Skip to content

Support for tool/function calling #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
christianliebel opened this issue Jul 2, 2024 · 10 comments
Open

Support for tool/function calling #7

christianliebel opened this issue Jul 2, 2024 · 10 comments
Labels
ecosystem parity A feature that other popular language model APIs offer enhancement New feature or request

Comments

@christianliebel
Copy link

I would like to request the addition of tool/function calling functionality to the Prompt API. This feature is available in some models and allows the model to invoke specific actions using a well-defined contract, typically in JSON format. This functionality is beneficial for various use cases that require outputs of a specific structure.

Examples:

@domenic domenic added the enhancement New feature or request label Jul 29, 2024
domenic added a commit that referenced this issue Aug 14, 2024
domenic added a commit that referenced this issue Aug 14, 2024
@KenjiBaheux
Copy link

Thank you for the suggestion to add tool/function calling functionality.

To assess the feasibility of this feature, we would appreciate it if folks could provide more details on the typical context size required for defining the functions needed in actual use cases. This information will help us understand the potential impact on performance and resource requirements, especially for an on-device context.

@CakeCrusher
Copy link

Function calling at this size of model is not practical. Although constrained generation would be a practical intermediate solution for this.
You can delegate constrained generation to the user by giving them access to output logits.

This is the most practical approach IMO

@ChristianWeyer
Copy link

Function Calling may indeed be too heavy for those models.
However, models fine-tuned for JSON data extraction would be really helpful. Then we can use approved patterns that are e.g. implemented in Instructor (https://js.useinstructor.com/)

domenic added a commit that referenced this issue Oct 9, 2024
domenic added a commit that referenced this issue Oct 9, 2024
@schlessera
Copy link

Function calling might be viable for specially tuned models even at lower sizes. Having a flexible API in place to allow for function execution would allow experimentation for that. There might be alternative ways to make this work with limited resources, like having dedicated helper logic in the browser to structure input and output, or extracting arguments and formatting them. And with advances in training the smaller models, they could still drastically improve in that area. There could also be a path way where two smaller models with separate responsibilities could collaborate, one for understanding the generic language and one for reasoning about solving the problem at hand outside of the boundaries of human languages.

If the model can, within a certain threshold of reliability, assess whether a function might be adapted to solve an identified task, it can state so and pass arguments to back to the consumer code. The consumer code could then opt to either run that function directly within the browser thread, or forward it to a service worker. It might even make sense to default to service workers as the default execution model, so that the in-browser API knows about the functions and the service worker to execute them in, and the entire flow can be executed without requiring intermediate assistance by the main thread.

This could even allow for browser extensions to provide a set of standard functions to be called to augment the capabilities of an LLM model in an easy way for end users (provided the security implications are correctly handled).

@domenic
Copy link
Collaborator

domenic commented Jun 5, 2025

We think this would be an exciting next step.

Looking at popular APIs today (OpenAI HTTP docs, OpenAI TypeScript Agents SDK docs, Anthropic docs, Gemini docs, Vercel AI SDK docs) it seems like the common info for each tool is:

  • Name
  • Description (natural language)
  • Input arguments JSON schema

For a JavaScript (instead of HTTP) API, we can also provide the functions directly.

It's interesting to compare these docs to the MCP docs, which add annotations. I think those are probably less important for our use case? And we can always add them later. So, I think the following is a good starting point:

const result = await session.prompt(
  "What is the weather like in San Francisco?",
  {
    tools: [
      {
        name: 'weather',
        description: 'Get the weather in a location',
        inputSchema: {
          type: "object",
          properties: {
            location: {
              type: "string",
              description: "The city and state, e.g. San Francisco, CA"
            }
          },
          required: ["location"]
        },
        async execute({ location }) {
          const res = await fetch("https://weatherapi.example/?location=" + location);

          // We could allow returning JS objects directly and stringifying them, but that
          // seems risky...
          return res.json().toString();
        }
      }
    ]
  }
);

Some minor points to bikeshed:

  • I like the Vercel AI SDK's use of object literals instead of an array of named items, e.g. I think tools: { weather: { description, parameters, execute } } would be nicer than the above. However, nobody else does that. And in the future, if we wanted to offer built-in tools, the above design is more flexible, since the built-in tools could be entries in the array such as "builtin:toolname" or LanguageModel.ToolName. So, I am inclined to stick with the array version.

  • There's a split between input_schema (Anthropic API), inputSchema (MCP), and parameters (everyone else). I find "input schema" to be a good bit clearer, so I went with that, but I could be persuaded to align with the majority.

  • The name execute() has many possible alternatives, e.g. run() could be another possibility, but I've found at least two places (OpenAI Agents SDK and Vercel AI SDK) using execute() so far.

  • It's not clear whether the name is necessary for our JavaScript API use case. In the HTTP APIs, it's used so that in the HTTP response, the model can tell the developer which tool it's calling. In our case, that whole process is hidden from the developer since they just provide a JavaScript function. So in theory it's not necessary. But I do suspect that models might get better performance with named tools. If we auto-generated tool names behind the scenes (like tool0, tool1, etc.) I wonder if the models might mess up more, compared to developer-supplied semantic names. We should probably test this!

@tomayac
Copy link
Contributor

tomayac commented Jun 5, 2025

How do you envision error handling to work (Vercel example)?

@domenic
Copy link
Collaborator

domenic commented Jun 5, 2025

Goood question.

  • Rejected promises from execute() should just cause the prompt to fail, bubbling the error to the return value of prompt().
  • I guess we'll have to vend specific errors for cases where the model screws up, like Vercel's InvalidToolArgumentsError and NoSuchToolError?
  • In all cases, errors should wipe all the related messages from the session.

@sushraja-msft
Copy link
Contributor

Thanks for taking this forward domenic. I think the tool declaration needs to be on LanguageModel.create and is a session attribute

  • Having the developer provide available tools on each turn would be redundant and also confusing for the model if the available tools change with each turn.
  • Some models (Phi4) expect the tools to be declared in the system prompt.

@domenic
Copy link
Collaborator

domenic commented Jun 6, 2025

Interesting. That seems like a reasonable restriction to me. Creating sessions shouldn't be that hard for developers.

So, concretely, we'd place them in LanguageModelCreateCoreOptions.

@nico-martin
Copy link

I created quite a lot of demos for tool-calling with small LLMs.
Most of them dont support tool-calling directly, so I built my own tooling around the LLM where the tools are described in the system Prompt and then I parse the response for potential tool-calls.

Here are some of my findings:

  1. Yes, even small LLMs (like Gemini Nano, Gemma2 2B, Qwen3 1.7B) are actually not that bad with tool-calling. It always depends on how much context is already in the conversation, how many functions you have and how well your functions are described, but for simple usecases it does work.
  2. JSON does not work very well. If you force the LLM in the system Prompt to always return the same structure, it often forgets about that on the second or third anwser or it just outputs invalid JSON (semocolon here, comma there, etc.). Thats why I am using XML for the tool-calling.

I just added Prompt API support in my LLM tool-calling demo. Feel free to try it yourself:
https://llm-tool-calling.nico.dev/ --> (click on the model below the input, in the Settings Modal -> Model search for the "Prompt API")

Works quite nice even with two calls in one prompt:

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ecosystem parity A feature that other popular language model APIs offer enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants