Functionary
Functionary is a language model that can interpret and execute functions/plugins.
The model determines when to execute functions, whether in parallel or serially, and can understand their outputs. It only triggers functions as needed. Function definitions are given as JSON Schema Objects, similar to OpenAI GPT function calls.
- Intelligent parallel function-calls/tool-uses
- Able to analyze functions/tools outputs and provide relevant responses grounded in the function/tools outputs
- Able to decide when to not use tools/call functions and provide normal chat response
- Grammar Sampling resulting in 100% accuracy in generating function and argument names
- Automatically execute python functions called by Functionary
We evaluated Functionary's function call prediction capabilities on MeetKai's in-house benchmark dataset. The accuracy metric measures the overall correctness of predicted function calls, including function name prediction and arguments extraction.
Dataset | Model Name | Function Calling Accuracy (Name & Arguments) |
---|---|---|
In-house data | MeetKai-functionary-small-v2.2 | 0.546 |
In-house data | MeetKai-functionary-medium-v2.2 | 0.664 |
In-house data | OpenAI-gpt-3.5-turbo-1106 | 0.531 |
In-house data | OpenAI-gpt-4-1106-preview | 0.737 |
Make sure you have PyTorch installed. Then to install the required dependencies, run:
Now you can start a blazing fast vLLM server:
Since the server is running on vLLM, you can pass in various arguments for the vLLM Engine. They are listed here. The more important arguments are:
- --model : the name of model (e.g.: "meetkai/functionary-medium-v2.2")
- --tensor-parallel-size (default = 1) : the number of GPUs to use for the server. This should be set to greater than 1 for larger models like "meetkai/functionary-medium-v2.2" depending on the type of GPU that you are using
- --max-model-len : the context window to set for your server.
There are also some features implemented independent of vLLM by us:
- --host (default="0.0.0.0") : the host IP address for the server
- --port (default="8000") : the port to be exposed for the server
- --grammar_sampling (default=True) : when enabled, the server uses our implementation of grammar sampling that ensures 100% accuracy for function and argument names. It works by constraining function and argument names to only the list of functions provided in the API call or no function call at all.
If you're having trouble with dependencies, and you have nvidia-container-toolkit, you can start your environment like this:
Model | Description | Compute Requirements (for FP16 HF model weights) |
---|---|---|
8k context | Any GPU with 24GB VRAM | |
8k context, better accuracy | 2 x A100-80GB or equivalent | |
8k context | Any GPU with 24GB VRAM | |
Parallel function call support. | Any GPU with 24GB VRAM | |
4k context, better accuracy (deprecated) | Any GPU with 24GB VRAM | |
4k context (deprecated) | Any GPU with 24GB VRAM | |
functionary-7b-v0.1 | 2k context (deprecated) Not recommended, use 2.1 onwards | Any GPU with 24GB VRAM |
- v1 models are compatible with both OpenAI-python v0 and v1.
- v2 models are designed for compatibility with OpenAI-python v1.
You may refer to the official OpenAI documentation here for the difference between OpenAI-python v0 and v1
We convert function definitions to a similar text to TypeScript definitions. Then we inject these definitions as system prompts. After that, we inject the default system prompt. Then we start the conversation messages.
We use specially designed prompt templates which we call "v2PromptTemplate" and "v1PromptTemplate". v2PromptTemplate breaks down each turns into from, recipient and content portions. "v1PromptTemplate" uses a variety of special tokens in each turn. These includes a start-of-function-call token to indicate an assistant turn with function call and role-specific stop token for each turn.
We don't change the logit probabilities to conform to a certain schema, but the model itself knows how to conform. This allows us to use existing tools and caching systems with ease.
This project is licensed under the terms of the MIT license.