llama-cpp-python Usage

2min
every functionary model release comes with gguf file formats thus, functionary can be loaded and used in a much wider variety of hardware using llama cpp currently, we provide the following quantization 4 bit 8 bit fp16 (except for functionary medium v2 due to file size) setup make sure that llama cpp python https //github com/abetlen/llama cpp python is successully installed in your system the following is the sample code from llama cpp import llama from functionary prompt template import get prompt template from tokenizer from transformers import autotokenizer tools = \[ # for functionary 7b v2 we use "tools"; for functionary 7b v1 4 we use "functions" = \[{"name" "get current weather", "description" , "parameters" }] { "type" "function", "function" { "name" "get current weather", "description" "get the current weather", "parameters" { "type" "object", "properties" { "location" { "type" "string", "description" "the city and state, e g , san francisco, ca" } }, "required" \["location"] } } } ] \# you can download gguf files from https //huggingface co/meetkai/functionary 7b v2 gguf/tree/main llm = llama(model path="path to gguf file", n ctx=4096, n gpu layers= 1) messages = \[ {"role" "user", "content" "what's the weather like in hanoi?"} ] \# create tokenizer from hf \# we found that the tokenizer from llama cpp is not compatible with tokenizer from hf that we trained \# the reason might be we added new tokens to the original tokenizer \# so we will use tokenizer from huggingface tokenizer = autotokenizer from pretrained("meetkai/functionary 7b v2", legacy=true) \# prompt template will be used for creating the prompt prompt template = get prompt template from tokenizer(tokenizer) \# before inference, we need to add an empty assistant (message without content or function call) messages append({"role" "assistant"}) \# create the prompt to use for inference prompt str = prompt template get prompt from messages(messages, tools) token ids = tokenizer encode(prompt str) gen tokens = \[] \# get list of stop tokens stop token ids = \[tokenizer encode(token)\[ 1] for token in prompt template get stop tokens for generation()] print("stop token ids ", stop token ids) \# we use function generate (instead of call ) so we can pass in list of token ids for token id in llm generate(token ids, temp=0) if token id in stop token ids break gen tokens append(token id) llm output = tokenizer decode(gen tokens) \# parse the message from llm output result = prompt template parse assistant response(llm output) print(result) the output would be {'role' 'assistant', 'content' none, 'tool calls' \[{'type' 'function', 'function' {'name' 'get current weather', 'arguments' '{\n "location" "hanoi"\n}'}}]} note we should use the tokenizer from huggingface to convert prompt into token ids instead of using the tokenizer from llama cpp because we found that tokenizer from llama cpp doesn't give the same result as that from huggingface the reason might be in the training, we added new tokens to the tokenizer and llama cpp doesn't handle this succesfully