from httpx import RequestNotRead
from fastcore.test import *core
Introduction
We often call APIs while prototyping and testing our code. A single API call (e.g. an Anthropic chat completion) can take 100’s of ms to run. This can really slow down development especially if our notebook contains many API calls 😞.
cachy caches API requests. It does this by saving the result of each API call to a local cachy.jsonl file. Before calling an API (e.g. OpenAI) it will check if the request already exists in cachy.jsonl. If it does it will return the cached result.
How does it work?
Under the hood popular SDK’s like OpenAI, Anthropic and LiteLLM use httpx.Client and httpx.AsyncClient.
cachy patches the send method of both clients and injects a simple caching mechanism:
- create a cache key from the request
- if the key exists in
cachy.jsonlreturn the cached response - if not, call the API and save the response to
cachy.jsonl
cachy.jsonl contains one API response per line.
Each line has the following format {"key": key, "response": response}
key: hash of the API requestresponse: the API response.
{
"key": "afc2be0c",
"response": "{\"id\":\"msg_xxx\",\"type\":\"message\",\"role\":\"assistant\",\"model\":\"claude-sonnet-4-20250514\",\"content\":[{\"type\":\"text\",\"text\":\"Coordination.\"}],\"stop_reason\":\"end_turn\",\"stop_sequence\":null,\"usage\":{\"input_tokens\":16,\"cache_creation_input_tokens\":0,\"cache_read_input_tokens\":0,\"cache_creation\":{\"ephemeral_5m_input_tokens\":0,\"ephemeral_1h_input_tokens\":0},\"output_tokens\":6,\"service_tier\":\"standard\"}}"
}Patching httpx
Patching a method is very straightforward.
In our case we want to patch httpx._client.Client.send and httpx._client.AsyncClient.send.
These methods are called when running httpx.get, httpx.post, etc.
In the example below we use @patch from fastcore to print calling an API when httpx._client.Client.send is run.
@patch
def send(self:httpx._client.Client, r, **kwargs):
print('calling an API')
return self._orig_send(r, **kwargs)Cache Filtering
Now, let’s build up our caching logic piece-by-piece.
The first thing we need to do is ensure that our caching logic only runs on specific urls.
For now, let’s only cache API calls made to popular LLM providers like OpenAI, Anthropic, Google and DeepSeek. We can make this fully customizable later.
Exported source
doms = ("api.openai.com", "api.anthropic.com", "generativelanguage.googleapis.com", "api.deepseek.com")Exported source
def _should_cache(url, doms): return any(dom in str(url) for dom in doms)We could then use _should_cache like this.
@patch
def send(self:httpx._client.Client, r, **kwargs):
if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
# insert caching logic
...Cache Key
The next thing we need to do is figure out if a response for the request r already exists in our cache.
Recall that each line in cachy.jsonl has the following format {"key": key, "response": response}.
Our key needs to be unique and deterministic. One way to do this is to concatenate the request URL and content, then generate a hash from the result.
def _key(r): return hashlib.sha256(str(r.url.copy_remove_param('key')).encode() + r.content).hexdigest()[:8]When LiteLLM calls Gemini it includes the API key in a query param so that’s why we strip the key param from the url.
Let’s test this out.
r1 = httpx.Request('POST', 'https://api.openai.com/v1/chat/completions', content=b'some content')
r1<Request('POST', 'https://api.openai.com/v1/chat/completions')>
_key(r1)'2d135d43'
If we run it again we should get the same key.
_key(r1)'2d135d43'
Let’s modify the url and confirm we get a different key.
_key(httpx.Request('POST', 'https://api.anthropic.com/v1/messages', content=b'some content'))'8a99b0a9'
Great. Let’s update our patch.
@patch
def send(self:httpx._client.Client, r, **kwargs):
if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
key = _key(r)
# if cache hit return the response
# else run the request, write to response the cache and return it
...Cache Reads/Writes
Now let’s add some methods that will read from and write to cachy.jsonl.
Exported source
def _cache(key, cfp):
with open(cfp, "r") as f:
line = first(f, lambda l: json.loads(l)["key"] == key)
return json.loads(line)["response"] if line else NoneExported source
def _write_cache(key, content, cfp):
with open(cfp, "a") as f: f.write(json.dumps({"key":key, "response": content})+"\n")Let’s update our patch.
@patch
def send(self:httpx._client.Client, r, **kwargs):
if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
key = key(r)
if res := _cache(key,"cachy.jsonl"): return httpx.Response(status_code=200, content=res, request=r)
res = self._orig_send(r, **kwargs)
content = res.read().decode()
_write_cache(key, content, "cachy.jsonl")
return httpx.Response(status_code=res.status_code, content=content, request=r)Multipart Requests
_key will throw the following error for multipart requests (e.g. file uploads).
RequestNotRead: Attempted to access streaming request content, without having calledread().
rfu = httpx.Request('POST', 'https://api.openai.com/v1/chat/completions', files={"file": ("test.txt", b"hello")})
rfu<Request('POST', 'https://api.openai.com/v1/chat/completions')>
test_fail(lambda: _key(rfu), RequestNotRead)rfu.read(); _key(rfu);Each part of a multipart request is separated by a delimiter called a boundary with this structure --b{RANDOM_ID}. Here’s an example for rfu.
b'--f9ee33966b45cc8c80952bb57cc728c4\r\nContent-Disposition: form-data; name="file"; filename="test.txt"\r\nContent-Type: text/plain\r\n\r\nhello\r\n--f9ee33966b45cc8c80952bb57cc728c4--\r\n'As the boundary is a random id, two identical multipart requests will produce different boundaries. As the boundary is part of the request content, _key will generate different keys leading to cache misses 😞.
Let’s create a helper method _content that will extract content from any request and remove the non-deterministic boundary.
rfu = httpx.Request('POST', 'https://api.openai.com/v1/chat/completions', files={"file": ("test.txt", b"hello")})
rfu<Request('POST', 'https://api.openai.com/v1/chat/completions')>
_content(rfu)b'--cachy-boundary\r\nContent-Disposition: form-data; name="file"; filename="test.txt"\r\nContent-Type: text/plain\r\n\r\nhello\r\n--cachy-boundary--\r\n'
def _key(r): return hashlib.sha256(str(r.url.copy_remove_param('key')).encode() + _content(r)).hexdigest()[:8]Let’s confirm that running _key multiple times on the same multipart request now returns the same key.
_key(rfu), _key(rfu)('9ae79ac5', '9ae79ac5')
Streaming
Let’s add support for streaming.
First let’s include an is_stream bool in our hash so that a non-streamed request will generate a different key to the same request when streamed.
Exported source
def _key(r, is_stream=False):
"Create a unique, deterministic id from the request `r`."
return hashlib.sha256(f"{r.url.copy_remove_param('key')}{is_stream}".encode() + _content(r)).hexdigest()[:8]In the patch we need to consume the entire stream before writing it to the cache.
@patch
def send(self:httpx._client.Client, r, **kwargs):
is_stream = kwargs.get("stream")
if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
key = _key(r, is_stream=False)
if res := _cache(key,"cachy.jsonl"): return httpx.Response(status_code=200, content=res, request=r)
res = self._orig_send(r, **kwargs)
content = res.read().decode() if not is_stream else b''.join(list(res.iter_bytes())).decode()
_write_cache(key, content, "cachy.jsonl")
return httpx.Response(status_code=res.status_code, content=content, request=r)enable_cachy
To make cachy as user friendly as possible let’s make it so that we can apply our patch by running a single method at the top of our notebook.
from cachy import enable_cachy
enable_cachy()For this to work we’ll need to wrap our patch.
def _apply_patch():
@patch
def send(self:httpx._client.Client, r, **kwargs):
is_stream = kwargs.get("stream")
if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
key = _key(r, is_stream=False)
if res := _cache(key,"cachy.jsonl"): return httpx.Response(status_code=200, content=res, request=r)
res = self._orig_send(r, **kwargs)
content = res.read().decode() if not is_stream else b''.join(list(res.iter_bytes())).decode()
_write_cache(key, content, "cachy.jsonl")
return httpx.Response(status_code=res.status_code, content=content, request=r)def enable_cachy():
_apply_patch()Great. Now, let’s make cachy a little more customizable by making it possible to specify:
- the APIs (or domains) to cache
- the location of the cache file.
def enable_cachy(cache_dir=None, doms=doms):
cfp = Path(cache_dir or getattr(Config.find("settings.ini"), "config_path", ".")) / "cachy.jsonl"
cfp.touch(exist_ok=True)
_apply_patch(cfp, doms)Note: If our notebook is running in an nbdev project Config.find("settings.ini").config_path automatically finds the base dir.
def _apply_patch(cfp, doms):
@patch
def send(self:httpx._client.Client, r, **kwargs):
is_stream = kwargs.get("stream")
if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
key = _key(r, is_stream=False)
if res := _cache(key,cfp): return httpx.Response(status_code=200, content=res, request=r)
res = self._orig_send(r, **kwargs)
content = res.read().decode() if not is_stream else b''.join(list(res.iter_bytes())).decode()
_write_cache(key,content,cfp)
return httpx.Response(status_code=res.status_code, content=content, request=r)Async
Now let’s add support for async requests.
Exported source
def _apply_async_patch(cfp, doms):
@patch
async def send(self:httpx._client.AsyncClient, r, **kwargs):
is_stream = kwargs.get("stream")
if not _should_cache(r.url, doms): return await self._orig_send(r, **kwargs)
key = _key(r, is_stream=False)
if res := _cache(key, cfp): return httpx.Response(status_code=200, content=res, request=r)
res = await self._orig_send(r, **kwargs)
content = res.read().decode() if not is_stream else b''.join([c async for c in res.aiter_bytes()]).decode()
_write_cache(key, content, cfp)
return httpx.Response(status_code=res.status_code, content=content, request=r)Let’s rename our original patch.
Exported source
def _apply_sync_patch(cfp, doms):
@patch
def send(self:httpx._client.Client, r, **kwargs):
is_stream = kwargs.get("stream")
if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
key = _key(r, is_stream=False)
if res := _cache(key,cfp): return httpx.Response(status_code=200, content=res, request=r)
res = self._orig_send(r, **kwargs)
content = res.read().decode() if not is_stream else b''.join(list(res.iter_bytes())).decode()
_write_cache(key,content,cfp)
return httpx.Response(status_code=res.status_code, content=content, request=r)Finally, let’s update enable_cachy.
enable_cachy
enable_cachy (cache_dir=None, doms=('api.openai.com', 'api.anthropic.com', 'generativelanguage.googleapis.com', 'api.deepseek.com'))
Exported source
def enable_cachy(cache_dir=None, doms=doms):
cfp = Path(cache_dir or getattr(Config.find("settings.ini"), "config_path", ".")) / "cachy.jsonl"
cfp.touch(exist_ok=True)
_apply_sync_patch(cfp, doms)
_apply_async_patch(cfp, doms)And a way to turn if off:
disable_cachy
disable_cachy ()
Exported source
def disable_cachy():
httpx._client.AsyncClient.send = httpx._client.AsyncClient._orig_send
httpx._client.Client.send = httpx._client.Client._orig_sendTests
Let’s test enable_cachy on 3 SDKs (OpenAI, Anthropic, LiteLLM) for the scenarios below:
- sync requests with(out) streaming
- async requests with(out) streaming
Add some helper functions.
class mods: ant="claude-sonnet-4-20250514"; oai="gpt-4o"; gem="gemini/gemini-2.0-flash"def mk_msgs(m): return [{"role": "user", "content": f"write 1 word about {m}"}]enable_cachy()OpenAI
from openai import OpenAIcli = OpenAI()r = cli.responses.create(model=mods.oai, input=mk_msgs("openai sync"))
rCollaboration
- id: resp_017850b8b871e44100692ede6ef6c081a085e7e17b2c19943b
- created_at: 1764679278.0
- error: None
- incomplete_details: None
- instructions: None
- metadata: {}
- model: gpt-4o-2024-08-06
- object: response
- output: [ResponseOutputMessage(id=‘msg_017850b8b871e44100692ede6f9b7081a0a6e61d0ebd2c78da’, content=[ResponseOutputText(annotations=[], text=‘Collaboration’, type=‘output_text’, logprobs=[])], role=‘assistant’, status=‘completed’, type=‘message’)]
- parallel_tool_calls: True
- temperature: 1.0
- tool_choice: auto
- tools: []
- top_p: 1.0
- background: False
- conversation: None
- max_output_tokens: None
- max_tool_calls: None
- previous_response_id: None
- prompt: None
- prompt_cache_key: None
- reasoning: Reasoning(effort=None, generate_summary=None, summary=None)
- safety_identifier: None
- service_tier: default
- status: completed
- text: ResponseTextConfig(format=ResponseFormatText(type=‘text’), verbosity=‘medium’)
- top_logprobs: 0
- truncation: disabled
- usage: ResponseUsage(input_tokens=15, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=3, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=18)
- user: None
- billing: {‘payer’: ‘developer’}
- prompt_cache_retention: None
- store: True
r = cli.responses.create(model=mods.oai, input=mk_msgs("openai sync"))
rCollaboration
- id: resp_017850b8b871e44100692ede6ef6c081a085e7e17b2c19943b
- created_at: 1764679278.0
- error: None
- incomplete_details: None
- instructions: None
- metadata: {}
- model: gpt-4o-2024-08-06
- object: response
- output: [ResponseOutputMessage(id=‘msg_017850b8b871e44100692ede6f9b7081a0a6e61d0ebd2c78da’, content=[ResponseOutputText(annotations=[], text=‘Collaboration’, type=‘output_text’, logprobs=[])], role=‘assistant’, status=‘completed’, type=‘message’)]
- parallel_tool_calls: True
- temperature: 1.0
- tool_choice: auto
- tools: []
- top_p: 1.0
- background: False
- conversation: None
- max_output_tokens: None
- max_tool_calls: None
- previous_response_id: None
- prompt: None
- prompt_cache_key: None
- reasoning: Reasoning(effort=None, generate_summary=None, summary=None)
- safety_identifier: None
- service_tier: default
- status: completed
- text: ResponseTextConfig(format=ResponseFormatText(type=‘text’), verbosity=‘medium’)
- top_logprobs: 0
- truncation: disabled
- usage: ResponseUsage(input_tokens=15, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=3, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=18)
- user: None
- billing: {‘payer’: ‘developer’}
- prompt_cache_retention: None
- store: True
Let’s test streaming.
r = cli.responses.create(model=mods.oai, input=mk_msgs("openai sync streaming"), stream=True)
for ch in r: print(ch)ResponseCreatedEvent(response=Response(id='resp_05244a0e69c5f66800692ede7164d4819793bd4ccc65bc4237', created_at=1764679281.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, prompt_cache_retention=None, store=True), sequence_number=0, type='response.created')
ResponseInProgressEvent(response=Response(id='resp_05244a0e69c5f66800692ede7164d4819793bd4ccc65bc4237', created_at=1764679281.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, prompt_cache_retention=None, store=True), sequence_number=1, type='response.in_progress')
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='msg_05244a0e69c5f66800692ede71c3308197989e1d568fbfc387', content=[], role='assistant', status='in_progress', type='message'), output_index=0, sequence_number=2, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='msg_05244a0e69c5f66800692ede71c3308197989e1d568fbfc387', output_index=0, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=3, type='response.content_part.added')
ResponseTextDeltaEvent(content_index=0, delta='Innov', item_id='msg_05244a0e69c5f66800692ede71c3308197989e1d568fbfc387', logprobs=[], output_index=0, sequence_number=4, type='response.output_text.delta', obfuscation='eNimFngwXhb')
ResponseTextDeltaEvent(content_index=0, delta='ative', item_id='msg_05244a0e69c5f66800692ede71c3308197989e1d568fbfc387', logprobs=[], output_index=0, sequence_number=5, type='response.output_text.delta', obfuscation='NIHlIsqyjma')
ResponseTextDoneEvent(content_index=0, item_id='msg_05244a0e69c5f66800692ede71c3308197989e1d568fbfc387', logprobs=[], output_index=0, sequence_number=6, text='Innovative', type='response.output_text.done')
ResponseContentPartDoneEvent(content_index=0, item_id='msg_05244a0e69c5f66800692ede71c3308197989e1d568fbfc387', output_index=0, part=ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[]), sequence_number=7, type='response.content_part.done')
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='msg_05244a0e69c5f66800692ede71c3308197989e1d568fbfc387', content=[ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[])], role='assistant', status='completed', type='message'), output_index=0, sequence_number=8, type='response.output_item.done')
ResponseCompletedEvent(response=Response(id='resp_05244a0e69c5f66800692ede7164d4819793bd4ccc65bc4237', created_at=1764679281.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[ResponseOutputMessage(id='msg_05244a0e69c5f66800692ede71c3308197989e1d568fbfc387', content=[ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[])], role='assistant', status='completed', type='message')], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='default', status='completed', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=In: 16; Out: 3; Total: 19, user=None, prompt_cache_retention=None, store=True), sequence_number=9, type='response.completed')
r = cli.responses.create(model=mods.oai, input=mk_msgs("openai sync streaming"), stream=True)
for ch in r: print(ch)ResponseCreatedEvent(response=Response(id='resp_05244a0e69c5f66800692ede7164d4819793bd4ccc65bc4237', created_at=1764679281.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, prompt_cache_retention=None, store=True), sequence_number=0, type='response.created')
ResponseInProgressEvent(response=Response(id='resp_05244a0e69c5f66800692ede7164d4819793bd4ccc65bc4237', created_at=1764679281.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, prompt_cache_retention=None, store=True), sequence_number=1, type='response.in_progress')
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='msg_05244a0e69c5f66800692ede71c3308197989e1d568fbfc387', content=[], role='assistant', status='in_progress', type='message'), output_index=0, sequence_number=2, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='msg_05244a0e69c5f66800692ede71c3308197989e1d568fbfc387', output_index=0, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=3, type='response.content_part.added')
ResponseTextDeltaEvent(content_index=0, delta='Innov', item_id='msg_05244a0e69c5f66800692ede71c3308197989e1d568fbfc387', logprobs=[], output_index=0, sequence_number=4, type='response.output_text.delta', obfuscation='eNimFngwXhb')
ResponseTextDeltaEvent(content_index=0, delta='ative', item_id='msg_05244a0e69c5f66800692ede71c3308197989e1d568fbfc387', logprobs=[], output_index=0, sequence_number=5, type='response.output_text.delta', obfuscation='NIHlIsqyjma')
ResponseTextDoneEvent(content_index=0, item_id='msg_05244a0e69c5f66800692ede71c3308197989e1d568fbfc387', logprobs=[], output_index=0, sequence_number=6, text='Innovative', type='response.output_text.done')
ResponseContentPartDoneEvent(content_index=0, item_id='msg_05244a0e69c5f66800692ede71c3308197989e1d568fbfc387', output_index=0, part=ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[]), sequence_number=7, type='response.content_part.done')
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='msg_05244a0e69c5f66800692ede71c3308197989e1d568fbfc387', content=[ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[])], role='assistant', status='completed', type='message'), output_index=0, sequence_number=8, type='response.output_item.done')
ResponseCompletedEvent(response=Response(id='resp_05244a0e69c5f66800692ede7164d4819793bd4ccc65bc4237', created_at=1764679281.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[ResponseOutputMessage(id='msg_05244a0e69c5f66800692ede71c3308197989e1d568fbfc387', content=[ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[])], role='assistant', status='completed', type='message')], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='default', status='completed', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=In: 16; Out: 3; Total: 19, user=None, prompt_cache_retention=None, store=True), sequence_number=9, type='response.completed')
Let’s test async.
from openai import AsyncOpenAIcli = AsyncOpenAI()r = await cli.responses.create(model=mods.oai, input=mk_msgs("openai async"))
rInnovative
- id: resp_0efa725f1445a96c00692ede73db0481a094ddddaab4135e1f
- created_at: 1764679283.0
- error: None
- incomplete_details: None
- instructions: None
- metadata: {}
- model: gpt-4o-2024-08-06
- object: response
- output: [ResponseOutputMessage(id=‘msg_0efa725f1445a96c00692ede74206881a0bd7959dfa7047a88’, content=[ResponseOutputText(annotations=[], text=‘Innovative’, type=‘output_text’, logprobs=[])], role=‘assistant’, status=‘completed’, type=‘message’)]
- parallel_tool_calls: True
- temperature: 1.0
- tool_choice: auto
- tools: []
- top_p: 1.0
- background: False
- conversation: None
- max_output_tokens: None
- max_tool_calls: None
- previous_response_id: None
- prompt: None
- prompt_cache_key: None
- reasoning: Reasoning(effort=None, generate_summary=None, summary=None)
- safety_identifier: None
- service_tier: default
- status: completed
- text: ResponseTextConfig(format=ResponseFormatText(type=‘text’), verbosity=‘medium’)
- top_logprobs: 0
- truncation: disabled
- usage: ResponseUsage(input_tokens=15, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=3, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=18)
- user: None
- billing: {‘payer’: ‘developer’}
- prompt_cache_retention: None
- store: True
r = await cli.responses.create(model=mods.oai, input=mk_msgs("openai async"))
rInnovative
- id: resp_0efa725f1445a96c00692ede73db0481a094ddddaab4135e1f
- created_at: 1764679283.0
- error: None
- incomplete_details: None
- instructions: None
- metadata: {}
- model: gpt-4o-2024-08-06
- object: response
- output: [ResponseOutputMessage(id=‘msg_0efa725f1445a96c00692ede74206881a0bd7959dfa7047a88’, content=[ResponseOutputText(annotations=[], text=‘Innovative’, type=‘output_text’, logprobs=[])], role=‘assistant’, status=‘completed’, type=‘message’)]
- parallel_tool_calls: True
- temperature: 1.0
- tool_choice: auto
- tools: []
- top_p: 1.0
- background: False
- conversation: None
- max_output_tokens: None
- max_tool_calls: None
- previous_response_id: None
- prompt: None
- prompt_cache_key: None
- reasoning: Reasoning(effort=None, generate_summary=None, summary=None)
- safety_identifier: None
- service_tier: default
- status: completed
- text: ResponseTextConfig(format=ResponseFormatText(type=‘text’), verbosity=‘medium’)
- top_logprobs: 0
- truncation: disabled
- usage: ResponseUsage(input_tokens=15, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=3, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=18)
- user: None
- billing: {‘payer’: ‘developer’}
- prompt_cache_retention: None
- store: True
Let’s test async streaming.
r = await cli.responses.create(model=mods.oai, input=mk_msgs("openai async streaming"), stream=True)
async for ch in r: print(ch)ResponseCreatedEvent(response=Response(id='resp_05d129295d2177e400692ede76dbe481a3856cbae82d8bd97d', created_at=1764679286.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, prompt_cache_retention=None, store=True), sequence_number=0, type='response.created')
ResponseInProgressEvent(response=Response(id='resp_05d129295d2177e400692ede76dbe481a3856cbae82d8bd97d', created_at=1764679286.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, prompt_cache_retention=None, store=True), sequence_number=1, type='response.in_progress')
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='msg_05d129295d2177e400692ede77796481a3b7a73f61fbe10b8b', content=[], role='assistant', status='in_progress', type='message'), output_index=0, sequence_number=2, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='msg_05d129295d2177e400692ede77796481a3b7a73f61fbe10b8b', output_index=0, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=3, type='response.content_part.added')
ResponseTextDeltaEvent(content_index=0, delta='Eff', item_id='msg_05d129295d2177e400692ede77796481a3b7a73f61fbe10b8b', logprobs=[], output_index=0, sequence_number=4, type='response.output_text.delta', obfuscation='8cG9Ub8L8V5I9')
ResponseTextDeltaEvent(content_index=0, delta='icient', item_id='msg_05d129295d2177e400692ede77796481a3b7a73f61fbe10b8b', logprobs=[], output_index=0, sequence_number=5, type='response.output_text.delta', obfuscation='ViOTkG9eDM')
ResponseTextDoneEvent(content_index=0, item_id='msg_05d129295d2177e400692ede77796481a3b7a73f61fbe10b8b', logprobs=[], output_index=0, sequence_number=6, text='Efficient', type='response.output_text.done')
ResponseContentPartDoneEvent(content_index=0, item_id='msg_05d129295d2177e400692ede77796481a3b7a73f61fbe10b8b', output_index=0, part=ResponseOutputText(annotations=[], text='Efficient', type='output_text', logprobs=[]), sequence_number=7, type='response.content_part.done')
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='msg_05d129295d2177e400692ede77796481a3b7a73f61fbe10b8b', content=[ResponseOutputText(annotations=[], text='Efficient', type='output_text', logprobs=[])], role='assistant', status='completed', type='message'), output_index=0, sequence_number=8, type='response.output_item.done')
ResponseCompletedEvent(response=Response(id='resp_05d129295d2177e400692ede76dbe481a3856cbae82d8bd97d', created_at=1764679286.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[ResponseOutputMessage(id='msg_05d129295d2177e400692ede77796481a3b7a73f61fbe10b8b', content=[ResponseOutputText(annotations=[], text='Efficient', type='output_text', logprobs=[])], role='assistant', status='completed', type='message')], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='default', status='completed', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=In: 16; Out: 3; Total: 19, user=None, prompt_cache_retention=None, store=True), sequence_number=9, type='response.completed')
r = await cli.responses.create(model=mods.oai, input=mk_msgs("openai async streaming"), stream=True)
async for ch in r: print(ch)ResponseCreatedEvent(response=Response(id='resp_05d129295d2177e400692ede76dbe481a3856cbae82d8bd97d', created_at=1764679286.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, prompt_cache_retention=None, store=True), sequence_number=0, type='response.created')
ResponseInProgressEvent(response=Response(id='resp_05d129295d2177e400692ede76dbe481a3856cbae82d8bd97d', created_at=1764679286.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, prompt_cache_retention=None, store=True), sequence_number=1, type='response.in_progress')
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='msg_05d129295d2177e400692ede77796481a3b7a73f61fbe10b8b', content=[], role='assistant', status='in_progress', type='message'), output_index=0, sequence_number=2, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='msg_05d129295d2177e400692ede77796481a3b7a73f61fbe10b8b', output_index=0, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=3, type='response.content_part.added')
ResponseTextDeltaEvent(content_index=0, delta='Eff', item_id='msg_05d129295d2177e400692ede77796481a3b7a73f61fbe10b8b', logprobs=[], output_index=0, sequence_number=4, type='response.output_text.delta', obfuscation='8cG9Ub8L8V5I9')
ResponseTextDeltaEvent(content_index=0, delta='icient', item_id='msg_05d129295d2177e400692ede77796481a3b7a73f61fbe10b8b', logprobs=[], output_index=0, sequence_number=5, type='response.output_text.delta', obfuscation='ViOTkG9eDM')
ResponseTextDoneEvent(content_index=0, item_id='msg_05d129295d2177e400692ede77796481a3b7a73f61fbe10b8b', logprobs=[], output_index=0, sequence_number=6, text='Efficient', type='response.output_text.done')
ResponseContentPartDoneEvent(content_index=0, item_id='msg_05d129295d2177e400692ede77796481a3b7a73f61fbe10b8b', output_index=0, part=ResponseOutputText(annotations=[], text='Efficient', type='output_text', logprobs=[]), sequence_number=7, type='response.content_part.done')
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='msg_05d129295d2177e400692ede77796481a3b7a73f61fbe10b8b', content=[ResponseOutputText(annotations=[], text='Efficient', type='output_text', logprobs=[])], role='assistant', status='completed', type='message'), output_index=0, sequence_number=8, type='response.output_item.done')
ResponseCompletedEvent(response=Response(id='resp_05d129295d2177e400692ede76dbe481a3856cbae82d8bd97d', created_at=1764679286.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[ResponseOutputMessage(id='msg_05d129295d2177e400692ede77796481a3b7a73f61fbe10b8b', content=[ResponseOutputText(annotations=[], text='Efficient', type='output_text', logprobs=[])], role='assistant', status='completed', type='message')], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='default', status='completed', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=In: 16; Out: 3; Total: 19, user=None, prompt_cache_retention=None, store=True), sequence_number=9, type='response.completed')
Anthropic
from anthropic import Anthropiccli = Anthropic()r = cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant sync"))
rCoordination
- id:
msg_01CfjNMMiotXXVesJp8bhZy2 - content:
[{'citations': None, 'text': 'Coordination', 'type': 'text'}] - model:
claude-sonnet-4-20250514 - role:
assistant - stop_reason:
end_turn - stop_sequence:
None - type:
message - usage:
{'cache_creation': {'ephemeral_1h_input_tokens': 0, 'ephemeral_5m_input_tokens': 0}, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0, 'input_tokens': 15, 'output_tokens': 5, 'server_tool_use': None, 'service_tier': 'standard'}
r = cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant sync"))
rCoordination
- id:
msg_01CfjNMMiotXXVesJp8bhZy2 - content:
[{'citations': None, 'text': 'Coordination', 'type': 'text'}] - model:
claude-sonnet-4-20250514 - role:
assistant - stop_reason:
end_turn - stop_sequence:
None - type:
message - usage:
{'cache_creation': {'ephemeral_1h_input_tokens': 0, 'ephemeral_5m_input_tokens': 0}, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0, 'input_tokens': 15, 'output_tokens': 5, 'server_tool_use': None, 'service_tier': 'standard'}
Let’s test streaming.
r = cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant sync streaming"), stream=True)
for ch in r: print(ch)RawMessageStartEvent(message=Message(id='msg_015x4UT4k9GhMN47kCy1ctrt', content=[], model='claude-sonnet-4-20250514', role='assistant', stop_reason=None, stop_sequence=None, type='message', usage=In: 16; Out: 2; Cache create: 0; Cache read: 0; Total Tokens: 18; Search: 0), type='message_start')
RawContentBlockStartEvent(content_block=TextBlock(citations=None, text='', type='text'), index=0, type='content_block_start')
RawContentBlockDeltaEvent(delta=TextDelta(text='Buff', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockDeltaEvent(delta=TextDelta(text='ering', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockStopEvent(index=0, type='content_block_stop')
RawMessageDeltaEvent(delta=Delta(stop_reason='end_turn', stop_sequence=None), type='message_delta', usage=MessageDeltaUsage(cache_creation_input_tokens=0, cache_read_input_tokens=0, input_tokens=16, output_tokens=6, server_tool_use=None))
RawMessageStopEvent(type='message_stop')
r = cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant sync streaming"), stream=True)
for ch in r: print(ch)RawMessageStartEvent(message=Message(id='msg_015x4UT4k9GhMN47kCy1ctrt', content=[], model='claude-sonnet-4-20250514', role='assistant', stop_reason=None, stop_sequence=None, type='message', usage=In: 16; Out: 2; Cache create: 0; Cache read: 0; Total Tokens: 18; Search: 0), type='message_start')
RawContentBlockStartEvent(content_block=TextBlock(citations=None, text='', type='text'), index=0, type='content_block_start')
RawContentBlockDeltaEvent(delta=TextDelta(text='Buff', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockDeltaEvent(delta=TextDelta(text='ering', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockStopEvent(index=0, type='content_block_stop')
RawMessageDeltaEvent(delta=Delta(stop_reason='end_turn', stop_sequence=None), type='message_delta', usage=MessageDeltaUsage(cache_creation_input_tokens=0, cache_read_input_tokens=0, input_tokens=16, output_tokens=6, server_tool_use=None))
RawMessageStopEvent(type='message_stop')
Let’s test async.
from anthropic import AsyncAnthropiccli = AsyncAnthropic()r = await cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant async"))
rConcurrency
- id:
msg_01NMWNEQCiGWepH1g4eB2yDb - content:
[{'citations': None, 'text': '**Concurrency**', 'type': 'text'}] - model:
claude-sonnet-4-20250514 - role:
assistant - stop_reason:
end_turn - stop_sequence:
None - type:
message - usage:
{'cache_creation': {'ephemeral_1h_input_tokens': 0, 'ephemeral_5m_input_tokens': 0}, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0, 'input_tokens': 15, 'output_tokens': 8, 'server_tool_use': None, 'service_tier': 'standard'}
r = await cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant async"))
rConcurrency
- id:
msg_01NMWNEQCiGWepH1g4eB2yDb - content:
[{'citations': None, 'text': '**Concurrency**', 'type': 'text'}] - model:
claude-sonnet-4-20250514 - role:
assistant - stop_reason:
end_turn - stop_sequence:
None - type:
message - usage:
{'cache_creation': {'ephemeral_1h_input_tokens': 0, 'ephemeral_5m_input_tokens': 0}, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0, 'input_tokens': 15, 'output_tokens': 8, 'server_tool_use': None, 'service_tier': 'standard'}
Let’s test async streaming.
r = await cli.messages.create(model=mods.ant,max_tokens=1024,messages=mk_msgs("ant async streaming"), stream=True)
async for ch in r.response.aiter_bytes(): print(ch.decode())event: message_start
data: {"type":"message_start","message":{"model":"claude-sonnet-4-20250514","id":"msg_01HSogveNre4UeiteLGqiZkt","type":"message","role":"assistant","content":[],"stop_reason":null,"stop_sequence":null,"usage":{"input_tokens":16,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"cache_creation":{"ephemeral_5m_input_tokens":0,"ephemeral_1h_input_tokens":0},"output_tokens":1,"service_tier":"standard"}} }
event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""} }
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"**"} }
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Concurrent"} }
event: ping
data: {"type": "ping"}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"**"} }
event: content_block_stop
data: {"type":"content_block_stop","index":0 }
event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null},"usage":{"input_tokens":16,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"output_tokens":7} }
event: message_stop
data: {"type":"message_stop" }
r = await cli.messages.create(model=mods.ant,max_tokens=1024,messages=mk_msgs("ant async streaming"), stream=True)
async for ch in r.response.aiter_bytes(): print(ch.decode())event: message_start
data: {"type":"message_start","message":{"model":"claude-sonnet-4-20250514","id":"msg_01HSogveNre4UeiteLGqiZkt","type":"message","role":"assistant","content":[],"stop_reason":null,"stop_sequence":null,"usage":{"input_tokens":16,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"cache_creation":{"ephemeral_5m_input_tokens":0,"ephemeral_1h_input_tokens":0},"output_tokens":1,"service_tier":"standard"}} }
event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""} }
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"**"} }
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Concurrent"} }
event: ping
data: {"type": "ping"}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"**"} }
event: content_block_stop
data: {"type":"content_block_stop","index":0 }
event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null},"usage":{"input_tokens":16,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"output_tokens":7} }
event: message_stop
data: {"type":"message_stop" }
LiteLLM
Let’s test the LiteLLM SDK by running sync/async calls with(out) streaming for OpenAI, Anthropic, & Gemini.
We’ll also double check tool calls and citations.
Sync Tests
from litellm import completionLet’s define a helper method to display a streamed response.
def _stream(r):
for ch in r: print(ch.choices[0].delta.content or "")Anthropic
Let’s test claude-sonnet-x.
r = completion(model=mods.ant, messages=mk_msgs("lite: ant sync..."))
rModelResponse(id='chatcmpl-4a601d73-c5e0-4a99-9bb5-72184b503ceb', created=1764679321, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='**partial**', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=6, prompt_tokens=18, total_tokens=24, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))
r = completion(model=mods.ant, messages=mk_msgs("lite: ant sync..."))
rModelResponse(id='chatcmpl-858c663c-5f1a-4836-80ef-2f99c9cafef9', created=1764679321, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='**partial**', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=6, prompt_tokens=18, total_tokens=24, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))
Now, with streaming enabled.
r = completion(model=mods.ant, messages=mk_msgs("lite: ant sync stream..."), stream=True)
_stream(r)**
Efficient
**
r = completion(model=mods.ant, messages=mk_msgs("lite: ant sync stream..."), stream=True)
_stream(r)**
Efficient
**
OpenAI
Let’s test gpt-4o.
r = completion(model=mods.oai, messages=mk_msgs("lite: oai sync..."))
rModelResponse(id='chatcmpl-CiJzMl6JYB7siXcKdtd6IzFhjOSSU', created=1764679304, model='gpt-4o-2024-08-06', object='chat.completion', system_fingerprint='fp_e819e3438b', choices=[Choices(finish_reason='stop', index=0, message=Message(content='Connectivity', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]), provider_specific_fields={})], usage=Usage(completion_tokens=1, prompt_tokens=18, total_tokens=19, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None)), service_tier='default')
r = completion(model=mods.oai, messages=mk_msgs("lite: oai sync..."))
rModelResponse(id='chatcmpl-CiJzMl6JYB7siXcKdtd6IzFhjOSSU', created=1764679304, model='gpt-4o-2024-08-06', object='chat.completion', system_fingerprint='fp_e819e3438b', choices=[Choices(finish_reason='stop', index=0, message=Message(content='Connectivity', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]), provider_specific_fields={})], usage=Usage(completion_tokens=1, prompt_tokens=18, total_tokens=19, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None)), service_tier='default')
Now, with streaming enabled.
r = completion(model=mods.oai, messages=mk_msgs("lite: oai sync stream..."), stream=True)
_stream(r)Integration
r = completion(model=mods.oai, messages=mk_msgs("lite: oai sync stream..."), stream=True)
_stream(r)Integration
Gemini
Let’s test 2.0-flash.
r = completion(model=mods.gem, messages=mk_msgs("lite: gem sync..."))
rModelResponse(id='it4uaaeWJNKlkdUPv-fwiQg', created=1764679325, model='gemini-2.0-flash', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Efficient.\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=3, prompt_tokens=10, total_tokens=13, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=10, image_tokens=None)), vertex_ai_grounding_metadata=[], vertex_ai_url_context_metadata=[], vertex_ai_safety_results=[], vertex_ai_citation_metadata=[])
r = completion(model=mods.gem, messages=mk_msgs("lite: gem sync..."))
rModelResponse(id='it4uaaeWJNKlkdUPv-fwiQg', created=1764679325, model='gemini-2.0-flash', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Efficient.\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=3, prompt_tokens=10, total_tokens=13, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=10, image_tokens=None)), vertex_ai_grounding_metadata=[], vertex_ai_url_context_metadata=[], vertex_ai_safety_results=[], vertex_ai_citation_metadata=[])
Now, with streaming enabled.
r = completion(model=mods.gem, messages=mk_msgs("lite: gem sync stream..."), stream=True)
_stream(r)Fast
r = completion(model=mods.gem, messages=mk_msgs("lite: gem sync stream..."), stream=True)
_stream(r)Fast
Async Tests
from litellm import acompletionasync def _astream(r):
async for chunk in r: print(chunk.choices[0].delta.content or "")Anthropic
Let’s test claude-sonnet-x.
r = await acompletion(model=mods.ant, messages=mk_msgs("lite: ant async..."))
rModelResponse(id='chatcmpl-36db5251-9a24-4394-b820-b1b1e6eff9c3', created=1764679329, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='**coroutines**', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=8, prompt_tokens=18, total_tokens=26, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))
r = await acompletion(model=mods.ant, messages=mk_msgs("lite: ant async..."))
rModelResponse(id='chatcmpl-9455c342-beae-4ff2-a012-d11bd8e959ca', created=1764679329, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='**coroutines**', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=8, prompt_tokens=18, total_tokens=26, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))
Now, with streaming enabled.
r = await acompletion(model=mods.ant, messages=mk_msgs("lite: ant async stream..."), stream=True)
await(_astream(r))**
concurrent
**
r = await acompletion(model=mods.ant, messages=mk_msgs("lite: ant async stream..."), stream=True)
await(_astream(r))**
concurrent
**
OpenAI
Let’s test gpt-4o.
r = await acompletion(model=mods.oai, messages=mk_msgs("lite: oai async..."))
rModelResponse(id='chatcmpl-CiJzVWXtqcySUdl7b9G7E80vaWjwW', created=1764679313, model='gpt-4o-2024-08-06', object='chat.completion', system_fingerprint='fp_e819e3438b', choices=[Choices(finish_reason='stop', index=0, message=Message(content='Efficient.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]), provider_specific_fields={})], usage=Usage(completion_tokens=3, prompt_tokens=18, total_tokens=21, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None)), service_tier='default')
r = await acompletion(model=mods.oai, messages=mk_msgs("lite: oai async..."))
rModelResponse(id='chatcmpl-CiJzVWXtqcySUdl7b9G7E80vaWjwW', created=1764679313, model='gpt-4o-2024-08-06', object='chat.completion', system_fingerprint='fp_e819e3438b', choices=[Choices(finish_reason='stop', index=0, message=Message(content='Efficient.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]), provider_specific_fields={})], usage=Usage(completion_tokens=3, prompt_tokens=18, total_tokens=21, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None)), service_tier='default')
Now, with streaming enabled.
r = await acompletion(model=mods.oai, messages=mk_msgs("lite: oai async stream..."), stream=True)
await(_astream(r))Illuminate
r = await acompletion(model=mods.oai, messages=mk_msgs("lite: oai async stream..."), stream=True)
await(_astream(r))Illuminate
Gemini
Let’s test 2.0-flash.
r = await acompletion(model=mods.gem, messages=mk_msgs("lite: gem async..."))
rModelResponse(id='kt4uaenhMJ66xN8Pgpv6gAE', created=1764679333, model='gemini-2.0-flash', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Concurrency\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=2, prompt_tokens=10, total_tokens=12, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=10, image_tokens=None)), vertex_ai_grounding_metadata=[], vertex_ai_url_context_metadata=[], vertex_ai_safety_results=[], vertex_ai_citation_metadata=[])
r = await acompletion(model=mods.gem, messages=mk_msgs("lite: gem async..."))
rModelResponse(id='kt4uaenhMJ66xN8Pgpv6gAE', created=1764679334, model='gemini-2.0-flash', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Concurrency\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=2, prompt_tokens=10, total_tokens=12, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=10, image_tokens=None)), vertex_ai_grounding_metadata=[], vertex_ai_url_context_metadata=[], vertex_ai_safety_results=[], vertex_ai_citation_metadata=[])
Now, with streaming enabled.
r = await acompletion(model=mods.gem, messages=mk_msgs("lite: gem async stream..."), stream=True)
await(_astream(r))Fast
r = await acompletion(model=mods.gem, messages=mk_msgs("lite: gem async stream..."), stream=True)
await(_astream(r))Fast
Tool Calls
As a sanity check let’s confirm that tool calls work.
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {"type":"string", "description":"The city e.g. Reims"},
"unit": {"type":"string", "enum":["celsius", "fahrenheit"]},
},
"required": ["location"],
}
}
}
]r = completion(model=mods.ant, messages=mk_msgs("Is it raining in Reims?"), tools=tools)
rModelResponse(id='chatcmpl-4a571e36-69d8-472c-8564-1e8056f6c849', created=1764679337, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='tool_calls', index=0, message=Message(content=None, role='assistant', tool_calls=[ChatCompletionMessageToolCall(index=0, function=Function(arguments='{"location": "Reims"}', name='get_current_weather'), id='toolu_017sHr4VFzg6Nh7jt8wSescj', type='function')], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=57, prompt_tokens=427, total_tokens=484, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))
r = completion(model=mods.ant, messages=mk_msgs("Is it raining in Reims?"), tools=tools)
rModelResponse(id='chatcmpl-807802a2-39df-47d6-b24b-946ece4c1d60', created=1764679337, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='tool_calls', index=0, message=Message(content=None, role='assistant', tool_calls=[ChatCompletionMessageToolCall(index=0, function=Function(arguments='{"location": "Reims"}', name='get_current_weather'), id='toolu_017sHr4VFzg6Nh7jt8wSescj', type='function')], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=57, prompt_tokens=427, total_tokens=484, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))
Multipart Request
cli = Anthropic()
r = cli.beta.files.upload(file=("ex.txt", b"hello world", "text/plain"))
rFileMetadata(id='file_011CVhstprQYGfRzyf5RpTWu', created_at=datetime.datetime(2025, 12, 2, 12, 41, 59, 74000, tzinfo=datetime.timezone.utc), filename='ex.txt', mime_type='text/plain', size_bytes=11, type='file', downloadable=False)
cli = Anthropic()
r = cli.beta.files.upload(file=("ex.txt", b"hello world", "text/plain"))
rFileMetadata(id='file_011CVhstprQYGfRzyf5RpTWu', created_at=datetime.datetime(2025, 12, 2, 12, 41, 59, 74000, tzinfo=datetime.timezone.utc), filename='ex.txt', mime_type='text/plain', size_bytes=11, type='file', downloadable=False)
Gemini Model Comparison
When LiteLLM calls Gemini it includes the model name in the url. Let’s test that we can run the same prompt with two different Gemini models.
mods.gem'gemini/gemini-2.0-flash'
r = completion(model=mods.gem, messages=mk_msgs("lite: gem different models..."))
rModelResponse(id='l94uabyuKYKlkdUPh6DpkQg', created=1764679338, model='gemini-2.0-flash', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Streamlined\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=3, prompt_tokens=11, total_tokens=14, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=11, image_tokens=None)), vertex_ai_grounding_metadata=[], vertex_ai_url_context_metadata=[], vertex_ai_safety_results=[], vertex_ai_citation_metadata=[])
r = completion(model="gemini/gemini-2.5-flash", messages=mk_msgs("lite: gem different models..."))
rModelResponse(id='nN4uafGuNYGekdUP0cTEkQM', created=1764679338, model='gemini-2.5-flash', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Versatile', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=867, prompt_tokens=12, total_tokens=879, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=865, rejected_prediction_tokens=None, text_tokens=2), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=12, image_tokens=None)), vertex_ai_grounding_metadata=[], vertex_ai_url_context_metadata=[], vertex_ai_safety_results=[], vertex_ai_citation_metadata=[])