core

Cache your API calls with a single line of code. No mocks, no fixtures. Just faster, cleaner code.

Introduction

We often call APIs while prototyping and testing our code. A single API call (e.g. an Anthropic chat completion) can take 100’s of ms to run. This can really slow down development especially if our notebook contains many API calls 😞.

cachy caches API requests. It does this by saving the result of each API call to a local cachy.jsonl file. Before calling an API (e.g. OpenAI) it will check if the request already exists in cachy.jsonl. If it does it will return the cached result.

How does it work?

Under the hood popular SDK’s like OpenAI, Anthropic and LiteLLM use httpx.Client and httpx.AsyncClient.

cachy patches the send method of both clients and injects a simple caching mechanism:

  • create a cache key from the request
  • if the key exists in cachy.jsonl return the cached response
  • if not, call the API and save the response to cachy.jsonl
import tempfile
from httpx import RequestNotRead
from fastcore.test import *

cachy.jsonl contains one API response per line.

Each line has the following format {"key": key, "response": response}

  • key: hash of the API request
  • response: the API response.
{
    "key": "afc2be0c", 
    "response": "{\"id\":\"msg_xxx\",\"type\":\"message\",\"role\":\"assistant\",\"model\":\"claude-sonnet-4-20250514\",\"content\":[{\"type\":\"text\",\"text\":\"Coordination.\"}],\"stop_reason\":\"end_turn\",\"stop_sequence\":null,\"usage\":{\"input_tokens\":16,\"cache_creation_input_tokens\":0,\"cache_read_input_tokens\":0,\"cache_creation\":{\"ephemeral_5m_input_tokens\":0,\"ephemeral_1h_input_tokens\":0},\"output_tokens\":6,\"service_tier\":\"standard\"}}"
}

Patching httpx

Patching a method is very straightforward.

In our case we want to patch httpx.Client.send and httpx.AsyncClient.send.

These methods are called when running httpx.get, httpx.post, etc.

In the example below we use @patch from fastcore to print calling an API when httpx.Client.send is run.

@patch
def send(self:httpx.Client, r, **kwargs):
    print('calling an API')
    return self._orig_send(r, **kwargs)

Cache Filtering

Now, let’s build up our caching logic piece-by-piece.

The first thing we need to do is ensure that our caching logic only runs on specific urls.

For now, let’s only cache API calls made to popular LLM providers like OpenAI, Anthropic, Google and DeepSeek. We can make this fully customizable later.

Exported source
doms = ("chatgpt.com", "api.openai.com", "api.anthropic.com", "generativelanguage.googleapis.com", "api.deepseek.com",
    'api.fireworks.ai', 'openrouter.ai', 'api.groq.com', 'api.together.xyz', 'api.mistral.ai', 'api.x.ai')
Exported source
def _should_cache(url, doms): return any(dom in str(url) for dom in doms)

We could then use _should_cache like this.

@patch
def send(self:httpx._client.Client, r, **kwargs):
    if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
    # insert caching logic
    ...

Cache Key

The next thing we need to do is figure out if a response for the request r already exists in our cache.

Recall that each line in cachy.jsonl has the following format {"key": key, "response": response}.

Our key needs to be unique and deterministic. One way to do this is to concatenate the request URL and content, then generate a hash from the result.

def _key(r): return hashlib.sha256(str(r.url.copy_remove_param('key')).encode() + r.content).hexdigest()[:8]

When LiteLLM calls Gemini it includes the API key in a query param so that’s why we strip the key param from the url.

Let’s test this out.

r1 = httpx.Request('POST', 'https://api.openai.com/v1/chat/completions', content=b'some content')
r1
<Request('POST', 'https://api.openai.com/v1/chat/completions')>
_key(r1)
'2d135d43'

If we run it again we should get the same key.

_key(r1)
'2d135d43'

Let’s modify the url and confirm we get a different key.

_key(httpx.Request('POST', 'https://api.anthropic.com/v1/messages', content=b'some content'))
'8a99b0a9'

Great. Let’s update our patch.

@patch
def send(self:httpx._client.Client, r, **kwargs):
    if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
    key = _key(r)
    # if cache hit return the response
    # else run the request, write to response the cache and return it
    ...

Cache Reads/Writes

Now let’s add some methods that will read from and write to cachy.jsonl.

Multipart Requests

_key will throw the following error for multipart requests (e.g. file uploads).

RequestNotRead: Attempted to access streaming request content, without having calledread().

rfu = httpx.Request('POST', 'https://api.openai.com/v1/chat/completions', files={"file": ("test.txt", b"hello")})
rfu
<Request('POST', 'https://api.openai.com/v1/chat/completions')>
test_fail(lambda: _key(rfu), RequestNotRead)
rfu.read(); _key(rfu);

Each part of a multipart request is separated by a delimiter called a boundary with this structure --b{RANDOM_ID}. Here’s an example for rfu.

b'--f9ee33966b45cc8c80952bb57cc728c4\r\nContent-Disposition: form-data; name="file"; filename="test.txt"\r\nContent-Type: text/plain\r\n\r\nhello\r\n--f9ee33966b45cc8c80952bb57cc728c4--\r\n'

As the boundary is a random id, two identical multipart requests will produce different boundaries. As the boundary is part of the request content, _key will generate different keys leading to cache misses 😞.

Let’s create a helper method _content that will extract content from any request and remove the non-deterministic boundary.

rfu = httpx.Request('POST', 'https://api.openai.com/v1/chat/completions', files={"file": ("test.txt", b"hello")})
rfu
<Request('POST', 'https://api.openai.com/v1/chat/completions')>
_content(rfu)
b'--cachy-boundary\r\nContent-Disposition: form-data; name="file"; filename="test.txt"\r\nContent-Type: text/plain\r\n\r\nhello\r\n--cachy-boundary--\r\n'
def _key(r): return hashlib.sha256(str(r.url.copy_remove_param('key')).encode() + _content(r)).hexdigest()[:8]

Let’s confirm that running _key multiple times on the same multipart request now returns the same key.

_key(rfu), _key(rfu)
('9ae79ac5', '9ae79ac5')

Streaming

Let’s add support for streaming.

First let’s include an is_stream bool in our hash so that a non-streamed request will generate a different key to the same request when streamed.

c1 = b'{"model":"kimi-k2.5","messages":[{"content":"Say hello in French","role":"user"}],"max_tokens":64,"temperature":1.0}'
c2 = b'{"messages":[{"role":"user","content":"Say hello in French"}],"model":"kimi-k2.5","max_tokens":64,"temperature":1.0}'
test_eq(_norm_content(SimpleNamespace(headers={'content-type': 'application/json'}, content=c1, _content='done reading')), 
        _norm_content(SimpleNamespace(headers={'content-type': 'application/json'}, content=c2, _content='done reading')))

In the patch we need to consume the entire stream before writing it to the cache.

@patch
def send(self:httpx._client.Client, request, **kwargs):
    return _send('cachy.json', doms, self, request, **kwargs)
doms = doms + ('httpbin.org',)
origdir = os.getcwd()
os.chdir(tempfile.mkdtemp())
r1 = httpx.post('https://httpbin.org/post', json={'a':1})
r1.json()['headers']
{'Accept': '*/*',
 'Accept-Encoding': 'gzip, deflate, br, zstd',
 'Content-Length': '7',
 'Content-Type': 'application/json',
 'Host': 'httpbin.org',
 'User-Agent': 'python-httpx/0.28.1',
 'X-Amzn-Trace-Id': 'Root=1-69f0039c-631b8c577999f0533e3506c4'}
r2 = httpx.post('https://httpbin.org/post', json={'a':1})
assert r2.text==r1.text

enable_cachy

To make cachy as user friendly as possible let’s make it so that we can apply our patch by running a single method at the top of our notebook.

from cachy import enable_cachy

enable_cachy()
def enable_cachy(cache_dir=None, doms=doms):
    cfp = Path(cache_dir or find_file_parents("pyproject.toml") or ".") / "cachy.jsonl"
    cfp.touch(exist_ok=True)   
    _apply_patch(cfp, doms)

Async

Now let’s add support for async requests.


enable_cachy


def enable_cachy(
    cache_dir:NoneType=None,
    doms:tuple=('chatgpt.com', 'api.openai.com', 'api.anthropic.com', 'generativelanguage.googleapis.com', 'api.deepseek.com', 'api.fireworks.ai', 'openrouter.ai', 'api.groq.com', 'api.together.xyz', 'api.mistral.ai', 'api.x.ai'),
    hdrs:NoneType=None, debug:bool=False
):

Call self as a function.


disable_cachy


def disable_cachy(
    
):

Call self as a function.

enable_cachy(debug=True)
r1 = httpx.post('https://httpbin.org/post', json={'a':1})
r2 = httpx.post('https://httpbin.org/post', json={'a':1})
test_eq(r1.text, r2.text)
🔴 MISS 6422adea
b'{"a":1}'
🟢 HIT 6422adea
b'{"a":1}'
async with AsyncClient() as c:
    r1 = await c.post('https://httpbin.org/post', json={'a':2})
    r2 = await c.post('https://httpbin.org/post', json={'a':2})
test_eq(r1.text, r2.text)
🔴 MISS 46253226
b'{"a":2}'
🟢 HIT 46253226
b'{"a":2}'
with httpx.stream('POST', 'https://httpbin.org/post', json={'a':3}) as r1: t1 = r1.read()
with httpx.stream('POST', 'https://httpbin.org/post', json={'a':3}) as r2: t2 = r2.read()
test_eq(t1, t2)
🟢 HIT 3085a21f
b'{"a":3}'
🟢 HIT 3085a21f
b'{"a":3}'
from litellm.llms.custom_httpx.aiohttp_transport import LiteLLMAiohttpTransport
from aiohttp import ClientSession
async with ClientSession() as s, AsyncClient(transport=LiteLLMAiohttpTransport(client=s, owns_session=False)) as c:
    r1 = await c.post('https://httpbin.org/post', json={'a':4})
    r2 = await c.post('https://httpbin.org/post', json={'a':4})
test_eq(r1.text, r2.text)
🟢 HIT d536dd72
b'{"a":4}'
🟢 HIT d536dd72
b'{"a":4}'
from litellm import completion,acompletion
haik = "claude-haiku-4-5"
msg = [{"role": "user", "content": f"Hi."}]

r = await acompletion(model=haik, messages=msg)
r
🟢 HIT 1f44e00f
b'{"model":"claude-haiku-4-5","messages":[{"role":"user","content":[{"type":"text","text":"Hi."}]}],"max_tokens":64000}'

Hi! How’s it going? What can I help you with?

  • id: chatcmpl-3b11ab88-ebbf-457c-943b-47e6e2115aa2
  • model: claude-haiku-4-5-20251001
  • finish_reason: stop
  • usage: Usage(completion_tokens=17, prompt_tokens=9, total_tokens=26, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=0, rejected_prediction_tokens=None, text_tokens=17, image_tokens=None, video_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=9, image_tokens=None, video_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0, inference_geo='not_available', speed=None)
r = await acompletion(model=haik, messages=msg, stream=True)
async for ch in r: print(ch.choices[0].delta.content or "")
🟢 HIT 99ef8862
b'{"model": "claude-haiku-4-5", "messages": [{"role": "user", "content": [{"type": "text", "text": "Hi."}]}], "max_tokens": 64000, "stream": true}'
Hi
! How can I help you today?
disable_cachy()

Tests

Let’s test enable_cachy on 3 SDKs (OpenAI, Anthropic, LiteLLM) for the scenarios below:

  • sync requests with(out) streaming
  • async requests with(out) streaming

Add some helper functions.

class mods: ant="claude-sonnet-4-20250514"; oai="gpt-4o"; gem="gemini/gemini-2.5-flash"
def mk_msgs(m): return [{"role": "user", "content": f"write 1 word about {m}"}]
enable_cachy(debug=True)

OpenAI

from openai import OpenAI
cli = OpenAI()
r = cli.responses.create(model=mods.oai, input=mk_msgs("openai sync"))
r
🟢 HIT 57d94e97
b'{"input":[{"role":"user","content":"write 1 word about openai sync"}],"model":"gpt-4o"}'
Response(id='resp_0e313853e3006e850069efcd5d7a2c81978ade6e918c1aa896', created_at=1777323357.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[ResponseOutputMessage(id='msg_0e313853e3006e850069efcd5e53848197ae68fd371c239602', content=[ResponseOutputText(annotations=[], text='Collaboration', type='output_text', logprobs=[])], role='assistant', status='completed', type='message', phase=None)], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, completed_at=1777323358.0, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, prompt_cache_retention='in_memory', reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='default', status='completed', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=ResponseUsage(input_tokens=15, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=3, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=18), user=None, billing={'payer': 'developer'}, frequency_penalty=0.0, moderation=None, presence_penalty=0.0, store=True)
r = cli.responses.create(model=mods.oai, input=mk_msgs("openai sync"))
r
🟢 HIT 57d94e97
b'{"input":[{"role":"user","content":"write 1 word about openai sync"}],"model":"gpt-4o"}'
Response(id='resp_0e313853e3006e850069efcd5d7a2c81978ade6e918c1aa896', created_at=1777323357.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[ResponseOutputMessage(id='msg_0e313853e3006e850069efcd5e53848197ae68fd371c239602', content=[ResponseOutputText(annotations=[], text='Collaboration', type='output_text', logprobs=[])], role='assistant', status='completed', type='message', phase=None)], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, completed_at=1777323358.0, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, prompt_cache_retention='in_memory', reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='default', status='completed', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=ResponseUsage(input_tokens=15, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=3, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=18), user=None, billing={'payer': 'developer'}, frequency_penalty=0.0, moderation=None, presence_penalty=0.0, store=True)

Let’s test streaming.

r = cli.responses.create(model=mods.oai, input=mk_msgs("openai sync streaming"), stream=True)
for ch in r: print(str(ch)[:60])
🟢 HIT 24e0bc9b
b'{"input":[{"role":"user","content":"write 1 word about openai sync streaming"}],"model":"gpt-4o","stream":true}'
ResponseCreatedEvent(response=Response(id='resp_0963cfaffbbc
ResponseInProgressEvent(response=Response(id='resp_0963cfaff
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='
ResponseContentPartAddedEvent(content_index=0, item_id='msg_
ResponseTextDeltaEvent(content_index=0, delta='Innov', item_
ResponseTextDeltaEvent(content_index=0, delta='ative', item_
ResponseTextDoneEvent(content_index=0, item_id='msg_0963cfaf
ResponseContentPartDoneEvent(content_index=0, item_id='msg_0
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='m
ResponseCompletedEvent(response=Response(id='resp_0963cfaffb
r = cli.responses.create(model=mods.oai, input=mk_msgs("openai sync streaming"), stream=True)
for ch in r: print(str(ch)[:60])
🟢 HIT 24e0bc9b
b'{"input":[{"role":"user","content":"write 1 word about openai sync streaming"}],"model":"gpt-4o","stream":true}'
ResponseCreatedEvent(response=Response(id='resp_0963cfaffbbc
ResponseInProgressEvent(response=Response(id='resp_0963cfaff
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='
ResponseContentPartAddedEvent(content_index=0, item_id='msg_
ResponseTextDeltaEvent(content_index=0, delta='Innov', item_
ResponseTextDeltaEvent(content_index=0, delta='ative', item_
ResponseTextDoneEvent(content_index=0, item_id='msg_0963cfaf
ResponseContentPartDoneEvent(content_index=0, item_id='msg_0
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='m
ResponseCompletedEvent(response=Response(id='resp_0963cfaffb

Let’s test async.

from openai import AsyncOpenAI
cli = AsyncOpenAI()
r = await cli.responses.create(model=mods.oai, input=mk_msgs("openai async"))
r
🟢 HIT b823478a
b'{"input":[{"role":"user","content":"write 1 word about openai async"}],"model":"gpt-4o"}'
Response(id='resp_058882a1287ac2fd0069efcd7037748196ab217b8d9b8eabca', created_at=1777323376.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[ResponseOutputMessage(id='msg_058882a1287ac2fd0069efcd718e5c819699c6ae1c591177c5', content=[ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[])], role='assistant', status='completed', type='message', phase=None)], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, completed_at=1777323377.0, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, prompt_cache_retention='in_memory', reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='default', status='completed', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=ResponseUsage(input_tokens=15, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=3, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=18), user=None, billing={'payer': 'developer'}, frequency_penalty=0.0, moderation=None, presence_penalty=0.0, store=True)
r = await cli.responses.create(model=mods.oai, input=mk_msgs("openai async"))
r
🟢 HIT b823478a
b'{"input":[{"role":"user","content":"write 1 word about openai async"}],"model":"gpt-4o"}'
Response(id='resp_058882a1287ac2fd0069efcd7037748196ab217b8d9b8eabca', created_at=1777323376.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[ResponseOutputMessage(id='msg_058882a1287ac2fd0069efcd718e5c819699c6ae1c591177c5', content=[ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[])], role='assistant', status='completed', type='message', phase=None)], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, completed_at=1777323377.0, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, prompt_cache_retention='in_memory', reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='default', status='completed', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=ResponseUsage(input_tokens=15, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=3, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=18), user=None, billing={'payer': 'developer'}, frequency_penalty=0.0, moderation=None, presence_penalty=0.0, store=True)

Let’s test async streaming.

r = await cli.responses.create(model=mods.oai, input=mk_msgs("openai async streaming"), stream=True)
async for ch in r: print(str(ch)[:60])
🟢 HIT b3ff4fe2
b'{"input":[{"role":"user","content":"write 1 word about openai async streaming"}],"model":"gpt-4o","stream":true}'
ResponseCreatedEvent(response=Response(id='resp_0df7c016da93
ResponseInProgressEvent(response=Response(id='resp_0df7c016d
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='
ResponseContentPartAddedEvent(content_index=0, item_id='msg_
ResponseTextDeltaEvent(content_index=0, delta='Innov', item_
ResponseTextDeltaEvent(content_index=0, delta='ative', item_
ResponseTextDoneEvent(content_index=0, item_id='msg_0df7c016
ResponseContentPartDoneEvent(content_index=0, item_id='msg_0
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='m
ResponseCompletedEvent(response=Response(id='resp_0df7c016da
r = await cli.responses.create(model=mods.oai, input=mk_msgs("openai async streaming"), stream=True)
async for ch in r: print(str(ch)[:60])
🟢 HIT b3ff4fe2
b'{"input":[{"role":"user","content":"write 1 word about openai async streaming"}],"model":"gpt-4o","stream":true}'
ResponseCreatedEvent(response=Response(id='resp_0df7c016da93
ResponseInProgressEvent(response=Response(id='resp_0df7c016d
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='
ResponseContentPartAddedEvent(content_index=0, item_id='msg_
ResponseTextDeltaEvent(content_index=0, delta='Innov', item_
ResponseTextDeltaEvent(content_index=0, delta='ative', item_
ResponseTextDoneEvent(content_index=0, item_id='msg_0df7c016
ResponseContentPartDoneEvent(content_index=0, item_id='msg_0
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='m
ResponseCompletedEvent(response=Response(id='resp_0df7c016da

Anthropic

from anthropic import Anthropic
cli = Anthropic()
r = cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant sync"))
r
🟢 HIT c526de0d
b'{"max_tokens":1024,"messages":[{"role":"user","content":"write 1 word about ant sync"}],"model":"claude-sonnet-4-20250514"}'
/var/folders/51/b2_szf2945n072c0vj2cyty40000gn/T/ipykernel_13011/1249625954.py:1: DeprecationWarning: The model 'claude-sonnet-4-20250514' is deprecated and will reach end-of-life on June 15th, 2026.
Please migrate to a newer model. Visit https://docs.anthropic.com/en/docs/resources/model-deprecations for more information.
  r = cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant sync"))
Message(id='msg_01K4YY9nbpommYrEp9jCBAJU', container=None, content=[TextBlock(citations=None, text='Coordination', type='text')], model='claude-sonnet-4-20250514', role='assistant', stop_details=None, stop_reason='end_turn', stop_sequence=None, type='message', usage=Usage(cache_creation=CacheCreation(ephemeral_1h_input_tokens=0, ephemeral_5m_input_tokens=0), cache_creation_input_tokens=0, cache_read_input_tokens=0, inference_geo='not_available', input_tokens=15, output_tokens=5, server_tool_use=None, service_tier='standard'))
r = cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant sync"))
r
🟢 HIT c526de0d
b'{"max_tokens":1024,"messages":[{"role":"user","content":"write 1 word about ant sync"}],"model":"claude-sonnet-4-20250514"}'
/var/folders/51/b2_szf2945n072c0vj2cyty40000gn/T/ipykernel_13011/1249625954.py:1: DeprecationWarning: The model 'claude-sonnet-4-20250514' is deprecated and will reach end-of-life on June 15th, 2026.
Please migrate to a newer model. Visit https://docs.anthropic.com/en/docs/resources/model-deprecations for more information.
  r = cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant sync"))
Message(id='msg_01K4YY9nbpommYrEp9jCBAJU', container=None, content=[TextBlock(citations=None, text='Coordination', type='text')], model='claude-sonnet-4-20250514', role='assistant', stop_details=None, stop_reason='end_turn', stop_sequence=None, type='message', usage=Usage(cache_creation=CacheCreation(ephemeral_1h_input_tokens=0, ephemeral_5m_input_tokens=0), cache_creation_input_tokens=0, cache_read_input_tokens=0, inference_geo='not_available', input_tokens=15, output_tokens=5, server_tool_use=None, service_tier='standard'))

Let’s test streaming.

r = cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant sync streaming"), stream=True)
for ch in r: print(ch)
🟢 HIT 922be3b8
b'{"max_tokens":1024,"messages":[{"role":"user","content":"write 1 word about ant sync streaming"}],"model":"claude-sonnet-4-20250514","stream":true}'
RawMessageStartEvent(message=Message(id='msg_0151oHQ1ysV6fdckd4MHiu8P', container=None, content=[], model='claude-sonnet-4-20250514', role='assistant', stop_details=None, stop_reason=None, stop_sequence=None, type='message', usage=Usage(cache_creation=CacheCreation(ephemeral_1h_input_tokens=0, ephemeral_5m_input_tokens=0), cache_creation_input_tokens=0, cache_read_input_tokens=0, inference_geo='not_available', input_tokens=16, output_tokens=2, server_tool_use=None, service_tier='standard')), type='message_start')
RawContentBlockStartEvent(content_block=TextBlock(citations=None, text='', type='text'), index=0, type='content_block_start')
RawContentBlockDeltaEvent(delta=TextDelta(text='Buff', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockDeltaEvent(delta=TextDelta(text='ering', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockStopEvent(index=0, type='content_block_stop')
RawMessageDeltaEvent(delta=Delta(container=None, stop_details=None, stop_reason='end_turn', stop_sequence=None), type='message_delta', usage=MessageDeltaUsage(cache_creation_input_tokens=0, cache_read_input_tokens=0, input_tokens=16, output_tokens=6, server_tool_use=None))
RawMessageStopEvent(type='message_stop')
/var/folders/51/b2_szf2945n072c0vj2cyty40000gn/T/ipykernel_13011/1920032579.py:1: DeprecationWarning: The model 'claude-sonnet-4-20250514' is deprecated and will reach end-of-life on June 15th, 2026.
Please migrate to a newer model. Visit https://docs.anthropic.com/en/docs/resources/model-deprecations for more information.
  r = cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant sync streaming"), stream=True)
r = cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant sync streaming"), stream=True)
for ch in r: print(ch)
🟢 HIT 922be3b8
b'{"max_tokens":1024,"messages":[{"role":"user","content":"write 1 word about ant sync streaming"}],"model":"claude-sonnet-4-20250514","stream":true}'
RawMessageStartEvent(message=Message(id='msg_0151oHQ1ysV6fdckd4MHiu8P', container=None, content=[], model='claude-sonnet-4-20250514', role='assistant', stop_details=None, stop_reason=None, stop_sequence=None, type='message', usage=Usage(cache_creation=CacheCreation(ephemeral_1h_input_tokens=0, ephemeral_5m_input_tokens=0), cache_creation_input_tokens=0, cache_read_input_tokens=0, inference_geo='not_available', input_tokens=16, output_tokens=2, server_tool_use=None, service_tier='standard')), type='message_start')
RawContentBlockStartEvent(content_block=TextBlock(citations=None, text='', type='text'), index=0, type='content_block_start')
RawContentBlockDeltaEvent(delta=TextDelta(text='Buff', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockDeltaEvent(delta=TextDelta(text='ering', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockStopEvent(index=0, type='content_block_stop')
RawMessageDeltaEvent(delta=Delta(container=None, stop_details=None, stop_reason='end_turn', stop_sequence=None), type='message_delta', usage=MessageDeltaUsage(cache_creation_input_tokens=0, cache_read_input_tokens=0, input_tokens=16, output_tokens=6, server_tool_use=None))
RawMessageStopEvent(type='message_stop')
/var/folders/51/b2_szf2945n072c0vj2cyty40000gn/T/ipykernel_13011/1920032579.py:1: DeprecationWarning: The model 'claude-sonnet-4-20250514' is deprecated and will reach end-of-life on June 15th, 2026.
Please migrate to a newer model. Visit https://docs.anthropic.com/en/docs/resources/model-deprecations for more information.
  r = cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant sync streaming"), stream=True)

Let’s test async.

from anthropic import AsyncAnthropic
cli = AsyncAnthropic()
r = await cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant async"))
r
🟢 HIT 630ebedc
b'{"max_tokens":1024,"messages":[{"role":"user","content":"write 1 word about ant async"}],"model":"claude-sonnet-4-20250514"}'
Message(id='msg_01PVs5uNGeGYV2iBZ7rk2YU2', container=None, content=[TextBlock(citations=None, text='**Concurrency**', type='text')], model='claude-sonnet-4-20250514', role='assistant', stop_details=None, stop_reason='end_turn', stop_sequence=None, type='message', usage=Usage(cache_creation=CacheCreation(ephemeral_1h_input_tokens=0, ephemeral_5m_input_tokens=0), cache_creation_input_tokens=0, cache_read_input_tokens=0, inference_geo='not_available', input_tokens=15, output_tokens=8, server_tool_use=None, service_tier='standard'))
r = await cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant async"))
r
🟢 HIT 630ebedc
b'{"max_tokens":1024,"messages":[{"role":"user","content":"write 1 word about ant async"}],"model":"claude-sonnet-4-20250514"}'
Message(id='msg_01PVs5uNGeGYV2iBZ7rk2YU2', container=None, content=[TextBlock(citations=None, text='**Concurrency**', type='text')], model='claude-sonnet-4-20250514', role='assistant', stop_details=None, stop_reason='end_turn', stop_sequence=None, type='message', usage=Usage(cache_creation=CacheCreation(ephemeral_1h_input_tokens=0, ephemeral_5m_input_tokens=0), cache_creation_input_tokens=0, cache_read_input_tokens=0, inference_geo='not_available', input_tokens=15, output_tokens=8, server_tool_use=None, service_tier='standard'))

Let’s test async streaming.

r = await cli.messages.create(model=mods.ant,max_tokens=1024,messages=mk_msgs("ant async streaming"), stream=True)
async for ch in r.response.aiter_bytes(): print(ch.decode())
🟢 HIT 86eb5c07
b'{"max_tokens":1024,"messages":[{"role":"user","content":"write 1 word about ant async streaming"}],"model":"claude-sonnet-4-20250514","stream":true}'
event: message_start
data: {"type":"message_start","message":{"model":"claude-sonnet-4-20250514","id":"msg_016xUKQiAYoDT5f18wMXc19W","type":"message","role":"assistant","content":[],"stop_reason":null,"stop_sequence":null,"stop_details":null,"usage":{"input_tokens":16,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"cache_creation":{"ephemeral_5m_input_tokens":0,"ephemeral_1h_input_tokens":0},"output_tokens":2,"service_tier":"standard","inference_geo":"not_available"}}      }

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""} }

event: ping
data: {"type": "ping"}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Conc"}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"urrency"}               }

event: content_block_stop
data: {"type":"content_block_stop","index":0        }

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null,"stop_details":null},"usage":{"input_tokens":16,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"output_tokens":6}          }

event: message_stop
data: {"type":"message_stop"             }

r = await cli.messages.create(model=mods.ant,max_tokens=1024,messages=mk_msgs("ant async streaming"), stream=True)
async for ch in r.response.aiter_bytes(): print(ch.decode())
🟢 HIT 86eb5c07
b'{"max_tokens":1024,"messages":[{"role":"user","content":"write 1 word about ant async streaming"}],"model":"claude-sonnet-4-20250514","stream":true}'
event: message_start
data: {"type":"message_start","message":{"model":"claude-sonnet-4-20250514","id":"msg_016xUKQiAYoDT5f18wMXc19W","type":"message","role":"assistant","content":[],"stop_reason":null,"stop_sequence":null,"stop_details":null,"usage":{"input_tokens":16,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"cache_creation":{"ephemeral_5m_input_tokens":0,"ephemeral_1h_input_tokens":0},"output_tokens":2,"service_tier":"standard","inference_geo":"not_available"}}      }

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""} }

event: ping
data: {"type": "ping"}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Conc"}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"urrency"}               }

event: content_block_stop
data: {"type":"content_block_stop","index":0        }

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null,"stop_details":null},"usage":{"input_tokens":16,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"output_tokens":6}          }

event: message_stop
data: {"type":"message_stop"             }

LiteLLM

Let’s test the LiteLLM SDK by running sync/async calls with(out) streaming for OpenAI, Anthropic, & Gemini.

We’ll also double check tool calls and citations.

Sync Tests

from litellm import completion

Let’s define a helper method to display a streamed response.

def _stream(r): 
    for ch in r: print(ch.choices[0].delta.content or "")
Anthropic

Let’s test claude-sonnet-x.

r = completion(model=mods.ant, messages=mk_msgs("lite: ant sync..."))
r
🟢 HIT eb6c6de3
b'{"model": "claude-sonnet-4-20250514", "messages": [{"role": "user", "content": [{"type": "text", "text": "write 1 word about lite: ant sync..."}]}], "max_tokens": 64000}'

Background

(Lite refers to a lightweight/simplified version, and ant sync suggests asynchronous processing or synchronization in a minimal, efficient manner)

  • id: chatcmpl-29d3c972-d7a6-4246-ba4b-16b8d84e54f5
  • model: claude-sonnet-4-20250514
  • finish_reason: stop
  • usage: Usage(completion_tokens=36, prompt_tokens=18, total_tokens=54, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=0, rejected_prediction_tokens=None, text_tokens=36, image_tokens=None, video_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=18, image_tokens=None, video_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0, inference_geo='not_available', speed=None)
r = completion(model=mods.ant, messages=mk_msgs("lite: ant sync..."))
r
🟢 HIT eb6c6de3
b'{"model": "claude-sonnet-4-20250514", "messages": [{"role": "user", "content": [{"type": "text", "text": "write 1 word about lite: ant sync..."}]}], "max_tokens": 64000}'

Background

(Lite refers to a lightweight/simplified version, and ant sync suggests asynchronous processing or synchronization in a minimal, efficient manner)

  • id: chatcmpl-efb78b97-7c8c-46dc-a019-ce21193137aa
  • model: claude-sonnet-4-20250514
  • finish_reason: stop
  • usage: Usage(completion_tokens=36, prompt_tokens=18, total_tokens=54, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=0, rejected_prediction_tokens=None, text_tokens=36, image_tokens=None, video_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=18, image_tokens=None, video_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0, inference_geo='not_available', speed=None)

Now, with streaming enabled.

r = completion(model=mods.ant, messages=mk_msgs("lite: ant sync stream..."), stream=True)
_stream(r)
🟢 HIT eb277a6d
b'{"model": "claude-sonnet-4-20250514", "messages": [{"role": "user", "content": [{"type": "text", "text": "write 1 word about lite: ant sync stream..."}]}], "max_tokens": 64000, "stream": true}'
**
lightweight**
r = completion(model=mods.ant, messages=mk_msgs("lite: ant sync stream..."), stream=True)
_stream(r)
🟢 HIT eb277a6d
b'{"model": "claude-sonnet-4-20250514", "messages": [{"role": "user", "content": [{"type": "text", "text": "write 1 word about lite: ant sync stream..."}]}], "max_tokens": 64000, "stream": true}'
**
lightweight**
OpenAI

Let’s test gpt-4o.

r = completion(model=mods.oai, messages=mk_msgs("lite: oai sync..."))
r
🟢 HIT e2bbf9a1
b'{"messages":[{"role":"user","content":"write 1 word about lite: oai sync..."}],"model":"gpt-4o"}'

Efficiency.

  • id: chatcmpl-DZNIjDIRHneic7UNq7E5ys72ciVzu
  • model: gpt-4o-2024-08-06
  • finish_reason: stop
  • usage: Usage(completion_tokens=2, prompt_tokens=18, total_tokens=20, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None, image_tokens=None, video_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None, video_tokens=None))
r = completion(model=mods.oai, messages=mk_msgs("lite: oai sync..."))
r
🟢 HIT e2bbf9a1
b'{"messages":[{"role":"user","content":"write 1 word about lite: oai sync..."}],"model":"gpt-4o"}'

Efficiency.

  • id: chatcmpl-DZNIjDIRHneic7UNq7E5ys72ciVzu
  • model: gpt-4o-2024-08-06
  • finish_reason: stop
  • usage: Usage(completion_tokens=2, prompt_tokens=18, total_tokens=20, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None, image_tokens=None, video_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None, video_tokens=None))

Now, with streaming enabled.

r = completion(model=mods.oai, messages=mk_msgs("lite: oai sync stream..."), stream=True)
_stream(r)
🟢 HIT ab1293ed
b'{"messages":[{"role":"user","content":"write 1 word about lite: oai sync stream..."}],"model":"gpt-4o","stream":true,"stream_options":{"include_usage":true}}'
Synchronization

r = completion(model=mods.oai, messages=mk_msgs("lite: oai sync stream..."), stream=True)
_stream(r)
🟢 HIT ab1293ed
b'{"messages":[{"role":"user","content":"write 1 word about lite: oai sync stream..."}],"model":"gpt-4o","stream":true,"stream_options":{"include_usage":true}}'
Synchronization

Gemini

Let’s test 2.5-flash.

import os
r = completion(model=mods.gem, messages=mk_msgs("lite: gem sync..."))
r
🟢 HIT 4abf8d07
b'{"contents":[{"role":"user","parts":[{"text":"write 1 word about lite: gem sync..."}]}]}'

Sync

  • id: pM3vaaW2FpyVg8UP_9mvyAQ
  • model: gemini-2.5-flash
  • finish_reason: stop
  • usage: Usage(completion_tokens=810, prompt_tokens=11, total_tokens=821, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=809, rejected_prediction_tokens=None, text_tokens=1, image_tokens=None, video_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=11, image_tokens=None, video_tokens=None), cache_read_input_tokens=None)
r = completion(model=mods.gem, messages=mk_msgs("lite: gem sync..."))
r
🟢 HIT 4abf8d07
b'{"contents":[{"role":"user","parts":[{"text":"write 1 word about lite: gem sync..."}]}]}'

Sync

  • id: pM3vaaW2FpyVg8UP_9mvyAQ
  • model: gemini-2.5-flash
  • finish_reason: stop
  • usage: Usage(completion_tokens=810, prompt_tokens=11, total_tokens=821, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=809, rejected_prediction_tokens=None, text_tokens=1, image_tokens=None, video_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=11, image_tokens=None, video_tokens=None), cache_read_input_tokens=None)

Now, with streaming enabled.

r = completion(model=mods.gem, messages=mk_msgs("lite: gem sync stream..."), stream=True)
_stream(r)
🟢 HIT 41b18f32
b'{"contents": [{"role": "user", "parts": [{"text": "write 1 word about lite: gem sync stream..."}]}]}'
Lite
r = completion(model=mods.gem, messages=mk_msgs("lite: gem sync stream..."), stream=True)
_stream(r)
🟢 HIT 41b18f32
b'{"contents": [{"role": "user", "parts": [{"text": "write 1 word about lite: gem sync stream..."}]}]}'
Lite

Async Tests

from litellm import acompletion
async def _astream(r):
    async for chunk in r: print(chunk.choices[0].delta.content or "")
Anthropic

Let’s test claude-sonnet-x.

r = await acompletion(model=mods.ant, messages=mk_msgs("lite: ant async..."))
r
🟢 HIT 056d7fa1
b'{"model":"claude-sonnet-4-20250514","messages":[{"role":"user","content":[{"type":"text","text":"write 1 word about lite: ant async..."}]}],"max_tokens":64000}'

coroutine

  • id: chatcmpl-b35c469a-c4a5-4ca5-8da5-83acdadc60f3
  • model: claude-sonnet-4-20250514
  • finish_reason: stop
  • usage: Usage(completion_tokens=8, prompt_tokens=18, total_tokens=26, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=0, rejected_prediction_tokens=None, text_tokens=8, image_tokens=None, video_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=18, image_tokens=None, video_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0, inference_geo='not_available', speed=None)
r = await acompletion(model=mods.ant, messages=mk_msgs("lite: ant async..."))
r
🟢 HIT 056d7fa1
b'{"model":"claude-sonnet-4-20250514","messages":[{"role":"user","content":[{"type":"text","text":"write 1 word about lite: ant async..."}]}],"max_tokens":64000}'

coroutine

  • id: chatcmpl-4ad6da97-5f81-45d1-be4b-bea9ac160c7d
  • model: claude-sonnet-4-20250514
  • finish_reason: stop
  • usage: Usage(completion_tokens=8, prompt_tokens=18, total_tokens=26, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=0, rejected_prediction_tokens=None, text_tokens=8, image_tokens=None, video_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=18, image_tokens=None, video_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0, inference_geo='not_available', speed=None)

Now, with streaming enabled.

r = await acompletion(model=mods.ant, messages=mk_msgs("lite: ant async stream..."), stream=True)
await(_astream(r))
🟢 HIT 9b778238
b'{"model": "claude-sonnet-4-20250514", "messages": [{"role": "user", "content": [{"type": "text", "text": "write 1 word about lite: ant async stream..."}]}], "max_tokens": 64000, "stream": true}'
**reactive
**
r = await acompletion(model=mods.ant, messages=mk_msgs("lite: ant async stream..."), stream=True)
await(_astream(r))
🟢 HIT 9b778238
b'{"model": "claude-sonnet-4-20250514", "messages": [{"role": "user", "content": [{"type": "text", "text": "write 1 word about lite: ant async stream..."}]}], "max_tokens": 64000, "stream": true}'
**reactive
**
OpenAI

Let’s test gpt-4o.

r = await acompletion(model=mods.oai, messages=mk_msgs("lite: oai async..."))
r
🟢 HIT 5066785c
b'{"messages":[{"role":"user","content":"write 1 word about lite: oai async..."}],"model":"gpt-4o"}'

Efficient

  • id: chatcmpl-DZNJ8def3HPOivOvvnzGPpVmcZVvo
  • model: gpt-4o-2024-08-06
  • finish_reason: stop
  • usage: Usage(completion_tokens=2, prompt_tokens=18, total_tokens=20, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None, image_tokens=None, video_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None, video_tokens=None))
r = await acompletion(model=mods.oai, messages=mk_msgs("lite: oai async..."))
r
🟢 HIT 5066785c
b'{"messages":[{"role":"user","content":"write 1 word about lite: oai async..."}],"model":"gpt-4o"}'

Efficient

  • id: chatcmpl-DZNJ8def3HPOivOvvnzGPpVmcZVvo
  • model: gpt-4o-2024-08-06
  • finish_reason: stop
  • usage: Usage(completion_tokens=2, prompt_tokens=18, total_tokens=20, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None, image_tokens=None, video_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None, video_tokens=None))

Now, with streaming enabled.

r = await acompletion(model=mods.oai, messages=mk_msgs("lite: oai async stream..."), stream=True)
await(_astream(r))
🟢 HIT 25dc75ed
b'{"messages":[{"role":"user","content":"write 1 word about lite: oai async stream..."}],"model":"gpt-4o","stream":true,"stream_options":{"include_usage":true}}'
Eff
icient

r = await acompletion(model=mods.oai, messages=mk_msgs("lite: oai async stream..."), stream=True)
await(_astream(r))
🟢 HIT 25dc75ed
b'{"messages":[{"role":"user","content":"write 1 word about lite: oai async stream..."}],"model":"gpt-4o","stream":true,"stream_options":{"include_usage":true}}'
Eff
icient

Gemini

Let’s test 2.5-flash.

r = await acompletion(model=mods.gem, messages=mk_msgs("lite: gem async..."))
r
🟢 HIT 903f999d
b'{"contents":[{"role":"user","parts":[{"text":"write 1 word about lite: gem async..."}]}]}'

Nimble

  • id: t83vae-wNqf84-EP07uUgAQ
  • model: gemini-2.5-flash
  • finish_reason: stop
  • usage: Usage(completion_tokens=882, prompt_tokens=11, total_tokens=893, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=880, rejected_prediction_tokens=None, text_tokens=2, image_tokens=None, video_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=11, image_tokens=None, video_tokens=None), cache_read_input_tokens=None)
r = await acompletion(model=mods.gem, messages=mk_msgs("lite: gem async..."))
r
🟢 HIT 903f999d
b'{"contents":[{"role":"user","parts":[{"text":"write 1 word about lite: gem async..."}]}]}'

Nimble

  • id: t83vae-wNqf84-EP07uUgAQ
  • model: gemini-2.5-flash
  • finish_reason: stop
  • usage: Usage(completion_tokens=882, prompt_tokens=11, total_tokens=893, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=880, rejected_prediction_tokens=None, text_tokens=2, image_tokens=None, video_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=11, image_tokens=None, video_tokens=None), cache_read_input_tokens=None)

Now, with streaming enabled.

r = await acompletion(model=mods.gem, messages=mk_msgs("lite: gem async stream..."), stream=True)
await(_astream(r))
🟢 HIT b998fffc
b'{"contents": [{"role": "user", "parts": [{"text": "write 1 word about lite: gem async stream..."}]}]}'
**Swift**
r = await acompletion(model=mods.gem, messages=mk_msgs("lite: gem async stream..."), stream=True)
await(_astream(r))
🟢 HIT b998fffc
b'{"contents": [{"role": "user", "parts": [{"text": "write 1 word about lite: gem async stream..."}]}]}'
**Swift**

Tool Calls

As a sanity check let’s confirm that tool calls work.

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type":"string", "description":"The city e.g. Reims"},
                    "unit": {"type":"string", "enum":["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            }
        }
    }
]
r = completion(model=mods.ant, messages=mk_msgs("Is it raining in Reims?"), tools=tools)
r
🟢 HIT 0dba8ec6
b'{"model": "claude-sonnet-4-20250514", "messages": [{"role": "user", "content": [{"type": "text", "text": "write 1 word about Is it raining in Reims?"}]}], "tools": [{"name": "get_current_weather", "input_schema": {"type": "object", "properties": {"location": {"type": "string", "description": "The city e.g. Reims"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}}, "required": ["location"]}, "type": "custom", "description": "Get the current weather in a given location"}], "max_tokens": 64000}'

🔧 get_current_weather({“location”: “Reims”})

  • id: chatcmpl-72144365-d95a-4017-b170-8879a79757ff
  • model: claude-sonnet-4-20250514
  • finish_reason: tool_calls
  • usage: Usage(completion_tokens=57, prompt_tokens=427, total_tokens=484, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=0, rejected_prediction_tokens=None, text_tokens=57, image_tokens=None, video_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=427, image_tokens=None, video_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0, inference_geo='not_available', speed=None)
r = completion(model=mods.ant, messages=mk_msgs("Is it raining in Reims?"), tools=tools)
r
🟢 HIT 0dba8ec6
b'{"model": "claude-sonnet-4-20250514", "messages": [{"role": "user", "content": [{"type": "text", "text": "write 1 word about Is it raining in Reims?"}]}], "tools": [{"name": "get_current_weather", "input_schema": {"type": "object", "properties": {"location": {"type": "string", "description": "The city e.g. Reims"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}}, "required": ["location"]}, "type": "custom", "description": "Get the current weather in a given location"}], "max_tokens": 64000}'

🔧 get_current_weather({“location”: “Reims”})

  • id: chatcmpl-cd269bef-b0ac-42de-8102-79f9a81ca1c9
  • model: claude-sonnet-4-20250514
  • finish_reason: tool_calls
  • usage: Usage(completion_tokens=57, prompt_tokens=427, total_tokens=484, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=0, rejected_prediction_tokens=None, text_tokens=57, image_tokens=None, video_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=427, image_tokens=None, video_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0, inference_geo='not_available', speed=None)

Multipart Request

cli = Anthropic()
r = cli.beta.files.upload(file=("ex.txt", b"hello world", "text/plain"))
r
🟢 HIT 461a2b1c
b'--1d26302376ac7810c7a9c5a37c767380\r\nContent-Disposition: form-data; name="file"; filename="ex.txt"\r\nContent-Type: text/plain\r\n\r\nhello world\r\n--1d26302376ac7810c7a9c5a37c767380--\r\n'
FileMetadata(id='file_011CaUwnzxDSYjNoiLgZvEbc', created_at=datetime.datetime(2026, 4, 27, 20, 57, 42, 96000, tzinfo=datetime.timezone.utc), filename='ex.txt', mime_type='text/plain', size_bytes=11, type='file', downloadable=False, scope=None)
cli = Anthropic()
r = cli.beta.files.upload(file=("ex.txt", b"hello world", "text/plain"))
r
🟢 HIT 461a2b1c
b'--da2e1345dfd16a7282f8f68273a48a25\r\nContent-Disposition: form-data; name="file"; filename="ex.txt"\r\nContent-Type: text/plain\r\n\r\nhello world\r\n--da2e1345dfd16a7282f8f68273a48a25--\r\n'
FileMetadata(id='file_011CaUwnzxDSYjNoiLgZvEbc', created_at=datetime.datetime(2026, 4, 27, 20, 57, 42, 96000, tzinfo=datetime.timezone.utc), filename='ex.txt', mime_type='text/plain', size_bytes=11, type='file', downloadable=False, scope=None)

Gemini Model Comparison

When LiteLLM calls Gemini it includes the model name in the url. Let’s test that we can run the same prompt with two different Gemini models.

mods.gem
'gemini/gemini-2.5-flash'
r = completion(model=mods.gem, messages=mk_msgs("lite: gem different models..."))
r
🟢 HIT 439cd3d9
b'{"contents":[{"role":"user","parts":[{"text":"write 1 word about lite: gem different models..."}]}]}'

Refined

  • id: xs3vabmHOf-BqfkPwcbMwQU
  • model: gemini-2.5-flash
  • finish_reason: stop
  • usage: Usage(completion_tokens=531, prompt_tokens=12, total_tokens=543, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=529, rejected_prediction_tokens=None, text_tokens=2, image_tokens=None, video_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=12, image_tokens=None, video_tokens=None), cache_read_input_tokens=None)
r = completion(model="gemini/gemini-2.5-flash", messages=mk_msgs("lite: gem different models..."))
r
🟢 HIT 439cd3d9
b'{"contents":[{"role":"user","parts":[{"text":"write 1 word about lite: gem different models..."}]}]}'

Refined

  • id: xs3vabmHOf-BqfkPwcbMwQU
  • model: gemini-2.5-flash
  • finish_reason: stop
  • usage: Usage(completion_tokens=531, prompt_tokens=12, total_tokens=543, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=529, rejected_prediction_tokens=None, text_tokens=2, image_tokens=None, video_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=12, image_tokens=None, video_tokens=None), cache_read_input_tokens=None)

Gemini File Upload

The google-genai SDK’s files.upload() relies on x-goog-upload-url and x-goog-upload-status response headers

from google import genai
cli = genai.Client()

When no hdrs is provided the request fails:

tfw = tempfile.NamedTemporaryFile(suffix='.txt')
f = tfw.__enter__()
fn = Path(f.name)
fn.write_text("test content");
try: gfile = cli.files.upload(file=fn)
except Exception as e: print(e)
🟢 HIT 74953d92
b'{"file": {"mime_type": "text/plain", "size_bytes": 12}}'
🔴 MISS 9441134d
b'test content'
🔴 MISS 9441134d
b'test content'
🔴 MISS 9441134d
b'test content'
400 Bad Request. {'message': 'Upload has already been terminated.', 'status': 'Bad Request'}

When caching Gemini file uploads, by default request content only includes mime_type and size_bytes. This means different files with the same mime type and size produce identical cache keys, causing incorrect cache hits. The fix is to pass a file content fingerprint (a hash of the file bytes) as the display_name in the upload config: cli.files.upload(file=fn, config={"display_name": _fingerprint(fn)}). This ensures the request body is unique per file content, generating distinct cache keys.

def _fingerprint(path): return hashlib.sha256(Path(path).read_bytes()).hexdigest()[:16]
enable_cachy(hdrs=['x-goog-upload-url', 'x-goog-upload-status'])
gfile = cli.files.upload(file=fn, config={"display_name": _fingerprint(fn)})
gfile
File(
  create_time=datetime.datetime(2026, 4, 27, 20, 57, 56, 10134, tzinfo=TzInfo(0)),
  display_name='6ae8a75555209fd6',
  expiration_time=datetime.datetime(2026, 4, 29, 20, 57, 54, 164349, tzinfo=TzInfo(0)),
  mime_type='text/plain',
  name='files/xbshh618qjkw',
  sha256_hash='NmFlOGE3NTU1NTIwOWZkNmM0NDE1N2MwYWVkODAxNmU3NjNmZjQzNWExOWNmMTg2Zjc2ODYzMTQwMTQzZmY3Mg==',
  size_bytes=12,
  source=<FileSource.UPLOADED: 'UPLOADED'>,
  state=<FileState.ACTIVE: 'ACTIVE'>,
  update_time=datetime.datetime(2026, 4, 27, 20, 57, 56, 10134, tzinfo=TzInfo(0)),
  uri='https://generativelanguage.googleapis.com/v1beta/files/xbshh618qjkw'
)
tfw.__exit__(None, None, None)

httpx.stream

headers = {"x-api-key": os.environ["ANTHROPIC_API_KEY"], "anthropic-version": "2023-06-01", "content-type": "application/json"}
url = "https://api.anthropic.com/v1/messages"
payload = json.dumps({"model": mods.ant, "max_tokens": 16, "messages": mk_msgs("ant sync")}).encode()
with httpx.stream("POST", url, headers=headers, content=payload) as r1: c1 = json.loads(b''.join(r1.iter_bytes()).decode())
with httpx.stream("POST", url, headers=headers, content=payload) as r2: c2 = json.loads(b''.join(r2.iter_bytes()).decode())
c1['content'][0]['text']
'Coordination'
test_eq(c1,c2)

Binary Content

headers = {"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}", "content-type": "application/json"}
url = "https://api.openai.com/v1/audio/speech"
payload = json.dumps({"model": "tts-1", "input": "cachy binary test", "voice": "alloy"}).encode()
r1 = httpx.post(url, headers=headers, content=payload)
r2 = httpx.post(url, headers=headers, content=payload)
test_eq(r1.content, r2.content)
test_eq(isinstance(r1.content, bytes), True)
os.chdir(origdir)