Exported source
doms = ("api.openai.com", "api.anthropic.com", "generativelanguage.googleapis.com", "api.deepseek.com")httpxenable_cachyWe often call APIs while prototyping and testing our code. A single API call (e.g. an Anthropic chat completion) can take 100’s of ms to run. This can really slow down development especially if our notebook contains many API calls 😞.
cachy caches API requests. It does this by saving the result of each API call to a local cachy.jsonl file. Before calling an API (e.g. OpenAI) it will check if the request already exists in cachy.jsonl. If it does it will return the cached result.
How does it work?
Under the hood popular SDK’s like OpenAI, Anthropic and LiteLLM use httpx.Client and httpx.AsyncClient.
cachy patches the send method of both clients and injects a simple caching mechanism:
cachy.jsonl return the cached responsecachy.jsonlcachy.jsonl contains one API response per line.
Each line has the following format {"key": key, "response": response}
key: hash of the API requestresponse: the API response.{
"key": "afc2be0c",
"response": "{\"id\":\"msg_xxx\",\"type\":\"message\",\"role\":\"assistant\",\"model\":\"claude-sonnet-4-20250514\",\"content\":[{\"type\":\"text\",\"text\":\"Coordination.\"}],\"stop_reason\":\"end_turn\",\"stop_sequence\":null,\"usage\":{\"input_tokens\":16,\"cache_creation_input_tokens\":0,\"cache_read_input_tokens\":0,\"cache_creation\":{\"ephemeral_5m_input_tokens\":0,\"ephemeral_1h_input_tokens\":0},\"output_tokens\":6,\"service_tier\":\"standard\"}}"
}httpxPatching a method is very straightforward.
In our case we want to patch httpx._client.Client.send and httpx._client.AsyncClient.send.
These methods are called when running httpx.get, httpx.post, etc.
In the example below we use @patch from fastcore to print calling an API when httpx._client.Client.send is run.
Now, let’s build up our caching logic piece-by-piece.
The first thing we need to do is ensure that our caching logic only runs on specific urls.
For now, let’s only cache API calls made to popular LLM providers like OpenAI, Anthropic, Google and DeepSeek. We can make this fully customizable later.
We could then use _should_cache like this.
The next thing we need to do is figure out if a response for the request r already exists in our cache.
Recall that each line in cachy.jsonl has the following format {"key": key, "response": response}.
Our key needs to be unique and deterministic. One way to do this is to concatenate the request URL and content, then generate a hash from the result.
We use r.url.host instead of r.url because when LiteLLM calls Gemini it includes the API key in a query param. See #1.
If we used r.url we wouldn’t be able to use dummy api keys when running our notebooks in a CI pipeline.
Let’s test this out.
<Request('POST', 'https://api.openai.com/v1/chat/completions')>
If we run it again we should get the same key.
Let’s modify the url and confirm we get a different key.
'2707fa05'
Great. Let’s update our patch.
Now let’s add some methods that will read from and write to cachy.jsonl.
Let’s update our patch.
@patch
def send(self:httpx._client.Client, r, **kwargs):
if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
key = key(r)
if res := _cache(key,"cachy.jsonl"): return httpx.Response(status_code=200, content=res, request=r)
res = self._orig_send(r, **kwargs)
content = res.read().decode()
_write_cache(key, content, "cachy.jsonl")
return httpx.Response(status_code=res.status_code, content=content, request=r)Let’s add support for streaming.
First let’s include an is_stream bool in our hash so that a non-streamed request will generate a different key to the same request when streamed.
In the patch we need to consume the entire stream before writing it to the cache.
@patch
def send(self:httpx._client.Client, r, **kwargs):
is_stream = kwargs.get("stream")
if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
key = _key(r, is_stream=False)
if res := _cache(key,"cachy.jsonl"): return httpx.Response(status_code=200, content=res, request=r)
res = self._orig_send(r, **kwargs)
content = res.read().decode() if not is_stream else b''.join(list(res.iter_bytes())).decode()
_write_cache(key, content, "cachy.jsonl")
return httpx.Response(status_code=res.status_code, content=content, request=r)enable_cachyTo make cachy as user friendly as possible let’s make it so that we can apply our patch by running a single method at the top of our notebook.
For this to work we’ll need to wrap our patch.
def _apply_patch():
@patch
def send(self:httpx._client.Client, r, **kwargs):
is_stream = kwargs.get("stream")
if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
key = _key(r, is_stream=False)
if res := _cache(key,"cachy.jsonl"): return httpx.Response(status_code=200, content=res, request=r)
res = self._orig_send(r, **kwargs)
content = res.read().decode() if not is_stream else b''.join(list(res.iter_bytes())).decode()
_write_cache(key, content, "cachy.jsonl")
return httpx.Response(status_code=res.status_code, content=content, request=r)Great. Now, let’s make cachy a little more customizable by making it possible to specify:
Note: If our notebook is running in an nbdev project Config.find("settings.ini").config_path automatically finds the base dir.
def _apply_patch(cfp, doms):
@patch
def send(self:httpx._client.Client, r, **kwargs):
is_stream = kwargs.get("stream")
if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
key = _key(r, is_stream=False)
if res := _cache(key,cfp): return httpx.Response(status_code=200, content=res, request=r)
res = self._orig_send(r, **kwargs)
content = res.read().decode() if not is_stream else b''.join(list(res.iter_bytes())).decode()
_write_cache(key,content,cfp)
return httpx.Response(status_code=res.status_code, content=content, request=r)Now let’s add support for async requests.
def _apply_async_patch(cfp, doms):
@patch
async def send(self:httpx._client.AsyncClient, r, **kwargs):
is_stream = kwargs.get("stream")
if not _should_cache(r.url, doms): return await self._orig_send(r, **kwargs)
key = _key(r, is_stream=False)
if res := _cache(key, cfp): return httpx.Response(status_code=200, content=res, request=r)
res = await self._orig_send(r, **kwargs)
content = res.read().decode() if not is_stream else b''.join([c async for c in res.aiter_bytes()]).decode()
_write_cache(key, content, cfp)
return httpx.Response(status_code=res.status_code, content=content, request=r)Let’s rename our original patch.
def _apply_sync_patch(cfp, doms):
@patch
def send(self:httpx._client.Client, r, **kwargs):
is_stream = kwargs.get("stream")
if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
key = _key(r, is_stream=False)
if res := _cache(key,cfp): return httpx.Response(status_code=200, content=res, request=r)
res = self._orig_send(r, **kwargs)
content = res.read().decode() if not is_stream else b''.join(list(res.iter_bytes())).decode()
_write_cache(key,content,cfp)
return httpx.Response(status_code=res.status_code, content=content, request=r)Finally, let’s update enable_cachy.
enable_cachy (cache_dir=None, doms=('api.openai.com', 'api.anthropic.com', 'generativelanguage.googleapis.com', 'api.deepseek.com'))
And a way to turn if off:
disable_cachy ()
Let’s test enable_cachy on 3 SDKs (OpenAI, Anthropic, LiteLLM) for the scenarios below:
Add some helper functions.
Collaboration
Collaboration
Let’s test streaming.
ResponseCreatedEvent(response=Response(id='resp_0bedad8c74657bc30068c80bb26cf4819ca6920df9fbd23916', created_at=1757940658.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, store=True), sequence_number=0, type='response.created')
ResponseInProgressEvent(response=Response(id='resp_0bedad8c74657bc30068c80bb26cf4819ca6920df9fbd23916', created_at=1757940658.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, store=True), sequence_number=1, type='response.in_progress')
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', content=[], role='assistant', status='in_progress', type='message'), output_index=0, sequence_number=2, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', output_index=0, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=3, type='response.content_part.added')
ResponseTextDeltaEvent(content_index=0, delta='Innov', item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', logprobs=[], output_index=0, sequence_number=4, type='response.output_text.delta', obfuscation='VCpZLMzWQSq')
ResponseTextDeltaEvent(content_index=0, delta='ative', item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', logprobs=[], output_index=0, sequence_number=5, type='response.output_text.delta', obfuscation='xFQTnzLeSC8')
ResponseTextDoneEvent(content_index=0, item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', logprobs=[], output_index=0, sequence_number=6, text='Innovative', type='response.output_text.done')
ResponseContentPartDoneEvent(content_index=0, item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', output_index=0, part=ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[]), sequence_number=7, type='response.content_part.done')
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', content=[ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[])], role='assistant', status='completed', type='message'), output_index=0, sequence_number=8, type='response.output_item.done')
ResponseCompletedEvent(response=Response(id='resp_0bedad8c74657bc30068c80bb26cf4819ca6920df9fbd23916', created_at=1757940658.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[ResponseOutputMessage(id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', content=[ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[])], role='assistant', status='completed', type='message')], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='default', status='completed', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=In: 16; Out: 3; Total: 19, user=None, store=True), sequence_number=9, type='response.completed')
ResponseCreatedEvent(response=Response(id='resp_0bedad8c74657bc30068c80bb26cf4819ca6920df9fbd23916', created_at=1757940658.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, store=True), sequence_number=0, type='response.created')
ResponseInProgressEvent(response=Response(id='resp_0bedad8c74657bc30068c80bb26cf4819ca6920df9fbd23916', created_at=1757940658.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, store=True), sequence_number=1, type='response.in_progress')
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', content=[], role='assistant', status='in_progress', type='message'), output_index=0, sequence_number=2, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', output_index=0, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=3, type='response.content_part.added')
ResponseTextDeltaEvent(content_index=0, delta='Innov', item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', logprobs=[], output_index=0, sequence_number=4, type='response.output_text.delta', obfuscation='VCpZLMzWQSq')
ResponseTextDeltaEvent(content_index=0, delta='ative', item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', logprobs=[], output_index=0, sequence_number=5, type='response.output_text.delta', obfuscation='xFQTnzLeSC8')
ResponseTextDoneEvent(content_index=0, item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', logprobs=[], output_index=0, sequence_number=6, text='Innovative', type='response.output_text.done')
ResponseContentPartDoneEvent(content_index=0, item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', output_index=0, part=ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[]), sequence_number=7, type='response.content_part.done')
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', content=[ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[])], role='assistant', status='completed', type='message'), output_index=0, sequence_number=8, type='response.output_item.done')
ResponseCompletedEvent(response=Response(id='resp_0bedad8c74657bc30068c80bb26cf4819ca6920df9fbd23916', created_at=1757940658.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[ResponseOutputMessage(id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', content=[ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[])], role='assistant', status='completed', type='message')], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='default', status='completed', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=In: 16; Out: 3; Total: 19, user=None, store=True), sequence_number=9, type='response.completed')
Let’s test async.
Innovative
Innovative
Let’s test async streaming.
ResponseCreatedEvent(response=Response(id='resp_09667f14142ea18b0068c80bc16c8481a1988be17cf34e1823', created_at=1757940673.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, store=True), sequence_number=0, type='response.created')
ResponseInProgressEvent(response=Response(id='resp_09667f14142ea18b0068c80bc16c8481a1988be17cf34e1823', created_at=1757940673.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, store=True), sequence_number=1, type='response.in_progress')
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', content=[], role='assistant', status='in_progress', type='message'), output_index=0, sequence_number=2, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', output_index=0, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=3, type='response.content_part.added')
ResponseTextDeltaEvent(content_index=0, delta='Eff', item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', logprobs=[], output_index=0, sequence_number=4, type='response.output_text.delta', obfuscation='bCNpIYrqdSk3w')
ResponseTextDeltaEvent(content_index=0, delta='icient', item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', logprobs=[], output_index=0, sequence_number=5, type='response.output_text.delta', obfuscation='wHhRAwEPuj')
ResponseTextDoneEvent(content_index=0, item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', logprobs=[], output_index=0, sequence_number=6, text='Efficient', type='response.output_text.done')
ResponseContentPartDoneEvent(content_index=0, item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', output_index=0, part=ResponseOutputText(annotations=[], text='Efficient', type='output_text', logprobs=[]), sequence_number=7, type='response.content_part.done')
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', content=[ResponseOutputText(annotations=[], text='Efficient', type='output_text', logprobs=[])], role='assistant', status='completed', type='message'), output_index=0, sequence_number=8, type='response.output_item.done')
ResponseCompletedEvent(response=Response(id='resp_09667f14142ea18b0068c80bc16c8481a1988be17cf34e1823', created_at=1757940673.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[ResponseOutputMessage(id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', content=[ResponseOutputText(annotations=[], text='Efficient', type='output_text', logprobs=[])], role='assistant', status='completed', type='message')], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='default', status='completed', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=In: 16; Out: 3; Total: 19, user=None, store=True), sequence_number=9, type='response.completed')
ResponseCreatedEvent(response=Response(id='resp_09667f14142ea18b0068c80bc16c8481a1988be17cf34e1823', created_at=1757940673.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, store=True), sequence_number=0, type='response.created')
ResponseInProgressEvent(response=Response(id='resp_09667f14142ea18b0068c80bc16c8481a1988be17cf34e1823', created_at=1757940673.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, store=True), sequence_number=1, type='response.in_progress')
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', content=[], role='assistant', status='in_progress', type='message'), output_index=0, sequence_number=2, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', output_index=0, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=3, type='response.content_part.added')
ResponseTextDeltaEvent(content_index=0, delta='Eff', item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', logprobs=[], output_index=0, sequence_number=4, type='response.output_text.delta', obfuscation='bCNpIYrqdSk3w')
ResponseTextDeltaEvent(content_index=0, delta='icient', item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', logprobs=[], output_index=0, sequence_number=5, type='response.output_text.delta', obfuscation='wHhRAwEPuj')
ResponseTextDoneEvent(content_index=0, item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', logprobs=[], output_index=0, sequence_number=6, text='Efficient', type='response.output_text.done')
ResponseContentPartDoneEvent(content_index=0, item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', output_index=0, part=ResponseOutputText(annotations=[], text='Efficient', type='output_text', logprobs=[]), sequence_number=7, type='response.content_part.done')
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', content=[ResponseOutputText(annotations=[], text='Efficient', type='output_text', logprobs=[])], role='assistant', status='completed', type='message'), output_index=0, sequence_number=8, type='response.output_item.done')
ResponseCompletedEvent(response=Response(id='resp_09667f14142ea18b0068c80bc16c8481a1988be17cf34e1823', created_at=1757940673.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[ResponseOutputMessage(id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', content=[ResponseOutputText(annotations=[], text='Efficient', type='output_text', logprobs=[])], role='assistant', status='completed', type='message')], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='default', status='completed', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=In: 16; Out: 3; Total: 19, user=None, store=True), sequence_number=9, type='response.completed')
Coordination
msg_01AgKMiEZKSXzQYnSYWaducJ[{'citations': None, 'text': 'Coordination', 'type': 'text'}]claude-sonnet-4-20250514assistantend_turnNonemessage{'cache_creation': {'ephemeral_1h_input_tokens': 0, 'ephemeral_5m_input_tokens': 0}, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0, 'input_tokens': 15, 'output_tokens': 5, 'server_tool_use': None, 'service_tier': 'standard'}Coordination
msg_01AgKMiEZKSXzQYnSYWaducJ[{'citations': None, 'text': 'Coordination', 'type': 'text'}]claude-sonnet-4-20250514assistantend_turnNonemessage{'cache_creation': {'ephemeral_1h_input_tokens': 0, 'ephemeral_5m_input_tokens': 0}, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0, 'input_tokens': 15, 'output_tokens': 5, 'server_tool_use': None, 'service_tier': 'standard'}Let’s test streaming.
RawMessageStartEvent(message=Message(id='msg_01Jnrv7itVfTgQGFkbGnoi6k', content=[], model='claude-sonnet-4-20250514', role='assistant', stop_reason=None, stop_sequence=None, type='message', usage=In: 16; Out: 1; Cache create: 0; Cache read: 0; Total Tokens: 17; Search: 0), type='message_start')
RawContentBlockStartEvent(content_block=TextBlock(citations=None, text='', type='text'), index=0, type='content_block_start')
RawContentBlockDeltaEvent(delta=TextDelta(text='**', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockDeltaEvent(delta=TextDelta(text='Buffering**', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockStopEvent(index=0, type='content_block_stop')
RawMessageDeltaEvent(delta=Delta(stop_reason='end_turn', stop_sequence=None), type='message_delta', usage=MessageDeltaUsage(cache_creation_input_tokens=0, cache_read_input_tokens=0, input_tokens=16, output_tokens=8, server_tool_use=None))
RawMessageStopEvent(type='message_stop')
RawMessageStartEvent(message=Message(id='msg_01Jnrv7itVfTgQGFkbGnoi6k', content=[], model='claude-sonnet-4-20250514', role='assistant', stop_reason=None, stop_sequence=None, type='message', usage=In: 16; Out: 1; Cache create: 0; Cache read: 0; Total Tokens: 17; Search: 0), type='message_start')
RawContentBlockStartEvent(content_block=TextBlock(citations=None, text='', type='text'), index=0, type='content_block_start')
RawContentBlockDeltaEvent(delta=TextDelta(text='**', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockDeltaEvent(delta=TextDelta(text='Buffering**', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockStopEvent(index=0, type='content_block_stop')
RawMessageDeltaEvent(delta=Delta(stop_reason='end_turn', stop_sequence=None), type='message_delta', usage=MessageDeltaUsage(cache_creation_input_tokens=0, cache_read_input_tokens=0, input_tokens=16, output_tokens=8, server_tool_use=None))
RawMessageStopEvent(type='message_stop')
Let’s test async.
concurrent
msg_01R33rKFKM5BqeJprT6A6DVM[{'citations': None, 'text': '**concurrent**', 'type': 'text'}]claude-sonnet-4-20250514assistantend_turnNonemessage{'cache_creation': {'ephemeral_1h_input_tokens': 0, 'ephemeral_5m_input_tokens': 0}, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0, 'input_tokens': 15, 'output_tokens': 6, 'server_tool_use': None, 'service_tier': 'standard'}concurrent
msg_01R33rKFKM5BqeJprT6A6DVM[{'citations': None, 'text': '**concurrent**', 'type': 'text'}]claude-sonnet-4-20250514assistantend_turnNonemessage{'cache_creation': {'ephemeral_1h_input_tokens': 0, 'ephemeral_5m_input_tokens': 0}, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0, 'input_tokens': 15, 'output_tokens': 6, 'server_tool_use': None, 'service_tier': 'standard'}Let’s test async streaming.
event: message_start
data: {"type":"message_start","message":{"id":"msg_019CQEuc6f7D2ewx7Co1PcTW","type":"message","role":"assistant","model":"claude-sonnet-4-20250514","content":[],"stop_reason":null,"stop_sequence":null,"usage":{"input_tokens":16,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"cache_creation":{"ephemeral_5m_input_tokens":0,"ephemeral_1h_input_tokens":0},"output_tokens":1,"service_tier":"standard"}} }
event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""} }
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Async"} }
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Iterable"}}
event: ping
data: {"type": "ping"}
event: content_block_stop
data: {"type":"content_block_stop","index":0 }
event: ping
data: {"type": "ping"}
event: ping
data: {"type": "ping"}
event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null},"usage":{"input_tokens":16,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"output_tokens":6} }
event: message_stop
data: {"type":"message_stop"}
event: message_start
data: {"type":"message_start","message":{"id":"msg_019CQEuc6f7D2ewx7Co1PcTW","type":"message","role":"assistant","model":"claude-sonnet-4-20250514","content":[],"stop_reason":null,"stop_sequence":null,"usage":{"input_tokens":16,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"cache_creation":{"ephemeral_5m_input_tokens":0,"ephemeral_1h_input_tokens":0},"output_tokens":1,"service_tier":"standard"}} }
event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""} }
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Async"} }
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Iterable"}}
event: ping
data: {"type": "ping"}
event: content_block_stop
data: {"type":"content_block_stop","index":0 }
event: ping
data: {"type": "ping"}
event: ping
data: {"type": "ping"}
event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null},"usage":{"input_tokens":16,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"output_tokens":6} }
event: message_stop
data: {"type":"message_stop"}
Let’s test the LiteLLM SDK by running sync/async calls with(out) streaming for OpenAI, Anthropic, & Gemini.
We’ll also double check tool calls and citations.
Let’s define a helper method to display a streamed response.
Let’s test claude-sonnet-x.
ModelResponse(id='chatcmpl-b2bea735-71f6-46bb-8645-09b96052d3e2', created=1757941046, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='**Streamlined**', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=8, prompt_tokens=18, total_tokens=26, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))
ModelResponse(id='chatcmpl-c31df5f6-2278-4f92-9fab-5756ec4855e5', created=1757941046, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='**Streamlined**', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=8, prompt_tokens=18, total_tokens=26, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))
Now, with streaming enabled.
**
Lightweight**
Let’s test gpt-4o.
ModelResponse(id='chatcmpl-CG2xxEWT6Y33MdJvMnIRDuqlOQUvB', created=1757940685, model='gpt-4o-2024-08-06', object='chat.completion', system_fingerprint='fp_f33640a400', choices=[Choices(finish_reason='stop', index=0, message=Message(content='Efficiency', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]), provider_specific_fields={})], usage=Usage(completion_tokens=1, prompt_tokens=18, total_tokens=19, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None)), service_tier='default')
ModelResponse(id='chatcmpl-CG2xxEWT6Y33MdJvMnIRDuqlOQUvB', created=1757940685, model='gpt-4o-2024-08-06', object='chat.completion', system_fingerprint='fp_f33640a400', choices=[Choices(finish_reason='stop', index=0, message=Message(content='Efficiency', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]), provider_specific_fields={})], usage=Usage(completion_tokens=1, prompt_tokens=18, total_tokens=19, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None)), service_tier='default')
Now, with streaming enabled.
Synchronization
Let’s test 2.0-flash.
ModelResponse(id='zwvIaMfgK7CBvdIPsqDQqQY', created=1757941047, model='gemini-2.0-flash', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Synchronization.\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=3, prompt_tokens=10, total_tokens=13, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=10, image_tokens=None)), vertex_ai_grounding_metadata=[], vertex_ai_url_context_metadata=[], vertex_ai_safety_results=[], vertex_ai_citation_metadata=[])
ModelResponse(id='zwvIaMfgK7CBvdIPsqDQqQY', created=1757941047, model='gemini-2.0-flash', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Synchronization.\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=3, prompt_tokens=10, total_tokens=13, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=10, image_tokens=None)), vertex_ai_grounding_metadata=[], vertex_ai_url_context_metadata=[], vertex_ai_safety_results=[], vertex_ai_citation_metadata=[])
Now, with streaming enabled.
Eff
ortless
Let’s test claude-sonnet-x.
ModelResponse(id='chatcmpl-66487276-199b-4045-b980-eeb2c83148e9', created=1757941047, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='**coroutine**', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=8, prompt_tokens=18, total_tokens=26, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))
ModelResponse(id='chatcmpl-a1334d94-646c-4933-8b92-58178b067a88', created=1757941047, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='**coroutine**', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=8, prompt_tokens=18, total_tokens=26, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))
Now, with streaming enabled.
**
reactive**
Let’s test gpt-4o.
ModelResponse(id='chatcmpl-CG2y4f3zgvJK7cENDO4UDN0hHZ6fH', created=1757940692, model='gpt-4o-2024-08-06', object='chat.completion', system_fingerprint='fp_f33640a400', choices=[Choices(finish_reason='stop', index=0, message=Message(content='Fast.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]), provider_specific_fields={})], usage=Usage(completion_tokens=2, prompt_tokens=18, total_tokens=20, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None)), service_tier='default')
ModelResponse(id='chatcmpl-CG2y4f3zgvJK7cENDO4UDN0hHZ6fH', created=1757940692, model='gpt-4o-2024-08-06', object='chat.completion', system_fingerprint='fp_f33640a400', choices=[Choices(finish_reason='stop', index=0, message=Message(content='Fast.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]), provider_specific_fields={})], usage=Usage(completion_tokens=2, prompt_tokens=18, total_tokens=20, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None)), service_tier='default')
Now, with streaming enabled.
Eff
icient
.
Let’s test 2.0-flash.
ModelResponse(id='1gvIaLDAGe2kvdIPsY6--Q0', created=1757941047, model='gemini-2.0-flash', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Concurrency\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=2, prompt_tokens=10, total_tokens=12, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=10, image_tokens=None)), vertex_ai_grounding_metadata=[], vertex_ai_url_context_metadata=[], vertex_ai_safety_results=[], vertex_ai_citation_metadata=[])
ModelResponse(id='1gvIaLDAGe2kvdIPsY6--Q0', created=1757941047, model='gemini-2.0-flash', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Concurrency\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=2, prompt_tokens=10, total_tokens=12, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=10, image_tokens=None)), vertex_ai_grounding_metadata=[], vertex_ai_url_context_metadata=[], vertex_ai_safety_results=[], vertex_ai_citation_metadata=[])
Now, with streaming enabled.
Concurrency
As a sanity check let’s confirm that tool calls work.
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {"type":"string", "description":"The city e.g. Reims"},
"unit": {"type":"string", "enum":["celsius", "fahrenheit"]},
},
"required": ["location"],
}
}
}
]ModelResponse(id='chatcmpl-78f754da-9ec9-490e-86f5-07049d0446e5', created=1757941047, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='tool_calls', index=0, message=Message(content=None, role='assistant', tool_calls=[ChatCompletionMessageToolCall(index=0, function=Function(arguments='{"location": "Reims"}', name='get_current_weather'), id='toolu_0182nVBg1pTYTadKxS5qgCt4', type='function')], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=57, prompt_tokens=427, total_tokens=484, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))
ModelResponse(id='chatcmpl-73aac225-ac1b-42cc-80cf-61cefb999295', created=1757941047, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='tool_calls', index=0, message=Message(content=None, role='assistant', tool_calls=[ChatCompletionMessageToolCall(index=0, function=Function(arguments='{"location": "Reims"}', name='get_current_weather'), id='toolu_0182nVBg1pTYTadKxS5qgCt4', type='function')], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=57, prompt_tokens=427, total_tokens=484, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))