core

Cache your API calls with a single line of code. No mocks, no fixtures. Just faster, cleaner code.

Introduction

We often call APIs while prototyping and testing our code. A single API call (e.g. an Anthropic chat completion) can take 100’s of ms to run. This can really slow down development especially if our notebook contains many API calls 😞.

cachy caches API requests. It does this by saving the result of each API call to a local cachy.jsonl file. Before calling an API (e.g. OpenAI) it will check if the request already exists in cachy.jsonl. If it does it will return the cached result.

How does it work?

Under the hood popular SDK’s like OpenAI, Anthropic and LiteLLM use httpx.Client and httpx.AsyncClient.

cachy patches the send method of both clients and injects a simple caching mechanism:

  • create a cache key from the request
  • if the key exists in cachy.jsonl return the cached response
  • if not, call the API and save the response to cachy.jsonl

cachy.jsonl contains one API response per line.

Each line has the following format {"key": key, "response": response}

  • key: hash of the API request
  • response: the API response.
{
    "key": "afc2be0c", 
    "response": "{\"id\":\"msg_xxx\",\"type\":\"message\",\"role\":\"assistant\",\"model\":\"claude-sonnet-4-20250514\",\"content\":[{\"type\":\"text\",\"text\":\"Coordination.\"}],\"stop_reason\":\"end_turn\",\"stop_sequence\":null,\"usage\":{\"input_tokens\":16,\"cache_creation_input_tokens\":0,\"cache_read_input_tokens\":0,\"cache_creation\":{\"ephemeral_5m_input_tokens\":0,\"ephemeral_1h_input_tokens\":0},\"output_tokens\":6,\"service_tier\":\"standard\"}}"
}

Patching httpx

Patching a method is very straightforward.

In our case we want to patch httpx._client.Client.send and httpx._client.AsyncClient.send.

These methods are called when running httpx.get, httpx.post, etc.

In the example below we use @patch from fastcore to print calling an API when httpx._client.Client.send is run.

@patch
def send(self:httpx._client.Client, r, **kwargs):
    print('calling an API')
    return self._orig_send(r, **kwargs)

Cache Filtering

Now, let’s build up our caching logic piece-by-piece.

The first thing we need to do is ensure that our caching logic only runs on specific urls.

For now, let’s only cache API calls made to popular LLM providers like OpenAI, Anthropic, Google and DeepSeek. We can make this fully customizable later.

Exported source
doms = ("api.openai.com", "api.anthropic.com", "generativelanguage.googleapis.com", "api.deepseek.com")
Exported source
def _should_cache(url, doms): return any(dom in str(url) for dom in doms)

We could then use _should_cache like this.

@patch
def send(self:httpx._client.Client, r, **kwargs):
    if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
    # insert caching logic
    ...

Cache Key

The next thing we need to do is figure out if a response for the request r already exists in our cache.

Recall that each line in cachy.jsonl has the following format {"key": key, "response": response}.

Our key needs to be unique and deterministic. One way to do this is to concatenate the request URL and content, then generate a hash from the result.

def _key(r): return hashlib.sha256(str(r.url.host).encode() + r.content).hexdigest()[:8]

We use r.url.host instead of r.url because when LiteLLM calls Gemini it includes the API key in a query param. See #1.

If we used r.url we wouldn’t be able to use dummy api keys when running our notebooks in a CI pipeline.

Let’s test this out.

r1 = httpx.Request('POST', 'https://api.openai.com/v1/chat/completions', content=b'some content')
r1
<Request('POST', 'https://api.openai.com/v1/chat/completions')>
_key(r1)
'80acdde4'

If we run it again we should get the same key.

_key(r1)
'80acdde4'

Let’s modify the url and confirm we get a different key.

_key(httpx.Request('POST', 'https://api.anthropic.com/v1/messages', content=b'some content'))
'2707fa05'

Great. Let’s update our patch.

@patch
def send(self:httpx._client.Client, r, **kwargs):
    if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
    key = _key(r)
    # if cache hit return the response
    # else run the request, write to response the cache and return it
    ...

Cache Reads/Writes

Now let’s add some methods that will read from and write to cachy.jsonl.

Exported source
def _cache(key, cfp):
    with open(cfp, "r") as f:
        line = first(f, lambda l: json.loads(l)["key"] == key)
        return json.loads(line)["response"] if line else None
Exported source
def _write_cache(key, content, cfp):
    with open(cfp, "a") as f: f.write(json.dumps({"key":key, "response": content})+"\n")

Let’s update our patch.

@patch
def send(self:httpx._client.Client, r, **kwargs):
    if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
    key = key(r)
    if res := _cache(key,"cachy.jsonl"): return httpx.Response(status_code=200, content=res, request=r)
    res = self._orig_send(r, **kwargs)
    content = res.read().decode()
    _write_cache(key, content, "cachy.jsonl")
    return httpx.Response(status_code=res.status_code, content=content, request=r)

Streaming

Let’s add support for streaming.

First let’s include an is_stream bool in our hash so that a non-streamed request will generate a different key to the same request when streamed.

Exported source
def _key(r, is_stream=False):
    "Create a unique, deterministic id from the request `r`."
    return hashlib.sha256(f"{r.url.host}{is_stream}".encode() + r.content).hexdigest()[:8]

In the patch we need to consume the entire stream before writing it to the cache.

@patch
def send(self:httpx._client.Client, r, **kwargs):
    is_stream = kwargs.get("stream")
    if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
    key = _key(r, is_stream=False)
    if res := _cache(key,"cachy.jsonl"): return httpx.Response(status_code=200, content=res, request=r)
    res = self._orig_send(r, **kwargs)
    content = res.read().decode() if not is_stream else b''.join(list(res.iter_bytes())).decode()
    _write_cache(key, content, "cachy.jsonl")
    return httpx.Response(status_code=res.status_code, content=content, request=r)

enable_cachy

To make cachy as user friendly as possible let’s make it so that we can apply our patch by running a single method at the top of our notebook.

from cachy import enable_cachy

enable_cachy()

For this to work we’ll need to wrap our patch.

def _apply_patch():    
    @patch    
    def send(self:httpx._client.Client, r, **kwargs):
        is_stream = kwargs.get("stream")
        if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
        key = _key(r, is_stream=False)
        if res := _cache(key,"cachy.jsonl"): return httpx.Response(status_code=200, content=res, request=r)
        res = self._orig_send(r, **kwargs)
        content = res.read().decode() if not is_stream else b''.join(list(res.iter_bytes())).decode()
        _write_cache(key, content, "cachy.jsonl")
        return httpx.Response(status_code=res.status_code, content=content, request=r)
def enable_cachy():  
    _apply_patch()

Great. Now, let’s make cachy a little more customizable by making it possible to specify:

  • the APIs (or domains) to cache
  • the location of the cache file.
def enable_cachy(cache_dir=None, doms=doms):
    cfp = Path(cache_dir or getattr(Config.find("settings.ini"), "config_path", ".")) / "cachy.jsonl"
    cfp.touch(exist_ok=True)   
    _apply_patch(cfp, doms)

Note: If our notebook is running in an nbdev project Config.find("settings.ini").config_path automatically finds the base dir.

def _apply_patch(cfp, doms):    
    @patch
    def send(self:httpx._client.Client, r, **kwargs):
        is_stream = kwargs.get("stream")
        if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
        key = _key(r, is_stream=False)
        if res := _cache(key,cfp): return httpx.Response(status_code=200, content=res, request=r)
        res = self._orig_send(r, **kwargs)
        content = res.read().decode() if not is_stream else b''.join(list(res.iter_bytes())).decode()
        _write_cache(key,content,cfp)
        return httpx.Response(status_code=res.status_code, content=content, request=r)

Async

Now let’s add support for async requests.

Exported source
def _apply_async_patch(cfp, doms):    
    @patch
    async def send(self:httpx._client.AsyncClient, r, **kwargs):
        is_stream = kwargs.get("stream")
        if not _should_cache(r.url, doms): return await self._orig_send(r, **kwargs)
        key = _key(r, is_stream=False)
        if res := _cache(key, cfp): return httpx.Response(status_code=200, content=res, request=r)
        res = await self._orig_send(r, **kwargs)
        content = res.read().decode() if not is_stream else b''.join([c async for c in res.aiter_bytes()]).decode()
        _write_cache(key, content, cfp)
        return httpx.Response(status_code=res.status_code, content=content, request=r)

Let’s rename our original patch.

Exported source
def _apply_sync_patch(cfp, doms):    
    @patch
    def send(self:httpx._client.Client, r, **kwargs):
        is_stream = kwargs.get("stream")
        if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
        key = _key(r, is_stream=False)
        if res := _cache(key,cfp): return httpx.Response(status_code=200, content=res, request=r)
        res = self._orig_send(r, **kwargs)
        content = res.read().decode() if not is_stream else b''.join(list(res.iter_bytes())).decode()
        _write_cache(key,content,cfp)
        return httpx.Response(status_code=res.status_code, content=content, request=r)

Finally, let’s update enable_cachy.


source

enable_cachy

 enable_cachy (cache_dir=None, doms=('api.openai.com',
               'api.anthropic.com', 'generativelanguage.googleapis.com',
               'api.deepseek.com'))
Exported source
def enable_cachy(cache_dir=None, doms=doms):
    cfp = Path(cache_dir or getattr(Config.find("settings.ini"), "config_path", ".")) / "cachy.jsonl"
    cfp.touch(exist_ok=True)   
    _apply_sync_patch(cfp, doms)
    _apply_async_patch(cfp, doms)

And a way to turn if off:


source

disable_cachy

 disable_cachy ()
Exported source
def disable_cachy():
    httpx._client.AsyncClient.send = httpx._client.AsyncClient._orig_send
    httpx._client.Client.send      = httpx._client.Client._orig_send

Tests

Let’s test enable_cachy on 3 SDKs (OpenAI, Anthropic, LiteLLM) for the scenarios below:

  • sync requests with(out) streaming
  • async requests with(out) streaming

Add some helper functions.

class mods: ant="claude-sonnet-4-20250514"; oai="gpt-4o"; gem="gemini/gemini-2.0-flash"
def mk_msgs(m): return [{"role": "user", "content": f"write 1 word about {m}"}]
enable_cachy()

OpenAI

from openai import OpenAI
cli = OpenAI()
r = cli.responses.create(model=mods.oai, input=mk_msgs("openai sync"))
r

Collaboration

  • id: resp_0adc61ccfb3938da0068c80baeb59c81a29e449ca5f3333fa2
  • created_at: 1757940654.0
  • error: None
  • incomplete_details: None
  • instructions: None
  • metadata: {}
  • model: gpt-4o-2024-08-06
  • object: response
  • output: [ResponseOutputMessage(id=‘msg_0adc61ccfb3938da0068c80baf67fc81a2b8fe99f555767ec5’, content=[ResponseOutputText(annotations=[], text=‘Collaboration’, type=‘output_text’, logprobs=[])], role=‘assistant’, status=‘completed’, type=‘message’)]
  • parallel_tool_calls: True
  • temperature: 1.0
  • tool_choice: auto
  • tools: []
  • top_p: 1.0
  • background: False
  • conversation: None
  • max_output_tokens: None
  • max_tool_calls: None
  • previous_response_id: None
  • prompt: None
  • prompt_cache_key: None
  • reasoning: Reasoning(effort=None, generate_summary=None, summary=None)
  • safety_identifier: None
  • service_tier: default
  • status: completed
  • text: ResponseTextConfig(format=ResponseFormatText(type=‘text’), verbosity=‘medium’)
  • top_logprobs: 0
  • truncation: disabled
  • usage: ResponseUsage(input_tokens=15, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=3, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=18)
  • user: None
  • store: True
r = cli.responses.create(model=mods.oai, input=mk_msgs("openai sync"))
r

Collaboration

  • id: resp_0adc61ccfb3938da0068c80baeb59c81a29e449ca5f3333fa2
  • created_at: 1757940654.0
  • error: None
  • incomplete_details: None
  • instructions: None
  • metadata: {}
  • model: gpt-4o-2024-08-06
  • object: response
  • output: [ResponseOutputMessage(id=‘msg_0adc61ccfb3938da0068c80baf67fc81a2b8fe99f555767ec5’, content=[ResponseOutputText(annotations=[], text=‘Collaboration’, type=‘output_text’, logprobs=[])], role=‘assistant’, status=‘completed’, type=‘message’)]
  • parallel_tool_calls: True
  • temperature: 1.0
  • tool_choice: auto
  • tools: []
  • top_p: 1.0
  • background: False
  • conversation: None
  • max_output_tokens: None
  • max_tool_calls: None
  • previous_response_id: None
  • prompt: None
  • prompt_cache_key: None
  • reasoning: Reasoning(effort=None, generate_summary=None, summary=None)
  • safety_identifier: None
  • service_tier: default
  • status: completed
  • text: ResponseTextConfig(format=ResponseFormatText(type=‘text’), verbosity=‘medium’)
  • top_logprobs: 0
  • truncation: disabled
  • usage: ResponseUsage(input_tokens=15, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=3, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=18)
  • user: None
  • store: True

Let’s test streaming.

r = cli.responses.create(model=mods.oai, input=mk_msgs("openai sync streaming"), stream=True)
for ch in r: print(ch)
ResponseCreatedEvent(response=Response(id='resp_0bedad8c74657bc30068c80bb26cf4819ca6920df9fbd23916', created_at=1757940658.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, store=True), sequence_number=0, type='response.created')
ResponseInProgressEvent(response=Response(id='resp_0bedad8c74657bc30068c80bb26cf4819ca6920df9fbd23916', created_at=1757940658.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, store=True), sequence_number=1, type='response.in_progress')
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', content=[], role='assistant', status='in_progress', type='message'), output_index=0, sequence_number=2, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', output_index=0, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=3, type='response.content_part.added')
ResponseTextDeltaEvent(content_index=0, delta='Innov', item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', logprobs=[], output_index=0, sequence_number=4, type='response.output_text.delta', obfuscation='VCpZLMzWQSq')
ResponseTextDeltaEvent(content_index=0, delta='ative', item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', logprobs=[], output_index=0, sequence_number=5, type='response.output_text.delta', obfuscation='xFQTnzLeSC8')
ResponseTextDoneEvent(content_index=0, item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', logprobs=[], output_index=0, sequence_number=6, text='Innovative', type='response.output_text.done')
ResponseContentPartDoneEvent(content_index=0, item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', output_index=0, part=ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[]), sequence_number=7, type='response.content_part.done')
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', content=[ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[])], role='assistant', status='completed', type='message'), output_index=0, sequence_number=8, type='response.output_item.done')
ResponseCompletedEvent(response=Response(id='resp_0bedad8c74657bc30068c80bb26cf4819ca6920df9fbd23916', created_at=1757940658.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[ResponseOutputMessage(id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', content=[ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[])], role='assistant', status='completed', type='message')], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='default', status='completed', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=In: 16; Out: 3; Total: 19, user=None, store=True), sequence_number=9, type='response.completed')
r = cli.responses.create(model=mods.oai, input=mk_msgs("openai sync streaming"), stream=True)
for ch in r: print(ch)
ResponseCreatedEvent(response=Response(id='resp_0bedad8c74657bc30068c80bb26cf4819ca6920df9fbd23916', created_at=1757940658.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, store=True), sequence_number=0, type='response.created')
ResponseInProgressEvent(response=Response(id='resp_0bedad8c74657bc30068c80bb26cf4819ca6920df9fbd23916', created_at=1757940658.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, store=True), sequence_number=1, type='response.in_progress')
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', content=[], role='assistant', status='in_progress', type='message'), output_index=0, sequence_number=2, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', output_index=0, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=3, type='response.content_part.added')
ResponseTextDeltaEvent(content_index=0, delta='Innov', item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', logprobs=[], output_index=0, sequence_number=4, type='response.output_text.delta', obfuscation='VCpZLMzWQSq')
ResponseTextDeltaEvent(content_index=0, delta='ative', item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', logprobs=[], output_index=0, sequence_number=5, type='response.output_text.delta', obfuscation='xFQTnzLeSC8')
ResponseTextDoneEvent(content_index=0, item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', logprobs=[], output_index=0, sequence_number=6, text='Innovative', type='response.output_text.done')
ResponseContentPartDoneEvent(content_index=0, item_id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', output_index=0, part=ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[]), sequence_number=7, type='response.content_part.done')
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', content=[ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[])], role='assistant', status='completed', type='message'), output_index=0, sequence_number=8, type='response.output_item.done')
ResponseCompletedEvent(response=Response(id='resp_0bedad8c74657bc30068c80bb26cf4819ca6920df9fbd23916', created_at=1757940658.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[ResponseOutputMessage(id='msg_0bedad8c74657bc30068c80bb33eb4819c9b4ac3a1ef6111d7', content=[ResponseOutputText(annotations=[], text='Innovative', type='output_text', logprobs=[])], role='assistant', status='completed', type='message')], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='default', status='completed', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=In: 16; Out: 3; Total: 19, user=None, store=True), sequence_number=9, type='response.completed')

Let’s test async.

from openai import AsyncOpenAI
cli = AsyncOpenAI()
r = await cli.responses.create(model=mods.oai, input=mk_msgs("openai async"))
r

Innovative

  • id: resp_07966b24b5f8fcde0068c80bc084cc819785b99bc0eaa06e47
  • created_at: 1757940672.0
  • error: None
  • incomplete_details: None
  • instructions: None
  • metadata: {}
  • model: gpt-4o-2024-08-06
  • object: response
  • output: [ResponseOutputMessage(id=‘msg_07966b24b5f8fcde0068c80bc1038481978db7b032ab75dbef’, content=[ResponseOutputText(annotations=[], text=‘Innovative’, type=‘output_text’, logprobs=[])], role=‘assistant’, status=‘completed’, type=‘message’)]
  • parallel_tool_calls: True
  • temperature: 1.0
  • tool_choice: auto
  • tools: []
  • top_p: 1.0
  • background: False
  • conversation: None
  • max_output_tokens: None
  • max_tool_calls: None
  • previous_response_id: None
  • prompt: None
  • prompt_cache_key: None
  • reasoning: Reasoning(effort=None, generate_summary=None, summary=None)
  • safety_identifier: None
  • service_tier: default
  • status: completed
  • text: ResponseTextConfig(format=ResponseFormatText(type=‘text’), verbosity=‘medium’)
  • top_logprobs: 0
  • truncation: disabled
  • usage: ResponseUsage(input_tokens=15, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=3, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=18)
  • user: None
  • store: True
r = await cli.responses.create(model=mods.oai, input=mk_msgs("openai async"))
r

Innovative

  • id: resp_07966b24b5f8fcde0068c80bc084cc819785b99bc0eaa06e47
  • created_at: 1757940672.0
  • error: None
  • incomplete_details: None
  • instructions: None
  • metadata: {}
  • model: gpt-4o-2024-08-06
  • object: response
  • output: [ResponseOutputMessage(id=‘msg_07966b24b5f8fcde0068c80bc1038481978db7b032ab75dbef’, content=[ResponseOutputText(annotations=[], text=‘Innovative’, type=‘output_text’, logprobs=[])], role=‘assistant’, status=‘completed’, type=‘message’)]
  • parallel_tool_calls: True
  • temperature: 1.0
  • tool_choice: auto
  • tools: []
  • top_p: 1.0
  • background: False
  • conversation: None
  • max_output_tokens: None
  • max_tool_calls: None
  • previous_response_id: None
  • prompt: None
  • prompt_cache_key: None
  • reasoning: Reasoning(effort=None, generate_summary=None, summary=None)
  • safety_identifier: None
  • service_tier: default
  • status: completed
  • text: ResponseTextConfig(format=ResponseFormatText(type=‘text’), verbosity=‘medium’)
  • top_logprobs: 0
  • truncation: disabled
  • usage: ResponseUsage(input_tokens=15, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=3, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=18)
  • user: None
  • store: True

Let’s test async streaming.

r = await cli.responses.create(model=mods.oai, input=mk_msgs("openai async streaming"), stream=True)
async for ch in r: print(ch)
ResponseCreatedEvent(response=Response(id='resp_09667f14142ea18b0068c80bc16c8481a1988be17cf34e1823', created_at=1757940673.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, store=True), sequence_number=0, type='response.created')
ResponseInProgressEvent(response=Response(id='resp_09667f14142ea18b0068c80bc16c8481a1988be17cf34e1823', created_at=1757940673.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, store=True), sequence_number=1, type='response.in_progress')
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', content=[], role='assistant', status='in_progress', type='message'), output_index=0, sequence_number=2, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', output_index=0, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=3, type='response.content_part.added')
ResponseTextDeltaEvent(content_index=0, delta='Eff', item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', logprobs=[], output_index=0, sequence_number=4, type='response.output_text.delta', obfuscation='bCNpIYrqdSk3w')
ResponseTextDeltaEvent(content_index=0, delta='icient', item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', logprobs=[], output_index=0, sequence_number=5, type='response.output_text.delta', obfuscation='wHhRAwEPuj')
ResponseTextDoneEvent(content_index=0, item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', logprobs=[], output_index=0, sequence_number=6, text='Efficient', type='response.output_text.done')
ResponseContentPartDoneEvent(content_index=0, item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', output_index=0, part=ResponseOutputText(annotations=[], text='Efficient', type='output_text', logprobs=[]), sequence_number=7, type='response.content_part.done')
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', content=[ResponseOutputText(annotations=[], text='Efficient', type='output_text', logprobs=[])], role='assistant', status='completed', type='message'), output_index=0, sequence_number=8, type='response.output_item.done')
ResponseCompletedEvent(response=Response(id='resp_09667f14142ea18b0068c80bc16c8481a1988be17cf34e1823', created_at=1757940673.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[ResponseOutputMessage(id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', content=[ResponseOutputText(annotations=[], text='Efficient', type='output_text', logprobs=[])], role='assistant', status='completed', type='message')], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='default', status='completed', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=In: 16; Out: 3; Total: 19, user=None, store=True), sequence_number=9, type='response.completed')
r = await cli.responses.create(model=mods.oai, input=mk_msgs("openai async streaming"), stream=True)
async for ch in r: print(ch)
ResponseCreatedEvent(response=Response(id='resp_09667f14142ea18b0068c80bc16c8481a1988be17cf34e1823', created_at=1757940673.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, store=True), sequence_number=0, type='response.created')
ResponseInProgressEvent(response=Response(id='resp_09667f14142ea18b0068c80bc16c8481a1988be17cf34e1823', created_at=1757940673.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='auto', status='in_progress', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=None, user=None, store=True), sequence_number=1, type='response.in_progress')
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', content=[], role='assistant', status='in_progress', type='message'), output_index=0, sequence_number=2, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', output_index=0, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=3, type='response.content_part.added')
ResponseTextDeltaEvent(content_index=0, delta='Eff', item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', logprobs=[], output_index=0, sequence_number=4, type='response.output_text.delta', obfuscation='bCNpIYrqdSk3w')
ResponseTextDeltaEvent(content_index=0, delta='icient', item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', logprobs=[], output_index=0, sequence_number=5, type='response.output_text.delta', obfuscation='wHhRAwEPuj')
ResponseTextDoneEvent(content_index=0, item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', logprobs=[], output_index=0, sequence_number=6, text='Efficient', type='response.output_text.done')
ResponseContentPartDoneEvent(content_index=0, item_id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', output_index=0, part=ResponseOutputText(annotations=[], text='Efficient', type='output_text', logprobs=[]), sequence_number=7, type='response.content_part.done')
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', content=[ResponseOutputText(annotations=[], text='Efficient', type='output_text', logprobs=[])], role='assistant', status='completed', type='message'), output_index=0, sequence_number=8, type='response.output_item.done')
ResponseCompletedEvent(response=Response(id='resp_09667f14142ea18b0068c80bc16c8481a1988be17cf34e1823', created_at=1757940673.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4o-2024-08-06', object='response', output=[ResponseOutputMessage(id='msg_09667f14142ea18b0068c80bc1c07081a1913a81eae5c426d3', content=[ResponseOutputText(annotations=[], text='Efficient', type='output_text', logprobs=[])], role='assistant', status='completed', type='message')], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, conversation=None, max_output_tokens=None, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=Reasoning(effort=None, generate_summary=None, summary=None), safety_identifier=None, service_tier='default', status='completed', text=ResponseTextConfig(format=ResponseFormatText(type='text'), verbosity='medium'), top_logprobs=0, truncation='disabled', usage=In: 16; Out: 3; Total: 19, user=None, store=True), sequence_number=9, type='response.completed')

Anthropic

from anthropic import Anthropic
cli = Anthropic()
r = cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant sync"))
r

Coordination

  • id: msg_01AgKMiEZKSXzQYnSYWaducJ
  • content: [{'citations': None, 'text': 'Coordination', 'type': 'text'}]
  • model: claude-sonnet-4-20250514
  • role: assistant
  • stop_reason: end_turn
  • stop_sequence: None
  • type: message
  • usage: {'cache_creation': {'ephemeral_1h_input_tokens': 0, 'ephemeral_5m_input_tokens': 0}, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0, 'input_tokens': 15, 'output_tokens': 5, 'server_tool_use': None, 'service_tier': 'standard'}
r = cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant sync"))
r

Coordination

  • id: msg_01AgKMiEZKSXzQYnSYWaducJ
  • content: [{'citations': None, 'text': 'Coordination', 'type': 'text'}]
  • model: claude-sonnet-4-20250514
  • role: assistant
  • stop_reason: end_turn
  • stop_sequence: None
  • type: message
  • usage: {'cache_creation': {'ephemeral_1h_input_tokens': 0, 'ephemeral_5m_input_tokens': 0}, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0, 'input_tokens': 15, 'output_tokens': 5, 'server_tool_use': None, 'service_tier': 'standard'}

Let’s test streaming.

r = cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant sync streaming"), stream=True)
for ch in r: print(ch)
RawMessageStartEvent(message=Message(id='msg_01Jnrv7itVfTgQGFkbGnoi6k', content=[], model='claude-sonnet-4-20250514', role='assistant', stop_reason=None, stop_sequence=None, type='message', usage=In: 16; Out: 1; Cache create: 0; Cache read: 0; Total Tokens: 17; Search: 0), type='message_start')
RawContentBlockStartEvent(content_block=TextBlock(citations=None, text='', type='text'), index=0, type='content_block_start')
RawContentBlockDeltaEvent(delta=TextDelta(text='**', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockDeltaEvent(delta=TextDelta(text='Buffering**', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockStopEvent(index=0, type='content_block_stop')
RawMessageDeltaEvent(delta=Delta(stop_reason='end_turn', stop_sequence=None), type='message_delta', usage=MessageDeltaUsage(cache_creation_input_tokens=0, cache_read_input_tokens=0, input_tokens=16, output_tokens=8, server_tool_use=None))
RawMessageStopEvent(type='message_stop')
r = cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant sync streaming"), stream=True)
for ch in r: print(ch)
RawMessageStartEvent(message=Message(id='msg_01Jnrv7itVfTgQGFkbGnoi6k', content=[], model='claude-sonnet-4-20250514', role='assistant', stop_reason=None, stop_sequence=None, type='message', usage=In: 16; Out: 1; Cache create: 0; Cache read: 0; Total Tokens: 17; Search: 0), type='message_start')
RawContentBlockStartEvent(content_block=TextBlock(citations=None, text='', type='text'), index=0, type='content_block_start')
RawContentBlockDeltaEvent(delta=TextDelta(text='**', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockDeltaEvent(delta=TextDelta(text='Buffering**', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockStopEvent(index=0, type='content_block_stop')
RawMessageDeltaEvent(delta=Delta(stop_reason='end_turn', stop_sequence=None), type='message_delta', usage=MessageDeltaUsage(cache_creation_input_tokens=0, cache_read_input_tokens=0, input_tokens=16, output_tokens=8, server_tool_use=None))
RawMessageStopEvent(type='message_stop')

Let’s test async.

from anthropic import AsyncAnthropic
cli = AsyncAnthropic()
r = await cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant async"))
r

concurrent

  • id: msg_01R33rKFKM5BqeJprT6A6DVM
  • content: [{'citations': None, 'text': '**concurrent**', 'type': 'text'}]
  • model: claude-sonnet-4-20250514
  • role: assistant
  • stop_reason: end_turn
  • stop_sequence: None
  • type: message
  • usage: {'cache_creation': {'ephemeral_1h_input_tokens': 0, 'ephemeral_5m_input_tokens': 0}, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0, 'input_tokens': 15, 'output_tokens': 6, 'server_tool_use': None, 'service_tier': 'standard'}
r = await cli.messages.create(model=mods.ant, max_tokens=1024, messages=mk_msgs("ant async"))
r

concurrent

  • id: msg_01R33rKFKM5BqeJprT6A6DVM
  • content: [{'citations': None, 'text': '**concurrent**', 'type': 'text'}]
  • model: claude-sonnet-4-20250514
  • role: assistant
  • stop_reason: end_turn
  • stop_sequence: None
  • type: message
  • usage: {'cache_creation': {'ephemeral_1h_input_tokens': 0, 'ephemeral_5m_input_tokens': 0}, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0, 'input_tokens': 15, 'output_tokens': 6, 'server_tool_use': None, 'service_tier': 'standard'}

Let’s test async streaming.

r = await cli.messages.create(model=mods.ant,max_tokens=1024,messages=mk_msgs("ant async streaming"), stream=True)
async for ch in r.response.aiter_bytes(): print(ch.decode())
event: message_start
data: {"type":"message_start","message":{"id":"msg_019CQEuc6f7D2ewx7Co1PcTW","type":"message","role":"assistant","model":"claude-sonnet-4-20250514","content":[],"stop_reason":null,"stop_sequence":null,"usage":{"input_tokens":16,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"cache_creation":{"ephemeral_5m_input_tokens":0,"ephemeral_1h_input_tokens":0},"output_tokens":1,"service_tier":"standard"}}       }

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}     }

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Async"}             }

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Iterable"}}

event: ping
data: {"type": "ping"}

event: content_block_stop
data: {"type":"content_block_stop","index":0     }

event: ping
data: {"type": "ping"}

event: ping
data: {"type": "ping"}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null},"usage":{"input_tokens":16,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"output_tokens":6}              }

event: message_stop
data: {"type":"message_stop"}

r = await cli.messages.create(model=mods.ant,max_tokens=1024,messages=mk_msgs("ant async streaming"), stream=True)
async for ch in r.response.aiter_bytes(): print(ch.decode())
event: message_start
data: {"type":"message_start","message":{"id":"msg_019CQEuc6f7D2ewx7Co1PcTW","type":"message","role":"assistant","model":"claude-sonnet-4-20250514","content":[],"stop_reason":null,"stop_sequence":null,"usage":{"input_tokens":16,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"cache_creation":{"ephemeral_5m_input_tokens":0,"ephemeral_1h_input_tokens":0},"output_tokens":1,"service_tier":"standard"}}       }

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}     }

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Async"}             }

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Iterable"}}

event: ping
data: {"type": "ping"}

event: content_block_stop
data: {"type":"content_block_stop","index":0     }

event: ping
data: {"type": "ping"}

event: ping
data: {"type": "ping"}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null},"usage":{"input_tokens":16,"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"output_tokens":6}              }

event: message_stop
data: {"type":"message_stop"}

LiteLLM

Let’s test the LiteLLM SDK by running sync/async calls with(out) streaming for OpenAI, Anthropic, & Gemini.

We’ll also double check tool calls and citations.

Sync Tests

from litellm import completion

Let’s define a helper method to display a streamed response.

def _stream(r): 
    for ch in r: print(ch.choices[0].delta.content or "")
Anthropic

Let’s test claude-sonnet-x.

r = completion(model=mods.ant, messages=mk_msgs("lite: ant sync..."))
r
ModelResponse(id='chatcmpl-b2bea735-71f6-46bb-8645-09b96052d3e2', created=1757941046, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='**Streamlined**', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=8, prompt_tokens=18, total_tokens=26, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))
r = completion(model=mods.ant, messages=mk_msgs("lite: ant sync..."))
r
ModelResponse(id='chatcmpl-c31df5f6-2278-4f92-9fab-5756ec4855e5', created=1757941046, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='**Streamlined**', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=8, prompt_tokens=18, total_tokens=26, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))

Now, with streaming enabled.

r = completion(model=mods.ant, messages=mk_msgs("lite: ant sync stream..."), stream=True)
_stream(r)
**
Lightweight**
r = completion(model=mods.ant, messages=mk_msgs("lite: ant sync stream..."), stream=True)
_stream(r)
**
Lightweight**
OpenAI

Let’s test gpt-4o.

r = completion(model=mods.oai, messages=mk_msgs("lite: oai sync..."))
r
ModelResponse(id='chatcmpl-CG2xxEWT6Y33MdJvMnIRDuqlOQUvB', created=1757940685, model='gpt-4o-2024-08-06', object='chat.completion', system_fingerprint='fp_f33640a400', choices=[Choices(finish_reason='stop', index=0, message=Message(content='Efficiency', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]), provider_specific_fields={})], usage=Usage(completion_tokens=1, prompt_tokens=18, total_tokens=19, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None)), service_tier='default')
r = completion(model=mods.oai, messages=mk_msgs("lite: oai sync..."))
r
ModelResponse(id='chatcmpl-CG2xxEWT6Y33MdJvMnIRDuqlOQUvB', created=1757940685, model='gpt-4o-2024-08-06', object='chat.completion', system_fingerprint='fp_f33640a400', choices=[Choices(finish_reason='stop', index=0, message=Message(content='Efficiency', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]), provider_specific_fields={})], usage=Usage(completion_tokens=1, prompt_tokens=18, total_tokens=19, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None)), service_tier='default')

Now, with streaming enabled.

r = completion(model=mods.oai, messages=mk_msgs("lite: oai sync stream..."), stream=True)
_stream(r)
Synchronization

r = completion(model=mods.oai, messages=mk_msgs("lite: oai sync stream..."), stream=True)
_stream(r)
Synchronization

Gemini

Let’s test 2.0-flash.

r = completion(model=mods.gem, messages=mk_msgs("lite: gem sync..."))
r
ModelResponse(id='zwvIaMfgK7CBvdIPsqDQqQY', created=1757941047, model='gemini-2.0-flash', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Synchronization.\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=3, prompt_tokens=10, total_tokens=13, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=10, image_tokens=None)), vertex_ai_grounding_metadata=[], vertex_ai_url_context_metadata=[], vertex_ai_safety_results=[], vertex_ai_citation_metadata=[])
r = completion(model=mods.gem, messages=mk_msgs("lite: gem sync..."))
r
ModelResponse(id='zwvIaMfgK7CBvdIPsqDQqQY', created=1757941047, model='gemini-2.0-flash', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Synchronization.\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=3, prompt_tokens=10, total_tokens=13, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=10, image_tokens=None)), vertex_ai_grounding_metadata=[], vertex_ai_url_context_metadata=[], vertex_ai_safety_results=[], vertex_ai_citation_metadata=[])

Now, with streaming enabled.

r = completion(model=mods.gem, messages=mk_msgs("lite: gem sync stream..."), stream=True)
_stream(r)
Eff
ortless

r = completion(model=mods.gem, messages=mk_msgs("lite: gem sync stream..."), stream=True)
_stream(r)
Eff
ortless

Async Tests

from litellm import acompletion
async def _astream(r):
    async for chunk in r: print(chunk.choices[0].delta.content or "")
Anthropic

Let’s test claude-sonnet-x.

r = await acompletion(model=mods.ant, messages=mk_msgs("lite: ant async..."))
r
ModelResponse(id='chatcmpl-66487276-199b-4045-b980-eeb2c83148e9', created=1757941047, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='**coroutine**', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=8, prompt_tokens=18, total_tokens=26, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))
r = await acompletion(model=mods.ant, messages=mk_msgs("lite: ant async..."))
r
ModelResponse(id='chatcmpl-a1334d94-646c-4933-8b92-58178b067a88', created=1757941047, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='**coroutine**', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=8, prompt_tokens=18, total_tokens=26, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))

Now, with streaming enabled.

r = await acompletion(model=mods.ant, messages=mk_msgs("lite: ant async stream..."), stream=True)
await(_astream(r))
**
reactive**
r = await acompletion(model=mods.ant, messages=mk_msgs("lite: ant async stream..."), stream=True)
await(_astream(r))
**
reactive**
OpenAI

Let’s test gpt-4o.

r = await acompletion(model=mods.oai, messages=mk_msgs("lite: oai async..."))
r
ModelResponse(id='chatcmpl-CG2y4f3zgvJK7cENDO4UDN0hHZ6fH', created=1757940692, model='gpt-4o-2024-08-06', object='chat.completion', system_fingerprint='fp_f33640a400', choices=[Choices(finish_reason='stop', index=0, message=Message(content='Fast.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]), provider_specific_fields={})], usage=Usage(completion_tokens=2, prompt_tokens=18, total_tokens=20, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None)), service_tier='default')
r = await acompletion(model=mods.oai, messages=mk_msgs("lite: oai async..."))
r
ModelResponse(id='chatcmpl-CG2y4f3zgvJK7cENDO4UDN0hHZ6fH', created=1757940692, model='gpt-4o-2024-08-06', object='chat.completion', system_fingerprint='fp_f33640a400', choices=[Choices(finish_reason='stop', index=0, message=Message(content='Fast.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]), provider_specific_fields={})], usage=Usage(completion_tokens=2, prompt_tokens=18, total_tokens=20, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None)), service_tier='default')

Now, with streaming enabled.

r = await acompletion(model=mods.oai, messages=mk_msgs("lite: oai async stream..."), stream=True)
await(_astream(r))
Eff
icient
.

r = await acompletion(model=mods.oai, messages=mk_msgs("lite: oai async stream..."), stream=True)
await(_astream(r))
Eff
icient
.

Gemini

Let’s test 2.0-flash.

r = await acompletion(model=mods.gem, messages=mk_msgs("lite: gem async..."))
r
ModelResponse(id='1gvIaLDAGe2kvdIPsY6--Q0', created=1757941047, model='gemini-2.0-flash', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Concurrency\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=2, prompt_tokens=10, total_tokens=12, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=10, image_tokens=None)), vertex_ai_grounding_metadata=[], vertex_ai_url_context_metadata=[], vertex_ai_safety_results=[], vertex_ai_citation_metadata=[])
r = await acompletion(model=mods.gem, messages=mk_msgs("lite: gem async..."))
r
ModelResponse(id='1gvIaLDAGe2kvdIPsY6--Q0', created=1757941047, model='gemini-2.0-flash', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Concurrency\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=2, prompt_tokens=10, total_tokens=12, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=10, image_tokens=None)), vertex_ai_grounding_metadata=[], vertex_ai_url_context_metadata=[], vertex_ai_safety_results=[], vertex_ai_citation_metadata=[])

Now, with streaming enabled.

r = await acompletion(model=mods.gem, messages=mk_msgs("lite: gem async stream..."), stream=True)
await(_astream(r))
Concurrency
r = await acompletion(model=mods.gem, messages=mk_msgs("lite: gem async stream..."), stream=True)
await(_astream(r))
Concurrency

Tool Calls

As a sanity check let’s confirm that tool calls work.

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type":"string", "description":"The city e.g. Reims"},
                    "unit": {"type":"string", "enum":["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            }
        }
    }
]
r = completion(model=mods.ant, messages=mk_msgs("Is it raining in Reims?"), tools=tools)
r
ModelResponse(id='chatcmpl-78f754da-9ec9-490e-86f5-07049d0446e5', created=1757941047, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='tool_calls', index=0, message=Message(content=None, role='assistant', tool_calls=[ChatCompletionMessageToolCall(index=0, function=Function(arguments='{"location": "Reims"}', name='get_current_weather'), id='toolu_0182nVBg1pTYTadKxS5qgCt4', type='function')], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=57, prompt_tokens=427, total_tokens=484, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))
r = completion(model=mods.ant, messages=mk_msgs("Is it raining in Reims?"), tools=tools)
r
ModelResponse(id='chatcmpl-73aac225-ac1b-42cc-80cf-61cefb999295', created=1757941047, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='tool_calls', index=0, message=Message(content=None, role='assistant', tool_calls=[ChatCompletionMessageToolCall(index=0, function=Function(arguments='{"location": "Reims"}', name='get_current_weather'), id='toolu_0182nVBg1pTYTadKxS5qgCt4', type='function')], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=57, prompt_tokens=427, total_tokens=484, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))