def find_block(r:abc.Mapping, # The message to look in ):"Find the content in `r`." m = nested_idx(r, 'candidates', 0)ifnot m: return mifhasattr(m, 'content'): return m.content else: return m
find_block(r)
parts {
text: "Hi Faisal! It\'s nice to meet you. I\'m an AI, and I\'m here to help. What can I do for you today?\n"
}
role: "model"
def contents(r):"Helper to get the contents from response `r`." blk = find_block(r)ifnot blk: return rifhasattr(blk, 'parts'): returngetattr(blk,'parts')[0].textreturn blk
contents(r)
"Hi Faisal! It's nice to meet you. I'm an AI, and I'm here to help. What can I do for you today?\n"
Exported source
@patch()def _repr_markdown_(self:GenerateContentResponse): met =list(self.to_dict()['candidates'][0].items()) +list(self.to_dict()['usage_metadata'].items()) det ='\n- '.join(f'{k}: {v}'for k,v in met) res = contents(self)ifnot res: returnf"- {det}"returnf"""{contents(self)}\n<details>\n\n- {det}\n\n</details>"""
r
Hi Faisal! It’s nice to meet you. I’m an AI, and I’m here to help. What can I do for you today?
content: {‘parts’: [{‘text’: “Hi Faisal! It’s nice to meet you. I’m an AI, and I’m here to help. What can I do for you today?”}], ‘role’: ‘model’}
def usage(inp=0, # Number of input tokens out=0# Number of output tokens ):"Slightly more concise version of `Usage`."return UsageMetadata(prompt_token_count=inp, candidates_token_count=out)
Add together each of input_tokens and output_tokens
Exported source
@patchdef__add__(self:UsageMetadata, b):"Add together each of `input_tokens` and `output_tokens`"return usage(self.prompt_token_count+b.prompt_token_count, self.candidates_token_count+b.candidates_token_count)
msgs = [mk_msg(prompt), mk_msg(r), mk_msg('I forgot my name. Can you remind me please?')]msgs
[{'role': 'user', 'parts': ["I'm Faisal"]},
{'role': 'model',
'parts': ["Hi Faisal, it's nice to meet you! Is there anything I can help you with today?\n"]},
{'role': 'user', 'parts': ['I forgot my name. Can you remind me please?']}]
Okay, Faisal, I can help with that! You just told me your name is Faisal. Don’t worry, it happens to everyone sometimes! Is there anything else I can help you with?
content: {‘parts’: [{‘text’: “Okay, Faisal, I can help with that! You just told me your name is Faisal. Don’t worry, it happens to everyone sometimes! Is there anything else I can help you with?”}], ‘role’: ‘model’}
Helper to set ‘assistant’ role on alternate messages.
Exported source
def mk_msgs(msgs:list, **kw):"Helper to set 'assistant' role on alternate messages."ifisinstance(msgs,str): msgs=[msgs]return [mk_msg(o, ('user','model')[i%2], **kw) for i,o inenumerate(msgs)]
msgs = mk_msgs(["Hi, I'm Faisal!", r, "I forgot my name. Can you remind me please?"]); msgs
[{'role': 'user', 'parts': ["Hi, I'm Faisal!"]},
{'role': 'model',
'parts': ["Hi Faisal, it's nice to meet you! Is there anything I can help you with today?\n"]},
{'role': 'user', 'parts': ['I forgot my name. Can you remind me please?']}]
class Client:def__init__(self, model, cli=None, sp=None):"Basic LLM messages client."self.model,self.use = model,usage(0,0)self.sp = spself.c = (cli or genai.GenerativeModel(model, system_instruction=sp))
c = Client(model)c.use
In: 0; Out: 0; Total: 0
Exported source
@patchdef _r(self:Client, r:GenerateContentResponse):"Store the result of the message and accrue total usage."self.result = rifgetattr(r,'usage_metadata',None): self.use += r.usage_metadatareturn r
@patch@delegates(genai.GenerativeModel.generate_content)def__call__(self:Client, msgs:list, # List of messages in the dialog sp:str=None, # System prompt maxtok=4096, # Maximum tokens stream:bool=False, # Stream response?**kwargs):"Make a call to LLM."if sp: self._set_sp(sp) msgs =self._precall(msgs) gc_params = inspect.signature(GenerationConfig.__init__).parameters gc_kwargs = {k: v for k, v in kwargs.items() if k in gc_params} gen_config = GenerationConfig(max_output_tokens=maxtok, **gc_kwargs) gen_params = inspect.signature(self.c.generate_content).parameters gen_kwargs = {k: v for k, v in kwargs.items() if k in gen_params} r =self.c.generate_content( contents=msgs, generation_config=gen_config, stream=stream, **gen_kwargs)ifnot stream: returnself._r(r)else: return get_stream(map(self._r, r))
c.c.generate_content('hi').text
'Hi there! How can I help you today?\n'
msgs = ['hi']
c(msgs)
Hi there! How can I help you today?
content: {‘parts’: [{‘text’: ‘Hi there! How can I help you today?’}], ‘role’: ‘model’}
Gemini cli requires passing the system prompt when creating the client, but we didn’t pass one at creation time. Let’s make sure that it gets set properly when we call the client later.
We’ve shown the token usage but we really care about is pricing. Let’s extract the latest pricing from Google into a pricing dict. Currently, the experimental version Gemini 2.0 Flash is free, and the pricing of other experimental models is not entirely clear. Since there’s rumors they are two experimental version of the upcoming Gemini 2.0 Pro, they are priced as 1.5 Pro. Better safe than sorry.
def sums( a:int, # First thing to sum b:int# Second thing to sum) ->int: # The sum of the inputs"Adds a + b."print(f"Finding the sum of {a} and {b}")return a + b
sysp ="You are a helpful assistant. When using tools, be sure to pass all required parameters, at minimum."
a,b =604542,6458932pr =f"What is {a}+{b}?"
Google’s Genai API handles schema exatraction under the hood, so we can just directly pass the functions
def contents(r):"Helper to get the contents from response `r`." blk = find_block(r)ifnot blk: return rifhasattr(blk, 'parts'): part = blk.parts[0]if'text'in part:return part.textelse:return partreturn blk
contents(r)
function_call {
name: "sums"
args {
fields {
key: "b"
value {
number_value: 6458932
}
}
fields {
key: "a"
value {
number_value: 604542
}
}
}
}
def mk_toolres( r:abc.Mapping, # Tool use request response ns, # Namespace to search for tools ):"Create a `tool_result` message from response `r`." parts = find_block(r).parts tcs = [p.function_call for p in parts ifhasattr(p, 'function_call')] res = [mk_msg(r)] tc_res = []for func in (tcs or []):ifnot func: continue func = convert_func(func) cts = call_func(func.name, func.inputs, ns=ns) tc_res.append(FunctionResponse(name=func.name, response={'result': cts}))if tc_res: res.append(mk_msg(tc_res))return res
Helper to set ‘assistant’ role on alternate messages.
Exported source
def mk_msgs(msgs:list, **kw):"Helper to set 'assistant' role on alternate messages."ifisinstance(msgs,str): msgs=[msgs]return [mk_msg(o, ('user','model')[i%2], **kw) for i,o inenumerate(msgs)]
We can also use tool calling to force the model to return structured outputs.
@patch@delegates(Client.__call__)def structured(self:Client, msgs:list, # The prompt or list of prompts tools:list, # Namespace to search for tools**kwargs):"Return the value of all tool calls (generally used for structured outputs)"ifnotisinstance(msgs, list): msgs = [msgs]ifnotisinstance(tools, list): tools = [tools] kwargs['tools'] = tools kwargs['tool_config'] = mk_tool_config(tools) res =self(msgs, **kwargs) ns=mk_ns(*tools) parts = find_block(res).parts funcs = [convert_func(p.function_call) for p in parts ifhasattr(p, 'function_call')] tcs = [call_func(func.name, func.inputs, ns=ns) for func in funcs]return tcs
class Recipe(BasicRepr):"A structure for representing recipes."def__init__(self, recipe_name: str, ingredients: list[str]): store_attr()
Gemini API schema extraction doesn’t work very well for Class definitions so we define a factory method as a workaround.
pr ="Give me a receipe for chocolate chip cookies"recipe = c.structured(pr, tools=[Recipe], sp=sysp)[0]; recipe
name: "Conversation"
description: "A conversation between two people"
parameters {
type_: OBJECT
properties {
key: "turns"
value {
type_: ARRAY
items {
type_: OBJECT
properties {
key: "msg_b"
value {
type_: STRING
}
}
properties {
key: "msg_a"
value {
type_: STRING
}
}
required: "msg_a"
required: "msg_b"
}
}
}
required: "turns"
}
result = gen_model.generate_content(pr, tool_config={'function_calling_config':'ANY'}); result
function_call { name: “create_convo” args { fields { key: “turns” value { list_value { values { struct_value { fields { key: “msg_b” value { string_value: “Okay, here is a recipe:\n\nIngredients:\n\n1 cup (2 sticks) unsalted butter, softened\n1 cup granulated sugar\n1 cup packed brown sugar\n2 teaspoons pure vanilla extract\n2 large eggs\n3 cups all-purpose flour\n1 teaspoon baking soda\n1 teaspoon salt\n2 cups chocolate chips\n\nInstructions\n\nPreheat oven to 375 degrees F (190 degrees C). Line baking sheets with parchment paper.\nIn a large bowl, cream together the butter, granulated sugar, and brown sugar until light and fluffy.\nBeat in the vanilla extract, then the eggs one at a time.\nIn a separate bowl, whisk together the flour, baking soda, and salt.\nGradually add the dry ingredients to the wet ingredients, mixing until just combined.\nStir in the chocolate chips.\nDrop by rounded tablespoons onto the prepared baking sheets.\nBake for 9-11 minutes, or until golden brown.Let cool on baking sheets for a few minutes before transferring to a wire rack to cool completely.” } } fields { key: “msg_a” value { string_value: “Hey, I'd love a chocolate chip cookie recipe.” } } } } } } } } }
content: {‘parts’: [{‘function_call’: {‘name’: ‘create_convo’, ‘args’: {‘turns’: [{‘msg_b’: ‘Okay, here is a recipe:\n\nIngredients:\n\n1 cup (2 sticks) unsalted butter, softened\n1 cup granulated sugar\n1 cup packed brown sugar\n2 teaspoons pure vanilla extract\n2 large eggs\n3 cups all-purpose flour\n1 teaspoon baking soda\n1 teaspoon salt\n2 cups chocolate chips\n\nInstructions\n\nPreheat oven to 375 degrees F (190 degrees C). Line baking sheets with parchment paper.\nIn a large bowl, cream together the butter, granulated sugar, and brown sugar until light and fluffy.\nBeat in the vanilla extract, then the eggs one at a time.\nIn a separate bowl, whisk together the flour, baking soda, and salt.\nGradually add the dry ingredients to the wet ingredients, mixing until just combined.\nStir in the chocolate chips.\nDrop by rounded tablespoons onto the prepared baking sheets.\nBake for 9-11 minutes, or until golden brown.Let cool on baking sheets for a few minutes before transferring to a wire rack to cool completely.’, ‘msg_a’: “Hey, I’d love a chocolate chip cookie recipe.”}]}}}], ‘role’: ‘model’}
name: "create_convo"
args {
fields {
key: "turns"
value {
list_value {
values {
struct_value {
fields {
key: "msg_b"
value {
string_value: "Okay, here is a recipe:\\n\\nIngredients:\\n\\n1 cup (2 sticks) unsalted butter, softened\\n1 cup granulated sugar\\n1 cup packed brown sugar\\n2 teaspoons pure vanilla extract\\n2 large eggs\\n3 cups all-purpose flour\\n1 teaspoon baking soda\\n1 teaspoon salt\\n2 cups chocolate chips\\n\\nInstructions\\n\\nPreheat oven to 375 degrees F (190 degrees C). Line baking sheets with parchment paper.\\nIn a large bowl, cream together the butter, granulated sugar, and brown sugar until light and fluffy.\\nBeat in the vanilla extract, then the eggs one at a time.\\nIn a separate bowl, whisk together the flour, baking soda, and salt.\\nGradually add the dry ingredients to the wet ingredients, mixing until just combined.\\nStir in the chocolate chips.\\nDrop by rounded tablespoons onto the prepared baking sheets.\\nBake for 9-11 minutes, or until golden brown.Let cool on baking sheets for a few minutes before transferring to a wire rack to cool completely."
}
}
fields {
key: "msg_a"
value {
string_value: "Hey, I\'d love a chocolate chip cookie recipe."
}
}
}
}
}
}
}
}
def _convert_proto(o):"Convert proto objects to Python dicts and lists"ifisinstance(o, (dict,MapComposite)): return {k:_convert_proto(v) for k,v in o.items()}elifisinstance(o, (list,RepeatedComposite)): return [_convert_proto(v) for v in o]elifhasattr(o, 'DESCRIPTOR'): return {k.name:_convert_proto(getattr(o,k.name)) for k in o.DESCRIPTOR.fields}return o
Exported source
def mk_args(args):ifisinstance(args, MapComposite): return _convert_proto(args)return {k: v for k,v in args.items()}
args = mk_args(func.args); args
{'turns': [{'msg_b': 'Okay, here is a recipe:\\n\\nIngredients:\\n\\n1 cup (2 sticks) unsalted butter, softened\\n1 cup granulated sugar\\n1 cup packed brown sugar\\n2 teaspoons pure vanilla extract\\n2 large eggs\\n3 cups all-purpose flour\\n1 teaspoon baking soda\\n1 teaspoon salt\\n2 cups chocolate chips\\n\\nInstructions\\n\\nPreheat oven to 375 degrees F (190 degrees C). Line baking sheets with parchment paper.\\nIn a large bowl, cream together the butter, granulated sugar, and brown sugar until light and fluffy.\\nBeat in the vanilla extract, then the eggs one at a time.\\nIn a separate bowl, whisk together the flour, baking soda, and salt.\\nGradually add the dry ingredients to the wet ingredients, mixing until just combined.\\nStir in the chocolate chips.\\nDrop by rounded tablespoons onto the prepared baking sheets.\\nBake for 9-11 minutes, or until golden brown.Let cool on baking sheets for a few minutes before transferring to a wire rack to cool completely.',
'msg_a': "Hey, I'd love a chocolate chip cookie recipe."}]}
def mk_tool_config(choose: list)->dict:return {"function_calling_config": {"mode": "ANY", "allowed_function_names": [x.__name__ifhasattr(x, '__name__') else x.name for x in choose]}}
Return the value of all tool calls (generally used for structured outputs)
Type
Default
Details
msgs
list
The prompt or list of prompts
tools
list
Namespace to search for tools
sp
str
None
System prompt
maxtok
int
4096
Maximum tokens
stream
bool
False
Stream response?
generation_config
generation_types.GenerationConfigType | None
None
safety_settings
safety_types.SafetySettingOptions | None
None
tool_config
content_types.ToolConfigType | None
None
request_options
helper_types.RequestOptionsType | None
None
Exported source
@patch@delegates(Client.__call__)def structured(self:Client, msgs:list, # The prompt or list of prompts tools:list, # Namespace to search for tools**kwargs):"Return the value of all tool calls (generally used for structured outputs)"ifnotisinstance(msgs, list): msgs = [msgs]ifnotisinstance(tools, list): tools = [tools] kwargs['tools'] = [cls2tool(x) for x in tools] kwargs['tool_config'] = mk_tool_config(kwargs['tools']) res =self(msgs, **kwargs) ns=mk_ns(*tools) parts = find_block(res).parts funcs = [convert_func(p.function_call) for p in parts ifhasattr(p, 'function_call')] tcs = [call_func(func.name, func.inputs, ns=ns) for func in funcs]return tcs
pr ="Create a conversation between Albert Einstein and Robert J. Oppenheimer"convo = c.structured(pr, tools=[Conversation], sp=sysp)[0];print(convo)
Conversation(turns=[{'msg_b': 'Robert J. Oppenheimer: Indeed, Professor Einstein. Its mysteries both fascinate and, at times, trouble me. Especially those we are now starting to unravel with the atom.', 'msg_a': "Albert Einstein: The universe is a wondrous place, wouldn't you agree?"}, {'msg_b': "Robert J. Oppenheimer: Precisely. The potential for both creation and destruction weighs heavily on my mind. It's a double-edged sword.", 'msg_a': "Albert Einstein: Ah yes, the atom. A source of immense power, but also grave responsibility, wouldn't you say?"}, {'msg_b': 'Robert J. Oppenheimer: I concur. Yet, the forces of politics and conflict often seem to overshadow reason and progress. I fear our work might be used for ends we never intended.', 'msg_a': 'Albert Einstein: It is our duty, as scientists, to guide humanity toward the beneficial application of these discoveries, to prevent it from turning against itself.'}, {'msg_b': 'Robert J. Oppenheimer: I wish I had your certainty, Professor. The path ahead feels fraught with peril. Still, I am grateful for the pursuit of knowledge, even amidst the uncertainty.', 'msg_a': 'Albert Einstein: We must never lose hope that reason and understanding will prevail. The universe operates on elegant principles; if we remain dedicated to truth, we will find our way. '}, {'msg_b': 'Robert J. Oppenheimer: Perhaps, Professor. Perhaps.', 'msg_a': 'Albert Einstein: As am I, my friend. For it is through the exploration of these great mysteries that we glimpse the true nature of the universe and perhaps, our place within it.'}])
Chat
We’ll create a Chat class that will handle formatting of messages and passing along system prompts and tools, so we don’t have to worry about doing that manually each time.
class Chat:def__init__(self, model:Optional[str]=None, # Model to use (leave empty if passing `cli`) cli:Optional[Client]=None, # Client to use (leave empty if passing `model`) sp=None, # Optional system prompt tools:Optional[list]=None, # List of tools to make available tool_config:Optional[str]=None): # Forced tool choice"Gemini chat client."assert model or cliself.c = (cli or Client(model, sp=sp))self.h,self.sp,self.tools,self.tool_config = [],sp,tools,tool_config@propertydef use(self): returnself.c.use@propertydef cost(self): returnself.c.cost
sp ="Never mention what tools you use."chat = Chat(model, sp=sp)chat.use, chat.h
@patchdef _post_pr(self:Chat, pr, prev_role):if pr isNoneand prev_role =='assistant':raiseValueError("Prompt must be given after assistant completion, or use `self.cont_pr`.")if pr: self.h.append(mk_msg(pr))
Exported source
@patchdef _append_pr(self:Chat, pr=None, # Prompt / message ): prev_role = nested_idx(self.h, -1, 'role') ifself.h else'assistant'# First message should be 'user'if pr and prev_role =='user': self() # already user request pendingself._post_pr(pr, prev_role)
Exported source
@patch@delegates(genai.GenerativeModel.generate_content)def__call__(self:Chat, pr=None, # Prompt / message temp=0, # Temperature maxtok=4096, # Maximum tokens stream=False, # Stream response?**kwargs):ifisinstance(pr,str): pr = pr.strip()self._append_pr(pr)ifself.tools: kwargs['tools'] =self.tools# NOTE: Gemini specifies tool_choice via tool_configifself.tool_config: kwargs['tool_config'] = mk_tool_config(self.tool_config) res =self.c(self.h, stream=stream, sp=self.sp, temp=temp, maxtok=maxtok, **kwargs)if stream: returnself._stream(res)self.h += mk_toolres(self.c.result, ns=self.tools)return res
We can check our uage with the use property. As you can see it keeps track of the history of the conversation.
chat.use
In: 43; Out: 17; Total: 60
Let’s make a nice markdown representation for our docs and jupyter notebooks of our chat object.
@patchdef _repr_markdown_(self:Chat):ifnothasattr(self.c, 'result'): return'No results yet' last_msg = contents(self.c.result) history ='\n\n'.join(f"**{m['role']}**: {m['parts'][0] ifisinstance(m['parts'][0],str) else m['parts'][0].text}"for m inself.h if m['role'] in ('user','model')) det =self.c._repr_markdown_().split('\n\n')[-1]returnf"""{last_msg}<details><summary>History</summary>{history}</details>{det}"""
chat
Your name is Faisal.
History
user: I’m Faisal
model: It’s nice to meet you, Faisal.
user: What’s my name?
model: Your name is Faisal.
Metric
Count
Cost (USD)
Input tokens
43
0.000000
Output tokens
17
0.000000
Cache tokens
0
0.000000
Total
60
$0.000000
Let’s also make sure that streaming works correctly with the Chat interface
chat = Chat(model, sp=sp)for o in chat("I'm Faisal", stream=True): o = contents(o)if o andisinstance(o, str): print(o, end='')
It's nice to meet you, Faisal.
Let’s also make sure that tool use works with the Chat interface
pr =f"What is {a}+{b}?"; pr
'What is 604542+6458932?'
sp ="You are a helpful assistant. When using tools, be sure to pass all required parameters, at minimum."chat = Chat(model, sp=sp, tools=[sums])r = chat("I'm Faisal")r
Hello Faisal, how can I help you today?
content: {‘parts’: [{‘text’: ‘Hello Faisal, how can I help you today?’}], ‘role’: ‘model’}
If we inspect the history, we can see that the result of the function call has already been added. We can simply call chat() to pass this to the model and get a response.
Now let’s make sure that tool_config works correctly by forcing the model to pick a particular function.
def diff( a:int, # The number to subtract from b:int# The amount to subtract) ->int: # Result of subtracting b from a"Returns a - b."print(f"Finding the diff of {a} and {b}")return a - b
sp ="You are a helpful assistant. When using tools, be sure to pass all required parameters, at minimum."chat = Chat(model, sp=sp, tools=[sums, diff], tool_config=[diff])r = chat(f"What is {a}+{b}?")r
We can see that the model calls the function specified by tool_config even though the prompt asks for a summation, which is the expected behvior in this case.
Now that we are passing more than just text, will need a helper function to upload media using Gemini’s File API, which is the recomended way of passing media to the model.
And finally lets add a helper function for make content correctly handles text and other media.
Exported source
def _mk_content(src):"Create appropriate content data structure based on type of content"ifisinstance(src,str): return text_msg(src)ifisinstance(src,FunctionResponse): return srcelse: return media_msg(src)
Now let’s make sure it properly handles text vs. Path objects for media
def mk_msg(content, role='user', **kw):ifisinstance(content, GenerateContentResponse): role,content ='model',contents(content)ifisinstance(content, dict): role,content = content['role'],content['parts']ifnotisinstance(content, list): content=[content]if role =='user': if content: ## Gemini errors if the message contains only media and no textiflen(content) ==1andnotisinstance(content[0], str): content.append(' ') content = [_mk_content(o) for o in content]else: content =''returndict(role=role, parts=content, **kw)
Helper to set ‘assistant’ role on alternate messages.
Exported source
def mk_msgs(msgs:list, **kw):"Helper to set 'assistant' role on alternate messages."ifisinstance(msgs,str): msgs=[msgs]return [mk_msg(o, ('user','model')[i%2], **kw) for i,o inenumerate(msgs)]
q ="In brief, what color flowers are in this image?"mk_msgs([fn, q])
[{'role': 'user',
'parts': [{'file_data': {'mime_type': 'image/jpeg',
'file_uri': 'https://generativelanguage.googleapis.com/v1beta/files/kmbr0020l5k6'}},
{'text': ' '}]},
{'role': 'model',
'parts': ['In brief, what color flowers are in this image?']}]
mk_msgs(['Hi', 'Nice, to meet you. How can I help?', [fn, q]])
[{'role': 'user', 'parts': [{'text': 'Hi'}]},
{'role': 'model', 'parts': ['Nice, to meet you. How can I help?']},
{'role': 'user',
'parts': [{'file_data': {'mime_type': 'image/jpeg',
'file_uri': 'https://generativelanguage.googleapis.com/v1beta/files/sfxeop0j911a'}},
{'text': 'In brief, what color flowers are in this image?'}]}]
Now, we should just be able to pass a list of multimedia content to our Chat client and it should be able to handle it all under the hood. Let’s test it out.
chat(fn)
This picture features an adorable Cavalier King Charles Spaniel puppy. It has beautiful brown and white fur, and is resting on a grassy area next to a patch of purple flowers. The puppy is looking directly at the camera with a sweet expression.
content: {‘parts’: [{‘text’: ‘This picture features an adorable Cavalier King Charles Spaniel puppy. It has beautiful brown and white fur, and is resting on a grassy area next to a patch of purple flowers. The puppy is looking directly at the camera with a sweet expression.’}], ‘role’: ‘model’}
Hooray! That works, let’s double check the history to make sure that everything is properly formatted and stored.
mk_msgs(chat.h)
[{'role': 'user',
'parts': [{'file_data': {'mime_type': 'image/jpeg',
'file_uri': 'https://generativelanguage.googleapis.com/v1beta/files/y1u2m1yaahq'}},
{'text': ' '}]},
{'role': 'model',
'parts': ["This is an adorable image! The puppy is a Cavalier King Charles Spaniel with a beautiful coat of white and chestnut brown. It's nestled amongst some lovely purple flowers and green grass. The puppy's big, dark eyes and sweet expression make it irresistibly cute."]},
{'role': 'user',
'parts': [{'file_data': {'mime_type': 'image/jpeg',
'file_uri': 'https://generativelanguage.googleapis.com/v1beta/files/xpvi2ey0087o'}},
{'text': ' '}]},
{'role': 'model',
'parts': ['The image shows a charming Cavalier King Charles Spaniel puppy, with its distinctive white and chestnut coat, positioned amidst some blooming purple flowers. The pup is resting on the grass, and its gaze is directed towards the viewer. Its soft fur and gentle expression evoke a sense of warmth and tenderness.']},
{'role': 'user',
'parts': [{'file_data': {'mime_type': 'image/jpeg',
'file_uri': 'https://generativelanguage.googleapis.com/v1beta/files/q2k4qztinipo'}},
{'text': 'In brief, what color flowers are in this image?'}]},
{'role': 'model', 'parts': ['The flowers in the image are purple.\n']}]
While we are at it, let’s update our markdown representation to handle the new messages.
Exported source
@patchdef _repr_markdown_(self:Chat):ifnothasattr(self.c, 'result'): return'No results yet' last_msg = contents(self.c.result)def fmt_part(ps):iflen(ps) ==1: return fmt_single(ps[0])return'\n'+'\n'.join(f'- {fmt_single(p)}'for p in ps)def fmt_single(p):if'text'in p: return p['text']if'file_data'in p: returnf"uploaded media: {p['file_data']['mime_type']}"returnstr(p) history ='\n\n'.join(f"**{m['role']}**: {fmt_part(m['parts'])}"for m inself.h if m['role'] in ('user','model')) det =self.c._repr_markdown_().split('\n\n')[-1]returnf"""{last_msg}<details><summary>History</summary>{history}</details>{det}"""
chat
The flowers in the image are purple.
History
user: - uploaded media: image/jpeg - In brief, what color flowers are in this image?
model: The flowers in the image are purple.
Metric
Count
Cost (USD)
Input tokens
270
0.000000
Output tokens
9
0.000000
Cache tokens
0
0.000000
Total
279
$0.000000
Other Media (audio, video, etc.)
Unlike ChatGPT and Claude, Gemini models can also handle audio and video inputs. Since we’re using Gemini’s File API for handling multimedia content, what we have should just work, except we’ll need to make one small modification to the media_msg function. Also, while we are at it, let us also allow for users to pass in the bytes of the content instead of the path to be consistent with our other LLM provider libraries.
Handle media input as either Path or bytes, returning dict for Gemini API
Type
Default
Details
media
Media to process (Path|bytes|dict)
mime
NoneType
None
Optional mime type
Returns
dict
Dict for Gemini API
Exported source
def media_msg( media, # Media to process (Path|bytes|dict) mime=None# Optional mime type)->dict: # Dict for Gemini API"Handle media input as either Path or bytes, returning dict for Gemini API"ifisinstance(media, dict): return media # Already processeddef _upload(f, mime=None): f = genai.upload_file(f, mime_type=mime)while f.state.name =="PROCESSING": time.sleep(2); f = genai.get_file(f.name)return {'file_data': {'mime_type': f.mime_type, 'file_uri': f.uri}}ifisinstance(media, (str,Path)): return _upload(media)ifisinstance(media, bytes) and mime isNone: mime = ft.guess(media).mimereturn _upload(io.BytesIO(media ifisinstance(media, bytes) else media.encode()), mime)
Since we’re uploading potentially larger files, we need to wait for the upload and process to complete so that the media is ready to be consumed by the model.
chat = Chat(model)img = fn.read_bytes()
chat([img, q])
Certainly! The flowers in the image are a light purple color.
content: {‘parts’: [{‘text’: ‘Certainly! The flowers in the image are a light purple color.’}], ‘role’: ‘model’}
# We'll test this with the example from Gemini's docsvideo_fn = Path('./samples/selective_attention_test.mp4')prompt ="Answer the question in the video"
chat = Chat(model)chat([video_fn, prompt])
The players wearing white pass the basketball 13 times.
content: {‘parts’: [{‘text’: ‘The players wearing white pass the basketball 13 times.’}], ‘role’: ‘model’}
Takes a little while, but works like a charm! Now, let’s try an audio file to make sure it also works.
audio_fn = Path('./samples/attention_is_all_you_need.mp3')audio = audio_fn.read_bytes()prompt ="What is the audio about?"
chat([audio, prompt])
The audio is a podcast discussion about a groundbreaking research paper titled “Attention is All You Need” by Vaswani et al. The podcast features a machine learning expert who explains the core ideas, motivation, and architecture of the Transformer model introduced in the paper. They discuss the significance of attention mechanisms, how the Transformer differs from previous approaches like RNNs, its remarkable performance on machine translation and other sequence transduction tasks, and the broader implications of the research for machine learning and NLP. They also touch upon the limitations and future directions.
content: {‘parts’: [{‘text’: ‘The audio is a podcast discussion about a groundbreaking research paper titled “Attention is All You Need” by Vaswani et al. The podcast features a machine learning expert who explains the core ideas, motivation, and architecture of the Transformer model introduced in the paper. They discuss the significance of attention mechanisms, how the Transformer differs from previous approaches like RNNs, its remarkable performance on machine translation and other sequence transduction tasks, and the broader implications of the research for machine learning and NLP. They also touch upon the limitations and future directions.’}], ‘role’: ‘model’}
Finally, let’s check to make sure pdfs work as well.
pdf_fn = Path('./samples/attention_is_all_you_need.pdf')prompt ="What's mentioned in this pdf that's not mentioned in the previous podcast?"chat([pdf_fn, prompt])
Okay, here’s a breakdown of what’s in the PDF that wasn’t covered in the podcast:
Technical Details of the Transformer:
Detailed Architecture: The PDF provides a much more detailed breakdown of the Transformer architecture, including the specific arrangement of encoder and decoder layers, sub-layers, residual connections, and layer normalization (see Figure 1 and Section 3.1). The podcast gave a high-level overview, whereas the PDF is more specific about the components used to build the model.
Scaled Dot-Product Attention: The document explains the mechanics of “Scaled Dot-Product Attention” (section 3.2.1, Figure 2), and its advantages over additive attention and specifically mentions that dot products are scaled by 1/sqrt(dk). This isn’t mentioned in the podcast.
Multi-Head Attention: The PDF delves into the purpose of “Multi-Head Attention”, stating that it enables the model to attend to different representation subspaces (section 3.2.2, Figure 2). It also explicitly states that the projections are parameter matrices such as WQ ∈ R^(d_model x d_k), WK ∈ R^(d_model x d_k), etc.. and gives the values for the number of attention layers used. The podcast explained what multi-head attention was, but didn’t delve into the math or specific values used.
Positional Encodings: The PDF describes the specific sine and cosine functions used for positional encoding (section 3.5) and why they chose this method. This level of detail about this topic isn’t mentioned in the podcast.
Point-wise Feed-Forward Networks: The podcast doesn’t go into the details of this topic, but the PDF notes that feed-forward networks in each layer are position-wise, fully connected, and consist of two linear transformations with a ReLU activation in between. It also specifies the dimensionality of input and output as well as the inner layer (section 3.3).
Embeddings and Softmax: The PDF states how learned embeddings are used to convert input/output tokens to vectors (section 3.4), and that the same weight matrix is shared between embeddings and the pre-softmax linear transformation. The embeddings are multiplied by the square root of d_model. These details are not mentioned in the podcast.
Training and Evaluation:
Training Details: The PDF explains the datasets (WMT 2014 English-German and English-French), the use of byte-pair encoding and word-piece vocabularies, and batching based on approximate sequence lengths (section 5.1). It also specifies how the model is trained using a specific number of training steps, using GPUs, the Adam optimizer and a specific formula for varying the learning rate (sections 5.2, 5.3). The podcast briefly mentions training time, but doesn’t give specific dataset, architecture, or learning details.
Regularization: The PDF explicitly mentions the use of residual dropout and label smoothing during training and the parameters used (section 5.4). The podcast didn’t discuss this at all.
Performance Metrics: The PDF presents specific BLEU scores for various models on the English-German and English-French translation tasks and lists the training cost in FLOPS (section 6.1, Table 2). It also mentions beam search details like beam size and a length penalty. The podcast only mentions the end results of BLEU score.
Model Variations: The PDF explores various variations of the base model (section 6.2, Table 3), testing different parameters (number of heads, key/value dimensions, dropout rates, etc.), and mentions which variations performed well or poorly. The podcast doesn’t mention any of these experiments.
English Constituency Parsing Results: The PDF mentions that the transformer performs well on english constituency parsing with results from the Penn Treebank (section 6.3, Table 4) and compares its results against other models. This is only briefly touched on in the podcast, and none of the results from this section is covered.
Other Points
Computational Complexity Analysis: The PDF includes a table (Table 1) that compares the computational complexity, sequential operations and maximum path lengths of self-attention, recurrent and convolutional layers. This is not covered in the podcast.
Attention Visualizations: The PDF includes visualization examples of attention heads for certain words (Figure 3, 4, and 5), showcasing the behavior and long range dependencies. These visualizations are not shown nor covered in the podcast.
References and Acknowledgements: The document has a complete list of references and acknowledgements. This was not part of the podcast.
In summary, while the podcast provided a good overview, the PDF offers a deeper technical dive into the Transformer model, along with details about its training, experiments, and evaluations that aren’t touched upon in the audio discussion. The PDF provides specifics that an audience interested in implementing the model or performing experiments would be interested in.
content: {‘parts’: [{‘text’: ’Okay, here's a breakdown of what's in the PDF that wasn't covered in the podcast:*Technical Details of the Transformer:Detailed Architecture: The PDF provides a much more detailed breakdown of the Transformer architecture, including the specific arrangement of encoder and decoder layers, sub-layers, residual connections, and layer normalization (see Figure 1 and Section 3.1). The podcast gave a high-level overview, whereas the PDF is more specific about the components used to build the model.Scaled Dot-Product Attention: The document explains the mechanics of “Scaled Dot-Product Attention” (section 3.2.1, Figure 2), and its advantages over additive attention and specifically mentions that dot products are scaled by 1/sqrt(dk). This isn't mentioned in the podcast.Multi-Head Attention: The PDF delves into the purpose of “Multi-Head Attention”, stating that it enables the model to attend to different representation subspaces (section 3.2.2, Figure 2). It also explicitly states that the projections are parameter matrices such as WQ ∈ R^(d_model x d_k), WK ∈ R^(d_model x d_k), etc.. and gives the values for the number of attention layers used. The podcast explained what multi-head attention was, but didn't delve into the math or specific values used.Positional Encodings: The PDF describes the specific sine and cosine functions used for positional encoding (section 3.5) and why they chose this method. This level of detail about this topic isn't mentioned in the podcast.Point-wise Feed-Forward Networks: The podcast doesn't go into the details of this topic, but the PDF notes that feed-forward networks in each layer are position-wise, fully connected, and consist of two linear transformations with a ReLU activation in between. It also specifies the dimensionality of input and output as well as the inner layer (section 3.3).Embeddings and Softmax:** The PDF states how learned embeddings are used to convert input/output tokens to vectors (section 3.4), and that the same weight matrix is shared between embeddings and the pre-softmax linear transformation. The embeddings are multiplied by the square root of d_model. These details are not mentioned in the podcast.*Training and Evaluation:Training Details: The PDF explains the datasets (WMT 2014 English-German and English-French), the use of byte-pair encoding and word-piece vocabularies, and batching based on approximate sequence lengths (section 5.1). It also specifies how the model is trained using a specific number of training steps, using GPUs, the Adam optimizer and a specific formula for varying the learning rate (sections 5.2, 5.3). The podcast briefly mentions training time, but doesn't give specific dataset, architecture, or learning details.Regularization: The PDF explicitly mentions the use of residual dropout and label smoothing during training and the parameters used (section 5.4). The podcast didn't discuss this at all.Performance Metrics: The PDF presents specific BLEU scores for various models on the English-German and English-French translation tasks and lists the training cost in FLOPS (section 6.1, Table 2). It also mentions beam search details like beam size and a length penalty. The podcast only mentions the end results of BLEU score.Model Variations: The PDF explores various variations of the base model (section 6.2, Table 3), testing different parameters (number of heads, key/value dimensions, dropout rates, etc.), and mentions which variations performed well or poorly. The podcast doesn't mention any of these experiments.English Constituency Parsing Results:** The PDF mentions that the transformer performs well on english constituency parsing with results from the Penn Treebank (section 6.3, Table 4) and compares its results against other models. This is only briefly touched on in the podcast, and none of the results from this section is covered.*Other PointsComputational Complexity Analysis: The PDF includes a table (Table 1) that compares the computational complexity, sequential operations and maximum path lengths of self-attention, recurrent and convolutional layers. This is not covered in the podcast.Attention Visualizations: The PDF includes visualization examples of attention heads for certain words (Figure 3, 4, and 5), showcasing the behavior and long range dependencies. These visualizations are not shown nor covered in the podcast.References and Acknowledgements:** The document has a complete list of references and acknowledgements. This was not part of the podcast.*In summary, while the podcast provided a good overview, the PDF offers a deeper technical dive into the Transformer model, along with details about its training, experiments, and evaluations that aren't touched upon in the audio discussion.** The PDF provides specifics that an audience interested in implementing the model or performing experiments would be interested in.’}], ‘role’: ‘model’}
pr ="Can you generate an exact transcript of the first minute or so of the podcast."chat(pr)
Okay, here’s the transcript of the first minute of the podcast, based on your request:
“Welcome to our podcast, where we dive into groundbreaking research papers. Today, we’re discussing ‘Attention is All You Need’ by Vaswani et al. Joining us is an expert in machine learning. Welcome. Thanks for having me. I’m excited to discuss this revolutionary paper. Let’s start with the core idea. What’s the main thrust of this research? The paper introduces a new model architecture called the Transformer, which is based entirely on attention mechanisms. It completely does away with recurrence and convolutions, which were staples in previous sequence transduction models. That sounds like a significant departure from previous approaches. What motivated this radical change? The main motivation was to address limitations in previous models, particularly the sequential nature of processing in RNNs. This sequential computation hindered parallelization and made it challenging to learn long-range dependencies in sequences. Could you explain what attention mechanisms are and why they’re so crucial in this model? Certainly. Attention allows the model to focus on different parts of the input sequence when producing each part of the output. In the Transformer, they use a specific type called scaled dot product attention and extend it to multi head attention, which lets the model jointly attend to information from different representation sub spaces. Fascinating.”
content: {‘parts’: [{‘text’: ‘Okay, here's the transcript of the first minute of the podcast, based on your request:“Welcome to our podcast, where we dive into groundbreaking research papers. Today, we're discussing 'Attention is All You Need' by Vaswani et al. Joining us is an expert in machine learning. Welcome. Thanks for having me. I'm excited to discuss this revolutionary paper. Let's start with the core idea. What's the main thrust of this research? The paper introduces a new model architecture called the Transformer, which is based entirely on attention mechanisms. It completely does away with recurrence and convolutions, which were staples in previous sequence transduction models. That sounds like a significant departure from previous approaches. What motivated this radical change? The main motivation was to address limitations in previous models, particularly the sequential nature of processing in RNNs. This sequential computation hindered parallelization and made it challenging to learn long-range dependencies in sequences. Could you explain what attention mechanisms are and why they're so crucial in this model? Certainly. Attention allows the model to focus on different parts of the input sequence when producing each part of the output. In the Transformer, they use a specific type called scaled dot product attention and extend it to multi head attention, which lets the model jointly attend to information from different representation sub spaces. Fascinating.”’}], ‘role’: ‘model’}
Okay, here’s the transcript of the first minute of the podcast, based on your request:
“Welcome to our podcast, where we dive into groundbreaking research papers. Today, we’re discussing ‘Attention is All You Need’ by Vaswani et al. Joining us is an expert in machine learning. Welcome. Thanks for having me. I’m excited to discuss this revolutionary paper. Let’s start with the core idea. What’s the main thrust of this research? The paper introduces a new model architecture called the Transformer, which is based entirely on attention mechanisms. It completely does away with recurrence and convolutions, which were staples in previous sequence transduction models. That sounds like a significant departure from previous approaches. What motivated this radical change? The main motivation was to address limitations in previous models, particularly the sequential nature of processing in RNNs. This sequential computation hindered parallelization and made it challenging to learn long-range dependencies in sequences. Could you explain what attention mechanisms are and why they’re so crucial in this model? Certainly. Attention allows the model to focus on different parts of the input sequence when producing each part of the output. In the Transformer, they use a specific type called scaled dot product attention and extend it to multi head attention, which lets the model jointly attend to information from different representation sub spaces. Fascinating.”
History
user: - uploaded media: video/mp4 - Answer the question in the video
model: The players wearing white pass the basketball 13 times.
user: - uploaded media: audio/mpeg - What is the audio about?
model: The audio is a podcast discussion about a groundbreaking research paper titled “Attention is All You Need” by Vaswani et al. The podcast features a machine learning expert who explains the core ideas, motivation, and architecture of the Transformer model introduced in the paper. They discuss the significance of attention mechanisms, how the Transformer differs from previous approaches like RNNs, its remarkable performance on machine translation and other sequence transduction tasks, and the broader implications of the research for machine learning and NLP. They also touch upon the limitations and future directions.
user: - uploaded media: application/pdf - What’s mentioned in this pdf that’s not mentioned in the previous podcast?
model: Okay, here’s a breakdown of what’s in the PDF that wasn’t covered in the podcast:
Technical Details of the Transformer:
Detailed Architecture: The PDF provides a much more detailed breakdown of the Transformer architecture, including the specific arrangement of encoder and decoder layers, sub-layers, residual connections, and layer normalization (see Figure 1 and Section 3.1). The podcast gave a high-level overview, whereas the PDF is more specific about the components used to build the model.
Scaled Dot-Product Attention: The document explains the mechanics of “Scaled Dot-Product Attention” (section 3.2.1, Figure 2), and its advantages over additive attention and specifically mentions that dot products are scaled by 1/sqrt(dk). This isn’t mentioned in the podcast.
Multi-Head Attention: The PDF delves into the purpose of “Multi-Head Attention”, stating that it enables the model to attend to different representation subspaces (section 3.2.2, Figure 2). It also explicitly states that the projections are parameter matrices such as WQ ∈ R^(d_model x d_k), WK ∈ R^(d_model x d_k), etc.. and gives the values for the number of attention layers used. The podcast explained what multi-head attention was, but didn’t delve into the math or specific values used.
Positional Encodings: The PDF describes the specific sine and cosine functions used for positional encoding (section 3.5) and why they chose this method. This level of detail about this topic isn’t mentioned in the podcast.
Point-wise Feed-Forward Networks: The podcast doesn’t go into the details of this topic, but the PDF notes that feed-forward networks in each layer are position-wise, fully connected, and consist of two linear transformations with a ReLU activation in between. It also specifies the dimensionality of input and output as well as the inner layer (section 3.3).
Embeddings and Softmax: The PDF states how learned embeddings are used to convert input/output tokens to vectors (section 3.4), and that the same weight matrix is shared between embeddings and the pre-softmax linear transformation. The embeddings are multiplied by the square root of d_model. These details are not mentioned in the podcast.
Training and Evaluation:
Training Details: The PDF explains the datasets (WMT 2014 English-German and English-French), the use of byte-pair encoding and word-piece vocabularies, and batching based on approximate sequence lengths (section 5.1). It also specifies how the model is trained using a specific number of training steps, using GPUs, the Adam optimizer and a specific formula for varying the learning rate (sections 5.2, 5.3). The podcast briefly mentions training time, but doesn’t give specific dataset, architecture, or learning details.
Regularization: The PDF explicitly mentions the use of residual dropout and label smoothing during training and the parameters used (section 5.4). The podcast didn’t discuss this at all.
Performance Metrics: The PDF presents specific BLEU scores for various models on the English-German and English-French translation tasks and lists the training cost in FLOPS (section 6.1, Table 2). It also mentions beam search details like beam size and a length penalty. The podcast only mentions the end results of BLEU score.
Model Variations: The PDF explores various variations of the base model (section 6.2, Table 3), testing different parameters (number of heads, key/value dimensions, dropout rates, etc.), and mentions which variations performed well or poorly. The podcast doesn’t mention any of these experiments.
English Constituency Parsing Results: The PDF mentions that the transformer performs well on english constituency parsing with results from the Penn Treebank (section 6.3, Table 4) and compares its results against other models. This is only briefly touched on in the podcast, and none of the results from this section is covered.
Other Points
Computational Complexity Analysis: The PDF includes a table (Table 1) that compares the computational complexity, sequential operations and maximum path lengths of self-attention, recurrent and convolutional layers. This is not covered in the podcast.
Attention Visualizations: The PDF includes visualization examples of attention heads for certain words (Figure 3, 4, and 5), showcasing the behavior and long range dependencies. These visualizations are not shown nor covered in the podcast.
References and Acknowledgements: The document has a complete list of references and acknowledgements. This was not part of the podcast.
In summary, while the podcast provided a good overview, the PDF offers a deeper technical dive into the Transformer model, along with details about its training, experiments, and evaluations that aren’t touched upon in the audio discussion. The PDF provides specifics that an audience interested in implementing the model or performing experiments would be interested in.
user: Can you generate an exact transcript of the first minute or so of the podcast.
model: Okay, here’s the transcript of the first minute of the podcast, based on your request:
“Welcome to our podcast, where we dive into groundbreaking research papers. Today, we’re discussing ‘Attention is All You Need’ by Vaswani et al. Joining us is an expert in machine learning. Welcome. Thanks for having me. I’m excited to discuss this revolutionary paper. Let’s start with the core idea. What’s the main thrust of this research? The paper introduces a new model architecture called the Transformer, which is based entirely on attention mechanisms. It completely does away with recurrence and convolutions, which were staples in previous sequence transduction models. That sounds like a significant departure from previous approaches. What motivated this radical change? The main motivation was to address limitations in previous models, particularly the sequential nature of processing in RNNs. This sequential computation hindered parallelization and made it challenging to learn long-range dependencies in sequences. Could you explain what attention mechanisms are and why they’re so crucial in this model? Certainly. Attention allows the model to focus on different parts of the input sequence when producing each part of the output. In the Transformer, they use a specific type called scaled dot product attention and extend it to multi head attention, which lets the model jointly attend to information from different representation sub spaces. Fascinating.”
Metric
Count
Cost (USD)
Input tokens
96,895
0.000000
Output tokens
1,472
0.000000
Cache tokens
0
0.000000
Total
98,367
$0.000000
All of these also work with Client and can be combined with structured to get structured responses using multi-media data.
class AudioMetadata(BasicRepr):"""Class to hold metadata for audio files"""def__init__(self, n_speakers:int, # Number of speakers topic:str, # Topic discussed summary:str, # 100 word summary transcript:list[str], # Transcript of the audio segmented by speaker ): store_attr()pr ="Extract the necessary information from the audio."
c = Client(model)audio_md = c.structured(mk_msgs([[audio_fn, pr]]), tools=[AudioMetadata])[0]
print(f'Number of speakers: {audio_md.n_speakers}')print(f'Topic: {audio_md.topic}')print(f'Summary: {audio_md.summary}')transcript ='\n-'.join(list(audio_md.transcript)[:10])print(f'Transcript: {transcript}')
Number of speakers: 2.0
Topic: Machine Learning, Natural Language Processing
Summary: This podcast discusses the Attention is All You Need research paper, focusing on the Transformer model's architecture, attention mechanisms, performance, and broader implications.
Transcript: Welcome to our podcast, where we dive into groundbreaking research papers. Today, we're discussing attention is all you need by Vaswani at all. Joining us is an expert in machine learning. Welcome.
-Thanks for having me. I'm excited to discuss this revolutionary paper.
-Let's start with the core idea. What's the main thrust of this research?
-The paper introduces a new model architecture called the Transformer, which is based entirely on attention mechanisms. It completely does away with recurrence and convolutions, which were staples in previous sequence transduction models.
-That sounds like a significant departure from previous approaches. What motivated this radical change?
-The main motivation was to address limitations in previous models, particularly the sequential nature of processing in RNNs. This sequential computation hindered parallelization and made it challenging to learn long-range dependencies in sequences.
-Could you explain what attention mechanisms are and why they're so crucial in this model?
-Certainly. Attention allows the model to focus on different parts of the input sequence when producing each part of the output. In the Transformer, they use a specific type called scaled dot-product attention and extend it to multi-head attention, which lets the model jointly attend to information from different representation sub-spaces.
-Fascinating. How does the Transformer's architecture differ from previous models?
-The Transformer uses a stack of identical layers for both the encoder and decoder. Each layer has two main components: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. This structure allows for more parallelization and efficient computation.