xml source

from copy import deepcopy

JSON to XML


source

json_to_xml

 json_to_xml (d:dict, rnm:str)

Convert d to XML.

Type Details
d dict JSON dictionary to convert
rnm str Root name
Returns str

JSON doesn’t map as nicely to XML as the data structure used in fastcore.xml, but for simple XML trees it can be convenient – for example:

a = dict(surname='Howard', firstnames=['Jeremy','Peter'],
         address=dict(state='Queensland',country='Australia'))
hl_md(json_to_xml(a, 'person'))
<person>
  <surname>Howard</surname>
  <firstnames>
    <item>Jeremy</item>
    <item>Peter</item>
  </firstnames>
  <address>
    <state>Queensland</state>
    <country>Australia</country>
  </address>
</person>

Including documents

Notebooks

nbp = Path('00_xml.ipynb')
nb = dict2obj(nbp.read_json())
cells = nb.cells
cell = cells[-1]
cell
{ 'cell_type': 'code',
  'execution_count': {},
  'id': '1e9ee5c1',
  'metadata': {'time_run': '2025-12-24T11:44:29.900555+00:00'},
  'outputs': [],
  'source': ['#|hide\n', '#|eval: false\n', 'from nbdev.doclinks import nbdev_export\n', 'nbdev_export()']}

source

get_mime_text

 get_mime_text (data)

Get text from MIME bundle, preferring markdown over plain


source

cell2out

 cell2out (o)

Convert single notebook output to XML format

for o in cell.outputs: print(to_xml(cell2out(o)))

source

cell2xml

 cell2xml (cell, out=True)

Convert notebook cell to concise XML format

cell2xml(cell)
<code><source>#|hide
#|eval: false
from nbdev.doclinks import nbdev_export
nbdev_export()</code>
cell2xml(cell, out=False)
<code>#|hide
#|eval: false
from nbdev.doclinks import nbdev_export
nbdev_export()</code>

source

nb2xml

 nb2xml (fname=None, nb=None, out=True)

Convert notebook to XML format

nbsml = deepcopy(nb)
del(nbsml.cells[2:])

print(nb2xml(nb=nbsml))
<notebook><code><source>#|default_exp xml</code><md><source># xml source</md></notebook>

Documents

According to Anthropic, “it’s essential to structure your prompts in a way that clearly separates the input data from the instructions”. They recommend using something like the following:

Here are some documents for you to reference for your task:
    
<documents>
<document index="1">
<source>
(URL, file name, hash, etc)
</source>
<document_content>
(the text content)
</document_content>
</document>
</documents>

We will create some small helper functions to make it easier to generate context in this format, although we’re use <src> instead of <source> to avoid conflict with that HTML tag. Although it’s based on Anthropic’s recommendation, it’s likely to work well with other models too.

We’ll use doctype to store our pairs.

Since Anthropic’s example shows newlines before and after each tag, we’ll do the same.

to_xml(Src('a'))
'<src>a</src>'
to_xml(Document('a'))
'<document>a</document>'

source

mk_doctype

 mk_doctype (content:str, src:Optional[str]=None)

Create a doctype named tuple

Type Default Details
content str The document content
src Optional None URL, filename, etc; defaults to md5(content) if not provided
Returns namedtuple

This is a convenience wrapper to ensure that a doctype has the needed information in the right format.

doc = 'This is a "sample"'
mk_doctype(doc)
doctype(src='\n47e19350\n', content='\nThis is a "sample"\n')

source

mk_doc

 mk_doc (index:int, content:str, src:Optional[str]=None, **kwargs)

Create an ft format tuple for a single doc in Anthropic’s recommended format

Type Default Details
index int The document index
content str The document content
src Optional None URL, filename, etc; defaults to md5(content) if not provided
kwargs VAR_KEYWORD
Returns tuple

We can now generate XML for one document in the suggested format:

mk_doc(1, doc, title="test")
<document index="1" title="test"><src>
47e19350
</src><document-content>
This is a "sample"
</document-content></document>

source

docs_xml

 docs_xml (docs:list[str], srcs:Optional[list]=None, prefix:bool=True,
           details:Optional[list]=None, title:str=None)

Create an XML string containing docs in Anthropic’s recommended format

Type Default Details
docs list The content of each document
srcs Optional None URLs, filenames, etc; each one defaults to md5(content) if not provided
prefix bool True Include Anthropic’s suggested prose intro?
details Optional None Optional list of dicts with additional attrs for each doc
title str None Optional title attr for Documents element
Returns str

Putting it all together, we have our final XML format:

docs = [doc, 'And another one']
srcs = [None, 'doc.txt']
print(docs_xml(docs, srcs))
Here are some documents for you to reference for your task:

<documents><document index="1"><src>
47e19350
</src><document-content>
This is a "sample"
</document-content></document><document index="2"><src>
doc.txt
</src><document-content>
And another one
</document-content></document></documents>

Context creation

Now that we can generate Anthropic’s XML format, let’s make it easy for a few common cases.

File list to context

For generating XML context from files, we’ll just read them as text and use the file names as src.


source

read_file

 read_file (fname, out=True, max_size=None)

Read file content, converting notebooks to XML if needed


source

files2ctx

 files2ctx (fnames:list[typing.Union[str,pathlib.Path]], prefix:bool=True,
            out:bool=True, srcs:Optional[list]=None, title:str=None,
            max_size:int=None)

Convert files to XML context, handling notebooks

Type Default Details
fnames list List of file names to add to context
prefix bool True Include Anthropic’s suggested prose intro?
out bool True Include notebook cell outputs?
srcs Optional None Use the labels instead of fnames
title str None Optional title attr for Documents element
max_size int None Skip files larger than this (bytes)
Returns str XML for LM context
fnames = ['samples/sample_core.py', 'samples/sample_styles.css']
hl_md(files2ctx(fnames, max_size=120))
Here are some documents for you to reference for your task:

<documents><document index="1"><src>
samples/sample_core.py
</src><document-content>
[Skipped: sample_core.py exceeds 120 bytes]
</document-content></document><document index="2"><src>
samples/sample_styles.css
</src><document-content>
.cell { margin-bottom: 1rem; }
.cell > .sourceCode { margin-bottom: 0; }
.cell-output > pre { margin-bottom: 0; }
</document-content></document></documents>

Folder to context


source

folder2ctx

 folder2ctx (folder:Union[str,pathlib.Path], prefix:bool=True,
             out:bool=True, include_base:bool=True, title:str=None,
             max_size:int=100000, recursive:bool=True, symlinks:bool=True,
             file_glob:str=None, file_re:str=None, folder_re:str=None,
             skip_file_glob:str=None, skip_file_re:str=None,
             skip_folder_re:str=None, func:callable=<function join>,
             ret_folders:bool=False, sort:bool=True, exts:list|str=None)

Convert folder contents to XML context, handling notebooks

Type Default Details
folder Union
prefix bool True Include Anthropic’s suggested prose intro?
out bool True Include notebook cell outputs?
include_base bool True Include full path in src?
title str None Optional title attr for Documents element
max_size int 100000 Skip files larger than this (bytes)
recursive bool True search subfolders
symlinks bool True follow symlinks?
file_glob str None Only include files matching glob
file_re str None Only include files matching regex
folder_re str None Only enter folders matching regex
skip_file_glob str None Skip files matching glob
skip_file_re str None Skip files matching regex
skip_folder_re str None Skip folders matching regex,
func callable join function to apply to each matched file
ret_folders bool False return folders, not just files
sort bool True sort files by name within each folder
exts list | str None
Returns L Paths to matched files
print(folder2ctx('samples', prefix=False, file_glob='*.py'))
<documents><document index="1"><src>
samples/sample_core.py
</src><document-content>
import inspect
empty = inspect.Parameter.empty
models = 'claude-3-opus-20240229','claude-3-sonnet-20240229','claude-3-haiku-20240307'
</document-content></document></documents>

source

repo2ctx

 repo2ctx (owner:str, repo:str, ref:str=None, prefix:bool=True,
           out:bool=True, include_base:bool=True, title:str=None,
           max_size:int=100000, recursive:bool=True, symlinks:bool=True,
           file_glob:str=None, file_re:str=None, folder_re:str=None,
           skip_file_glob:str=None, skip_file_re:str=None,
           skip_folder_re:str=None, func:callable=<function join>,
           ret_folders:bool=False, sort:bool=True, exts:list|str=None)

Convert GitHub repo to XML context without cloning

Type Default Details
owner str GitHub repo owner
repo str GitHub repo name
ref str None Git ref (branch/tag/sha); defaults to repo’s default branch
prefix bool True Include Anthropic’s suggested prose intro?
out bool True Include notebook cell outputs?
include_base bool True Include full path in src?
title str None Optional title attr for Documents element
max_size int 100000 Skip files larger than this (bytes)
recursive bool True search subfolders
symlinks bool True follow symlinks?
file_glob str None Only include files matching glob
file_re str None Only include files matching regex
folder_re str None Only enter folders matching regex
skip_file_glob str None Skip files matching glob
skip_file_re str None Skip files matching regex
skip_folder_re str None Skip folders matching regex,
func callable join function to apply to each matched file
ret_folders bool False return folders, not just files
sort bool True sort files by name within each folder
exts list | str None
Returns str XML for LM context
print(repo2ctx('answerdotai', 'toolslm', exts=('md','py'), skip_file_re='^_', prefix=False, out=False)[:330])
<documents title="GitHub repository contents from answerdotai/toolslm at ref 'main' (filters applied: exts: md, py | skip_file_re: ^_)"><document index="1"><src>
CHANGELOG.md
</src><document-content>
# Release notes

<!-- do not remove -->

## 0.3.8

### New Features

- Add `repo2ctx` ([#52](https://github.com/AnswerDotAI/toolsl
Tip

After you install toolslm, folder2ctx becomes available from the command line.

!folder2ctx -h
usage: folder2ctx [-h] [--recursive] [--symlinks] [--file_glob FILE_GLOB]
                  [--file_re FILE_RE] [--folder_re FOLDER_RE]
                  [--skip_file_glob SKIP_FILE_GLOB]
                  [--skip_file_re SKIP_FILE_RE]
                  [--skip_folder_re SKIP_FOLDER_RE] [--func FUNC]
                  [--ret_folders] [--sort] [--exts EXTS] [--prefix] [--out]
                  [--include_base] [--title TITLE] [--max_size MAX_SIZE]
                  folder

CLI to convert folder contents to XML context, handling notebooks

positional arguments:
  folder                           Folder name containing files to add to
                                   context

options:
  -h, --help                       show this help message and exit
  --recursive                      search subfolders (default: False)
  --symlinks                       follow symlinks? (default: False)
  --file_glob FILE_GLOB            Only include files matching glob
  --file_re FILE_RE                Only include files matching regex
  --folder_re FOLDER_RE            Only enter folders matching regex
  --skip_file_glob SKIP_FILE_GLOB  Skip files matching glob
  --skip_file_re SKIP_FILE_RE      Skip files matching regex
  --skip_folder_re SKIP_FOLDER_RE  Skip folders matching regex,
  --func FUNC                      function to apply to each matched file
                                   (default: <function join>)
  --ret_folders                    return folders, not just files (default:
                                   False)
  --sort                           sort files by name within each folder
                                   (default: False)
  --exts EXTS
  --prefix                         Include Anthropic's suggested prose intro?
                                   (default: False)
  --out                            Include notebook cell outputs? (default:
                                   False)
  --include_base                   Include full path in src? (default: False)
  --title TITLE                    Optional title attr for Documents element
  --max_size MAX_SIZE              Skip files larger than this (bytes)