Building a Copilot #2: Parsing & Converting On The Fly

Welcome back: we’re discussing how to write an AI coding / copilot service named The Owl. This is the post I have been most looking forward to writing, because it describes a logical but semi-crazy sequence of incremental streamed parsing layers.

At the end of Part 1, we had a Python server that streamed text to a client using Server Sent Events. In the client, this let us display text as it arrived word by word:

But if you look closely, this isn’t plain text: it’s formatted. In fact, it’s Markdown.

AIs can easily return Markdown-formatted text, but the server response from the AI server is not just the text but a complex object (ie, JSON.) Code reading the AI’s response needs to handle partially complete data, and process it as it arrives piece by piece.

But this is not the only layer of parsing and streaming on the fly. In fact, the Owl has four layers of parsing incomplete streams of data:

  1. The AI library returns a stream of data; this includes the actual AI response. The response grows as each token is returned, so it is parsed chunk by chunk
  2. That AI response is, at the same time as it arrives and is parsed, streamed to the client. That is, the incomplete response is wrapped in JSON, using fields specific to the Owl, and streamed token by token as each token is received from the server
  3. The client gets that JSON, and parses it as it arrives to also extract the AI response. At this stage the client has the streaming text of the AI… through a few layers.
  4. The AI response is Markdown-formatted. So this is parsed on the fly to provide rendering information using a formatting state machine that is updated with each token

This is four different levels of data arriving via a stream and being parsed and handled on the fly, token by token as it arrives. Of them all, the most complex to implement (to me) was the Markdown parsing.

If your client talks directly to an AI, step 2 is not required; you can go straight from the AI server to the client. And, there are Javascript libraries to present AI chats so you don’t need to implement any of this yourself. But if you’re writing your own middleware and client, you’ll need to handle some form of on-fly-fly parsing of incomplete data.

Besides, it’s fun!

Contents

This post will describe first, how an AI library is wrapped in the Owl’s server code, and then how all the above works in practice with code samples. It uses both Python and Delphi, but even if you don’t use one of those languages it may still be useful for you.

Wrapping an AI library

The Owl provides its own REST API but under the hood it can be implemented with multiple AI engines: currently, it supports OpenAI, Mistral, and Ollama. This has a pretty basic base class:

The APIs are implemented via get_response_json_sse(), because it streams. But it can be handy when coding to get the full response in one go, which is what get_response_json() is for. Get_response_json_sse() is a Python generator function, ie, will yield results as it is iterated, and this allows us to build and send the JSON string as tokens arrive from the AI.

The comments here describe the current format of the Owl’s JSON sent to the client: currently a series of simple key/value pairs for the response itself and some other data.

Let’s look at how this is done using OpenAI’s Python library.

Asynchronous OpenAI chat

This code is straight out of the library documentation:

It creates an async chat completion (note you should use a secrets manager to avoid leaking your API key). This is iterable, and this is where the magic happens. Let’s implement an OpenAI descendant of the AICallBase class.

First, get the completion object per above, and check if it has an ok status code:

I return JSON with the error but for a while just raised an exception. SSEventFromText() is a tiny wrapper function to return a Falcon SSEvent from a string.

If there’s a failure, it returns a single event with JSON containing a failure flag and info on what the error was.

However, if all is ok, we can iterate over the chunks in the response. Remember, this is happening as the server streams the response to us:

I tend to find loops with flags like ‘is_first’ to be a code smell, but honestly I can’t really find a better way to achieve this here.

The chat response can contain a number of responses: think of these as multiple replies. Since we didn’t call the AI asking for more than one, there’s a single choice of replies, thus the [0] index. Then we get the delta (what’s new, which will be one or more tokens that have arrived) and the content, which is the text of that delta.

Remember from our method documentation that we want to return JSON in the format:

Plus, our method is a generator: it will yield content. Thus, the first time it’s called we want to yield the start of the JSON (that’s the ugly ‘is_first’ flag). Then, we want to yield the AI’s response until that runs out; finally, yield the last of the JSON.

Using this generator for the web server

Remember part 1, where we used the Falcon web framework and wrote a Hello World generator to return text?

Simply use this generator instead:

Now you’re getting the AI server’s response piece by piece, and streaming it as JSON to your client.

Client-side

We’re implementing our client as a plugin for the Delphi IDE. While Delphi has excellent Python support, we’re going to use Delphi’s native HTTP components to communicate with the server. Last year (2023, version 12.0?) the HTTP components were updated to support SSE (streaming) responses, so we can give an object representing our request the latest chunks one by one as they come in:

Delphi has multiple inbuilt JSON libraries, but all of them assume the JSON is valid and complete. We cannot use them to parse incomplete JSON that is arriving on the fly token by token. However, in order to display the AI’s response word by word as it comes in we want to parse the streaming partial JSON, which remember will be an incomplete or possibly complete string of the form:

to get the response as it arrives (error handling is out of scope for this blog post.) That response will be Markdown and is what we want to display.

Incremental JSON

JSON supports nested objects, but to keep this simple the Owl’s JSON that it returns — which we wrote above — is flat: it uses JSON as a series of key/value pairs.

We do not want to re-parse the JSON every time a new piece of text is added. Instead, we want to iteratively parse, to be as efficient as possible. This is in the TIncrementalJSONParser class.

Each time a new piece of text comes in, it is added via AddChunk to the buffer, FBuffer. The text is sanitised from any SSE formatting before being appended to the buffer. It boils down to:

FCursor is a cursor into the buffer (effectively an index or pointer), which stores the point up to which we last successfully read, and each key/value pair is stored in the FData dictionary. ParseBuffer() does the work: it incrementally parses.

As a user of the class, you can always get the value for any key via the Data[] property, noting that the value might be incomplete, but perhaps more useful is the OnIncrementalValue event: this is fired every time a value is changed, as it changes. So if you have a long key/value pair like an AI response, you will get this event fired many times as each token arrives in the response, and the JSON string is parsed another word along.

Values are always strings.

How it works

Using JSON as a flat list of key/value pairs, rather than nested objects, makes this considerably simpler. Both the keys and values are strings surrounded by double quotation marks.

The internal ExtractString method starts at a position in the buffer, and scans for the next ” mark. From there, it scans until the second ” mark (handling escaped \” symbols), or hits the end of the buffer. It copies the string between those two points and returns it, along with a variable indicating if it found the ending ” or not. If not, we know the value is incomplete.

Using this, you can build a simple parser. It does re-read strings until the end of a string is found, but it’s still reasonably performant. Consider how the following JSON will be parsed:

This will come in piece by piece, such as:

It parses the open quote and then looks for a string with ExtractString. Here, it finds a complete one. Using the FIsKeyComplete and FIsValueComplete flags, we can track if we’re reading a key or value; here it will set FIsKeyComplete to true and move on to parsing the value, storing the current key for which it’s attempting to read a value in FLastKey.

But there is no value yet, so the FData dictionary contains an entry with “key” mapping to an empty string. FCursor is updated to the last point is successfully parsed, in this case the end of the string.

When more data comes in:

It knows it has a complete key but incomplete value, so tries to read the string again via ExtractString. This time, it will get “the val” but will return that it is incomplete. FData is updated to map “key” to “the val”, the event is called to notify you, but FIsValueComplete is still false meaning next time a chunk comes in, it will try to read the value again.

More data comes in:

FCursor is still pointing to the start of the value, because last time it read, the value was incomplete. This time, ExtractString finds the end quote, so it returns both the value and a flag that the value is complete. We know what key we’re parsing the value for via FLastKey, so the dictionary is updated, the event is fired, and state is reset to look for a key: FIsKeyComplete FIsValueComplete are both false, and FLastKey is “”.

Incremental Markdown

But we’re still not done. We’re parsed the incrementally arriving JSON to get the AI’s response, but that’s Markdown formatted text.

Now, you could just use a Markdown control and set the entire value to the text each time. While it would work, that kind of approach tends to lead to bugs like flicker, or making it impossible to select text (because it’s constantly being rewritten underneath the mouse), or impossible to scroll (when resetting the text it will often reset the scroll position.)

Besides, Delphi has no inbuilt Markdown renderer — the IDE uses a custom version of the one linked above, but it’s not exposed to plugin writers via the ToolsAPI. And a plugin can’t easily include its own version of a component that users might also have installed due to clashes: only one will be able to be loaded. There are workarounds, but each of them requires maintaining a customised version of the component (which is exactly what the IDE does.)

So we’re going to separate parsing from rendering, and either render by converting the Markdown to HTML or maybe RTF; either way, new content will simply be appended.

TMarkdownIncrementalParser

Markdown is considerably harder to parse than a simple flat key/value list of JSON pairs. It has nested state: you might have bold text, with some of that itself being italic (or a heading, with a few words marked as code in that heading.) Some of the markup symbols have duplicate meanings: an asterisk * can start an unordered list, or could be starting an italicised section or text, or can be part of ** to start a bold section of text.

But because this is streamed, there is one assumption, which is that you can always figure out the state (eg, italic) before the text to which it applies. This means that this style of headings:

will not be supported. Only the form of headings where they are prefixed with hash marks # or ## will be parsed.

State and events

We’re going to parse markdown into a series of events. These are not Delphi events but are akin to instructions: enable bold, change colour to red, append text ‘hello’, etc. These can then be used to create HTML or rich text.

Nested state like the italic and bold above can be represented well, via a stack. The current formatting is the result of applying the entire stack: when something new is encountered, just push a new format onto the stack.

Each of these events has a start, content which is text, and an end. This corresponds to pushing and popping state on the stack, plus the text for which it applies.

We will support headings (of any level), incline code, code blocks (large multi-line pieces of code), and both unordered and numbered lists. This could be extended in future to support links or other formatting.

Then, we need to store the parser context; and have a buffer, cursor into that buffer (following the same idea as JSON: the cursor points to the last location that was successfully parsed) and a bit of other info, plus an event you can hook into in order to do the final HTML/etc generation. You get a class definition that starts like this:

Some elements are referred to as ‘inline’: bold, italic, and inline code. Some are not, such as headings. This is used to track where they are valid.

Let’s incrementally parse:

In short: append to the buffer. DetectElement() does the main work. If the cursor is moving, keep going and try to parse even more; if it didn’t move (ie DetectElement() or the methods it calls did not successfully parse) then stop and wait for the next chunk.

This checks the current state. If it’s in any form of code block, continue it (it will keep emitting code until it gets to the end backtick(s).) If it’s in a heading or list, or there is no state, try to parse.

BeginElementDetection is the monster.

Starting at the current cursor, it scans ahead to find formatting marks. These can be # for headings, * or – or + for unordered lists, * or _ for italic or bold, any number for an ordered list, or a backtick to start inline code or a code block, plus line breaks.

The closest of these to the cursor (the first one found) then decides what to do. This is all based on the current state: for example, if currently inside a heading, it does not parse for another # symbol, but it does parse for inline code, inline bold, or inline italic. Text found before the next formatting change is simply emitted as content.

This boils down to a switch statement for what kind of content to attempt to parse:

One of the simplest to parse is headings, which begin with any number of # marks, eg # is a level 1 heading and ### is a level 3 heading:

This:

  • Validates it can parse a heading. Finding # is not enough; there might be more #s coming, so we can only know the heading level once there’s whitespace.
  • Updates the current state: inside a heading, and element is a heading
  • Emits the event. This is key. Each event has the element type, if it starts, ends or has content, and element-specific data: for headings, it stores the heading level in HTML-like notation
  • Updates the cursor indicating the last place it successfully parsed to

For Markdown like:

This gets a sequence of events like:

  • heading, start, ‘h1’
  • heading, content, ‘Heading here’
  • heading, end, ”
  • text, start, ”
  • text, content, ‘hello ‘
  • italic, start, ”
  • italic, content, ‘world’
  • italic, end, ”
  • text, start, ”
  • text, content, ‘!’
  • text, end, ”

And those can be used to convert to HTML or rich text.

Tests

There are far more edge cases or state to track than I had expected: Markdown is a simple format but has gotchas.

Unit tests. Unit tests. Unit tests.

It was quite common to break something while trying to add support for more complex cases. Not just writing tests ahead of time for what I was trying to get the output to be, but just a growing set of tests day after day against known working Markdown and output was extremely helpful.

The nice thing about emitting a stream of events is that unit tests can verify against a list of specific events and their data very cleanly.

Rendering Markdown

We’re almost there.

The Owl, inside Delphi, currently uses a rich text field. This is not super pleasant to code against but is mainly because the Edge browser embedding is really hard to get working and I eventually fell back to using a TRichEdit instead.

This is not ideal because it’s not as nice to work with; is much harder to get to respond to dark mode changing; and in high DPI the rich edit seems to have a bug where list marks (a dot, number, etc) may or may not draw at the right size seemingly randomly.

If you know a reliable technique to dynamically create an embedded Edge browser, ensuring it finds the right DLL, etc etc, I would be very grateful to hear about it!

I have two builder classes, THTMLBuilder and TRichTextBuilder. HTML is much nicer than RTF so for simplicity’s sake I’ll share that code:

It’s very simple: gets a markdown event, and adds something. This could be a tag, an end tag, or content. For example:

Ending a tag or just adding text is even simpler.

How many layers?

So there we have it.

  1. The AI server streams to the Owl server
  2. The Owl server parses on the fly and streams JSON to the client
  3. The client parses on the fly and extracts Markdown from the JSON
  4. The Markdown is also parsed on the fly, and emitted as a stream of events
  5. Those events are used to generate/build HTML or rich text

And yet it looks so simple in practice!

Animated GIF – you may need to click to play.

And see that note the AI says about a conversation we had six minutes before? That’s because it has a memory. More on that in another post…

And for info, this still gets us up only to March in terms of development, five months ago.

Happy coding!

Code license: in general, like most blogs, code is for information and education. There’s no error checking, and it’s not suitable for product. Also, don’t use this to create anything competitive to the Owl, ie an AI service or copilot targeting Delphi/C++Builder.

If you do want to use it for some kind of copilot for RAD Studio, drop me an email (I’ll likely say yes, just don’t want surprises.) I prefer working with people :)

And on that note if you want to build an AI service, get in touch!