4  Structured Output

In this chapter, we continue to talk about {ellmer}. You see, {ellmer} still has a pretty cool trick up its sleeves. And that trick has to do with structured output. Namely, {ellmer} (or rather the LLMs it talks with) allows you to extract data from unstructured texts in a structured format.

That’s huge! One of the most fundamental problems with regular prompting is that you can never be sure that you get output in the form that you wish. But for these kinds of situations, OpenAI and other vendors nowadays allow you to use so-called “structured output”.

And the best part: {ellmer} has built-in functionalities that allow you to use this structured mode really easily. In this chapter, I’ll show you how. And as always: There’s a video associated with this book chapter:

< video will be inserted after video release >

4.1 A ground truth

To test the structured output, we first need some data that we can extract things from. So let’s do something fun. Let’s come up with a little hero story (yes these details are absolutely GPT generated):

  • Description: Dax Ironfist is a stocky, gruff dwarf with soot-covered hands, a thick brown beard, and a pair of goggles perpetually resting on his forehead. He despises adventure but is an unmatched engineer, constantly tinkering with mechanical inventions in his underground workshop.
  • Journey: When his latest creation, an experimental automaton, malfunctions and escapes, Dax is forced to leave his comfort zone and track it down. His pursuit takes him into an ancient dwarven ruin filled with dangerous traps, where he meets a young adventurer who needs his expertise. Through their trials, Dax realizes that his skills can help more than just himself.
  • Lesson Learned: Innovation isn’t just about personal success—it’s about using one’s talents to improve the lives of others.

What a beautiful character! Now we can let GPT generate a multi-page story that involves these details. This little stunt probably cost me 10 cents or so in GPT tokens. But it’s worth it. We now have something that we can ask GPT.

Anyway, I have saved this story in a file called story.txt:

library(tidyverse)
read_lines('dat/story.txt')[1:5] |> 
  stringr::str_wrap(width = 50) |> 
  paste(collapse = '\n') |> 
  paste('...') |> 
  cat()
## Title: Dax Ironfist – The Reluctant Inventor
## 
## Dax Ironfist let out a frustrated grunt as a
## spark flew from his latest creation, a small,
## gear-driven automaton designed to assist with
## forging. The stout dwarf wiped soot from his brow,
## his thick fingers carefully adjusting a valve
## on the automaton's chest. "Blasted thing," he
## muttered, tightening a bolt.
## 
## His underground workshop, carved deep into the
## rocky heart of the mountain, was cluttered with
## gears, wrenches, and half-finished inventions. Dax
## had no interest in adventure—he had sworn long ago
## to let others chase glory while he built marvels
## in the safety of his forge. But fate, as it turned
## out, had other plans. ...

Here, I’ve only printed the first few lines of the story. If you want to read the full story, you can do so here.

4.2 Extract data

Now, in order to ask ChatGPT about specifics of our story, we can create a chat with a suitable system prompt:

library(ellmer)
chat <- chat_openai(
  system_prompt = "
  You are an information extractor AI.
  The user will give you a story and you will 
  provide the following information in the response:
  
  Name of main character: <Insert name here>
  
  Names of supporting characters: <Insert comma-
    separated list of names here>
  
  One-Sentence Summary: <Insert summary here. 
    Use one sentence only>
  
  Lesson learned: <Summarize what the hero 
    learned in two sentences>
  "
)
## Using model = "gpt-4o".

And now, we can feed in the story as a user prompt.

txt_full_story <- read_lines('dat/story.txt') |> 
  paste(collapse = '\n')
chat$chat(txt_full_story) 
## Name of main character: Dax Ironfist
## 
## Names of supporting characters: Ryn
## 
## One-Sentence Summary: Dax Ironfist, a reluctant inventor, teams up with a young
## adventurer named Ryn to recover his rogue automaton within ancient dwarven 
## tunnels, learning the value of his creations beyond personal projects.
## 
## Lesson learned: Dax learned that his inventions have the potential to aid 
## others beyond his own confines and that there is more to his work than solitary
## craftsmanship. By collaborating with Ryn, he discovers a sense of purpose in 
## using his skills to contribute to a larger community.

Just so that you can read the actual reply, here’s also the reply in a narrower format. Don’t worry about the code here.

chat$last_turn('assistant')@text |> 
  stringr::str_split_1('\n') |> 
  stringr::str_wrap(width = 50) |> 
  paste(collapse = '\n') |> 
  cat()
## Name of main character: Dax Ironfist
## 
## Names of supporting characters: Ryn
## 
## One-Sentence Summary: Dax Ironfist, a reluctant
## inventor, teams up with a young adventurer named
## Ryn to recover his rogue automaton within ancient
## dwarven tunnels, learning the value of his
## creations beyond personal projects.
## 
## Lesson learned: Dax learned that his inventions
## have the potential to aid others beyond his own
## confines and that there is more to his work than
## solitary craftsmanship. By collaborating with
## Ryn, he discovers a sense of purpose in using his
## skills to contribute to a larger community.

4.3 Use a structure

Sweet. This worked pretty great. But the thing with LLMs is that we can never be sure that it will actually adhere to this specific format.

For example, it’s easy to imagine that the LLM might also add introductory sentences like “Sure, here’s your data”. That might not be great if we rely on the structure of the reply.

Enter structured outputs. With the extract_data() method, we can enforce a certain output. For that, let’s create a new chat with a shorter system prompt.

chat <- chat_openai(
  system_prompt = "
  You are an information extractor AI.
  The user will give you a story and you will 
  provide the requested information in the response.
  "
)
## Using model = "gpt-4o".

Then, we can use the extract_data() method. Inside of this function, we can specify the so-called type. That’s the desired output format. It is specified using type_*() functions.

output_structure <- type_object(
  name_main = type_string(
    description = 'Name of main character'
  ),
  name_support = type_string(
    description = 'Names of supporting characters, 
    comma-separated list'
  ),
  summary = type_string(
    description = 'One-Sentence Summary'
  ),
  lesson_learned = type_string(
    description = 'Lesson learned, summarize what 
    the hero learned in two sentences'
  )
)

And with this, we can use the extract_data() method in combination with the story text.

chat$extract_data(
  txt_full_story, 
  type = output_structure
)
## $name_main
## [1] "Dax Ironfist"
## 
## $name_support
## [1] "Ryn"
## 
## $summary
## [1] "Dax Ironfist, a dwarf inventor, chases after his rogue automaton through ancient dwarven tunnels and teams up with a young adventurer named Ryn to retrieve it, ultimately discovering a new purpose for his inventions."
## 
## $lesson_learned
## [1] "Dax learned that his inventing skills have potential beyond his own needs and can help others. By working with Ryn and overcoming the automaton's threat, Dax realized that adventure could help him see his inventions in a new light, sparking a desire to contribute to a greater good."

Sweet! Now, we have an R list that we can work with.

4.4 Access the data

And in case you’re wondering how to get to the data from the chat object itself, here’s how to do it. First, you have to get the last turn.

last_turn <- chat$last_turn('assistant')
last_turn
## <ellmer::Turn>
##  @ role    : chr "assistant"
##  @ contents:List of 1
##  .. $ : <ellmer::ContentJson>
##  ..  ..@ value:List of 1
##  .. .. .. $ wrapper:List of 4
##  .. .. ..  ..$ name_main     : chr "Dax Ironfist"
##  .. .. ..  ..$ name_support  : chr "Ryn"
##  .. .. ..  ..$ summary       : chr "Dax Ironfist, a dwarf inventor, chases after his rogue automaton through ancient dwarven tunnels and teams up w"| __truncated__
##  .. .. ..  ..$ lesson_learned: chr "Dax learned that his inventing skills have potential beyond his own needs and can help others. By working with "| __truncated__
##  @ json    :List of 8
##  .. $ id                : chr "chatcmpl-AyoKyFN6K2W0VNradLu1bv7zvUb3A"
##  .. $ object            : chr "chat.completion"
##  .. $ created           : int 1739056540
##  .. $ model             : chr "gpt-4o-2024-08-06"
##  .. $ choices           :List of 1
##  ..  ..$ :List of 4
##  ..  .. ..$ index        : int 0
##  ..  .. ..$ message      :List of 3
##  ..  .. .. ..$ role   : chr "assistant"
##  ..  .. .. ..$ content: chr "{\"wrapper\":{\"name_main\":\"Dax Ironfist\",\"name_support\":\"Ryn\",\"summary\":\"Dax Ironfist, a dwarf inven"| __truncated__
##  ..  .. .. ..$ refusal: NULL
##  ..  .. ..$ logprobs     : NULL
##  ..  .. ..$ finish_reason: chr "stop"
##  .. $ usage             :List of 5
##  ..  ..$ prompt_tokens            : int 1321
##  ..  ..$ completion_tokens        : int 132
##  ..  ..$ total_tokens             : int 1453
##  ..  ..$ prompt_tokens_details    :List of 2
##  ..  .. ..$ cached_tokens: int 1152
##  ..  .. ..$ audio_tokens : int 0
##  ..  ..$ completion_tokens_details:List of 4
##  ..  .. ..$ reasoning_tokens          : int 0
##  ..  .. ..$ audio_tokens              : int 0
##  ..  .. ..$ accepted_prediction_tokens: int 0
##  ..  .. ..$ rejected_prediction_tokens: int 0
##  .. $ service_tier      : chr "default"
##  .. $ system_fingerprint: chr "fp_4691090a87"
##  @ tokens  : int [1:2] 1321 132
##  @ text    : chr ""

And from there you can follow the hierarchy all the way down to wrapper deep inside the contents field. Just make sure that you use @, $ and brackets appropriately.

last_turn@contents[[1]]@value$wrapper
## $name_main
## [1] "Dax Ironfist"
## 
## $name_support
## [1] "Ryn"
## 
## $summary
## [1] "Dax Ironfist, a dwarf inventor, chases after his rogue automaton through ancient dwarven tunnels and teams up with a young adventurer named Ryn to retrieve it, ultimately discovering a new purpose for his inventions."
## 
## $lesson_learned
## [1] "Dax learned that his inventing skills have potential beyond his own needs and can help others. By working with Ryn and overcoming the automaton's threat, Dax realized that adventure could help him see his inventions in a new light, sparking a desire to contribute to a greater good."

And of course, we could turn this thing into a tibble.

last_turn@contents[[1]]@value$wrapper |> 
  dplyr::as_tibble()
## # A tibble: 1 × 4
##   name_main    name_support summary                               lesson_learned
##   <chr>        <chr>        <chr>                                 <chr>         
## 1 Dax Ironfist Ryn          Dax Ironfist, a dwarf inventor, chas… Dax learned t…

4.5 Named entitity recognition

Alright, let’s do one more example. Let’s try to extract all named entities from this story. This is a common thing people want to do in text analysis.

Here, we can do this by creating another type. For now, let’s just go with name and description of the entity.

type_entity <- type_object(
  .description = 'A named entity that was found 
  in our story',
  name = type_string('Name of the entity'),
  desc = type_string('One sentence description 
                     of entity')
)

But of course we don’t want to find just a single entity. We want our LLM to extract all entities. Thus, let’s make sure that we ask our LLM to extract an array (or “vector” as we’d say in the R world).

chat$extract_data(
  txt_full_story, 
  type = type_array(
    # items is the argument name of 
    # the array content type
    items = type_entity
  )
)
##                       name
## 1             Dax Ironfist
## 2                      Ryn
## 3    Gear-driven Automaton
## 4 Dwarven Stronghold Ruins
## 5            Ancient Forge
##                                                                                                                                                                               desc
## 1 A stout dwarf and reluctant inventor focused on building automated creations.\n                    His latest invention is a gear-driven automaton meant to assist with forging.
## 2        A young adventurer clad in mismatched armor in search of treasure who assists\n                    Dax in navigating the dangerous ruins where his automaton had run off.
## 3                 A small mechanical creation built by Dax to aid in forging but which goes\n                    rogue, causing trouble by trying to integrate with ancient magic.
## 4                       Ancient tunnels of a collapsed dwarven settlement filled with traps and\n                    remnants of the past, where Dax chases his runaway automaton.
## 5        A grand chamber within the dwarven ruins, containing a forge with pulsing runes\n                    that the automaton attempts to interact with, feeding off old magic.

And as always I prefer a tibble. Here, we have multiple rows (with two columns each), so we need to use bind_rows() to turn this into a tibble.

last_turn <- chat$last_turn('assistant')
last_turn@contents[[1]]@value$wrapper |> 
  bind_rows()
## # A tibble: 5 × 2
##   name                     desc                                                 
##   <chr>                    <chr>                                                
## 1 Dax Ironfist             "A stout dwarf and reluctant inventor focused on bui…
## 2 Ryn                      "A young adventurer clad in mismatched armor in sear…
## 3 Gear-driven Automaton    "A small mechanical creation built by Dax to aid in …
## 4 Dwarven Stronghold Ruins "Ancient tunnels of a collapsed dwarven settlement f…
## 5 Ancient Forge            "A grand chamber within the dwarven ruins, containin…

4.6 Adding classification

Finally, let’s complete this example by also showing you how these types can be used for automations. The trick is to modify our type_entity with yet another entry using type_enum().

This particular type corresponds to a categorical variable. It’s what you’d call a factor in the R world.

type_entity <- type_object(
  .description = 'A named entity that was found
  in our story',
  name = type_string('Name of the entity'),
  type = type_enum(
    description = 'Classification of entity',
    values = c('Protagonist', 'Antagonist', 'Other')
  ),
  desc = type_string('One sentence description 
                     of entity')
)

And with that, we can run our entity recognition again.

chat$extract_data(
  txt_full_story, 
  type = type_array(
    items = type_entity
  )
)
last_turn <- chat$last_turn('assistant')
last_turn@contents[[1]]@value$wrapper |> 
  bind_rows()
# # A tibble: 3 × 3
#   name         type        desc  
#   <chr>        <chr>       <chr>
# 1 Dax Ironfist Protagonist A stout dwarf inventor who prefers the safety o…
# 2 Ryn          Other       A young, enthusiastic adventurer clad in mismat…
# 3 Automaton    Other       Dax's creation, a small, gear-driven machine de…

Nice! So with that you should have a good idea of how structured outputs work. If you’re looking for more examples, then the {ellmer} docs have a few more fantastic examples.


Enjoyed this chapter?

Here are three other ways I can help you:

3 Minutes Wednesdays

Every week, I share bite-sized R tips & tricks. Reading time less than 3 minutes. Delivered straight to your inbox. You can sign up for free weekly tips online.

Data Cleaning With R Master Class

This in-depth video course teaches you everything you need to know about becoming better & more efficient at cleaning up messy data. This includes Excel & JSON files, text data and working with times & dates. If you want to get better at data cleaning, check out the course page.

Insightful Data Visualizations for "Uncreative" R Users

This video course teaches you how to leverage {ggplot2} to make charts that communicate effectively without being a design expert. Course information can be found on the course page.