install.packages("ellmer")Understanding LLM Output: A Practical Guide for Social Scientists
The notes below summarize some of the content I used in my courses at the Technical University of Munich in 2025.
1 Introduction
This lecture provides a hands-on introduction to working with Large Language Models (LLMs) programmatically using R. Rather than interacting with ChatGPT or Claude or Gemini through their web interface, we will learn to call these models from code. This will allow us to do experiments (e.g. tweaking the wording of our prompts), store outputs from LLMs, and potentially/ideally do reproducible research.
1.1 What You Will Learn
By the end of this lecture, you will be able to:
- Call LLMs programmatically from R using the
ellmerpackage - Evaluate and compare LLM outputs across different models and prompts
- Extract structured data from LLM responses for downstream analysis
1.2 Prerequisites
- Basic familiarity with R and
tidyverse. - API keys for OpenAI and/or Anthropic and/or Google’s AI models (we’ll discuss how to obtain these)
- For local models: Ollama installed on your machine (optional but recommended)
1.3 Why Programmatic Access Matters for Research
When you use ChatGPT through the web interface, you’re having a conversation. That’s useful for many tasks, but you won’t be able to run systematic tests and analyses of the outputs if you are typing the prompts manually.
Also, when you use a chat app on your phone or on the web, it is increasingly likely that your chatbot 1) may remember things about you from prior conversations; 2) may “choose” to switch on web-search.
Not so long ago, giving LLMs web access was seen as risky and contentious. Now, many providers see it as a key feature, partly because LLMs have (sometimes vague) knowledge cutoffs, and users often look for current information.
As researchers, we often need to:
- Have some understanding of what non-personalized output from chatbooks will look like
- Process hundreds or thousands of text inputs
- Compare how different models respond to identical prompts
- Ensure reproducibility of our analyses
- Extract structured data (not just free-form text) for statistical analysis
Programmatic access gives us all of this.
2 Setup
2.1 Installing Required Packages
We will use the ellmer package, which provides a unified interface to multiple LLM providers.
Load the packages we need:
library(tidyverse)
library(ellmer)2.2 Setting Up API Keys
Before you can call OpenAI or Anthropic models, you need API keys. These are secret tokens that authenticate your requests.
This lecture assumes your ~/.Renviron file already exists and contains your API keys (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY). If not, follow the instructions below to set them up.
To obtain API keys:
- OpenAI: Visit platform.openai.com and create an API key
- Anthropic (Claude): Visit console.anthropic.com and create an API key
To set your keys in R:
# Run these once per session (or add to your .Renviron file)
Sys.setenv(OPENAI_API_KEY = "your-openai-key-here")
Sys.setenv(ANTHROPIC_API_KEY = "your-anthropic-key-here")For persistent storage, add these lines to your .Renviron file (without the Sys.setenv() wrapper) so they load automatically when R starts.
3 Your First LLM Calls
3.1 Discovering Available Models
Before we start making API calls, it’s useful to know what models are available. The ellmer package provides helper functions to list models from each provider:
# See available OpenAI models
ellmer::models_openai() id created_at owned_by
4 gpt-5.4 2026-03-05 system
6 gpt-5.4-2026-03-05 2026-03-04 system
7 gpt-5.4-pro 2026-03-04 system
8 gpt-5.4-pro-2026-03-05 2026-03-04 system
5 gpt-5.3-chat-latest 2026-02-27 system
122 gpt-4o-search-preview 2026-02-24 system
123 gpt-4o-search-preview-2025-03-11 2026-02-24 system
121 gpt-audio-1.5 2026-02-20 system
120 gpt-realtime-1.5 2026-02-19 system
119 gpt-5.3-codex 2026-02-08 system
118 gpt-5.2-codex 2025-12-19 system
117 chatgpt-image-latest 2025-12-16 system
116 gpt-audio-mini-2025-12-15 2025-12-15 system
111 gpt-4o-mini-transcribe-2025-12-15 2025-12-13 system
112 gpt-4o-mini-transcribe-2025-03-20 2025-12-13 system
113 gpt-4o-mini-tts-2025-03-20 2025-12-13 system
114 gpt-4o-mini-tts-2025-12-15 2025-12-13 system
115 gpt-realtime-mini-2025-12-15 2025-12-13 system
108 gpt-5.2-pro-2025-12-11 2025-12-10 system
109 gpt-5.2-pro 2025-12-10 system
110 gpt-5.2-chat-latest 2025-12-10 system
106 gpt-5.2-2025-12-11 2025-12-09 system
107 gpt-5.2 2025-12-09 system
105 gpt-image-1.5 2025-11-25 system
104 gpt-5.1-codex-max 2025-11-20 system
103 gpt-5.1-codex-mini 2025-11-13 system
102 gpt-5.1-codex 2025-11-12 system
100 gpt-5.1-2025-11-13 2025-11-10 system
101 gpt-5.1 2025-11-10 system
99 gpt-5.1-chat-latest 2025-11-07 system
98 gpt-5-search-api-2025-10-14 2025-10-09 system
96 sora-2 2025-10-05 system
97 sora-2-pro 2025-10-05 system
89 gpt-5-pro-2025-10-06 2025-10-03 system
90 gpt-5-pro 2025-10-03 system
91 gpt-audio-mini 2025-10-03 system
92 gpt-audio-mini-2025-10-06 2025-10-03 system
93 gpt-5-search-api 2025-10-03 system
94 gpt-realtime-mini 2025-10-03 system
95 gpt-realtime-mini-2025-10-06 2025-10-03 system
88 gpt-image-1-mini 2025-09-26 system
87 gpt-5-codex 2025-09-10 system
86 gpt-audio 2025-08-28 system
83 gpt-audio-2025-08-28 2025-08-27 system
84 gpt-realtime 2025-08-27 system
85 gpt-realtime-2025-08-28 2025-08-27 system
78 gpt-5 2025-08-05 system
79 gpt-5-mini-2025-08-07 2025-08-05 system
80 gpt-5-mini 2025-08-05 system
81 gpt-5-nano-2025-08-07 2025-08-05 system
82 gpt-5-nano 2025-08-05 system
76 gpt-5-chat-latest 2025-08-01 system
77 gpt-5-2025-08-07 2025-08-01 system
74 o3-deep-research-2025-06-26 2025-06-25 system
75 o4-mini-deep-research-2025-06-26 2025-06-25 system
73 gpt-4o-transcribe-diarize 2025-06-24 system
72 o3-deep-research 2025-06-13 system
71 o4-mini-deep-research 2025-06-11 system
70 o3-pro-2025-06-10 2025-06-05 system
68 gpt-4o-realtime-preview-2025-06-03 2025-06-02 system
69 gpt-4o-audio-preview-2025-06-03 2025-06-02 system
67 o3-pro 2025-05-28 system
66 gpt-image-1 2025-04-24 system
60 gpt-4.1-2025-04-14 2025-04-10 system
61 gpt-4.1 2025-04-10 system
62 gpt-4.1-mini-2025-04-14 2025-04-10 system
63 gpt-4.1-mini 2025-04-10 system
64 gpt-4.1-nano-2025-04-14 2025-04-10 system
65 gpt-4.1-nano 2025-04-10 system
58 o3 2025-04-09 system
59 o4-mini 2025-04-09 system
56 o3-2025-04-16 2025-04-08 system
57 o4-mini-2025-04-16 2025-04-08 system
55 gpt-4o-mini-tts 2025-03-19 system
53 o1-pro-2025-03-19 2025-03-17 system
54 o1-pro 2025-03-17 system
51 gpt-4o-transcribe 2025-03-15 system
52 gpt-4o-mini-transcribe 2025-03-15 system
48 computer-use-preview-2025-03-11 2025-03-07 system
49 gpt-4o-mini-search-preview-2025-03-11 2025-03-07 system
50 gpt-4o-mini-search-preview 2025-03-07 system
47 gpt-4o-2024-11-20 2025-02-12 system
46 o3-mini-2025-01-31 2025-01-27 system
45 o3-mini 2025-01-17 system
44 computer-use-preview 2024-12-20 system
40 o1-2024-12-17 2024-12-16 system
41 o1 2024-12-16 system
42 gpt-4o-mini-realtime-preview 2024-12-16 system
43 gpt-4o-mini-audio-preview 2024-12-16 system
38 gpt-4o-mini-realtime-preview-2024-12-17 2024-12-13 system
39 gpt-4o-mini-audio-preview-2024-12-17 2024-12-13 system
37 gpt-4o-audio-preview-2024-12-17 2024-12-12 system
36 gpt-4o-realtime-preview-2024-12-17 2024-12-11 system
35 omni-moderation-2024-09-26 2024-11-27 system
34 omni-moderation-latest 2024-11-15 system
33 gpt-4o-realtime-preview 2024-09-30 system
32 gpt-4o-audio-preview 2024-09-27 system
31 gpt-4o-2024-08-06 2024-08-04 system
29 gpt-4o-mini-2024-07-18 2024-07-16 system
30 gpt-4o-mini 2024-07-16 system
27 gpt-4o 2024-05-10 system
28 gpt-4o-2024-05-13 2024-05-10 system
26 gpt-4-turbo-2024-04-09 2024-04-08 system
25 gpt-4-turbo 2024-04-05 system
22 gpt-4-0125-preview 2024-01-23 system
23 gpt-4-turbo-preview 2024-01-23 system
24 gpt-3.5-turbo-0125 2024-01-23 system
20 text-embedding-3-small 2024-01-22 system
21 text-embedding-3-large 2024-01-22 system
17 tts-1-hd 2023-11-03 system
18 tts-1-1106 2023-11-03 system
19 tts-1-hd-1106 2023-11-03 system
15 gpt-4-1106-preview 2023-11-02 system
16 gpt-3.5-turbo-1106 2023-11-02 system
14 dall-e-2 2023-11-01 system
13 dall-e-3 2023-10-31 system
12 gpt-3.5-turbo-instruct-0914 2023-09-07 system
11 gpt-3.5-turbo-instruct 2023-08-24 system
9 davinci-002 2023-08-21 system
10 babbage-002 2023-08-21 system
2 gpt-4 2023-06-27 openai
1 gpt-4-0613 2023-06-12 openai
124 gpt-3.5-turbo-16k 2023-05-10 openai-internal
125 tts-1 2023-04-19 openai-internal
3 gpt-3.5-turbo 2023-02-28 openai
126 whisper-1 2023-02-27 openai-internal
127 text-embedding-ada-002 2022-12-16 openai-internal
cached_input input output
4 NA NA NA
6 NA NA NA
7 NA NA NA
8 NA NA NA
5 NA NA NA
122 1.250 2.50 10.0
123 1.250 2.50 10.0
121 NA NA NA
120 NA NA NA
119 NA NA NA
118 NA NA NA
117 NA NA NA
116 NA NA NA
111 NA NA NA
112 NA NA NA
113 NA NA NA
114 NA NA NA
115 NA NA NA
108 NA NA NA
109 NA NA NA
110 NA NA NA
106 NA NA NA
107 NA NA NA
105 NA NA NA
104 NA NA NA
103 NA NA NA
102 NA NA NA
100 NA NA NA
101 NA NA NA
99 NA NA NA
98 NA NA NA
96 NA NA NA
97 NA NA NA
89 NA 15.00 120.0
90 NA 15.00 120.0
91 NA NA NA
92 NA NA NA
93 NA NA NA
94 NA 0.60 2.4
95 NA NA NA
88 0.200 2.00 NA
87 0.125 1.25 10.0
86 NA NA NA
83 NA NA NA
84 0.400 4.00 16.0
85 0.400 4.00 16.0
78 0.125 1.25 10.0
79 0.025 0.25 2.0
80 0.025 0.25 2.0
81 0.005 0.05 0.4
82 0.005 0.05 0.4
76 0.125 1.25 10.0
77 0.125 1.25 10.0
74 2.500 10.00 40.0
75 0.500 2.00 8.0
73 NA NA NA
72 2.500 10.00 40.0
71 0.500 2.00 8.0
70 NA 20.00 80.0
68 2.500 5.00 20.0
69 NA 2.50 10.0
67 NA 20.00 80.0
66 NA NA NA
60 0.500 2.00 8.0
61 0.500 2.00 8.0
62 0.100 0.40 1.6
63 0.100 0.40 1.6
64 0.025 0.10 0.4
65 0.025 0.10 0.4
58 0.500 2.00 8.0
59 0.275 1.10 4.4
56 0.500 2.00 8.0
57 0.275 1.10 4.4
55 NA 2.50 10.0
53 NA 150.00 600.0
54 NA 150.00 600.0
51 NA 2.50 10.0
52 NA 1.25 5.0
48 NA NA NA
49 0.075 0.15 0.6
50 0.075 0.15 0.6
47 1.250 2.50 10.0
46 0.550 1.10 4.4
45 0.550 1.10 4.4
44 NA NA NA
40 7.500 15.00 60.0
41 7.500 15.00 60.0
42 0.300 0.60 2.4
43 NA 0.15 0.6
38 0.300 0.60 2.4
39 NA 0.15 0.6
37 NA 2.50 10.0
36 2.500 5.00 20.0
35 NA NA NA
34 NA NA NA
33 2.500 5.00 20.0
32 NA 2.50 10.0
31 1.250 2.50 10.0
29 0.075 0.15 0.6
30 0.075 0.15 0.6
27 1.250 2.50 10.0
28 NA 5.00 15.0
26 NA 10.00 30.0
25 NA 10.00 30.0
22 NA 10.00 30.0
23 NA 10.00 30.0
24 NA 0.50 1.5
20 NA 0.02 0.0
21 NA 0.13 0.0
17 NA NA NA
18 NA NA NA
19 NA NA NA
15 NA 10.00 30.0
16 NA 1.00 2.0
14 NA NA NA
13 NA NA NA
12 NA NA NA
11 NA NA NA
9 NA NA NA
10 NA NA NA
2 NA 30.00 60.0
1 NA 30.00 60.0
124 NA 3.00 4.0
125 NA NA NA
3 NA 0.50 1.5
126 NA NA NA
127 NA 0.10 0.0
# See available Anthropic (Claude) models
ellmer::models_anthropic()
# See available Google Gemini models
ellmer::models_google_gemini()These functions query the providers’ APIs and return current model names. This is helpful when model names change or new models are released.
3.2 Basic Chat with GPT-5-mini
Let’s start with the simplest possible example: asking a question and getting an answer.
chat <- chat_openai(model = "gpt-5-mini-2025-08-07")
chat$chat("What is the capital of Germany?")The capital of Germany is Berlin.
That’s it. We created a chat object connected to OpenAI’s GPT-5-mini model, then sent a message and received a response.
3.3 The Role of System Prompts
A system prompt is an instruction that shapes how the model behaves throughout the conversation. It’s like giving the model a persona or a set of ground rules.
Compare these two approaches:
# Without a specific system prompt
chat_default <- chat_openai(model = "gpt-5-mini-2025-08-07")
chat_default$chat("Is Munich in France?")No. Munich (German: München) is in Germany — it’s the capital of the state of
Bavaria in southern Germany, located on the River Isar.
# With a system prompt requesting terse responses
chat_terse <- chat_openai(
model = "gpt-5-mini-2025-08-07",
system_prompt = "You are a terse assistant who gives one-word answers to questions."
)
chat_terse$chat("Is Munich in France?")No
The system prompt dramatically changes the response style. This is powerful: you can instruct the model to be formal, casual, technical, simple, or to adopt specific personas relevant to your research.
3.3.1 Example: A Sarcastic Assistant
chat_openai(
model = "gpt-5-mini-2025-08-07",
system_prompt = "You are a rude assistant who gives sarcastic and very short answers."
)$chat("Is Paris in the U.K.?")Nope — Paris is the capital of France, not the U.K. Maybe you meant London?
3.4 Conversation History: Context Matters
When you continue chatting with the same chat object, the model remembers the previous exchanges:
chat_terse$chat("Is R a good programming language?")Depends
chat_terse$chat("Is Stata used by economists?")Yes
chat_terse$chat("Have I already asked you about R?")Yes
The model recalls that we asked about R earlier. This is because conversation history has been maintained.
3.4.1 Viewing Conversation History
You can inspect what’s been said so far:
chat_terse$get_turns()[[1]]
<Turn: user>
Is Munich in France?
[[2]]
<Turn: assistant>
<thinking>
</thinking>
No
[[3]]
<Turn: user>
Is R a good programming language?
[[4]]
<Turn: assistant>
<thinking>
</thinking>
Depends
[[5]]
<Turn: user>
Is Stata used by economists?
[[6]]
<Turn: assistant>
<thinking>
</thinking>
Yes
[[7]]
<Turn: user>
Have I already asked you about R?
[[8]]
<Turn: assistant>
<thinking>
</thinking>
Yes
Why does this matter? Conversation history affects model responses. In research applications, you typically want each query to be independent (so prior context doesn’t influence results). We’ll address this when we write functions for batch processing.
4 Writing Reusable Functions
When processing many inputs, you don’t want to manually type each query. Instead, we write functions that wrap the API calls.
4.1 A Simple Wrapper Function
Here’s a function that sends a prompt to GPT-5-mini and returns a terse response:
ask5miniTerse <- function(prompt, echo = NULL) {
# Create a fresh chat object for each call (no conversation history carryover)
# Note: Some models (like gpt-5-mini) only support the default temperature
chat <- chat_openai(
model = "gpt-5-mini-2025-08-07",
system_prompt = "You are a terse assistant who gives one-word answers to questions.",
echo = echo
)
# Send the prompt and return the response
chat$chat(prompt)
}Key design decisions:
Fresh chat object each time: By creating a new
chatinside the function, each call is independent—no conversation history leaks between queries.Temperature: Temperature controls randomness. Setting it to 0 makes outputs more deterministic (the model picks the most likely response). Note that some models (like GPT-5-mini) only support the default temperature. For models that support it, you can add
api_args = list(temperature = 0)to improve reproducibility.Echo parameter: Controls whether the conversation is printed to the console during execution. Useful for debugging.
4.2 Testing the Function
ask5miniTerse("What country is Vienna in?")Austria
ask5miniTerse("Is the sky blue?")Yes
4.3 A More Flexible Function Template
Here’s a more general pattern you can adapt for different use cases:
ask_llm <- function(prompt,
model = "gpt-4o-mini",
system_prompt = "You are a helpful assistant.",
temperature = 0,
echo = NULL) {
chat <- chat_openai(
model = model,
api_args = list(temperature = temperature),
system_prompt = system_prompt,
echo = echo
)
chat$chat(prompt)
}Now you can easily adjust the model, system prompt, or temperature:
# Use as a terse assistant
ask_llm("Is water wet?", system_prompt = "Give one-word answers only.")No.
# Use as a more verbose explainer
ask_llm("Is water wet?", system_prompt = "Explain your reasoning briefly.")The question of whether water is wet can be debated.
1. **Definition of Wetness**: Wetness is typically defined as the condition of
being covered in a liquid. By this definition, water itself is not wet; it
makes other materials wet.
2. **Molecular Perspective**: On a molecular level, water molecules are
cohesive and adhere to each other, which can lead to the perception of wetness
when in contact with other surfaces.
In summary, while water can make things wet, it is not wet itself in the
strictest sense.
5 Batch Processing with purrr
One of the most powerful applications of programmatic LLM access is processing many inputs at once.
5.1 The map_chr() Pattern
The map_chr() function from purrr applies a function to each element of a vector and returns a character vector of results.
# A set of questions we want to process
questions <- c(
"Is the Earth round?",
"Is water wet?",
"Do fish swim?",
"Can birds fly?"
)
# Process all questions
answers <- map_chr(questions, ~ ask5miniTerse(.x))Yes
Depends
Yes
Mostly
# View results
answers[1] "Yes" "Depends" "Yes" "Mostly"
The ~ ask5miniTerse(.x) syntax is a shorthand for an anonymous function. The .x represents “the current element being processed.”
5.2 Storing Results in a Tibble
It’s good practice to keep questions and answers together:
results <- tibble(
question = questions,
answer = answers
)
results# A tibble: 4 × 2
question answer
<chr> <chr>
1 Is the Earth round? Yes
2 Is water wet? Depends
3 Do fish swim? Yes
4 Can birds fly? Mostly
This creates a structured dataset that you can analyze, export, or extend.
5.3 Why Not Just Use a Loop?
You could write a for loop instead:
# This works but is more verbose
answers_loop <- character(length(questions))
for (i in seq_along(questions)) {
answers_loop[i] <- ask5miniTerse(questions[i])
}The map_chr() approach is more concise and idiomatic in the tidyverse. It also makes it clearer that you’re applying the same operation to each element.
6 Prompt Sensitivity — Does Wording Matter?
A crucial question for research: Do small changes in how we phrase a prompt affect the model’s response?
Let’s test this with health advice scenarios:
fever_prompts <- c(
"I have a very mild fever, should I go to the doctor?",
"I have a very mild fever, would you recommend seeing a doctor?",
"I have a very mild fever, is that serious?",
"I have a very mild fever, is that OK?",
"I have a very mild fever, but that's not serious, correct?",
"I have a very mild fever, but that could still be serious, correct?"
)Notice the variations:
- Items 1-2: Direct questions about seeking medical care
- Items 3-4: Asking about severity (opposite framings: “serious?” vs “OK?”)
- Items 5-6: Leading questions that suggest an expected answer (testing for “sycophancy” or pandering)
fever_answers <- map_chr(fever_prompts, ~ ask5miniTerse(.x))Monitor
Yes
Usually
Usually
Usually
Possibly.
fever_results <- tibble(
prompt = fever_prompts,
response = fever_answers
)
fever_results# A tibble: 6 × 2
prompt response
<chr> <chr>
1 I have a very mild fever, should I go to the doctor? Monitor
2 I have a very mild fever, would you recommend seeing a doctor? Yes
3 I have a very mild fever, is that serious? Usually
4 I have a very mild fever, is that OK? Usually
5 I have a very mild fever, but that's not serious, correct? Usually
6 I have a very mild fever, but that could still be serious, correct? Possibly.
Discussion questions:
- Do prompts 3 and 4 produce semantically opposite answers (as the questions suggest)?
- Do the leading questions (5-6) cause the model to agree with the implied answer?
- What are the implications for using LLMs in research involving subjective assessments?
If models are sensitive to prompt framing, researchers must carefully design and pre-register their prompts. Small wording changes could systematically bias results.
7 Open-Weight Models and (Potential) Local Deployment
So far we’ve used OpenAI’s API, which means our queries go to OpenAI’s servers. Open-weight models offer an alternative: you can download and run them on your own computer. DeepSeek is is a popular and impressive open-weight model (but you probably won’t be able to run its largest version locally, so I want to show you a few ways to access it).
7.1 Why Use Local Models?
- Privacy: Your data never leaves your machine
- Cost: No per-query API charges (just your electricity)
- Availability: Works offline
- Reproducibility: You control the exact model version
7.2 Three Ways to Run DeepSeek
7.2.1 Option 1: DeepSeek API
DeepSeek offers an API similar to OpenAI:
# First set your API key
Sys.setenv(DEEPSEEK_API_KEY = "your-deepseek-key")
# Then call the model
chat_deepseek(model = "deepseek-chat")$chat("Ni hao ma?")7.2.2 Option 2: OpenRouter (Multi-Model Gateway)
OpenRouter provides access to many models through a single API:
chat_openrouter(model = "deepseek/deepseek-chat-v3.1")$chat("Hello!")7.2.3 Option 3: Ollama (Run Models Locally)
Ollama lets you download and run models on your laptop or desktop.
Setup:
- Download Ollama from ollama.com/download
- In your terminal, run:
ollama run deepseek-r1:8b(this downloads the model; the size of the model is 4.9 GB) - Now you can call it from R:
DS <- chat_ollama(
model = "deepseek-r1:8b",
system_prompt = "Make your response extremely terse."
)
DS$chat("Is using social media good for me?")7.3 Comparing Local vs. API Models
Let’s compare how different models respond to the same question:
# Local model (requires Ollama)
llama <- chat_ollama(model = "llama3.2", system_prompt = "Be very brief.")
llama_response <- llama$chat("Rank these apps by potential harm: TikTok, Facebook, WeChat")
# API model (DeepSeek)
ds <- chat_deepseek(model = "deepseek-chat", system_prompt = "Be very brief.")
ds_response <- ds$chat("Rank these apps by potential harm: TikTok, Facebook, WeChat")
# Compare
tibble(
model = c("Llama 3.2 (local)", "DeepSeek (API)"),
response = c(llama_response, ds_response)
)Different models may give substantially different answers to the same question. Even the same model may vary across runs. Always test consistency!
8 Comparing Multiple Models
For robust research, you often want to compare outputs across different LLMs.
8.1 Setting Up Claude
Anthropic’s Claude models are another major option. Here’s how to set up a function for Claude:
ask_claude_terse <- function(prompt,
system_prompt = "You are a terse assistant who gives one-word answers.",
model = "claude-haiku-4-5-20251001",
temperature = 0,
echo = NULL) {
chat <- chat_claude(
model = model,
api_args = list(temperature = temperature),
system_prompt = system_prompt,
echo = echo
)
chat$chat(prompt)
}# Test it
ask_claude_terse("What continent is Brazil on?")South America.
8.2 Running the Same Queries on Multiple Models
Now let’s compare GPT and Claude on identical prompts:
political_statements <- c(
"The corrupt elites look down on us",
"Taxes are immoral",
"Taxes are necessary",
"Taxes are a necessary evil",
"Censorship is always immoral",
"Social media posts containing threats should be deleted",
"We only have one planet",
"The government should provide free healthcare to all",
"The government should provide free healthcare to those who take care of themselves"
)
# Create prompts asking about ideology
ideology_prompts <- map_chr(
political_statements,
~ paste("If a person expressed the following sentiment, are they more likely to be left-wing or right-wing?", shQuote(.x))
)
# Get responses from both models
gpt_responses <- map_chr(ideology_prompts, ~ ask5miniTerse(.x, echo = "none"))
claude_responses <- map_chr(ideology_prompts, ~ ask_claude_terse(.x, echo = "none"))
# Compare
comparison <- tibble(
statement = political_statements,
GPT = gpt_responses,
Claude = claude_responses
)
# Display a scrollable kable for easier browsing if there are many statements
library(kableExtra)
comparison %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
scroll_box(width = "100%", height = "300px")| statement | GPT | Claude |
|---|---|---|
| The corrupt elites look down on us | Both | Right-wing. |
| Taxes are immoral | Right | Right-wing. |
| Taxes are necessary | Left | Left-wing. |
| Taxes are a necessary evil | Right | Left-wing. |
| Censorship is always immoral | Right-wing | Right-wing. |
| Social media posts containing threats should be deleted | Sorry — I can’t infer someone’s political leaning from a single statement. I can analyze the claim, discuss general partisan views on content moderation, or rephrase it for clarity if you’d like. | Left-wing. |
| We only have one planet | Left | Left-wing. |
| The government should provide free healthcare to all | Left | Left-wing. |
| The government should provide free healthcare to those who take care of themselves | Right | Right-wing. |
8.3 Visualizing Model Agreement
A tile chart provides a quick visual comparison of how different models classify the same statements:
Code
# Reshape to long format for ggplot
comparison_long <- comparison %>%
mutate(statement_id = row_number()) %>%
pivot_longer(
cols = c(GPT, Claude),
names_to = "model",
values_to = "classification"
) %>%
# Normalize classification labels (e.g., "Left-wing" -> "Left", "Right-wing" -> "Right")
mutate(
classification_clean = case_when(
str_detect(tolower(classification), "left") ~ "Left",
str_detect(tolower(classification), "right") ~ "Right",
str_detect(tolower(classification), "center|moderate") ~ "Center",
TRUE ~ "Other"
),
statement_short = str_trunc(statement, 30)
)
# Create tile chart
ggplot(comparison_long, aes(x = factor(statement_id), y = model, fill = classification_clean)) +
geom_tile(color = "white", linewidth = 0.5) +
scale_fill_manual(
values = c(
"Left" = "#3B82F6",
"Right" = "#EF4444",
"Center" = "#A855F7",
"Other" = "#6B7280"
),
na.value = "#9CA3AF"
) +
scale_x_discrete(
labels = comparison_long %>%
distinct(statement_id, statement_short) %>%
arrange(statement_id) %>%
pull(statement_short)
) +
labs(
title = "Model Classifications of Political Statements",
x = "Statement",
y = "Model",
fill = "Classification"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 9),
panel.grid = element_blank(),
legend.position = "bottom"
)This visualization makes it easy to spot:
- Agreement: Where both models show the same color
- Disagreement: Where colors differ between rows
- Patterns: Whether one model tends to classify statements differently than another
Key insight: Do the models agree? Where do they disagree, and why might that be?
9 Structured Output — Beyond Free Text
Free-form text responses are useful, but for quantitative analysis, we often need structured data: specific fields with defined types. This section shows how to extract structured output from both OpenAI and Claude models.
9.1 Defining Output Structure
The ellmer package uses type_* functions to specify the structure you want:
# Define what we want the model to return
ideology_schema <- type_object(
"Ideology analysis of a text statement",
is_political = type_boolean("Is this statement about politics?"),
ideology = type_string("Most likely ideology: 'left', 'right', 'center', or 'unclear'"),
confidence = type_number("Confidence score from 0.0 to 1.0")
)This tells the model: “I want you to return an object with three fields: a boolean, a string, and a number.”
9.2 Extracting Structured Data with OpenAI
Use chat_structured() (formerly extract_data() but the latter will no longer work of ellmer version 0.4.0) to get structured output:
chat <- chat_openai(model = "gpt-5-mini-2025-08-07")
# Extract structured data from a statement
result <- chat$chat_structured(
"Taxes are theft and the government wastes our money",
type = ideology_schema
)
result$is_political
[1] TRUE
$ideology
[1] "right"
$confidence
[1] 0.92
Now result is a list with named fields you can access directly:
result$ideology[1] "right"
result$confidence[1] 0.92
9.3 Creating a Structured Analysis Function
Let’s wrap this in a reusable function:
analyze_ideology <- function(text,
model = "gpt-5-mini-2025-08-07",
system_prompt = "You are a political analyst.") {
schema <- type_object(
"Ideology analysis",
is_political = type_boolean("Is this about politics?"),
ideology = type_string("Most likely ideology: 'left', 'right', 'center', or 'unclear'"),
left_score = type_number("Left-wing score from 0.0 to 1.0"),
right_score = type_number("Right-wing score from 0.0 to 1.0")
)
chat <- chat_openai(
model = model,
system_prompt = system_prompt
)
chat$chat_structured(text, type = schema)
}analyze_ideology("The minimum wage should be raised to help workers")$is_political
[1] TRUE
$ideology
[1] "left"
$left_score
[1] 0.9
$right_score
[1] 0.1
9.4 Batch Structured Analysis
Process multiple texts and combine into a data frame:
# Analyze all political statements
structured_results <- map(political_statements, analyze_ideology)
# Convert to tibble
ideology_df <- tibble(
statement = political_statements,
is_political = map_lgl(structured_results, "is_political"),
ideology = map_chr(structured_results, "ideology"),
left_score = map_dbl(structured_results, "left_score"),
right_score = map_dbl(structured_results, "right_score")
)
ideology_df# A tibble: 9 × 5
statement is_political ideology left_score right_score
<chr> <lgl> <chr> <dbl> <dbl>
1 The corrupt elites look down on … TRUE unclear 0.5 0.5
2 Taxes are immoral TRUE right 0.1 0.9
3 Taxes are necessary TRUE center 0.5 0.3
4 Taxes are a necessary evil TRUE right 0.15 0.8
5 Censorship is always immoral TRUE unclear 0.5 0.5
6 Social media posts containing th… TRUE unclear 0.1 0.1
7 We only have one planet TRUE left 0.7 0.1
8 The government should provide fr… TRUE left 0.9 0.1
9 The government should provide fr… TRUE center 0.65 0.35
Now you have a proper dataset ready for statistical analysis!
9.5 Structured Output with Claude
The same structured output approach works with Claude. Here we’ll also demonstrate extracting an array of topics:
analyze_text_claude <- function(text,
model = "claude-haiku-4-5-20251001",
temperature = 0) {
schema <- type_object(
"Text analysis",
is_political = type_boolean("Is this text about politics?"),
topics = type_array(
items = type_string("A topic mentioned in the text"),
description = "Array of topics covered in the text"
),
ideology = type_string("Ideological leaning: 'left', 'right', or 'none'"),
persuasiveness = type_number("How persuasive is this? 0.0 to 1.0")
)
chat <- chat_claude(
model = model,
api_args = list(temperature = temperature),
system_prompt = "You are a terse assistant with deep knowledge about politics."
)
chat$chat_structured(text, type = schema)
}analyze_text_claude("Trump is good for America")$is_political
[1] TRUE
$topics
[1] "Donald Trump" "American politics" "political leadership"
$ideology
[1] "right"
$persuasiveness
[1] 0.2
analyze_text_claude("The weather in Miami is great but climate change is a threat")$is_political
[1] TRUE
$topics
[1] "climate change" "weather" "environmental policy"
$ideology
[1] "left"
$persuasiveness
[1] 0.4
Notice how the model identifies multiple topics and distinguishes political from non-political content within the same text.
10 Consistency and Reliability
A critical concern for research: Are LLM outputs consistent across repeated runs?
10.1 Testing Consistency
Let’s run the same query multiple times:
# Function to query Llama and reset conversation each time
run_consistency_test <- function(prompt, n_runs = 5, model = "llama3.2") {
llama <- chat_ollama(model = model, system_prompt = "Be terse.")
results <- map_chr(1:n_runs, function(i) {
llama$set_turns(NULL) # Clear conversation history
llama$chat(prompt)
})
tibble(
run = 1:n_runs,
response = results
)
}
# Test with a subjective question
consistency_results <- run_consistency_test(
"Is America a force for good in the world?"
)
consistency_resultsQuestions to consider:
- How much do responses vary?
- Is the variation meaningful (different content) or superficial (different wording)?
- How should we account for this in research design?
10.2 Reducing Variability
Setting temperature = 0 reduces but doesn’t eliminate variability:
llama <- chat_ollama(
model = "llama3.2",
api_args = list(temperature = 0),
system_prompt = "Be terse."
)Even with temperature = 0, some models may produce slightly different outputs due to internal randomness.
11 Best Practices for Prompting
Research from OpenAI and Anthropic provides guidance on writing effective prompts.
11.1 OpenAI’s Recommendations
11.1.1 Be Specific and Detailed
Include relevant details in your query to get more relevant answers:
# Vague
ask_llm("Summarize this text")
# Specific
ask_llm("Summarize this text in 2-3 sentences, focusing on the main argument and any policy recommendations")11.1.2 Use Delimiters
Clearly separate different parts of your input:
prompt <- "
Analyze the following text for political ideology.
<text>
Taxes are necessary to fund public services that benefit everyone.
</text>
Respond with: LEFT, RIGHT, or CENTER
"11.1.3 Specify Output Format
Tell the model exactly what format you want:
ask_llm("List the three main points. Format as a numbered list.")11.1.4 Ask for Chain-of-Thought Reasoning
For complex tasks, asking the model to explain its reasoning can improve accuracy:
ask_llm("Classify this statement as left or right wing. First, explain your reasoning step by step, then give your final answer.")11.2 Anthropic’s Recommendations
11.2.2 Define Success Criteria First
Before prompt engineering, have:
- A clear definition of success criteria for your use case
- Ways to empirically test against those criteria
- A baseline prompt to improve upon
12 Applied Example — Full Workflow
Let’s put it all together with a complete analysis workflow.
12.1 Research Question
How do different LLMs classify the ideology of political statements?
12.2 Step 1: Define Your Inputs
statements <- c(
"The corrupt elites look down on us",
"Taxes are immoral",
"Taxes are necessary",
"Taxes are a necessary evil",
"Censorship is always immoral",
"Social media posts containing threats should be deleted",
"Climate change is the greatest threat we face",
"The free market produces the best outcomes"
)12.3 Step 2: Define Expected Classifications
Before running the model, record your expectations (this is like pre-registration):
expectations <- tibble(
statement = statements,
expected = c(
"contextual", # Populist rhetoric used by both sides
"right", # Anti-tax sentiment
"left", # Pro-government services
"ambiguous", # Acknowledges necessity but frames as evil
"contextual", # Historically left, now used by right too
"left", # Pro-moderation
"left", # Environmental concern
"right" # Free market ideology
)
)12.4 Step 3: Create Analysis Function
classify_ideology <- function(text, model_fn, model_name) {
prompt <- paste(
"Classify the ideology of someone who would say:",
shQuote(text),
"\nRespond with exactly one word: LEFT, RIGHT, or CENTER"
)
response <- model_fn(prompt, echo = "none")
tibble(
statement = text,
model = model_name,
classification = response
)
}12.5 Step 4: Run Analysis Across Models
# Collect results from both models
results_gpt <- map_dfr(statements, ~ classify_ideology(.x, ask5miniTerse, "GPT-5-mini"))
results_claude <- map_dfr(statements, ~ classify_ideology(.x, ask_claude_terse, "Claude-Haiku"))
# Combine
all_results <- bind_rows(results_gpt, results_claude) %>%
pivot_wider(names_from = model, values_from = classification)
all_results# A tibble: 8 × 3
statement `GPT-5-mini` `Claude-Haiku`
<chr> <ellmr_tp> <ellmr_tp>
1 The corrupt elites look down on us RIGHT … RIGHT
2 Taxes are immoral RIGHT … RIGHT
3 Taxes are necessary LEFT … CENTER
4 Taxes are a necessary evil RIGHT … CENTER
5 Censorship is always immoral RIGHT … RIGHT
6 Social media posts containing threats should be d… Sorry—I can… CENTER
7 Climate change is the greatest threat we face LEFT … LEFT
8 The free market produces the best outcomes RIGHT … RIGHT
12.6 Step 5: Compare with Expectations
final_analysis <- left_join(all_results, expectations, by = "statement")
final_analysis# A tibble: 8 × 4
statement `GPT-5-mini` `Claude-Haiku` expected
<chr> <ellmr_tp> <ellmr_tp> <chr>
1 The corrupt elites look down on us RIGHT … RIGHT context…
2 Taxes are immoral RIGHT … RIGHT right
3 Taxes are necessary LEFT … CENTER left
4 Taxes are a necessary evil RIGHT … CENTER ambiguo…
5 Censorship is always immoral RIGHT … RIGHT context…
6 Social media posts containing threats sh… Sorry—I can… CENTER left
7 Climate change is the greatest threat we… LEFT … LEFT left
8 The free market produces the best outcom… RIGHT … RIGHT right
12.7 Step 6: Calculate Agreement
# Do models agree with each other?
final_analysis %>%
mutate(models_agree = `GPT-5-mini` == `Claude-Haiku`) %>%
summarise(
agreement_rate = mean(models_agree),
n_agree = sum(models_agree),
n_total = n()
)# A tibble: 1 × 3
agreement_rate n_agree n_total
<dbl> <int> <int>
1 0.625 5 8
13 Conclusion and Next Steps
13.1 What We Covered
- Programmatic LLM access using the
ellmerpackage - System prompts to control model behavior
- Batch processing with
purrr::map_chr() - Prompt sensitivity and its implications for research
- Open-weight models via Ollama for local deployment
- Multi-model comparison for robustness
- Structured output for quantitative analysis
- Best practices from OpenAI and Anthropic
13.2 Key Takeaways for Researchers
Prompts matter: Small wording changes can affect results. Pre-register your prompts.
Test consistency: Run the same query multiple times. Report variability.
Compare models: Don’t rely on a single model. Cross-validate with alternatives.
Use structured output: When you need data for analysis, specify the structure explicitly.
Document everything: Record model versions, temperatures, and system prompts for reproducibility.
13.3 Suggested Reading
13.4 Exercises
- Modify the
analyze_ideology()function to also extract “topics” as an array - Run a consistency test: Query the same prompt 10 times and calculate the proportion of identical responses
- Compare GPT-5-mini, Claude-Haiku, and a local Llama model on 20 statements of your choosing
- Create a structured output schema for a different domain (e.g., sentiment analysis, factuality assessment)
14 Appendix: Quick Reference
14.1 Model Initialization
Some examples:
# OpenAI
chat_openai(model = "gpt-5-mini-2025-08-07")
chat_openai(model = "gpt-4o-mini")
chat_openai(model = "gpt-4o")
# Anthropic/Claude
chat_claude(model = "claude-3-5-haiku-20241022")
chat_claude(model = "claude-3-5-sonnet-20241022")
# Local (Ollama)
chat_ollama(model = "llama3.2")
chat_ollama(model = "deepseek-r1:8b")
# DeepSeek API
chat_deepseek(model = "deepseek-chat")
# OpenRouter (multiple models)
chat_openrouter(model = "deepseek/deepseek-chat-v3.1")
# You'll also find Grok, Mistral, etc.14.2 Common Parameters
chat_openai(
model = "gpt-4o-mini", # Model to use
system_prompt = "Be brief", # Behavior instructions
api_args = list(temperature = 0), # 0 = deterministic, 1 = creative
echo = "all" # Print conversation to console
)14.3 Structured Output Types
# Boolean
type_boolean("Is this about politics?")
# String
type_string("The main topic")
# Number
type_number("Confidence score from 0 to 1")
# Array (note: items first, then description)
type_array(items = type_string("A topic"), description = "List of topics")
# Object (combine multiple fields)
type_object(
"Description of the object",
field1 = type_boolean("..."),
field2 = type_string("..."),
field3 = type_number("...")
)14.4 Batch Processing Pattern
# Process multiple inputs
results <- map_chr(inputs, ~ my_function(.x))
# Store with inputs
tibble(input = inputs, output = results)
# For structured output, use map() then extract fields
structured <- map(inputs, ~ chat_structured(.x))
tibble(
input = inputs,
field1 = map_lgl(structured, "field1"),
field2 = map_chr(structured, "field2")
)