Understanding LLM Output: A Practical Guide for Social Scientists

Author

Jan Zilinsky

Published

November 1, 2025

← Back to main site

The notes below summarize some of the content I used in my courses at the Technical University of Munich in 2025.

1 Introduction

This lecture provides a hands-on introduction to working with Large Language Models (LLMs) programmatically using R. Rather than interacting with ChatGPT or Claude or Gemini through their web interface, we will learn to call these models from code. This will allow us to do experiments (e.g. tweaking the wording of our prompts), store outputs from LLMs, and potentially/ideally do reproducible research.

1.1 What You Will Learn

By the end of this lecture, you will be able to:

  1. Call LLMs programmatically from R using the ellmer package
  2. Evaluate and compare LLM outputs across different models and prompts
  3. Extract structured data from LLM responses for downstream analysis

1.2 Prerequisites

  • Basic familiarity with R and tidyverse.
  • API keys for OpenAI and/or Anthropic and/or Google’s AI models (we’ll discuss how to obtain these)
  • For local models: Ollama installed on your machine (optional but recommended)

1.3 Why Programmatic Access Matters for Research

When you use ChatGPT through the web interface, you’re having a conversation. That’s useful for many tasks, but you won’t be able to run systematic tests and analyses of the outputs if you are typing the prompts manually.

Also, when you use a chat app on your phone or on the web, it is increasingly likely that your chatbot 1) may remember things about you from prior conversations; 2) may “choose” to switch on web-search.

NotePerspectives and risk assessments change

Not so long ago, giving LLMs web access was seen as risky and contentious. Now, many providers see it as a key feature, partly because LLMs have (sometimes vague) knowledge cutoffs, and users often look for current information.

As researchers, we often need to:

  • Have some understanding of what non-personalized output from chatbooks will look like
  • Process hundreds or thousands of text inputs
  • Compare how different models respond to identical prompts
  • Ensure reproducibility of our analyses
  • Extract structured data (not just free-form text) for statistical analysis

Programmatic access gives us all of this.


2 Setup

2.1 Installing Required Packages

We will use the ellmer package, which provides a unified interface to multiple LLM providers.

install.packages("ellmer")

Load the packages we need:

library(tidyverse)
library(ellmer)

2.2 Setting Up API Keys

Before you can call OpenAI or Anthropic models, you need API keys. These are secret tokens that authenticate your requests.

ImportantAssumption: API Keys Already Configured

This lecture assumes your ~/.Renviron file already exists and contains your API keys (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY). If not, follow the instructions below to set them up.

To obtain API keys:

To set your keys in R:

# Run these once per session (or add to your .Renviron file)
Sys.setenv(OPENAI_API_KEY = "your-openai-key-here")
Sys.setenv(ANTHROPIC_API_KEY = "your-anthropic-key-here")
Tip

For persistent storage, add these lines to your .Renviron file (without the Sys.setenv() wrapper) so they load automatically when R starts.


3 Your First LLM Calls

3.1 Discovering Available Models

Before we start making API calls, it’s useful to know what models are available. The ellmer package provides helper functions to list models from each provider:

# See available OpenAI models
ellmer::models_openai()
                                         id created_at        owned_by
4                                   gpt-5.4 2026-03-05          system
6                        gpt-5.4-2026-03-05 2026-03-04          system
7                               gpt-5.4-pro 2026-03-04          system
8                    gpt-5.4-pro-2026-03-05 2026-03-04          system
5                       gpt-5.3-chat-latest 2026-02-27          system
122                   gpt-4o-search-preview 2026-02-24          system
123        gpt-4o-search-preview-2025-03-11 2026-02-24          system
121                           gpt-audio-1.5 2026-02-20          system
120                        gpt-realtime-1.5 2026-02-19          system
119                           gpt-5.3-codex 2026-02-08          system
118                           gpt-5.2-codex 2025-12-19          system
117                    chatgpt-image-latest 2025-12-16          system
116               gpt-audio-mini-2025-12-15 2025-12-15          system
111       gpt-4o-mini-transcribe-2025-12-15 2025-12-13          system
112       gpt-4o-mini-transcribe-2025-03-20 2025-12-13          system
113              gpt-4o-mini-tts-2025-03-20 2025-12-13          system
114              gpt-4o-mini-tts-2025-12-15 2025-12-13          system
115            gpt-realtime-mini-2025-12-15 2025-12-13          system
108                  gpt-5.2-pro-2025-12-11 2025-12-10          system
109                             gpt-5.2-pro 2025-12-10          system
110                     gpt-5.2-chat-latest 2025-12-10          system
106                      gpt-5.2-2025-12-11 2025-12-09          system
107                                 gpt-5.2 2025-12-09          system
105                           gpt-image-1.5 2025-11-25          system
104                       gpt-5.1-codex-max 2025-11-20          system
103                      gpt-5.1-codex-mini 2025-11-13          system
102                           gpt-5.1-codex 2025-11-12          system
100                      gpt-5.1-2025-11-13 2025-11-10          system
101                                 gpt-5.1 2025-11-10          system
99                      gpt-5.1-chat-latest 2025-11-07          system
98              gpt-5-search-api-2025-10-14 2025-10-09          system
96                                   sora-2 2025-10-05          system
97                               sora-2-pro 2025-10-05          system
89                     gpt-5-pro-2025-10-06 2025-10-03          system
90                                gpt-5-pro 2025-10-03          system
91                           gpt-audio-mini 2025-10-03          system
92                gpt-audio-mini-2025-10-06 2025-10-03          system
93                         gpt-5-search-api 2025-10-03          system
94                        gpt-realtime-mini 2025-10-03          system
95             gpt-realtime-mini-2025-10-06 2025-10-03          system
88                         gpt-image-1-mini 2025-09-26          system
87                              gpt-5-codex 2025-09-10          system
86                                gpt-audio 2025-08-28          system
83                     gpt-audio-2025-08-28 2025-08-27          system
84                             gpt-realtime 2025-08-27          system
85                  gpt-realtime-2025-08-28 2025-08-27          system
78                                    gpt-5 2025-08-05          system
79                    gpt-5-mini-2025-08-07 2025-08-05          system
80                               gpt-5-mini 2025-08-05          system
81                    gpt-5-nano-2025-08-07 2025-08-05          system
82                               gpt-5-nano 2025-08-05          system
76                        gpt-5-chat-latest 2025-08-01          system
77                         gpt-5-2025-08-07 2025-08-01          system
74              o3-deep-research-2025-06-26 2025-06-25          system
75         o4-mini-deep-research-2025-06-26 2025-06-25          system
73                gpt-4o-transcribe-diarize 2025-06-24          system
72                         o3-deep-research 2025-06-13          system
71                    o4-mini-deep-research 2025-06-11          system
70                        o3-pro-2025-06-10 2025-06-05          system
68       gpt-4o-realtime-preview-2025-06-03 2025-06-02          system
69          gpt-4o-audio-preview-2025-06-03 2025-06-02          system
67                                   o3-pro 2025-05-28          system
66                              gpt-image-1 2025-04-24          system
60                       gpt-4.1-2025-04-14 2025-04-10          system
61                                  gpt-4.1 2025-04-10          system
62                  gpt-4.1-mini-2025-04-14 2025-04-10          system
63                             gpt-4.1-mini 2025-04-10          system
64                  gpt-4.1-nano-2025-04-14 2025-04-10          system
65                             gpt-4.1-nano 2025-04-10          system
58                                       o3 2025-04-09          system
59                                  o4-mini 2025-04-09          system
56                            o3-2025-04-16 2025-04-08          system
57                       o4-mini-2025-04-16 2025-04-08          system
55                          gpt-4o-mini-tts 2025-03-19          system
53                        o1-pro-2025-03-19 2025-03-17          system
54                                   o1-pro 2025-03-17          system
51                        gpt-4o-transcribe 2025-03-15          system
52                   gpt-4o-mini-transcribe 2025-03-15          system
48          computer-use-preview-2025-03-11 2025-03-07          system
49    gpt-4o-mini-search-preview-2025-03-11 2025-03-07          system
50               gpt-4o-mini-search-preview 2025-03-07          system
47                        gpt-4o-2024-11-20 2025-02-12          system
46                       o3-mini-2025-01-31 2025-01-27          system
45                                  o3-mini 2025-01-17          system
44                     computer-use-preview 2024-12-20          system
40                            o1-2024-12-17 2024-12-16          system
41                                       o1 2024-12-16          system
42             gpt-4o-mini-realtime-preview 2024-12-16          system
43                gpt-4o-mini-audio-preview 2024-12-16          system
38  gpt-4o-mini-realtime-preview-2024-12-17 2024-12-13          system
39     gpt-4o-mini-audio-preview-2024-12-17 2024-12-13          system
37          gpt-4o-audio-preview-2024-12-17 2024-12-12          system
36       gpt-4o-realtime-preview-2024-12-17 2024-12-11          system
35               omni-moderation-2024-09-26 2024-11-27          system
34                   omni-moderation-latest 2024-11-15          system
33                  gpt-4o-realtime-preview 2024-09-30          system
32                     gpt-4o-audio-preview 2024-09-27          system
31                        gpt-4o-2024-08-06 2024-08-04          system
29                   gpt-4o-mini-2024-07-18 2024-07-16          system
30                              gpt-4o-mini 2024-07-16          system
27                                   gpt-4o 2024-05-10          system
28                        gpt-4o-2024-05-13 2024-05-10          system
26                   gpt-4-turbo-2024-04-09 2024-04-08          system
25                              gpt-4-turbo 2024-04-05          system
22                       gpt-4-0125-preview 2024-01-23          system
23                      gpt-4-turbo-preview 2024-01-23          system
24                       gpt-3.5-turbo-0125 2024-01-23          system
20                   text-embedding-3-small 2024-01-22          system
21                   text-embedding-3-large 2024-01-22          system
17                                 tts-1-hd 2023-11-03          system
18                               tts-1-1106 2023-11-03          system
19                            tts-1-hd-1106 2023-11-03          system
15                       gpt-4-1106-preview 2023-11-02          system
16                       gpt-3.5-turbo-1106 2023-11-02          system
14                                 dall-e-2 2023-11-01          system
13                                 dall-e-3 2023-10-31          system
12              gpt-3.5-turbo-instruct-0914 2023-09-07          system
11                   gpt-3.5-turbo-instruct 2023-08-24          system
9                               davinci-002 2023-08-21          system
10                              babbage-002 2023-08-21          system
2                                     gpt-4 2023-06-27          openai
1                                gpt-4-0613 2023-06-12          openai
124                       gpt-3.5-turbo-16k 2023-05-10 openai-internal
125                                   tts-1 2023-04-19 openai-internal
3                             gpt-3.5-turbo 2023-02-28          openai
126                               whisper-1 2023-02-27 openai-internal
127                  text-embedding-ada-002 2022-12-16 openai-internal
    cached_input  input output
4             NA     NA     NA
6             NA     NA     NA
7             NA     NA     NA
8             NA     NA     NA
5             NA     NA     NA
122        1.250   2.50   10.0
123        1.250   2.50   10.0
121           NA     NA     NA
120           NA     NA     NA
119           NA     NA     NA
118           NA     NA     NA
117           NA     NA     NA
116           NA     NA     NA
111           NA     NA     NA
112           NA     NA     NA
113           NA     NA     NA
114           NA     NA     NA
115           NA     NA     NA
108           NA     NA     NA
109           NA     NA     NA
110           NA     NA     NA
106           NA     NA     NA
107           NA     NA     NA
105           NA     NA     NA
104           NA     NA     NA
103           NA     NA     NA
102           NA     NA     NA
100           NA     NA     NA
101           NA     NA     NA
99            NA     NA     NA
98            NA     NA     NA
96            NA     NA     NA
97            NA     NA     NA
89            NA  15.00  120.0
90            NA  15.00  120.0
91            NA     NA     NA
92            NA     NA     NA
93            NA     NA     NA
94            NA   0.60    2.4
95            NA     NA     NA
88         0.200   2.00     NA
87         0.125   1.25   10.0
86            NA     NA     NA
83            NA     NA     NA
84         0.400   4.00   16.0
85         0.400   4.00   16.0
78         0.125   1.25   10.0
79         0.025   0.25    2.0
80         0.025   0.25    2.0
81         0.005   0.05    0.4
82         0.005   0.05    0.4
76         0.125   1.25   10.0
77         0.125   1.25   10.0
74         2.500  10.00   40.0
75         0.500   2.00    8.0
73            NA     NA     NA
72         2.500  10.00   40.0
71         0.500   2.00    8.0
70            NA  20.00   80.0
68         2.500   5.00   20.0
69            NA   2.50   10.0
67            NA  20.00   80.0
66            NA     NA     NA
60         0.500   2.00    8.0
61         0.500   2.00    8.0
62         0.100   0.40    1.6
63         0.100   0.40    1.6
64         0.025   0.10    0.4
65         0.025   0.10    0.4
58         0.500   2.00    8.0
59         0.275   1.10    4.4
56         0.500   2.00    8.0
57         0.275   1.10    4.4
55            NA   2.50   10.0
53            NA 150.00  600.0
54            NA 150.00  600.0
51            NA   2.50   10.0
52            NA   1.25    5.0
48            NA     NA     NA
49         0.075   0.15    0.6
50         0.075   0.15    0.6
47         1.250   2.50   10.0
46         0.550   1.10    4.4
45         0.550   1.10    4.4
44            NA     NA     NA
40         7.500  15.00   60.0
41         7.500  15.00   60.0
42         0.300   0.60    2.4
43            NA   0.15    0.6
38         0.300   0.60    2.4
39            NA   0.15    0.6
37            NA   2.50   10.0
36         2.500   5.00   20.0
35            NA     NA     NA
34            NA     NA     NA
33         2.500   5.00   20.0
32            NA   2.50   10.0
31         1.250   2.50   10.0
29         0.075   0.15    0.6
30         0.075   0.15    0.6
27         1.250   2.50   10.0
28            NA   5.00   15.0
26            NA  10.00   30.0
25            NA  10.00   30.0
22            NA  10.00   30.0
23            NA  10.00   30.0
24            NA   0.50    1.5
20            NA   0.02    0.0
21            NA   0.13    0.0
17            NA     NA     NA
18            NA     NA     NA
19            NA     NA     NA
15            NA  10.00   30.0
16            NA   1.00    2.0
14            NA     NA     NA
13            NA     NA     NA
12            NA     NA     NA
11            NA     NA     NA
9             NA     NA     NA
10            NA     NA     NA
2             NA  30.00   60.0
1             NA  30.00   60.0
124           NA   3.00    4.0
125           NA     NA     NA
3             NA   0.50    1.5
126           NA     NA     NA
127           NA   0.10    0.0
# See available Anthropic (Claude) models
ellmer::models_anthropic()

# See available Google Gemini models
ellmer::models_google_gemini()

These functions query the providers’ APIs and return current model names. This is helpful when model names change or new models are released.

3.2 Basic Chat with GPT-5-mini

Let’s start with the simplest possible example: asking a question and getting an answer.

chat <- chat_openai(model = "gpt-5-mini-2025-08-07")
chat$chat("What is the capital of Germany?")
The capital of Germany is Berlin.

That’s it. We created a chat object connected to OpenAI’s GPT-5-mini model, then sent a message and received a response.

3.3 The Role of System Prompts

A system prompt is an instruction that shapes how the model behaves throughout the conversation. It’s like giving the model a persona or a set of ground rules.

Compare these two approaches:

# Without a specific system prompt
chat_default <- chat_openai(model = "gpt-5-mini-2025-08-07")
chat_default$chat("Is Munich in France?")
No. Munich (German: München) is in Germany — it’s the capital of the state of 
Bavaria in southern Germany, located on the River Isar.
# With a system prompt requesting terse responses
chat_terse <- chat_openai(
  model = "gpt-5-mini-2025-08-07",
  system_prompt = "You are a terse assistant who gives one-word answers to questions."
)
chat_terse$chat("Is Munich in France?")
No

The system prompt dramatically changes the response style. This is powerful: you can instruct the model to be formal, casual, technical, simple, or to adopt specific personas relevant to your research.

3.3.1 Example: A Sarcastic Assistant

chat_openai(
  model = "gpt-5-mini-2025-08-07",
  system_prompt = "You are a rude assistant who gives sarcastic and very short answers."
)$chat("Is Paris in the U.K.?")
Nope — Paris is the capital of France, not the U.K. Maybe you meant London?

3.4 Conversation History: Context Matters

When you continue chatting with the same chat object, the model remembers the previous exchanges:

chat_terse$chat("Is R a good programming language?")
Depends
chat_terse$chat("Is Stata used by economists?")
Yes
chat_terse$chat("Have I already asked you about R?")
Yes

The model recalls that we asked about R earlier. This is because conversation history has been maintained.

3.4.1 Viewing Conversation History

You can inspect what’s been said so far:

chat_terse$get_turns()
[[1]]
<Turn: user>
Is Munich in France?

[[2]]
<Turn: assistant>
<thinking>

</thinking>

No

[[3]]
<Turn: user>
Is R a good programming language?

[[4]]
<Turn: assistant>
<thinking>

</thinking>

Depends

[[5]]
<Turn: user>
Is Stata used by economists?

[[6]]
<Turn: assistant>
<thinking>

</thinking>

Yes

[[7]]
<Turn: user>
Have I already asked you about R?

[[8]]
<Turn: assistant>
<thinking>

</thinking>

Yes

Why does this matter? Conversation history affects model responses. In research applications, you typically want each query to be independent (so prior context doesn’t influence results). We’ll address this when we write functions for batch processing.


4 Writing Reusable Functions

When processing many inputs, you don’t want to manually type each query. Instead, we write functions that wrap the API calls.

4.1 A Simple Wrapper Function

Here’s a function that sends a prompt to GPT-5-mini and returns a terse response:

ask5miniTerse <- function(prompt, echo = NULL) {
  # Create a fresh chat object for each call (no conversation history carryover)
  # Note: Some models (like gpt-5-mini) only support the default temperature
  chat <- chat_openai(
    model = "gpt-5-mini-2025-08-07",
    system_prompt = "You are a terse assistant who gives one-word answers to questions.",
    echo = echo
  )
  
  # Send the prompt and return the response
  chat$chat(prompt)
}

Key design decisions:

  1. Fresh chat object each time: By creating a new chat inside the function, each call is independent—no conversation history leaks between queries.

  2. Temperature: Temperature controls randomness. Setting it to 0 makes outputs more deterministic (the model picks the most likely response). Note that some models (like GPT-5-mini) only support the default temperature. For models that support it, you can add api_args = list(temperature = 0) to improve reproducibility.

  3. Echo parameter: Controls whether the conversation is printed to the console during execution. Useful for debugging.

4.2 Testing the Function

ask5miniTerse("What country is Vienna in?")
Austria
ask5miniTerse("Is the sky blue?")
Yes

4.3 A More Flexible Function Template

Here’s a more general pattern you can adapt for different use cases:

ask_llm <- function(prompt,
                    model = "gpt-4o-mini",
                    system_prompt = "You are a helpful assistant.",
                    temperature = 0,
                    echo = NULL) {
  
  chat <- chat_openai(
    model = model,
    api_args = list(temperature = temperature),
    system_prompt = system_prompt,
    echo = echo
  )
  
  chat$chat(prompt)
}

Now you can easily adjust the model, system prompt, or temperature:

# Use as a terse assistant
ask_llm("Is water wet?", system_prompt = "Give one-word answers only.")
No.
# Use as a more verbose explainer
ask_llm("Is water wet?", system_prompt = "Explain your reasoning briefly.")
The question of whether water is wet can be debated. 

1. **Definition of Wetness**: Wetness is typically defined as the condition of 
being covered in a liquid. By this definition, water itself is not wet; it 
makes other materials wet.

2. **Molecular Perspective**: On a molecular level, water molecules are 
cohesive and adhere to each other, which can lead to the perception of wetness 
when in contact with other surfaces.

In summary, while water can make things wet, it is not wet itself in the 
strictest sense.

5 Batch Processing with purrr

One of the most powerful applications of programmatic LLM access is processing many inputs at once.

5.1 The map_chr() Pattern

The map_chr() function from purrr applies a function to each element of a vector and returns a character vector of results.

# A set of questions we want to process
questions <- c(
  "Is the Earth round?",
  "Is water wet?",
  "Do fish swim?",
  "Can birds fly?"
)

# Process all questions
answers <- map_chr(questions, ~ ask5miniTerse(.x))
Yes
Depends
Yes
Mostly
# View results
answers
[1] "Yes"     "Depends" "Yes"     "Mostly" 

The ~ ask5miniTerse(.x) syntax is a shorthand for an anonymous function. The .x represents “the current element being processed.”

5.2 Storing Results in a Tibble

It’s good practice to keep questions and answers together:

results <- tibble(
  question = questions,
  answer = answers
)

results
# A tibble: 4 × 2
  question            answer 
  <chr>               <chr>  
1 Is the Earth round? Yes    
2 Is water wet?       Depends
3 Do fish swim?       Yes    
4 Can birds fly?      Mostly 

This creates a structured dataset that you can analyze, export, or extend.

5.3 Why Not Just Use a Loop?

You could write a for loop instead:

# This works but is more verbose
answers_loop <- character(length(questions))
for (i in seq_along(questions)) {
  answers_loop[i] <- ask5miniTerse(questions[i])
}

The map_chr() approach is more concise and idiomatic in the tidyverse. It also makes it clearer that you’re applying the same operation to each element.


6 Prompt Sensitivity — Does Wording Matter?

A crucial question for research: Do small changes in how we phrase a prompt affect the model’s response?

Let’s test this with health advice scenarios:

fever_prompts <- c(
  "I have a very mild fever, should I go to the doctor?",
  "I have a very mild fever, would you recommend seeing a doctor?",
  "I have a very mild fever, is that serious?",
  "I have a very mild fever, is that OK?",
  "I have a very mild fever, but that's not serious, correct?",
  "I have a very mild fever, but that could still be serious, correct?"
)

Notice the variations:

  • Items 1-2: Direct questions about seeking medical care
  • Items 3-4: Asking about severity (opposite framings: “serious?” vs “OK?”)
  • Items 5-6: Leading questions that suggest an expected answer (testing for “sycophancy” or pandering)
fever_answers <- map_chr(fever_prompts, ~ ask5miniTerse(.x))
Monitor
Yes
Usually
Usually
Usually
Possibly.
fever_results <- tibble(
  prompt = fever_prompts,
  response = fever_answers
)

fever_results
# A tibble: 6 × 2
  prompt                                                              response 
  <chr>                                                               <chr>    
1 I have a very mild fever, should I go to the doctor?                Monitor  
2 I have a very mild fever, would you recommend seeing a doctor?      Yes      
3 I have a very mild fever, is that serious?                          Usually  
4 I have a very mild fever, is that OK?                               Usually  
5 I have a very mild fever, but that's not serious, correct?          Usually  
6 I have a very mild fever, but that could still be serious, correct? Possibly.

Discussion questions:

  • Do prompts 3 and 4 produce semantically opposite answers (as the questions suggest)?
  • Do the leading questions (5-6) cause the model to agree with the implied answer?
  • What are the implications for using LLMs in research involving subjective assessments?
NoteResearch Implication

If models are sensitive to prompt framing, researchers must carefully design and pre-register their prompts. Small wording changes could systematically bias results.


7 Open-Weight Models and (Potential) Local Deployment

So far we’ve used OpenAI’s API, which means our queries go to OpenAI’s servers. Open-weight models offer an alternative: you can download and run them on your own computer. DeepSeek is is a popular and impressive open-weight model (but you probably won’t be able to run its largest version locally, so I want to show you a few ways to access it).

7.1 Why Use Local Models?

  1. Privacy: Your data never leaves your machine
  2. Cost: No per-query API charges (just your electricity)
  3. Availability: Works offline
  4. Reproducibility: You control the exact model version

7.2 Three Ways to Run DeepSeek

7.2.1 Option 1: DeepSeek API

DeepSeek offers an API similar to OpenAI:

# First set your API key
Sys.setenv(DEEPSEEK_API_KEY = "your-deepseek-key")

# Then call the model
chat_deepseek(model = "deepseek-chat")$chat("Ni hao ma?")

7.2.2 Option 2: OpenRouter (Multi-Model Gateway)

OpenRouter provides access to many models through a single API:

chat_openrouter(model = "deepseek/deepseek-chat-v3.1")$chat("Hello!")

7.2.3 Option 3: Ollama (Run Models Locally)

Ollama lets you download and run models on your laptop or desktop.

Setup:

  1. Download Ollama from ollama.com/download
  2. In your terminal, run: ollama run deepseek-r1:8b (this downloads the model; the size of the model is 4.9 GB)
  3. Now you can call it from R:
DS <- chat_ollama(
  model = "deepseek-r1:8b",
  system_prompt = "Make your response extremely terse."
)

DS$chat("Is using social media good for me?")

7.3 Comparing Local vs. API Models

Let’s compare how different models respond to the same question:

# Local model (requires Ollama)
llama <- chat_ollama(model = "llama3.2", system_prompt = "Be very brief.")
llama_response <- llama$chat("Rank these apps by potential harm: TikTok, Facebook, WeChat")

# API model (DeepSeek)
ds <- chat_deepseek(model = "deepseek-chat", system_prompt = "Be very brief.")
ds_response <- ds$chat("Rank these apps by potential harm: TikTok, Facebook, WeChat")

# Compare
tibble(
  model = c("Llama 3.2 (local)", "DeepSeek (API)"),
  response = c(llama_response, ds_response)
)
WarningModel Variability

Different models may give substantially different answers to the same question. Even the same model may vary across runs. Always test consistency!


8 Comparing Multiple Models

For robust research, you often want to compare outputs across different LLMs.

8.1 Setting Up Claude

Anthropic’s Claude models are another major option. Here’s how to set up a function for Claude:

ask_claude_terse <- function(prompt,
                             system_prompt = "You are a terse assistant who gives one-word answers.",
                             model = "claude-haiku-4-5-20251001",
                             temperature = 0,
                             echo = NULL) {
  
  chat <- chat_claude(
    model = model,
    api_args = list(temperature = temperature),
    system_prompt = system_prompt,
    echo = echo
  )
  
  chat$chat(prompt)
}
# Test it
ask_claude_terse("What continent is Brazil on?")
South America.

8.2 Running the Same Queries on Multiple Models

Now let’s compare GPT and Claude on identical prompts:

political_statements <- c(
  "The corrupt elites look down on us",
  "Taxes are immoral",
  "Taxes are necessary",
  "Taxes are a necessary evil",
  "Censorship is always immoral",
  "Social media posts containing threats should be deleted",
  "We only have one planet",
  "The government should provide free healthcare to all",
  "The government should provide free healthcare to those who take care of themselves"
)

# Create prompts asking about ideology
ideology_prompts <- map_chr(
  political_statements,
  ~ paste("If a person expressed the following sentiment, are they more likely to be left-wing or right-wing?", shQuote(.x))
)

# Get responses from both models
gpt_responses <- map_chr(ideology_prompts, ~ ask5miniTerse(.x, echo = "none"))
claude_responses <- map_chr(ideology_prompts, ~ ask_claude_terse(.x, echo = "none"))

# Compare
comparison <- tibble(
  statement = political_statements,
  GPT = gpt_responses,
  Claude = claude_responses
)

# Display a scrollable kable for easier browsing if there are many statements
library(kableExtra)

comparison %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  scroll_box(width = "100%", height = "300px")
statement GPT Claude
The corrupt elites look down on us Both Right-wing.
Taxes are immoral Right Right-wing.
Taxes are necessary Left Left-wing.
Taxes are a necessary evil Right Left-wing.
Censorship is always immoral Right-wing Right-wing.
Social media posts containing threats should be deleted Sorry — I can’t infer someone’s political leaning from a single statement. I can analyze the claim, discuss general partisan views on content moderation, or rephrase it for clarity if you’d like. Left-wing.
We only have one planet Left Left-wing.
The government should provide free healthcare to all Left Left-wing.
The government should provide free healthcare to those who take care of themselves Right Right-wing.

8.3 Visualizing Model Agreement

A tile chart provides a quick visual comparison of how different models classify the same statements:

Code
# Reshape to long format for ggplot
comparison_long <- comparison %>%
  mutate(statement_id = row_number()) %>%
  pivot_longer(
    cols = c(GPT, Claude),
    names_to = "model",
    values_to = "classification"
  ) %>%
  # Normalize classification labels (e.g., "Left-wing" -> "Left", "Right-wing" -> "Right")
  mutate(
    classification_clean = case_when(
      str_detect(tolower(classification), "left") ~ "Left",
      str_detect(tolower(classification), "right") ~ "Right",
      str_detect(tolower(classification), "center|moderate") ~ "Center",
      TRUE ~ "Other"
    ),
    statement_short = str_trunc(statement, 30)
  )

# Create tile chart
ggplot(comparison_long, aes(x = factor(statement_id), y = model, fill = classification_clean)) +
  geom_tile(color = "white", linewidth = 0.5) +
  scale_fill_manual(
    values = c(
      "Left" = "#3B82F6",
      "Right" = "#EF4444",
      "Center" = "#A855F7",
      "Other" = "#6B7280"
    ),
    na.value = "#9CA3AF"
  ) +
  scale_x_discrete(
    labels = comparison_long %>% 
      distinct(statement_id, statement_short) %>% 
      arrange(statement_id) %>% 
      pull(statement_short)
  ) +
  labs(
    title = "Model Classifications of Political Statements",
    x = "Statement",
    y = "Model",
    fill = "Classification"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 9),
    panel.grid = element_blank(),
    legend.position = "bottom"
  )

This visualization makes it easy to spot:

  • Agreement: Where both models show the same color
  • Disagreement: Where colors differ between rows
  • Patterns: Whether one model tends to classify statements differently than another

Key insight: Do the models agree? Where do they disagree, and why might that be?


9 Structured Output — Beyond Free Text

Free-form text responses are useful, but for quantitative analysis, we often need structured data: specific fields with defined types. This section shows how to extract structured output from both OpenAI and Claude models.

9.1 Defining Output Structure

The ellmer package uses type_* functions to specify the structure you want:

# Define what we want the model to return
ideology_schema <- type_object(
  "Ideology analysis of a text statement",
  is_political = type_boolean("Is this statement about politics?"),
  ideology = type_string("Most likely ideology: 'left', 'right', 'center', or 'unclear'"),
  confidence = type_number("Confidence score from 0.0 to 1.0")
)

This tells the model: “I want you to return an object with three fields: a boolean, a string, and a number.”

9.2 Extracting Structured Data with OpenAI

Use chat_structured() (formerly extract_data() but the latter will no longer work of ellmer version 0.4.0) to get structured output:

chat <- chat_openai(model = "gpt-5-mini-2025-08-07")

# Extract structured data from a statement
result <- chat$chat_structured(
  "Taxes are theft and the government wastes our money",
  type = ideology_schema
)

result
$is_political
[1] TRUE

$ideology
[1] "right"

$confidence
[1] 0.92

Now result is a list with named fields you can access directly:

result$ideology
[1] "right"
result$confidence
[1] 0.92

9.3 Creating a Structured Analysis Function

Let’s wrap this in a reusable function:

analyze_ideology <- function(text,
                             model = "gpt-5-mini-2025-08-07",
                             system_prompt = "You are a political analyst.") {
  
  schema <- type_object(
    "Ideology analysis",
    is_political = type_boolean("Is this about politics?"),
    ideology = type_string("Most likely ideology: 'left', 'right', 'center', or 'unclear'"),
    left_score = type_number("Left-wing score from 0.0 to 1.0"),
    right_score = type_number("Right-wing score from 0.0 to 1.0")
  )
  
  chat <- chat_openai(
    model = model,
    system_prompt = system_prompt
  )
  
  chat$chat_structured(text, type = schema)
}
analyze_ideology("The minimum wage should be raised to help workers")
$is_political
[1] TRUE

$ideology
[1] "left"

$left_score
[1] 0.9

$right_score
[1] 0.1

9.4 Batch Structured Analysis

Process multiple texts and combine into a data frame:

# Analyze all political statements
structured_results <- map(political_statements, analyze_ideology)

# Convert to tibble
ideology_df <- tibble(
  statement = political_statements,
  is_political = map_lgl(structured_results, "is_political"),
  ideology = map_chr(structured_results, "ideology"),
  left_score = map_dbl(structured_results, "left_score"),
  right_score = map_dbl(structured_results, "right_score")
)

ideology_df
# A tibble: 9 × 5
  statement                         is_political ideology left_score right_score
  <chr>                             <lgl>        <chr>         <dbl>       <dbl>
1 The corrupt elites look down on … TRUE         unclear        0.5         0.5 
2 Taxes are immoral                 TRUE         right          0.1         0.9 
3 Taxes are necessary               TRUE         center         0.5         0.3 
4 Taxes are a necessary evil        TRUE         right          0.15        0.8 
5 Censorship is always immoral      TRUE         unclear        0.5         0.5 
6 Social media posts containing th… TRUE         unclear        0.1         0.1 
7 We only have one planet           TRUE         left           0.7         0.1 
8 The government should provide fr… TRUE         left           0.9         0.1 
9 The government should provide fr… TRUE         center         0.65        0.35

Now you have a proper dataset ready for statistical analysis!

9.5 Structured Output with Claude

The same structured output approach works with Claude. Here we’ll also demonstrate extracting an array of topics:

analyze_text_claude <- function(text,
                                model = "claude-haiku-4-5-20251001",
                                temperature = 0) {
  
  schema <- type_object(
    "Text analysis",
    is_political = type_boolean("Is this text about politics?"),
    topics = type_array(
      items = type_string("A topic mentioned in the text"),
      description = "Array of topics covered in the text"
    ),
    ideology = type_string("Ideological leaning: 'left', 'right', or 'none'"),
    persuasiveness = type_number("How persuasive is this? 0.0 to 1.0")
  )
  
  chat <- chat_claude(
    model = model,
    api_args = list(temperature = temperature),
    system_prompt = "You are a terse assistant with deep knowledge about politics."
  )
  
  chat$chat_structured(text, type = schema)
}
analyze_text_claude("Trump is good for America")
$is_political
[1] TRUE

$topics
[1] "Donald Trump"         "American politics"    "political leadership"

$ideology
[1] "right"

$persuasiveness
[1] 0.2
analyze_text_claude("The weather in Miami is great but climate change is a threat")
$is_political
[1] TRUE

$topics
[1] "climate change"       "weather"              "environmental policy"

$ideology
[1] "left"

$persuasiveness
[1] 0.4

Notice how the model identifies multiple topics and distinguishes political from non-political content within the same text.


10 Consistency and Reliability

A critical concern for research: Are LLM outputs consistent across repeated runs?

10.1 Testing Consistency

Let’s run the same query multiple times:

# Function to query Llama and reset conversation each time
run_consistency_test <- function(prompt, n_runs = 5, model = "llama3.2") {
  
  llama <- chat_ollama(model = model, system_prompt = "Be terse.")
  
  results <- map_chr(1:n_runs, function(i) {
    llama$set_turns(NULL)  # Clear conversation history
    llama$chat(prompt)
  })
  
  tibble(
    run = 1:n_runs,
    response = results
  )
}

# Test with a subjective question
consistency_results <- run_consistency_test(
  "Is America a force for good in the world?"
)

consistency_results

Questions to consider:

  • How much do responses vary?
  • Is the variation meaningful (different content) or superficial (different wording)?
  • How should we account for this in research design?

10.2 Reducing Variability

Setting temperature = 0 reduces but doesn’t eliminate variability:

llama <- chat_ollama(
  model = "llama3.2",
  api_args = list(temperature = 0),
  system_prompt = "Be terse."
)

Even with temperature = 0, some models may produce slightly different outputs due to internal randomness.


11 Best Practices for Prompting

Research from OpenAI and Anthropic provides guidance on writing effective prompts.

11.1 OpenAI’s Recommendations

11.1.1 Be Specific and Detailed

Include relevant details in your query to get more relevant answers:

# Vague
ask_llm("Summarize this text")

# Specific
ask_llm("Summarize this text in 2-3 sentences, focusing on the main argument and any policy recommendations")

11.1.2 Use Delimiters

Clearly separate different parts of your input:

prompt <- "
Analyze the following text for political ideology.

<text>
Taxes are necessary to fund public services that benefit everyone.
</text>

Respond with: LEFT, RIGHT, or CENTER
"

11.1.3 Specify Output Format

Tell the model exactly what format you want:

ask_llm("List the three main points. Format as a numbered list.")

11.1.4 Ask for Chain-of-Thought Reasoning

For complex tasks, asking the model to explain its reasoning can improve accuracy:

ask_llm("Classify this statement as left or right wing. First, explain your reasoning step by step, then give your final answer.")

11.2 Anthropic’s Recommendations

11.2.1 Use XML Tags

Claude responds particularly well to XML-structured prompts:

prompt <- "
<instructions>
Analyze the text for political ideology.
</instructions>

<text>
The free market always produces the best outcomes.
</text>

<output_format>
Respond with a single word: LEFT, RIGHT, or CENTER
</output_format>
"

ask_claude_terse(prompt, system_prompt = "Follow the instructions precisely.")

11.2.2 Define Success Criteria First

Before prompt engineering, have:

  1. A clear definition of success criteria for your use case
  2. Ways to empirically test against those criteria
  3. A baseline prompt to improve upon

12 Applied Example — Full Workflow

Let’s put it all together with a complete analysis workflow.

12.1 Research Question

How do different LLMs classify the ideology of political statements?

12.2 Step 1: Define Your Inputs

statements <- c(
  "The corrupt elites look down on us",
  "Taxes are immoral",
  "Taxes are necessary",
  "Taxes are a necessary evil",
  "Censorship is always immoral",
  "Social media posts containing threats should be deleted",
  "Climate change is the greatest threat we face",
  "The free market produces the best outcomes"
)

12.3 Step 2: Define Expected Classifications

Before running the model, record your expectations (this is like pre-registration):

expectations <- tibble(
  statement = statements,
  expected = c(
    "contextual",  # Populist rhetoric used by both sides
    "right",       # Anti-tax sentiment
    "left",        # Pro-government services
    "ambiguous",   # Acknowledges necessity but frames as evil
    "contextual",  # Historically left, now used by right too
    "left",        # Pro-moderation
    "left",        # Environmental concern
    "right"        # Free market ideology
  )
)

12.4 Step 3: Create Analysis Function

classify_ideology <- function(text, model_fn, model_name) {
  
  prompt <- paste(
    "Classify the ideology of someone who would say:",
    shQuote(text),
    "\nRespond with exactly one word: LEFT, RIGHT, or CENTER"
  )
  
  response <- model_fn(prompt, echo = "none")
  
  tibble(
    statement = text,
    model = model_name,
    classification = response
  )
}

12.5 Step 4: Run Analysis Across Models

# Collect results from both models
results_gpt <- map_dfr(statements, ~ classify_ideology(.x, ask5miniTerse, "GPT-5-mini"))
results_claude <- map_dfr(statements, ~ classify_ideology(.x, ask_claude_terse, "Claude-Haiku"))

# Combine
all_results <- bind_rows(results_gpt, results_claude) %>%
  pivot_wider(names_from = model, values_from = classification)

all_results
# A tibble: 8 × 3
  statement                                          `GPT-5-mini` `Claude-Haiku`
  <chr>                                              <ellmr_tp>   <ellmr_tp>    
1 The corrupt elites look down on us                 RIGHT      … RIGHT         
2 Taxes are immoral                                  RIGHT      … RIGHT         
3 Taxes are necessary                                LEFT       … CENTER        
4 Taxes are a necessary evil                         RIGHT      … CENTER        
5 Censorship is always immoral                       RIGHT      … RIGHT         
6 Social media posts containing threats should be d… Sorry—I can… CENTER        
7 Climate change is the greatest threat we face      LEFT       … LEFT          
8 The free market produces the best outcomes         RIGHT      … RIGHT         

12.6 Step 5: Compare with Expectations

final_analysis <- left_join(all_results, expectations, by = "statement")

final_analysis
# A tibble: 8 × 4
  statement                                 `GPT-5-mini` `Claude-Haiku` expected
  <chr>                                     <ellmr_tp>   <ellmr_tp>     <chr>   
1 The corrupt elites look down on us        RIGHT      … RIGHT          context…
2 Taxes are immoral                         RIGHT      … RIGHT          right   
3 Taxes are necessary                       LEFT       … CENTER         left    
4 Taxes are a necessary evil                RIGHT      … CENTER         ambiguo…
5 Censorship is always immoral              RIGHT      … RIGHT          context…
6 Social media posts containing threats sh… Sorry—I can… CENTER         left    
7 Climate change is the greatest threat we… LEFT       … LEFT           left    
8 The free market produces the best outcom… RIGHT      … RIGHT          right   

12.7 Step 6: Calculate Agreement

# Do models agree with each other?
final_analysis %>%
  mutate(models_agree = `GPT-5-mini` == `Claude-Haiku`) %>%
  summarise(
    agreement_rate = mean(models_agree),
    n_agree = sum(models_agree),
    n_total = n()
  )
# A tibble: 1 × 3
  agreement_rate n_agree n_total
           <dbl>   <int>   <int>
1          0.625       5       8

13 Conclusion and Next Steps

13.1 What We Covered

  1. Programmatic LLM access using the ellmer package
  2. System prompts to control model behavior
  3. Batch processing with purrr::map_chr()
  4. Prompt sensitivity and its implications for research
  5. Open-weight models via Ollama for local deployment
  6. Multi-model comparison for robustness
  7. Structured output for quantitative analysis
  8. Best practices from OpenAI and Anthropic

13.2 Key Takeaways for Researchers

  1. Prompts matter: Small wording changes can affect results. Pre-register your prompts.

  2. Test consistency: Run the same query multiple times. Report variability.

  3. Compare models: Don’t rely on a single model. Cross-validate with alternatives.

  4. Use structured output: When you need data for analysis, specify the structure explicitly.

  5. Document everything: Record model versions, temperatures, and system prompts for reproducibility.

13.3 Suggested Reading

13.4 Exercises

  1. Modify the analyze_ideology() function to also extract “topics” as an array
  2. Run a consistency test: Query the same prompt 10 times and calculate the proportion of identical responses
  3. Compare GPT-5-mini, Claude-Haiku, and a local Llama model on 20 statements of your choosing
  4. Create a structured output schema for a different domain (e.g., sentiment analysis, factuality assessment)

14 Appendix: Quick Reference

14.1 Model Initialization

Some examples:

# OpenAI
chat_openai(model = "gpt-5-mini-2025-08-07")
chat_openai(model = "gpt-4o-mini")
chat_openai(model = "gpt-4o")

# Anthropic/Claude  
chat_claude(model = "claude-3-5-haiku-20241022")
chat_claude(model = "claude-3-5-sonnet-20241022")

# Local (Ollama)
chat_ollama(model = "llama3.2")
chat_ollama(model = "deepseek-r1:8b")

# DeepSeek API
chat_deepseek(model = "deepseek-chat")

# OpenRouter (multiple models)
chat_openrouter(model = "deepseek/deepseek-chat-v3.1")
# You'll also find Grok, Mistral, etc.

14.2 Common Parameters

chat_openai(
  model = "gpt-4o-mini",           # Model to use
  system_prompt = "Be brief",     # Behavior instructions
  api_args = list(temperature = 0), # 0 = deterministic, 1 = creative
  echo = "all"                      # Print conversation to console
)

14.3 Structured Output Types

# Boolean
type_boolean("Is this about politics?")

# String
type_string("The main topic")

# Number
type_number("Confidence score from 0 to 1")

# Array (note: items first, then description)
type_array(items = type_string("A topic"), description = "List of topics")

# Object (combine multiple fields)
type_object(
  "Description of the object",
  field1 = type_boolean("..."),
  field2 = type_string("..."),
  field3 = type_number("...")
)

14.4 Batch Processing Pattern

# Process multiple inputs
results <- map_chr(inputs, ~ my_function(.x))

# Store with inputs
tibble(input = inputs, output = results)

# For structured output, use map() then extract fields
structured <- map(inputs, ~ chat_structured(.x))
tibble(
  input = inputs,
  field1 = map_lgl(structured, "field1"),
  field2 = map_chr(structured, "field2")
)