Advanced R Tips and Tricks

A collection of practical solutions for common data visualization and analysis situations.

Based on: github.com/zilinskyjan/R-snippets

A slide deck version is available here.

Summary of Tips

Tip	Solution
Wrap axis labels	`scale_x_discrete(labels = scales::label_wrap(20))`
Align title with plot	`theme(plot.title.position = "plot")`
Remove padding	`scale_x_continuous(expand = c(0, 0))`
Remove gridlines	`theme(panel.grid.major.y = element_blank())`
Format dates	`scale_x_date(labels = scales::label_date("%b '%y"))`
Fix legend order	`guides(fill = guide_legend(reverse = TRUE))`
Wrap facet labels	`facet_wrap(~ var, labeller = label_wrap_gen(width = 25))`
Hide legend	`geom_point(show.legend = FALSE)`
Transparent colors	`scales::alpha("blue", 0.5)`
Subset in ggplot	`data = . %>% filter(condition)`

Setup: Load Data and Create Summary

We’ll use a dataset of news headlines about ChatGPT/AI collected from Media Cloud to demonstrate each tip. First, let’s load the data and create a summary tibble.

Code

library(tidyverse)
library(scales)

# Load the headlines data
headlines <- read_csv("data/mc-onlinenews-mediacloud-20250604193447-chatgpt-headlines.csv")

# Create a summary: count articles by media outlet
outlet_summary <- headlines %>%
  mutate(outlet = str_remove(media_url, "\\.com$|\\.org$|\\.net$")) %>%
  count(outlet, name = "n_articles") %>%
  slice_max(n_articles, n = 12) %>%
  mutate(
    outlet_label = case_when(
      outlet == "theguardian" ~ "The Guardian",
      outlet == "forbes" ~ "Forbes",
      outlet == "cnet" ~ "CNET",
      outlet == "techcrunch" ~ "TechCrunch Startup Coverage",
      outlet == "zdnet" ~ "ZDNet",
      outlet == "businessinsider" ~ "Business Insider Financial News",
      outlet == "techradar" ~ "TechRadar",
      outlet == "theconversation" ~ "The Conversation",
      outlet == "cbsnews" ~ "CBS News",
      outlet == "cnbc" ~ "CNBC",
      outlet == "reuters" ~ "Reuters",
      TRUE ~ outlet
    )
  )

# Preview the data
outlet_summary

# A tibble: 12 x 3
   outlet          n_articles outlet_label                   
   <chr>                <int> <chr>                          
 1 forbes                7036 Forbes                         
 2 benzinga              4206 benzinga                       
 3 businessinsider       3147 Business Insider Financial News
 4 techradar             2902 TechRadar                      
 5 zdnet                 2867 ZDNet                          
 6 cnet                  2270 CNET                           
 7 techcrunch            2025 TechCrunch Startup Coverage    
 8 nytimes               1530 nytimes                        
 9 ibtimes               1484 ibtimes                        
10 theguardian           1213 The Guardian                   
11 fortune               1168 fortune                        
12 cnbc                  1098 CNBC

We’ll also create a time series summary:

Code

# Parse dates and count by day
daily_counts <- headlines %>%
  mutate(
    date = mdy(publish_date)
  ) %>%
  filter(!is.na(date), date >= "2024-11-01") %>%
  count(date, name = "n_articles")

head(daily_counts)

# A tibble: 6 x 2
  date       n_articles
  <date>          <int>
1 2024-11-01         80
2 2024-11-02         31
3 2024-11-03         25
4 2024-11-04         51
5 2024-11-05         51
6 2024-11-06         47

Part 1: Data Visualization with ggplot2

Wrapping Long Axis Labels

Problem: Category names overlap on axes

Without the tip:

Code

ggplot(outlet_summary, aes(y = n_articles, x = outlet_label)) +
  geom_col(fill = "steelblue") +
  labs(
    title = "News Coverage of ChatGPT by Media Outlet",
    y = "Number of Articles",
    x = NULL
  )

The labels are cut off or overlap because they’re too long.

With the tip applied:

Code

ggplot(outlet_summary, aes(y = n_articles, x = outlet_label)) +
  geom_col(fill = "steelblue") +
  scale_x_discrete(labels = scales::label_wrap(15)) +
  labs(
    title = "News Coverage of ChatGPT by Media Outlet",
    y = "Number of Articles",
    x = NULL
  )

The scales::label_wrap(15) function breaks long labels into multiple lines at around 20 characters.

Bonus: Format numbers nicely:

Code

scale_y_continuous(labels = scales::comma)

Alternative (a classic solution but often will not look great):

We could use:

Code

theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

The result:

Code

ggplot(outlet_summary, aes(y = n_articles, x = outlet_label)) +
  geom_col(fill = "steelblue") +
  labs(
    title = "News Coverage of ChatGPT by Media Outlet",
    y = "Number of Articles",
    x = NULL
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

Getting Rid of Awkward Padding

Problem: Unwanted padding around your plot

Code

ggplot(outlet_summary, aes(x = n_articles, y = reorder(outlet, n_articles))) +
  geom_col(fill = "steelblue") +
  labs(x = "Number of Articles", y = NULL)

Notice the gap between the bars and the y-axis.

With the tip applied:

Code

ggplot(outlet_summary, aes(x = n_articles, y = reorder(outlet, n_articles))) +
  geom_col(fill = "steelblue") +
  scale_x_continuous(expand = c(0, 0)) +
  labs(x = "Number of Articles", y = NULL)

The bars now start directly at the axis with expand = c(0, 0).

Better / different Title Positioning

Problem: Default title positioning leaves awkward spacing

Code

ggplot(outlet_summary, aes(x = n_articles, y = reorder(outlet, n_articles))) +
  geom_col(fill = "steelblue") +
  labs(
    title = "News Coverage of ChatGPT by Media Outlet",
    subtitle = "Top 12 outlets by article count",
    x = "Number of Articles",
    y = NULL
  )

Notice how the title starts at the y-axis, not aligned with the plot area.

With the tip applied:

Code

ggplot(outlet_summary, aes(x = n_articles, y = reorder(outlet, n_articles))) +
  geom_col(fill = "steelblue") +
  labs(
    title = "News Coverage of ChatGPT by Media Outlet",
    subtitle = "Top 12 outlets by article count",
    x = "Number of Articles",
    y = NULL
  ) +
  theme(plot.title.position = "plot")

The title now aligns with the left edge of the entire plot area.

Getting Rid of (Some) Gridlines

Problem: Too many gridlines clutter the plot

Usual theme_minimal() rendering:

Code

ggplot(outlet_summary, aes(x = n_articles, y = reorder(outlet, n_articles))) +
  geom_col(fill = "steelblue") +
  theme_minimal() +
  labs(x = "Number of Articles", y = NULL)

With the tip applied:

Code

ggplot(outlet_summary, aes(x = n_articles, y = reorder(outlet, n_articles))) +
  geom_col(fill = "steelblue") +
  theme_minimal() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank()
  ) +
  labs(x = "Number of Articles", y = NULL)

Removing the horizontal gridlines makes the chart cleaner when bars already provide visual alignment.

Custom Date Axis Formatting

Problem: Default date formatting doesn’t meet your needs

Code

ggplot(daily_counts, aes(x = date, y = n_articles)) +
  geom_line(color = "steelblue") +
  labs(
    title = "Daily ChatGPT News Coverage",
    x = NULL,
    y = "Articles"
  )

The date axis uses default formatting which may not be ideal.

Now the x-axis shows month abbreviations with year, which looks better:

Code

ggplot(daily_counts, aes(x = date, y = n_articles)) +
  geom_line(color = "steelblue") +
  scale_x_date(
    name = NULL,
    breaks = scales::breaks_width("1 month"),
    labels = scales::label_date("%b '%y")
  ) +
  labs(
    title = "Daily ChatGPT News Coverage",
    y = "Articles"
  )

Fix Legend Order Mismatches

Problem: Legend colors don’t match your data order

First, let’s create data that shows this issue:

Code

# Create category data
outlet_categories <- outlet_summary %>%
  mutate(
    category = case_when(
      outlet %in% c("forbes", "businessinsider", "cnbc") ~ "Business",
      outlet %in% c("techcrunch", "zdnet", "cnet", "techradar") ~ "Technology",
      TRUE ~ "General News"
    )
  ) %>%
  group_by(category) %>%
  summarise(total_articles = sum(n_articles)) %>%
  arrange(desc(total_articles))

outlet_categories

# A tibble: 3 x 2
  category     total_articles
  <chr>                 <int>
1 Business              11281
2 Technology            10064
3 General News           9601

The legend order (alphabetical) doesn’t match the bar order (by value):

Code

ggplot(outlet_categories, aes(x = total_articles, y = reorder(category, total_articles), fill = category)) +
  geom_col() +
  labs(x = "Total Articles", y = NULL)

To guarantee the legend colors actually match the bar order, you must relevel the factor for category before plotting, using the ordering variable. This sets the order for both the bars and the legend:

Code

# Set the category factor levels by total_articles (descending)
outlet_categories <- outlet_categories %>%
  mutate(category = forcats::fct_reorder(category, total_articles))

ggplot(outlet_categories, aes(x = total_articles, y = category, fill = category)) +
  geom_col() +
  labs(x = "Total Articles", y = NULL) +
  guides(fill = guide_legend(reverse = TRUE))

Now the legend order and colors will reflect the bar order exactly.

Wrapping Long Labels in Facets

Problem: Long facet labels overlap or look messy

Code

# Create data with long category names for faceting
outlet_by_category <- headlines %>%
  mutate(outlet = str_remove(media_url, "\\.com$|\\.org$|\\.net$")) %>%
  mutate(
    category = case_when(
      outlet %in% c("forbes", "businessinsider", "cnbc", "thestreet") ~ 
        "Business and Financial News Coverage",
      outlet %in% c("techcrunch", "zdnet", "cnet", "techradar", "wired") ~ 
        "Technology Industry Publications",
      outlet %in% c("theguardian", "nytimes", "washingtonpost") ~ 
        "Major National Newspapers",
      TRUE ~ "Other Media Sources"
    )
  ) %>%
  count(category, outlet) %>%
  group_by(category) %>%
  slice_max(n, n = 5)

Code

ggplot(outlet_by_category, aes(x = n, y = reorder(outlet, n))) +
  geom_col(fill = "steelblue") +
  facet_wrap(~ category, scales = "free_y") +
  labs(x = "Number of Articles", y = NULL) +
  theme_bw(base_size=20)

The facet titles are cut off because they’re too long.

Solution

Code

ggplot(outlet_by_category, aes(x = n, y = reorder(outlet, n))) +
  geom_col(fill = "steelblue") +
  facet_wrap(~ category, scales = "free_y", labeller = label_wrap_gen(width = 25)) +
  theme(strip.text = element_text(size = 9)) +
  labs(x = "Number of Articles", y = NULL) +
  theme_bw(base_size=20)

The label_wrap_gen(width = 25) wraps the facet titles at around 25 characters.

Control Legend Visibility

Problem: Unwanted legend entries from specific geoms

Code

top_outlets <- outlet_summary %>% slice_max(n_articles, n = 6)

ggplot(top_outlets, aes(x = n_articles, y = reorder(outlet, n_articles), fill = outlet)) +
  geom_col() +
  geom_text(aes(label = n_articles, color = outlet), hjust = -0.2) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(x = "Number of Articles", y = NULL)

Both the bars and text create legend entries, which is redundant.

With the tip applied:

Code

ggplot(top_outlets, aes(x = n_articles, y = reorder(outlet, n_articles), fill = outlet)) +
  geom_col() +
  geom_text(aes(label = n_articles), hjust = -0.2, show.legend = FALSE) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(x = "Number of Articles", y = NULL)

Using show.legend = FALSE removes unnecessary legend entries.

Transparent Colors

Problem: Overlapping points or areas obscure underlying data

Code

# Create some overlapping data
set.seed(42)
scatter_data <- tibble(
  x = rnorm(2500, mean = 100, sd = 30),
  y = x + rnorm(2500, mean = 0, sd = 20)
)

Without the tip:

Code

ggplot(scatter_data, aes(x = x, y = y)) +
  geom_point(color = "steelblue", size = 3) +
  labs(title = "Overlapping points obscure density")

It’s hard to see where points are concentrated.

With the tip applied:

Code

ggplot(scatter_data, aes(x = x, y = y)) +
  geom_point(color = scales::alpha("steelblue", 0.1), size = 2) +
  labs(title = "Transparency reveals density patterns")

Using scales::alpha("steelblue", 0.1) makes overlapping points visible.

Part 2: Data Manipulation

Subset Data Within ggplot

Problem: Need different data filtering for specific plot layers

Code

# Add some outlier detection
daily_with_outliers <- daily_counts %>%
  mutate(is_outlier = n_articles > quantile(n_articles, 0.95))

ggplot(daily_with_outliers, aes(x = date, y = n_articles)) +
  geom_line(color = "gray70") +
  geom_point(
    data = . %>% filter(is_outlier),
    color = "red", size = 2
  ) +
  labs(
    title = "Daily Coverage with High-Volume Days Highlighted",
    subtitle = "Red points show days above 95th percentile",
    x = NULL, y = "Articles"
  )

The data = . %>% filter(is_outlier) syntax filters data for just the points layer.

Summary Statistics with Confidence Intervals

Problem: Need to show uncertainty in group means

Code

# Calculate mean and CI by outlet category
category_stats <- headlines %>%
  mutate(
    outlet = str_remove(media_url, "\\.com$|\\.org$|\\.net$"),
    title_length = nchar(title)
  ) %>%
  mutate(
    category = case_when(
      outlet %in% c("forbes", "businessinsider", "cnbc") ~ "Business",
      outlet %in% c("techcrunch", "zdnet", "cnet", "techradar") ~ "Tech",
      outlet %in% c("theguardian", "nytimes", "washingtonpost") ~ "News",
      TRUE ~ "Other"
    )
  ) %>%
  group_by(category) %>%
  summarise(
    M = mean(title_length, na.rm = TRUE),
    sd = sd(title_length, na.rm = TRUE),
    n = sum(!is.na(title_length)),
    se = sd / sqrt(n)
  )

ggplot(category_stats, aes(x = M, y = reorder(category, M))) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_errorbar(
    aes(xmin = M - 1.96*se, xmax = M + 1.96*se),
    width = 0.2
  ) +
  labs(
    title = "Average Headline Length by Outlet Category",
    subtitle = "Error bars show 95% confidence intervals",
    x = "Mean Title Length (characters)",
    y = NULL
  )

Filter by Minimum Observations

Problem: Need to calculate statistics only for groups with sufficient sample sizes

Code

# Only calculate stats for outlets with enough articles
outlet_stats <- headlines %>%
  mutate(
    outlet = str_remove(media_url, "\\.com$|\\.org$|\\.net$"),
    title_length = nchar(title)
  ) %>%
  group_by(outlet) %>%
  summarize(
    n_articles = n(),
    mean_length = ifelse(
      n_articles >= 50,
      mean(title_length, na.rm = TRUE),
      NA
    )
  ) %>%
  filter(!is.na(mean_length)) %>%
  slice_max(n_articles, n = 10)

outlet_stats

# A tibble: 10 x 3
   outlet          n_articles mean_length
   <chr>                <int>       <dbl>
 1 forbes                7036        63.5
 2 benzinga              4206       100. 
 3 businessinsider       3147        94.4
 4 techradar             2902        76.4
 5 zdnet                 2867        59.4
 6 cnet                  2270        56.0
 7 techcrunch            2025        66.5
 8 nytimes               1530        53.2
 9 ibtimes               1484        63.1
10 theguardian           1213        75.3

Using ifelse(n_articles >= 50, ...) ensures we only calculate statistics for outlets with at least 50 articles.

Part 3: External Data Access

Harvard Dataverse Integration

Problem: Need to access publicly available research datasets

Solution:

Code

# Set up connection to Harvard Dataverse
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")

# Download dataset directly into R
dataset <- dataverse::get_dataframe_by_name(
  "filename.tab",
  "doi:10.7910/DVN/XXXXXX"  # Replace with actual DOI
)

Resources: CRAN dataverse package vignette

Additional Resources

This collection: github.com/zilinskyjan/R-snippets
Allison Koh’s collection: github.com/allisonkoh/helpful-code-stuff
Silvia Kim’s workshop notes: https://sysilviakim.com/learningR/

# Advanced R Tips and Tricks {.unnumbered} A collection of practical solutions for common data visualization and analysis situations. Based on: [github.com/zilinskyjan/R-snippets](https://github.com/zilinskyjan/R-snippets) A slide deck version is available [here](https://rawcdn.githack.com/zilinskyjan/DataViz/2e38f94acb5465e7bfe1e7a438de6701f9cd344b/slides/r-snippets-teaching.html#/title-slide). ## Summary of Tips | Tip | Solution | |-----|----------| | Wrap axis labels | `scale_x_discrete(labels = scales::label_wrap(20))` | | Align title with plot | `theme(plot.title.position = "plot")` | | Remove padding | `scale_x_continuous(expand = c(0, 0))` | | Remove gridlines | `theme(panel.grid.major.y = element_blank())` | | Format dates | `scale_x_date(labels = scales::label_date("%b '%y"))` | | Fix legend order | `guides(fill = guide_legend(reverse = TRUE))` | | Wrap facet labels | `facet_wrap(~ var, labeller = label_wrap_gen(width = 25))` | | Hide legend | `geom_point(show.legend = FALSE)` | | Transparent colors | `scales::alpha("blue", 0.5)` | | Subset in ggplot | `data = . %>% filter(condition)` | ## Setup: Load Data and Create Summary We'll use a dataset of news headlines about ChatGPT/AI collected from Media Cloud to demonstrate each tip. First, let's load the data and create a summary tibble. ```{r setup} #| message: false #| warning: false library(tidyverse) library(scales) # Load the headlines data headlines <- read_csv("data/mc-onlinenews-mediacloud-20250604193447-chatgpt-headlines.csv") # Create a summary: count articles by media outlet outlet_summary <- headlines %>% mutate(outlet = str_remove(media_url, "\\.com$|\\.org$|\\.net$")) %>% count(outlet, name = "n_articles") %>% slice_max(n_articles, n = 12) %>% mutate( outlet_label = case_when( outlet == "theguardian" ~ "The Guardian", outlet == "forbes" ~ "Forbes", outlet == "cnet" ~ "CNET", outlet == "techcrunch" ~ "TechCrunch Startup Coverage", outlet == "zdnet" ~ "ZDNet", outlet == "businessinsider" ~ "Business Insider Financial News", outlet == "techradar" ~ "TechRadar", outlet == "theconversation" ~ "The Conversation", outlet == "cbsnews" ~ "CBS News", outlet == "cnbc" ~ "CNBC", outlet == "reuters" ~ "Reuters", TRUE ~ outlet ) ) # Preview the data outlet_summary ``` We'll also create a time series summary: ```{r time-summary} # Parse dates and count by day daily_counts <- headlines %>% mutate( date = mdy(publish_date) ) %>% filter(!is.na(date), date >= "2024-11-01") %>% count(date, name = "n_articles") head(daily_counts) ``` ## Part 1: Data Visualization with ggplot2 ### Wrapping Long Axis Labels **Problem:** Category names overlap on axes **Without the tip:** ```{r axis-labels-problem} #| fig-width: 8 #| fig-height: 5 ggplot(outlet_summary, aes(y = n_articles, x = outlet_label)) + geom_col(fill = "steelblue") + labs( title = "News Coverage of ChatGPT by Media Outlet", y = "Number of Articles", x = NULL ) ``` The labels are cut off or overlap because they're too long. **With the tip applied:** ```{r axis-labels-solution} #| fig-width: 8 #| fig-height: 5 ggplot(outlet_summary, aes(y = n_articles, x = outlet_label)) + geom_col(fill = "steelblue") + scale_x_discrete(labels = scales::label_wrap(15)) + labs( title = "News Coverage of ChatGPT by Media Outlet", y = "Number of Articles", x = NULL ) ``` The `scales::label_wrap(15)` function breaks long labels into multiple lines at around 20 characters. **Bonus:** Format numbers nicely: ```{r number-format} #| eval: false scale_y_continuous(labels = scales::comma) ``` ```{r axis-labels-ctd} #| echo: false #| fig-width: 8 #| fig-height: 5 ggplot(outlet_summary, aes(y = n_articles, x = outlet_label)) + geom_col(fill = "steelblue") + scale_x_discrete(labels = scales::label_wrap(15)) + scale_y_continuous(labels = scales::comma) + labs( title = "News Coverage of ChatGPT by Media Outlet", y = "Number of Articles", x = NULL ) ``` **Alternative (a classic solution but often will not look great):** We could use: ```{r axis-labels-rotate} #| eval: false theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) ``` The result: ```{r axis-labels-45} #| fig-width: 8 #| fig-height: 5 ggplot(outlet_summary, aes(y = n_articles, x = outlet_label)) + geom_col(fill = "steelblue") + labs( title = "News Coverage of ChatGPT by Media Outlet", y = "Number of Articles", x = NULL ) + theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) ``` ### Getting Rid of Awkward Padding **Problem:** Unwanted padding around your plot ```{r padding-problem} #| fig-width: 8 #| fig-height: 4 ggplot(outlet_summary, aes(x = n_articles, y = reorder(outlet, n_articles))) + geom_col(fill = "steelblue") + labs(x = "Number of Articles", y = NULL) ``` Notice the gap between the bars and the y-axis. **With the tip applied:** ```{r padding-solution} #| fig-width: 8 #| fig-height: 4 ggplot(outlet_summary, aes(x = n_articles, y = reorder(outlet, n_articles))) + geom_col(fill = "steelblue") + scale_x_continuous(expand = c(0, 0)) + labs(x = "Number of Articles", y = NULL) ``` The bars now start directly at the axis with `expand = c(0, 0)`. ### Better / different Title Positioning **Problem:** Default title positioning leaves awkward spacing ```{r title-position-problem} #| fig-width: 8 #| fig-height: 4 ggplot(outlet_summary, aes(x = n_articles, y = reorder(outlet, n_articles))) + geom_col(fill = "steelblue") + labs( title = "News Coverage of ChatGPT by Media Outlet", subtitle = "Top 12 outlets by article count", x = "Number of Articles", y = NULL ) ``` Notice how the title starts at the y-axis, not aligned with the plot area. **With the tip applied:** ```{r title-position-solution} #| fig-width: 8 #| fig-height: 4 ggplot(outlet_summary, aes(x = n_articles, y = reorder(outlet, n_articles))) + geom_col(fill = "steelblue") + labs( title = "News Coverage of ChatGPT by Media Outlet", subtitle = "Top 12 outlets by article count", x = "Number of Articles", y = NULL ) + theme(plot.title.position = "plot") ``` The title now aligns with the left edge of the entire plot area. ### Getting Rid of (Some) Gridlines **Problem:** Too many gridlines clutter the plot Usual `theme_minimal()` rendering: ```{r gridlines-problem} #| fig-width: 8 #| fig-height: 4 ggplot(outlet_summary, aes(x = n_articles, y = reorder(outlet, n_articles))) + geom_col(fill = "steelblue") + theme_minimal() + labs(x = "Number of Articles", y = NULL) ``` **With the tip applied:** ```{r gridlines-solution} #| fig-width: 8 #| fig-height: 4 ggplot(outlet_summary, aes(x = n_articles, y = reorder(outlet, n_articles))) + geom_col(fill = "steelblue") + theme_minimal() + theme( panel.grid.major.y = element_blank(), panel.grid.minor = element_blank() ) + labs(x = "Number of Articles", y = NULL) ``` Removing the horizontal gridlines makes the chart cleaner when bars already provide visual alignment. ### Custom Date Axis Formatting **Problem:** Default date formatting doesn't meet your needs ```{r date-format-problem} #| fig-width: 9 #| fig-height: 4 ggplot(daily_counts, aes(x = date, y = n_articles)) + geom_line(color = "steelblue") + labs( title = "Daily ChatGPT News Coverage", x = NULL, y = "Articles" ) ``` The date axis uses default formatting which may not be ideal. Now the x-axis shows month abbreviations with year, which looks better: ```{r date-format-solution} #| fig-width: 9 #| fig-height: 4 ggplot(daily_counts, aes(x = date, y = n_articles)) + geom_line(color = "steelblue") + scale_x_date( name = NULL, breaks = scales::breaks_width("1 month"), labels = scales::label_date("%b '%y") ) + labs( title = "Daily ChatGPT News Coverage", y = "Articles" ) ``` ### Fix Legend Order Mismatches **Problem:** Legend colors don't match your data order First, let's create data that shows this issue: ```{r legend-order-setup} # Create category data outlet_categories <- outlet_summary %>% mutate( category = case_when( outlet %in% c("forbes", "businessinsider", "cnbc") ~ "Business", outlet %in% c("techcrunch", "zdnet", "cnet", "techradar") ~ "Technology", TRUE ~ "General News" ) ) %>% group_by(category) %>% summarise(total_articles = sum(n_articles)) %>% arrange(desc(total_articles)) outlet_categories ``` The legend order (alphabetical) doesn't match the bar order (by value): ```{r legend-order-problem} #| fig-width: 6 #| fig-height: 4 ggplot(outlet_categories, aes(x = total_articles, y = reorder(category, total_articles), fill = category)) + geom_col() + labs(x = "Total Articles", y = NULL) ``` To guarantee the legend colors actually match the bar order, you must relevel the factor for `category` **before** plotting, using the ordering variable. This sets the order for both the bars and the legend: ```{r legend-order-solution} #| fig-width: 6 #| fig-height: 4 # Set the category factor levels by total_articles (descending) outlet_categories <- outlet_categories %>% mutate(category = forcats::fct_reorder(category, total_articles)) ggplot(outlet_categories, aes(x = total_articles, y = category, fill = category)) + geom_col() + labs(x = "Total Articles", y = NULL) + guides(fill = guide_legend(reverse = TRUE)) ``` Now the legend order and colors will reflect the bar order exactly. ### Wrapping Long Labels in Facets **Problem:** Long facet labels overlap or look messy ```{r facet-setup} # Create data with long category names for faceting outlet_by_category <- headlines %>% mutate(outlet = str_remove(media_url, "\\.com$|\\.org$|\\.net$")) %>% mutate( category = case_when( outlet %in% c("forbes", "businessinsider", "cnbc", "thestreet") ~ "Business and Financial News Coverage", outlet %in% c("techcrunch", "zdnet", "cnet", "techradar", "wired") ~ "Technology Industry Publications", outlet %in% c("theguardian", "nytimes", "washingtonpost") ~ "Major National Newspapers", TRUE ~ "Other Media Sources" ) ) %>% count(category, outlet) %>% group_by(category) %>% slice_max(n, n = 5) ``` ```{r facet-problem} #| fig-width: 10 #| fig-height: 6 ggplot(outlet_by_category, aes(x = n, y = reorder(outlet, n))) + geom_col(fill = "steelblue") + facet_wrap(~ category, scales = "free_y") + labs(x = "Number of Articles", y = NULL) + theme_bw(base_size=20) ``` The facet titles are cut off because they're too long. Solution ```{r facet-solution} #| fig-width: 10 #| fig-height: 6 ggplot(outlet_by_category, aes(x = n, y = reorder(outlet, n))) + geom_col(fill = "steelblue") + facet_wrap(~ category, scales = "free_y", labeller = label_wrap_gen(width = 25)) + theme(strip.text = element_text(size = 9)) + labs(x = "Number of Articles", y = NULL) + theme_bw(base_size=20) ``` The `label_wrap_gen(width = 25)` wraps the facet titles at around 25 characters. ### Control Legend Visibility **Problem:** Unwanted legend entries from specific geoms ```{r legend-visibility-problem} #| fig-width: 8 #| fig-height: 4 top_outlets <- outlet_summary %>% slice_max(n_articles, n = 6) ggplot(top_outlets, aes(x = n_articles, y = reorder(outlet, n_articles), fill = outlet)) + geom_col() + geom_text(aes(label = n_articles, color = outlet), hjust = -0.2) + scale_x_continuous(expand = expansion(mult = c(0, 0.15))) + labs(x = "Number of Articles", y = NULL) ``` Both the bars and text create legend entries, which is redundant. **With the tip applied:** ```{r legend-visibility-solution} #| fig-width: 8 #| fig-height: 4 ggplot(top_outlets, aes(x = n_articles, y = reorder(outlet, n_articles), fill = outlet)) + geom_col() + geom_text(aes(label = n_articles), hjust = -0.2, show.legend = FALSE) + scale_x_continuous(expand = expansion(mult = c(0, 0.15))) + labs(x = "Number of Articles", y = NULL) ``` Using `show.legend = FALSE` removes unnecessary legend entries. ### Transparent Colors **Problem:** Overlapping points or areas obscure underlying data ```{r transparent-setup} # Create some overlapping data set.seed(42) scatter_data <- tibble( x = rnorm(2500, mean = 100, sd = 30), y = x + rnorm(2500, mean = 0, sd = 20) ) ``` **Without the tip:** ```{r transparent-problem} #| fig-width: 6 #| fig-height: 5 ggplot(scatter_data, aes(x = x, y = y)) + geom_point(color = "steelblue", size = 3) + labs(title = "Overlapping points obscure density") ``` It's hard to see where points are concentrated. **With the tip applied:** ```{r transparent-solution} #| fig-width: 6 #| fig-height: 5 ggplot(scatter_data, aes(x = x, y = y)) + geom_point(color = scales::alpha("steelblue", 0.1), size = 2) + labs(title = "Transparency reveals density patterns") ``` Using `scales::alpha("steelblue", 0.1)` makes overlapping points visible. ## Part 2: Data Manipulation ### Subset Data Within ggplot **Problem:** Need different data filtering for specific plot layers ```{r subset-ggplot-example} #| fig-width: 9 #| fig-height: 4 # Add some outlier detection daily_with_outliers <- daily_counts %>% mutate(is_outlier = n_articles > quantile(n_articles, 0.95)) ggplot(daily_with_outliers, aes(x = date, y = n_articles)) + geom_line(color = "gray70") + geom_point( data = . %>% filter(is_outlier), color = "red", size = 2 ) + labs( title = "Daily Coverage with High-Volume Days Highlighted", subtitle = "Red points show days above 95th percentile", x = NULL, y = "Articles" ) ``` The `data = . %>% filter(is_outlier)` syntax filters data for just the points layer. ### Summary Statistics with Confidence Intervals **Problem:** Need to show uncertainty in group means ```{r ci-example} #| fig-width: 8 #| fig-height: 5 # Calculate mean and CI by outlet category category_stats <- headlines %>% mutate( outlet = str_remove(media_url, "\\.com$|\\.org$|\\.net$"), title_length = nchar(title) ) %>% mutate( category = case_when( outlet %in% c("forbes", "businessinsider", "cnbc") ~ "Business", outlet %in% c("techcrunch", "zdnet", "cnet", "techradar") ~ "Tech", outlet %in% c("theguardian", "nytimes", "washingtonpost") ~ "News", TRUE ~ "Other" ) ) %>% group_by(category) %>% summarise( M = mean(title_length, na.rm = TRUE), sd = sd(title_length, na.rm = TRUE), n = sum(!is.na(title_length)), se = sd / sqrt(n) ) ggplot(category_stats, aes(x = M, y = reorder(category, M))) + geom_col(fill = "steelblue", alpha = 0.7) + geom_errorbar( aes(xmin = M - 1.96*se, xmax = M + 1.96*se), width = 0.2 ) + labs( title = "Average Headline Length by Outlet Category", subtitle = "Error bars show 95% confidence intervals", x = "Mean Title Length (characters)", y = NULL ) ``` ### Filter by Minimum Observations **Problem:** Need to calculate statistics only for groups with sufficient sample sizes ```{r filter-min-obs-example} # Only calculate stats for outlets with enough articles outlet_stats <- headlines %>% mutate( outlet = str_remove(media_url, "\\.com$|\\.org$|\\.net$"), title_length = nchar(title) ) %>% group_by(outlet) %>% summarize( n_articles = n(), mean_length = ifelse( n_articles >= 50, mean(title_length, na.rm = TRUE), NA ) ) %>% filter(!is.na(mean_length)) %>% slice_max(n_articles, n = 10) outlet_stats ``` Using `ifelse(n_articles >= 50, ...)` ensures we only calculate statistics for outlets with at least 50 articles. ## Part 3: External Data Access ### Harvard Dataverse Integration **Problem:** Need to access publicly available research datasets **Solution:** ```{r dataverse} #| eval: false # Set up connection to Harvard Dataverse Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu") # Download dataset directly into R dataset <- dataverse::get_dataframe_by_name( "filename.tab", "doi:10.7910/DVN/XXXXXX" # Replace with actual DOI ) ``` **Resources:** [CRAN dataverse package vignette](https://cran.r-project.org/web/packages/dataverse/vignettes/A-introduction.html) ## Additional Resources - **This collection:** [github.com/zilinskyjan/R-snippets](https://github.com/zilinskyjan/R-snippets) - **Allison Koh's collection:** [github.com/allisonkoh/helpful-code-stuff](https://github.com/allisonkoh/helpful-code-stuff) - **Silvia Kim's workshop notes:** <https://sysilviakim.com/learningR/>