Advanced R Tips and Tricks

A collection of practical solutions for common data visualization and analysis situations.

Based on: github.com/zilinskyjan/R-snippets

A slide deck version is available here.

Summary of Tips

Tip Solution
Wrap axis labels scale_x_discrete(labels = scales::label_wrap(20))
Align title with plot theme(plot.title.position = "plot")
Remove padding scale_x_continuous(expand = c(0, 0))
Remove gridlines theme(panel.grid.major.y = element_blank())
Format dates scale_x_date(labels = scales::label_date("%b '%y"))
Fix legend order guides(fill = guide_legend(reverse = TRUE))
Wrap facet labels facet_wrap(~ var, labeller = label_wrap_gen(width = 25))
Hide legend geom_point(show.legend = FALSE)
Transparent colors scales::alpha("blue", 0.5)
Subset in ggplot data = . %>% filter(condition)

Setup: Load Data and Create Summary

We’ll use a dataset of news headlines about ChatGPT/AI collected from Media Cloud to demonstrate each tip. First, let’s load the data and create a summary tibble.

library(tidyverse)
library(scales)

# Load the headlines data
headlines <- read_csv("data/mc-onlinenews-mediacloud-20250604193447-chatgpt-headlines.csv")

# Create a summary: count articles by media outlet
outlet_summary <- headlines %>%
  mutate(outlet = str_remove(media_url, "\\.com$|\\.org$|\\.net$")) %>%
  count(outlet, name = "n_articles") %>%
  slice_max(n_articles, n = 12) %>%
  mutate(
    outlet_label = case_when(
      outlet == "theguardian" ~ "The Guardian",
      outlet == "forbes" ~ "Forbes",
      outlet == "cnet" ~ "CNET",
      outlet == "techcrunch" ~ "TechCrunch Startup Coverage",
      outlet == "zdnet" ~ "ZDNet",
      outlet == "businessinsider" ~ "Business Insider Financial News",
      outlet == "techradar" ~ "TechRadar",
      outlet == "theconversation" ~ "The Conversation",
      outlet == "cbsnews" ~ "CBS News",
      outlet == "cnbc" ~ "CNBC",
      outlet == "reuters" ~ "Reuters",
      TRUE ~ outlet
    )
  )

# Preview the data
outlet_summary
# A tibble: 12 x 3
   outlet          n_articles outlet_label                   
   <chr>                <int> <chr>                          
 1 forbes                7036 Forbes                         
 2 benzinga              4206 benzinga                       
 3 businessinsider       3147 Business Insider Financial News
 4 techradar             2902 TechRadar                      
 5 zdnet                 2867 ZDNet                          
 6 cnet                  2270 CNET                           
 7 techcrunch            2025 TechCrunch Startup Coverage    
 8 nytimes               1530 nytimes                        
 9 ibtimes               1484 ibtimes                        
10 theguardian           1213 The Guardian                   
11 fortune               1168 fortune                        
12 cnbc                  1098 CNBC                           

We’ll also create a time series summary:

# Parse dates and count by day
daily_counts <- headlines %>%
  mutate(
    date = mdy(publish_date)
  ) %>%
  filter(!is.na(date), date >= "2024-11-01") %>%
  count(date, name = "n_articles")

head(daily_counts)
# A tibble: 6 x 2
  date       n_articles
  <date>          <int>
1 2024-11-01         80
2 2024-11-02         31
3 2024-11-03         25
4 2024-11-04         51
5 2024-11-05         51
6 2024-11-06         47

Part 1: Data Visualization with ggplot2

Wrapping Long Axis Labels

Problem: Category names overlap on axes

Without the tip:

ggplot(outlet_summary, aes(y = n_articles, x = outlet_label)) +
  geom_col(fill = "steelblue") +
  labs(
    title = "News Coverage of ChatGPT by Media Outlet",
    y = "Number of Articles",
    x = NULL
  )

The labels are cut off or overlap because they’re too long.

With the tip applied:

ggplot(outlet_summary, aes(y = n_articles, x = outlet_label)) +
  geom_col(fill = "steelblue") +
  scale_x_discrete(labels = scales::label_wrap(15)) +
  labs(
    title = "News Coverage of ChatGPT by Media Outlet",
    y = "Number of Articles",
    x = NULL
  )

The scales::label_wrap(15) function breaks long labels into multiple lines at around 20 characters.

Alternative (a classic solution but often will not look great):

We could use:

theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

The result:

ggplot(outlet_summary, aes(y = n_articles, x = outlet_label)) +
  geom_col(fill = "steelblue") +
  labs(
    title = "News Coverage of ChatGPT by Media Outlet",
    y = "Number of Articles",
    x = NULL
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

Bonus: Format numbers nicely:

scale_x_continuous(labels = scales::comma)

Getting Rid of Awkward Padding

Problem: Unwanted padding around your plot

ggplot(outlet_summary, aes(x = n_articles, y = reorder(outlet, n_articles))) +
  geom_col(fill = "steelblue") +
  labs(x = "Number of Articles", y = NULL)

Notice the gap between the bars and the y-axis.

With the tip applied:

ggplot(outlet_summary, aes(x = n_articles, y = reorder(outlet, n_articles))) +
  geom_col(fill = "steelblue") +
  scale_x_continuous(expand = c(0, 0)) +
  labs(x = "Number of Articles", y = NULL)

The bars now start directly at the axis with expand = c(0, 0).

Better / different Title Positioning

Problem: Default title positioning leaves awkward spacing

ggplot(outlet_summary, aes(x = n_articles, y = reorder(outlet, n_articles))) +
  geom_col(fill = "steelblue") +
  labs(
    title = "News Coverage of ChatGPT by Media Outlet",
    subtitle = "Top 12 outlets by article count",
    x = "Number of Articles",
    y = NULL
  )

Notice how the title starts at the y-axis, not aligned with the plot area.

With the tip applied:

ggplot(outlet_summary, aes(x = n_articles, y = reorder(outlet, n_articles))) +
  geom_col(fill = "steelblue") +
  labs(
    title = "News Coverage of ChatGPT by Media Outlet",
    subtitle = "Top 12 outlets by article count",
    x = "Number of Articles",
    y = NULL
  ) +
  theme(plot.title.position = "plot")

The title now aligns with the left edge of the entire plot area.

Getting Rid of (Some) Gridlines

Problem: Too many gridlines clutter the plot

Usual theme_minimal() rendering:

ggplot(outlet_summary, aes(x = n_articles, y = reorder(outlet, n_articles))) +
  geom_col(fill = "steelblue") +
  theme_minimal() +
  labs(x = "Number of Articles", y = NULL)

With the tip applied:

ggplot(outlet_summary, aes(x = n_articles, y = reorder(outlet, n_articles))) +
  geom_col(fill = "steelblue") +
  theme_minimal() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank()
  ) +
  labs(x = "Number of Articles", y = NULL)

Removing the horizontal gridlines makes the chart cleaner when bars already provide visual alignment.

Custom Date Axis Formatting

Problem: Default date formatting doesn’t meet your needs

ggplot(daily_counts, aes(x = date, y = n_articles)) +
  geom_line(color = "steelblue") +
  labs(
    title = "Daily ChatGPT News Coverage",
    x = NULL,
    y = "Articles"
  )

The date axis uses default formatting which may not be ideal.

Now the x-axis shows month abbreviations with year, which looks better:

ggplot(daily_counts, aes(x = date, y = n_articles)) +
  geom_line(color = "steelblue") +
  scale_x_date(
    name = NULL,
    breaks = scales::breaks_width("1 month"),
    labels = scales::label_date("%b '%y")
  ) +
  labs(
    title = "Daily ChatGPT News Coverage",
    y = "Articles"
  )

Fix Legend Order Mismatches

Problem: Legend colors don’t match your data order

First, let’s create data that shows this issue:

# Create category data
outlet_categories <- outlet_summary %>%
  mutate(
    category = case_when(
      outlet %in% c("forbes", "businessinsider", "cnbc") ~ "Business",
      outlet %in% c("techcrunch", "zdnet", "cnet", "techradar") ~ "Technology",
      TRUE ~ "General News"
    )
  ) %>%
  group_by(category) %>%
  summarise(total_articles = sum(n_articles)) %>%
  arrange(desc(total_articles))

outlet_categories
# A tibble: 3 x 2
  category     total_articles
  <chr>                 <int>
1 Business              11281
2 Technology            10064
3 General News           9601

The legend order (alphabetical) doesn’t match the bar order (by value):

ggplot(outlet_categories, aes(x = total_articles, y = reorder(category, total_articles), fill = category)) +
  geom_col() +
  labs(x = "Total Articles", y = NULL)

To guarantee the legend colors actually match the bar order, you must relevel the factor for category before plotting, using the ordering variable. This sets the order for both the bars and the legend:

# Set the category factor levels by total_articles (descending)
outlet_categories <- outlet_categories %>%
  mutate(category = forcats::fct_reorder(category, total_articles))

ggplot(outlet_categories, aes(x = total_articles, y = category, fill = category)) +
  geom_col() +
  labs(x = "Total Articles", y = NULL) +
  guides(fill = guide_legend(reverse = TRUE))

Now the legend order and colors will reflect the bar order exactly.

Wrapping Long Labels in Facets

Problem: Long facet labels overlap or look messy

# Create data with long category names for faceting
outlet_by_category <- headlines %>%
  mutate(outlet = str_remove(media_url, "\\.com$|\\.org$|\\.net$")) %>%
  mutate(
    category = case_when(
      outlet %in% c("forbes", "businessinsider", "cnbc", "thestreet") ~ 
        "Business and Financial News Coverage",
      outlet %in% c("techcrunch", "zdnet", "cnet", "techradar", "wired") ~ 
        "Technology Industry Publications",
      outlet %in% c("theguardian", "nytimes", "washingtonpost") ~ 
        "Major National Newspapers",
      TRUE ~ "Other Media Sources"
    )
  ) %>%
  count(category, outlet) %>%
  group_by(category) %>%
  slice_max(n, n = 5)
ggplot(outlet_by_category, aes(x = n, y = reorder(outlet, n))) +
  geom_col(fill = "steelblue") +
  facet_wrap(~ category, scales = "free_y") +
  labs(x = "Number of Articles", y = NULL) +
  theme_bw(base_size=20)

The facet titles are cut off because they’re too long.

Solution

ggplot(outlet_by_category, aes(x = n, y = reorder(outlet, n))) +
  geom_col(fill = "steelblue") +
  facet_wrap(~ category, scales = "free_y", labeller = label_wrap_gen(width = 25)) +
  theme(strip.text = element_text(size = 9)) +
  labs(x = "Number of Articles", y = NULL) +
  theme_bw(base_size=20)

The label_wrap_gen(width = 25) wraps the facet titles at around 25 characters.

Control Legend Visibility

Problem: Unwanted legend entries from specific geoms

top_outlets <- outlet_summary %>% slice_max(n_articles, n = 6)

ggplot(top_outlets, aes(x = n_articles, y = reorder(outlet, n_articles), fill = outlet)) +
  geom_col() +
  geom_text(aes(label = n_articles, color = outlet), hjust = -0.2) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(x = "Number of Articles", y = NULL)

Both the bars and text create legend entries, which is redundant.

With the tip applied:

ggplot(top_outlets, aes(x = n_articles, y = reorder(outlet, n_articles), fill = outlet)) +
  geom_col() +
  geom_text(aes(label = n_articles), hjust = -0.2, show.legend = FALSE) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(x = "Number of Articles", y = NULL)

Using show.legend = FALSE removes unnecessary legend entries.

Transparent Colors

Problem: Overlapping points or areas obscure underlying data

# Create some overlapping data
set.seed(42)
scatter_data <- tibble(
  x = rnorm(2500, mean = 100, sd = 30),
  y = x + rnorm(2500, mean = 0, sd = 20)
)

Without the tip:

ggplot(scatter_data, aes(x = x, y = y)) +
  geom_point(color = "steelblue", size = 3) +
  labs(title = "Overlapping points obscure density")

It’s hard to see where points are concentrated.

With the tip applied:

ggplot(scatter_data, aes(x = x, y = y)) +
  geom_point(color = scales::alpha("steelblue", 0.1), size = 2) +
  labs(title = "Transparency reveals density patterns")

Using scales::alpha("steelblue", 0.1) makes overlapping points visible.

Part 2: Data Manipulation

Subset Data Within ggplot

Problem: Need different data filtering for specific plot layers

# Add some outlier detection
daily_with_outliers <- daily_counts %>%
  mutate(is_outlier = n_articles > quantile(n_articles, 0.95))

ggplot(daily_with_outliers, aes(x = date, y = n_articles)) +
  geom_line(color = "gray70") +
  geom_point(
    data = . %>% filter(is_outlier),
    color = "red", size = 2
  ) +
  labs(
    title = "Daily Coverage with High-Volume Days Highlighted",
    subtitle = "Red points show days above 95th percentile",
    x = NULL, y = "Articles"
  )

The data = . %>% filter(is_outlier) syntax filters data for just the points layer.

Summary Statistics with Confidence Intervals

Problem: Need to show uncertainty in group means

# Calculate mean and CI by outlet category
category_stats <- headlines %>%
  mutate(
    outlet = str_remove(media_url, "\\.com$|\\.org$|\\.net$"),
    title_length = nchar(title)
  ) %>%
  mutate(
    category = case_when(
      outlet %in% c("forbes", "businessinsider", "cnbc") ~ "Business",
      outlet %in% c("techcrunch", "zdnet", "cnet", "techradar") ~ "Tech",
      outlet %in% c("theguardian", "nytimes", "washingtonpost") ~ "News",
      TRUE ~ "Other"
    )
  ) %>%
  group_by(category) %>%
  summarise(
    M = mean(title_length, na.rm = TRUE),
    sd = sd(title_length, na.rm = TRUE),
    n = sum(!is.na(title_length)),
    se = sd / sqrt(n)
  )

ggplot(category_stats, aes(x = M, y = reorder(category, M))) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_errorbar(
    aes(xmin = M - 1.96*se, xmax = M + 1.96*se),
    width = 0.2
  ) +
  labs(
    title = "Average Headline Length by Outlet Category",
    subtitle = "Error bars show 95% confidence intervals",
    x = "Mean Title Length (characters)",
    y = NULL
  )

Filter by Minimum Observations

Problem: Need to calculate statistics only for groups with sufficient sample sizes

# Only calculate stats for outlets with enough articles
outlet_stats <- headlines %>%
  mutate(
    outlet = str_remove(media_url, "\\.com$|\\.org$|\\.net$"),
    title_length = nchar(title)
  ) %>%
  group_by(outlet) %>%
  summarize(
    n_articles = n(),
    mean_length = ifelse(
      n_articles >= 50,
      mean(title_length, na.rm = TRUE),
      NA
    )
  ) %>%
  filter(!is.na(mean_length)) %>%
  slice_max(n_articles, n = 10)

outlet_stats
# A tibble: 10 x 3
   outlet          n_articles mean_length
   <chr>                <int>       <dbl>
 1 forbes                7036        63.5
 2 benzinga              4206       100. 
 3 businessinsider       3147        94.4
 4 techradar             2902        76.4
 5 zdnet                 2867        59.4
 6 cnet                  2270        56.0
 7 techcrunch            2025        66.5
 8 nytimes               1530        53.2
 9 ibtimes               1484        63.1
10 theguardian           1213        75.3

Using ifelse(n_articles >= 50, ...) ensures we only calculate statistics for outlets with at least 50 articles.

Part 3: External Data Access

Harvard Dataverse Integration

Problem: Need to access publicly available research datasets

Solution:

# Set up connection to Harvard Dataverse
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")

# Download dataset directly into R
dataset <- dataverse::get_dataframe_by_name(
  "filename.tab",
  "doi:10.7910/DVN/XXXXXX"  # Replace with actual DOI
)

Resources: CRAN dataverse package vignette

Additional Resources