# Base R
data_base <- read.csv("my_data.csv")
# Tidyverse (readr)
library(tidyverse)
data_tidy <- read_csv("my_data.csv")Tidyverse Tips / Refresher
Reading CSV files: read_csv() vs. read.csv()
Why prefer read_csv()?
- Automatically parses column types (numeric, date, etc.) and shows a progress bar.
- Returns a tibble, which prints more cleanly (truncated rows/columns).
- Faster for large files.
Tip: To write back out, use write_csv(data, "out.csv").
Inspecting Your Data Immediately
head(data_tidy) # first 6 rows
tail(data_tidy) # last 6 rows
str(data_tidy) # structure: types & sample values
glimpse(data_tidy) # tidyverse-friendly structure
dim(data_tidy) # rows, columns
names(data_tidy) # column namesWhy inspect early?
- Check that columns imported with correct (or at least expected) names and types
- Spot missing values or parsing problems
- Get a sense of dataset size and structure
Dealing with Spaces or Special Characters in Names
Sometimes column names contain spaces or punctuation. You can’t refer to them directly without backticks.
# Suppose “Number of deaths” was imported:
data$`Number of deaths`
# Better: rename immediately
data <- data %>%
rename(NumberOfDeaths = `Number of deaths`)Tip: Use janitor’s clean_names() to automatically convert all names to snake_case:
library(janitor)
data <- data %>% clean_names()
# “Number of Deaths” → number_of_deathsRenaming Columns
data <- data %>%
rename(
deaths_total = `Number of deaths`,
country_code = CountryCode
)Rename multiple columns in one call using new_name = old_name syntax.
Merging (Joining) Two Datasets
Why We Merge
- Merge (join) means adding columns by matching rows.
- Requires a common identifier (key) that uniquely matches rows across datasets.
Join Functions
# Simple one-key join
merged <- main_data %>%
left_join(data_to_add, by = "country_year_id")
# Two-key join
merged <- main_data %>%
left_join(data_to_add, by = c("country", "year"))
# Other types:
# inner_join(): only keep rows present in both
# right_join(): keep all from data_to_addTip: Before joining, ensure keys have the same type and values:
unique(main_data$country)
unique(data_to_add$country)Tip: If one dataset uses “DEU” and the other “Germany,” recode or create a lookup table before joining.
Appending (Binding) Rows
This is how we would stack observations:
# Recommended
total <- bind_rows(data_for_germany,
data_for_france)
# Base R equivalent:
total2 <- rbind(data_for_germany, data_for_france)Note: bind_rows() will fill in missing columns with NA if one data frame has extra columns.
Common Operations
- Chaining with the pipe
%>%
result <- data %>%
filter(year >= 2000) %>%
select(country, year, deaths_total) %>%
arrange(desc(deaths_total))- Use
mutate()to create or transform columns
data <- data %>%
mutate(
deaths_per_100k = deaths_total / population * 100000,
log_deaths = log(deaths_total)
)- Quick summaries with
group_by()+summarise()
summary <- data %>%
group_by(country) %>%
summarise(
total_deaths = sum(deaths_total, na.rm = TRUE),
avg_deaths = mean(deaths_total, na.rm = TRUE)
)- Check for duplicates (especially before joining or binding)
data %>%
add_count(country, year) %>%
filter(n > 1)- Convert to factors or dates cleanly
data <- data %>%
mutate(
country = as_factor(country),
date = lubridate::ymd(date_string)
)- Use
glimpse()for a prettier, horizontally oriented overview.
Useful reference
See Grant McDermott’s excellent slides