# Base R
<- read.csv("my_data.csv")
data_base
# Tidyverse (readr)
library(tidyverse)
<- read_csv("my_data.csv") data_tidy
Tidyverse Tips / Refresher
Reading CSV files: read_csv()
vs. read.csv()
Why prefer read_csv()
?
- Automatically parses column types (numeric, date, etc.) and shows a progress bar.
- Returns a tibble, which prints more cleanly (truncated rows/columns).
- Faster for large files.
Tip: To write back out, use write_csv(data, "out.csv")
.
Inspecting Your Data Immediately
head(data_tidy) # first 6 rows
tail(data_tidy) # last 6 rows
str(data_tidy) # structure: types & sample values
glimpse(data_tidy) # tidyverse-friendly structure
dim(data_tidy) # rows, columns
names(data_tidy) # column names
Why inspect early?
- Check that columns imported with correct (or at least expected) names and types
- Spot missing values or parsing problems
- Get a sense of dataset size and structure
Dealing with Spaces or Special Characters in Names
Sometimes column names contain spaces or punctuation. You can’t refer to them directly without backticks.
# Suppose “Number of deaths” was imported:
$`Number of deaths`
data
# Better: rename immediately
<- data %>%
data rename(NumberOfDeaths = `Number of deaths`)
Tip: Use janitor’s clean_names()
to automatically convert all names to snake_case:
library(janitor)
<- data %>% clean_names()
data # “Number of Deaths” → number_of_deaths
Renaming Columns
<- data %>%
data rename(
deaths_total = `Number of deaths`,
country_code = CountryCode
)
Rename multiple columns in one call using new_name = old_name
syntax.
Merging (Joining) Two Datasets
Why We Merge
- Merge (join) means adding columns by matching rows.
- Requires a common identifier (key) that uniquely matches rows across datasets.
Join Functions
# Simple one-key join
<- main_data %>%
merged left_join(data_to_add, by = "country_year_id")
# Two-key join
<- main_data %>%
merged left_join(data_to_add, by = c("country", "year"))
# Other types:
# inner_join(): only keep rows present in both
# right_join(): keep all from data_to_add
Tip: Before joining, ensure keys have the same type and values:
unique(main_data$country)
unique(data_to_add$country)
Tip: If one dataset uses “DEU” and the other “Germany,” recode or create a lookup table before joining.
Appending (Binding) Rows
This is how we would stack observations:
# Recommended
<- bind_rows(data_for_germany,
total
data_for_france)
# Base R equivalent:
<- rbind(data_for_germany, data_for_france) total2
Note: bind_rows()
will fill in missing columns with NA
if one data frame has extra columns.
Common Operations
- Chaining with the pipe
%>%
<- data %>%
result filter(year >= 2000) %>%
select(country, year, deaths_total) %>%
arrange(desc(deaths_total))
- Use
mutate()
to create or transform columns
<- data %>%
data mutate(
deaths_per_100k = deaths_total / population * 100000,
log_deaths = log(deaths_total)
)
- Quick summaries with
group_by()
+summarise()
<- data %>%
summary group_by(country) %>%
summarise(
total_deaths = sum(deaths_total, na.rm = TRUE),
avg_deaths = mean(deaths_total, na.rm = TRUE)
)
- Check for duplicates (especially before joining or binding)
%>%
data add_count(country, year) %>%
filter(n > 1)
- Convert to factors or dates cleanly
<- data %>%
data mutate(
country = as_factor(country),
date = lubridate::ymd(date_string)
)
- Use
glimpse()
for a prettier, horizontally oriented overview.
Useful reference
See Grant McDermott’s excellent slides