45:00
6 Application 1: Data Manipulation
6.1 A real world example
In this application, we will practice some of the coding skills learned in the training so far. Whereas the examples used in the textbook use toy datasets which are already clean and well-formatted, the datasets used in our applications may require some initial cleaning prior to analysis. This is likely closer to tasks you might encounter in your everyday work at KNBS.
In this particular application, we would like you to produce 3 simple pieces of analysis using Kenya’s 2019 census data:
- Breakdown of Kenya by religious belief
- Share of people who are migrants in each county
- Average working hours for men and women in rural vs. urban areas.
There are two stages to this task. The first is to read in the data and prepare it to be analysed. This will involve reading the data in properly, fixing any column name issues, reducing the size of the dataset if it is too large, and finally dealing with any missing values, or NAs, that are found in the data.
The second is to perform the analysis on your dataset. This will involve creating new columns, filtering out certain data, and creating summaries.
6.2 Preparing your dataset for analysis
Read the dataset using
read_csv
and clean the column names usingjanitor::clean_names
. The 2019 Census data you need is found at this link. The survey information is hereThere are lots of columns in this dataset, making it slow to perform analysis. Select the columns required for the three pieces of analysis (e.g., p17 for religion) using
select
. Rename the selected columns to be more descriptive (e.g., religion) usingrename
.Examine your data. For each task, are there any NA’s or other strange values in your columns? Think about how to handle any NA’s and use
mutate
andcase_when
, orfilter
to address them.
6.3 Analysis
6.3.1 Breakdown of Kenya by religious belief
- Analyze the religious breakdown: Group by the religion column using
group_by
and count the number of individuals in each category usingsummarise(total = n())
andmutate.
6.3.3 Average working hours for men and women in rural vs. urban areas.
Filter the data to include only the working-age population or adults (e.g., age >= 18) using filter.
Create binary indicator columns for rural (based on ea_type) and female (based on sex) using
mutate
andcase_when
Create an hours_worked column where missing values (NA) in the original hours worked column (hours_worked_if_work) are replaced with 0, using
mutate
andreplace_na.
Calculate the average working hours: Group by the rural and female columns using group_by and calculate the mean of hours_worked and hours_worked_if_work using
summarise(mean())
. Remember to handle potential NAs in the mean calculation (e.g., using na.rm = TRUE).
Solution: Preparing your dataset for analysis
library(tidyverse)
library(janitor)
<- read_csv("../intro_R-main/data/census.csv") |>
census ::clean_names()
janitor
<- census |>
census_small select(county, subcounty_code, ea_type, p11, p12, p17, p19, p52) |>
rename(
religion = p17,
age = p12,
sex = p11,
hours_worked_if_work = p52
)
<- census_small |>
census_small mutate(religion = case_when(religion == 99 ~ NA, TRUE ~ religion))
Solution: Breakdown of Kenya by religious belief
<- census_small |>
religion_by_county group_by(county, religion) |>
summarise(total = n()) |>
mutate(share = total/sum(total))
Solution: Share of people who are migrants in each county
<- census_small |>
migration_prop mutate(migrant = if_else(birthplace == county, 0, 1)) |>
summarise(migration_prop = mean(migrant))
Solution: Average working hours for men and women in rural vs. urban areas
<- census_small |> mutate(
census_small rural = if_else(ea_type == 1, 1, 0),
female = if_else(sex == 2, 1, 0),
hours_worked = if_else(is.na(hours_worked_if_work), 0, hours_worked_if_work)
)
<- census_small |> filter(age >= 18) |>
working_hours_mean_adult group_by(rural, female) |>
summarise(
hours_worked = mean(hours_worked, na.rm = TRUE),
hours_worked_if_work = mean(hours_worked_if_work, na.rm = TRUE)
)