6 Application 1: Data Manipulation

45:00

6.1 A real world example

In this application, we will practice some of the coding skills learned in the training so far. Whereas the examples used in the textbook use toy datasets which are already clean and well-formatted, the datasets used in our applications may require some initial cleaning prior to analysis. This is likely closer to tasks you might encounter in your everyday work at KNBS.

In this particular application, we would like you to produce 3 simple pieces of analysis using Kenya’s 2019 census data:

Breakdown of Kenya by religious belief
Share of people who are migrants in each county
Average working hours for men and women in rural vs. urban areas.

There are two stages to this task. The first is to read in the data and prepare it to be analysed. This will involve reading the data in properly, fixing any column name issues, reducing the size of the dataset if it is too large, and finally dealing with any missing values, or NAs, that are found in the data.

The second is to perform the analysis on your dataset. This will involve creating new columns, filtering out certain data, and creating summaries.

6.2 Preparing your dataset for analysis

Read the dataset using read_csv and clean the column names using janitor::clean_names. The 2019 Census data you need is found at this link. The survey information is here
There are lots of columns in this dataset, making it slow to perform analysis. Select the columns required for the three pieces of analysis (e.g., p17 for religion) using select. Rename the selected columns to be more descriptive (e.g., religion) using rename.
Examine your data. For each task, are there any NA’s or other strange values in your columns? Think about how to handle any NA’s and use mutate and case_when, or filter to address them.

6.3 Analysis

6.3.1 Breakdown of Kenya by religious belief

Analyze the religious breakdown: Group by the religion column using group_by and count the number of individuals in each category using summarise(total = n()) and mutate.

6.3.3 Average working hours for men and women in rural vs. urban areas.

Filter the data to include only the working-age population or adults (e.g., age >= 18) using filter.
Create binary indicator columns for rural (based on ea_type) and female (based on sex) using mutate and case_when
Create an hours_worked column where missing values (NA) in the original hours worked column (hours_worked_if_work) are replaced with 0, using mutate and replace_na.
Calculate the average working hours: Group by the rural and female columns using group_by and calculate the mean of hours_worked and hours_worked_if_work using summarise(mean()). Remember to handle potential NAs in the mean calculation (e.g., using na.rm = TRUE).

Solution: Preparing your dataset for analysis

library(tidyverse)
library(janitor)

census <- read_csv("../intro_R-main/data/census.csv") |>
  janitor::clean_names()

census_small <- census |>
  select(county, subcounty_code, ea_type, p11, p12, p17, p19, p52) |>
  rename(
    religion = p17,
    age = p12,
    sex = p11,
    hours_worked_if_work = p52
  )

census_small <- census_small |>
  mutate(religion = case_when(religion == 99 ~ NA, TRUE ~ religion))

Solution: Breakdown of Kenya by religious belief

religion_by_county <- census_small |>
  group_by(county, religion) |>
  summarise(total = n()) |>
  mutate(share = total/sum(total))

Solution: Share of people who are migrants in each county

migration_prop <- census_small |>
  mutate(migrant = if_else(birthplace == county, 0, 1)) |>
  summarise(migration_prop = mean(migrant))

Solution: Average working hours for men and women in rural vs. urban areas

census_small <- census_small |> mutate(
  rural = if_else(ea_type == 1, 1, 0),
  female = if_else(sex == 2, 1, 0),
  hours_worked = if_else(is.na(hours_worked_if_work), 0, hours_worked_if_work)
)

working_hours_mean_adult <- census_small |> filter(age >= 18) |>
  group_by(rural, female) |>
  summarise(
    hours_worked = mean(hours_worked, na.rm = TRUE),
    hours_worked_if_work = mean(hours_worked_if_work, na.rm = TRUE)
  )

6.1 A real world example

6.2 Preparing your dataset for analysis

6.3 Analysis

6.3.1 Breakdown of Kenya by religious belief

6.3.2 Share of people who are migrants in each county

6.3.3 Average working hours for men and women in rural vs. urban areas.