3  Importing Data

So far, we have only made use of data which is pre-loaded into R via packages, but it is also possible to load your own data in from a variety of sources. We will focus on two different file types;

3.1 Local Reading

As described in Chapter 1, the bottom right pane of RStudio allows you to view files that are within your own personal filesystem. You are free to create new folders in this area, using the New Folder button.

05:00
  1. Navigate to the Files at the bottom right pane of your RStudio
  2. Create a new folder called data
  3. Save this GCP dataset to this folder as gcp.csv. You will need to press the download button in the top right corner.
gcp <- read_csv("data/gcp.csv")

Great, we will come back to using these two files, but first let’s discuss how to read in csv files.

3.2 CSV

Although there is a read.csv() function in base R, like most things there is a better tidyverse alternative! read_csv() from the readr package reads CSVs in as a tibble (which has additional features compared to a standard data frame), is much faster (~10X), and allows you to specify how you read data in more easily.

As always, let’s read the function documentation using ?read_csv. This tells us we need to provide a path to the file. This path can be either local or remote; so it will work equally well for data inside your project or from the internet.

To read in a local file, you have to specify the exact location of the file. You can do this as either an absolute filepath, which starts from the drive name right through to the final file (e.g. C:/Documents/My_work/file.csv), or as a relative file path. A relative file path just gives the location of the file starting from your current working environment. You can check what your current working environment is using the command getwd(). The advantage of using relative file paths is if someone duplicates your project from Github, the code will still work on their own computer.

We will start by reading in some local data, which contains details of Kenyan Gross County Product by economic activity for 2017:

gcp <- read_csv("data/gcp.csv")
New names:
Rows: 48 Columns: 22
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(3): Coun- ty Code, County Name, FISIM1 num (18): Agriculture, forestry and
fishing, Mining and quarrying, Manufactu... lgl (1): ...22
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...22`

Notice that the file is inside the data folder inside the current working directory.

gcp will now show in your environment. The environment viewer (top right) shows you basic information about the data that has been loaded in. You can also click on any object to view it in your script window.

You can also read in data directly from the web using the same function. For example, with the same GCP data:

gcp <- read_csv("https://raw.githubusercontent.com/cmbsteinberg/knbs_intro/refs/heads/main/data/gcp.csv")
New names:
Rows: 48 Columns: 22
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(3): Coun- ty Code, County Name, FISIM1 num (18): Agriculture, forestry and
fishing, Mining and quarrying, Manufactu... lgl (1): ...22
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...22`

This works exactly the same way as reading in local data, and the object you have created will appear in your environment (top right).

3.2.1 Exercise

05:00
  1. Read in the frogs dataset found here:‘https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-08-02/frogs.csv’ as an object called frogs
Solution
frogs <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-08-02/frogs.csv")

3.2.2 Inspecting the dataset

As noted in the previous section, you can see by looking in the environment window that the eurovision dataset has 48 observations and 22 variables. You can also return this information (and more) about datasets programatically, using the glimpse() function, again from the dplyr package:

glimpse(gcp)
Rows: 48
Columns: 22
$ `Coun- ty Code`                                        <chr> "01", "02", "03…
$ `County Name`                                          <chr> "MOMBASA", "KWA…
$ `Agriculture, forestry and fishing`                    <dbl> 1459, 39610, 38…
$ `Mining and quarrying`                                 <dbl> 1158, 1447, 333…
$ Manufacturing                                          <dbl> 47348, 300, 845…
$ `Electricity supply`                                   <dbl> 20546, 730, 147…
$ `Water supply; waste collection`                       <dbl> 1078, 513, 1353…
$ Construction                                           <dbl> 37168, 3184, 22…
$ `Wholesale and retail trade; repair of motor vehicles` <dbl> 36912, 5051, 60…
$ `Transport and storage`                                <dbl> 88308, 4198, 11…
$ `Accom- modation and food service activities`          <dbl> 12780, 6699, 11…
$ `Informa- tion and communi- cation`                    <dbl> 5413, 879, 2423…
$ `Financial and insurance activities`                   <dbl> 31155, 4941, 86…
$ `Real estate activities`                               <dbl> 35526, 5733, 14…
$ `Profes- sional, technical and support services`       <dbl> 7124, 349, 286,…
$ `Public admin- istration and defence`                  <dbl> 12024, 4566, 64…
$ Education                                              <dbl> 4229, 5384, 921…
$ `Human health and social work activities`              <dbl> 4539, 1575, 253…
$ `Other service activities`                             <dbl> 2379, 1647, 281…
$ FISIM1                                                 <chr> "(17,026)", "(5…
$ Total                                                  <dbl> 332122, 86278, …
$ ...22                                                  <lgl> NA, NA, NA, NA,…

As well as returning the number of rows and columns in the data, the glimpse function also shows you the names of the columns, the column classes (indicated in ), and an example of the first few rows of data.

3.2.3 Exercise

05:00
  1. Use the glimpse and View functions to examine the frogs dataset. How many rows and columns does it have?
Solution
View(frogs)

glimpse(frogs)

3.3 Excel

Reading excel files works in much the same way as CSV files. However, due to the difference in underlying structures we require the function read_excel() from a different package called readxl.

The main difference when reading excel files is three additional arguments that we can set;

  • sheet which allows us to specify which sheet to read. It can take the form of a string (the name of the sheet) or an integer (the position of the sheet); and
  • range which allows us to specify a cell range. It takes a typical cell reference like “B3:D10”.
  • skip an alternative to specifying a cell range, you can simply indicate how many rows to skip at the start of the sheet. This is ideal if you want to read in a sheet with an unknown number of columns and/or rows, but know there are several lines of metadata at the top of the sheet.

If we don’t set any of these arguments it will assume our data is in the first row of the first sheet (and it becomes almost identical to read_csv above).

# One option is to download the file 
tourism <- read_excel("data/tourism.xlsx")

3.3.1 Exercise

10:00
  1. Install and load the readxl package.
  2. Download this dataset published by KNBS about the tourism sector, saving it into /data
  3. Read it in, specifying the sheet name you want to read in.
  4. Examine the data you have read in; are the column names what you want? Work out how to skip these and only read in the data, with the correct column names.
Solution
url = "https://www.knbs.or.ke/wp-content/uploads/2024/04/2023-Economic-Survey-Kenya-Tourism-Sector.xlsx"
download_first <- download.file(url,destfile = "data/tourism.xlsx")
tourism <- read_excel("data/tourism.xlsx", sheet = "Table 12.5", skip = 2)

3.4 Rio

Sometimes you may want to read a selection of files of all different types. This is where Rio can come in handy. Rio is a wrapper around the libraries we’ve used above and many more, which lets you use import() to read almost any file in. This isn’t always useful, when you want to do very specific things with a certain file, but can be much cleaner.

3.5 Column Names

In the previous section we looked at reading data into R and also inspecting it. In this section we are going to look at our first steps once it’s read in.

In the previous session we stated that every column in a data frame is a variable and it is good practice to not have spaces within variable names, as spaces makes it harder for us to call on the variables when we need to use them.

When you enter data in Excel, you most often don’t think too much about what you call each column. After all, you just label them once and as long as they are meaningful to you, what does it matter if the column name is a long combination of CAPITALLETTERS, lowercaseletters, and numbers?

When you are working with variables in R though, you need to type the name of each variable, every time you want to work with it. So, it makes sense to make your column names as simple, but meaningful as possible.

Ideally, they should also be consistently formatted.

For example if we wanted to pick the name of Passenger column from the Titanic dataset.

Titanic$name Of Passenger

To get around this we enclose name of passenger with back ticks like the code below - this is the key above the tab key on the left hand side of your keyboard.

# Selecting data using the $ symbol
# note this now works because of the back ticks
Titanic$`name Of Passenger`

If your column names have spaces and you don’t get rid of them, you must use backticks.

However its good practise to remove spaces and symbols.

We can see the column names by using the names() function to access the name attribute of the data.

# Getting the column names using the names function

names(Titanic)
NULL

As we can see our column names have spaces and some start with capital letters and some with small letters, we can clean the names using the janitor package.

3.5.1 Cleaning Column Names

The janitor package offers many functions used to manipulate data, for example removing empty rows and columns, finding duplicates within a data frame. In this session we will use the library to to clean our data set names.

We can clean the names of our dataset with the janitor::clean_names() function as shown below.

We are overwriting the original Titanic data frame with a version with the column names cleaned.

# Cleaning the column names using the janitor
# package and the clean_names() function.
# This will put all names in lower case letters and 
# replace blank spaces with underscores.

titanic <- janitor::clean_names(Titanic)

# Getting the column names of the dataset

names(titanic)
NULL

clean_names() removes spaces, symbols, changes characters to lower case and makes all columns start with letters.

This is the default setting, there are many other options such as snake, lower_camel and all_caps. These can be put inside the clean_names() function as shown below:

# Specifying the case within the clean_names function

janitor::clean_names(Titanic, case = "snake")
, , Age = child, Survived = no

      Sex
Class  male female
  x1st    0      0
  x2nd    0      0
  x3rd   35     17
  crew    0      0

, , Age = adult, Survived = no

      Sex
Class  male female
  x1st  118      4
  x2nd  154     13
  x3rd  387     89
  crew  670      3

, , Age = child, Survived = yes

      Sex
Class  male female
  x1st    5      1
  x2nd   11     13
  x3rd   13     14
  crew    0      0

, , Age = adult, Survived = yes

      Sex
Class  male female
  x1st   57    140
  x2nd   14     80
  x3rd   75     76
  crew  192     20