05:00
3 Importing Data
So far, we have only made use of data which is pre-loaded into R via packages, but it is also possible to load your own data in from a variety of sources. We will focus on two different file types;
- CSV
- Excel
3.1 Local Reading
As described in Chapter 1, the bottom right pane of RStudio allows you to view files that are within your own personal filesystem. You are free to create new folders in this area, using the New Folder
button.
- Navigate to the
Files
at the bottom right pane of your RStudio - Create a new folder called
data
- Save this GCP dataset to this folder as
gcp.csv
. You will need to press the download button in the top right corner.
<- read_csv("data/gcp.csv") gcp
Great, we will come back to using these two files, but first let’s discuss how to read in csv files.
3.2 CSV
Although there is a read.csv()
function in base R, like most things there is a better tidyverse
alternative! read_csv()
from the readr
package reads CSVs in as a tibble (which has additional features compared to a standard data frame), is much faster (~10X), and allows you to specify how you read data in more easily.
As always, let’s read the function documentation using ?read_csv
. This tells us we need to provide a path to the file. This path can be either local or remote; so it will work equally well for data inside your project or from the internet.
To read in a local file, you have to specify the exact location of the file. You can do this as either an absolute filepath, which starts from the drive name right through to the final file (e.g. C:/Documents/My_work/file.csv), or as a relative file path. A relative file path just gives the location of the file starting from your current working environment. You can check what your current working environment is using the command getwd()
. The advantage of using relative file paths is if someone duplicates your project from Github, the code will still work on their own computer.
We will start by reading in some local data, which contains details of Kenyan Gross County Product by economic activity for 2017:
<- read_csv("data/gcp.csv") gcp
New names:
Rows: 48 Columns: 22
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(3): Coun- ty Code, County Name, FISIM1 num (18): Agriculture, forestry and
fishing, Mining and quarrying, Manufactu... lgl (1): ...22
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...22`
Notice that the file is inside the data folder inside the current working directory.
gcp
will now show in your environment. The environment viewer (top right) shows you basic information about the data that has been loaded in. You can also click on any object to view it in your script window.
You can also read in data directly from the web using the same function. For example, with the same GCP data:
<- read_csv("https://raw.githubusercontent.com/cmbsteinberg/knbs_intro/refs/heads/main/data/gcp.csv") gcp
New names:
Rows: 48 Columns: 22
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(3): Coun- ty Code, County Name, FISIM1 num (18): Agriculture, forestry and
fishing, Mining and quarrying, Manufactu... lgl (1): ...22
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...22`
This works exactly the same way as reading in local data, and the object you have created will appear in your environment (top right).
3.2.1 Exercise
05:00
- Read in the frogs dataset found here:‘https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-08-02/frogs.csv’ as an object called
frogs
Solution
<- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-08-02/frogs.csv") frogs
3.2.2 Inspecting the dataset
As noted in the previous section, you can see by looking in the environment window that the eurovision dataset has 48 observations and 22 variables. You can also return this information (and more) about datasets programatically, using the glimpse()
function, again from the dplyr package:
glimpse(gcp)
Rows: 48
Columns: 22
$ `Coun- ty Code` <chr> "01", "02", "03…
$ `County Name` <chr> "MOMBASA", "KWA…
$ `Agriculture, forestry and fishing` <dbl> 1459, 39610, 38…
$ `Mining and quarrying` <dbl> 1158, 1447, 333…
$ Manufacturing <dbl> 47348, 300, 845…
$ `Electricity supply` <dbl> 20546, 730, 147…
$ `Water supply; waste collection` <dbl> 1078, 513, 1353…
$ Construction <dbl> 37168, 3184, 22…
$ `Wholesale and retail trade; repair of motor vehicles` <dbl> 36912, 5051, 60…
$ `Transport and storage` <dbl> 88308, 4198, 11…
$ `Accom- modation and food service activities` <dbl> 12780, 6699, 11…
$ `Informa- tion and communi- cation` <dbl> 5413, 879, 2423…
$ `Financial and insurance activities` <dbl> 31155, 4941, 86…
$ `Real estate activities` <dbl> 35526, 5733, 14…
$ `Profes- sional, technical and support services` <dbl> 7124, 349, 286,…
$ `Public admin- istration and defence` <dbl> 12024, 4566, 64…
$ Education <dbl> 4229, 5384, 921…
$ `Human health and social work activities` <dbl> 4539, 1575, 253…
$ `Other service activities` <dbl> 2379, 1647, 281…
$ FISIM1 <chr> "(17,026)", "(5…
$ Total <dbl> 332122, 86278, …
$ ...22 <lgl> NA, NA, NA, NA,…
As well as returning the number of rows and columns in the data, the glimpse function also shows you the names of the columns, the column classes (indicated in
3.2.3 Exercise
05:00
- Use the glimpse and View functions to examine the frogs dataset. How many rows and columns does it have?
Solution
View(frogs)
glimpse(frogs)
3.3 Excel
Reading excel files works in much the same way as CSV files. However, due to the difference in underlying structures we require the function read_excel()
from a different package called readxl
.
The main difference when reading excel files is three additional arguments that we can set;
sheet
which allows us to specify which sheet to read. It can take the form of a string (the name of the sheet) or an integer (the position of the sheet); andrange
which allows us to specify a cell range. It takes a typical cell reference like “B3:D10”.skip
an alternative to specifying a cell range, you can simply indicate how many rows to skip at the start of the sheet. This is ideal if you want to read in a sheet with an unknown number of columns and/or rows, but know there are several lines of metadata at the top of the sheet.
If we don’t set any of these arguments it will assume our data is in the first row of the first sheet (and it becomes almost identical to read_csv above).
# One option is to download the file
<- read_excel("data/tourism.xlsx") tourism
3.3.1 Exercise
10:00
- Install and load the readxl package.
- Download this dataset published by KNBS about the tourism sector, saving it into /data
- Read it in, specifying the sheet name you want to read in.
- Examine the data you have read in; are the column names what you want? Work out how to skip these and only read in the data, with the correct column names.
Solution
= "https://www.knbs.or.ke/wp-content/uploads/2024/04/2023-Economic-Survey-Kenya-Tourism-Sector.xlsx"
url <- download.file(url,destfile = "data/tourism.xlsx")
download_first <- read_excel("data/tourism.xlsx", sheet = "Table 12.5", skip = 2) tourism
3.4 Rio
Sometimes you may want to read a selection of files of all different types. This is where Rio
can come in handy. Rio
is a wrapper around the libraries we’ve used above and many more, which lets you use import()
to read almost any file in. This isn’t always useful, when you want to do very specific things with a certain file, but can be much cleaner.
3.5 Column Names
In the previous section we looked at reading data into R and also inspecting it. In this section we are going to look at our first steps once it’s read in.
In the previous session we stated that every column in a data frame is a variable and it is good practice to not have spaces within variable names, as spaces makes it harder for us to call on the variables when we need to use them.
When you enter data in Excel, you most often don’t think too much about what you call each column. After all, you just label them once and as long as they are meaningful to you, what does it matter if the column name is a long combination of CAPITALLETTERS, lowercaseletters, and numbers?
When you are working with variables in R though, you need to type the name of each variable, every time you want to work with it. So, it makes sense to make your column names as simple, but meaningful as possible.
Ideally, they should also be consistently formatted.
For example if we wanted to pick the name of Passenger
column from the Titanic dataset.
$name Of Passenger Titanic
To get around this we enclose name of passenger
with back ticks like the code below - this is the key above the tab key on the left hand side of your keyboard.
# Selecting data using the $ symbol
# note this now works because of the back ticks
$`name Of Passenger` Titanic
If your column names have spaces and you don’t get rid of them, you must use backticks.
However its good practise to remove spaces and symbols.
We can see the column names by using the names()
function to access the name
attribute of the data.
# Getting the column names using the names function
names(Titanic)
NULL
As we can see our column names have spaces and some start with capital letters and some with small letters, we can clean the names using the janitor
package.
3.5.1 Cleaning Column Names
The janitor
package offers many functions used to manipulate data, for example removing empty rows and columns, finding duplicates within a data frame. In this session we will use the library to to clean our data set names.
We can clean the names of our dataset with the janitor::clean_names()
function as shown below.
We are overwriting the original Titanic data frame with a version with the column names cleaned.
# Cleaning the column names using the janitor
# package and the clean_names() function.
# This will put all names in lower case letters and
# replace blank spaces with underscores.
<- janitor::clean_names(Titanic)
titanic
# Getting the column names of the dataset
names(titanic)
NULL
clean_names()
removes spaces, symbols, changes characters to lower case and makes all columns start with letters.
This is the default setting, there are many other options such as snake
, lower_camel
and all_caps
. These can be put inside the clean_names()
function as shown below:
# Specifying the case within the clean_names function
::clean_names(Titanic, case = "snake") janitor
, , Age = child, Survived = no
Sex
Class male female
x1st 0 0
x2nd 0 0
x3rd 35 17
crew 0 0
, , Age = adult, Survived = no
Sex
Class male female
x1st 118 4
x2nd 154 13
x3rd 387 89
crew 670 3
, , Age = child, Survived = yes
Sex
Class male female
x1st 5 1
x2nd 11 13
x3rd 13 14
crew 0 0
, , Age = adult, Survived = yes
Sex
Class male female
x1st 57 140
x2nd 14 80
x3rd 75 76
crew 192 20