Importing Data


Up until this point in the course we have been just working with toy data that was in the window only or that we made up. It now comes the time where we need to work on getting data into R from many different sources.

Where do we get data from?

We get data from many different sources. Some of these sources are:

Built in Data

.csv, .txt. .xls, ….

SPSS, SAS, Stata

Web Scraping

Databases

Getting Started with Built in Data

Many packages in R have built in data. They use this data in order to display what the functions they have built can do. It ends up being a great resource for us to use while we learn how to work with data.

If you would like to see what data you have in R right now, run the following command:

data()

In RStudio a window will pop up and display the data as well as what packages that data is in.

data(package="tidyr")

We can also call data from a specific package. When you begin to have many packages installed in R you will want to make sure you call from specific packages.

On Your own:

Using the install.packages() function.

Install the datasets package.

Explore the different datasets in this package

Getting Data from Delimited Files

Much of the data we download or receive from researchers is in the form of delimited files. Whether that be a comma separated (csv) or a tab delimited file, there are multiple functions that can read these data into R.

We will stick to loading these data from the tidyverse packages but be aware these are not the only methods for doing this. We will use the tidyverse functions just to maintain consistency with everything else we do.

readr in Tidyverse

The first package in tidyverse we will use is called readr. This is actually a collection of multiple functions:

read_csv(): comma separated (CSV) files

read_tsv(): tab separated files

read_delim(): general delimited files

read_fwf(): fixed width files

read_table(): tabular files where columns are separated by white-space.

read_log(): web log files

readxl reads in Excel files.

In order to show an example of this we will create a simple dataset. Consider below with the read.table()function:

##   subject sex size
## 1       1   M    7
## 2       2   F   NA
## 3       3   F    9
## 4       4   M   11

This functions able to see the text in the quotations as rows and columns of a dataset. If you have data which is separated by space, this command is a great way to load the data in.

Let’s say that we wish to load a csv file into R now. We will take the data that we already have loaded in and create a simple csv file.

We write the csv file as shown below:

# Write to a file, suppress row names
write.csv(data, "data1.csv", row.names=FALSE)

# Same, except that instead of "NA", output blank cells
write.csv(data, "data2.csv", row.names=FALSE, na="")

# Use tabs, suppress row names and column names
write.table(data, "data3.tab", sep="\t", row.names=FALSE, col.names=FALSE)

The functions all create a different file that we will read into R now. For example we can see what each of these files look like below:

readLines("data1.csv")
## [1] "\"subject\",\"sex\",\"size\"" "1,\"M\",7"                   
## [3] "2,\"F\",NA"                   "3,\"F\",9"                   
## [5] "4,\"M\",11"

We can see that in the above file we have commas separating all of the data elements. We also have NA where the data was missing.

readLines("data2.csv")
## [1] "\"subject\",\"sex\",\"size\"" "1,\"M\",7"                   
## [3] "2,\"F\","                     "3,\"F\",9"                   
## [5] "4,\"M\",11"

In this one we do not have any NA, but R has treated the missing data with blank spaces.

readLines("data3.tab")
## [1] "1\t\"M\"\t7"  "2\t\"F\"\tNA" "3\t\"F\"\t9"  "4\t\"M\"\t11"

With the third data set we do not have any commas but the \t represents a tabbed space.

Reading the Data

We can read the csv files with the read.csv() function:

data1 <- read.csv("data1.csv")
data1
##   subject sex size
## 1       1   M    7
## 2       2   F   NA
## 3       3   F    9
## 4       4   M   11
data2 <- read.csv("data2.csv")
data2
##   subject sex size
## 1       1   M    7
## 2       2   F   NA
## 3       3   F    9
## 4       4   M   11

With the tab delimited file we use the general function of read.delim() function. Note that the sep="\t"displays what separator was used.

data3 <- read.delim("data3.tab", sep="\t", header=F)
data3
##   V1 V2 V3
## 1  1  M  7
## 2  2  F NA
## 3  3  F  9
## 4  4  M 11

We could also use read.delim() to read a csv file by using sep=",".

Importing From Other Software

R can read data from more than just delimited files or internal datasets. R can also read files from all other major statistical software:

SAS

Stata

SPSS

Enter Haven Package

Haven is another R package that is part of the tidyverse. It is designed to bring in data from multiple sources. We can also use this function to write data to these same courses.

For SAS

For SAS files we can read and write them in the following manner:

read_sas(data_file, catalog_file = NULL, encoding = NULL)

write_sas(data, path)

For Stata

For Stata files, we can read and write them in the following manner:

read_dta(file, encoding = NULL)

read_stata(file, encoding = NULL)

write_dta(data, path, version = 14)

For SPSS

For SPSS files, we can read and write them in the following manner:

read_sav(file, user_na = FALSE)

read_por(file, user_na = FALSE)

write_sav(data, path)

read_spss(file, user_na = FALSE)

Other Data Sources

We will cover other data sources in another course