Dataframes in R


Dataframes in R

With statistics we are most likely to use the data structure called a data frame. This is similar to a matrix in appearance however we can have multiple types of data in it like a list. Each column must contain the same type of data or R will most likely default to character for that column. It is very important that you become proficient in working with data frames in order to fully understand data analysis.

Creating Data Frames

We usually create a data frame with vectors.

names <- c("Angela", "Shondra")
ages <- c(27,36)
insurance <- c(TRUE, T)
patients <- data.frame(names, ages, insurance)
patients
##     names ages insurance
## 1  Angela   27      TRUE
## 2 Shondra   36      TRUE

We may wish to add rows or columns to our data. We can do this with:

rbind()

cbind()

For example we can go back to our patient data and say we wish to add another patient we could just do the following

l <- c(names="Liu Jie", age=45, insurance=TRUE)
rbind(patients, l)
## Warning in `[<-.factor`(`*tmp*`, ri, value = "Liu Jie"): invalid factor
## level, NA generated
##     names ages insurance
## 1  Angela   27      TRUE
## 2 Shondra   36      TRUE
## 3    <NA>   45      TRUE

This warning serves as a reminder to always know what your data type is. R has read our data in as a factor when we want it as a character.

patients$names <- as.character(patients$names)
patients <- rbind(patients, l)
patients
##     names ages insurance
## 1  Angela   27      TRUE
## 2 Shondra   36      TRUE
## 3 Liu Jie   45      TRUE

Finally if we decided to then place another column of data in we could

# Next appointments
next.appt <- c("09/23/2016", "04/14/2016", "02/25/2016")

#Lets R know these are dates
next.appt <- as.Date(next.appt, "%m/%d/%Y")
next.appt
## [1] "2016-09-23" "2016-04-14" "2016-02-25"

We then have a vector of dates which we can cbind in R.

patients <- cbind(patients, next.appt)
patients
##     names ages insurance  next.appt
## 1  Angela   27      TRUE 2016-09-23
## 2 Shondra   36      TRUE 2016-04-14
## 3 Liu Jie   45      TRUE 2016-02-25

Accessing Data Frames

In order to best consider accessing of data frames we will use some built in data from R.

library(datasets)
titanic <- data.frame(Titanic)

We can look at the different columns that we have in the data set:

colnames(titanic)
## [1] "Class"    "Sex"      "Age"      "Survived" "Freq"

We can use the notion of indexing that we did with arrays to look at the first 2 rows of data:

titanic[1:2,]
##   Class  Sex   Age Survived Freq
## 1   1st Male Child       No    0
## 2   2nd Male Child       No    0

A simple function for looking at the start of the data is the head() function:

head(titanic)
##   Class    Sex   Age Survived Freq
## 1   1st   Male Child       No    0
## 2   2nd   Male Child       No    0
## 3   3rd   Male Child       No   35
## 4  Crew   Male Child       No    0
## 5   1st Female Child       No    0
## 6   2nd Female Child       No    0

We can also look at the last few rows as well

tail(titanic)
##    Class    Sex   Age Survived Freq
## 27   3rd   Male Adult      Yes   75
## 28  Crew   Male Adult      Yes  192
## 29   1st Female Adult      Yes  140
## 30   2nd Female Adult      Yes   80
## 31   3rd Female Adult      Yes   76
## 32  Crew Female Adult      Yes   20

If we wished to access the age information, we could do this by the column number:

titanic[,3]
##  [1] Child Child Child Child Child Child Child Child Adult Adult Adult
## [12] Adult Adult Adult Adult Adult Child Child Child Child Child Child
## [23] Child Child Adult Adult Adult Adult Adult Adult Adult Adult
## Levels: Child Adult

or more frequently we would use the column name instead:

titanic[, "Age"]
##  [1] Child Child Child Child Child Child Child Child Adult Adult Adult
## [12] Adult Adult Adult Adult Adult Child Child Child Child Child Child
## [23] Child Child Adult Adult Adult Adult Adult Adult Adult Adult
## Levels: Child Adult

This means we can access data with a column or row number. More importantly we can use the name. For large data frames accessing by a name is key.

Further Indexing

Let’s say we wish to know information about a particular class

titanic["1st", ]
##    Class  Sex  Age Survived Freq
## NA  <NA> <NA> <NA>     <NA>   NA

We could also ask for information by using the factors that we have as well

first.class.freq <- titanic[titanic$Class=="1st", "Freq"]
first.class.freq
## [1]   0   0 118   4   5   1  57 140
male.freq <- titanic[titanic$Sex=="Male", "Freq"]
male.freq
##  [1]   0   0  35   0 118 154 387 670   5  11  13   0  57  14  75 192

Then we can add up the new values

sum(first.class.freq)
## [1] 325
sum(male.freq)
## [1] 1731

Quick Check Practice


set.seed(1234)
example = data.frame(c1=runif(50), c2=rnorm(50), c3=runif(50))


# 1. How many observations are there in example?
# 2. How many variables are there in example?
# 3. What are the names of the variables in example?
# 4. Create a dataframe with only observations where c1 > 0.2? Name this c1_gr_02.
# 5. Create a dataframe with only observations where c1 > 0.2 and c2 > 0.2? Name this c1_c2_gr_02.


# 1. How many observations are there in example?
dim(example)[1]
# 2. How many variables are there in example?
dim(example)[2]
# 3. What are the names of the variables in example?
names(example)
# 4. Create a dataframe with only observations where c1 > 0.2? Name this c1_gr_02.
c1_gr_02 <- example[example$c1>0.2,]
# 5. Create a dataframe with only observations where c1 > 0.2 and c2 > 0.2? Name this c1_c2_gr_02.
c1_c2_gr_02 <- example[example$c1>0.2 & example$c2>0.2,]


test_error()
test_function("dim")
test_function("dim")
test_function("names")
test_object("c1_gr_02", incorrect_msg = "Did you remember to name the new dataframe?")
test_object("c1_c2_gr_02", incorrect_msg = "Did you remember to name the new dataframe?")
success_msg("Great Job")

Use your knowledge of dataframes to answer these questions.

Adding New Variables

Suppose we not only want to know the frequency of survival but the proportion. We can ask R to calculate this and add it to our data.

titanic$surv_p <- titanic$Freq/sum(titanic$Freq)
head(titanic,4)
##   Class  Sex   Age Survived Freq     surv_p
## 1   1st Male Child       No    0 0.00000000
## 2   2nd Male Child       No    0 0.00000000
## 3   3rd Male Child       No   35 0.01590186
## 4  Crew Male Child       No    0 0.00000000

Perhaps we were not pleased the decimal places and want to have this as a percentage. We can overwrite the values and change this.

titanic$surv_p <- titanic$surv_p*100
head(titanic,4)
##   Class  Sex   Age Survived Freq   surv_p
## 1   1st Male Child       No    0 0.000000
## 2   2nd Male Child       No    0 0.000000
## 3   3rd Male Child       No   35 1.590186
## 4  Crew Male Child       No    0 0.000000

In the future we will be performing many more operations on data frames.

On Your Own: Swirl Practice

In order to learn R you must do R. Follow the steps below in your RStudio console:

1. Run this command to pick the course:

swirl()

You will be promted to choose a course. Type whatever number is in front of 02 Getting Data. This will then take you to a menu of lessons. For now we will just use lesson 5. Type 5 to choose Matrices and Dataframes then follow all the instructions until you are finished.

Once you are finished with the lesson come back to this course and continue.