Back to: Introduction to R
Dataframes in R
With statistics we are most likely to use the data structure called a data frame. This is similar to a matrix in appearance however we can have multiple types of data in it like a list. Each column must contain the same type of data or R will most likely default to character for that column. It is very important that you become proficient in working with data frames in order to fully understand data analysis.
Creating Data Frames
We usually create a data frame with vectors.
names <- c("Angela", "Shondra")
ages <- c(27,36)
insurance <- c(TRUE, T)
patients <- data.frame(names, ages, insurance)
patients##     names ages insurance
## 1  Angela   27      TRUE
## 2 Shondra   36      TRUEWe may wish to add rows or columns to our data. We can do this with:
rbind()
cbind()
For example we can go back to our patient data and say we wish to add another patient we could just do the following
l <- c(names="Liu Jie", age=45, insurance=TRUE)
rbind(patients, l)## Warning in `[<-.factor`(`*tmp*`, ri, value = "Liu Jie"): invalid factor
## level, NA generated##     names ages insurance
## 1  Angela   27      TRUE
## 2 Shondra   36      TRUE
## 3    <NA>   45      TRUEThis warning serves as a reminder to always know what your data type is. R has read our data in as a factor when we want it as a character.
patients$names <- as.character(patients$names)
patients <- rbind(patients, l)
patients##     names ages insurance
## 1  Angela   27      TRUE
## 2 Shondra   36      TRUE
## 3 Liu Jie   45      TRUEFinally if we decided to then place another column of data in we could
# Next appointments
next.appt <- c("09/23/2016", "04/14/2016", "02/25/2016")
#Lets R know these are dates
next.appt <- as.Date(next.appt, "%m/%d/%Y")
next.appt## [1] "2016-09-23" "2016-04-14" "2016-02-25"We then have a vector of dates which we can cbind in R.
patients <- cbind(patients, next.appt)
patients##     names ages insurance  next.appt
## 1  Angela   27      TRUE 2016-09-23
## 2 Shondra   36      TRUE 2016-04-14
## 3 Liu Jie   45      TRUE 2016-02-25Accessing Data Frames
In order to best consider accessing of data frames we will use some built in data from R.
library(datasets)
titanic <- data.frame(Titanic)We can look at the different columns that we have in the data set:
colnames(titanic)## [1] "Class"    "Sex"      "Age"      "Survived" "Freq"We can use the notion of indexing that we did with arrays to look at the first 2 rows of data:
titanic[1:2,]##   Class  Sex   Age Survived Freq
## 1   1st Male Child       No    0
## 2   2nd Male Child       No    0A simple function for looking at the start of the data is the head() function:
head(titanic)##   Class    Sex   Age Survived Freq
## 1   1st   Male Child       No    0
## 2   2nd   Male Child       No    0
## 3   3rd   Male Child       No   35
## 4  Crew   Male Child       No    0
## 5   1st Female Child       No    0
## 6   2nd Female Child       No    0We can also look at the last few rows as well
tail(titanic)##    Class    Sex   Age Survived Freq
## 27   3rd   Male Adult      Yes   75
## 28  Crew   Male Adult      Yes  192
## 29   1st Female Adult      Yes  140
## 30   2nd Female Adult      Yes   80
## 31   3rd Female Adult      Yes   76
## 32  Crew Female Adult      Yes   20If we wished to access the age information, we could do this by the column number:
titanic[,3]##  [1] Child Child Child Child Child Child Child Child Adult Adult Adult
## [12] Adult Adult Adult Adult Adult Child Child Child Child Child Child
## [23] Child Child Adult Adult Adult Adult Adult Adult Adult Adult
## Levels: Child Adultor more frequently we would use the column name instead:
titanic[, "Age"]##  [1] Child Child Child Child Child Child Child Child Adult Adult Adult
## [12] Adult Adult Adult Adult Adult Child Child Child Child Child Child
## [23] Child Child Adult Adult Adult Adult Adult Adult Adult Adult
## Levels: Child AdultThis means we can access data with a column or row number. More importantly we can use the name. For large data frames accessing by a name is key.
Further Indexing
Let’s say we wish to know information about a particular class
titanic["1st", ]##    Class  Sex  Age Survived Freq
## NA  <NA> <NA> <NA>     <NA>   NAWe could also ask for information by using the factors that we have as well
first.class.freq <- titanic[titanic$Class=="1st", "Freq"]
first.class.freq## [1]   0   0 118   4   5   1  57 140male.freq <- titanic[titanic$Sex=="Male", "Freq"]
male.freq##  [1]   0   0  35   0 118 154 387 670   5  11  13   0  57  14  75 192Then we can add up the new values
sum(first.class.freq)## [1] 325sum(male.freq)## [1] 1731Quick Check Practice
set.seed(1234)
example = data.frame(c1=runif(50), c2=rnorm(50), c3=runif(50))
# 1.  How many observations are there in example?
# 2. How many variables are there in example?
# 3. What are the names of the variables in example?
# 4. Create a dataframe with only observations where c1 > 0.2? Name this c1_gr_02.
# 5. Create a dataframe with only observations where c1 > 0.2 and c2 > 0.2? Name this c1_c2_gr_02.
# 1.  How many observations are there in example?
dim(example)[1]
# 2. How many variables are there in example?
dim(example)[2]
# 3. What are the names of the variables in example?
names(example)
# 4. Create a dataframe with only observations where c1 > 0.2? Name this c1_gr_02.
c1_gr_02 <- example[example$c1>0.2,]
# 5. Create a dataframe with only observations where c1 > 0.2 and c2 > 0.2? Name this c1_c2_gr_02.
c1_c2_gr_02 <- example[example$c1>0.2 & example$c2>0.2,]
test_error()
test_function("dim")
test_function("dim")
test_function("names")
test_object("c1_gr_02", incorrect_msg = "Did you remember to name the new dataframe?")
test_object("c1_c2_gr_02",  incorrect_msg = "Did you remember to name the new dataframe?")
success_msg("Great Job")     
Adding New Variables
Suppose we not only want to know the frequency of survival but the proportion. We can ask R to calculate this and add it to our data.
titanic$surv_p <- titanic$Freq/sum(titanic$Freq)
head(titanic,4)##   Class  Sex   Age Survived Freq     surv_p
## 1   1st Male Child       No    0 0.00000000
## 2   2nd Male Child       No    0 0.00000000
## 3   3rd Male Child       No   35 0.01590186
## 4  Crew Male Child       No    0 0.00000000Perhaps we were not pleased the decimal places and want to have this as a percentage. We can overwrite the values and change this.
titanic$surv_p <- titanic$surv_p*100
head(titanic,4)##   Class  Sex   Age Survived Freq   surv_p
## 1   1st Male Child       No    0 0.000000
## 2   2nd Male Child       No    0 0.000000
## 3   3rd Male Child       No   35 1.590186
## 4  Crew Male Child       No    0 0.000000In the future we will be performing many more operations on data frames.
On Your Own: Swirl Practice
In order to learn R you must do R. Follow the steps below in your RStudio console:
1. Run this command to pick the course:
swirl()You will be promted to choose a course. Type whatever number is in front of 02 Getting Data. This will then take you to a menu of lessons. For now we will just use lesson 5. Type 5 to choose Matrices and Dataframes then follow all the instructions until you are finished.
Once you are finished with the lesson come back to this course and continue.
// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });