Back to: Introduction to R
The first tidyr
function we will look into is the spread()
function. With spread()
it does similar to what you would expect. We have a data frame where some of the rows contain information that is really a variable name. This means the columns are a combination of variable names as well as some data. The picture below displays this:
We can consider the following data which is table 2:
## # A tibble: 12 × 4
## country year key value
## <fctr> <int> <fctr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
Notice that in the column of key
, instead of there being values we see the following variable names:
cases
population
In order to use this data we need to have it so the data frame looks like this instead:
## # A tibble: 6 × 4
## country year cases population
## * <fctr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Now we can see that we have all the columns representing the variables we are interested in and each of the rows is now a complete observation.
In order to do this we need to learn about the spread()
function:
spread(data, key, value)
Where
data
is your dataframe of interest.
key
is the column whose values will become variable names.
value
is the column where values will fill in under the new variables created from key
.
If we consider piping, we can write this as:
data %>%
spread(key, value)
spread()
Example
Now if we consider table2 , we can see that we have:
## # A tibble: 12 × 4
## country year key value
## <fctr> <int> <fctr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
Now this table was made for this example so key is the key
in our spread()
function and value is the value
in our spread()
function. We can fix this with the following code:
table2 %>%
spread(key,value)
## # A tibble: 6 × 4
## country year cases population
## * <fctr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
We can now see that we have a variable named cases
and a variable named population
. This is much most tidy.
On Your Own: RStudio Practice
We first will load tidyverse. If you have not installed it run the following code:
install.packages("tidyverse")
Then load this package:
library(tidyverse)
In this example we will use the dataset population
that is part of tidyverse. Print this data:
## # A tibble: 1 × 3
## country year population
## <chr> <int> <int>
## 1 Afghanistan 1995 17586073
You should see the table that we have above, now We have a variable named year
, assume that we wish to actually have each year as its own variable. Using the spread()
function, redo this data so that each year is a variable. Your data will look like this at the end:
## # A tibble: 219 × 20
## country `1995` `1996` `1997` `1998` `1999`
## * <chr> <int> <int> <int> <int> <int>
## 1 Afghanistan 17586073 18415307 19021226 19496836 19987071
## 2 Albania 3357858 3341043 3331317 3325456 3317941
## 3 Algeria 29315463 29845208 30345466 30820435 31276295
## 4 American Samoa 52874 53926 54942 55899 56768
## 5 Andorra 63854 64274 64090 63799 64084
## 6 Angola 12104952 12451945 12791388 13137542 13510616
## 7 Anguilla 9807 10063 10305 10545 10797
## 8 Antigua and Barbuda 68349 70245 72232 74206 76041
## 9 Argentina 34833168 35264070 35690778 36109342 36514558
## 10 Armenia 3223173 3173425 3137652 3112958 3093820
## # ... with 209 more rows, and 14 more variables: `2000` <int>,
## # `2001` <int>, `2002` <int>, `2003` <int>, `2004` <int>, `2005` <int>,
## # `2006` <int>, `2007` <int>, `2008` <int>, `2009` <int>, `2010` <int>,
## # `2011` <int>, `2012` <int>, `2013` <int>
// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('t
able table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });