Spread


The first tidyr function we will look into is the spread() function. With spread() it does similar to what you would expect. We have a data frame where some of the rows contain information that is really a variable name. This means the columns are a combination of variable names as well as some data. The picture below displays this:

We can consider the following data which is table 2:

## # A tibble: 12 × 4
##        country  year        key      value
##         <fctr> <int>     <fctr>      <int>
## 1  Afghanistan  1999      cases        745
## 2  Afghanistan  1999 population   19987071
## 3  Afghanistan  2000      cases       2666
## 4  Afghanistan  2000 population   20595360
## 5       Brazil  1999      cases      37737
## 6       Brazil  1999 population  172006362
## 7       Brazil  2000      cases      80488
## 8       Brazil  2000 population  174504898
## 9        China  1999      cases     212258
## 10       China  1999 population 1272915272
## 11       China  2000      cases     213766
## 12       China  2000 population 1280428583

Notice that in the column of key, instead of there being values we see the following variable names:

cases

population

In order to use this data we need to have it so the data frame looks like this instead:

## # A tibble: 6 × 4
##       country  year  cases population
## *      <fctr> <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

Now we can see that we have all the columns representing the variables we are interested in and each of the rows is now a complete observation.

In order to do this we need to learn about the spread() function:

spread(data, key, value)

Where

data is your dataframe of interest.

key is the column whose values will become variable names.

value is the column where values will fill in under the new variables created from key.

If we consider piping, we can write this as:

data %>%
  spread(key, value)

spread() Example

Now if we consider table2 , we can see that we have:

## # A tibble: 12 × 4
##        country  year        key      value
##         <fctr> <int>     <fctr>      <int>
## 1  Afghanistan  1999      cases        745
## 2  Afghanistan  1999 population   19987071
## 3  Afghanistan  2000      cases       2666
## 4  Afghanistan  2000 population   20595360
## 5       Brazil  1999      cases      37737
## 6       Brazil  1999 population  172006362
## 7       Brazil  2000      cases      80488
## 8       Brazil  2000 population  174504898
## 9        China  1999      cases     212258
## 10       China  1999 population 1272915272
## 11       China  2000      cases     213766
## 12       China  2000 population 1280428583

Now this table was made for this example so key is the key in our spread() function and value is the valuein our spread() function. We can fix this with the following code:

table2 %>%
    spread(key,value)
## # A tibble: 6 × 4
##       country  year  cases population
## *      <fctr> <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

We can now see that we have a variable named cases and a variable named population. This is much most tidy.

 

On Your Own: RStudio Practice

We first will load tidyverse. If you have not installed it run the following code:

install.packages("tidyverse")

Then load this package:

library(tidyverse)

In this example we will use the dataset population that is part of tidyverse. Print this data:

## # A tibble: 1 × 3
##       country  year population
##         <chr> <int>      <int>
## 1 Afghanistan  1995   17586073

You should see the table that we have above, now We have a variable named year, assume that we wish to actually have each year as its own variable. Using the spread() function, redo this data so that each year is a variable. Your data will look like this at the end:

## # A tibble: 219 × 20
##                country   `1995`   `1996`   `1997`   `1998`   `1999`
## *                <chr>    <int>    <int>    <int>    <int>    <int>
## 1          Afghanistan 17586073 18415307 19021226 19496836 19987071
## 2              Albania  3357858  3341043  3331317  3325456  3317941
## 3              Algeria 29315463 29845208 30345466 30820435 31276295
## 4       American Samoa    52874    53926    54942    55899    56768
## 5              Andorra    63854    64274    64090    63799    64084
## 6               Angola 12104952 12451945 12791388 13137542 13510616
## 7             Anguilla     9807    10063    10305    10545    10797
## 8  Antigua and Barbuda    68349    70245    72232    74206    76041
## 9            Argentina 34833168 35264070 35690778 36109342 36514558
## 10             Armenia  3223173  3173425  3137652  3112958  3093820
## # ... with 209 more rows, and 14 more variables: `2000` <int>,
## #   `2001` <int>, `2002` <int>, `2003` <int>, `2004` <int>, `2005` <int>,
## #   `2006` <int>, `2007` <int>, `2008` <int>, `2009` <int>, `2010` <int>,
## #   `2011` <int>, `2012` <int>, `2013` <int>