What is Tidy Data?


There are many ways in which we can organize data. Some of these ways can make for easy data analysis. Others lead to a lot of frustration. This is where tidy data comes in. Tidy data is a concept from Hadley Wickham’s 2014 paper Tidy Data.

In the framework of tidy data every row is an observation, every column represents variables and every entry into the cells of the data frame are values. R for Data Science sums this up with the following graphic:

In order to work with data in this way all of these feature line up for us. Consider the following datasets:

#table1
# A tibble: 6 × 4
      country  year  cases population
       <fctr> <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3      Brazil  1999  37737  172006362
4      Brazil  2000  80488  174504898
5       China  1999 212258 1272915272
6       China  2000 213766 1280428583
#table2
# A tibble: 12 × 4
       country  year        key      value
        <fctr> <int>     <fctr>      <int>
1  Afghanistan  1999      cases        745
2  Afghanistan  1999 population   19987071
3  Afghanistan  2000      cases       2666
4  Afghanistan  2000 population   20595360
5       Brazil  1999      cases      37737
6       Brazil  1999 population  172006362
7       Brazil  2000      cases      80488
8       Brazil  2000 population  174504898
9        China  1999      cases     212258
10       China  1999 population 1272915272
11       China  2000      cases     213766
12       China  2000 population 1280428583
#table3
# A tibble: 6 × 3
      country  year              rate
       <fctr> <int>             <chr>
1 Afghanistan  1999      745/19987071
2 Afghanistan  2000     2666/20595360
3      Brazil  1999   37737/172006362
4      Brazil  2000   80488/174504898
5       China  1999 212258/1272915272
6       China  2000 213766/1280428583

From these above tables we can see that only Table 1 is actually tidy data. We will consider how we can create tidy data from the other 2 as well as some other examples as we move through this unit.

To start out with getting the Data Set ready we will use the package `tidyr` and then to start transforming and working with the data to model and graph it, we will use the `dplyr` packages, both of `tidyverse`.

tidyr Functions

To start out with getting the Data Set ready we will use the package tidyr and then to start transforming and working with the data to model and graph it, we will use the dplyr packages, both of tidyverse.

For the tidyr package we will focus on the following 4 functions:

1. Gather

2. Spread

3. Separate

4. Unite

 

On Your Own: Swirl Practice

In order to learn R you must do R. Follow the steps below in your RStudio console:

1. Run this command to pick the course:

swirl()

You will be promted to choose a course. Type whatever number is in front of 02 Getting Data. This will then take you to a menu of lessons. For now we will just use lesson 6. Type 6 to choose Looking at Data then follow all the instructions until you are finished.

Once you are finished with the lesson come back to this course and continue.