The Data Analysis Workflow


We begin by looking at the Data Analysis workflow. The version above is a concept created by Hadley Wickham and Adapted from Bradley Boehmke.

The diagram shows the natural flow of how we work with data and perform research. We will begin to explore what this means as we continue.

Data Wrangling

The first steps we take in any Data Analysis is Data Wrangling. Before we can do any kind of analysis we need to be able to collect our data. Sometimes this comes in from one source but many times this comes from multiple data sources.

Once we have this data we find that very rarely is it ever in a useful form. In fact Dasau and Johnson suggest that this data preparation of cleaning may take up to 80% of the time.

This is where R has great power and by the end of this course you will be able to work with multiple data sources and wrangle data like a pro.

Importing Data

When it comes to importing your data R is very powerful. R can grab data from many courses including

.csv, .txt. .xls, ….

SPSS, SAS, Stata

Web Scraping

Databases

MySQL

Oracle

PostgreSQL

MongoDB

Tidying Data

Tidying Data is the process in making data useful. In this concept we have ecah column of data represent a variable and each row of data represents a single observation. This format is quite useful for data analysis. In this course we will rely heavily on the tidyr package.

Transforming Data

Once we have data into R and begin to tidy the data we usually need to transform multiple aspects of the data. R has many tools that allow a user to manipulate and transform data.

Data Exploration and Analysis

R is one of the most capable languages to explore and analyze data. With over 10,000 packages it can be hard to find models or plots that do not already have multiple functions in R.

Visualizing Data

There are multiple ways to vizualize data in R. The base graphics are easy to use and outperform Stata, SAS and SPSS. In this course we will focus on using the ggplot2 package. This package is actually a language for grahpics and once a user becomes proficient you can create grahs like the one shown below which is created by Harvard Institute for Quantitative Social Science. The original plot came from the economist:

Then below can be created by ggplot without any outside graphics software.

Modeling Data ————-

Once data has been vizualized it is important to model the data. R can handle anything from a simple t-test to the working with data that is overa terrabyte in size.

Data Collaboration and Publishing

In every field it is key to be able to communicate what we learn and publish this work so that it can be beneficial to others. It is also important that statisticians can collaborate with others as well. Rstudio provides many tools to do these things:

RStudio connects easily to:

Github

RPubs

RStudio Connect

shiny provides an interactive data visualization and java script environment.

plotly allows for interactive graphics for webpages.

RStudio can make webpages, books, slides, and many other documents that can help relay data. (In fact this course was entirely built in RStudio using Github).