Summarize


As you have seen in your own work, being able to summarize information is crucial. We need to be able to take out data and summarize it as well. We will consider doing this using the summarise() function.

Like in the rest of these lessons, let’s consider what happens when we try to to do this in base R. We will:

1. Create a table grouped by dest.

2. Summarize each group by taking mean of arr_delay.

head(with(flights, tapply(arr_delay, dest, mean, na.rm=TRUE)))
head(aggregate(arr_delay ~ dest, flights, mean))

I am going to default to not explaining the logic and exactly what R is doing with Base R but let’s consider this with the summarise() function.

Enter summarise() Function

The summarise() function is:

summarise(.data, ...)

where

.data is the tibble of interest.

... is a list of name paired summary functions

Such as:

mean()

median

var()

sd()

min()

max()

...

Note: summarise() is Primarily useful with data that has been grouped by one or more variables.

Our example:

flights %>%
    group_by(dest) %>%
    summarise(avg_delay = mean(arr_delay, na.rm=TRUE))

Consider the logic here:

1. Group flights by destination

2. Find the average delay of the groups and call it avg_delay.

This is much easier to understand than the Base R code.

## # A tibble: 105 × 2
##     dest avg_delay
##    <chr>     <dbl>
## 1    ABQ  4.381890
## 2    ACK  4.852273
## 3    ALB 14.397129
## 4    ANC -2.500000
## 5    ATL 11.300113
## 6    AUS  6.019909
## 7    AVL  8.003831
## 8    BDL  7.048544
## 9    BGR  8.027933
## 10   BHM 16.877323
## # ... with 95 more rows

Another Example

Lets say that we would like to have more than just the averages but we wish to have the minimum and the maximum departure delays by carrier:

flights %>%
    group_by(carrier) %>%
    summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("delay"))
## # A tibble: 16 × 5
##    carrier dep_delay_min arr_delay_min dep_delay_max arr_delay_max
##      <chr>         <dbl>         <dbl>         <dbl>         <dbl>
## 1       9E           -24           -68           747           744
## 2       AA           -24           -75          1014          1007
## 3       AS           -21           -74           225           198
## 4       B6           -43           -71           502           497
## 5       DL           -33           -71           960           931
## 6       EV           -32           -62           548           577
## 7       F9           -27           -47           853           834
## 8       FL           -22           -44           602           572
## 9       HA           -16           -70          1301          1272
## 10      MQ           -26           -53          1137          1127
## 11      OO           -14           -26           154           157
## 12      UA           -20           -75           483           455
## 13      US           -19           -70           500           492
## 14      VX           -20           -86           653           676
## 15      WN           -13           -58           471           453
## 16      YV           -16           -46           387           381

On Your Own: RStudio Practice

The following is a new function:

Helper function n_distinct(vector) counts the number of unique items in that vector

Then for each destination

count the total number of flights

the number of distinct planes that flew there

Your answer will look like:

## # A tibble: 105 × 3
##     dest flight_count plane_count
##    <chr>        <int>       <int>
## 1    ABQ          254         108
## 2    ACK          265          58
## 3    ALB          439         172
## 4    ANC            8           6
## 5    ATL        17215        1180
## 6    AUS         2439         993
## 7    AVL          275         159
## 8    BDL          443         186
## 9    BGR          375          46
## 10   BHM          297          45
## # ... with 95 more rows