Back to: Introduction to R
As you have seen in your own work, being able to summarize information is crucial. We need to be able to take out data and summarize it as well. We will consider doing this using the summarise()
function.
Like in the rest of these lessons, let’s consider what happens when we try to to do this in base R. We will:
1. Create a table grouped by dest
.
2. Summarize each group by taking mean of arr_delay
.
head(with(flights, tapply(arr_delay, dest, mean, na.rm=TRUE)))
head(aggregate(arr_delay ~ dest, flights, mean))
I am going to default to not explaining the logic and exactly what R is doing with Base R but let’s consider this with the summarise()
function.
Enter summarise()
Function
The summarise()
function is:
summarise(.data, ...)
where
.data
is the tibble of interest.
...
is a list of name paired summary functions
Such as:
mean()
median
var()
sd()
min()
max()
...
Note: summarise()
is Primarily useful with data that has been grouped by one or more variables.
Our example:
flights %>%
group_by(dest) %>%
summarise(avg_delay = mean(arr_delay, na.rm=TRUE))
Consider the logic here:
1. Group flights by destination
2. Find the average delay of the groups and call it avg_delay
.
This is much easier to understand than the Base R code.
## # A tibble: 105 × 2
## dest avg_delay
## <chr> <dbl>
## 1 ABQ 4.381890
## 2 ACK 4.852273
## 3 ALB 14.397129
## 4 ANC -2.500000
## 5 ATL 11.300113
## 6 AUS 6.019909
## 7 AVL 8.003831
## 8 BDL 7.048544
## 9 BGR 8.027933
## 10 BHM 16.877323
## # ... with 95 more rows
Another Example
Lets say that we would like to have more than just the averages but we wish to have the minimum and the maximum departure delays by carrier:
flights %>%
group_by(carrier) %>%
summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("delay"))
## # A tibble: 16 × 5
## carrier dep_delay_min arr_delay_min dep_delay_max arr_delay_max
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 9E -24 -68 747 744
## 2 AA -24 -75 1014 1007
## 3 AS -21 -74 225 198
## 4 B6 -43 -71 502 497
## 5 DL -33 -71 960 931
## 6 EV -32 -62 548 577
## 7 F9 -27 -47 853 834
## 8 FL -22 -44 602 572
## 9 HA -16 -70 1301 1272
## 10 MQ -26 -53 1137 1127
## 11 OO -14 -26 154 157
## 12 UA -20 -75 483 455
## 13 US -19 -70 500 492
## 14 VX -20 -86 653 676
## 15 WN -13 -58 471 453
## 16 YV -16 -46 387 381
On Your Own: RStudio Practice
The following is a new function:
Helper function n_distinct(vector)
counts the number of unique items in that vector
Then for each destination
count the total number of flights
the number of distinct planes that flew there
Your answer will look like:
## # A tibble: 105 × 3
## dest flight_count plane_count
## <chr> <int> <int>
## 1 ABQ 254 108
## 2 ACK 265 58
## 3 ALB 439 172
## 4 ANC 8 6
## 5 ATL 17215 1180
## 6 AUS 2439 993
## 7 AVL 275 159
## 8 BDL 443 186
## 9 BGR 375 46
## 10 BHM 297 45
## # ... with 95 more rows
// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });