Further Summarizing

We have so far discussed how one could find the basic number summaries:

mean

median

standard deviation

variance

minimum

maximum

However there are many more operations that you may wish to do for summarizing data. In fact many of the following examples are excellent choices for working with categorical data which does not always make sense to do the above summaries for.

We will consider:

1. Grouping and Counting

2. Grouping, Counting and Sorting

3. Other Groupings

4. Counting Groups

Grouping and Counting

We have seen the functions `tally()` and `count()` in a previous lesson. Both of these can be used for grouping and counting. They also are very concise in how they are called.

For example if we wished to know how many flights there were by month, we would use `tally()` in this manner:

``````flights %>%
group_by(month) %>%
tally()``````
``````## # A tibble: 12 × 2
##    month     n
##    <int> <int>
## 1      1 27004
## 2      2 24951
## 3      3 28834
## 4      4 28330
## 5      5 28796
## 6      6 28243
## 7      7 29425
## 8      8 29327
## 9      9 27574
## 10    10 28889
## 11    11 27268
## 12    12 28135``````

Where as we could do the same thing with `count()`

``````flights %>%
count(month)``````
``````## # A tibble: 12 × 2
##    month     n
##    <int> <int>
## 1      1 27004
## 2      2 24951
## 3      3 28834
## 4      4 28330
## 5      5 28796
## 6      6 28243
## 7      7 29425
## 8      8 29327
## 9      9 27574
## 10    10 28889
## 11    11 27268
## 12    12 28135``````

*Notice: `count()` allowed for month to be called inside of it, removing the need for the `group_by()` function.

Grouping, counting and sorting.

Both `tally()` and `count()` have an argument called `sort()`. This allows you to go one step further and group by, count and sort at the same time. For `tally()` this would be:

``flights %>% group_by(month) %>% tally(sort=TRUE)``
``````## # A tibble: 12 × 2
##    month     n
##    <int> <int>
## 1      7 29425
## 2      8 29327
## 3     10 28889
## 4      3 28834
## 5      5 28796
## 6      4 28330
## 7      6 28243
## 8     12 28135
## 9      9 27574
## 10    11 27268
## 11     1 27004
## 12     2 24951``````

then for `count()` we would have:

``flights %>% count_(month, sort=TRUE)``
``## Error in as.lazy_dots(.dots): object 'month' not found``

Grouping with other functions

We can also sum over other values rather than just counting the rows like the above examples. For example let us say we were interested in knowing the total distance for planes in a given month. We could do this with the `summarise()` function, `tally()` function or the `count()` function:

``````flights %>%
group_by(month) %>%
summarise(dist = sum(distance))``````
``````## # A tibble: 12 × 2
##    month     dist
##    <int>    <dbl>
## 1      1 27188805
## 2      2 24975509
## 3      3 29179636
## 4      4 29427294
## 5      5 29974128
## 6      6 29856388
## 7      7 31149199
## 8      8 31149334
## 9      9 28711426
## 10    10 30012086
## 11    11 28639718
## 12    12 29954084``````

We take flights then group by month and then create a new variable called distance, where we sum the distance. If you wish to see the new column, please run this on your computer. For `tally()` we could do:

``````flights %>%
group_by(month) %>%
tally(wt = distance)``````
``````## # A tibble: 12 × 2
##    month        n
##    <int>    <dbl>
## 1      1 27188805
## 2      2 24975509
## 3      3 29179636
## 4      4 29427294
## 5      5 29974128
## 6      6 29856388
## 7      7 31149199
## 8      8 31149334
## 9      9 28711426
## 10    10 30012086
## 11    11 28639718
## 12    12 29954084``````

Note: in `tally()` the `wt` stands for weight and allows you to weight the sum based on the distance. With the `count()` function we also use `wt`:

``flights %>% count(month, wt = distance)``
``````## # A tibble: 12 × 2
##    month        n
##    <int>    <dbl>
## 1      1 27188805
## 2      2 24975509
## 3      3 29179636
## 4      4 29427294
## 5      5 29974128
## 6      6 29856388
## 7      7 31149199
## 8      8 31149334
## 9      9 28711426
## 10    10 30012086
## 11    11 28639718
## 12    12 29954084``````

Counting Groups

We may want to know how large our groups are. To do this we can use the following functions:

`group_size()` is a function that returns counts of group.

`n_groups()` returns the number of groups

So if wanted to count the number of flights by month, we could group by month and find the groups size using `group_size()`:

``````flights %>%
group_by(month) %>%
group_size()``````
``````##  [1] 27004 24951 28834 28330 28796 28243 29425 29327 27574 28889 27268
## [12] 28135``````

If we just wished to know how many months were represented in our data we could use the `n_groups()`function:

``````flights %>%
group_by(month) %>%
n_groups()``````
``## [1] 12``