Further Summarizing


We have so far discussed how one could find the basic number summaries:

mean

median

standard deviation

variance

minimum

maximum

However there are many more operations that you may wish to do for summarizing data. In fact many of the following examples are excellent choices for working with categorical data which does not always make sense to do the above summaries for.

We will consider:

1. Grouping and Counting

2. Grouping, Counting and Sorting

3. Other Groupings

4. Counting Groups

Grouping and Counting

We have seen the functions tally() and count() in a previous lesson. Both of these can be used for grouping and counting. They also are very concise in how they are called.

For example if we wished to know how many flights there were by month, we would use tally() in this manner:

flights %>%
  group_by(month) %>% 
  tally()
## # A tibble: 12 × 2
##    month     n
##    <int> <int>
## 1      1 27004
## 2      2 24951
## 3      3 28834
## 4      4 28330
## 5      5 28796
## 6      6 28243
## 7      7 29425
## 8      8 29327
## 9      9 27574
## 10    10 28889
## 11    11 27268
## 12    12 28135

Where as we could do the same thing with count()

flights %>% 
  count(month)
## # A tibble: 12 × 2
##    month     n
##    <int> <int>
## 1      1 27004
## 2      2 24951
## 3      3 28834
## 4      4 28330
## 5      5 28796
## 6      6 28243
## 7      7 29425
## 8      8 29327
## 9      9 27574
## 10    10 28889
## 11    11 27268
## 12    12 28135

*Notice: count() allowed for month to be called inside of it, removing the need for the group_by() function.

Grouping, counting and sorting.

Both tally() and count() have an argument called sort(). This allows you to go one step further and group by, count and sort at the same time. For tally() this would be:

flights %>% group_by(month) %>% tally(sort=TRUE)
## # A tibble: 12 × 2
##    month     n
##    <int> <int>
## 1      7 29425
## 2      8 29327
## 3     10 28889
## 4      3 28834
## 5      5 28796
## 6      4 28330
## 7      6 28243
## 8     12 28135
## 9      9 27574
## 10    11 27268
## 11     1 27004
## 12     2 24951

then for count() we would have:

flights %>% count_(month, sort=TRUE)
## Error in as.lazy_dots(.dots): object 'month' not found

Grouping with other functions

We can also sum over other values rather than just counting the rows like the above examples. For example let us say we were interested in knowing the total distance for planes in a given month. We could do this with the summarise() function, tally() function or the count() function:

flights %>% 
  group_by(month) %>% 
  summarise(dist = sum(distance))
## # A tibble: 12 × 2
##    month     dist
##    <int>    <dbl>
## 1      1 27188805
## 2      2 24975509
## 3      3 29179636
## 4      4 29427294
## 5      5 29974128
## 6      6 29856388
## 7      7 31149199
## 8      8 31149334
## 9      9 28711426
## 10    10 30012086
## 11    11 28639718
## 12    12 29954084

We take flights then group by month and then create a new variable called distance, where we sum the distance. If you wish to see the new column, please run this on your computer. For tally() we could do:

flights %>% 
  group_by(month) %>% 
  tally(wt = distance)
## # A tibble: 12 × 2
##    month        n
##    <int>    <dbl>
## 1      1 27188805
## 2      2 24975509
## 3      3 29179636
## 4      4 29427294
## 5      5 29974128
## 6      6 29856388
## 7      7 31149199
## 8      8 31149334
## 9      9 28711426
## 10    10 30012086
## 11    11 28639718
## 12    12 29954084

Note: in tally() the wt stands for weight and allows you to weight the sum based on the distance. With the count() function we also use wt:

flights %>% count(month, wt = distance)
## # A tibble: 12 × 2
##    month        n
##    <int>    <dbl>
## 1      1 27188805
## 2      2 24975509
## 3      3 29179636
## 4      4 29427294
## 5      5 29974128
## 6      6 29856388
## 7      7 31149199
## 8      8 31149334
## 9      9 28711426
## 10    10 30012086
## 11    11 28639718
## 12    12 29954084

Counting Groups

We may want to know how large our groups are. To do this we can use the following functions:

group_size() is a function that returns counts of group.

n_groups() returns the number of groups

So if wanted to count the number of flights by month, we could group by month and find the groups size using group_size():

flights %>% 
  group_by(month) %>% 
  group_size()
##  [1] 27004 24951 28834 28330 28796 28243 29425 29327 27574 28889 27268
## [12] 28135

If we just wished to know how many months were represented in our data we could use the n_groups()function:

flights %>% 
  group_by(month) %>% 
  n_groups()
## [1] 12