Back to: Introduction to R
We have so far discussed how one could find the basic number summaries:
mean
median
standard deviation
variance
minimum
maximum
However there are many more operations that you may wish to do for summarizing data. In fact many of the following examples are excellent choices for working with categorical data which does not always make sense to do the above summaries for.
We will consider:
1. Grouping and Counting
2. Grouping, Counting and Sorting
3. Other Groupings
4. Counting Groups
Grouping and Counting
We have seen the functions tally()
and count()
in a previous lesson. Both of these can be used for grouping and counting. They also are very concise in how they are called.
For example if we wished to know how many flights there were by month, we would use tally()
in this manner:
flights %>%
group_by(month) %>%
tally()
## # A tibble: 12 × 2
## month n
## <int> <int>
## 1 1 27004
## 2 2 24951
## 3 3 28834
## 4 4 28330
## 5 5 28796
## 6 6 28243
## 7 7 29425
## 8 8 29327
## 9 9 27574
## 10 10 28889
## 11 11 27268
## 12 12 28135
Where as we could do the same thing with count()
flights %>%
count(month)
## # A tibble: 12 × 2
## month n
## <int> <int>
## 1 1 27004
## 2 2 24951
## 3 3 28834
## 4 4 28330
## 5 5 28796
## 6 6 28243
## 7 7 29425
## 8 8 29327
## 9 9 27574
## 10 10 28889
## 11 11 27268
## 12 12 28135
*Notice: count()
allowed for month to be called inside of it, removing the need for the group_by()
function.
Grouping, counting and sorting.
Both tally()
and count()
have an argument called sort()
. This allows you to go one step further and group by, count and sort at the same time. For tally()
this would be:
flights %>% group_by(month) %>% tally(sort=TRUE)
## # A tibble: 12 × 2
## month n
## <int> <int>
## 1 7 29425
## 2 8 29327
## 3 10 28889
## 4 3 28834
## 5 5 28796
## 6 4 28330
## 7 6 28243
## 8 12 28135
## 9 9 27574
## 10 11 27268
## 11 1 27004
## 12 2 24951
then for count()
we would have:
flights %>% count_(month, sort=TRUE)
## Error in as.lazy_dots(.dots): object 'month' not found
Grouping with other functions
We can also sum over other values rather than just counting the rows like the above examples. For example let us say we were interested in knowing the total distance for planes in a given month. We could do this with the summarise()
function, tally()
function or the count()
function:
flights %>%
group_by(month) %>%
summarise(dist = sum(distance))
## # A tibble: 12 × 2
## month dist
## <int> <dbl>
## 1 1 27188805
## 2 2 24975509
## 3 3 29179636
## 4 4 29427294
## 5 5 29974128
## 6 6 29856388
## 7 7 31149199
## 8 8 31149334
## 9 9 28711426
## 10 10 30012086
## 11 11 28639718
## 12 12 29954084
We take flights then group by month and then create a new variable called distance, where we sum the distance. If you wish to see the new column, please run this on your computer. For tally()
we could do:
flights %>%
group_by(month) %>%
tally(wt = distance)
## # A tibble: 12 × 2
## month n
## <int> <dbl>
## 1 1 27188805
## 2 2 24975509
## 3 3 29179636
## 4 4 29427294
## 5 5 29974128
## 6 6 29856388
## 7 7 31149199
## 8 8 31149334
## 9 9 28711426
## 10 10 30012086
## 11 11 28639718
## 12 12 29954084
Note: in tally()
the wt
stands for weight and allows you to weight the sum based on the distance. With the count()
function we also use wt
:
flights %>% count(month, wt = distance)
## # A tibble: 12 × 2
## month n
## <int> <dbl>
## 1 1 27188805
## 2 2 24975509
## 3 3 29179636
## 4 4 29427294
## 5 5 29974128
## 6 6 29856388
## 7 7 31149199
## 8 8 31149334
## 9 9 28711426
## 10 10 30012086
## 11 11 28639718
## 12 12 29954084
Counting Groups
We may want to know how large our groups are. To do this we can use the following functions:
group_size()
is a function that returns counts of group.
n_groups()
returns the number of groups
So if wanted to count the number of flights by month, we could group by month and find the groups size using group_size()
:
flights %>%
group_by(month) %>%
group_size()
## [1] 27004 24951 28834 28330 28796 28243 29425 29327 27574 28889 27268
## [12] 28135
If we just wished to know how many months were represented in our data we could use the n_groups()
function:
flights %>%
group_by(month) %>%
n_groups()
## [1] 12