Arrange


We also have need to make sure the data is ordered in a certain manner. This can be easily done in R with the arrange() function. Again we can do this in base R but this is not always a clear path.

Let’s say that we wish to look at only carriers and departure delay and we wish to order departure delays from what smallest to largest. In base R we would have to run the following command:

flights[order(flights$dep_delay), c("carrier", "dep_delay")]

In this command we are ordering the rows by dep_delay and then only keeping carrier and dep_delay in the end.

Enter the arrange() Function

We could do this in an easy manner using the arrange() function:

arrange(.data, ...)

Where

.data is a data frame of interest.

... are the variables you wish to sort by.

flights %>%
    select(carrier, dep_delay) %>%
    arrange(dep_delay)
## # A tibble: 336,776 × 2
##    carrier dep_delay
##      <chr>     <dbl>
## 1       B6       -43
## 2       DL       -33
## 3       EV       -32
## 4       DL       -30
## 5       F9       -27
## 6       MQ       -26
## 7       EV       -25
## 8       MQ       -25
## 9       9E       -24
## 10      B6       -24
## # ... with 336,766 more rows

With arrange() we first use select() to pick the only columns that we want and then we arrange by the dep_delay. If we had wished to order them in a descending manner we could have simply used the desc()function:

flights %>%
    select(carrier, dep_delay) %>%
    arrange(desc(dep_delay))
## # A tibble: 336,776 × 2
##    carrier dep_delay
##      <chr>     <dbl>
## 1       HA      1301
## 2       MQ      1137
## 3       MQ      1126
## 4       AA      1014
## 5       MQ      1005
## 6       DL       960
## 7       DL       911
## 8       DL       899
## 9       DL       898
## 10      AA       896
## # ... with 336,766 more rows

More Complex Arrange

Lets consider that we wish to look at the top 3 departure delays for each day, then we wish to order them from largest to smallest departure delay. We then need to do the following:

1. Group by month and Day

2. Pick the top 3 departure delays

3. order them largest to smallest

This can be done in the following manner:

flights %>% 
  group_by(month, day) %>%  
  top_n(3, dep_delay) %>% 
  arrange(desc(dep_delay))
## Source: local data frame [1,108 x 19]
## Groups: month, day [365]
## 
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     1     9      641            900      1301     1242
## 2   2013     6    15     1432           1935      1137     1607
## 3   2013     1    10     1121           1635      1126     1239
## 4   2013     9    20     1139           1845      1014     1457
## 5   2013     7    22      845           1600      1005     1044
## 6   2013     4    10     1100           1900       960     1342
## 7   2013     3    17     2321            810       911      135
## 8   2013     6    27      959           1900       899     1236
## 9   2013     7    22     2257            759       898      121
## 10  2013    12     5      756           1700       896     1058
## # ... with 1,098 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Where

group_by() is a way to group data. This way we perform operations on a group. So top 3 delays are by a group of day and month.

top_n()takes a tibble and returns a specific number of rows based on a chosen value.

On Your Own: RStudio Practice

Perform the following operations:

Group by month and day.

use sample_n() to pick 1 observation per day.

Arrange by longest to smallest departure delay.

Your answer may look like:

flights %>%
  group_by(month,day) %>%
  sample_m(1) %>%
  arrange(desc(dep_delay))