Back to: Introduction to R
We also have need to make sure the data is ordered in a certain manner. This can be easily done in R with the arrange()
function. Again we can do this in base R but this is not always a clear path.
Let’s say that we wish to look at only carriers and departure delay and we wish to order departure delays from what smallest to largest. In base R we would have to run the following command:
flights[order(flights$dep_delay), c("carrier", "dep_delay")]
In this command we are ordering the rows by dep_delay
and then only keeping carrier
and dep_delay
in the end.
Enter the arrange()
Function
We could do this in an easy manner using the arrange()
function:
arrange(.data, ...)
Where
.data
is a data frame of interest.
...
are the variables you wish to sort by.
flights %>%
select(carrier, dep_delay) %>%
arrange(dep_delay)
## # A tibble: 336,776 × 2
## carrier dep_delay
## <chr> <dbl>
## 1 B6 -43
## 2 DL -33
## 3 EV -32
## 4 DL -30
## 5 F9 -27
## 6 MQ -26
## 7 EV -25
## 8 MQ -25
## 9 9E -24
## 10 B6 -24
## # ... with 336,766 more rows
With arrange()
we first use select()
to pick the only columns that we want and then we arrange by the dep_delay
. If we had wished to order them in a descending manner we could have simply used the desc()
function:
flights %>%
select(carrier, dep_delay) %>%
arrange(desc(dep_delay))
## # A tibble: 336,776 × 2
## carrier dep_delay
## <chr> <dbl>
## 1 HA 1301
## 2 MQ 1137
## 3 MQ 1126
## 4 AA 1014
## 5 MQ 1005
## 6 DL 960
## 7 DL 911
## 8 DL 899
## 9 DL 898
## 10 AA 896
## # ... with 336,766 more rows
More Complex Arrange
Lets consider that we wish to look at the top 3 departure delays for each day, then we wish to order them from largest to smallest departure delay. We then need to do the following:
1. Group by month and Day
2. Pick the top 3 departure delays
3. order them largest to smallest
This can be done in the following manner:
flights %>%
group_by(month, day) %>%
top_n(3, dep_delay) %>%
arrange(desc(dep_delay))
## Source: local data frame [1,108 x 19]
## Groups: month, day [365]
##
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## 2 2013 6 15 1432 1935 1137 1607
## 3 2013 1 10 1121 1635 1126 1239
## 4 2013 9 20 1139 1845 1014 1457
## 5 2013 7 22 845 1600 1005 1044
## 6 2013 4 10 1100 1900 960 1342
## 7 2013 3 17 2321 810 911 135
## 8 2013 6 27 959 1900 899 1236
## 9 2013 7 22 2257 759 898 121
## 10 2013 12 5 756 1700 896 1058
## # ... with 1,098 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Where
group_by()
is a way to group data. This way we perform operations on a group. So top 3 delays are by a group of day and month.
top_n()
takes a tibble and returns a specific number of rows based on a chosen value.
On Your Own: RStudio Practice
Perform the following operations:
Group by month and day.
use sample_n()
to pick 1 observation per day.
Arrange by longest to smallest departure delay.
Your answer may look like:
flights %>%
group_by(month,day) %>%
sample_m(1) %>%
arrange(desc(dep_delay))
// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });