Filtering


We will focus on working with our data more. At this point we will consider how we pick the rows of the data that we wish to work with. If you consider many modern data sets, we have so much information that we may not want to bring it all in at once.

We have not discussed exactly how R works at this point, however R brings data into the RAM of your computer. This means you can be limited for what size data you can bring in at once. Very rarely do you need the entire data set. We will focus on how to pick the rows or observations we want now.

Enter the filter() Function

The filter() function chooses rows that meet a specific criteria. We can do this with Base R functions or with dplyr`.

Let’s say that we want to look at the flights data but we are only interested in the data from the first day of the year. We could do this without learning a new command and use indexing which we learned in dataframes:

flights[flights$month==1 & flights$day==1, ]
## # A tibble: 842 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     1     1      517            515         2      830
## 2   2013     1     1      533            529         4      850
## 3   2013     1     1      542            540         2      923
## 4   2013     1     1      544            545        -1     1004
## 5   2013     1     1      554            600        -6      812
## 6   2013     1     1      554            558        -4      740
## 7   2013     1     1      555            600        -5      913
## 8   2013     1     1      557            600        -3      709
## 9   2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 832 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Now this is not very difficult to do, what we have is that we are working with flights and we only want to keep the rows of data there month==1 and day==1. However we could use the filter() function to do this in a much easier to read format:

filter(.data, ...)

where

.data is a tibble.

... is a set of arguments the data you want returned needs to meet.

This means in our example we could perform the following:

filter(flights, month==1, day==1)
## # A tibble: 842 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     1     1      517            515         2      830
## 2   2013     1     1      533            529         4      850
## 3   2013     1     1      542            540         2      923
## 4   2013     1     1      544            545        -1     1004
## 5   2013     1     1      554            600        -6      812
## 6   2013     1     1      554            558        -4      740
## 7   2013     1     1      555            600        -5      913
## 8   2013     1     1      557            600        -3      709
## 9   2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 832 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

We could instead do the same command with piping to make it even easier to read:

flights %>%
    filter(month==1, day==1)

Finally we could also only do one filtering at a time and chain it:

flights %>%
    filter(month==1) %>%
    filter(day==1)

These will all lead to the same output. Make sure you verify this on your own screen.

Further Filtering

filter() supports the use of multiple conditions where we can use Boolean. For example if we wanted to consider only flights that depart between 0600 and 0605 we could do the following:

flights %>% filter(dep_time >= 600, dep_time <= 605)
## # A tibble: 2,460 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     1     1      600            600         0      851
## 2   2013     1     1      600            600         0      837
## 3   2013     1     1      601            600         1      844
## 4   2013     1     1      602            610        -8      812
## 5   2013     1     1      602            605        -3      821
## 6   2013     1     2      600            600         0      814
## 7   2013     1     2      600            605        -5      751
## 8   2013     1     2      600            600         0      819
## 9   2013     1     2      600            600         0      846
## 10  2013     1     2      600            600         0      737
## # ... with 2,450 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

We would then get 2,460 different flights. Another way to accomplish this would be to use what is called a helper function. We can use between() which lets us pick specific numeric ranges:

flights %>%
  filter(between(dep_time,600,605))

We can also use the filter() function to remove missing data for us. Previously we learned about a class of functions called is.foo() where foo represents a data type. We could choose to only use flights that have a departure time. That means we wish to not have missing data for departure time:

flights %>% filter(!is.na(dep_time))
## # A tibble: 328,521 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     1     1      517            515         2      830
## 2   2013     1     1      533            529         4      850
## 3   2013     1     1      542            540         2      923
## 4   2013     1     1      544            545        -1     1004
## 5   2013     1     1      554            600        -6      812
## 6   2013     1     1      554            558        -4      740
## 7   2013     1     1      555            600        -5      913
## 8   2013     1     1      557            600        -3      709
## 9   2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 328,511 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

On Your Own: RStudio Practice

Using the filter() function and chaining:

Choose only rows associated with

United Airlines (UA)

American Airlines (AA)

Your end result should be:

## # A tibble: 91,394 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     1     1      517            515         2      830
## 2   2013     1     1      533            529         4      850
## 3   2013     1     1      542            540         2      923
## 4   2013     1     1      554            558        -4      740
## 5   2013     1     1      558            600        -2      753
## 6   2013     1     1      558            600        -2      924
## 7   2013     1     1      558            600        -2      923
## 8   2013     1     1      559            600        -1      941
## 9   2013     1     1      559            600        -1      854
## 10  2013     1     1      606            610        -4      858
## # ... with 91,384 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>