Choosing Columns


We have learned how to select the rows that we are interested in. The next logical step would be to select the columns we want as well. Many times we have so many columns that we are no interested in for a particular analysis. Instead of slowing down your analysis by continuing to run through extra data, we could just select the columns we care about.

Enter the select() Function

The select() function chooses columns that we specify. Again we can do this with base functions or with dplyr. We feel that as you continue on with your R usage that you will most likely want to go the route of dplyr functions instead.

Let’s say that we want to look at the flights data but we are really only interested in the arrival time, departure time and the particular flight number. This seems reasonable if we are a customer and wanted to only know these pieces of information. We could do this with indexing which we learned in dataframes:

flights[, c("dep_time", "arr_time", "flight")]
## # A tibble: 336,776 × 3
##    dep_time arr_time flight
##       <int>    <int>  <int>
## 1       517      830   1545
## 2       533      850   1714
## 3       542      923   1141
## 4       544     1004    725
## 5       554      812    461
## 6       554      740   1696
## 7       555      913    507
## 8       557      709   5708
## 9       557      838     79
## 10      558      753    301
## # ... with 336,766 more rows

We have indexed based on columns and picked the three columns which we wish to have information on. We could do this with piping in dplyr instead though:

select(.data, ...)

where

.data is a tibble.

... are the columns that you wish to have in bare (no quotations)

We could then do the following

flights %>%
  filter(dep_time, arr_time, flight)
## # A tibble: 328,063 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     1     1      517            515         2      830
## 2   2013     1     1      533            529         4      850
## 3   2013     1     1      542            540         2      923
## 4   2013     1     1      544            545        -1     1004
## 5   2013     1     1      554            600        -6      812
## 6   2013     1     1      554            558        -4      740
## 7   2013     1     1      555            600        -5      913
## 8   2013     1     1      557            600        -3      709
## 9   2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 328,053 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Further Uses of select() Function

Removing Columns

We may wish to pick certain columns that we wish to have but we also may want to remove certain columns. It is quite common to de-identify a dataset before actually distributing it to a research team. The select()function will also remove columns. Lets say that we wished to remove the month and day of the flights:

flights %>%
  select(-month,-day)
## # A tibble: 336,776 × 17
##     year dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int>    <int>          <int>     <dbl>    <int>          <int>
## 1   2013      517            515         2      830            819
## 2   2013      533            529         4      850            830
## 3   2013      542            540         2      923            850
## 4   2013      544            545        -1     1004           1022
## 5   2013      554            600        -6      812            837
## 6   2013      554            558        -4      740            728
## 7   2013      555            600        -5      913            854
## 8   2013      557            600        -3      709            723
## 9   2013      557            600        -3      838            846
## 10  2013      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

We also could use a vector for this:

cols <- c("month", "day")
flights %>%
  select(-one_of(cols))

Note: one_of is a helper function, it basically suggest that you take elements from a character vector.

We can also remove columns that contain a certain phrase in the name. If we were interested in removing any columns that had to do with time we could search for the word “time” in the data and remove them:

flights %>%
  select(-contains("time"))
## # A tibble: 336,776 × 13
##     year month   day dep_delay arr_delay carrier flight tailnum origin
##    <int> <int> <int>     <dbl>     <dbl>   <chr>  <int>   <chr>  <chr>
## 1   2013     1     1         2        11      UA   1545  N14228    EWR
## 2   2013     1     1         4        20      UA   1714  N24211    LGA
## 3   2013     1     1         2        33      AA   1141  N619AA    JFK
## 4   2013     1     1        -1       -18      B6    725  N804JB    JFK
## 5   2013     1     1        -6       -25      DL    461  N668DN    LGA
## 6   2013     1     1        -4        12      UA   1696  N39463    EWR
## 7   2013     1     1        -5        19      B6    507  N516JB    EWR
## 8   2013     1     1        -3       -14      EV   5708  N829AS    LGA
## 9   2013     1     1        -3        -8      B6     79  N593JB    JFK
## 10  2013     1     1        -2         8      AA    301  N3ALAA    LGA
## # ... with 336,766 more rows, and 4 more variables: dest <chr>,
## #   distance <dbl>, hour <dbl>, minute <dbl>

Note: contains() is another helper function, it suggests that whatever inside the function is a literal string. In other words it needs to match a name exactly.

Renaming Columns

Many times we may wish to rename a column so that it makes more sense to us. The select() function can rename things for us as well. For example, there is a variable called tailnum in the flights data. This actually refers to the tailnumber of the plane. We could rename this to be:

flights %>%
  select(tail = tailnum)
## # A tibble: 336,776 × 1
##      tail
##     <chr>
## 1  N14228
## 2  N24211
## 3  N619AA
## 4  N804JB
## 5  N668DN
## 6  N39463
## 7  N516JB
## 8  N829AS
## 9  N593JB
## 10 N3ALAA
## # ... with 336,766 more rows
flights %>% select(tail = tailnum)
## # A tibble: 336,776 × 1
##      tail
##     <chr>
## 1  N14228
## 2  N24211
## 3  N619AA
## 4  N804JB
## 5  N668DN
## 6  N39463
## 7  N516JB
## 8  N829AS
## 9  N593JB
## 10 N3ALAA
## # ... with 336,766 more rows

Note: We only kept the column of data that we renamed.

Unique Observations

Many times we have a lot of repeats in our data. If we just would like to have an account of all things included then we can use the unique() command. Lets assume that we wish to know the origin of a flight and ist destination. We do not want to have every flight listed over and over again so we ask for unique values:

flights %>% 
  select(origin, dest) %>% 
  unique()
## # A tibble: 224 × 2
##    origin  dest
##     <chr> <chr>
## 1     EWR   IAH
## 2     LGA   IAH
## 3     JFK   MIA
## 4     JFK   BQN
## 5     LGA   ATL
## 6     EWR   ORD
## 7     EWR   FLL
## 8     LGA   IAD
## 9     JFK   MCO
## 10    LGA   ORD
## # ... with 214 more rows

On Your Own: RStudio Practice

Consider the flights data: flights.

1. Select all but the year column.

2. Remove the month and day from them.

3. Select values which contain “time” in them.

4. Chain these together so that you run a command and it does all of these things.

Your answer should look like:

## # A tibble: 336,776 × 6
##    dep_time sched_dep_time arr_time sched_arr_time air_time
##       <int>          <int>    <int>          <int>    <dbl>
## 1       517            515      830            819      227
## 2       533            529      850            830      227
## 3       542            540      923            850      160
## 4       544            545     1004           1022      183
## 5       554            600      812            837      116
## 6       554            558      740            728      150
## 7       555            600      913            854      158
## 8       557            600      709            723       53
## 9       557            600      838            846      140
## 10      558            600      753            745      138
## # ... with 336,766 more rows, and 1 more variables: time_hour <dttm>