2 Working with data - Part 1

In this chapter we will learn how to manipulate and summarise data using the dplyr package (with a little help from the tidyr package too).

Learning Objectives

At the end of this chapter, learners should be able to:

Use the pipe (%>%) to chain multiple functions together
Design chains of dplyr functions to manipulate data frames
Understand how to identify and handle missing values in a data frame

Both dplyr and tidyr are contained within the tidyverse (along with readr) so we can load all of these packages at once using library(tidyverse):

# don't forget to load tidyverse!
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

2.1 Chaining functions together with pipes

Pipes are a powerful feature of the tidyverse that allow you to chain multiple functions together. Pipes are useful because they allow you to break down complex operations into smaller steps that are easier to read and understand.

For example, take the following code:

my_vector <- c(1, 2, 3, 4, 5)
as.character(round(mean(my_vector)))

[1] "3"

What do you think this code does? It calculates the mean of my_vector, rounds the result to the nearest whole number, and then converts the result to a character. But the code is a bit hard to read because you have to start from the inside of the brackets and work your way out.

Instead, we can use the pipe operator (%>%) to chain these functions together in a more readable way:

my_vector <- 1:5
my_vector %>% mean() %>% round() %>% as.character()

[1] "3"

See how the code reads naturally from left to right? You can think of the pipe as being like the phrase “and then”. Here, we’re telling R: “Take my_vector, and then calculate the mean, and then round the result, and then convert it to a character.”

You’ll notice that we didn’t need to specify the input to each function. That’s because the pipe automatically passes the output of the previous function as the first input to the next function. We can still specify additional arguments to each function if we need to. For example, if we wanted to round the mean to 2 decimal places, we could do this:

my_vector %>% mean() %>% round(digits = 2) %>% as.character()

[1] "3"

R is clever enough to know that the first argument to round() is still the output of mean(), even though we’ve now specified the digits argument.

Plenty of pipes

There is another style of pipe in R, called the ‘base R pipe’ |>, which is available in R version 4.1.0 and later. The base R pipe works in a similar way to the magrittr pipe (%>%) that we use in this course, but it is not as flexible. We recommend using the magrittr pipe for now.

Fun fact: the magrittr package is named after the artist René Magritte, who made a famous painting of a pipe.

To type the pipe operator more easily, you can use the keyboard shortcut Cmd-shift-M (although once you get used to it, you might find it easier to type %>% manually).

Practice exercises

Try these practice questions to test your understanding

1. What is NOT a valid way to re-write the following code using the pipe operator: round(sqrt(sum(1:10)), 1). If you’re not sure, try running the different options in the console to see which one gives the same answer.

✗1:10 %>% sum() %>% sqrt() %>% round(1)

✔sum(1:10) %>% sqrt(1) %>% round()

✗1:10 %>% sum() %>% sqrt() %>% round(digits = 1)

✗sum(1:10) %>% sqrt() %>% round(digits = 1)

2. What is the output of the following code? letters %>% head() %>% toupper() Try to guess it before copy-pasting into R.

✗"A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

✗"a" "b" "c" "d" "e" "f"

✗An error

✔"A" "B" "C" "D" "E" "F"

Solutions

The invalid option is sum(1:10) %>% sqrt(1) %>% round(). This is because the sqrt() function only takes one argument, so you can’t specify 1 as an argument in addition to what is being piped in from sum(1:10). Note that some options used the pipe to send 1:10 to sum() (like 1:10 %>% sum()), and others just used sum(1:10) directly. Both are valid ways to use the pipe, it’s just a matter of personal preference.
The output of the code letters %>% head() %>% toupper() is "A" "B" "C" "D" "E" "F". The letters vector contains the lowercase alphabet, and the head() function returns the first 6 elements of the vector. Finally, the toupper() function then converts these elements to uppercase.

2.2 Basic data manipulation

To really see the power of the pipe, we will use it together with the dplyr package that provides a set of functions to easily filter, sort, select, modify and summarise data frames. These functions are designed to work well with the pipe, so you can chain them together to create complex data manipulations in a readable format.

For example, even though we haven’t covered the dplyr functions yet, you can probably guess what the following code does:

# use the pipe to chain together our data manipulation steps
m_dose %>%
  filter(cage_number == "3E") %>%
  drop_na(weight_lost_g) %>% 
  pull(weight_lost_g) %>%
  mean()

This code filters the m_dose data frame to only include data from cage 3E, then pulls out the weight_lost_g column, and finally calculates the mean of the values in that column. The first argument to each function is the output of the previous function, and any additional arguments (like the column name in pull()) are specified in the brackets (like round(digits = 2) from the previous example).

We also used the enter key after each pipe %>% to break up the code into multiple lines to make it easier to read. This isn’t required, but is a popular style in the R community, so all the code examples in this chapter will follow this format.

We will now introduce some of the most commonly used dplyr functions for manipulating data frames. To showcase these, we will use the m_dose that we practiced reading in last chapter. This imaginary dataset contains information on the weight lost by different strains of mice after being treated with different doses of MouseZempic®.

# read in the data, like we did in chapter 1
m_dose <- read_delim("data/mousezempic_dosage_data.csv")

Rows: 344 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): mouse_strain, cage_number, replicate, sex
dbl (5): weight_lost_g, drug_dose_g, tail_length_mm, initial_weight_g, id_num

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Before we start, let’s use what we learned in the previous chapter to take a look at m_dose:

# it's a tibble, so prints nicely
m_dose

# A tibble: 344 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         1A                   3.75 rep1      male       0.00181
 2 CD-1         1A                   3.8  rep1      female     0.00186
 3 CD-1         1A                   3.25 rep1      female     0.00195
 4 CD-1         1A                  NA    rep1      <NA>      NA      
 5 CD-1         1A                   3.45 rep1      female     0.00193
 6 CD-1         1A                   3.65 rep1      male       0.0019 
 7 CD-1         1A                   3.62 rep1      female     0.00181
 8 CD-1         1A                   4.68 rep1      male       0.00195
 9 CD-1         1A                   3.48 rep1      <NA>       0.00193
10 CD-1         1A                   4.25 rep1      <NA>       0.0019 
# ℹ 334 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

You might also like to use View() to open the data in a separate window and get a closer look.

Using RStudio autocomplete

Although it’s great to give our data a descriptive name like m_dose, it can be a bit of a pain to type out every time. Luckily, RStudio has a handy autocomplete feature that can solve this problem. Just start typing the name of the object, and you’ll see it will popup:

You can then press Tab to autocomplete it. If there are multiple objects that start with the same letters, you can use the arrow keys to cycle through the options.

Try using autocomplete in this chapter to save yourself some typing!

2.2.1 Sorting data

Often, one of the first things you might want to do with a dataset is sort it. In dplyr, this is called ‘arranging’ and is done with the arrange() function.

Arrange orders rows by their values in one or more columns

By default, arrange() sorts in ascending order (smallest values first). For example, let’s sort the m_dose data frame by the weight_lost_g column:

m_dose %>%
  arrange(weight_lost_g)

# A tibble: 344 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 BALB C       2B                   2.7  rep2      female     0.00192
 2 CD-1         3E                   2.85 rep2      female     0.00181
 3 CD-1         3E                   2.85 rep2      female     0.00184
 4 CD-1         3E                   2.9  rep2      female     0.00187
 5 CD-1         2B                   2.9  rep2      female     0.00178
 6 CD-1         1A                   2.9  rep3      female     0.00188
 7 BALB C       2B                   2.9  rep1      female     0.00187
 8 CD-1         3E                   2.92 rep3      female     0.00193
 9 CD-1         2B                   2.98 rep1      <NA>       0.00179
10 CD-1         2B                   3    rep1      female     0.00185
# ℹ 334 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

If we compare this to when we just printed our data above, we can see that the rows are now sorted so that the mice that lost the least weight are at the top.

Sometimes you might want to sort in descending order instead (largest values first). You can do this by putting the desc() function around your column name, inside arrange():

m_dose %>%
  # put desc() around the column name to sort in descending order
  arrange(desc(weight_lost_g))

# A tibble: 344 × 9
   mouse_strain cage_number weight_lost_g replicate sex   drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>       <dbl>
 1 Black 6      3E                   6.3  rep1      male      0.00221
 2 Black 6      3E                   6.05 rep1      male      0.0023 
 3 Black 6      3E                   6    rep2      male      0.0022 
 4 Black 6      3E                   6    rep3      male      0.00222
 5 Black 6      3E                   5.95 rep2      male      0.00223
 6 Black 6      3E                   5.95 rep3      male      0.00229
 7 Black 6      3E                   5.85 rep1      male      0.00213
 8 Black 6      3E                   5.85 rep1      male      0.00217
 9 Black 6      3E                   5.85 rep3      male      0.0023 
10 Black 6      3E                   5.8  rep2      male      0.00229
# ℹ 334 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

Now we can see the mice that lost the most weight are at the top.

Comments and pipes

Notice how in the previous example we have written a comment in the middle of the pipe chain. This is a good practice to help you remember what each step is doing, especially when you have a long chain of functions, and won’t cause any errors as long as you make sure that the comment is on its own line.

You can also write comments at the end of the line, just make sure it’s after the pipe operator %>%.

For example, these comments are allowed:

m_dose %>% # a comment here is fine
  # a comment here is fine
  arrange(desc(weight_lost_g))

# A tibble: 344 × 9
   mouse_strain cage_number weight_lost_g replicate sex   drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>       <dbl>
 1 Black 6      3E                   6.3  rep1      male      0.00221
 2 Black 6      3E                   6.05 rep1      male      0.0023 
 3 Black 6      3E                   6    rep2      male      0.0022 
 4 Black 6      3E                   6    rep3      male      0.00222
 5 Black 6      3E                   5.95 rep2      male      0.00223
 6 Black 6      3E                   5.95 rep3      male      0.00229
 7 Black 6      3E                   5.85 rep1      male      0.00213
 8 Black 6      3E                   5.85 rep1      male      0.00217
 9 Black 6      3E                   5.85 rep3      male      0.0023 
10 Black 6      3E                   5.8  rep2      male      0.00229
# ℹ 334 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

But this will cause an error, because the # is before the pipe, so R treats it as part of the comment (notice how the %>% has changed colour?) and doesn’t know how the two lines relate to each other. It tries to run them separately, which for the first line is ok (it will just print m_dose):

m_dose # this comment will cause an error %>%

# A tibble: 344 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         1A                   3.75 rep1      male       0.00181
 2 CD-1         1A                   3.8  rep1      female     0.00186
 3 CD-1         1A                   3.25 rep1      female     0.00195
 4 CD-1         1A                  NA    rep1      <NA>      NA      
 5 CD-1         1A                   3.45 rep1      female     0.00193
 6 CD-1         1A                   3.65 rep1      male       0.0019 
 7 CD-1         1A                   3.62 rep1      female     0.00181
 8 CD-1         1A                   4.68 rep1      male       0.00195
 9 CD-1         1A                   3.48 rep1      <NA>       0.00193
10 CD-1         1A                   4.25 rep1      <NA>       0.0019 
# ℹ 334 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

  arrange(desc(weight_lost_g))

Error: object 'weight_lost_g' not found

But for the second line, there is an error that R doesn’t know what the weight_lost_g object is. That’s because it’s a column in the m_dose data frame, so R only knows what it is in the context of the pipe chain containing that data frame.

You can also sort by multiple columns by passing multiple column names to arrange(). For example, to sort by the strain first and then by the amount of weight lost:

# sort by strain first, then by weight lost
m_dose %>%
  arrange(mouse_strain, weight_lost_g)

# A tibble: 344 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 BALB C       2B                   2.7  rep2      female     0.00192
 2 BALB C       2B                   2.9  rep1      female     0.00187
 3 BALB C       2B                   3.2  rep2      female     0.00187
 4 BALB C       2B                   3.25 rep1      female     0.00178
 5 BALB C       2B                   3.25 rep3      male       0.00187
 6 BALB C       2B                   3.25 rep3      female     0.00191
 7 BALB C       2B                   3.3  rep1      male       0.00197
 8 BALB C       2B                   3.3  rep1      female     0.00195
 9 BALB C       2B                   3.32 rep3      female     0.00199
10 BALB C       2B                   3.35 rep2      female     0.00187
# ℹ 334 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

This will sort the data frame by strain (according to alphabetical order, as it is a character column), and within each strain, they are then sorted by the amount of weight lost.

Piping into View()

In the above example, we sorted the data by strain and then by weight lost, but because there are so many mice in each strain, the preview shown in our console doesn’t allow us to see the full effect of the sorting.

One handy trick you can use with pipes is to add View() at the end of your chain to open the data in a separate window. Try running this code, and you’ll be able to scroll through the full dataset to check that the other mouse strains have also been sorted correctly:

# sort by strain first, then by weight lost
m_dose %>%
  arrange(mouse_strain, weight_lost_g) %>%
  View()

This is a great way to check that your code has actually done what you intended!

2.2.1.1 Extracting rows with the smallest or largest values

Slice functions are used to select rows based on their position in the data frame. The slice_min() and slice_max() functions are particularly useful, because they allow you to select the rows with the smallest or largest values in a particular column.

This is equivalent to using arrange() followed by head(), but is more concise:

# get the 10 mice with the lowest drug dose
m_dose %>%
  # slice_min() requires the column to sort by, and n = the number of rows to keep
  slice_min(drug_dose_g, n = 10)

# A tibble: 13 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         3E                   3.15 rep1      female     0.00172
 2 CD-1         3E                   3.4  rep1      female     0.00174
 3 CD-1         1A                   3.45 rep3      female     0.00176
 4 CD-1         2B                   3.25 rep1      female     0.00178
 5 CD-1         2B                   3.9  rep1      male       0.00178
 6 CD-1         2B                   2.9  rep2      female     0.00178
 7 BALB C       2B                   3.25 rep1      female     0.00178
 8 CD-1         2B                   2.98 rep1      <NA>       0.00179
 9 CD-1         1A                   3.7  rep1      <NA>       0.0018 
10 CD-1         3E                   3.6  rep1      male       0.0018 
11 CD-1         3E                   3.8  rep1      male       0.0018 
12 CD-1         3E                   3.95 rep1      male       0.0018 
13 CD-1         2B                   3.55 rep1      female     0.0018 
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

# get the top 5 mice that lost the most weight
m_dose %>%
  # slice_max() has the same arguments as slice_min()
  slice_max(weight_lost_g, n = 5)

# A tibble: 6 × 9
  mouse_strain cage_number weight_lost_g replicate sex   drug_dose_g
  <chr>        <chr>               <dbl> <chr>     <chr>       <dbl>
1 Black 6      3E                   6.3  rep1      male      0.00221
2 Black 6      3E                   6.05 rep1      male      0.0023 
3 Black 6      3E                   6    rep2      male      0.0022 
4 Black 6      3E                   6    rep3      male      0.00222
5 Black 6      3E                   5.95 rep2      male      0.00223
6 Black 6      3E                   5.95 rep3      male      0.00229
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

But wait— neither of those pieces of code actually gave the number of rows we asked for! In the first example, we asked for the 10 mice with the lowest drug dose, but we got 13. And in the second example, we asked for the top 5 mice that lost the most weight, but we got 6. Why aren’t the slice_ functions behaving as expected?

If we take a look at the help page (type ?slice_min in the console), we learn that slice_min() and slice_max() have an argument called with_ties that is set to TRUE by default. If we want to make sure we only get the number of rows we asked for, we would have to set it to FALSE, like so:

# get the top 5 mice that lost the most weight
m_dose %>%
  # no ties allowed!
  slice_max(weight_lost_g, n = 5, with_ties = FALSE)

# A tibble: 5 × 9
  mouse_strain cage_number weight_lost_g replicate sex   drug_dose_g
  <chr>        <chr>               <dbl> <chr>     <chr>       <dbl>
1 Black 6      3E                   6.3  rep1      male      0.00221
2 Black 6      3E                   6.05 rep1      male      0.0023 
3 Black 6      3E                   6    rep2      male      0.0022 
4 Black 6      3E                   6    rep3      male      0.00222
5 Black 6      3E                   5.95 rep2      male      0.00223
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

This is an important lesson: sometimes functions will behave in a way that is unexpected, and you might need to read their help page or use other guides/google/AI to understand why.

Practice exercises

Try these practice questions to test your understanding

1. Which code would you use to sort the m_dose data frame from biggest to smallest initial weight?

✗m_dose %>% sort(initial_weight_g)

✗m_dose %>% arrange(initial_weight_g)

✗m_dose %>% sort(descending(initial_weight_g))

✔m_dose %>% arrange(desc(initial_weight_g))

2. Which code would you use to extract the 3 mice with the highest initial weight from the m_dose data frame?

✔m_dose %>% slice_max(initial_weight_g, n = 3)

✗m_dose %>% arrange(desc(initial_weight_g))

✗m_dose %>% slice_min(initial_weight_g, n = 3)

✗m_dose %>% arrange(initial_weight_g)

3. I’ve written the below code, but one of the comments is messing it up! Which one?

# comment A
m_dose # comment B %>%
  # comment C
  slice_max(weight_lost_g, n = 5, with_ties = FALSE) # comment D

✗Comment A

✔Comment B

✗Comment C

✗Comment D

Solutions

The correct code to sort the m_dose data frame from biggest to smallest initial weight is m_dose %>% arrange(desc(initial_weight_g)). The arrange() function is used to sort the data frame (although there is a sort() function in R, that’s not part of dplyr and won’t work the same way), and the desc() function is used to sort in descending order.
The correct code to extract the 3 mice with the highest initial weight from the m_dose data frame is m_dose %>% slice_max(initial_weight_g, n = 3). The slice_max() function is used to select the rows with the largest values in the initial_weight_g column, and the n = 3 argument specifies that we want to keep 3 rows. The arrange() function is not needed in this case, because slice_max() will automatically sort the data frame by the specified column.
The comment that is messing up the code is Comment B. The # symbol is before the pipe operator %>%, so R treats it as part of the comment and this breaks our chain of pipes. The other comments are fine, because they are either at the end of the line or on their own line. Basically, if a comment is changing the colour of the pipe operator (or any other bits of your code), it’s in the wrong place!

2.2.2 Filtering data (rows)

Filter allows you to filter rows using a logical test

In dplyr, the filter() function is used to subset rows based on their values. You provide a logical test, and filter() will keep the rows where the test is TRUE. We can write these tests using the comparison operators we learned in the previous chapter (e.g. ==, < and !=, see Section 1.3).

For example, to filter the m_dose data frame to only include mice that lost more than 6g:

m_dose %>%
  filter(weight_lost_g > 6)

# A tibble: 2 × 9
  mouse_strain cage_number weight_lost_g replicate sex   drug_dose_g
  <chr>        <chr>               <dbl> <chr>     <chr>       <dbl>
1 Black 6      3E                   6.3  rep1      male      0.00221
2 Black 6      3E                   6.05 rep1      male      0.0023 
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

Or to only include mice from cage 3E:

m_dose %>%
  # remember that == is used for testing equality
  filter(cage_number == "3E") # don't forget the quotes either!

# A tibble: 168 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         3E                   3.4  rep1      female     0.00174
 2 CD-1         3E                   3.6  rep1      male       0.0018 
 3 CD-1         3E                   3.8  rep1      female     0.00189
 4 CD-1         3E                   3.95 rep1      male       0.00185
 5 CD-1         3E                   3.8  rep1      male       0.0018 
 6 CD-1         3E                   3.8  rep1      female     0.00187
 7 CD-1         3E                   3.55 rep1      male       0.00183
 8 CD-1         3E                   3.2  rep1      female     0.00187
 9 CD-1         3E                   3.15 rep1      female     0.00172
10 CD-1         3E                   3.95 rep1      male       0.0018 
# ℹ 158 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

2.2.2.1 Combining logical tests

Sometimes we want to filter based on multiple conditions. Here we will show some more advanced operators that can be used to combine logical tests.

The & operator is used to combine two logical tests with an ‘and’ condition. For example, to filter the data frame to only include mice that have a tail length greater than 19mm and are female:

m_dose %>%
  filter(tail_length_mm > 19 & sex == "female")

# A tibble: 5 × 9
  mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
  <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
1 CD-1         1A                   3.45 rep1      female     0.00193
2 CD-1         3E                   3.8  rep1      female     0.00189
3 CD-1         2B                   3.3  rep1      female     0.00181
4 CD-1         3E                   3.9  rep3      female     0.00191
5 BALB C       2B                   3.52 rep3      female     0.00194
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

The | operator is used to combine two logical tests with an ‘or’ condition. For example, to filter the data frame to only include mice that have an initial weight less than 35g or a tail length less than 14mm:

m_dose %>%
  filter(initial_weight_g < 35 | tail_length_mm < 14)

# A tibble: 30 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         1A                   3.48 rep1      <NA>       0.00193
 2 CD-1         1A                   4.4  rep1      male       0.00198
 3 CD-1         1A                   3.32 rep1      female     0.00184
 4 CD-1         3E                   2.9  rep2      female     0.00187
 5 CD-1         1A                   3.6  rep2      female     0.0019 
 6 CD-1         1A                   3.2  rep2      female     0.00189
 7 CD-1         2B                   3.4  rep2      female     0.00185
 8 CD-1         2B                   2.9  rep2      female     0.00178
 9 CD-1         2B                   3.05 rep3      female     0.00188
10 Black 6      3E                   4.5  rep1      female     0.00211
# ℹ 20 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

The %in% operator can be used to filter based on a vector of multiple values (c(x, y)). It’s particularly useful when you have a few character values you want to filter on, as it is shorter to type than | (or).

For example, to filter the data frame to only include mice from cages 3E or 1A, we could use | like this:

m_dose %>%
  filter(cage_number == "3E" | cage_number == "1A")

# A tibble: 220 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         1A                   3.75 rep1      male       0.00181
 2 CD-1         1A                   3.8  rep1      female     0.00186
 3 CD-1         1A                   3.25 rep1      female     0.00195
 4 CD-1         1A                  NA    rep1      <NA>      NA      
 5 CD-1         1A                   3.45 rep1      female     0.00193
 6 CD-1         1A                   3.65 rep1      male       0.0019 
 7 CD-1         1A                   3.62 rep1      female     0.00181
 8 CD-1         1A                   4.68 rep1      male       0.00195
 9 CD-1         1A                   3.48 rep1      <NA>       0.00193
10 CD-1         1A                   4.25 rep1      <NA>       0.0019 
# ℹ 210 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

Or we could use %in% like this:

m_dose %>%
  filter(cage_number %in% c("3E", "1A"))

# A tibble: 220 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         1A                   3.75 rep1      male       0.00181
 2 CD-1         1A                   3.8  rep1      female     0.00186
 3 CD-1         1A                   3.25 rep1      female     0.00195
 4 CD-1         1A                  NA    rep1      <NA>      NA      
 5 CD-1         1A                   3.45 rep1      female     0.00193
 6 CD-1         1A                   3.65 rep1      male       0.0019 
 7 CD-1         1A                   3.62 rep1      female     0.00181
 8 CD-1         1A                   4.68 rep1      male       0.00195
 9 CD-1         1A                   3.48 rep1      <NA>       0.00193
10 CD-1         1A                   4.25 rep1      <NA>       0.0019 
# ℹ 210 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

Practice exercises

Try these practice questions to test your understanding

1. Which code would you use to filter the m_dose data frame to only include mice from replicate 2?

✗m_dose %>% filter(replicate == 2)

✗m_dose %>% filter(replicate == rep2)

✔m_dose %>% filter(replicate == "rep2")

✗m_dose %>% filter(replicate = "rep2")

2. What is NOT a valid way to filter the m_dose data frame to only include mice that lost more than 4g, and have an initial weight less than 40g?

✗m_dose %>% filter(weight_lost_g > 4) %>% filter(initial_weight_g < 40)

✔m_dose %>% filter(weight_lost_g > 4) %>% (initial_weight_g < 40)

✗m_dose %>% filter(weight_lost_g > 4 & initial_weight_g < 40)

✗m_dose %>% filter(initial_weight_g < 40) %>% filter(weight_lost_g > 4)

3. Which option correctly describes what the following code is doing?

m_dose %>%
  filter(mouse_strain %in% c("BALB C", "Black 6")) %>%
  filter(weight_lost_g > 3 & weight_lost_g < 5) %>%
  arrange(desc(drug_dose_g))

✗Filters the data frame to remove mice from the “BALB C” and “Black 6” strains, who only lost between 3 and 5g of weight, and then sorts the data frame by drug dose.

✗Filters the data frame to remove mice from the “BALB C” and “Black 6” strains, that lost between 3 and 5g of weight, and then sorts the data frame by drug dose in descending order.

✗Filters the data frame to only include mice from the “BALB C” and “Black 6” strains, that lost between 3 and 5g of weight, and then sorts the data frame by drug dose.

✔Filters the data frame to only include mice from the “BALB C” and “Black 6” strains, that lost between 3 and 5g of weight, and then sorts the data frame by drug dose in descending order.

Solutions

The correct code to filter the m_dose data frame to only include mice from replicate 2 is m_dose %>% filter(replicate == "rep2"). Option A is incorrect because 2 is not a value of replicate (when filtering you need to know what values are actually in your columns! So make sure to View() your data first). Option B is incorrect because the replicate column is a character column, so you need to use quotes around the value you are filtering on. Option D is incorrect because = is not the correct way to test for equality, you need to use ==.
The invalid option is m_dose %>% filter(weight_lost_g > 4) %>% (initial_weight_g < 40). This is because the second filtering step is missing the name of the filter function, so R doesn’t know what to do with (initial_weight_g < 40). The other options are valid ways to filter the data frame based on the specified conditions; note that we can use multiple filter() functions in a row to apply multiple conditions, or the & operator to combine them into a single filter() function. It’s just a matter of personal preference.
The correct description of the code is that it filters the data frame to only include mice from the “BALB C” and “Black 6” strains, then filters those further to only those that lost between 3 and 5g of weight, and finally sorts the data frame by drug dose in descending order.

2.2.3 Dealing with missing values

Missing values are a common problem in real-world datasets. In R, missing values are represented by NA. In fact, if you look at the m_dose data frame we’ve been using, you’ll see that some of the cells contain NA: try spotting them with the View() function.

You can also find missing values in a data frame using the is.na() function in combination with filter(). For example, to find all the rows in the m_dose data frame that have a missing value for the drug_dose_g column:

m_dose %>%
  filter(is.na(drug_dose_g))

# A tibble: 2 × 9
  mouse_strain cage_number weight_lost_g replicate sex   drug_dose_g
  <chr>        <chr>               <dbl> <chr>     <chr>       <dbl>
1 CD-1         1A                     NA rep1      <NA>           NA
2 Black 6      3E                     NA rep3      <NA>           NA
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

The problem with missing values is that they can cause problems when you try to perform calculations on your data. For example, if you try to calculate the mean of a column that contains even a single missing value, the result will also be NA:

# try to calculate the mean of the drug_dose_g column
# remember from chapter 1 that we can use $ to access columns in a data frame
m_dose$drug_dose_g %>% mean()

[1] NA

NA values in R are therefore referred to as ‘contagious’: if you put an NA in you usually get an NA out. If you think about it, that makes sense— when we don’t know the value of a particular mouse’s drug dose, how can we calculate the average? That missing value could be anything.

For this reason, it’s important to deal with missing values before performing calculations. Many functions in R will have an argument called na.rm that you can set to TRUE to remove missing values before performing the calculation. For example, to calculate the mean of the drug_dose_g column with the missing values excluded:

# try to calculate the mean of the drug_dose_g column
# remember from chapter 1 that we can use $ to access columns in a data frame
m_dose$drug_dose_g %>% mean(na.rm = TRUE)

[1] 0.002009152

This time, the result is a number, because the missing values have been removed before the calculation.

But not all functions have an na.rm argument. In these cases, you can remove rows with missing values. This can be done for a single column, using the filter() function together with is.na():

# remove rows with missing values in the drug_dose_g column
m_dose %>%
  # remember the ! means 'not', it negates the result of is.na()
  filter(!is.na(drug_dose_g))

# A tibble: 342 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         1A                   3.75 rep1      male       0.00181
 2 CD-1         1A                   3.8  rep1      female     0.00186
 3 CD-1         1A                   3.25 rep1      female     0.00195
 4 CD-1         1A                   3.45 rep1      female     0.00193
 5 CD-1         1A                   3.65 rep1      male       0.0019 
 6 CD-1         1A                   3.62 rep1      female     0.00181
 7 CD-1         1A                   4.68 rep1      male       0.00195
 8 CD-1         1A                   3.48 rep1      <NA>       0.00193
 9 CD-1         1A                   4.25 rep1      <NA>       0.0019 
10 CD-1         1A                   3.3  rep1      <NA>       0.00186
# ℹ 332 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

Or, you can remove rows with missing values in any column using the na.omit() or drop_na() function:

# remove rows with missing values in any column
m_dose %>%
  na.omit()

# A tibble: 333 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         1A                   3.75 rep1      male       0.00181
 2 CD-1         1A                   3.8  rep1      female     0.00186
 3 CD-1         1A                   3.25 rep1      female     0.00195
 4 CD-1         1A                   3.45 rep1      female     0.00193
 5 CD-1         1A                   3.65 rep1      male       0.0019 
 6 CD-1         1A                   3.62 rep1      female     0.00181
 7 CD-1         1A                   4.68 rep1      male       0.00195
 8 CD-1         1A                   3.2  rep1      female     0.00182
 9 CD-1         1A                   3.8  rep1      male       0.00191
10 CD-1         1A                   4.4  rep1      male       0.00198
# ℹ 323 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

m_dose %>%
  drop_na()

# A tibble: 333 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         1A                   3.75 rep1      male       0.00181
 2 CD-1         1A                   3.8  rep1      female     0.00186
 3 CD-1         1A                   3.25 rep1      female     0.00195
 4 CD-1         1A                   3.45 rep1      female     0.00193
 5 CD-1         1A                   3.65 rep1      male       0.0019 
 6 CD-1         1A                   3.62 rep1      female     0.00181
 7 CD-1         1A                   4.68 rep1      male       0.00195
 8 CD-1         1A                   3.2  rep1      female     0.00182
 9 CD-1         1A                   3.8  rep1      male       0.00191
10 CD-1         1A                   4.4  rep1      male       0.00198
# ℹ 323 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

Sometimes, instead of removing rows with missing values, you might want to replace them with a specific value. This can be done using the replace_na() function from the tidyr package. replace_na() takes a list() which contains each of the column names you want to edit, and the value that should be used.

For example, to replace missing values in the weight_lost_g columns with 0, replace missing values in the sex column with ‘unknown’ and leave the rest of the data frame unchanged:

# replace missing values in the drug_dose_g column with 0
m_dose %>%
  # here we need to provide the column_names = values_to_replace
  # this needs to be contained within a list()
  replace_na(list(weight_lost_g = 0, sex = "unknown"))

# A tibble: 344 × 9
   mouse_strain cage_number weight_lost_g replicate sex     drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>         <dbl>
 1 CD-1         1A                   3.75 rep1      male        0.00181
 2 CD-1         1A                   3.8  rep1      female      0.00186
 3 CD-1         1A                   3.25 rep1      female      0.00195
 4 CD-1         1A                   0    rep1      unknown    NA      
 5 CD-1         1A                   3.45 rep1      female      0.00193
 6 CD-1         1A                   3.65 rep1      male        0.0019 
 7 CD-1         1A                   3.62 rep1      female      0.00181
 8 CD-1         1A                   4.68 rep1      male        0.00195
 9 CD-1         1A                   3.48 rep1      unknown     0.00193
10 CD-1         1A                   4.25 rep1      unknown     0.0019 
# ℹ 334 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

When deciding how to handle missing values, you might have prior knowledge that NA should be replaced with a specific value, or you might decide that removing rows with NA is the best approach for your analysis.

For example, maybe we knew that the mice were given a weight_lost_g of NA if they didn’t lose any weight, it would then make sense to replace those with 0 (as we did in the code above). However, if the drug_dose_g column was missing simply because the data was lost, we might choose to remove those rows entirely.

It’s important to think carefully about how missing values should be handled in your analysis.

2.2.4 Selecting columns

Select allows you to select only certain columns

While filter() is used to subset rows, select() is used to subset columns. You can use select() to keep only the columns you’re interested in, or to drop columns you don’t need.

The select() function takes the names of the columns that you want to keep/remove (no vector notation c() or quotation marks "" necessary). For example, to select only the mouse_strain, initial_weight_g, and weight_lost_g columns from the m_dose data frame:

m_dose %>%
  select(mouse_strain, initial_weight_g, weight_lost_g)

# A tibble: 344 × 3
   mouse_strain initial_weight_g weight_lost_g
   <chr>                   <dbl>         <dbl>
 1 CD-1                     39.1          3.75
 2 CD-1                     39.5          3.8 
 3 CD-1                     40.3          3.25
 4 CD-1                     NA           NA   
 5 CD-1                     36.7          3.45
 6 CD-1                     39.3          3.65
 7 CD-1                     38.9          3.62
 8 CD-1                     39.2          4.68
 9 CD-1                     34.1          3.48
10 CD-1                     42            4.25
# ℹ 334 more rows

We can see that all the other columns have been removed from the data frame.

If you want to keep all columns except for a few, you can use - to drop columns. For example, to keep all columns except for cage_number and sex:

m_dose %>%
  select(-cage_number, -sex)

# A tibble: 344 × 7
   mouse_strain weight_lost_g replicate drug_dose_g tail_length_mm
   <chr>                <dbl> <chr>           <dbl>          <dbl>
 1 CD-1                  3.75 rep1          0.00181           18.7
 2 CD-1                  3.8  rep1          0.00186           17.4
 3 CD-1                  3.25 rep1          0.00195           18  
 4 CD-1                 NA    rep1         NA                 NA  
 5 CD-1                  3.45 rep1          0.00193           19.3
 6 CD-1                  3.65 rep1          0.0019            20.6
 7 CD-1                  3.62 rep1          0.00181           17.8
 8 CD-1                  4.68 rep1          0.00195           19.6
 9 CD-1                  3.48 rep1          0.00193           18.1
10 CD-1                  4.25 rep1          0.0019            20.2
# ℹ 334 more rows
# ℹ 2 more variables: initial_weight_g <dbl>, id_num <dbl>

There are also some helper functions that can be used to select columns based on their names :

There are several helper functions that can be used with the select function
Function	Description	Example
`starts_with()`	select column(s) that start with a certain string	select all columns starting with the letter i `select(starts_with("i"))`
`ends_with()`	select column(s) that end with a certain string	select all columns ending with _g `select(ends_with("_g"))`
`contains()`	select column(s) that contain a certain string	select all columns containing the word ‘weight’ `select(contains("weight"))`

You need to use quotation marks around the arguments in these helper functions, as they aren’t full column names, just strings of characters.

Try using these helper functions to select columns from the m_dose data frame!

Reordering columns

Relocate allows you to move columns around

We can reorder columns using the relocate() function, which works similarly to select() (except it just moves columns around rather than dropping/keeping them). For example, to move the sex column to before the cage_number column:

m_dose %>%
  # first the name of the column to move, then where it should go
  relocate(sex, .before = cage_number)

# A tibble: 344 × 9
   mouse_strain sex    cage_number weight_lost_g replicate drug_dose_g
   <chr>        <chr>  <chr>               <dbl> <chr>           <dbl>
 1 CD-1         male   1A                   3.75 rep1          0.00181
 2 CD-1         female 1A                   3.8  rep1          0.00186
 3 CD-1         female 1A                   3.25 rep1          0.00195
 4 CD-1         <NA>   1A                  NA    rep1         NA      
 5 CD-1         female 1A                   3.45 rep1          0.00193
 6 CD-1         male   1A                   3.65 rep1          0.0019 
 7 CD-1         female 1A                   3.62 rep1          0.00181
 8 CD-1         male   1A                   4.68 rep1          0.00195
 9 CD-1         <NA>   1A                   3.48 rep1          0.00193
10 CD-1         <NA>   1A                   4.25 rep1          0.0019 
# ℹ 334 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

Without a specific position ( .before / .after), this function will place the chosen column(s) as the first / left-most columns.

Two further useful helper functions for relocate() are the everything() and last_col() functions, which can be used to move columns to the start/end of the data frame.

# move id_num to the front
m_dose %>%
  relocate(id_num, .before = everything()) # don't forget the brackets

# A tibble: 344 × 9
   id_num mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
    <dbl> <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1      1 CD-1         1A                   3.75 rep1      male       0.00181
 2      2 CD-1         1A                   3.8  rep1      female     0.00186
 3      3 CD-1         1A                   3.25 rep1      female     0.00195
 4      4 CD-1         1A                  NA    rep1      <NA>      NA      
 5      5 CD-1         1A                   3.45 rep1      female     0.00193
 6      6 CD-1         1A                   3.65 rep1      male       0.0019 
 7      7 CD-1         1A                   3.62 rep1      female     0.00181
 8      8 CD-1         1A                   4.68 rep1      male       0.00195
 9      9 CD-1         1A                   3.48 rep1      <NA>       0.00193
10     10 CD-1         1A                   4.25 rep1      <NA>       0.0019 
# ℹ 334 more rows
# ℹ 2 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>

# move mouse_strain to the end
m_dose %>%
  relocate(mouse_strain, .after = last_col())

# A tibble: 344 × 9
   cage_number weight_lost_g replicate sex    drug_dose_g tail_length_mm
   <chr>               <dbl> <chr>     <chr>        <dbl>          <dbl>
 1 1A                   3.75 rep1      male       0.00181           18.7
 2 1A                   3.8  rep1      female     0.00186           17.4
 3 1A                   3.25 rep1      female     0.00195           18  
 4 1A                  NA    rep1      <NA>      NA                 NA  
 5 1A                   3.45 rep1      female     0.00193           19.3
 6 1A                   3.65 rep1      male       0.0019            20.6
 7 1A                   3.62 rep1      female     0.00181           17.8
 8 1A                   4.68 rep1      male       0.00195           19.6
 9 1A                   3.48 rep1      <NA>       0.00193           18.1
10 1A                   4.25 rep1      <NA>       0.0019            20.2
# ℹ 334 more rows
# ℹ 3 more variables: initial_weight_g <dbl>, id_num <dbl>, mouse_strain <chr>

Re-ordering columns isn’t necessary, but it makes it easier to see the data you’re most interested in within the console (since often not all of the columns will fit on the screen at once). For example, if we are doing a lot of computation on the initial_weight_g column, we’d probably like to have that near the start so we can easily check it.

Note that the output of the select() function is a new data frame, even if you only select a single column:

# select the mouse_strain column
m_dose %>%
  select(mouse_strain) %>%
  # recall from chapter 1 that class() tells us the type of an object
  class()

[1] "tbl_df"     "tbl"        "data.frame"

Sometimes, we instead want to get the values of a column as a vector.

Pull allows you to pull a column out of a data frame as a vector

We can do this by using the pull() function, which extracts a single column from a data frame as a vector:

# get the mouse_strain column as a vector
m_dose %>%
  pull(mouse_strain) %>%
  class()

[1] "character"

We can see that the class of the output is now a vector, rather than a data frame. This is important because some functions only accept vectors, not data frames, like mean() for example:

# this will give an error
m_dose %>% select(initial_weight_g) %>% mean(na.rm = TRUE)

Warning in mean.default(., na.rm = TRUE): argument is not numeric or logical:
returning NA

[1] NA

# this will work
m_dose %>% pull(initial_weight_g) %>% mean(na.rm = TRUE)

[1] 43.92193

Note how both times we used na.rm = TRUE to remove missing values before calculating the mean.

You might remember that we used the $ operator in the previous chapter to extract a single column from a data frame, so why use pull() instead? The main reason is that pull() works within a chain of pipes, whereas $ doesn’t.

For example, let’s say we want to know the average initial weight of mice that lost at least 4g. We can do this by chaining filter() and pull() together:

m_dose %>%
  # filter to mice that lost at least 4g
  filter(weight_lost_g >= 4) %>%
  # get the initial_weight_g column as a vector
  pull(initial_weight_g) %>%
  # calculate mean, removing NA values
  mean(na.rm = TRUE)

[1] 46.48023

2.3 Summary

Here’s what we’ve covered in this chapter:

The pipe operator %>% and how we can use it to chain together multiple function calls, making our code more readable and easier to understand.
The basic dplyr verbs arrange(), filter() and select()

Why does data need to be tidy anyway?

In this chapter, we’ve been focusing on making our data ‘tidy’: that is, structured in a consistent way that makes it easy to work with. A nice visual illustration of tidy data and its importance can be found here.

2.3.1 Practice questions

What is the purpose of the pipe operator %>%? Keeping this in mind, re-write the following code to use the pipe.
1. round(mean(c(1, 2, 3, 4, 5)))
2. print(as.character(1 + 10))
What would be the result of evaluating the following expressions? You don’t need to know these off the top of your head, use R to help! (Hint: some expressions might give an error. Try to think about why)
1. m_dose %>% filter(weight_lost_g > 10)
2. m_dose %>% select(tail_length_mm, weight_lost_g)
3. m_dose %>% arrange(tail_length_mm)
4. m_dose %>% filter(initial_Weight_g > 10) %>% arrange(mouse_strain)
5. m_dose %>% relocate(mouse_strain, .after = cage_number)
6. m_dose %>% pull(weight_lost_g)
7. m_dose %>% filter(!is.na(weight_lost_g))
8. m_dose %>% replace_na(list(weight_lost_g = 0))
What is a missing value in R? What are two ways to deal with missing values in a data frame?

Solutions

The pipe operator %>% is used to chain together multiple function calls, passing the result of one function to the next. Here’s how you could re-write the code to use the pipe:
1. c(1, 2, 3, 4, 5) %>% mean() %>% round()
2. as.character(1 + 10) %>% print()
The result of evaluating the expressions would be:
1. A data frame containing only the rows where weight_lost_g is greater than 10.
2. A data frame containing only the tail_length_mm and weight_lost_g columns.
3. A data frame sorted by tail_length_mm, in ascending order.
4. An error because initial_Weight_g is not a column in the data frame.
5. A data frame with the mouse_strain column moved to be after the cage_number column.
6. A vector containing the values of the weight_lost_g column.
7. A data frame containing only the rows where weight_lost_g is not NA.
8. A data frame with missing values in the weight_lost_g column replaced with 0.
A missing value in R is represented by NA. Two ways to deal with missing values in a data frame are to remove them using filter(!is.na(column_name)) or to replace them with a specific value using replace_na(list(column_name = value)).