A common problem when reading data into R is having multiple header rows in the source excel file. Take a look at this messy excel file from sheet 2 in the
clippy.xlsx example file, which ships with readxl:
Problem: The column names are right, but the first row of data is actually not data- it is additional metadata. This row of metadata is also causing all the columns to import as character. In reality,
death is a datetime and
weight is numeric.
Solution: We can use the
read_excel() function to read in the same file twice. In Step 1, we’ll create a character vector of the column names only. In Step 2, we’ll read in the actual data and skip the multiple header rows at the top. When we do this, we lose the column names, so we use the character vector of column names we created in Step 1 instead.
In this step, we read in the first row only (by setting
n_max = 0), extract the names from that row (using the
names() function), and assign those to a character vector called
cnames. This object now contains the correct column names that we’ll need in Step 2.
Now we’ll read in all the rows except for the first two rows (using
skip = 2), which contained the variable names and variable descriptions, and set the column names to
cnames, which we created in Step 1. Now our column types are guessed correctly.
If you want a way to save that metadata without polluting your actual data, you can do a third read using
n_max = 1:
This vignette has been adapted from a post on Alison Hill’s blog.