I’m pleased to announce that readxl 1.0.0 is available on CRAN. readxl makes it easy to bring tabular data out of Excel and into R, for modern .xlsx files and the legacy .xls format. readxl does not have any tricky external dependencies, such as Java or Perl, and is easy to install and use on Mac, Windows, and Linux.

You can install it with:

As well as fixing many bugs, this release:

  • Allows you to target specific cells for reading, in a variety of ways
  • Adds two new column types: "logical" and "list", for data of disparate type
  • Is more resilient to the wondrous diversity in spreadsheets, e.g., those written by 3rd party tools

You can see a full list of changes in the release notes. This is the first release maintained by Jenny Bryan.

Specifying the data rectangle

In an ideal world, data would live in a neat rectangle in the upper left corner of a spreadsheet. But spreadsheets often serve multiple purposes for users with different priorities. It is common to encounter several rows of notes above or below the data, for example. The new range argument provides a flexible interface for describing the data rectangle, including Excel-style ranges and row- or column-only ranges.

library(readxl)
read_excel(
  readxl_example("deaths.xlsx"),
  range = "arts!A5:F15"
)
#> # A tibble: 10 x 6
#>   Name       Profession   Age `Has kids` `Date of birth`     `Date of death`    
#>   <chr>      <chr>      <dbl> <lgl>      <dttm>              <dttm>             
#> 1 David Bow… musician      69 TRUE       1947-01-08 00:00:00 2016-01-10 00:00:00
#> 2 Carrie Fi… actor         60 TRUE       1956-10-21 00:00:00 2016-12-27 00:00:00
#> 3 Chuck Ber… musician      90 TRUE       1926-10-18 00:00:00 2017-03-18 00:00:00
#> 4 Bill Paxt… actor         61 TRUE       1955-05-17 00:00:00 2017-02-25 00:00:00
#> # … with 6 more rows
read_excel(
  readxl_example("deaths.xlsx"),
  sheet = "other",
  range = cell_rows(5:15)
)
#> # A tibble: 10 x 6
#>   Name       Profession   Age `Has kids` `Date of birth`     `Date of death`    
#>   <chr>      <chr>      <dbl> <lgl>      <dttm>              <dttm>             
#> 1 Vera Rubin scientist     88 TRUE       1928-07-23 00:00:00 2016-12-25 00:00:00
#> 2 Mohamed A… athlete       74 TRUE       1942-01-17 00:00:00 2016-06-03 00:00:00
#> 3 Morley Sa… journalist    84 TRUE       1931-11-08 00:00:00 2016-05-19 00:00:00
#> 4 Fidel Cas… politician    90 TRUE       1926-08-13 00:00:00 2016-11-25 00:00:00
#> # … with 6 more rows

There is also a new argument n_max that limits the number of data rows read from the sheet. It is an example of readxl’s evolution towards a readr-like interface. The Sheet Geometry vignette goes over all the options.

Column typing

The new ability to target cells for reading means that readxl’s automatic column typing will “just work”" for most sheets, most of the time. Above, the Has kids column is automatically detected as logical, which is a new column type for readxl.

You can still specify column type explicitly via col_types, which gets a couple new features. If you provide exactly one type, it is recycled to the necessary length. The new type "guess" can be mixed with explicit types to specify some types, while leaving others to be guessed.

read_excel(
  readxl_example("deaths.xlsx"),
  range = "arts!A5:C15",
  col_types = c("guess", "skip", "numeric")
)
#> # A tibble: 10 x 2
#>   Name            Age
#>   <chr>         <dbl>
#> 1 David Bowie      69
#> 2 Carrie Fisher    60
#> 3 Chuck Berry      90
#> 4 Bill Paxton      61
#> # … with 6 more rows

The new argument guess_max limits the rows used for type guessing. Leading and trailing whitespace is trimmed when the new trim_ws argument is TRUE, which is the default. Finally, thanks to Jonathan Marshall, multiple na values are accepted. The Cell and Column Types vignette has more detail.

"list" columns

Thanks to Greg Freedman Ellis we now have a "list" column type. This is useful if you want to bring truly disparate data into R without the coercion required by atomic vector types.

(df <- read_excel(
  readxl_example("clippy.xlsx"),
  col_types = c("text", "list")
))
#> # A tibble: 4 x 2
#>   name                 value     
#>   <chr>                <list>    
#> 1 Name                 <chr [1]> 
#> 2 Species              <chr [1]> 
#> 3 Approx date of death <dttm [1]>
#> 4 Weight in grams      <dbl [1]>
tibble::deframe(df)
#> $Name
#> [1] "Clippy"
#> 
#> $Species
#> [1] "paperclip"
#> 
#> $`Approx date of death`
#> [1] "2007-01-01 UTC"
#> 
#> $`Weight in grams`
#> [1] 0.9

Everything else

To learn more, read the vignettes and articles or release notes. Highlights include:

  • General rationalization of sheet geometry, including detection and treatment of empty rows and columns.
  • Improved behavior and messaging around coercion and mismatched cell and column types.
  • Improved handling of datetimes with respect to 3rd party software, rounding, and the Lotus 1-2-3 leap year bug.
  • read_xls() and read_xlsx() are now exposed, so that files without an .xls or .xlsx extension can be read. Thanks Jirka Lewandowski!
  • readxl Workflows showcases patterns that reduce tedium and increase reproducibility when raw data arrives in a spreadsheet.