useR to programmeR

Iteration 2

Emma Rand and Ian Lyttle

Learning objectives

This session is (mostly) about functional programming:

  • Aside: managing file paths within your project
  • Example: read a bunch of files, then put them in a single data frame
  • Fundamental paradigms in {purrr}:
  • Adverbs to handle failure
  • More generally, using functions as arguments to functions 🤯

For coding, we will use r-programming-exercises:

  • Open R/iteration-02-01-reading-files.R.
  • Restart R.

Aside: {here} package

For me, here::here() is a truly magical function:

  • useful in scripts: .R files (like today!)
  • useful in documents: .Rmd and .qmd files

If you need to:

  • refer to a file, and
  • it’s in a fixed place within your project

here() can make your life much simpler!

Here: Example











👋
/Users/ijlyttle/repos/r-programming-exercises/ 
|-- r-programming-exercises.Rproj 
|-- README.md
|-- LICENCE.md
|-- data/
    |-- gapminder/
        |-- 1952.xlsx
        |-- ...
    |-- ...
|-- R/ 
    |-- iteration-02-01-reading-files.R 
    |-- ...

Within iteration-02-01-reading-files.R:

  • here("data/gapminder/1952.xlsx")

Works just as well for .Rmd, .qmd files.

Here: Searches










🔎
/Users/ijlyttle/repos/r-programming-exercises/ 
|-- r-programming-exercises.Rproj 
|-- README.md
|-- LICENCE.md
|-- data/
    |-- gapminder/
        |-- 1952.xlsx
        |-- ...
    |-- ...
|-- R/ 
    |-- iteration-02-01-reading-files.R 
    |-- ...

  • Looks in directory for an .Rproj file (simplified)
  • Doesn’t find one

Here: Moves up and searches

🔎









/Users/ijlyttle/repos/r-programming-exercises/ 
|-- r-programming-exercises.Rproj 
|-- README.md
|-- LICENCE.md
|-- data/
    |-- gapminder/
        |-- 1952.xlsx
        |-- ...
    |-- ...
|-- R/ 
    |-- iteration-02-01-reading-files.R 
    |-- ...

  • Moves up one directory
  • Looks again

Here: Finds .Rproj

/Users/ijlyttle/repos/r-programming-exercises/ 
|-- r-programming-exercises.Rproj 
|-- README.md
|-- LICENCE.md
|-- data/
    |-- gapminder/
        |-- 1952.xlsx
        |-- ...
    |-- ...
|-- R/ 
    |-- iteration-02-01-reading-files.R 
    |-- ...

Here: Flags project-root

🚩









/Users/ijlyttle/repos/r-programming-exercises/ 
|-- r-programming-exercises.Rproj 
|-- README.md
|-- LICENCE.md
|-- data/
    |-- gapminder/
        |-- 1952.xlsx
        |-- ...
    |-- ...
|-- R/ 
    |-- iteration-02-01-reading-files.R 
    |-- ...

/Users/ijlyttle/repos/r-programming-exercises/

Here: Returns full path

🚩





🎯



/Users/ijlyttle/repos/r-programming-exercises/ 
|-- r-programming-exercises.Rproj 
|-- README.md
|-- LICENCE.md
|-- data/
    |-- gapminder/
        |-- 1952.xlsx
        |-- ...
    |-- ...
|-- R/ 
    |-- iteration-02-01-reading-files.R 
    |-- ...

here("data/gapminder/1952.xlsx")

/Users/ijlyttle/repos/r-programming-exercises/data/gapminder/1952.xlsx


here() returns a string that represents a path.

It makes no guarantee that the path exists.

Here: Epilogue

here() works especially well if you need to rearrange your source (e.g. .R) files.

However, if you move target files (e.g. .xlsx files), you need to modify your calls to here().


The here way:

read_excel(here("data/gapminder/1952.xlsx"))

🧐 Where here() can help

read_excel("../data/gapminder/1952.xlsx")

🔥 Meme Alert

Do not do this:

setwd("/Users/ijlyttle/repos/r-programming-exercises/data/gapminder")

read_excel("1952.xlsx")

Reading multiple files

Iteration functions in {purrr} can help with repetitive tasks.

Example

Read Excel files from a directory, then combine into a single data-frame.

Our turn: Reading data manually

Here’s our starting code:

data1952 <- read_excel(here("data/gapminder/1952.xlsx"))
data1957 <- read_excel(here("data/gapminder/1957.xlsx"))
data1962 <- read_excel(here("data/gapminder/1952.xlsx"))
data1967 <- read_excel(here("data/gapminder/1967.xlsx"))

data_manual <- bind_rows(data1952, data1957, data1962, data1967)

What problems do you see?

(I see two real problems, and one philosophical problem)

Run this example code, discuss with your neighbor.

Our turn: Make list of paths

I see this as a two step problem:

  • make a named list of paths, name is year
  • use list of paths to read data frames, combine

Let’s work together to improve this code to get paths:

paths <-
  # get the filepaths from the directory
  fs::dir_ls(here("data/gapminder")) |>
  # convert to list
  # extract the year as names
  print()

Our turn: Read data

Let’s work together to improve this code to read data:

data <-
  paths |>
  # read each file from excel, into data frame
  # keep only non-null elements
  # set list-names as column `year`
  # bind into single data-frame
  # convert year to number
  print()

Fundamental paradigms

Functional programming has three fundamental paradigms; they act on lists or vectors:

Each of these takes a function as an argument, to tell the operator what to do.


For coding, we will use r-programming-exercises:

  • Open R/iteration-02-02-fundamental-paradigms.R.
  • Restart R.

Map: Intro

num <- 1:4
num |> map(\(x) x + 1)

map() takes:

  • list or atomic vector
  • function to apply to each member of the vector

Map: Intermediate result

num <- 1:4
num |> map(\(x) x + 1)

Input Result
1 2
2
3
4

Map: Result

num <- 1:4
num |> map(\(x) x + 1)

Input Result
1 2
2 3
3 4
4 5

Map: Atomic variants

map() always returns a list:

num <- 1:4
num |> map(\(x) x + 1)
[[1]]
[1] 2

[[2]]
[1] 3

[[3]]
[1] 4

[[4]]
[1] 5

Use an atomic variant to specify type:

num <- 1:4
num |> map_int(\(x) x + 1)
[1] 2 3 4 5

Keep: Intro

num <- 1:4
num |> keep(\(x) x %% 2 == 0)

Outside {purrr}: known as filter(), but {dplyr} took this name first.

keep() takes:

  • list or vector
  • function, when applied to each member, returns TRUE or FALSE
    • this is called a predicate function

Keep: Intermediate result

num <- 1:4
num |> keep(\(x) x %% 2 == 0)

Input Evaluation Result
1 FALSE
2 TRUE 2
3
4

Keep: Result

num <- 1:4
num |> keep(\(x) x %% 2 == 0)

Input Evaluation Result
1 FALSE
2 TRUE 2
3 FALSE
4 TRUE 4

Reduce: Intro

num <- 1:4
num |> reduce(\(acc, x) acc + x)

reduce() takes:

  • a list or vector
  • a reducer function, which takes two arguments:
    • the accumulated value
    • the “next” value of the input

Reduce: First result

num <- 1:4
num |> reduce(\(acc, x) acc + x)

Input Result
1 1
2
3
4

Reduce: Intermediate result

num <- 1:4
num |> reduce(\(acc, x) acc + x)

Input Result
1
2 3
3
4

Reduce: Result

num <- 1:4
num |> reduce(\(acc, x) acc + x)

Input Result
1
2
3
4 10

Reduce: Initialize

num <- 1:4
num |> reduce(\(acc, x) acc + x, .init = 1)

Input Result
1
2
3
4 11

Reduce: Use existing functions

num <- 1:4
num |> reduce(sum, .init = 1)

Input Result
1
2
3
4 11

Additional arguments

num <- c(1, 2, 3, NA, 4)
num |> reduce(sum)
[1] NA

The default behavior for sum() is not to remove NA values.


To change the behavior, use an anonymous function:

num |> reduce(\(acc, x) sum(acc, x, na.rm = TRUE))
[1] 10

No longer recommended

num |> reduce(sum, na.rm = TRUE)

Using an anonymous function will:

  • make it more explicit which argument goes to which function.
  • tend to yield better error messages.

Variants and adverbs

Some useful variants, can mix and match:

Adverbs modify verbs (functions):

Handling failures with adverbs

If we have a failure, we may not want to stop everything.

library("readr")
read_csv("not/a/file.csv")
Error: 'not/a/file.csv' does not exist in current working directory ('/home/runner/work/r-programming/r-programming').

For coding, we will use r-programming-exercises:

  • Open R/iteration-02-03-adverbs.R.
  • Restart R.

Function operators a.k.a. adverbs

Function operators:

  • take a function
  • return a modified function
library("purrr")

possibly_read_csv <- possibly(read_csv, otherwise = NULL, quiet = FALSE)

possibly_read_csv("not/a/file.csv")
Error: 'not/a/file.csv' does not exist in current working directory ('/home/runner/work/r-programming/r-programming').
NULL

possibly_read_csv(I("a, b\n 1, 2"), col_types = "dd")
# A tibble: 1 × 2
      a     b
  <dbl> <dbl>
1     1     2

Our turn: Handle failure

In the r-programming-exercises repository:

  • look at data/gapminder_party/
  • try running your script using this directory

Create a new function:

possibly_read_excel <- possibly() # we do the rest

Use this function in your script.

Our turn: Re-implement list_rbind()

Re-implement list_rbind() using functional-programming techniques:

data_reimplemented <-
  paths_party |>
  map(possibly_read_excel) |>
  # keep(negate(is.null)) |>
  # imap(\(df, name) mutate(df, "year" := parse_number(name))) |>
  # reduce(rbind) |>
  print()

Let’s run this, uncommenting one line at a time.

Functions as arguments

We have seen functions as arguments in:

Using functions, themselves, as arguments takes a little getting used-to.

Once you wrap your mind around it, it’s like seeing in more dimensions.


For coding, we will use r-programming-exercises:

  • Open R/iteration-02-04-functions-as-arguments.R.
  • Restart R.

Labelling scales

library("tidyverse")
library("palmerpenguins")
library("conflicted")
conflicts_prefer(palmerpenguins::penguins)

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point()

What if we want lower-case names for the species?

Specify labels 🧐

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point() + 
  scale_color_discrete(labels = c("adelaide", "chinstrap", "gentoo"))

We can do it manually, but what if we get a dataset with more species?

Use a labelling function 😎

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point() + 
  scale_color_discrete(labels = tolower) # tolower is a function

Look at ?discrete_scale: labels can take a function.

Function factories

Function operators (adverbs) return modified functions 🤯

Function factories return functions “out of thin air” 🤯🤯


{scales}, used for {ggplot2} is full of these function factories!

Our turn: Labeller

## use scales:: notation, vs. library(), to help autocomplete
percent_labeller <- scales::label_percent(accuracy = 1)

# percent_labeller is a function
percent_labeller(c(0, 0.01, 0.1, 1))
[1] "0%"   "1%"   "10%"  "100%"

Play around with:

  • accuracy
  • values sent to percent_labeller
  • whatever else seems interesting to you

Your turn: Labeller

ggplot(penguins, aes(x = bill_length_mm, color = species)) +
  stat_ecdf()

Add scale_y_continuous() to this plot, to use a percentage-labeller.

Your turn: Labeller (solution)

ggplot(penguins, aes(x = bill_length_mm, color = species)) +
  stat_ecdf() +
  scale_y_continuous(labels = scales::label_percent(accuracy = 1))

To me, this is a cleaner solution than mutating the data from decimal to percent.

Declariative vs. Imperative Programming

Let’s say you wanted to double this:

original <- 1:4


Declarative

Focus on what:

double <- original |> map_dbl(\(x) 2 * x)
double
[1] 2 4 6 8

Imperative

Focus on how:

double = numeric(length(original))
for (i in seq_along(original)) {
  double[i] = original[i] * 2
}
double
[1] 2 4 6 8


Of course, base R has the ultimate declarative approach:

2 * original
[1] 2 4 6 8

ui.dev has a very accessible article on the two approaches.

If we have time

Three fundamental paradigms in functional programming

Given a list and a function:

  • filter, keep(): make a new list, subset of old list
  • map(): make a new list, operating on each element
  • reduce(): make a new “thing”

For coding, we will use r-programming-exercises:

  • Open R/iteration-02-05-dpurrr.R.
  • Restart R.

dplyr using purrr?

We can use purrr::keep(), purrr::map(), purrr::reduce() to “implement”:

I claim it’s possible, I don’t claim it’s a good idea.

Our turn: Simplified penguins

library("conflicted")
library("palmerpenguins")
library("dplyr")
library("purrr")

conflicts_prefer(palmerpenguins::penguins)

# simplify penguins (Sorry Allison!)
penguins_local <-
  penguins |>
  mutate(across(where(is.factor), as.character)) |> # use strings, not factors
  select(species, island, body_mass_g, sex) |>      # fewer columns
  print()
# A tibble: 344 × 4
   species island    body_mass_g sex   
   <chr>   <chr>           <int> <chr> 
 1 Adelie  Torgersen        3750 male  
 2 Adelie  Torgersen        3800 female
 3 Adelie  Torgersen        3250 female
 4 Adelie  Torgersen          NA <NA>  
 5 Adelie  Torgersen        3450 female
 6 Adelie  Torgersen        3650 male  
 7 Adelie  Torgersen        3625 female
 8 Adelie  Torgersen        4675 male  
 9 Adelie  Torgersen        3475 <NA>  
10 Adelie  Torgersen        4250 <NA>  
# ℹ 334 more rows

Tabular data: Two perspectives

  • column-based: named list of column vectors

    {
      "species": ["Adelie", "Adelie", ...],
      "island": ["Torgersen", "Torgersen", ...],
      "body_mass_g": [3750, 3800, ...],
      "sex": ["male", "female", ...]
    }
  • row-based: collection of rows, each a named list

    [
      {"species": "Adelie", "island": "Torgersen", "body_mass_g": 3750, "sex": "male"}, 
      {"species": "Adelie", "island": "Torgersen", "body_mass_g": 3800, "sex": "female"}, 
      ...
    ]

Our turn: Helper functions

We have a couple of helper functions to convert to:

  • Data frames: column-based

    #' @param .d unnamed list of named lists, i.e. transposed data frame
    #'
    #' @return tibble
    dpurrr_to_tibble <- function(.d) {
      .d |>
        purrr::list_transpose() |>
        tibble::as_tibble()
    }  
  • Lists of lists: row-based

    #' @param .data data frame or tibble
    #'
    #' @return unnamed list of named lists, i.e. transposed data frame
    dpurrr_to_list <- function(.data) {
      .data |>
        as.list() |>
        purrr::list_transpose(simplify = FALSE)
    }

Our turn: Experiment

# experiment with helpers
penguins_local |>
  head(2) |>
  dpurrr_to_list() |>
  # dpurrr_to_tibble() |>
  str()
List of 2
 $ :List of 4
  ..$ species    : chr "Adelie"
  ..$ island     : chr "Torgersen"
  ..$ body_mass_g: int 3750
  ..$ sex        : chr "male"
 $ :List of 4
  ..$ species    : chr "Adelie"
  ..$ island     : chr "Torgersen"
  ..$ body_mass_g: int 3800
  ..$ sex        : chr "female"

Comment and change lines as you see fit.

Our turn: dpurrr filter (first element)

# filter is just purrr::keep()
penguins_local |>
  dpurrr_to_list() |>
  keep(\(d) d$sex == "female" && !is.na(d$sex)) |>
  dpurrr_to_tibble() 

Predicate function acts on each “row”, d, which is a list:

List of 4
 $ species    : chr "Adelie"
 $ island     : chr "Torgersen"
 $ body_mass_g: int 3750
 $ sex        : chr "male"
List of 4
 $ species    : chr "Adelie"
 $ island     : chr "Torgersen"
 $ body_mass_g: int 3800
 $ sex        : chr "female"
List of 4
 $ species    : chr "Adelie"
 $ island     : chr "Torgersen"
 $ body_mass_g: int 3250
 $ sex        : chr "female"

Our turn: dpurrr filter (more elements)

# filter is just purrr::keep()
penguins_local |>
  dpurrr_to_list() |>
  keep(\(d) d$sex == "female" && !is.na(d$sex)) |>
  dpurrr_to_tibble() 

Predicate function acts on each “row”, d, which is a list:

List of 4
 $ species    : chr "Adelie"
 $ island     : chr "Torgersen"
 $ body_mass_g: int 3750
 $ sex        : chr "male"
List of 4
 $ species    : chr "Adelie"
 $ island     : chr "Torgersen"
 $ body_mass_g: int 3800
 $ sex        : chr "female"
List of 4
 $ species    : chr "Adelie"
 $ island     : chr "Torgersen"
 $ body_mass_g: int 3250
 $ sex        : chr "female"

Our turn: dpurrr filter (element results)

# filter is just purrr::keep()
penguins_local |>
  dpurrr_to_list() |>
  keep(\(d) d$sex == "female" && !is.na(d$sex)) |>
  dpurrr_to_tibble() 

Predicate function acts on each “row”, d, which is a list:

List of 4
 $ species    : chr "Adelie"
 $ island     : chr "Torgersen"
 $ body_mass_g: int 3750
 $ sex        : chr "male"
List of 4
 $ species    : chr "Adelie"
 $ island     : chr "Torgersen"
 $ body_mass_g: int 3800
 $ sex        : chr "female"
List of 4
 $ species    : chr "Adelie"
 $ island     : chr "Torgersen"
 $ body_mass_g: int 3250
 $ sex        : chr "female"

Our turn: dpurrr filter (result)

# filter is just purrr::keep()
penguins_local |>
  dpurrr_to_list() |>
  keep(\(d) d$sex == "female" && !is.na(d$sex)) |>
  dpurrr_to_tibble() 

Re-assembled into a tibble:

# A tibble: 165 × 4
   species island    body_mass_g sex   
   <chr>   <chr>           <int> <chr> 
 1 Adelie  Torgersen        3800 female
 2 Adelie  Torgersen        3250 female
 3 Adelie  Torgersen        3450 female
 4 Adelie  Torgersen        3625 female
 5 Adelie  Torgersen        3200 female
 6 Adelie  Torgersen        3700 female
 7 Adelie  Torgersen        3450 female
 8 Adelie  Torgersen        3325 female
 9 Adelie  Biscoe           3400 female
10 Adelie  Biscoe           3800 female
# ℹ 155 more rows

Our turn: dpurrr mutate

#' @param .d unnamed list of named lists, i.e. transposed data frame
#' @param mapper function applied to each member of `.d`
#' 
#' @return unnamed list of named lists, i.e. transposed data frame
dpurrr_mutate <- function(.d, mapper) {
  # modifyList() used to keep current elements
  .d |> purrr::map(\(d) modifyList(d, mapper(d)))
}

This version of mutate operates on every “row”, modifying its list.

penguins_local |>
  dpurrr_to_list() |>
  dpurrr_mutate(\(d) list(body_mass_kg = d$body_mass_g / 1000)) |>
  dpurrr_to_tibble() |>
  print()

Our turn: dpurrr mutate (start)

penguins_local |>
  dpurrr_to_list() |>
  dpurrr_mutate(\(d) list(body_mass_kg = d$body_mass_g / 1000)) |>
  dpurrr_to_tibble()

# A tibble: 344 × 4
   species island    body_mass_g sex   
   <chr>   <chr>           <int> <chr> 
 1 Adelie  Torgersen        3750 male  
 2 Adelie  Torgersen        3800 female
 3 Adelie  Torgersen        3250 female
 4 Adelie  Torgersen          NA <NA>  
 5 Adelie  Torgersen        3450 female
 6 Adelie  Torgersen        3650 male  
 7 Adelie  Torgersen        3625 female
 8 Adelie  Torgersen        4675 male  
 9 Adelie  Torgersen        3475 <NA>  
10 Adelie  Torgersen        4250 <NA>  
# ℹ 334 more rows

Our turn: dpurrr mutate (by row, before)

penguins_local |>
  dpurrr_to_list() |>
  dpurrr_mutate(\(d) list(body_mass_kg = d$body_mass_g / 1000)) |>
  dpurrr_to_tibble()

List of 4
 $ species    : chr "Adelie"
 $ island     : chr "Torgersen"
 $ body_mass_g: int 3750
 $ sex        : chr "male"
List of 4
 $ species    : chr "Adelie"
 $ island     : chr "Torgersen"
 $ body_mass_g: int 3800
 $ sex        : chr "female"
List of 4
 $ species    : chr "Adelie"
 $ island     : chr "Torgersen"
 $ body_mass_g: int 3250
 $ sex        : chr "female"

Our turn: dpurrr mutate (by row, after)

penguins_local |>
  dpurrr_to_list() |>
  dpurrr_mutate(\(d) list(body_mass_kg = d$body_mass_g / 1000)) |>
  dpurrr_to_tibble()

List of 5
 $ species     : chr "Adelie"
 $ island      : chr "Torgersen"
 $ body_mass_g : int 3750
 $ sex         : chr "male"
 $ body_mass_kg: num 3.75
List of 5
 $ species     : chr "Adelie"
 $ island      : chr "Torgersen"
 $ body_mass_g : int 3800
 $ sex         : chr "female"
 $ body_mass_kg: num 3.8
List of 5
 $ species     : chr "Adelie"
 $ island      : chr "Torgersen"
 $ body_mass_g : int 3250
 $ sex         : chr "female"
 $ body_mass_kg: num 3.25

Our turn: dpurrr mutate (result)

penguins_local |>
  dpurrr_to_list() |>
  dpurrr_mutate(\(d) list(body_mass_kg = d$body_mass_g / 1000)) |>
  dpurrr_to_tibble()

# A tibble: 344 × 5
   species island    body_mass_g sex    body_mass_kg
   <chr>   <chr>           <int> <chr>         <dbl>
 1 Adelie  Torgersen        3750 male           3.75
 2 Adelie  Torgersen        3800 female         3.8 
 3 Adelie  Torgersen        3250 female         3.25
 4 Adelie  Torgersen          NA <NA>          NA   
 5 Adelie  Torgersen        3450 female         3.45
 6 Adelie  Torgersen        3650 male           3.65
 7 Adelie  Torgersen        3625 female         3.62
 8 Adelie  Torgersen        4675 male           4.68
 9 Adelie  Torgersen        3475 <NA>           3.48
10 Adelie  Torgersen        4250 <NA>           4.25
# ℹ 334 more rows

Our turn: dpurrr summarise

#' @param .d unnamed list of named lists, i.e. transposed data frame
#' @param reducer function applied accumulator and to each member of `.d`
#' @param .init initial value of accumulator, if empty: first element of `.d`
#' @param ... other arguments passed to `purrr::reduce()`
#'
#' @return unnamed list of named lists, i.e. transposed data frame
dpurrr_summarise <- function(.d, reducer, .init, ...) {
  # wrap result in a list, to return a transposed data frame
  .d |> purrr::reduce(reducer, .init = .init, ...) |> list()
}

Takes a transposed data frame, returns a transposed data frame with a single “row”.

penguins_local |>
  dpurrr_to_list() |>
  dpurrr_summarise(
    \(acc, d) list(
      body_mass_g_min = min(acc$body_mass_g_min, d$body_mass_g, na.rm = TRUE),
      body_mass_g_max = max(acc$body_mass_g_max, d$body_mass_g, na.rm = TRUE)
    )
  ) |>
  dpurrr_to_tibble() |>
  print()

Our turn: dpurrr summarise (start)

penguins_local |>
  dpurrr_to_list() |>
  dpurrr_summarise(
    \(acc, d) list(
      body_mass_g_min = min(acc$body_mass_g_min, d$body_mass_g, na.rm = TRUE),
      body_mass_g_max = max(acc$body_mass_g_max, d$body_mass_g, na.rm = TRUE)
    )
  ) |>
  dpurrr_to_tibble() |>
  print()

# A tibble: 344 × 4
   species island    body_mass_g sex   
   <chr>   <chr>           <int> <chr> 
 1 Adelie  Torgersen        3750 male  
 2 Adelie  Torgersen        3800 female
 3 Adelie  Torgersen        3250 female
 4 Adelie  Torgersen          NA <NA>  
 5 Adelie  Torgersen        3450 female
 6 Adelie  Torgersen        3650 male  
 7 Adelie  Torgersen        3625 female
 8 Adelie  Torgersen        4675 male  
 9 Adelie  Torgersen        3475 <NA>  
10 Adelie  Torgersen        4250 <NA>  
# ℹ 334 more rows

Our turn: dpurrr summarise (by row, before)

penguins_local |>
  dpurrr_to_list() |>
  dpurrr_summarise(
    \(acc, d) list(
      body_mass_g_min = min(acc$body_mass_g_min, d$body_mass_g, na.rm = TRUE),
      body_mass_g_max = max(acc$body_mass_g_max, d$body_mass_g, na.rm = TRUE)
    )
  ) |>
  dpurrr_to_tibble() |>
  print()

List of 4
 $ species    : chr "Adelie"
 $ island     : chr "Torgersen"
 $ body_mass_g: int 3750
 $ sex        : chr "male"
List of 4
 $ species    : chr "Adelie"
 $ island     : chr "Torgersen"
 $ body_mass_g: int 3800
 $ sex        : chr "female"
List of 4
 $ species    : chr "Adelie"
 $ island     : chr "Torgersen"
 $ body_mass_g: int 3250
 $ sex        : chr "female"

Our turn: dpurrr summarise (by row, after)

penguins_local |>
  dpurrr_to_list() |>
  dpurrr_summarise(
    \(acc, d) list(
      body_mass_g_min = min(acc$body_mass_g_min, d$body_mass_g, na.rm = TRUE),
      body_mass_g_max = max(acc$body_mass_g_max, d$body_mass_g, na.rm = TRUE)
    )
  ) |>
  dpurrr_to_tibble() |>
  print()

List of 2
 $ body_mass_g_min: int 2700
 $ body_mass_g_max: int 6300

Our turn: dpurrr summarise (result)

penguins_local |>
  dpurrr_to_list() |>
  dpurrr_summarise(
    \(acc, d) list(
      body_mass_g_min = min(acc$body_mass_g_min, d$body_mass_g, na.rm = TRUE),
      body_mass_g_max = max(acc$body_mass_g_max, d$body_mass_g, na.rm = TRUE)
    )
  ) |>
  dpurrr_to_tibble() |>
  print()

# A tibble: 1 × 2
  body_mass_g_min body_mass_g_max
            <int>           <int>
1            2700            6300

Our turn: dpurrr summarise with grouping

We need a couple more functions to split and combine, also for our reducer:

#' @param .d unnamed list of named lists, i.e. transposed data frame
#' @param name string, name of variable on which to split
#'
#' @return named list of transposed data frames, names: values of split variable
dpurrr_split <- function(.d, name) {
  # uses purrr::map(), purrr::set_names(), purrr::keep()
}

#' @param .nd named list of transposed data frames
#' @param name string, name of variable to put into combined list
#'
#' @return transposed data frame
dpurrr_combine <- function(.nd, name) {
  # uses purrr::imap(), purrr::reduce()
}

body_mass_g_min_max <- function(acc, d) {
  list(
    body_mass_g_min = min(acc$body_mass_g_min, d$body_mass_g, na.rm = TRUE),
    body_mass_g_max = max(acc$body_mass_g_max, d$body_mass_g, na.rm = TRUE)
  )
}

Our turn: dpurrr summarise with grouping (start)

penguins_local |>
  dpurrr_to_list() |>
  dpurrr_split("species") |>
  map(\(d) d |> dpurrr_summarise(body_mass_g_min_max)) |>
  dpurrr_combine("species") |>
  dpurrr_to_tibble() |>
  print()

# A tibble: 344 × 4
   species island    body_mass_g sex   
   <chr>   <chr>           <int> <chr> 
 1 Adelie  Torgersen        3750 male  
 2 Adelie  Torgersen        3800 female
 3 Adelie  Torgersen        3250 female
 4 Adelie  Torgersen          NA <NA>  
 5 Adelie  Torgersen        3450 female
 6 Adelie  Torgersen        3650 male  
 7 Adelie  Torgersen        3625 female
 8 Adelie  Torgersen        4675 male  
 9 Adelie  Torgersen        3475 <NA>  
10 Adelie  Torgersen        4250 <NA>  
# ℹ 334 more rows

Our turn: dpurrr summarise with grouping (by row)

penguins_local |>
  dpurrr_to_list() |>
  dpurrr_split("species") |>
  map(\(d) d |> dpurrr_summarise(body_mass_g_min_max)) |>
  dpurrr_combine("species") |>
  dpurrr_to_tibble() |>
  print()

  List of 4
   $ species    : chr "Adelie"
   $ island     : chr "Torgersen"
   $ body_mass_g: int 3750
   $ sex        : chr "male"
  List of 4
   $ species    : chr "Adelie"
   $ island     : chr "Torgersen"
   $ body_mass_g: int 3800
   $ sex        : chr "female"
  List of 4
   $ species    : chr "Gentoo"
   $ island     : chr "Biscoe"
   $ body_mass_g: int 4500
   $ sex        : chr "female"
  List of 4
   $ species    : chr "Gentoo"
   $ island     : chr "Biscoe"
   $ body_mass_g: int 5700
   $ sex        : chr "male"
  List of 4
   $ species    : chr "Chinstrap"
   $ island     : chr "Dream"
   $ body_mass_g: int 3500
   $ sex        : chr "female"
  List of 4
   $ species    : chr "Chinstrap"
   $ island     : chr "Dream"
   $ body_mass_g: int 3900
   $ sex        : chr "male"

Our turn: dpurrr summarise with grouping (split)

penguins_local |>
  dpurrr_to_list() |>
  dpurrr_split("species") |>
  map(\(d) d |> dpurrr_summarise(body_mass_g_min_max)) |>
  dpurrr_combine("species") |>
  dpurrr_to_tibble() |>
  print()

$Adelie
  List of 4
   $ species    : chr "Adelie"
   $ island     : chr "Torgersen"
   $ body_mass_g: int 3750
   $ sex        : chr "male"
  List of 4
   $ species    : chr "Adelie"
   $ island     : chr "Torgersen"
   $ body_mass_g: int 3800
   $ sex        : chr "female"
$Gentoo
  List of 4
   $ species    : chr "Gentoo"
   $ island     : chr "Biscoe"
   $ body_mass_g: int 4500
   $ sex        : chr "female"
  List of 4
   $ species    : chr "Gentoo"
   $ island     : chr "Biscoe"
   $ body_mass_g: int 5700
   $ sex        : chr "male"
$Chinstrap
  List of 4
   $ species    : chr "Chinstrap"
   $ island     : chr "Dream"
   $ body_mass_g: int 3500
   $ sex        : chr "female"
  List of 4
   $ species    : chr "Chinstrap"
   $ island     : chr "Dream"
   $ body_mass_g: int 3900
   $ sex        : chr "male"

Our turn: dpurrr summarise with grouping (summarise)

penguins_local |>
  dpurrr_to_list() |>
  dpurrr_split("species") |>
  map(\(d) d |> dpurrr_summarise(body_mass_g_min_max)) |>
  dpurrr_combine("species") |>
  dpurrr_to_tibble() |>
  print()

$Adelie
  List of 2
   $ body_mass_g_min: int 2850
   $ body_mass_g_max: int 4775
$Gentoo
  List of 2
   $ body_mass_g_min: int 3950
   $ body_mass_g_max: int 6300
$Chinstrap
  List of 2
   $ body_mass_g_min: int 2700
   $ body_mass_g_max: int 4800

Our turn: dpurrr summarise with grouping (combine 🌗)

penguins_local |>
  dpurrr_to_list() |>
  dpurrr_split("species") |>
  map(\(d) d |> dpurrr_summarise(body_mass_g_min_max)) |>
  dpurrr_combine("species") |>
  dpurrr_to_tibble() |>
  print()

$Adelie
  List of 3
   $ body_mass_g_min: int 2850
   $ body_mass_g_max: int 4775
   $ species        : chr "Adelie"
$Gentoo
  List of 3
   $ body_mass_g_min: int 3950
   $ body_mass_g_max: int 6300
   $ species        : chr "Gentoo"
$Chinstrap
  List of 3
   $ body_mass_g_min: int 2700
   $ body_mass_g_max: int 4800
   $ species        : chr "Chinstrap"

Our turn: dpurrr summarise with grouping (combine 🌕)

penguins_local |>
  dpurrr_to_list() |>
  dpurrr_split("species") |>
  map(\(d) d |> dpurrr_summarise(body_mass_g_min_max)) |>
  dpurrr_combine("species") |>
  dpurrr_to_tibble() |>
  print()

  List of 3
   $ body_mass_g_min: int 2850
   $ body_mass_g_max: int 4775
   $ species        : chr "Adelie"
  List of 3
   $ body_mass_g_min: int 3950
   $ body_mass_g_max: int 6300
   $ species        : chr "Gentoo"
  List of 3
   $ body_mass_g_min: int 2700
   $ body_mass_g_max: int 4800
   $ species        : chr "Chinstrap"

Our turn: dpurrr summarise with grouping (result)

penguins_local |>
  dpurrr_to_list() |>
  dpurrr_split("species") |>
  map(\(d) d |> dpurrr_summarise(body_mass_g_min_max)) |>
  dpurrr_combine("species") |>
  dpurrr_to_tibble() |>
  print()

# A tibble: 3 × 3
  body_mass_g_min body_mass_g_max species  
            <int>           <int> <chr>    
1            2850            4775 Adelie   
2            3950            6300 Gentoo   
3            2700            4800 Chinstrap

Our turn: Finally…

We can agree this presents no danger to dplyr.

In JavaScript, data frames are often arrays of objects (lists); you can use tools like tidyjs:

Screenshot of tidyjs Observable page

Summary

  • {here} can help you manage file paths within projects.
  • Functional programming has three fundamental paradigms:
  • {purrr} offers variants and adverbs.
  • Adverbs can help you handle failure.
  • Functions can be used as arguments to functions.
  • Functions can be returned functions.
  • Another view of data frames (if we had time).

Wrap-up

Please go to pos.it/conf-workshop-survey.

Your feedback is crucial!

Data from the survey informs curriculum and format decisions for future conf workshops, and we really appreciate you taking the time to provide it.


Thank you!

  • Emma
  • Garrett
  • Mine Çetinkaya-Rundel, Posit
  • You 🤗