dplyr::coalesce() in action

Data wrangling
tidyverse
code
Published

October 27, 2023

Viable shortcut to missing-value-free data set

Introduction

Missing data can be a major headache when handling data in R; more especially if these NA’s are scattered across multiple columns representing your one variable of interest.

The coalesce() function from dplyr package comes in handy in such situations. In this short article, I am going to show you how powerful the function coalesce() is and how it can make your data manipulation task easy. To put in perspective, the function extracts the first non-missing (non-NA) value from a set of columns or values, making your data cleaner and analysis-ready, it does this by prioritizing a designated column as we will see from the worked out examples below.

Create fictious data

# load packages
pacman::p_load(
  tidyverse,
  flextable
)

# Create fictitious data
fict_data <- tribble(
 ~PTID, ~ HT_ADM, ~ HT_DISCH, ~ HT_D45, ~ HT_D90,       # patient ID and Height measurements at different time-points
  "7001",    100,      100.1,        100,       NA_real_,
  "7002",    97,       98,         NA_real_,  97.5,
  "7003",    97.6,     NA_real_,    97,      97.5,
  "7004",    NA_real_,  99,        99.8,     100,
  "7005",    79.5,  NA_real_,   78.9,     NA_real_,
  "7006",    NA_real_,  NA_real_,   NA_real_,  102.1,
  "7007",    78.5,      79,         NA_real_,    NA_real_,
  "7008",    NA_real_,      98,        98,       NA_real_)
# qflextable-it
fict_data %>% 
  qflextable()
Table 1:

Ficticious patient data

PTID

HT_ADM

HT_DISCH

HT_D45

HT_D90

7001

100.0

100.1

100.0

7002

97.0

98.0

97.5

7003

97.6

97.0

97.5

7004

99.0

99.8

100.0

7005

79.5

78.9

7006

102.1

7007

78.5

79.0

7008

98.0

98.0

The data in Table 1 is a made up one, representing height measurements of 8 patients across 4 time-points i.e during admission, discharge, day 45 and day 90 follow up visits respectively. As you can see, there is no single column with complete data. The assumption is that the measurement are subject to some absolute error and therefore any column can be used as measured height in an analysis; the problem is that no column is complete and this is what we are going to tackle using coalesce()

Creating Height column

The aim is to use coalesce() to create a column HT_cm with non-missing data using. We can try a few combination, but I am confident that if you look carefully, there are probably a few combination of two columns that can give you that.

NOTE: This is an illustration of how coalesce() works, your use-case might be different, just think how this function fits in.

HT_ADM vs HT_D90

Here coalesce() gives priority to HT_ADM and if this column has some NA’s while HT_D90 has values in the same row; the values of HT_D90 get pulled/filled in place of NA’s to form HT_CM.

fict_data %>% 
  mutate(
    HT_CM = coalesce(HT_ADM, HT_D90)    # prioritize _ADM over _D90
  ) %>% 
  flextable()

PTID

HT_ADM

HT_DISCH

HT_D45

HT_D90

HT_CM

7001

100.0

100.1

100.0

100.0

7002

97.0

98.0

97.5

97.0

7003

97.6

97.0

97.5

97.6

7004

99.0

99.8

100.0

100.0

7005

79.5

78.9

79.5

7006

102.1

102.1

7007

78.5

79.0

78.5

7008

98.0

98.0

But we still have one more missing value for patient 7008

HT_ADM vs HT_DISH

Take a combination of measurement at admission and discharge for instance.

fict_data %>% 
  mutate(
    HT_CM = coalesce(HT_ADM, HT_DISCH) # prioritize _ADM over _DISCH
  ) %>% 
  flextable()

PTID

HT_ADM

HT_DISCH

HT_D45

HT_D90

HT_CM

7001

100.0

100.1

100.0

100.0

7002

97.0

98.0

97.5

97.0

7003

97.6

97.0

97.5

97.6

7004

99.0

99.8

100.0

99.0

7005

79.5

78.9

79.5

7006

102.1

7007

78.5

79.0

78.5

7008

98.0

98.0

98.0

Same problem, we still have a missing cell for patient 7006 on the HT_CM column

HT_D90 vs HT_DISCH

We can reverse the natural order of the visits and the logic still remains. In the below example, coalesce() prioritizes HT_D90 over HT_DISCH but we still end up a missing cell in patient 7005 since both measurements are NA at discharge and at day 90.

fict_data %>% 
  mutate(
    HT_CM = coalesce(HT_D90, HT_DISCH)   # prioritize _D90 over _DISCH
  ) %>%  
  flextable()

PTID

HT_ADM

HT_DISCH

HT_D45

HT_D90

HT_CM

7001

100.0

100.1

100.0

100.1

7002

97.0

98.0

97.5

97.5

7003

97.6

97.0

97.5

97.5

7004

99.0

99.8

100.0

100.0

7005

79.5

78.9

7006

102.1

102.1

7007

78.5

79.0

79.0

7008

98.0

98.0

98.0

Use all columns

The ideal is to use all the columns. coalesce() prioritizes form “left to right” (in this case HT_ADM) filling in the missing cells on the left with the first non-missing value it encounters to the right. This is how it takes care of the non-missing business in your data.

fict_data %>% 
  mutate(
    HT_CM = coalesce(HT_ADM, HT_DISCH, HT_D45, HT_D90)
  ) %>% 
  flextable()

PTID

HT_ADM

HT_DISCH

HT_D45

HT_D90

HT_CM

7001

100.0

100.1

100.0

100.0

7002

97.0

98.0

97.5

97.0

7003

97.6

97.0

97.5

97.6

7004

99.0

99.8

100.0

99.0

7005

79.5

78.9

79.5

7006

102.1

102.1

7007

78.5

79.0

78.5

7008

98.0

98.0

98.0

Conclusion

I hope the examples helped you understand this simple, yet powerful function to handle missing data in your data wrangling efforts.

Session info

sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31)
 os       macOS Sonoma 14.6.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/London
 date     2024-08-21
 pandoc   3.1.11 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package           * version date (UTC) lib source
 askpass             1.2.0   2023-09-03 [1] CRAN (R 4.3.0)
 cli                 3.6.3   2024-06-21 [1] CRAN (R 4.3.3)
 colorspace          2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
 crayon              1.5.3   2024-06-20 [1] CRAN (R 4.3.3)
 crul                1.4.2   2024-04-09 [1] CRAN (R 4.3.1)
 curl                5.2.1   2024-03-01 [1] CRAN (R 4.3.1)
 data.table          1.15.4  2024-03-30 [1] CRAN (R 4.3.1)
 digest              0.6.36  2024-06-23 [1] CRAN (R 4.3.3)
 dplyr             * 1.1.4   2023-11-17 [1] CRAN (R 4.3.1)
 evaluate            0.24.0  2024-06-10 [1] CRAN (R 4.3.3)
 fansi               1.0.6   2023-12-08 [1] CRAN (R 4.3.1)
 fastmap             1.2.0   2024-05-15 [1] CRAN (R 4.3.3)
 flextable         * 0.9.6   2024-05-05 [1] CRAN (R 4.3.1)
 fontBitstreamVera   0.1.1   2017-02-01 [1] CRAN (R 4.3.0)
 fontLiberation      0.1.0   2016-10-15 [1] CRAN (R 4.3.0)
 fontquiver          0.2.1   2017-02-01 [1] CRAN (R 4.3.0)
 forcats           * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
 gdtools             0.3.7   2024-03-05 [1] CRAN (R 4.3.1)
 generics            0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
 gfonts              0.2.0   2023-01-08 [1] CRAN (R 4.3.0)
 ggplot2           * 3.5.1   2024-04-23 [1] CRAN (R 4.3.1)
 glue                1.7.0   2024-01-09 [1] CRAN (R 4.3.1)
 gtable              0.3.5   2024-04-22 [1] CRAN (R 4.3.1)
 hms                 1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
 htmltools           0.5.8.1 2024-04-04 [1] CRAN (R 4.3.1)
 htmlwidgets         1.6.4   2023-12-06 [1] CRAN (R 4.3.1)
 httpcode            0.3.0   2020-04-10 [1] CRAN (R 4.3.0)
 httpuv              1.6.15  2024-03-26 [1] CRAN (R 4.3.1)
 jsonlite            1.8.8   2023-12-04 [1] CRAN (R 4.3.1)
 knitr               1.48    2024-07-07 [1] CRAN (R 4.3.3)
 later               1.3.2   2023-12-06 [1] CRAN (R 4.3.1)
 lifecycle           1.0.4   2023-11-07 [1] CRAN (R 4.3.1)
 lubridate         * 1.9.3   2023-09-27 [1] CRAN (R 4.3.1)
 magrittr            2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
 mime                0.12    2021-09-28 [1] CRAN (R 4.3.0)
 munsell             0.5.1   2024-04-01 [1] CRAN (R 4.3.1)
 officer             0.6.6   2024-05-05 [1] CRAN (R 4.3.1)
 openssl             2.2.0   2024-05-16 [1] CRAN (R 4.3.3)
 pacman              0.5.1   2019-03-11 [1] CRAN (R 4.3.0)
 pillar              1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig           2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
 promises            1.3.0   2024-04-05 [1] CRAN (R 4.3.1)
 purrr             * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
 R6                  2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
 ragg                1.3.2   2024-05-15 [1] CRAN (R 4.3.3)
 Rcpp                1.0.12  2024-01-09 [1] CRAN (R 4.3.1)
 readr             * 2.1.5   2024-01-10 [1] CRAN (R 4.3.1)
 rlang               1.1.4   2024-06-04 [1] CRAN (R 4.3.3)
 rmarkdown           2.27    2024-05-17 [1] CRAN (R 4.3.3)
 rstudioapi          0.16.0  2024-03-24 [1] CRAN (R 4.3.1)
 scales              1.3.0   2023-11-28 [1] CRAN (R 4.3.1)
 sessioninfo         1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 shiny               1.8.1.1 2024-04-02 [1] CRAN (R 4.3.1)
 stringi             1.8.4   2024-05-06 [1] CRAN (R 4.3.1)
 stringr           * 1.5.1   2023-11-14 [1] CRAN (R 4.3.1)
 systemfonts         1.1.0   2024-05-15 [1] CRAN (R 4.3.3)
 textshaping         0.4.0   2024-05-24 [1] CRAN (R 4.3.3)
 tibble            * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
 tidyr             * 1.3.1   2024-01-24 [1] CRAN (R 4.3.1)
 tidyselect          1.2.1   2024-03-11 [1] CRAN (R 4.3.1)
 tidyverse         * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
 timechange          0.3.0   2024-01-18 [1] CRAN (R 4.3.1)
 tzdb                0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
 utf8                1.2.4   2023-10-22 [1] CRAN (R 4.3.1)
 uuid                1.2-0   2024-01-14 [1] CRAN (R 4.3.1)
 vctrs               0.6.5   2023-12-01 [1] CRAN (R 4.3.1)
 withr               3.0.0   2024-01-16 [1] CRAN (R 4.3.1)
 xfun                0.45    2024-06-16 [1] CRAN (R 4.3.3)
 xml2                1.3.6   2023-12-04 [1] CRAN (R 4.3.1)
 xtable              1.8-4   2019-04-21 [1] CRAN (R 4.3.0)
 yaml                2.3.9   2024-07-05 [1] CRAN (R 4.3.3)
 zip                 2.3.1   2024-01-27 [1] CRAN (R 4.3.1)

 [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────