Viable shortcut to missing-value-free data set

Introduction

Missing data can be a major headache when handling data in R; more especially if these NA’s are scattered across multiple columns representing your one variable of interest.

The coalesce() function from dplyr package comes in handy in such situations. In this short article, I am going to show you how powerful the function coalesce() is and how it can make your data manipulation task easy. To put in perspective, the function extracts the first non-missing (non-NA) value from a set of columns or values, making your data cleaner and analysis-ready, it does this by prioritizing a designated column as we will see from the worked out examples below.

Create fictious data

# load packages
pacman::p_load(
  tidyverse,
  flextable
)

# Create fictitious data
fict_data <- tribble(
 ~PTID, ~ HT_ADM, ~ HT_DISCH, ~ HT_D45, ~ HT_D90,       # patient ID and Height measurements at different time-points
  "7001",    100,      100.1,        100,       NA_real_,
  "7002",    97,       98,         NA_real_,  97.5,
  "7003",    97.6,     NA_real_,    97,      97.5,
  "7004",    NA_real_,  99,        99.8,     100,
  "7005",    79.5,  NA_real_,   78.9,     NA_real_,
  "7006",    NA_real_,  NA_real_,   NA_real_,  102.1,
  "7007",    78.5,      79,         NA_real_,    NA_real_,
  "7008",    NA_real_,      98,        98,       NA_real_)

# qflextable-it
fict_data %>% 
  qflextable()

Table 1:
Ficticious patient data
PTID	HT_ADM	HT_DISCH	HT_D45	HT_D90
7001	100.0	100.1	100.0
7002	97.0	98.0		97.5
7003	97.6		97.0	97.5
7004		99.0	99.8	100.0
7005	79.5		78.9
7006				102.1
7007	78.5	79.0
7008		98.0	98.0

The data in Table 1 is a made up one, representing height measurements of 8 patients across 4 time-points i.e during admission, discharge, day 45 and day 90 follow up visits respectively. As you can see, there is no single column with complete data. The assumption is that the measurement are subject to some absolute error and therefore any column can be used as measured height in an analysis; the problem is that no column is complete and this is what we are going to tackle using coalesce()

Creating Height column

The aim is to use coalesce() to create a column HT_cm with non-missing data using. We can try a few combination, but I am confident that if you look carefully, there are probably a few combination of two columns that can give you that.

NOTE: This is an illustration of how coalesce() works, your use-case might be different, just think how this function fits in.

HT_ADM vs HT_D90

Here coalesce() gives priority to HT_ADM and if this column has some NA’s while HT_D90 has values in the same row; the values of HT_D90 get pulled/filled in place of NA’s to form HT_CM.

fict_data %>% 
  mutate(
    HT_CM = coalesce(HT_ADM, HT_D90)    # prioritize _ADM over _D90
  ) %>% 
  flextable()

PTID	HT_ADM	HT_DISCH	HT_D45	HT_D90	HT_CM
7001	100.0	100.1	100.0		100.0
7002	97.0	98.0		97.5	97.0
7003	97.6		97.0	97.5	97.6
7004		99.0	99.8	100.0	100.0
7005	79.5		78.9		79.5
7006				102.1	102.1
7007	78.5	79.0			78.5
7008		98.0	98.0

But we still have one more missing value for patient 7008

HT_ADM vs HT_DISH

Take a combination of measurement at admission and discharge for instance.

fict_data %>% 
  mutate(
    HT_CM = coalesce(HT_ADM, HT_DISCH) # prioritize _ADM over _DISCH
  ) %>% 
  flextable()

PTID	HT_ADM	HT_DISCH	HT_D45	HT_D90	HT_CM
7001	100.0	100.1	100.0		100.0
7002	97.0	98.0		97.5	97.0
7003	97.6		97.0	97.5	97.6
7004		99.0	99.8	100.0	99.0
7005	79.5		78.9		79.5
7006				102.1
7007	78.5	79.0			78.5
7008		98.0	98.0		98.0

Same problem, we still have a missing cell for patient 7006 on the HT_CM column

HT_D90 vs HT_DISCH

We can reverse the natural order of the visits and the logic still remains. In the below example, coalesce() prioritizes HT_D90 over HT_DISCH but we still end up a missing cell in patient 7005 since both measurements are NA at discharge and at day 90.

fict_data %>% 
  mutate(
    HT_CM = coalesce(HT_D90, HT_DISCH)   # prioritize _D90 over _DISCH
  ) %>%  
  flextable()

PTID	HT_ADM	HT_DISCH	HT_D45	HT_D90	HT_CM
7001	100.0	100.1	100.0		100.1
7002	97.0	98.0		97.5	97.5
7003	97.6		97.0	97.5	97.5
7004		99.0	99.8	100.0	100.0
7005	79.5		78.9
7006				102.1	102.1
7007	78.5	79.0			79.0
7008		98.0	98.0		98.0

Use all columns

The ideal is to use all the columns. coalesce() prioritizes form “left to right” (in this case HT_ADM) filling in the missing cells on the left with the first non-missing value it encounters to the right. This is how it takes care of the non-missing business in your data.

fict_data %>% 
  mutate(
    HT_CM = coalesce(HT_ADM, HT_DISCH, HT_D45, HT_D90)
  ) %>% 
  flextable()

PTID	HT_ADM	HT_DISCH	HT_D45	HT_D90	HT_CM
7001	100.0	100.1	100.0		100.0
7002	97.0	98.0		97.5	97.0
7003	97.6		97.0	97.5	97.6
7004		99.0	99.8	100.0	99.0
7005	79.5		78.9		79.5
7006				102.1	102.1
7007	78.5	79.0			78.5
7008		98.0	98.0		98.0

Conclusion

I hope the examples helped you understand this simple, yet powerful function to handle missing data in your data wrangling efforts.

Session info

sessioninfo::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31)
 os       macOS Sonoma 14.6.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/London
 date     2024-08-21
 pandoc   3.1.11 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package           * version date (UTC) lib source
 askpass             1.2.0   2023-09-03 [1] CRAN (R 4.3.0)
 cli                 3.6.3   2024-06-21 [1] CRAN (R 4.3.3)
 colorspace          2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
 crayon              1.5.3   2024-06-20 [1] CRAN (R 4.3.3)
 crul                1.4.2   2024-04-09 [1] CRAN (R 4.3.1)
 curl                5.2.1   2024-03-01 [1] CRAN (R 4.3.1)
 data.table          1.15.4  2024-03-30 [1] CRAN (R 4.3.1)
 digest              0.6.36  2024-06-23 [1] CRAN (R 4.3.3)
 dplyr             * 1.1.4   2023-11-17 [1] CRAN (R 4.3.1)
 evaluate            0.24.0  2024-06-10 [1] CRAN (R 4.3.3)
 fansi               1.0.6   2023-12-08 [1] CRAN (R 4.3.1)
 fastmap             1.2.0   2024-05-15 [1] CRAN (R 4.3.3)
 flextable         * 0.9.6   2024-05-05 [1] CRAN (R 4.3.1)
 fontBitstreamVera   0.1.1   2017-02-01 [1] CRAN (R 4.3.0)
 fontLiberation      0.1.0   2016-10-15 [1] CRAN (R 4.3.0)
 fontquiver          0.2.1   2017-02-01 [1] CRAN (R 4.3.0)
 forcats           * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
 gdtools             0.3.7   2024-03-05 [1] CRAN (R 4.3.1)
 generics            0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
 gfonts              0.2.0   2023-01-08 [1] CRAN (R 4.3.0)
 ggplot2           * 3.5.1   2024-04-23 [1] CRAN (R 4.3.1)
 glue                1.7.0   2024-01-09 [1] CRAN (R 4.3.1)
 gtable              0.3.5   2024-04-22 [1] CRAN (R 4.3.1)
 hms                 1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
 htmltools           0.5.8.1 2024-04-04 [1] CRAN (R 4.3.1)
 htmlwidgets         1.6.4   2023-12-06 [1] CRAN (R 4.3.1)
 httpcode            0.3.0   2020-04-10 [1] CRAN (R 4.3.0)
 httpuv              1.6.15  2024-03-26 [1] CRAN (R 4.3.1)
 jsonlite            1.8.8   2023-12-04 [1] CRAN (R 4.3.1)
 knitr               1.48    2024-07-07 [1] CRAN (R 4.3.3)
 later               1.3.2   2023-12-06 [1] CRAN (R 4.3.1)
 lifecycle           1.0.4   2023-11-07 [1] CRAN (R 4.3.1)
 lubridate         * 1.9.3   2023-09-27 [1] CRAN (R 4.3.1)
 magrittr            2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
 mime                0.12    2021-09-28 [1] CRAN (R 4.3.0)
 munsell             0.5.1   2024-04-01 [1] CRAN (R 4.3.1)
 officer             0.6.6   2024-05-05 [1] CRAN (R 4.3.1)
 openssl             2.2.0   2024-05-16 [1] CRAN (R 4.3.3)
 pacman              0.5.1   2019-03-11 [1] CRAN (R 4.3.0)
 pillar              1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig           2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
 promises            1.3.0   2024-04-05 [1] CRAN (R 4.3.1)
 purrr             * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
 R6                  2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
 ragg                1.3.2   2024-05-15 [1] CRAN (R 4.3.3)
 Rcpp                1.0.12  2024-01-09 [1] CRAN (R 4.3.1)
 readr             * 2.1.5   2024-01-10 [1] CRAN (R 4.3.1)
 rlang               1.1.4   2024-06-04 [1] CRAN (R 4.3.3)
 rmarkdown           2.27    2024-05-17 [1] CRAN (R 4.3.3)
 rstudioapi          0.16.0  2024-03-24 [1] CRAN (R 4.3.1)
 scales              1.3.0   2023-11-28 [1] CRAN (R 4.3.1)
 sessioninfo         1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 shiny               1.8.1.1 2024-04-02 [1] CRAN (R 4.3.1)
 stringi             1.8.4   2024-05-06 [1] CRAN (R 4.3.1)
 stringr           * 1.5.1   2023-11-14 [1] CRAN (R 4.3.1)
 systemfonts         1.1.0   2024-05-15 [1] CRAN (R 4.3.3)
 textshaping         0.4.0   2024-05-24 [1] CRAN (R 4.3.3)
 tibble            * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
 tidyr             * 1.3.1   2024-01-24 [1] CRAN (R 4.3.1)
 tidyselect          1.2.1   2024-03-11 [1] CRAN (R 4.3.1)
 tidyverse         * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
 timechange          0.3.0   2024-01-18 [1] CRAN (R 4.3.1)
 tzdb                0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
 utf8                1.2.4   2023-10-22 [1] CRAN (R 4.3.1)
 uuid                1.2-0   2024-01-14 [1] CRAN (R 4.3.1)
 vctrs               0.6.5   2023-12-01 [1] CRAN (R 4.3.1)
 withr               3.0.0   2024-01-16 [1] CRAN (R 4.3.1)
 xfun                0.45    2024-06-16 [1] CRAN (R 4.3.3)
 xml2                1.3.6   2023-12-04 [1] CRAN (R 4.3.1)
 xtable              1.8-4   2019-04-21 [1] CRAN (R 4.3.0)
 yaml                2.3.9   2024-07-05 [1] CRAN (R 4.3.3)
 zip                 2.3.1   2024-01-27 [1] CRAN (R 4.3.1)

 [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────