# load packages
::p_load(
pacman
tidyverse,
flextable
)
# Create fictitious data
<- tribble(
fict_data ~PTID, ~ HT_ADM, ~ HT_DISCH, ~ HT_D45, ~ HT_D90, # patient ID and Height measurements at different time-points
"7001", 100, 100.1, 100, NA_real_,
"7002", 97, 98, NA_real_, 97.5,
"7003", 97.6, NA_real_, 97, 97.5,
"7004", NA_real_, 99, 99.8, 100,
"7005", 79.5, NA_real_, 78.9, NA_real_,
"7006", NA_real_, NA_real_, NA_real_, 102.1,
"7007", 78.5, 79, NA_real_, NA_real_,
"7008", NA_real_, 98, 98, NA_real_)
Viable shortcut to missing-value-free data set
Introduction
Missing data can be a major headache when handling data in R; more especially if these NA’s are scattered across multiple columns representing your one variable of interest.
The coalesce() function from dplyr package comes in handy in such situations. In this short article, I am going to show you how powerful the function coalesce()
is and how it can make your data manipulation task easy. To put in perspective, the function extracts the first non-missing (non-NA) value from a set of columns or values, making your data cleaner and analysis-ready, it does this by prioritizing a designated column as we will see from the worked out examples below.
Create fictious data
# qflextable-it
%>%
fict_data qflextable()
PTID | HT_ADM | HT_DISCH | HT_D45 | HT_D90 |
---|---|---|---|---|
7001 | 100.0 | 100.1 | 100.0 | |
7002 | 97.0 | 98.0 | 97.5 | |
7003 | 97.6 | 97.0 | 97.5 | |
7004 | 99.0 | 99.8 | 100.0 | |
7005 | 79.5 | 78.9 | ||
7006 | 102.1 | |||
7007 | 78.5 | 79.0 | ||
7008 | 98.0 | 98.0 |
The data in Table 1 is a made up one, representing height measurements of 8 patients across 4 time-points i.e during admission, discharge, day 45 and day 90 follow up visits respectively. As you can see, there is no single column with complete data. The assumption is that the measurement are subject to some absolute error and therefore any column can be used as measured height in an analysis; the problem is that no column is complete and this is what we are going to tackle using coalesce()
Creating Height column
The aim is to use coalesce()
to create a column HT_cm with non-missing data using. We can try a few combination, but I am confident that if you look carefully, there are probably a few combination of two columns that can give you that.
NOTE: This is an illustration of how coalesce()
works, your use-case might be different, just think how this function fits in.
HT_ADM vs HT_D90
Here coalesce()
gives priority to HT_ADM
and if this column has some NA’s while HT_D90
has values in the same row; the values of HT_D90
get pulled/filled in place of NA’s to form HT_CM
.
%>%
fict_data mutate(
HT_CM = coalesce(HT_ADM, HT_D90) # prioritize _ADM over _D90
%>%
) flextable()
PTID | HT_ADM | HT_DISCH | HT_D45 | HT_D90 | HT_CM |
---|---|---|---|---|---|
7001 | 100.0 | 100.1 | 100.0 | 100.0 | |
7002 | 97.0 | 98.0 | 97.5 | 97.0 | |
7003 | 97.6 | 97.0 | 97.5 | 97.6 | |
7004 | 99.0 | 99.8 | 100.0 | 100.0 | |
7005 | 79.5 | 78.9 | 79.5 | ||
7006 | 102.1 | 102.1 | |||
7007 | 78.5 | 79.0 | 78.5 | ||
7008 | 98.0 | 98.0 |
But we still have one more missing value for patient 7008
HT_ADM vs HT_DISH
Take a combination of measurement at admission and discharge for instance.
%>%
fict_data mutate(
HT_CM = coalesce(HT_ADM, HT_DISCH) # prioritize _ADM over _DISCH
%>%
) flextable()
PTID | HT_ADM | HT_DISCH | HT_D45 | HT_D90 | HT_CM |
---|---|---|---|---|---|
7001 | 100.0 | 100.1 | 100.0 | 100.0 | |
7002 | 97.0 | 98.0 | 97.5 | 97.0 | |
7003 | 97.6 | 97.0 | 97.5 | 97.6 | |
7004 | 99.0 | 99.8 | 100.0 | 99.0 | |
7005 | 79.5 | 78.9 | 79.5 | ||
7006 | 102.1 | ||||
7007 | 78.5 | 79.0 | 78.5 | ||
7008 | 98.0 | 98.0 | 98.0 |
Same problem, we still have a missing cell for patient 7006 on the HT_CM column
HT_D90 vs HT_DISCH
We can reverse the natural order of the visits and the logic still remains. In the below example, coalesce()
prioritizes HT_D90
over HT_DISCH
but we still end up a missing cell in patient 7005 since both measurements are NA at discharge and at day 90.
%>%
fict_data mutate(
HT_CM = coalesce(HT_D90, HT_DISCH) # prioritize _D90 over _DISCH
%>%
) flextable()
PTID | HT_ADM | HT_DISCH | HT_D45 | HT_D90 | HT_CM |
---|---|---|---|---|---|
7001 | 100.0 | 100.1 | 100.0 | 100.1 | |
7002 | 97.0 | 98.0 | 97.5 | 97.5 | |
7003 | 97.6 | 97.0 | 97.5 | 97.5 | |
7004 | 99.0 | 99.8 | 100.0 | 100.0 | |
7005 | 79.5 | 78.9 | |||
7006 | 102.1 | 102.1 | |||
7007 | 78.5 | 79.0 | 79.0 | ||
7008 | 98.0 | 98.0 | 98.0 |
Use all columns
The ideal is to use all the columns. coalesce()
prioritizes form “left to right” (in this case HT_ADM
) filling in the missing cells on the left with the first non-missing value it encounters to the right. This is how it takes care of the non-missing business in your data.
%>%
fict_data mutate(
HT_CM = coalesce(HT_ADM, HT_DISCH, HT_D45, HT_D90)
%>%
) flextable()
PTID | HT_ADM | HT_DISCH | HT_D45 | HT_D90 | HT_CM |
---|---|---|---|---|---|
7001 | 100.0 | 100.1 | 100.0 | 100.0 | |
7002 | 97.0 | 98.0 | 97.5 | 97.0 | |
7003 | 97.6 | 97.0 | 97.5 | 97.6 | |
7004 | 99.0 | 99.8 | 100.0 | 99.0 | |
7005 | 79.5 | 78.9 | 79.5 | ||
7006 | 102.1 | 102.1 | |||
7007 | 78.5 | 79.0 | 78.5 | ||
7008 | 98.0 | 98.0 | 98.0 |
Conclusion
I hope the examples helped you understand this simple, yet powerful function to handle missing data in your data wrangling efforts.
Session info
::session_info() sessioninfo
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.2 (2023-10-31)
os macOS Sonoma 14.6.1
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/London
date 2024-08-21
pandoc 3.1.11 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
askpass 1.2.0 2023-09-03 [1] CRAN (R 4.3.0)
cli 3.6.3 2024-06-21 [1] CRAN (R 4.3.3)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)
crayon 1.5.3 2024-06-20 [1] CRAN (R 4.3.3)
crul 1.4.2 2024-04-09 [1] CRAN (R 4.3.1)
curl 5.2.1 2024-03-01 [1] CRAN (R 4.3.1)
data.table 1.15.4 2024-03-30 [1] CRAN (R 4.3.1)
digest 0.6.36 2024-06-23 [1] CRAN (R 4.3.3)
dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.1)
evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.3.3)
fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.1)
fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.3.3)
flextable * 0.9.6 2024-05-05 [1] CRAN (R 4.3.1)
fontBitstreamVera 0.1.1 2017-02-01 [1] CRAN (R 4.3.0)
fontLiberation 0.1.0 2016-10-15 [1] CRAN (R 4.3.0)
fontquiver 0.2.1 2017-02-01 [1] CRAN (R 4.3.0)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)
gdtools 0.3.7 2024-03-05 [1] CRAN (R 4.3.1)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
gfonts 0.2.0 2023-01-08 [1] CRAN (R 4.3.0)
ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.3.1)
glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.1)
gtable 0.3.5 2024-04-22 [1] CRAN (R 4.3.1)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)
htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.3.1)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.3.1)
httpcode 0.3.0 2020-04-10 [1] CRAN (R 4.3.0)
httpuv 1.6.15 2024-03-26 [1] CRAN (R 4.3.1)
jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.1)
knitr 1.48 2024-07-07 [1] CRAN (R 4.3.3)
later 1.3.2 2023-12-06 [1] CRAN (R 4.3.1)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1)
lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.3.1)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
mime 0.12 2021-09-28 [1] CRAN (R 4.3.0)
munsell 0.5.1 2024-04-01 [1] CRAN (R 4.3.1)
officer 0.6.6 2024-05-05 [1] CRAN (R 4.3.1)
openssl 2.2.0 2024-05-16 [1] CRAN (R 4.3.3)
pacman 0.5.1 2019-03-11 [1] CRAN (R 4.3.0)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
promises 1.3.0 2024-04-05 [1] CRAN (R 4.3.1)
purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
ragg 1.3.2 2024-05-15 [1] CRAN (R 4.3.3)
Rcpp 1.0.12 2024-01-09 [1] CRAN (R 4.3.1)
readr * 2.1.5 2024-01-10 [1] CRAN (R 4.3.1)
rlang 1.1.4 2024-06-04 [1] CRAN (R 4.3.3)
rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.3.3)
rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.3.1)
scales 1.3.0 2023-11-28 [1] CRAN (R 4.3.1)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
shiny 1.8.1.1 2024-04-02 [1] CRAN (R 4.3.1)
stringi 1.8.4 2024-05-06 [1] CRAN (R 4.3.1)
stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.3.1)
systemfonts 1.1.0 2024-05-15 [1] CRAN (R 4.3.3)
textshaping 0.4.0 2024-05-24 [1] CRAN (R 4.3.3)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.3.1)
tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.3.1)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)
timechange 0.3.0 2024-01-18 [1] CRAN (R 4.3.1)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)
utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.1)
uuid 1.2-0 2024-01-14 [1] CRAN (R 4.3.1)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.1)
withr 3.0.0 2024-01-16 [1] CRAN (R 4.3.1)
xfun 0.45 2024-06-16 [1] CRAN (R 4.3.3)
xml2 1.3.6 2023-12-04 [1] CRAN (R 4.3.1)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.3.0)
yaml 2.3.9 2024-07-05 [1] CRAN (R 4.3.3)
zip 2.3.1 2024-01-27 [1] CRAN (R 4.3.1)
[1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
──────────────────────────────────────────────────────────────────────────────