Summarise numerical variables — summarise_numerical

Summarises numerical variables with repeated measurements either by field (i.e. all available measurements) or by instance (i.e. for all measurements at each assessment visit). Currently available summary options are mean, minimum, maximum, sum and number of non-missing values.

Usage

summarise_numerical_variables(
  ukb_main,
  data_dict = NULL,
  ukb_data_dict = get_ukb_data_dict(),
  summary_function = "mean",
  summarise_by = "Field",
  .drop = FALSE
)

Arguments

ukb_main: A UK Biobank main dataset data frame. Column names must match those under the descriptive_colnames column in data_dict.
data_dict: a data dictionary specific to the UKB main dataset file, created by make_data_dict.
ukb_data_dict: The UKB data dictionary (available online at the UK Biobank data showcase. This should be a data frame where all columns are of type character.
summary_function: The summary function to be applied. Options: "mean", "min", "max", "sum" or "n_values"
summarise_by: Whether to summarise by "Field" or by "Instance".
.drop: If TRUE, removes the original numerical variables from the result. Default value is FALSE.

Value

A data frame with new columns summarising numerical variables. The names for these new columns are prefixed by the value for summary_function and end with 'x', FieldID +/- instance being summarised e.g. if summarising FieldID 4080 instance 0, the new column would be named 'mean_systolic_blood_pressure_automated_reading_x4080_0'.

Details

Note that when summary_function = "sum", missing values are converted to zero. Therefore if a set of values are all missing then the sum will summarised as 0. See the documentation for rowSums for further details.

Examples

library(magrittr)
# get dummy UKB data and data dictionary
dummy_ukb_data_dict <- get_ukb_dummy("dummy_Data_Dictionary_Showcase.tsv")
dummy_ukb_codings <- get_ukb_dummy("dummy_Codings.tsv")

dummy_ukb_main <- read_ukb(
  path = get_ukb_dummy("dummy_ukb_main.tsv", path_only = TRUE),
  ukb_data_dict = dummy_ukb_data_dict,
  ukb_codings = dummy_ukb_codings
) %>%
  dplyr::select(eid, tidyselect::contains("systolic_blood_pressure")) %>%
  tibble::as_tibble()
#> Creating data dictionary
#> STEP 1 of 3
#> Reading data into R
#> STEP 2 of 3
#> Renaming with descriptive column names
#> STEP 3 of 3
#> Applying variable and value labels
#> Labelling dataset
#> Time taken: 0 minutes, 0 seconds.

# summarise mean values by Field, keep original variables
summarise_numerical_variables(
  dummy_ukb_main,
  ukb_data_dict = dummy_ukb_data_dict
)
#> Number of summary columns to make: 1
#> Time taken: 0 minutes, 0 seconds.
#> # A tibble: 10 × 10
#>      eid systolic_blood_pressure…¹ systolic_blood_press…² systolic_blood_press…³
#>    <int>                     <int>                  <int>                  <int>
#>  1     1                        NA                    134                    134
#>  2     2                       146                    145                    145
#>  3     3                       143                    123                    123
#>  4     4                        NA                     NA                     NA
#>  5     5                        NA                     NA                     NA
#>  6     6                        NA                     NA                     NA
#>  7     7                        NA                     NA                     NA
#>  8     8                        NA                     NA                     NA
#>  9     9                        NA                     NA                     NA
#> 10    10                        NA                     NA                     NA
#> # ℹ abbreviated names: ¹systolic_blood_pressure_automated_reading_f4080_0_0,
#> #   ²systolic_blood_pressure_automated_reading_f4080_0_1,
#> #   ³systolic_blood_pressure_automated_reading_f4080_0_2
#> # ℹ 6 more variables:
#> #   systolic_blood_pressure_automated_reading_f4080_0_3 <int>,
#> #   systolic_blood_pressure_automated_reading_f4080_1_0 <int>,
#> #   systolic_blood_pressure_automated_reading_f4080_1_1 <int>, …

# summarise mean values by Field, drop original variables
summarise_numerical_variables(
  dummy_ukb_main,
  ukb_data_dict = dummy_ukb_data_dict,
  .drop = TRUE
)
#> Number of summary columns to make: 1
#> Time taken: 0 minutes, 0 seconds.
#> # A tibble: 10 × 2
#>      eid mean_systolic_blood_pressure_automated_reading_x4080
#>    <int>                                                <dbl>
#>  1     1                                                 138.
#>  2     2                                                 143.
#>  3     3                                                 130.
#>  4     4                                                 NaN 
#>  5     5                                                 NaN 
#>  6     6                                                 NaN 
#>  7     7                                                 NaN 
#>  8     8                                                 NaN 
#>  9     9                                                 NaN 
#> 10    10                                                 NaN 

# summarise min values by instance, dropping original variables
summarise_numerical_variables(
  dummy_ukb_main,
  ukb_data_dict = dummy_ukb_data_dict,
  summary_function = "min",
  summarise_by = "Instance",
  .drop = TRUE
)
#> Number of summary columns to make: 2
#> Time taken: 0 minutes, 0 seconds.
#> # A tibble: 10 × 3
#>      eid min_systolic_blood_pressure_automated_reading_…¹ min_systolic_blood_p…²
#>    <int>                                            <int>                  <int>
#>  1     1                                              134                    134
#>  2     2                                              145                    129
#>  3     3                                              123                    123
#>  4     4                                               NA                     NA
#>  5     5                                               NA                     NA
#>  6     6                                               NA                     NA
#>  7     7                                               NA                     NA
#>  8     8                                               NA                     NA
#>  9     9                                               NA                     NA
#> 10    10                                               NA                     NA
#> # ℹ abbreviated names: ¹min_systolic_blood_pressure_automated_reading_x4080_0,
#> #   ²min_systolic_blood_pressure_automated_reading_x4080_1