Read a UK Biobank main dataset file — read

Reads a UK Biobank main dataset file into R using either fread or read_dta. Optionally renames variables with descriptive names, add variable labels and label coded values of type character as factors.

Usage

read_ukb(
  path,
  delim = "auto",
  data_dict = NULL,
  ukb_data_dict = get_ukb_data_dict(),
  ukb_codings = get_ukb_codings(),
  descriptive_colnames = TRUE,
  label = TRUE,
  max_n_labels = 30,
  na.strings = c("", "NA"),
  nrows = Inf,
  ...
)

Arguments

path: The path to a UK Biobank main dataset file.
delim: Delimiter for the UKB main dataset file. Default is "auto" (see data.table::fread()). Ignored if the file name ends with .dta (i.e. is a STATA file) or if ukb_main is a data frame.
data_dict: A data dictionary specific to the UKB main dataset file, generated by make_data_dict. To load only a selection of columns, supply a filtered copy of this data dictionary containing only the required variables. If NULL (default) then all fields will be read.
ukb_data_dict: The UKB data dictionary (available online at the UK Biobank data showcase. This should be a data frame where all columns are of type character.
ukb_codings: The UKB codings file (available online at the UK Biobank data showcase. This should be a data frame where all columns are of type character.
descriptive_colnames: If TRUE, rename columns with longer descriptive names derived from the UK Biobank's data dictionary 'Field' column.
label: If TRUE, apply variable labels and label coded values as factors.
max_n_labels: Coded variables with associated value labels less than or equal to this threshold will be labelled as factors. If NULL, then all value labels will be applied. Default value is 30.
na.strings: A character vector of strings which are to be interpreted as NA values. By default, ",," for columns of all types, including type character is read as NA for consistency. ,"", is unambiguous and read as an empty string. To read ,NA, as NA, set na.strings="NA". To read ,, as blank string "", set na.strings=NULL. When they occur in the file, the strings in na.strings should not appear quoted since that is how the string literal ,"NA", is distinguished from ,NA,, for example, when na.strings="NA".
nrows: The maximum number of rows to read. Unlike read.table, you do not need to set this to an estimate of the number of rows in the file for better speed because that is already automatically determined by fread almost instantly using the large sample of lines. nrows=0 returns the column names and typed empty columns determined by the large sample; useful for a dry run of a large file or to quickly check format consistency of a set of files before starting to read any of them.
...: Additional parameters are passed on to either fread or read_dta

Value

A UK Biobank phenotype dataset as a data table with human-readable variables labels and data values.

Details

Note that na.strings is not recognised by read_dta. Reading in a STATA file may therefore require careful checking for empty strings that need converting to NA.

Examples

library(magrittr)
# get dummy UKB data dictionary and codings
dummy_ukb_data_dict <- get_ukb_dummy("dummy_Data_Dictionary_Showcase.tsv")
dummy_ukb_codings <- get_ukb_dummy("dummy_Codings.tsv")

# file path to dummy UKB main dataset
dummy_ukb_main_path <- get_ukb_dummy("dummy_ukb_main.tsv", path_only = TRUE)

# read dummy UKB main dataset into R
read_ukb(
  path = dummy_ukb_main_path,
  ukb_data_dict = dummy_ukb_data_dict,
  ukb_codings = dummy_ukb_codings
) %>%
  # (convert to tibble for concise print method)
  tibble::as_tibble()
#> Creating data dictionary
#> STEP 1 of 3
#> Reading data into R
#> STEP 2 of 3
#> Renaming with descriptive column names
#> STEP 3 of 3
#> Applying variable and value labels
#> Labelling dataset
#> Time taken: 0 minutes, 0 seconds.
#> # A tibble: 10 × 71
#>      eid sex_f31_0_0 year_of_birth_f34_0_0 month_of_birth_f52_0_0
#>    <int> <fct>                       <int> <fct>                 
#>  1     1 Female                       1952 August                
#>  2     2 Female                       1946 March                 
#>  3     3 Male                         1951 April                 
#>  4     4 Female                       1956 September             
#>  5     5 NA                             NA April                 
#>  6     6 Male                         1948 February              
#>  7     7 Female                       1949 December              
#>  8     8 Male                         1956 October               
#>  9     9 Female                       1962 April                 
#> 10    10 Male                         1953 February              
#> # ℹ 67 more variables: ethnic_background_f21000_0_0 <fct>,
#> #   ethnic_background_f21000_1_0 <fct>, ethnic_background_f21000_2_0 <fct>,
#> #   body_mass_index_bmi_f21001_0_0 <dbl>, body_mass_index_bmi_f21001_1_0 <dbl>,
#> #   body_mass_index_bmi_f21001_2_0 <dbl>,
#> #   systolic_blood_pressure_automated_reading_f4080_0_0 <int>,
#> #   systolic_blood_pressure_automated_reading_f4080_0_1 <int>,
#> #   systolic_blood_pressure_automated_reading_f4080_0_2 <int>, …

# to read only a subset of variables, create a data dictionary and filter
# for selected variables, then supply to `read_ukb()`
data_dict_selected <- make_data_dict(
  ukb_main = dummy_ukb_main_path,
  ukb_data_dict = dummy_ukb_data_dict
) %>%
  dplyr::filter(FieldID %in% c("eid", "31", "34", "21001"))

read_ukb(
  path = dummy_ukb_main_path,
  data_dict = data_dict_selected,
  ukb_data_dict = dummy_ukb_data_dict,
  ukb_codings = dummy_ukb_codings
)
#> STEP 1 of 3
#> Reading data into R
#> STEP 2 of 3
#> Renaming with descriptive column names
#> STEP 3 of 3
#> Applying variable and value labels
#> Labelling dataset
#> Time taken: 0 minutes, 0 seconds.
#>       eid sex_f31_0_0 year_of_birth_f34_0_0 body_mass_index_bmi_f21001_0_0
#>     <int>      <fctr>                 <int>                          <num>
#>  1:     1      Female                  1952                        20.1115
#>  2:     2      Female                  1946                        30.1536
#>  3:     3        Male                  1951                        22.8495
#>  4:     4      Female                  1956                             NA
#>  5:     5        <NA>                    NA                        29.2752
#>  6:     6        Male                  1948                        28.2567
#>  7:     7      Female                  1949                             NA
#>  8:     8        Male                  1956                             NA
#>  9:     9      Female                  1962                        25.4016
#> 10:    10        Male                  1953                             NA
#>     body_mass_index_bmi_f21001_1_0 body_mass_index_bmi_f21001_2_0
#>                              <num>                          <num>
#>  1:                        20.8640                             NA
#>  2:                        20.2309                        27.4936
#>  3:                        26.7929                        27.6286
#>  4:                             NA                             NA
#>  5:                        19.7576                        14.6641
#>  6:                        30.2860                        27.3534
#>  7:                             NA                             NA
#>  8:                             NA                             NA
#>  9:                        21.9371                        24.4897
#> 10:                        25.1579                        30.0483

# set `descriptive_colnames` and `label` to FALSE to read the raw dataset as is
read_ukb(
  path = dummy_ukb_main_path,
  data_dict = data_dict_selected,
  ukb_data_dict = dummy_ukb_data_dict,
  ukb_codings = dummy_ukb_codings,
  descriptive_colnames = FALSE,
  label = FALSE
)
#> STEP 1 of 1
#> Reading data into R
#> Time taken: 0 minutes, 0 seconds.
#>       eid 31-0.0 34-0.0 21001-0.0 21001-1.0 21001-2.0
#>     <int> <char>  <int>     <num>     <num>     <num>
#>  1:     1      0   1952   20.1115   20.8640        NA
#>  2:     2      0   1946   30.1536   20.2309   27.4936
#>  3:     3      1   1951   22.8495   26.7929   27.6286
#>  4:     4      0   1956        NA        NA        NA
#>  5:     5   <NA>     NA   29.2752   19.7576   14.6641
#>  6:     6      1   1948   28.2567   30.2860   27.3534
#>  7:     7      0   1949        NA        NA        NA
#>  8:     8      1   1956        NA        NA        NA
#>  9:     9      0   1962   25.4016   21.9371   24.4897
#> 10:    10      1   1953        NA   25.1579   30.0483