Reads a UK Biobank main dataset file into R using either
fread
or read_dta
. Optionally
renames variables with descriptive names, add variable labels and label coded
values of type character as factors.
Usage
read_ukb(
path,
delim = "auto",
data_dict = NULL,
ukb_data_dict = get_ukb_data_dict(),
ukb_codings = get_ukb_codings(),
descriptive_colnames = TRUE,
label = TRUE,
max_n_labels = 30,
na.strings = c("", "NA"),
nrows = Inf,
...
)
Arguments
- path
The path to a UK Biobank main dataset file.
- delim
Delimiter for the UKB main dataset file. Default is "auto" (see
data.table::fread()
). Ignored if the file name ends with.dta
(i.e. is a STATA file) or ifukb_main
is a data frame.- data_dict
A data dictionary specific to the UKB main dataset file, generated by
make_data_dict
. To load only a selection of columns, supply a filtered copy of this data dictionary containing only the required variables. IfNULL
(default) then all fields will be read.- ukb_data_dict
The UKB data dictionary (available online at the UK Biobank data showcase. This should be a data frame where all columns are of type
character
.- ukb_codings
The UKB codings file (available online at the UK Biobank data showcase. This should be a data frame where all columns are of type
character
.- descriptive_colnames
If
TRUE
, rename columns with longer descriptive names derived from the UK Biobank's data dictionary 'Field' column.- label
If
TRUE
, apply variable labels and label coded values as factors.- max_n_labels
Coded variables with associated value labels less than or equal to this threshold will be labelled as factors. If
NULL
, then all value labels will be applied. Default value is 30.- na.strings
A character vector of strings which are to be interpreted as
NA
values. By default,",,"
for columns of all types, including typecharacter
is read asNA
for consistency.,"",
is unambiguous and read as an empty string. To read,NA,
asNA
, setna.strings="NA"
. To read,,
as blank string""
, setna.strings=NULL
. When they occur in the file, the strings inna.strings
should not appear quoted since that is how the string literal,"NA",
is distinguished from,NA,
, for example, whenna.strings="NA"
.- nrows
The maximum number of rows to read. Unlike
read.table
, you do not need to set this to an estimate of the number of rows in the file for better speed because that is already automatically determined byfread
almost instantly using the large sample of lines.nrows=0
returns the column names and typed empty columns determined by the large sample; useful for a dry run of a large file or to quickly check format consistency of a set of files before starting to read any of them.- ...
Additional parameters are passed on to either
fread
orread_dta
Value
A UK Biobank phenotype dataset as a data table with human-readable variables labels and data values.
Details
Note that na.strings
is not recognised by
read_dta
. Reading in a STATA file may therefore require
careful checking for empty strings that need converting to NA
.
Examples
library(magrittr)
# get dummy UKB data dictionary and codings
dummy_ukb_data_dict <- get_ukb_dummy("dummy_Data_Dictionary_Showcase.tsv")
dummy_ukb_codings <- get_ukb_dummy("dummy_Codings.tsv")
# file path to dummy UKB main dataset
dummy_ukb_main_path <- get_ukb_dummy("dummy_ukb_main.tsv", path_only = TRUE)
# read dummy UKB main dataset into R
read_ukb(
path = dummy_ukb_main_path,
ukb_data_dict = dummy_ukb_data_dict,
ukb_codings = dummy_ukb_codings
) %>%
# (convert to tibble for concise print method)
tibble::as_tibble()
#> Creating data dictionary
#> STEP 1 of 3
#> Reading data into R
#> STEP 2 of 3
#> Renaming with descriptive column names
#> STEP 3 of 3
#> Applying variable and value labels
#> Labelling dataset
#> Time taken: 0 minutes, 0 seconds.
#> # A tibble: 10 × 71
#> eid sex_f31_0_0 year_of_birth_f34_0_0 month_of_birth_f52_0_0
#> <int> <fct> <int> <fct>
#> 1 1 Female 1952 August
#> 2 2 Female 1946 March
#> 3 3 Male 1951 April
#> 4 4 Female 1956 September
#> 5 5 NA NA April
#> 6 6 Male 1948 February
#> 7 7 Female 1949 December
#> 8 8 Male 1956 October
#> 9 9 Female 1962 April
#> 10 10 Male 1953 February
#> # ℹ 67 more variables: ethnic_background_f21000_0_0 <fct>,
#> # ethnic_background_f21000_1_0 <fct>, ethnic_background_f21000_2_0 <fct>,
#> # body_mass_index_bmi_f21001_0_0 <dbl>, body_mass_index_bmi_f21001_1_0 <dbl>,
#> # body_mass_index_bmi_f21001_2_0 <dbl>,
#> # systolic_blood_pressure_automated_reading_f4080_0_0 <int>,
#> # systolic_blood_pressure_automated_reading_f4080_0_1 <int>,
#> # systolic_blood_pressure_automated_reading_f4080_0_2 <int>, …
# to read only a subset of variables, create a data dictionary and filter
# for selected variables, then supply to `read_ukb()`
data_dict_selected <- make_data_dict(
ukb_main = dummy_ukb_main_path,
ukb_data_dict = dummy_ukb_data_dict
) %>%
dplyr::filter(FieldID %in% c("eid", "31", "34", "21001"))
read_ukb(
path = dummy_ukb_main_path,
data_dict = data_dict_selected,
ukb_data_dict = dummy_ukb_data_dict,
ukb_codings = dummy_ukb_codings
)
#> STEP 1 of 3
#> Reading data into R
#> STEP 2 of 3
#> Renaming with descriptive column names
#> STEP 3 of 3
#> Applying variable and value labels
#> Labelling dataset
#> Time taken: 0 minutes, 0 seconds.
#> eid sex_f31_0_0 year_of_birth_f34_0_0 body_mass_index_bmi_f21001_0_0
#> <int> <fctr> <int> <num>
#> 1: 1 Female 1952 20.1115
#> 2: 2 Female 1946 30.1536
#> 3: 3 Male 1951 22.8495
#> 4: 4 Female 1956 NA
#> 5: 5 <NA> NA 29.2752
#> 6: 6 Male 1948 28.2567
#> 7: 7 Female 1949 NA
#> 8: 8 Male 1956 NA
#> 9: 9 Female 1962 25.4016
#> 10: 10 Male 1953 NA
#> body_mass_index_bmi_f21001_1_0 body_mass_index_bmi_f21001_2_0
#> <num> <num>
#> 1: 20.8640 NA
#> 2: 20.2309 27.4936
#> 3: 26.7929 27.6286
#> 4: NA NA
#> 5: 19.7576 14.6641
#> 6: 30.2860 27.3534
#> 7: NA NA
#> 8: NA NA
#> 9: 21.9371 24.4897
#> 10: 25.1579 30.0483
# set `descriptive_colnames` and `label` to FALSE to read the raw dataset as is
read_ukb(
path = dummy_ukb_main_path,
data_dict = data_dict_selected,
ukb_data_dict = dummy_ukb_data_dict,
ukb_codings = dummy_ukb_codings,
descriptive_colnames = FALSE,
label = FALSE
)
#> STEP 1 of 1
#> Reading data into R
#> Time taken: 0 minutes, 0 seconds.
#> eid 31-0.0 34-0.0 21001-0.0 21001-1.0 21001-2.0
#> <int> <char> <int> <num> <num> <num>
#> 1: 1 0 1952 20.1115 20.8640 NA
#> 2: 2 0 1946 30.1536 20.2309 27.4936
#> 3: 3 1 1951 22.8495 26.7929 27.6286
#> 4: 4 0 1956 NA NA NA
#> 5: 5 <NA> NA 29.2752 19.7576 14.6641
#> 6: 6 1 1948 28.2567 30.2860 27.3534
#> 7: 7 0 1949 NA NA NA
#> 8: 8 1 1956 NA NA NA
#> 9: 9 0 1962 25.4016 21.9371 24.4897
#> 10: 10 1 1953 NA 25.1579 30.0483