Skip to contents

A convenience function that returns a data frame for a main UK Biobank dataset with a unique ID column, created by concatenating values from a selection of variables. Manual validation of any subsequent linkage is strongly advised.

Usage

create_unique_id_df(
  path,
  delim = "\t",
  ukb_data_dict = ukbwranglr::get_ukb_data_dict(),
  ukb_codings = ukbwranglr::get_ukb_codings(),
  descriptive_colnames = TRUE,
  label = FALSE,
  field_ids = c("31", "52", "34", "21000", "53", "96", "50"),
  instances = "0",
  id_col = "..unique_id",
  remove = TRUE,
  .ignore_duplicate_ids = FALSE
)

Arguments

path

The path to a UK Biobank main dataset file.

delim

Delimiter for the UKB main dataset file. Default is "auto" (see data.table::fread()). Ignored if the file name ends with .dta (i.e. is a STATA file) or if ukb_main is a data frame.

ukb_data_dict

The UKB data dictionary (available online at the UK Biobank data showcase. This should be a data frame where all columns are of type character.

ukb_codings

The UKB codings file (available online at the UK Biobank data showcase. This should be a data frame where all columns are of type character.

descriptive_colnames

If TRUE, rename columns with longer descriptive names derived from the UK Biobank's data dictionary 'Field' column.

label

If TRUE, apply variable labels and label coded values as factors.

field_ids

A character vector of fields IDs that will be used to create the new unique ID column. These should match the values under column 'Field' in the UK Biobank data dictionary.

instances

A character vector of instances to include when generating the new unique ID column. Should contain one or more of the following digits: '0', '1', '2', '3'. Note that more recent datasets may include instances that are not present in older datasets. By default only the first instance is used.

id_col

Name of the the new column to be created.

remove

If TRUE, remove input columns from output data frame.

.ignore_duplicate_ids

If TRUE, allow duplicate ID values and raise a warning if any are found. May be helpful for debugging. By default this is FALSE.

Value

A data frame

See also