Skip to contents
library(epiUtils)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)

Introduction

The epiUtils package provides essential tools for epidemiological analysis, including:

  • Age-standardised rate calculations using direct standardisation
  • Person-years computation for cohort studies with age stratification
  • Built-in WHO 2000-2025 World Standard Population data

This vignette demonstrates the key functions using practical examples.

WHO World Standard Population

The package includes the WHO 2000-2025 World Standard Population data, which is commonly used for age standardisation in epidemiological studies.

# Load the WHO standard population data
data("who_2000_2025_standard_population")

# View the structure
head(who_2000_2025_standard_population)
#> # A tibble: 6 × 6
#>   age_group who_world_standard_perc…¹ recalculation_to_add…² rounded_to_integers
#>   <chr>                         <dbl>                  <dbl>               <int>
#> 1 [0-5)                          8.86                 88569.               88569
#> 2 [5-10)                         8.69                 86870.               86870
#> 3 [10-15)                        8.6                  85970.               85970
#> 4 [15-20)                        8.47                 84670.               84670
#> 5 [20-25)                        8.22                 82171.               82171
#> 6 [25-30)                        7.93                 79272.               79272
#> # ℹ abbreviated names: ¹​who_world_standard_percent,
#> #   ²​recalculation_to_add_to_1million
#> # ℹ 2 more variables: standard_for_seer_stat <int>, lower_age_limit <int>

# Summary of the data
str(who_2000_2025_standard_population)
#> tibble [21 × 6] (S3: tbl_df/tbl/data.frame)
#>  $ age_group                       : chr [1:21] "[0-5)" "[5-10)" "[10-15)" "[15-20)" ...
#>  $ who_world_standard_percent      : num [1:21] 8.86 8.69 8.6 8.47 8.22 7.93 7.61 7.15 6.59 6.04 ...
#>  $ recalculation_to_add_to_1million: num [1:21] 88569 86870 85970 84670 82171 ...
#>  $ rounded_to_integers             : int [1:21] 88569 86870 85970 84670 82171 79272 76073 71475 65877 60379 ...
#>  $ standard_for_seer_stat          : int [1:21] 88569 86870 85970 84670 82171 79272 76073 71475 65877 60379 ...
#>  $ lower_age_limit                 : int [1:21] 0 5 10 15 20 25 30 35 40 45 ...

Example 1: Calculate Person-Years by Age Strata

Let’s start with an example of calculating person-years for a cohort study. The default age groups now match the WHO 2000-2025 Standard Population, making it easy to calculate age-standardised rates.

# Create example cohort data
set.seed(123)
n_patients <- 100

cohort_data <- data.frame(
  patient_id = 1:n_patients,
  dob = as.Date("1950-01-01") + sample(-7300:7300, n_patients, replace = TRUE), # Random dates ±20 years
  study_entry_date = as.Date("2020-01-01") + sample(0:365, n_patients, replace = TRUE),
  study_exit_date = as.Date("2023-01-01") + sample(0:730, n_patients, replace = TRUE),
  event_status = sample(c(0L, 1L), n_patients, replace = TRUE, prob = c(0.8, 0.2))
)

# Ensure study_exit_date is after study_entry_date
cohort_data$study_exit_date <- pmax(
  cohort_data$study_exit_date, 
  cohort_data$study_entry_date + 30
)

# View first few rows
head(cohort_data)
#>   patient_id        dob study_entry_date study_exit_date event_status
#> 1          1 1936-10-03       2020-10-12      2023-05-05            0
#> 2          2 1936-11-20       2020-06-07      2023-09-22            0
#> 3          3 1958-07-16       2020-04-30      2023-07-05            1
#> 4          4 1953-11-18       2020-04-19      2024-07-26            0
#> 5          5 1964-03-10       2020-06-06      2023-09-09            0
#> 6          6 1938-03-10       2020-03-04      2024-04-02            0

# Calculate person-years by 5-year age groups (default - matches WHO age groups)
person_years_who <- calculate_person_years_by_age_strata(cohort_data)
print("Person-years by WHO-compatible age groups:")
#> [1] "Person-years by WHO-compatible age groups:"
print(person_years_who)
#> # A tibble: 21 × 4
#>    age_group person_years events     n
#>    <fct>            <dbl>  <int> <int>
#>  1 [0-5)                0      0     0
#>  2 [5-10)               0      0     0
#>  3 [10-15)              0      0     0
#>  4 [15-20)              0      0     0
#>  5 [20-25)              0      0     0
#>  6 [25-30)              0      0     0
#>  7 [30-35)              0      0     0
#>  8 [35-40)              0      0     0
#>  9 [40-45)              0      0     0
#> 10 [45-50)              0      0     0
#> # ℹ 11 more rows

# Calculate person-years using custom age groups (10-year groups for comparison)
person_years_10yr <- calculate_person_years_by_age_strata(
  cohort_data, 
  age_cut_points = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150)
)
print("Person-years by 10-year age groups:")
#> [1] "Person-years by 10-year age groups:"
print(person_years_10yr)
#> # A tibble: 11 × 4
#>    age_group person_years events     n
#>    <fct>            <dbl>  <int> <int>
#>  1 [0-10)             0        0     0
#>  2 [10-20)            0        0     0
#>  3 [20-30)            0        0     0
#>  4 [30-40)            0        0     0
#>  5 [40-50)            0        0     0
#>  6 [50-60)           77.0      5    24
#>  7 [60-70)          117.       7    39
#>  8 [70-80)           70.3      4    30
#>  9 [80-90)           74.7      1    28
#> 10 [90-100)          16.3      1     7
#> 11 [100+)             0        0     0

Example 2: Age-Standardised Rates with WHO Standard Population

Now let’s calculate age-standardised incidence rates using the WHO standard population. With the updated defaults, this is now much simpler since the age groups match automatically.

# Method 1: Direct calculation using WHO-compatible age groups (recommended)
# The default age groups now match WHO 2000-2025 Standard Population exactly

# First, let's verify the age groups match
print("Age groups from our person-years calculation:")
#> [1] "Age groups from our person-years calculation:"
print(levels(person_years_who$age_group))
#>  [1] "[0-5)"    "[5-10)"   "[10-15)"  "[15-20)"  "[20-25)"  "[25-30)" 
#>  [7] "[30-35)"  "[35-40)"  "[40-45)"  "[45-50)"  "[50-55)"  "[55-60)" 
#> [13] "[60-65)"  "[65-70)"  "[70-75)"  "[75-80)"  "[80-85)"  "[85-90)" 
#> [19] "[90-95)"  "[95-100)" "[100+)"

print("WHO 2000-2025 Standard Population age groups:")
#> [1] "WHO 2000-2025 Standard Population age groups:"
print(who_2000_2025_standard_population$age_group)
#>  [1] "[0-5)"    "[5-10)"   "[10-15)"  "[15-20)"  "[20-25)"  "[25-30)" 
#>  [7] "[30-35)"  "[35-40)"  "[40-45)"  "[45-50)"  "[50-55)"  "[55-60)" 
#> [13] "[60-65)"  "[65-70)"  "[70-75)"  "[75-80)"  "[80-85)"  "[85-90)" 
#> [19] "[90-95)"  "[95-100)" "[100+)"

print("Age groups match perfectly:", 
      identical(levels(person_years_who$age_group), who_2000_2025_standard_population$age_group))
#> [1] "Age groups match perfectly:"

# Merge with WHO standard population data directly
asr_data_who <- person_years_who |>
  left_join(
    who_2000_2025_standard_population |>
      select(age_group, standard_pop = standard_for_seer_stat),
    by = "age_group"
  ) |>
  filter(person_years > 0) |>  # Ensure positive person-years for all groups
  select(age_group, events, person_years, standard_pop)

print("Data prepared for ASR calculation (direct WHO matching):")
#> [1] "Data prepared for ASR calculation (direct WHO matching):"
print(asr_data_who)
#> # A tibble: 9 × 4
#>   age_group events person_years standard_pop
#>   <chr>      <int>        <dbl>        <int>
#> 1 [50-55)        0         28.5        53681
#> 2 [55-60)        5         48.5        45484
#> 3 [60-65)        4         50.6        37187
#> 4 [65-70)        3         66.5        29590
#> 5 [70-75)        0         42.7        22092
#> 6 [75-80)        4         27.6        15195
#> 7 [80-85)        0         42.2         9097
#> 8 [85-90)        1         32.5         4398
#> 9 [90-95)        1         16.3         1500

# Calculate age-standardised rates
asr_result_who <- calculate_asr_direct(asr_data_who)
#> Warning: ! Found age groups with small/zero case counts:
#>  zero cases: [50-55), [70-75), [80-85) and < 5 cases: [60-65) (4 cases),
#>   [65-70) (3 cases), [75-80) (4 cases), [85-90) (1 case), [90-95) (1 case)
#>  Small and zero case counts may produce unstable rate estimates
#>  Consider wider age groups, longer follow-up, or data quality checks

print("Age-Standardised Rate Results (WHO age groups):")
#> [1] "Age-Standardised Rate Results (WHO age groups):"
print(paste("Crude rate:", round(asr_result_who$crude_rate_scaled, 2), "per 100,000"))
#> [1] "Crude rate: 5064.91 per 100,000"
print(paste("Age-standardised rate:", round(asr_result_who$asr_scaled, 2), "per 100,000"))
#> [1] "Age-standardised rate: 5223.67 per 100,000"
print(paste("95% CI:", round(asr_result_who$asr_ci_lower_scaled, 2), "-", round(asr_result_who$asr_ci_upper_scaled, 2)))
#> [1] "95% CI: 2953.5 - 9563.6"
print(paste("Total events:", asr_result_who$total_events))
#> [1] "Total events: 18"
print(paste("Total person-years:", asr_result_who$total_person_years))
#> [1] "Total person-years: 355.386721423682"

# Method 2: Using custom age groups (for comparison)
# If you need custom age groups, you can still aggregate WHO data to match

# Create a mapping to match our 10-year age groups to WHO standard population
who_aggregated <- who_2000_2025_standard_population |>
  mutate(
    age_group_10yr = case_when(
      lower_age_limit < 10 ~ "[0-10)",
      lower_age_limit < 20 ~ "[10-20)",
      lower_age_limit < 30 ~ "[20-30)",
      lower_age_limit < 40 ~ "[30-40)",
      lower_age_limit < 50 ~ "[40-50)",
      lower_age_limit < 60 ~ "[50-60)",
      lower_age_limit < 70 ~ "[60-70)",
      lower_age_limit < 80 ~ "[70-80)",
      lower_age_limit < 90 ~ "[80-90)",
      lower_age_limit >= 90 ~ "[90+)"
    )
  ) |>
  group_by(age_group_10yr) |>
  summarise(
    standard_pop = sum(standard_for_seer_stat),
    .groups = "drop"
  )

# Merge with our 10-year person-years data
asr_data_10yr <- person_years_10yr |>
  left_join(who_aggregated, by = c("age_group" = "age_group_10yr")) |>
  filter(!is.na(standard_pop)) |>  # Remove any age groups not in WHO data
  filter(person_years > 0) |>  # Ensure positive person-years for all groups
  select(age_group, events, person_years, standard_pop)

print("Data prepared for ASR calculation (10-year age groups):")
#> [1] "Data prepared for ASR calculation (10-year age groups):"
print(asr_data_10yr)
#> # A tibble: 4 × 4
#>   age_group events person_years standard_pop
#>   <chr>      <int>        <dbl>        <int>
#> 1 [50-60)        5         77.0        99165
#> 2 [60-70)        7        117.         66777
#> 3 [70-80)        4         70.3        37287
#> 4 [80-90)        1         74.7        13495

# Calculate age-standardised rates using 10-year age groups
asr_result_10yr <- calculate_asr_direct(asr_data_10yr)
#> Warning: ! Found age groups with small/zero case counts:
#>  < 5 cases: [70-80) (4 cases), [80-90) (1 case)
#>  Small and zero case counts may produce unstable rate estimates
#>  Consider wider age groups, longer follow-up, or data quality checks

# Display comparison between WHO and 10-year age group results
print("Age-Standardised Rate Results (10-year age groups):")
#> [1] "Age-Standardised Rate Results (10-year age groups):"
print(paste("Crude rate:", round(asr_result_10yr$crude_rate_scaled, 2), "per 100,000"))
#> [1] "Crude rate: 5013.2 per 100,000"
print(paste("Age-standardised rate:", round(asr_result_10yr$asr_scaled, 2), "per 100,000"))
#> [1] "Age-standardised rate: 5877.1 per 100,000"
print(paste("95% CI:", round(asr_result_10yr$asr_ci_lower_scaled, 2), "-", round(asr_result_10yr$asr_ci_upper_scaled, 2)))
#> [1] "95% CI: 3198.72 - 10177.88"

# Compare the two approaches
print("Comparison of age standardisation approaches:")
#> [1] "Comparison of age standardisation approaches:"
print(paste("WHO age groups ASR:", round(asr_result_who$asr_scaled, 2), "per 100,000"))
#> [1] "WHO age groups ASR: 5223.67 per 100,000"
print(paste("10-year age groups ASR:", round(asr_result_10yr$asr_scaled, 2), "per 100,000"))
#> [1] "10-year age groups ASR: 5877.1 per 100,000"

# View age-specific details from WHO age groups
print("Age-specific details (WHO age groups):")
#> [1] "Age-specific details (WHO age groups):"
print(head(asr_result_who$age_specific_data[[1]], 10))
#> # A tibble: 9 × 8
#>   age_group events person_years standard_pop age_specific_rate std_weight
#>   <chr>      <int>        <dbl>        <int>             <dbl>      <dbl>
#> 1 [50-55)        0         28.5        53681            0         0.246  
#> 2 [55-60)        5         48.5        45484            0.103     0.208  
#> 3 [60-65)        4         50.6        37187            0.0791    0.170  
#> 4 [65-70)        3         66.5        29590            0.0451    0.136  
#> 5 [70-75)        0         42.7        22092            0         0.101  
#> 6 [75-80)        4         27.6        15195            0.145     0.0696 
#> 7 [80-85)        0         42.2         9097            0         0.0417 
#> 8 [85-90)        1         32.5         4398            0.0308    0.0202 
#> 9 [90-95)        1         16.3         1500            0.0614    0.00687
#> # ℹ 2 more variables: age_specific_rate_scaled <dbl>, contribution_to_asr <dbl>

# Note: The WHO age groups provide more precise age standardisation
# while custom age groups may be useful for specific analytical needs

Handling Small Case Counts

The calculate_asr_direct() function includes built-in warnings for small case counts, which help users make informed decisions about age grouping strategies. The warnings are raised regardless of the confidence interval method used.

# Create example data with small case counts to demonstrate warnings
small_cases_data <- data.frame(
  age_group = c("0-20", "20-40", "40-60", "60-80"),
  events = c(0L, 2L, 4L, 15L),  # Mix of zero and small case counts
  person_years = c(10000, 15000, 20000, 18000),
  standard_pop = c(25000, 30000, 20000, 15000)
)

print("Data with small case counts:")
#> [1] "Data with small case counts:"
print(small_cases_data)
#>   age_group events person_years standard_pop
#> 1      0-20      0        10000        25000
#> 2     20-40      2        15000        30000
#> 3     40-60      4        20000        20000
#> 4     60-80     15        18000        15000

# Calculate ASR - will show warnings about small case counts
print("With warnings (default):")
#> [1] "With warnings (default):"
result_with_warnings <- calculate_asr_direct(small_cases_data)
#> Warning: ! Found age groups with small/zero case counts:
#>  zero cases: 0-20 and < 5 cases: 20-40 (2 cases), 40-60 (4 cases)
#>  Small and zero case counts may produce unstable rate estimates
#>  Consider wider age groups, longer follow-up, or data quality checks

# Calculate ASR with warnings disabled
print("With warnings disabled:")
#> [1] "With warnings disabled:"
result_no_warnings <- calculate_asr_direct(small_cases_data, warn_small_cases = FALSE)

# Results are identical regardless of warning setting
print(paste("ASR:", round(result_with_warnings$asr_scaled, 2), "per 100,000"))
#> [1] "ASR: 22.78 per 100,000"

Confidence Interval Methods

The function supports two methods for calculating confidence intervals, controlled by the ci_method parameter. Understanding the differences helps you choose the most appropriate method for your analysis.

Method Overview

  1. “gamma” (default): Uses gamma distribution approach
  2. “byars”: Uses Byar’s approximation with Dobson adjustment

Technical Differences

Gamma Method:

  • Based on the gamma distribution, which naturally models positive rate data
  • Automatically ensures confidence intervals are always positive (cannot be negative)
  • Particularly robust with small case counts and low rates
  • Consistent with epitools::ageadjust.direct()
  • Uses shape and scale parameters derived from the observed data

Byar’s/Dobson Method:

  • Traditional epidemiological approach dating to the 1970s-80s
  • Uses normal approximation with continuity corrections for larger counts
  • Uses exact Poisson-based intervals for very small counts (<10 total cases)
  • Consistent with PHEindicatormethods::calculate_dsr()
# Compare confidence interval methods using the same data
print("Gamma method (default):")
#> [1] "Gamma method (default):"
result_gamma <- calculate_asr_direct(small_cases_data, warn_small_cases = FALSE)
print(paste("ASR:", round(result_gamma$asr_scaled, 2), 
            "95% CI:", round(result_gamma$asr_ci_lower_scaled, 2), "-", 
            round(result_gamma$asr_ci_upper_scaled, 2)))
#> [1] "ASR: 22.78 95% CI: 13.66 - 38.47"

print("Byar's/Dobson method:")
#> [1] "Byar's/Dobson method:"
result_byars <- calculate_asr_direct(small_cases_data, ci_method = "byars", warn_small_cases = FALSE)
print(paste("ASR:", round(result_byars$asr_scaled, 2), 
            "95% CI:", round(result_byars$asr_ci_lower_scaled, 2), "-", 
            round(result_byars$asr_ci_upper_scaled, 2)))
#> [1] "ASR: 22.78 95% CI: 12.2 - 33.36"

# The point estimates are identical, but confidence intervals may differ
print("Point estimates are identical:", 
      round(result_gamma$asr_scaled, 10) == round(result_byars$asr_scaled, 10))
#> [1] "Point estimates are identical:"

# Calculate the difference in CI width
gamma_width <- result_gamma$asr_ci_upper_scaled - result_gamma$asr_ci_lower_scaled
byars_width <- result_byars$asr_ci_upper_scaled - result_byars$asr_ci_lower_scaled
print(paste("CI width difference:", round(abs(gamma_width - byars_width), 2)))
#> [1] "CI width difference: 3.64"

Practical Implications

When results differ:

  • Differences are typically small for moderate to large case counts (>20 total cases)
  • More noticeable differences may occur with very small case counts or low rates
  • The gamma method tends to be slightly more conservative (wider intervals) with extremely small counts
  • Both methods are statistically valid and widely accepted

Method selection guidance:

  • Use gamma (default) for most analyses, especially with small case counts or low rates

  • Use byars when:

    • Comparing with historical studies that used Byar’s method
    • Institutional requirements specify this traditional approach
  • Both methods converge to similar results as sample sizes increase

Example 3: Using WHO Standard Population Directly

For a more straightforward example, let’s create some synthetic cancer incidence data that directly matches the WHO standard population age groups.

# Create synthetic cancer incidence data using WHO age groups
set.seed(456)

# Use WHO age groups directly (first 15 age groups, up to 75+)
cancer_data <- who_2000_2025_standard_population |>
  slice(1:15) |>  # Use first 15 age groups (0-74)
  mutate(
    # Simulate population sizes (person-years)
    person_years = round(runif(n(), min = 50000, max = 200000)),
    # Simulate cancer cases with age-specific rates
    # Higher rates in older age groups
    age_specific_rate = case_when(
      lower_age_limit < 20 ~ runif(1, 0.00001, 0.00005),
      lower_age_limit < 40 ~ runif(1, 0.00005, 0.0002),
      lower_age_limit < 60 ~ runif(1, 0.0002, 0.001),
      TRUE ~ runif(1, 0.001, 0.003)
    ),
    events = as.integer(round(person_years * age_specific_rate))
  ) |>
  select(age_group, events, person_years, standard_pop = standard_for_seer_stat)

print("Synthetic cancer incidence data:")
#> [1] "Synthetic cancer incidence data:"
print(cancer_data)
#> # A tibble: 15 × 4
#>    age_group events person_years standard_pop
#>    <chr>      <int>        <dbl>        <int>
#>  1 [0-5)          2        63433        88569
#>  2 [5-10)         3        81577        86870
#>  3 [10-15)        6       159943        85970
#>  4 [15-20)        6       177820        84670
#>  5 [20-25)       30       168260        82171
#>  6 [25-30)       18        99794        79272
#>  7 [30-35)       11        62365        76073
#>  8 [35-40)       16        92829        71475
#>  9 [40-45)       48        85625        65877
#> 10 [45-50)       61       107785        60379
#> 11 [50-55)       60       105942        53681
#> 12 [55-60)       47        82686        45484
#> 13 [60-65)      397       163266        37187
#> 14 [65-70)      422       173252        29590
#> 15 [70-75)      340       139838        22092

# Calculate age-standardised incidence rates
cancer_asr <- calculate_asr_direct(cancer_data)
#> Warning: ! Found age groups with small/zero case counts:
#>  < 5 cases: [0-5) (2 cases), [5-10) (3 cases)
#>  Small and zero case counts may produce unstable rate estimates
#>  Consider wider age groups, longer follow-up, or data quality checks

print("Cancer Incidence Results:")
#> [1] "Cancer Incidence Results:"
print(paste("Crude incidence rate:", round(cancer_asr$crude_rate_scaled, 1), "per 100,000"))
#> [1] "Crude incidence rate: 83.1 per 100,000"
print(paste("Age-standardised incidence rate:", round(cancer_asr$asr_scaled, 1), "per 100,000"))
#> [1] "Age-standardised incidence rate: 42.3 per 100,000"
print(paste("95% CI:", round(cancer_asr$asr_ci_lower_scaled, 1), "-", round(cancer_asr$asr_ci_upper_scaled, 1), "per 100,000"))
#> [1] "95% CI: 39.7 - 45.2 per 100,000"

Visualization

Let’s create some visualizations to better understand our results.

# Plot age-specific rates
age_specific_data <- cancer_asr$age_specific_data[[1]]

# Age-specific rates plot
p1 <- ggplot(age_specific_data, aes(x = age_group, y = age_specific_rate_scaled)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  theme_minimal() +
  labs(
    title = "Age-Specific Cancer Incidence Rates",
    x = "Age Group",
    y = "Rate per 100,000"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(p1)


# Contribution to ASR plot
p2 <- ggplot(age_specific_data, aes(x = age_group, y = contribution_to_asr * 100000)) +
  geom_col(fill = "darkgreen", alpha = 0.7) +
  theme_minimal() +
  labs(
    title = "Contribution of Each Age Group to Age-Standardised Rate",
    x = "Age Group",
    y = "Contribution per 100,000"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(p2)


# Compare crude vs age-standardised rates
comparison_data <- data.frame(
  Rate_Type = c("Crude Rate", "Age-Standardised Rate"),
  Rate = c(cancer_asr$crude_rate_scaled, cancer_asr$asr_scaled),
  CI_Lower = c(NA, cancer_asr$asr_ci_lower_scaled),
  CI_Upper = c(NA, cancer_asr$asr_ci_upper_scaled)
)

p3 <- ggplot(comparison_data, aes(x = Rate_Type, y = Rate)) +
  geom_col(fill = c("orange", "purple"), alpha = 0.7, width = 0.5) +
  geom_errorbar(
    aes(ymin = CI_Lower, ymax = CI_Upper), 
    width = 0.2, 
    na.rm = TRUE
  ) +
  theme_minimal() +
  labs(
    title = "Crude vs Age-Standardised Cancer Incidence Rates",
    x = "Rate Type",
    y = "Rate per 100,000"
  )

print(p3)

Example 4: Age-Standardised Prevalence Rates

The calculate_asr_direct() function can also be used to calculate age-standardised prevalence rates. For prevalence studies, the key differences are:

  • events: Number of individuals with the condition at a point in time
  • person_years: Total number of individuals examined/surveyed (use population counts, not person-years of follow-up)
  • multiplier: Use 100 to express results as percentages (%) instead of rates per 100,000
# Example: Age-standardised prevalence of diabetes in a population survey
set.seed(789)

# Create synthetic diabetes prevalence data
diabetes_data <- data.frame(
  age_group = c("18-29", "30-39", "40-49", "50-59", "60-69", "70+"),
  
  # Number of individuals with diabetes (events)
  events = c(12L, 45L, 98L, 156L, 203L, 167L),
  
  # Total number of individuals surveyed in each age group (person_years = N individuals)
  person_years = c(1200, 1500, 1800, 1400, 1100, 800),
  
  # WHO standard population weights for corresponding age groups
  standard_pop = c(2435, 2021, 1826, 1485, 1118, 763)  # Simplified WHO weights
)

print("Diabetes prevalence survey data:")
#> [1] "Diabetes prevalence survey data:"
print(diabetes_data)
#>   age_group events person_years standard_pop
#> 1     18-29     12         1200         2435
#> 2     30-39     45         1500         2021
#> 3     40-49     98         1800         1826
#> 4     50-59    156         1400         1485
#> 5     60-69    203         1100         1118
#> 6       70+    167          800          763

# Calculate age-standardised prevalence using multiplier = 100 for percentages
diabetes_prevalence <- calculate_asr_direct(
  .df = diabetes_data,
  multiplier = 100,  # Use 100 to get percentages instead of per 100,000
  warn_small_cases = FALSE
)

print("Diabetes Prevalence Results:")
#> [1] "Diabetes Prevalence Results:"
print(paste("Crude prevalence:", round(diabetes_prevalence$crude_rate_scaled, 2), "%"))
#> [1] "Crude prevalence: 8.73 %"
print(paste("Age-standardised prevalence:", round(diabetes_prevalence$asr_scaled, 2), "%"))
#> [1] "Age-standardised prevalence: 7.42 %"
print(paste("95% CI:", round(diabetes_prevalence$asr_ci_lower_scaled, 2), "% -", 
            round(diabetes_prevalence$asr_ci_upper_scaled, 2), "%"))
#> [1] "95% CI: 6.86 % - 8.01 %"

# View age-specific prevalence rates
age_specific_prev <- diabetes_prevalence$age_specific_data[[1]]
print("Age-specific prevalence rates:")
#> [1] "Age-specific prevalence rates:"
print(age_specific_prev[, c("age_group", "events", "person_years", "age_specific_rate_scaled")])
#>   age_group events person_years age_specific_rate_scaled
#> 1     18-29     12         1200                 1.000000
#> 2     30-39     45         1500                 3.000000
#> 3     40-49     98         1800                 5.444444
#> 4     50-59    156         1400                11.142857
#> 5     60-69    203         1100                18.454545
#> 6       70+    167          800                20.875000

# Plot age-specific prevalence
library(ggplot2)
p_prev <- ggplot(age_specific_prev, aes(x = age_group, y = age_specific_rate_scaled)) +
  geom_col(fill = "darkred", alpha = 0.7) +
  theme_minimal() +
  labs(
    title = "Age-Specific Diabetes Prevalence",
    x = "Age Group",
    y = "Prevalence (%)"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(p_prev)

Key Points for Prevalence Studies:

  • Cross-sectional data: Use counts of individuals with/without condition at a specific time point
  • Population denominator: Use total number of individuals examined, not person-years of follow-up
  • Percentage scale: Set multiplier = 100 to express results as percentages (%)
  • Interpretation: Results represent the proportion of the population with the condition, standardised for age differences

Key Features Summary

calculate_person_years_by_age_strata()

  • Calculates person-years of follow-up by age groups
  • Default age groups now match WHO 2000-2025 Standard Population exactly
  • Handles age transitions during follow-up period automatically
  • Flexible age group definitions for custom analyses
  • Final age group displays as “[100+)” to match WHO format
  • Returns events and person-years by age strata

calculate_asr_direct()

  • Calculates age-standardised rates using direct standardisation
  • Works seamlessly with WHO age groups from person-years function
  • Supports both incidence rates and prevalence calculations
  • Two confidence interval methods: gamma distribution (default) and Byar’s/Dobson approximation
  • Smart warnings for small/zero case counts with user control
  • Returns comprehensive results including crude rates, ASRs, confidence intervals, and totals

WHO Standard Population Data

  • Ready-to-use WHO 2000-2025 World Standard Population
  • Multiple standard population formats available
  • Easily integrates with the package functions

This vignette demonstrates the basic usage of the key functions in epiUtils. For more detailed information about specific functions, see their individual help pages.