Announcement

Acknowledgement

Dr. Hua Zhou’s slides

brolgar tutorial

rm(list = ls()) # clean-up workspace

Longitudinal data

broglar is an R package that helps browse over longitudinal data graphically and analytically in R.

individuals repeatedly measured through time

Installing brolgar

Install from GitHub:

# install.packages("remotes")
remotes::install_github("njtierney/brolgar")
## Skipping install of 'brolgar' from a github remote, the SHA1 (df18fa72) has not changed since last install.
##   Use `force = TRUE` to force installation

An example longitudinal data

Load tidyverse, brolgar and gghighlight:

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(brolgar)
library(gghighlight)

The csv file wages_pp.csv contains a data set from the textbook Applied Longitudical Data Analysis (2003) by Singer and Willett. This data contains measurements on hourly wages by years in the workforce, with education and race as covariates.

head wages_pp.csv
## "id","lnw","exper","ged","postexp","black","hispanic","hgc","hgc.9","uerate","ue.7","ue.centert1","ue.mean","ue.person.cen","ue1"
## 31,1.491,0.015,1,0.015,0,1,8,-1,3.215,-3.785,0,3.215,0,3.215
## 31,1.433,0.715,1,0.715,0,1,8,-1,3.215,-3.785,0,3.215,0,3.215
## 31,1.469,1.734,1,1.734,0,1,8,-1,3.215,-3.785,0,3.215,0,3.215
## 31,1.749,2.773,1,2.773,0,1,8,-1,3.295,-3.705,0.08,3.215,0.08,3.215
## 31,1.931,3.927,1,3.927,0,1,8,-1,2.895,-4.105,-0.32,3.215,-0.32,3.215
## 31,1.709,4.946,1,4.946,0,1,8,-1,2.495,-4.505,-0.72,3.215,-0.72,3.215
## 31,2.086,5.965,1,5.965,0,1,8,-1,2.595,-4.405,-0.62,3.215,-0.62,3.215
## 31,2.129,6.984,1,6.984,0,1,8,-1,4.795,-2.205,1.58,3.215,1.58,3.215
## 36,1.982,0.315,1,0.315,0,0,9,0,4.895,-2.105,0,5.0965,-0.201500000000000,4.895
wages
## # A tsibble: 6,402 x 9 [!]
## # Key:       id [888]
##       id ln_wages    xp   ged xp_since_ged black hispanic high_grade
##    <int>    <dbl> <dbl> <int>        <dbl> <int>    <int>      <int>
##  1    31     1.49 0.015     1        0.015     0        1          8
##  2    31     1.43 0.715     1        0.715     0        1          8
##  3    31     1.47 1.73      1        1.73      0        1          8
##  4    31     1.75 2.77      1        2.77      0        1          8
##  5    31     1.93 3.93      1        3.93      0        1          8
##  6    31     1.71 4.95      1        4.95      0        1          8
##  7    31     2.09 5.96      1        5.96      0        1          8
##  8    31     2.13 6.98      1        6.98      0        1          8
##  9    36     1.98 0.315     1        0.315     0        0          9
## 10    36     1.80 0.983     1        0.983     0        0          9
## # … with 6,392 more rows, and 1 more variable: unemploy_rate <dbl>

Read following variables into a tibble:
- id: 1-888, for each subject.
- lnw: natural log of wages, adjusted for inflation, to 1990 dollars.
- exper: Experience - the length of time in the workforce (in years). The number of time points and values of time points for each subject can differ.
- ged: when/if a graduate equivalency diploma is obtained.
- postexp: change in experience since getting a ged (if they get one).
- black: categorical indicator of race = black.
- hispanic: categorical indicator of race = hispanic.
- hgc: highest grade completed.
- uerate: unemployment rates in the local geographic region at each measurement time.

wages <- read_csv("wages_pp.csv") %>% select(1:8, 10)
## Parsed with column specification:
## cols(
##   id = col_double(),
##   lnw = col_double(),
##   exper = col_double(),
##   ged = col_double(),
##   postexp = col_double(),
##   black = col_double(),
##   hispanic = col_double(),
##   hgc = col_double(),
##   hgc.9 = col_double(),
##   uerate = col_double(),
##   ue.7 = col_double(),
##   ue.centert1 = col_double(),
##   ue.mean = col_double(),
##   ue.person.cen = col_double(),
##   ue1 = col_double()
## )
wages %>% print(width = Inf)
## # A tibble: 6,402 x 9
##       id   lnw exper   ged postexp black hispanic   hgc uerate
##    <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>    <dbl> <dbl>  <dbl>
##  1    31  1.49 0.015     1   0.015     0        1     8   3.22
##  2    31  1.43 0.715     1   0.715     0        1     8   3.22
##  3    31  1.47 1.73      1   1.73      0        1     8   3.22
##  4    31  1.75 2.77      1   2.77      0        1     8   3.30
##  5    31  1.93 3.93      1   3.93      0        1     8   2.90
##  6    31  1.71 4.95      1   4.95      0        1     8   2.50
##  7    31  2.09 5.96      1   5.96      0        1     8   2.60
##  8    31  2.13 6.98      1   6.98      0        1     8   4.80
##  9    36  1.98 0.315     1   0.315     0        0     9   4.89
## 10    36  1.80 0.983     1   0.983     0        0     9   7.4 
## # … with 6,392 more rows

Turn data frame into a tsibble

We turn a regular data frame/tibble into a tidy time series data frame tsibble by identifying
1. The key variable is the identifier of individuals.
2. The index variable is the time component of your data.
3. The regularity of the time interveral (index). Longitudinal data typically has irregular time periods between measurements, but can have regular measurements.

wages <- as_tsibble(x = wages,
                    key = id,
                    index = exper,
                    regular = FALSE)
wages
## # A tsibble: 6,402 x 9 [!]
## # Key:       id [888]
##       id   lnw exper   ged postexp black hispanic   hgc uerate
##    <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>    <dbl> <dbl>  <dbl>
##  1    31  1.49 0.015     1   0.015     0        1     8   3.22
##  2    31  1.43 0.715     1   0.715     0        1     8   3.22
##  3    31  1.47 1.73      1   1.73      0        1     8   3.22
##  4    31  1.75 2.77      1   2.77      0        1     8   3.30
##  5    31  1.93 3.93      1   3.93      0        1     8   2.90
##  6    31  1.71 4.95      1   4.95      0        1     8   2.50
##  7    31  2.09 5.96      1   5.96      0        1     8   2.60
##  8    31  2.13 6.98      1   6.98      0        1     8   4.80
##  9    36  1.98 0.315     1   0.315     0        0     9   4.89
## 10    36  1.80 0.983     1   0.983     0        0     9   7.4 
## # … with 6,392 more rows

Display trajectories of a few individuals

sample_n_keys() and sample_frac_keys() to take a random sample of keys.

set.seed(203)
wages %>%
  sample_n_keys(size = 5) %>%
  ggplot(aes(x = exper,
             y = lnw,
             group = id)) + 
  geom_line()

facet_sample() allows you to specify the number of keys per facet, and the number of facets with n_per_facet and n_facets. By default, it splits the data into 12 facets with 3 per facet:

set.seed(203)
ggplot(wages,
       aes(x = exper,
           y = lnw,
           group = id)) +
  geom_line() +
  facet_sample()

Find features in longitudinal data

Five number summaries of wages for each individual:

wages.summary <- wages %>%
  features(lnw, feat_five_num)
wages.summary %>% print(width = Inf)
## # A tibble: 888 x 6
##       id   min   q25   med   q75   max
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1    31 1.43   1.48  1.73  2.02  2.13
##  2    36 1.80   1.97  2.32  2.59  2.93
##  3    53 1.54   1.58  1.71  1.89  3.24
##  4   122 0.763  2.10  2.19  2.46  2.92
##  5   134 2.00   2.28  2.36  2.79  2.93
##  6   145 1.48   1.58  1.77  1.89  2.04
##  7   155 1.54   1.83  2.22  2.44  2.64
##  8   173 1.56   1.68  2.00  2.05  2.34
##  9   206 2.03   2.07  2.30  2.45  2.48
## 10   207 1.58   1.87  2.15  2.26  2.66
## # … with 878 more rows

Other feature functions include feat_monotonic, feat_ranges, feat_spread, feat_three_num, n_obs. For example, find those whose wages only increase or decrease with feat_monotonic:

wages %>%
  features(lnw, feat_monotonic)
## # A tibble: 888 x 5
##       id increase decrease unvary monotonic
##    <dbl> <lgl>    <lgl>    <lgl>  <lgl>    
##  1    31 FALSE    FALSE    FALSE  FALSE    
##  2    36 FALSE    FALSE    FALSE  FALSE    
##  3    53 FALSE    FALSE    FALSE  FALSE    
##  4   122 FALSE    FALSE    FALSE  FALSE    
##  5   134 FALSE    FALSE    FALSE  FALSE    
##  6   145 FALSE    FALSE    FALSE  FALSE    
##  7   155 FALSE    FALSE    FALSE  FALSE    
##  8   173 FALSE    FALSE    FALSE  FALSE    
##  9   206 TRUE     FALSE    FALSE  TRUE     
## 10   207 FALSE    FALSE    FALSE  FALSE    
## # … with 878 more rows

Find the number of observations per individual and summarize by bar plot:

wages %>%
  features(lnw, n_obs) %>%
  ggplot(aes(x = n_obs)) + 
    geom_bar()  

Linking individuals back to the data

Once you have created these features, you can join them back to the data with a left_join:

wages %>%
  features(lnw, feat_monotonic) %>%
  left_join(wages, by = "id") %>%
  ggplot(aes(x = exper,
             y = lnw,
             group = id)) +
  geom_line() + 
  gghighlight(increase)
## Warning: Tried to calculate with group_by(), but the calculation failed.
## Falling back to ungrouped filter operation...
## label_key: id
## Too many data series, skip labeling