HW1 graded
HW2 posted, due Oct 9
broglar is an R package that helps browse over longitudinal data graphically and analytically in R.
individuals repeatedly measured through time
brolgar
Install from GitHub:
# install.packages("remotes")
remotes::install_github("njtierney/brolgar")
## Skipping install of 'brolgar' from a github remote, the SHA1 (df18fa72) has not changed since last install.
## Use `force = TRUE` to force installation
Load tidyverse
, brolgar
and gghighlight
:
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(brolgar)
library(gghighlight)
The csv file wages_pp.csv
contains a data set from the textbook Applied Longitudical Data Analysis (2003) by Singer and Willett. This data contains measurements on hourly wages by years in the workforce, with education and race as covariates.
head wages_pp.csv
## "id","lnw","exper","ged","postexp","black","hispanic","hgc","hgc.9","uerate","ue.7","ue.centert1","ue.mean","ue.person.cen","ue1"
## 31,1.491,0.015,1,0.015,0,1,8,-1,3.215,-3.785,0,3.215,0,3.215
## 31,1.433,0.715,1,0.715,0,1,8,-1,3.215,-3.785,0,3.215,0,3.215
## 31,1.469,1.734,1,1.734,0,1,8,-1,3.215,-3.785,0,3.215,0,3.215
## 31,1.749,2.773,1,2.773,0,1,8,-1,3.295,-3.705,0.08,3.215,0.08,3.215
## 31,1.931,3.927,1,3.927,0,1,8,-1,2.895,-4.105,-0.32,3.215,-0.32,3.215
## 31,1.709,4.946,1,4.946,0,1,8,-1,2.495,-4.505,-0.72,3.215,-0.72,3.215
## 31,2.086,5.965,1,5.965,0,1,8,-1,2.595,-4.405,-0.62,3.215,-0.62,3.215
## 31,2.129,6.984,1,6.984,0,1,8,-1,4.795,-2.205,1.58,3.215,1.58,3.215
## 36,1.982,0.315,1,0.315,0,0,9,0,4.895,-2.105,0,5.0965,-0.201500000000000,4.895
wages
## # A tsibble: 6,402 x 9 [!]
## # Key: id [888]
## id ln_wages xp ged xp_since_ged black hispanic high_grade
## <int> <dbl> <dbl> <int> <dbl> <int> <int> <int>
## 1 31 1.49 0.015 1 0.015 0 1 8
## 2 31 1.43 0.715 1 0.715 0 1 8
## 3 31 1.47 1.73 1 1.73 0 1 8
## 4 31 1.75 2.77 1 2.77 0 1 8
## 5 31 1.93 3.93 1 3.93 0 1 8
## 6 31 1.71 4.95 1 4.95 0 1 8
## 7 31 2.09 5.96 1 5.96 0 1 8
## 8 31 2.13 6.98 1 6.98 0 1 8
## 9 36 1.98 0.315 1 0.315 0 0 9
## 10 36 1.80 0.983 1 0.983 0 0 9
## # … with 6,392 more rows, and 1 more variable: unemploy_rate <dbl>
Read following variables into a tibble:
- id
: 1-888, for each subject.
- lnw
: natural log of wages, adjusted for inflation, to 1990 dollars.
- exper
: Experience - the length of time in the workforce (in years). The number of time points and values of time points for each subject can differ.
- ged
: when/if a graduate equivalency diploma is obtained.
- postexp
: change in experience since getting a ged (if they get one).
- black
: categorical indicator of race = black.
- hispanic
: categorical indicator of race = hispanic.
- hgc
: highest grade completed.
- uerate
: unemployment rates in the local geographic region at each measurement time.
wages <- read_csv("wages_pp.csv") %>% select(1:8, 10)
## Parsed with column specification:
## cols(
## id = col_double(),
## lnw = col_double(),
## exper = col_double(),
## ged = col_double(),
## postexp = col_double(),
## black = col_double(),
## hispanic = col_double(),
## hgc = col_double(),
## hgc.9 = col_double(),
## uerate = col_double(),
## ue.7 = col_double(),
## ue.centert1 = col_double(),
## ue.mean = col_double(),
## ue.person.cen = col_double(),
## ue1 = col_double()
## )
wages %>% print(width = Inf)
## # A tibble: 6,402 x 9
## id lnw exper ged postexp black hispanic hgc uerate
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 31 1.49 0.015 1 0.015 0 1 8 3.22
## 2 31 1.43 0.715 1 0.715 0 1 8 3.22
## 3 31 1.47 1.73 1 1.73 0 1 8 3.22
## 4 31 1.75 2.77 1 2.77 0 1 8 3.30
## 5 31 1.93 3.93 1 3.93 0 1 8 2.90
## 6 31 1.71 4.95 1 4.95 0 1 8 2.50
## 7 31 2.09 5.96 1 5.96 0 1 8 2.60
## 8 31 2.13 6.98 1 6.98 0 1 8 4.80
## 9 36 1.98 0.315 1 0.315 0 0 9 4.89
## 10 36 1.80 0.983 1 0.983 0 0 9 7.4
## # … with 6,392 more rows
tsibble
We turn a regular data frame/tibble into a tidy time series data frame tsibble
by identifying
1. The key variable is the identifier of individuals.
2. The index variable is the time component of your data.
3. The regularity of the time interveral (index). Longitudinal data typically has irregular time periods between measurements, but can have regular measurements.
wages <- as_tsibble(x = wages,
key = id,
index = exper,
regular = FALSE)
wages
## # A tsibble: 6,402 x 9 [!]
## # Key: id [888]
## id lnw exper ged postexp black hispanic hgc uerate
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 31 1.49 0.015 1 0.015 0 1 8 3.22
## 2 31 1.43 0.715 1 0.715 0 1 8 3.22
## 3 31 1.47 1.73 1 1.73 0 1 8 3.22
## 4 31 1.75 2.77 1 2.77 0 1 8 3.30
## 5 31 1.93 3.93 1 3.93 0 1 8 2.90
## 6 31 1.71 4.95 1 4.95 0 1 8 2.50
## 7 31 2.09 5.96 1 5.96 0 1 8 2.60
## 8 31 2.13 6.98 1 6.98 0 1 8 4.80
## 9 36 1.98 0.315 1 0.315 0 0 9 4.89
## 10 36 1.80 0.983 1 0.983 0 0 9 7.4
## # … with 6,392 more rows
sample_n_keys()
and sample_frac_keys()
to take a random sample of keys.
set.seed(203)
wages %>%
sample_n_keys(size = 5) %>%
ggplot(aes(x = exper,
y = lnw,
group = id)) +
geom_line()
facet_sample()
allows you to specify the number of keys per facet, and the number of facets with n_per_facet
and n_facets
. By default, it splits the data into 12 facets with 3 per facet:
set.seed(203)
ggplot(wages,
aes(x = exper,
y = lnw,
group = id)) +
geom_line() +
facet_sample()
Five number summaries of wages for each individual:
wages.summary <- wages %>%
features(lnw, feat_five_num)
wages.summary %>% print(width = Inf)
## # A tibble: 888 x 6
## id min q25 med q75 max
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 31 1.43 1.48 1.73 2.02 2.13
## 2 36 1.80 1.97 2.32 2.59 2.93
## 3 53 1.54 1.58 1.71 1.89 3.24
## 4 122 0.763 2.10 2.19 2.46 2.92
## 5 134 2.00 2.28 2.36 2.79 2.93
## 6 145 1.48 1.58 1.77 1.89 2.04
## 7 155 1.54 1.83 2.22 2.44 2.64
## 8 173 1.56 1.68 2.00 2.05 2.34
## 9 206 2.03 2.07 2.30 2.45 2.48
## 10 207 1.58 1.87 2.15 2.26 2.66
## # … with 878 more rows
Other feature functions include feat_monotonic
, feat_ranges
, feat_spread
, feat_three_num
, n_obs
. For example, find those whose wages only increase or decrease with feat_monotonic:
wages %>%
features(lnw, feat_monotonic)
## # A tibble: 888 x 5
## id increase decrease unvary monotonic
## <dbl> <lgl> <lgl> <lgl> <lgl>
## 1 31 FALSE FALSE FALSE FALSE
## 2 36 FALSE FALSE FALSE FALSE
## 3 53 FALSE FALSE FALSE FALSE
## 4 122 FALSE FALSE FALSE FALSE
## 5 134 FALSE FALSE FALSE FALSE
## 6 145 FALSE FALSE FALSE FALSE
## 7 155 FALSE FALSE FALSE FALSE
## 8 173 FALSE FALSE FALSE FALSE
## 9 206 TRUE FALSE FALSE TRUE
## 10 207 FALSE FALSE FALSE FALSE
## # … with 878 more rows
Find the number of observations per individual and summarize by bar plot:
wages %>%
features(lnw, n_obs) %>%
ggplot(aes(x = n_obs)) +
geom_bar()
Once you have created these features, you can join them back to the data with a left_join
:
wages %>%
features(lnw, feat_monotonic) %>%
left_join(wages, by = "id") %>%
ggplot(aes(x = exper,
y = lnw,
group = id)) +
geom_line() +
gghighlight(increase)
## Warning: Tried to calculate with group_by(), but the calculation failed.
## Falling back to ungrouped filter operation...
## label_key: id
## Too many data series, skip labeling