window.Rmd
Time series come with a strict temporal order that dictate the type of operations that can be done. An example of operation is the moving average, where a window slides over the time order and averages of the response are computed on the temporal subset. The tsibble package provides three moving window operations, called verbs that operate on temporal data objects (nouns):
slide()
/slide2()
/pslide()
: sliding window with overlapping observations.tile()
/tile2()
/ptile()
: tiling window without overlapping observations.stretch()
/stretch2()
/pstretch()
: fixing an initial window and expanding to include more observations.These functions handle all sorts of objects and feature purrr-like interface. In this vignette, I will walk you through the slide()
and its variants, but the example snippets are also applicable to tile()
and stretch()
.
In spirit of purrr::map()
, slide()
accepts one input, slide2()
two inputs, and pslide()
multiple inputs, all of which always return lists for the sake of type stability. Other variants including slide_lgl()
, slide_int()
, slide_dbl()
, slide_chr()
return vectors of the corresponding type, as well as slide_dfr()
and slide_dfc()
for row-binding and column-binding data frames respectively. This full-fledged window family empowers users to build window-related workflows in all sorts of ways, from fixed window size to calendar periods, and from moving average to model fitting.
The pedestrian
dataset includes hourly pedestrian counts in the city of Melbourne, with Sensor
as key and Date_Time
as index. These windowed functions are index-based rolling for tackling general problems, rather than time indexed. Implicit missing values are thereby made explicit using fill_gaps()
, and .full = TRUE
warrants the equal time length of each sensor. This prepares the data inputs in the expected order.
library(dplyr)
library(tidyr)
library(tsibble)
pedestrian_full <- pedestrian %>%
fill_gaps(.full = TRUE)
pedestrian_full
#> # A tsibble: 70,176 x 5 [1h] <Australia/Melbourne>
#> # Key: Sensor [4]
#> Sensor Date_Time Date Time Count
#> <chr> <dttm> <date> <int> <int>
#> 1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01 0 1630
#> 2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01 1 826
#> 3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01 2 567
#> 4 Birrarung Marr 2015-01-01 03:00:00 2015-01-01 3 264
#> 5 Birrarung Marr 2015-01-01 04:00:00 2015-01-01 4 139
#> # … with 7.017e+04 more rows
Moving average is one of the common techniques to smooth time series. We can apply daily window smoother (a fixed window size of 24) easily for each sensor. slide()
returns an output the same length as the input with .fill = NA
(by default) and .align = "center-left"
padded at both sides of the data range, so that the result fits into mutate()
in harmony. slide_dbl()
produces the numeric vector returned by mean()
.
pedestrian_full %>%
group_by_key() %>%
mutate(Daily_MA = slide_dbl(Count,
mean, na.rm = TRUE, .size = 24, .align = "center-left"
))
#> # A tsibble: 70,176 x 6 [1h] <Australia/Melbourne>
#> # Key: Sensor [4]
#> # Groups: Sensor [4]
#> Sensor Date_Time Date Time Count Daily_MA
#> <chr> <dttm> <date> <int> <int> <dbl>
#> 1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01 0 1630 NA
#> 2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01 1 826 NA
#> 3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01 2 567 NA
#> 4 Birrarung Marr 2015-01-01 03:00:00 2015-01-01 3 264 NA
#> 5 Birrarung Marr 2015-01-01 04:00:00 2015-01-01 4 139 NA
#> # … with 7.017e+04 more rows
To make this even-order moving average symmetric, a second moving average with .size = 2
should be applied to Daily_MA
.
What if the time period we’d like to slide over happens not to be a fixed window size, for example sliding over three months. The preprocessing step is to wrap observations into monthly subsets (a list of tsibbles) using nest()
.
pedestrian_mth <- pedestrian_full %>%
mutate(YrMth = yearmonth(Date_Time)) %>%
nest(-Sensor, -YrMth)
pedestrian_mth
#> # A tibble: 96 x 3
#> Sensor YrMth data
#> <chr> <mth> <list>
#> 1 Birrarung Marr 2015 Jan <tsibble [744 × 4]>
#> 2 Birrarung Marr 2015 Feb <tsibble [672 × 4]>
#> 3 Birrarung Marr 2015 Mar <tsibble [744 × 4]>
#> 4 Birrarung Marr 2015 Apr <tsibble [721 × 4]>
#> 5 Birrarung Marr 2015 May <tsibble [744 × 4]>
#> # … with 91 more rows
Now it’s ready to (rock and) roll. When setting .size = 1
in slide()
, it behaves exactly the same as purrr::map()
, mapping over each element in the object. However, (1)
a bundle of 3 subsets (.size = 3
) needs to be binded first and then computed for average counts; (2)
alternatively, .bind = TRUE
takes care of binding data frames by row. The nicely-glued simple operations facilitate complex tasks in an easier-to-comprehend manner.
pedestrian_mth %>%
group_by(Sensor) %>%
# (1)
# mutate(Monthly_MA = slide_dbl(data,
# ~ mean(bind_rows(.)$Count, na.rm = TRUE), .size = 3, .align = "center"
# ))
# (2) equivalent to (1)
mutate(Monthly_MA = slide_dbl(data,
~ mean(.$Count, na.rm = TRUE), .size = 3, .align = "center", .bind = TRUE
))
#> # A tibble: 96 x 4
#> # Groups: Sensor [4]
#> Sensor YrMth data Monthly_MA
#> <chr> <mth> <list> <dbl>
#> 1 Birrarung Marr 2015 Jan <tsibble [744 × 4]> NA
#> 2 Birrarung Marr 2015 Feb <tsibble [672 × 4]> 634.
#> 3 Birrarung Marr 2015 Mar <tsibble [744 × 4]> 546.
#> 4 Birrarung Marr 2015 Apr <tsibble [721 × 4]> 554.
#> 5 Birrarung Marr 2015 May <tsibble [744 × 4]> 397.
#> # … with 91 more rows
We have had a glimpse at row-oriented workflow to slide over consecutive months using nest()
in the preceding example. To leverage this workflow more, we can fit a linear model for each sensor simultaneously but independently, and in turn obtain its fitted values and residuals over weekly rolling windows. This is where pslide()
comes to play. It takes a list or a data frame (multiple inputs) and apply the custom function my_diag()
to every rolling block. We start with a tsibble and end up with a diagnostic tibble of relatively larger size.
my_diag <- function(...) {
data <- tibble(...)
fit <- lm(Count ~ Time, data = data)
list(fitted = fitted(fit), resid = residuals(fit))
}
pedestrian %>%
filter_index(~ "2015-03") %>%
group_by_key() %>%
nest() %>%
mutate(diag = purrr::map(data, ~ pslide_dfr(., my_diag, .size = 24 * 7)))
#> # A tibble: 4 x 3
#> Sensor data diag
#> <chr> <list> <list>
#> 1 Birrarung Marr <tsibble [2,160 × 4]> <tibble [334,825 × 2…
#> 2 Bourke Street Mall (North) <tsibble [1,032 × 4]> <tibble [145,321 × 2…
#> 3 QV Market-Elizabeth St (West) <tsibble [2,160 × 4]> <tibble [334,825 × 2…
#> 4 Southern Cross Station <tsibble [2,160 × 4]> <tibble [334,825 × 2…
Why slide()
not working for this case? It is intended to work with a list, i.e. column-by-column for a data frame. However, here we perform a row-wise sliding over multiple columns of a data frame at one, pslide()
does the job for handling multiple lists. This example running a bit longer? Time to kick furrr in for parallel processing. Their multiprocessing counterparts are all prefixed with future_
.
library(furrr)
plan(multiprocess)
pedestrian %>%
filter_index(~ "2015-03") %>%
group_by_key() %>%
nest() %>%
mutate(diag = future_map(data, ~ future_pslide_dfr(., my_diag, .size = 24 * 7)))
The slide()
examples default to sliding over complete sets. In some cases, you may find partial sliding more appropriate, which can be enabled by .partial = TRUE
. Additionally, as opposed to moving window forward by a positive .size
, a negative one moves window backward.