ZeitZeiger is a method for supervised learning on high-dimensional data from an oscillatory system. This vignette goes through an example of how to use ZeitZeiger to train and test a predictor, perform cross-validation, and plot the results.

## Load the necessary packages

First let’s load the necessary packages.

```
library('data.table')
library('ggplot2')
library('zeitzeiger')
doParallel::registerDoParallel(cores = 2) # register a parallel backend to minimize runtime
```

## Generate example data

Now we’ll simulate data from an oscillatory system. Our simulated data will have 100 observations, with measurements of 200 features and 10 time-points per period. Each feature will have Gaussian noise. We’ll call the periodic variable here “time,” which will go between 0 and 1.

```
nObs = 100
nFeatures = 200
amp = 0.5
time = rep((0:9) / 10, length.out = nObs)
set.seed(42)
x = matrix(rnorm(nObs * nFeatures, sd = 0.7), nrow = nObs)
```

The oscillator will produce three patterns: a sine, a cosine, and a cosine with twice the frequency. For each of the three patterns, two features will have relatively strong signal and eight more will have relatively weak signal.

```
baseSignal = sin(time * 2 * pi)
x[, 1] = x[, 1] + 3 * baseSignal
x[, 2] = x[, 2] - 0.5 * baseSignal
for (ii in 3:10) {
x[, ii] = x[, ii] + runif(1, -amp, amp) * baseSignal}
baseSignal = cos(time * 2 * pi)
x[, 11] = x[, 11] + 2 * baseSignal
x[, 12] = x[, 12] - baseSignal
for (ii in 13:20) {
x[, ii] = x[, ii] + runif(1, -amp, amp) * baseSignal}
baseSignal = cos(time * 4 * pi + pi / 6)
x[, 21] = x[, 21] + baseSignal
x[, 22] = x[, 22] - baseSignal
for (ii in 23:30) {
x[, ii] = x[, ii] + runif(1, -amp, amp) * baseSignal}
```

## Create training and testing observations

Now we’ll split the dataset into training and test sets.

```
idxTrain = 1:round(nObs * 0.6)
xTrain = x[idxTrain, ]
timeTrain = time[idxTrain]
xTest = x[-idxTrain, ]
timeTest = time[-idxTrain]
```

## Train and test a ZeitZeiger predictor

ZeitZeiger has three main steps: `zeitzeigerFit`

,
`zeitzeigerSpc`

, and `zeitzeigerPredict`

. Each
function has several options, please see the help and Hughey et al. (2016) for
more details. The preferred way to use them is to call each function
separately. This lets you change the parameters for a later step without
re-running the earlier steps.

`zeitzeigerFit`

uses the training data to fit a periodic
smoothing spline to the behavior of each feature as a function of time.
`zeitzeigerSpc`

then uses the spline fits to calculate sparse
principal components (SPCs) for how the features change over time.
`zeitzeigerPredict`

then uses the training data and the SPCs
to predict the corresponding time for each test observation. The two
main parameters of ZeitZeiger are `sumabsv`

, which controls
the amount of regularization, and `nSPC`

, which controls how
many SPCs are used for prediction.

```
fitResult = zeitzeigerFit(xTrain, timeTrain)
spcResult = zeitzeigerSpc(fitResult$xFitMean, fitResult$xFitResid, sumabsv = 1)
predResult = zeitzeigerPredict(xTrain, timeTrain, xTest, spcResult, nSpc = 2)
```

Alternatively, you can run all three steps in one go using the
`zeitzeiger`

function, which returns a list consisting of the
results of the above three functions.

`zzFit = zeitzeiger(xTrain, timeTrain, xTest, sumabsv = 1, nSpc = 2)`

## Plot the periodic spline fits

Before looking at the predictions, it’s a good idea to check the quality of the spline fits.

```
timeRange = seq(0, 1 - 0.01, 0.01)
jj = 11
df1 = data.table(timeTrain, xTrain = xTrain[, jj])
df2 = data.table(
timeRange = rep(timeRange, 2),
parameter = rep(c('Observed mean', 'Predicted mean'), each = length(timeRange)),
value = c(predictIntensity(fitResult$xFitMean[jj, , drop = FALSE], timeRange),
2 * cos(timeRange * 2 * pi)))
```

For feature 11, the fit looks good. In a real-world dataset, we wouldn’t know the observed mean, so we could only compare the predicted mean to the observations.

## Plot prediction error on the test set

We can calculate the difference between predicted time and observed
time, i.e., the error, using `getCircDiff`

. This function
accounts for the fact that time is periodic, so `time = 0`

is
equivalent to `time = 1`

. The error can range from -0.5 to
0.5.

```
dfTest = data.frame(
timeObs = timeTest, timePred = predResult$timePred,
timeError = getCircDiff(predResult$timePred, timeTest))
```

There are many ways to examine prediction accuracy. A simple one is
to plot predicted time vs. observed time. Ideally, all points would lie
on the line `y = x`

. These predictions are pretty good. A
couple test observations with an observed time of 0 have a predicted
time slightly less than 1, which is ok.

```
ggplot(dfTest) +
geom_point(aes(x = timeObs, y = timePred), size = 2, shape = 1) +
geom_abline(slope = 1, intercept = 0, linetype = 'dashed') +
scale_x_continuous(limits = c(0, 1)) + scale_y_continuous(limits = c(0, 1)) +
labs(x = 'Observed time', y = 'Predicted time') + theme_bw()
```

When we plot error vs. observed time, we see those observations have a small negative error.

```
ggplot(dfTest) +
geom_point(aes(x = timeObs, y = timeError), size = 2, shape = 1) +
scale_x_continuous(limits = c(0, 1)) +
scale_y_continuous(limits = c(-0.2, 0.2)) +
labs(x = 'Observed time', y = 'Error') +
theme_bw()
```

## Plot a test observation’s time-dependent likelihood

The predicted time for a test observation is the time with maximum
likelihood. Another output of `zeitzeigerPredict`

, however,
is the likelihood across the entire range of time. By default, this
likelihood is calculated at 100 discrete time-points, the same as
`timeRange`

that we defined earlier (the predicted time is
not restricted to being one of these time-points).

```
obs = 2
dfLike = data.table(timeRange, likelihood = predResult$timeDepLike[obs, 1, ])
dfVert = data.table(
type = c('Observed time', 'Predicted time'),
xint = c(timeTest[obs], predResult$timePred[obs]))
```

For this observation, notice how the likelihood starts increasing at times slightly less than 1, then continues to increase at times greater than 0.

```
ggplot(dfLike) +
geom_point(aes(x = timeRange, y = likelihood), size = 2, shape = 1) +
geom_vline(aes(xintercept = xint, linetype = type), data = dfVert,
show.legend = TRUE) +
scale_linetype_manual(values = c('solid', 'dashed')) +
labs(x = 'Time', y = 'Likelihood') + theme_bw() +
theme(legend.position = c(0.7, 0.8), legend.title = element_blank())
```

## Run cross-validation

To determine the best parameters for training a ZeitZeiger predictor,
we can run cross-validation. The cross-validation functions can process
the folds in parallel, if `registerDoParallel`

is called
beforehand. Because the three steps of ZeitZeiger are separated, we need
to first randomly assign each observation to a fold, then use the same
assignments for `zeitzeigerFitCv`

and
`zeitzeigerPredictCv`

.

Here we’ll run cross-validation over three values of
`sumabsv`

and four values of `nSPC`

. Typically a
few values of `sumabsv`

will suffice, and `nSPC`

in this example can only range from 1 to 10 (because 10 is the default
value of `nTime`

).

```
sumabsv = c(1, 1.5, 3)
nSpc = 1:4
nFolds = 10
foldid = sample(rep(1:nFolds, length.out = nObs))
fitResultList = zeitzeigerFitCv(x, time, foldid)
spcResultList = list()
for (ii in seq_len(length(sumabsv))) {
spcResultList[[ii]] = zeitzeigerSpcCv(fitResultList, sumabsv = sumabsv[ii])}
predResultList = list()
for (ii in seq_len(length(sumabsv))) {
predResultList[[ii]] = zeitzeigerPredictCv(
x, time, foldid, spcResultList[[ii]], nSpc = nSpc)}
```

## Plot the error for each set of parameter values

Before plotting, we need to reorganize the output, making a
`data.frame`

with the information for each prediction.

```
timePredList = lapply(predResultList, function(a) a$timePred)
cvResult = data.table(
do.call(rbind, timePredList),
timeObs = rep(time, length(sumabsv)),
sumabsv = rep(sumabsv, each = length(time)),
obs = rep(1:nObs, length(sumabsv)))
cvResultMelt = melt(
cvResult, id.vars = c('obs', 'timeObs', 'sumabsv'), variable.name = 'nSpc',
value.name = 'timePred', variable.factor = FALSE)
cvResultMelt[, nSpc := as.integer(substr(nSpc, 2, 2))]
cvResultMelt[, sumabsv := factor(sumabsv)]
cvResultMelt[, timeError := getCircDiff(timePred, timeObs)]
```

Now calculate the median absolute error for each set of parameter values.

In this example dataset, the best accuracy seems to be at
`sumabsv = 1`

and `nSpc = 3`

.

## Train a model on the full dataset

Now we can train a predictor on the full dataset using the almost
optimal `sumabsv`

(to keep things interesting) from
cross-validation (all values of `nSpc`

come along for free),
and look at what ZeitZeiger has learned.

```
fitResultFinal = zeitzeigerFit(x, time)
spcResultFinal = zeitzeigerSpc(
fitResultFinal$xFitMean, fitResultFinal$xFitResid, sumabsv = 1.5)
dfVar = data.table(
spc = seq_len(length(spcResultFinal$d)),
propVar = spcResultFinal$d^2 / sum(spcResultFinal$d^2))
```

Only the first three SPCs explain any appreciable amount of the
variance in how the features change over time. This makes sense, as
accuracy stopped improving or got worse at `nSpc = 4`

.

```
ggplot(dfVar) +
geom_point(aes(x = spc, y = propVar), size = 2, shape = 1) +
scale_x_continuous(breaks = seq(1, 10)) +
labs(x = 'SPC', y = 'Proportion of\nvariance explained') + theme_bw()
```

## Plot the behavior of the SPCs over time

Now we can project the observations from feature-space to SPC-space, to look at how the three SPCs behave over time.

```
z = x %*% spcResultFinal$v[, 1:3]
colnames(z) = c('SPC 1', 'SPC 2', 'SPC 3')
zMelt = melt(
data.table(z, obs = 1:nObs, Time = time, check.names = FALSE),
id.vars = c('obs', 'Time'), variable.name = 'SPC', value.name = 'Abundance')
```

We see that each SPC corresponds to one of the three signals that we created in the dataset. The signs are reversed, but that’s ok. So these are the patterns that ZeitZeiger was using to predict the time.

```
ggplot(zMelt) +
facet_grid(vars(SPC), scales = 'free_y') +
geom_point(aes(x = Time, y = Abundance), size = 2, shape = 1) + theme_bw()
```

## Plot the coefficients of the features for the SPCs

Finally, we can look at which features contribute to which SPCs.

```
v = data.frame(spcResultFinal$v[, 1:3])
colnames(v) = c('SPC 1', 'SPC 2', 'SPC 3')
v = v[apply(v, 1, function(r) any(r != 0)), ]
v[v == 0] = NA
v = v[do.call(order, v), ]
v$feature = rownames(v)
vMelt = melt(setDT(v), id.vars = 'feature', variable.name = 'spc',
value.name = 'Coefficient')
vMelt[, feature := factor(feature, rev(v$feature))]
```

Sure enough, ZeitZeiger found all the features with strong signal. In addition, the sign and value of the coefficient for each feature in its respective SPC correspond to what we simulated.

```
ggplot(vMelt) +
facet_wrap(vars(spc), nrow = 1) +
geom_bar(aes(x = feature, y = Coefficient), stat = 'identity') +
labs(x = 'Feature') + coord_flip() +
theme_bw() + theme(panel.spacing = unit(1.2, 'lines'))
#> Warning: Removed 50 rows containing missing values (`position_stack()`).
```