Deep Learning for Biomedical Unstructured Time Series

Deep Learning
for Biomedical
Unstructured
Time-series
1D Convolutional neural
networks (CNNs) for time
series analysis, and
inspiration from beyond
biomedical field
Petteri Teikari, PhD
Singapore Eye Research Institute (SERI)
Visual Neurosciences group
http://petteri-teikari.com/
Version “Wed 17 April 2019“

Time SeriesAnalysis VeryShortIntro

TimeSeries Basics
Regular time seriesvs. irregular timeseries
https://mediatum.ub.tum.de/doc/1444158/78684.pdf
UnstructuredBiomedical1DTimeSeries
Time-Frequencyvisualization
https://doi.org/10.3389/fnhum.2016.00605
Timeserieswithdiscrete“states”
Sleepstagesinferredfromunivariateormultivariate(multipleEEGelectrodelocations,),
multimodal(EEGwithECG/EMG,etc.)dense1Dtimeseries
Manytypesof groundtruths possiblealsofor1Dtime
series Segmentation,classification,regression
https://arxiv.org/abs/1801.05394

TimeSeries Stationarity
Non-stationaritiessignificantly
distort short-term spectral,
symbolicand entropyheartrate
variabilityindicesNovember
2011PhysiologicalMeasurement
32(11):1775-86
DOI: 10.1088/0967-3334/32/11/S05
Testsof Stationarity
https://stats.stackexchange.com/questions/182764/stationarity-test
s-in-r-checking-mean-variance-and-covariance
Stationarity of order 2 For everyday use we often consider time series that have (instead of
strictstationarity):https://people.maths.bris.ac.uk/~magpn/Research/LSTS/TOS.html
●
aconstantmean
●
aconstantvariance
●
anautocovariancethatdoesnotdependontime.
Suchtimeseriesareknownas second-orderstationary or stationaryoforder2.
Examples of non-stationary processes are random walk with or without a
drift (a slow steady change) and deterministic trends (trends that are
constant, positive or negative, independent of time for the whole life of the
series).https://www.investopedia.com/articles/trading/07/stationary.asp

Time SeriesAnalysis LiteratureOverview

Representation vsSimilarity
https://arxiv.org/abs/1704.00794: “Time series
analysis approaches can be broadly categorized
into two families: (i) representation methods,
which provide high-level features for representing
properties of the time series at hand, and (ii)
similarity measures, which yield a meaningful
similarity between different time series for further
analysis.“
Classic representation methods are for instance
Fourier transforms, wavelets, singular value
decomposition, symbolic aggregate approximation,
andpiecewiseaggregateapproximation.
Time series may also be represented through the
parameters of model-based methods such as
Gaussian mixture models (GMM), Markov models and
hidden Markov models (HMMs), time series bitmaps
andvariantsofARIMA.
An advantage with parametric models is that they
can be naturally extended to the multivariate
case. For detailed overviews on representation
methods, we refer the interested reader to e.g.
Wangetal.(2013).
https://arxiv.org/abs/1704.00794: “Similarity-based approaches, once defined, such similarities
between pairs of time series may be utilized in a wide range of applications, such as
classification, clustering, and anomaly detection. Time series similarity measures include for
example dynamic time warping (DTW, the longest common subsequence (LCSS), the
extended Frobenius norm (Eros), and the Edit Distance with Real sequences (EDR), and
representstate-of-the-artperformanceinunivariatetimeseries(UTS)prediction.
Attempts have been made to design kernels from non-metric distances such as DTW, of
which the global alignment kernel (GAK) is an example. There are also promising works on
deriving kernels from parametric models, such as the probability product kernel, Fisher kernel,
andreservoir basedkernels.Commontoallthese methodsishowever a strongdependence
onacorrecthyperparametertuning,whichisdifficulttoobtaininanunsupervisedsetting.
Moreover, many of these methods cannot naturally be extended to deal with multivariate time
series (MTS), as they only capture the similarities between individual attributes and do not
modelthe dependenciesbetweenmultiple attributes.Equallyimportant,thesemethodsare not
designed to handle missing data, an important limitation in many existing scenarios, such
as clinical data where MTS originating from Electronic Health Records (EHRs) often contain
missingdata
In this work, we propose a surgical site infection detection framework for
patients undergoing colorectal cancer surgery that is completely
unsupervised, hence alleviating the problem of getting access to labelled
training data. The framework is based on powerful kernels for multivariate
time series that account for missing data when computing similarities.

Analysis withSimilarityMeasures
TimeSeriesClusterKernelforLearningSimilaritiesbetweenMultivariateTimeSerieswithMissingData
KarlØyvindMikalsen,FilippoMariaBianchi,CristinaSoguero-Ruiz,RobertJenssen(lastrevised29Jun2017)
https://arxiv.org/abs/1704.00794|https://github.com/kmi010/Time-series-cluster-kernel-TCK-(TheTCKwasimplementedinRandMatlab)
Similarity-based approaches represent a
promising direction for time series analysis.
However, many such methods rely on
parameter tuning, and some have
shortcomings if the time series are
multivariate (MTS), due to dependencies
between attributes, or the time series
containmissingdata.
In this paper, we address these challenges
within the powerful context of kernel
methods by proposing the robust time
series cluster kernel (TCK). The approach
taken leverages the missing data
handling properties of Gaussian
mixture models (GMM) augmented with
informative prior distributions. An ensemble
learning approach is exploited to ensure
robustness to parameters by combining the
clustering results of many GMM to
formthefinalkernel.
The experimental results demonstrated that the TCK
(1) is robust to hyperparameter settings, (2) is
competitive to established methods on prediction
tasks without missing data and (3) is better than
established methods on prediction tasks with missing
data.
In future works we plan to investigate whether the
use of more general covariance structures in the
GMM, or the use of HMMs as base probabilistic
models, could improve TCK.

Wavelets Shapelets→ Shapelets ”1DGabors”#1
Fast classification of univariate and multivariate time
seriesthrough shapelet discovery
https://doi.org/10.1007/s10115-015-0905-9
Josif Grabocka, MartinWistuba, Lars Schmidt-Thieme
A Shapelet Selection Algorithm forTime Series Classification: New Directions
https://doi.org/10.1016/j.procs.2018.03.025
The high timecomplexityof shapelet selection processhindersitsapplication in real timedataprocession.
Toovercome this, inthispaper we proposeafast shapelet selection algorithm (FSS), which sharply
reducesthe time consumption ofshapeletselection.
https://slideplayer.com/slide/8370683/
Forexample,aclassof
abnormalECG
measurementmaybe
characterised by an
unusualpatternthat
onlyoccurs
occasionallyatany
point during the
measurement.Shapelets
aresubseriesthatcapture
thistypeofcharacteristic.
Theyallowforthe
detection ofphase-
independentlocalised
similaritybetween series
within thesameclass.
Thegreattimeseriesclassificationbakeoff:areviewandexperimental
evaluationof recentalgorithmicadvances
Anthony Bagnall, Jason Lines, Aaron Bostrom,James Large, Eamonn Keoghs (May2017)
https://doi.org/10.1007/s10618-016-0483-9 | https://bitbucket.org/TonyBagnall/time-series-classification

Wavelets Shapelets→ Shapelets ”1DGabors”#2
Afastshapelet selectionalgorithmfortime
series classification
https://doi.org/10.1016/j.comnet.2018.11.031
Thetrainingtime ofshapelet based algorithmsishigh, eventhough itis
computed off-line, and the authorsaim tomake it moreefficient
Shapelet transformation algorithms have attracted a great deal of attention in the last
decade. However, the timecomplexity of the shapelet selectionprocess in shapelet
transformation algorithms is too high. To accelerate the shapelet selection process with
noreductioninaccuracy,wepresentedFSSforST.
The experimental results demonstrate that our proposed FSS was thousands of
timesfasterthantheoriginalshapelettransformation methodwithnoreduction
in accuracy. Our results also demonstrate that our method was the fastest method
among shapeletmethodsthathavetheleadinglevelofaccuracy.

RepresentationLearning with deeplearning #1
TowardsaUniversalNeuralNetworkEncoderforTime
Series
Joan Serrà,SantiagoPascual,AlexandrosKaratzoglou(Submitted on
10May 2018)https://arxiv.org/abs/1805.03908
We have studied the use of a universal encoder for time
series in the specific case of classifying an out-of-sample data
set of an unseen data type. We have considered the cases of
no-adaptation,mappingadaptation,andfulladaptation.
In all cases we achieve performances that are competitive with
the state-of-the-art that, in addition, involve a compact reusable
representation and few training iterations. We have also studied
the effect of the representation dimensionality, showing that
small representations have an impact to no-adaptation and
mapping adaptation approaches,butnotmuch tofulladaptation
ones.
In the future, we plan to refine the encoder architecture, as well
as optimizing some of the parameters we empirically use in our
experiments. A very interesting direction for future research is
the adoption of one-shot learning schemas (Snelletal.2017;
Sutskeveretal.2014), which we find very suitable for the
current setting in time series classification problems.
A further option to enhance the performance of a universal
encoder is data augmentation, specially considering recent
linear instance/class interpolation approaches (
Zhangetal.2018).
In order to have sufficient knowledge to accomplish any task, and in order to be
applicable in the absence of labeled data or even without adaptation/re-training,
researchers have been increasingly adopting the generic concept of universal
encoders, specially within the text processing domain (note that related concepts also
existinother domains).
The basic idea is to train a model (the encoder) that learns a common representation
which is useful for a variety of tasks and that, at the same time, can be reused for
novel tasks with minimal or no adaptation. While it would seem that classical
autoencoders and other unsupervised models should perfectly fit this purpose, recent
research in sentence encoding shows that, with current means, encoders learnt with a
sufficiently large set of supervised tasks, or mixing supervised and
unsupervised data, consistentlyoutperformtheirpurelyunsupervisedcounterparts.

OneDeepMusicRepresentationtoRuleThem All?
Acomparativeanalysisofdifferentrepresentationlearning
strategies
JaehunKim,JulianUrbano,CynthiaC. S.Liem,AlanHanjalic
(Submittedon13Feb2018)
Ourworkwilladdressthefollowing researchquestions:
–RQ1:Givenasetofcommonlearningtasksthatcanbeusedtotrain
anetwork,whatistheinfluenceofthenumberandtypeofthetaskson
theeffectivenessofthelearneddeeprepresentation?
–RQ2:Howdovariousdegreesofinformationsharinginthedeep
architectureaffecttheultimatesuccessofalearneddeep
representation?
–RQ3:Whatisthebestwaytoassesstheeffectivenessofadeep
representation?
Simplified illustration of the conceptual difference between traditional deep transfer learning (DTL) based on a single
learning task (above) and multi-task based deep transfer learning (MTDTL) (below). The same color used for a
learning and an unseen task indicates that the tasks have commonalities, which implies that the learned representation is
likely to be informative for the unseen task. At the same time, this representation may not be that informative to another
unseen task, leading to a low transfer learning performance. The hypothesis behind MTDTL is that relying on more
learning tasksincreasesrobustness of thelearned representationand itsusabilityfor abroadersetof unseen tasks.

LearningFiner-classNetworksforUniversal
Representations
JulienGirard,YoussefTamaazousti,HervéLeBorgne,Céline
Hudelot(Submittedon4 Oct2018)
Many real-world visual recognition use-cases can not directly benefit from
state-of-the-art CNN-based approaches because of the lack of many
annotated data. The usual approach to deal with this is to transfer a
representation pre-learned on a large annotated source-task onto a target-
task of interest. This raises the question of how well the original
representation is "universal", that is to say directly adapted to many
different target-tasks. To improve such universality, the state-of-the-art
consists in training networks on a diversified source problem, that is
modified either by adding generic or specific categories to the initial set of
categories.
We propose two methods to improve universality, but pay special attention
to limit the need of annotated data. We also propose a unified
framework of the methods based on the diversifying of the training
problem. Finally, to better match Atkinson's cognitive study about
universal human representations, we proposed to rely on the
transfer-learningschemeas wellasa new metric toevaluateuniversality.
We show thatourmethod learnsmore universal representationsthan state-
of-the-art, leading to significantly better results on 10 target-tasks from
multiple domains, using several network architectures, either alone or
combinedwithnetworkslearnedat acoarsersemantic level.

ImprovingClinicalPredictionsthroughUnsupervised
TimeSeriesRepresentationLearning
XinruiLyu,MatthiasHüser,StephanieL.Hyland,GeorgeZerveas,
Gunnar Rätsch(Submittedon2Dec2018)
MachineLearningforHealth(ML4H)Workshop atNeurIPS2018.
We empirically showed that in scenarios
where labeled medical time series data is
scarce, training classifiers on unsupervised
representations provides performance gains
over end-to-end supervised learning using
raw input signals, thus making effective use
of information available in a separate,
unlabeled training set.
The proposed model, explored for the first
time in the context of unsupervised patient
representation learning, produces
representations with the highest
performance in future signal prediction
and clinical outcome prediction,
exceeding several baselines.
The idea behind applying attention mechanisms to time series forecasting is to enable the
decoder to preferentially “attend” to specific parts of the input sequence
during decoding. This allows for particularly relevant events (e.g. drastic changes in heart
rate),tocontributemoretothegenerationofdifferentpointsintheoutputsequence.

UnsupervisedScalableRepresentationLearningforMultivariate
TimeSeries
https://github.com/White-Link/UnsupervisedScalableRepresentationLearni
ngTimeSeries
(PyTorch)
Jean-YvesFranceschi,AymericDieuleveut,MartinJaggi
(Submittedon30Jan2019)
Hence, we propose in the following an unsupervised
method to learn general-purpose representations for
multivariate time series that comply with the issues of
varying and potentially high lengths of the studied time
series. To this end, we adaptrecognized deep learningtools
and introduce a novel unsupervised loss. Our
representations are computed by a deep convolutional
neuralnetworkwithdilatedconvolutions(i.e.TCNs).
This network is then trained unsupervised, using the first
specifically designed triplet loss in the literature of
time series, taking advantage of the encoder resilience to
time seriesofunequallengths.
We leave as future work the applicability of our method to
other tasks like forecasting, and the study of its impact if it
weretobeaddedinpowerful ensemblemethods.

Unsupervised speech representation learning
using WaveNet autoencoder
Jan Chorowski, Ron J. Weiss,Samy Bengio, Aaron van den
Oord(Submitted on 25 Jan 2019)
We consider the task of unsupervised extraction of
meaningful latent representations of speech by applying
autoencoding neural networks to speech waveforms. The
goal is to learn a representation able to capture high level
semantic content from the signal, e.g. phoneme identities,
while being invariant to confounding low level details in the
signal such as the underlying pitch contour or background
noise. The behavior of autoencoder models depends on the
kind of constraintthatis applied tothelatentrepresentation.
Our best models used MFCCs (mel-frequency cepstral
coefficient) as the encoder input, but reconstructed raw
waveforms at the decoder output. We used standard 13
MFCC features extracted every 10ms (i.e., at a rate of 100 Hz)
and augmented with their temporal first and second
derivatives. Such features were originally designed for
speech recognition and are mostly invariant to pitch and
similarconfoundingdetail in theaudiosignal. T

ATaleof Two Time Series Methods:Representation
Learningfor Improved Distance and RiskMetrics
https://dspace.mit.edu/bitstream/handle/1721.1/119575/1076
345253-MIT.pdf
DivyaShanmugam (June2018)
Architecture of the proposed model. A single convolutional layer
extracts local features from the input, which a strided maxpool
layer reduces to a fixed-size vector. A fully connected layer
with ReLU activation carries out further, nonlinear dimensionality
reduction to yield the embedding. A softmax layer is added at
training time.
We introduce the multiple instance learning paradigm to risk
stratification. Risk stratification models aim to identify patients
at high risk for a given outcome so that doctors may intervene, with
the attempt of avoiding that outcome. Machine learning has led to
improved risk stratification models for a number of outcomes,
including stroke, cancer and treatment resistance [55]. To the best of
our knowledge, this is the first application of multiple instance learning
to risk stratification.
The extension of Jiffy to multi-label classification and unsupervised
learning poses a challenging but necessary task. The availability of
unlabeled time series data eclipses the availability of its annotated
counterpart. Thus, a simple network-based method for representation
learning on multivariate timeseries inthe absence oflabels isan important
line of work. There is also potential to further increase Jiffy’s speed by
replacing the fully connected layer with a structured [Bojarskietal.2016]
or
binarized[Rastegariet al.2016]
matrix.
The proposed risk stratification model extends naturally to a range of adverse
outcomes. The model is not limited to operating on ECG signals - it is
worth exploring whether the multiple instance learning approach may be
successful in other modalities of medical data, including voice. On a
theoretical level, strong generalization guarantees for distinguishing bags with
relative witnessratesdonotexistand are worth exploring asthese modelsare
appliedintherealworld.

Intro tomethods#1a
Highlycomparative time-series analysis: theempirical
structure of time series and their methods
http://doi.org/10.1098/rsif.2013.0048
Ben D. Fulcher, Max A. Little, Nick S. Jones

Intro tomethods#1b
Structure inalibrary of8651time-seriesanalysisoperations. (a) A
summaryof thefourmainclassesof operationsin ourlibrary,asdetermined by
a k-medoidsclustering,reflectsacrudebutintuitiveoverviewof thetime-series
analysisliterature.(b)A network representation of theoperationsinour library
thataremostsimilarto theapproximateentropy algorithm, ApEn(2,0.2)[7],
which wereretrieved fromourlibraryautomatically.Each nodein thenetwork
representsanoperationand linksencodedistancesbetweenthem(computed
using a normalized mutual information-based distancemetric, cf.electronic
supplementary material,§S1.3.1).Annotated scatterplotsshowtheoutputsof
ApEn(2,0.2)(horizontal axis)againsta representativememberof each shaded
community (indicated bya heavily outlined node, vertical axis). Similar pictures
can beproduced by targeting anygivenoperationin our library, thereby
connecting differenttime-seriesanalysismethodsthatneverthelessdisplay
similar behaviour acrossempiricaltimeseries.
Key scientific questions that can be addressed by representing time series by their properties (measured by many types of analysis
methods) and operations by their behaviour (across many types of time-series data). We show that this representation facilitates a range of
versatile techniquesfor addressingscientific time-seriesanalysisproblems, which are illustrated schematicallyin thisfigure.
The representations of time series (rows of the data matrix, figure 1a) and operations (columns of the data matrix, figure 1b) serve as
empirical fingerprints, and are shown in the top panel. Coloured borders are used to label different classes of time series and
operations, and other figures in this paper that explicitly demonstrate each technique are given in the bottom right-hand corner of each
panel.
(a) Time-seriesdatasetscan be organized automatically, revealingthe structure in agiven dataset (cf. figures4a,b and 5a). (b)Collectionsof
scientific methods can be organized automatically, highlighting relationships between methods developed in different fields (cf. figures
3a and 5b). (c) Real-world and model-generated datawith similar propertiesto aspecific time-seriestarget can be identified (cf. figure 4c,d).
(d)Given aspecific operation, alternativesfrom acrossscience can be retrieved (cf. figure 3b). (e)Regression:the behaviour of operations in
our library can be compared to find operations that vary with a target characteristic assigned to time series in a dataset (cf. figure 5d). (f)
Classification: operations can be selected based on their classification performance to build useful classifiers and gain insights into the
differencesbetween classesof labelled time-series datasets(cf. figure 5e).

Intro tomethods#1c
Highlycomparativetechniquesfortime-
seriesanalysistasks.Wedrawonourfull
library oftime-seriesanalysismethodsto:
(a) structure datasetsinmeaningfulways,
andretrieveandorganizeusefuloperations
for (b,e) classificationand(c,d) regression
tasks.(a)Fiveclassesof EEG signalsare
structuredmeaningfullyinatwo-
dimensional principalcomponentsspaceof
our libraryof operations.(b)Pairwise linear
correlationcoefficientsmeasuredbetween
the60mostsuccessful operationsfor
classifyingcongestiveheartfailureand
normalsinusrhythmRR intervalseries.
Clusteringrevealsthatmostoperationsare
organizedintooneof threegroups
(indicatedbydashedboxes).

Most of the time when people talk about time series and deep
learning, most likely they talking of Sequences (e.g. language)
instead of unstructuredtime series (e.g. voice waveform)

“Sequences” vs“TimeSeries”
“DenseTimeSeries”at videoframerate
Icehockeyas
gamecan be
simplifiedto
discreteevents
(sequences)
Notalwayssoblack-white,butinourcasetime-seriesaremainlydense1DBiosignalswithambiguousormissingdiscretestates

Time Series RNNsforsequences
The Unreasonable Effectivenessof
RecurrentNeuralNetworks
May21,2015|AndrejKarpathy
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
DanQ:ahybridconvolutionaland
recurrentdeepneuralnetworkfor
quantifyingthefunctionofDNA
sequences
Daniel Quang XiaohuiXieNucleic AcidsResearch,Volume44,
Issue11,20June2016,Pagese107,
https://doi.org/10.1093/nar/gkw226
DeepLearningforUnderstandingConsumerHistories
byTobiasLang- 25Oct2016
https://jobs.zalando.com/tech/blog/deep-learning-for-understanding-consumer-histories/?gh_src=4n3gxh1
Sequences. Depending on your background you mightbewondering:
WhatmakesRecurrentNetworkssospecial?

TimeSeries LSTMsApplied
DeepAir|UCBerkeleySchoolofInformation
https://www.ischool.berkeley.edu/projects/2017/deep-air
This project investigates the use of the LSTM recurrent neural network (RNN) as a
framework for forecasting in the future, based on time series data of pollution and
meteorological information in Beijing. Our results show that the LSTM framework
produces equivalent accuracy when predicting future time stamps compared to the
baseline support vector regression for a single time stamp. Using our LSTM framework,
we can now extend the prediction from a single time stamp out to 5 to 10 hours in the
future.
Overview of our self-supervised approach for posture and sequence representation learning
using CNNLSTM. After the initial training with motion-based detections we retrain our model for
enhancingthe learningof therepresentations. https://doi.org/10.1109/CVPR.2017.399
PianoGenie:An IntelligentMusicalInterface
Oct15,2018 |https://magenta.tensorflow.org/pianogenie
Chris Donahue ( chrisdonahue , chrisdonahuey ) ;Ian Simon ( iansimon , iansimon ) ;Sander Dieleman ( benanne , sedielem )
A bidirectional LSTM encoder maps asequence of piano notestoasequence of controller
buttons (shown as 4 in the above figure, 8 in the actual system). A unidirectional LSTM
decoder then decodes these controller sequences back into piano performances. After
training, the encoder isdiscarded and controller sequencesareprovided byuser input.

Time Series RNN/LSTMsareoutdated?#1
ThefallofRNN/ LSTM
EugenioCulurciello
https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0
Combining multiple neural attention modules, comes the “hierarchical
neural attention encoder”… Notice there is a hierarchy of attention
modules here, very similar to the hierarchy of neural networks. This is also
similar toTemporalconvolutionalnetwork(TCN)
→ Shapelets AttentionModels,e.g. Pervasive Attention: 2D Convolutional
NeuralNetworksforSequence-to-
SequencePrediction
MahaElbayad,LaurentBesacier,JakobVerbeek
(Submittedon11Aug 2018)
https://arxiv.org/abs/1808.03867|
https://github.com/elbayadm/attn2d

AnEmpiricalEvaluationof GenericConvolutional and
RecurrentNetworksforSequence Modeling
ShaojieBai,J.ZicoKolter,VladlenKoltun
(Revised19Apr2018)
https://arxiv.org/abs/1803.01271 |http://github.com/locuslab/TCN
For most deep learning practitioners, sequence modeling is
synonymous with recurrent networks. Yet recent results
indicate that convolutional architectures can outperform recurrent
networks on tasks such as audio synthesis and machine translation.
Given a new sequence modeling task or dataset, which architecture
should one use?
We conduct a systematic evaluation of generic convolutional and
recurrent architectures for sequence modeling. The models are
evaluated across a broad range of standard tasks that are commonly
used to benchmark recurrent networks. Our results indicate that a
simple convolutional architecture outperforms canonical
recurrent networks such as LSTMs across a diverse range of
tasks and datasets, while demonstrating longer effective memory. We
conclude that the common association between sequence modeling
and recurrent networks should be reconsidered, and convolutional
networks should be regarded as a natural starting point for sequence
modelingtasks.
The preeminence enjoyed by recurrent networks in sequence modeling
may be largely a vestige of history. Until recently, before the introduction of
architectural elements such as dilated convolutions and residual
connections, convolutional architectures were indeed weaker. Our
results indicate that with these elements, a simple convolutional
architecture is more effective across diverse sequence modeling tasks
than recurrent architectures such as LSTMs. Due to the comparable
clarity and simplicity of TCNs, we conclude that convolutional
networks should be regarded as a natural starting point and a
powerfultoolkit for sequence modeling

Dilated Temporal Fully-Convolutional Networkfor
Semantic Segmentation ofMotion CaptureData
NoshabaCheema,Somayeh Hosseini, Janis Sprenger, Erik
Herrmann,Han Du, Klaus Fischer, PhilippSlusallek
(Submittedon 24Jun 2018)
Semantic segmentation of motion capture sequences
plays a key part in many data-driven motion synthesis
frameworks. It is a preprocessing step in which long
recordings of motion capture sequences are partitioned
into smaller segments. Afterwards, additional methods like
statistical modeling can be applied to each group of
structurally-similar segments to learn an abstract motion
manifold. The segmentation task however often
remains a manual task, which increases the effort and
costofgeneratinglarge-scalemotiondatabases.
We therefore propose an automatic framework for
semantic segmentation of motion capture data using a
dilated temporal fully-convolutional network. Our
model outperforms a state-of-the-art model in action
segmentation, as well as three networks for sequence
modeling.

TemporalConvolutionalNetworksandDynamicTimeWarping
canDrasticallyImprovetheEarlyPredictionofSepsis
MichaelMoor,MaxHorn,BastianRieck,DamianRoqueiroandKarsten
Borgwardt(Submittedon7Feb2019)
https://osf.io/av5yx/?view_only=a6e3442634b34d53ba6e59c4a956b318
For future work, we aim to extend our analysis to more types of data
sources arising from the ICU. Futoma et al. (2017b) already
employed a subset of baseline covariates, medication effects, and
missingness indicator variables. However, a multitude of feature
classes still remain to be explored and properly integrated. For
instance, the combination of sequential and non-sequential
features has previously been handled by feeding non-sequential
data into the sequential model (Futoma et al.,2017a).
We hypothesize that this could be handled more efficiently by
using a more modular architecture that incorporates both
sequential and non-sequential parts. Furthermore, we aim to obtain
a better understanding of the time series features utilized by the
model. Specifically, we are interested in assessing the
interpretability of the learned filters of the MGPTCN framework
and evaluate how much the activity of an individual filter contributes
to a prediction. This endeavor is somewhat facilitated by our use of a
convolutional architecture. The extraction of short per-channel
signals could prove very relevant for supporting diagnoses made by
clinical practitioners.
Overview of our model. The raw, irregularly spaced time series are provided to the Multi-task Gaussian Process
(MGP) patient by patient. The MGP then draws from a posterior distribution (given the observed data) at evenly
spaced grid times (each hour). This grid is then fed into a temporal convolutional network (TCN) which after aforward
pass returns a loss. Its gradient is then computed by backpropagating through the computational graph including
both the TCN and the MGP (green arrows). Both the MGP and TCN parameters are learned end-to-end during
training.
We evaluate all methods using Area under the Precision–Recall Curve
(AUPRC) and additionally display the (less informative) Area under the
Receiver Operator Characteristic (AUC). The current state-of-the-art
method, MGP-RNN, is shown in blue. The two approaches for early
detection of sepsis that were introduced in this paper, i.e. MGP-TCN and
DTW-KNN ensemble, are shown in pink and red, respectively. By using three
random splits for all measures and methods, we depict the mean (line) and
standard deviation error bars (shaded area).

Clinicalnotes and textreportunderstanding
Wordsas thesequences

StructuringClinicalText
Comparativeeffectiveness of convolutional neural
network(CNN)and recurrent neural network(RNN)
architectures for radiologytext reportclassification (2018)
https://doi.org/10.1016/j.artmed.2018.11.004
DepartmentofBiomedicalDataScience,StanfordUniversitySchoolof
Medicine,Stanford,CA,USA
This paper explores cutting-edge deep learning methods for
information extraction from medical imaging free text
reports at a multi-institutional scale and compares them to the
state-of-the-art domain-specific rule-based system – PEFinder
andtraditionalmachinelearning methods– SVMandAdaboost.
Visualization methods have been developed to identify the
impact of input words on the output decision for both
deeplearning models.
DomainPhraseAttention-basedHierarchicalNeuralNetwork(DPA-
HNN)architecture.

ClinicalText +Images
Unsupervised MultimodalRepresentation Learning across
Medical Images and Reportsn
(MachineLearning for Health (ML4H)Workshop atNeurIPS 2018.)
https://arxiv.org/abs/1811.08615 MITCSAIL
Joint embeddings between medical imaging modalities and
associated radiology reports have the potential to offer
significant benefits to the clinical community, ranging from cross-
domain retrieval to conditional generation of reports to the
broader goals of multimodal representation learning. In this work,
we establish baseline joint embedding results measured via both
local and global retrieval methods on the soon to be released
MIMIC-CXR dataset consisting of both chest X-ray images and
the associatedradiologyreports..
We establish baseline results using supervised and unsupervised joint embedding
methods along with local (direct pairs) and global (ICD-9 code groupings) retrieval
evaluation metrics. Results show a possibility of incorporating more unsupervised data
into training for minimal-effort performance increase. A further study of joint
embeddings between these modalities may enable significant applications, such as
text/imagegenerationor theincorporationofotherEMRmodalities.

ElectronicHealthRecords
Visitsassequences,
eachsequencecancontain1Dbiosignals

EHRMining Risk Predictionmodel
Risk Prediction on Electronic Health Records with Prior
Medical Knowledge (2018)
https://doi.org/10.1145/3219819.3220020
We propose a novel and general framework called PRIME for
risk prediction task, which can successfully incorporate
discrete prior medical knowledge into all of the state-of-the-
art predictive models using posterior regularization technique.
Different from traditional posterior regularization, we do not need
to manually set a bound for each piece of prior medical
knowledge when modeling desired distribution of the target
disease on patients. Moreover, the proposed PRIME can
automatically learn the importance of different prior knowledge
with alog-linearmodel.
The limitation of this work is that the proposed PRIME is only
effective for common diseases. For rare and emerging
diseases, since there is little medical knowledge about them, it
is hard to incorporate any prior knowledge into deep learning
predictive models. Thus, the proposed PRIME may achieve
similar performance to the state-of-the-art baselines. In our
future work, we will focus on how to improve predictive
performanceofrisk predictionforrare diseases.

Intro to cleaning
Inthepreprocessing component,themainpurposeistocleanthe
data,filter theunusualpointsandmakeitsuitableastheinputtothe
CNN.Besidesthenormalstepsincludingtimestampalignment,
normalizationandmissingdataimputationfortimeseriesdatawith
trend,
themostimportantoperationtoimprovethedataqualityisthe
outlierdetection,interpolation andfiltering,inparticularfor
clinicaldata.Becauseintheclinicaldataofglucosetimeseries,there
aremanymissingor outlier datapointsduetoerrorsincalibration,
measurements,and/or mistakesintheprocessofdatacollectionand
transmission.Here,severalmethodsareintroducedtohandlethese
scenarios[36].
●
DimensionReductionModel: thetimeseriescan beprojectedinto
lowerdimensionsusinglinearcorrelationssuchasprinciplecomponent
analysis(PCA),and datawithlargeresidualerrorscanbeconsideredas
outliers.
●
Proximity-basedModel: thedataaredeterminedbynearest
neighbouranalysis,clusterordensity.Thus thedatainstancesthat are
isolatedfromthemajorityareconsidered asoutliers.
●
Probabilistic Stochastic Filters:differentfiltersforthesignals, such
asgaussian mixturemodelsoptimized usingexpectation-maximization.
In ourcasethefiltercan beimplementedbeforetheCNN, duetothe
continuouscharacteristic oftheinputglycaemic timeseriesdata.
AconvolutionalneuralnetworkforECGannotationasthebasisfor
classificationofcardiacrhythms
PhilippSodmann etal2018Physiol.Meas.inpress
https://doi.org/10.1088/1361-6579/aae304
Signalcleaning:
Inthedatapreprocessing,weperformedresamplingandsignaldenoising.We
resampledallECGsto300HzusingthefastFourier transforminorder topassECG
segmentsofequallengthontotheCNN.
Tofilternoisycomponentsinthesignalsuchasbaselinewandering,respirationeffects,
or powerlineinterference,weappliedadiscretewavelettransform(DWT)whichworks
asaband-passfilter.For this,weusedDaubechieswavelettransform(Db4).
Beforere-composition,eachcoefficientofthetransformwasmultipliedbyafactor
accordingtotabulatedvalues.Afterwards,a15%-trimmedmeanwithawindowsizeof
33sampleswasappliedtoremovethepersistentbaseline.
https://doi.org/10.3389/fnins.2013.00267
MEGandEEGdataanalysis withMNE-Python

TimeSeries Invariances
Acomplexity-invariantdistancemeasurefortimeseries
https://doi.org/10.1137/1.9781611972818.60
GustavoEAPA Batista, Xiaoyue Wang, and Eamonn J Keogh.
In Proceedingsofthe2011SIAM InternationalConferenceon DataMining(SDM),
pages699–710.SIAM,2011.Citedby216

TimeSeries DTWthe classicalmethod
456
StockPricePredictionwithFluctuationPatternsUsing
IndexingDynamic TimeWarpingand k∗
-Nearest
NeighborsKei Nakagawa, MitsuyoshiImamura,Kenichi Yoshida(2018)
https://doi.org/10.1007/978-3-319-93794-6_7

Learning invariances#1a
LearningtoExploit InvariancesinClinical
Time-SeriesDatausingSequence
TransformerNetworks
JeehehOh, JiaxuanWang, JennaWiens
(Submittedon 21 Aug2018)
Recently, researchers have started applying convolutional neural
networks (CNNs) with 1D convolutions to clinical tasks
involving time-series data. This is due, in part, to their
computational efficiency, relative to recurrent neural networks
and their ability to efficiently exploit certain temporal invariances,
(e.g.,phaseinvariance).
However, it is well-established that clinical data may exhibit many
other types of invariances (e.g., scaling). While preprocessing
techniques, (e.g., dynamic time warping) may successfully
transform and align inputs, their use often requires one to identify
thetypesofinvariancesinadvance.
In contrast, we propose the use of Sequence Transformer
Networks, an end-to-end trainable architecture that learns to
identify and account for invariances in clinical time-series data.
Applied to the task of predicting in-hospital mortality, our
proposedapproachachievesanimprovementintheAUROC.
Toaddressesthesechallenges,weproposeSequenceTransformer Networks,anapproachfor
learningtask-specificinvariancesrelatedtoamplitude,offset,andscaleinvariancesdirectlyfrom
thedata.Appliedtoclinicaltime-seriesdata,SequenceTransformerNetworkslearn input-and
task-dependenttransformations.Incontrasttodataaugmentationapproaches,our
proposedapproachmakeslimitedassumptionsaboutthepresenceofinvariancesinthedata.

Learning invariances#1b
LearningtoExploitInvariancesinClinicalTime-
Series DatausingSequenceTransformerNetworks
Jeeheh Oh, Jiaxuan Wang, JennaWiens
(Submitted on 21 Aug 2018)
Theproposedapproachisnotwithoutlimitation.Morespecifically,initscurrentformthe
SequenceTransformer appliesthesametransformationacrossallfeatureswithinanexample,
insteadoflearningfeature-specifictransformations.Despitethislimitation,thelearned
transformationsstillleadtoanincreaseinintra-classsimilarity.Inconclusion,weare
encouragedbythesepreliminaryresults.Overall,thiswork representsastartingpoint on
whichotherscanbuild.Inparticular,wehypothesizethattheabilitytocapturelocalinvariances
andfeature-specificinvariancescouldleadtofurther improvementsinperformance.

Learning invariances#2
Autowarp:LearningaWarpingDistancefromUnlabeledTime
Series UsingSequenceAutoencoders
Abubakar Abid, JamesZou StanfordUniversity
(Submitted on 23Oct2018)
Domain experts typically hand-craft or manually select a specific metric, such as dynamic time
warping (DTW), to apply on their data. In this paper, we propose Autowarp, an end-to-end
algorithm that optimizesand learnsagood metric givenunlabeled trajectories.
We define a flexible and differentiable family of warping metrics, which encompasses common
metrics such as DTW, Euclidean, and edit distance. Autowarp then leverages the representation
power of sequence autoencoders to optimize for a member of this warping distance
family. The output is a metric which is easy to interpret and can be robustly learned from relatively
few trajectories.
Future work will extend these results to more challenge time series data, such as those with higher
dimensionality or heterogeneousdata.

Learning invariances#3
NeuralWarp:Time-Series SimilaritywithWarpingNetworks
Josif Grabocka, LarsSchmidt-Thieme (Submitted on20 Dec2018)
https://arxiv.org/abs/1812.08306 | Relatedarticles
In this paper we propose to learn a warping function for
aligning the indices of time series in a deep latent
representation. We compared the suggested architecture
with two types of encoders (CNN, or RNN) and a deep
forward network as a warping function. Experimental
comparisons to non-parametric and un-warped Siames
networks demonstrated that the proposed elastic deep
similaritymeasureismoreaccuratethanpriormodels.

SMOTE forimbalancedclasses
SMOTE-GPU:BigData preprocessingon
commodityhardwareforimbalancedclassification
ProgressinArtificialIntelligenceDecember2017,Volume6,
Issue4,pp347–354
Consideringabinaryproblemwithamajorityclassanda
minorityclass,itislikelythatalearning algorithmignoresthe
later andstillachievesahighaccuracy.Thereare threemain
waysof dealingwiththesesituations [16]:
●
Algorithmicmodification Modifyinglearning algorithmsin
order totackletheproblembydesign.
●
Cost-sensitivelearningIntroducingcostsfor
misclassificationoftheminorityclassatdataor algorithmic
level.
●
DatasamplingPreprocessingthedatainorder toreduce
thebreachbetweenthenumberofinstancesofeachclass.
TheSMOTEtechniqueisbasedontheideaof
neighborhoodofthek-nearestneighbor (kNN)rule.
The area under the ROC curve results show that the use of
oversampling methods improves the detection of the minority
class in Big Data datasets. We have also shown how our design can
successfully work on a wide range of devices, including a laptop,
while requiring reasonable times, around 25 min on high-end devices,
and less than 2 h on the laptop, for the most time-demanding
experiment.
SMOTEforLearningfromImbalancedData:Progress and
Challenges,Markingthe15-yearAnniversary(2018)
https://doi.org/10.1613/jair.1.11192
●
GS4(Moutafis & Kakadiaris, 2014)
,SEG-SSC (Triguero et al.,2015)
and OCHS-SSC
(Dong et al.,2016)
generate synthetic examplestodiminish the
drawbacksproducedby the absence of labeled examples.
Several learning techniques were checked andsomeproperties
such asthecommonhiddenspacebetweenlabeledsamplesand
thesyntheticsamplewereexploited.
●
The technique proposed by Park et al. (2014) is a semi-
supervised active learning method in which labels are
incrementally obtained and applied using a clusteringalgorithm.
Inthe contextofcurrentchallengesoutlined,we highlightedtheneed
forenhancingthetreatmentof smalldisjuncts,noise, lack of data,
overlapping,datasetshiftandthecurseof dimensionality. To doso,the
theoreticalpropertiesof SMOTE re-garding these data
characteristics, and its relationship with the new synthetic
instances,mustbefurtheranalyzedindepth. Finally,wealsoposited
thatitisimportanttofocusondatasampling andpre-processing
approaches(such asSMOTE anditsextension)withintheframework
ofBig Dataandreal-timeprocessing.

Outlierdetection Whatto impute?

TypesofAnomalies
globalanomalies(x1, x2),
localanomaly x3
micro-cluster c3.
Asimpletwo-dimensionalexample
“Thissimpleexamplealready
illustratesthatanomaliesarenot
alwaysobviousandascoreis
muchmoreusefulthanabinary
labelassignment.”
AComparative EvaluationofUnsupervised
AnomalyDetectionAlgorithmsforMultivariate
Data(2016)
Markus Goldstein, SeiichiUchida
https://doi.org/10.1371/journal.pone.0152173
Threetypesofanomaly
schemes:
●
pointanomalydetection
●
collectiveanomaly
●
contextualanomalies

State-of-the-art 2 yearsoldcuttingedge#1
AComparativeEvaluationofUnsupervisedAnomaly
DetectionAlgorithms forMultivariateData (2016)
MarkusGoldstein,Seiichi Uchida
Dozens of algorithms have been proposed in this area, but unfortunately
the research community still lacks a comparative universal evaluation as
wellascommonpubliclyavailabledatasets.
These shortcomings are addressed in this study, where 19 different
unsupervised anomaly detection algorithms are evaluated on 10
different datasetsfrommultipleapplicationdomains.
By publishing the source code and the datasets, this paper aims to
be a new well-funded basis for unsupervised anomaly detection
research. Additionally, this evaluation reveals the strengths and
weaknessesofthedifferent approachesforthefirst time.
As a general summary for algorithmselection, werecommend to use
nearest-neighbor based methods, in particular k-NN for global tasks
and LOF for local tasks instead of clustering-based methods. If
computation time is essential, HBOS is a good candidate, especially for
larger datasets. A special attention should be paid to the nature of the
dataset when applying local algorithms, and if local anomalies are of
interest at allin thiscase.
Different anomaly detection modes
dependingon the availability of labels
in the dataset.
(a) Supervised anomaly detection uses a
fully labeled dataset for training. (b) Semi-
supervised anomaly detection uses an
anomaly-free training dataset. Afterwards,
deviations in the test data from that normal
model are used to detect anomalies. (c)
Unsupervised anomaly detection
algorithms use only intrinsic information of
the data in order to detect instances
deviatingfrom the majority of thedata.

State-of-the-art 2 yearsoldcuttingedge#2
A ComparativeEvaluation of Unsupervised Anomaly Detection Algorithmsfor
Multivariate Data (2016)MarkusGoldstein, SeiichiUchida
A visualization of the results of the k-NN global
anomaly detection algorithm. The anomaly score is
represented by the bubble size whereas the color shows the
labelsoftheartificiallygenerateddataset.
Comparing Influenced Outlierness (INFLO) withLocal Outlier Factor
(LOF) showstheusefulnessofthe reverseneighborhoodset.
For the red instance, LOF takes only the neighbors in the gray
area into account resulting in a high anomaly score. INFLO
additionally takes the blue instances into account (reverse
neighbors)andthusscorestheredinstancemorenormal.

Anomalydetection Cyber-physicalsystems
Anomaly DetectionwithGenerativeAdversarialNetworks for
MultivariateTimeSeries (2018)
Dan Li, DachengChen, Jonathan Goh,andSee-KiongNg
InstituteofDataScience, National UniversityofSingapore,
Unsupervised machinelearningtechniquescanbeusedtomodelthe
systembehaviour andclassifydeviantbehavioursaspossibleattacks.
Inthiswork,weproposedanovelGenerativeAdversarialNetworks-based
AnomalyDetection(GAN-AD)methodfor suchcomplexnetworkedCPSs.
WeusedLSTM-RNNinourGANtocapturethedistributionofthe
multivariatetimeseriesofthesensorsandactuatorsundernormal
workingconditionsofaCPS.
Insteadoftreatingeachsensor’sandactuator’stimeseriesindependently,we model
thetimeseriesofmultiplesensorsandactuatorsintheCPS
concurrently totakeintoaccountofpotentiallatentinteractions betweenthem.
ToexploitboththegeneratorandthediscriminatorofourGAN,wedeployedthe
GAN-traineddiscriminator together withtheresidualsbetweengenerator-
reconstructeddataandtheactualsamplestodetectpossibleanomaliesinthe
complexCPS.
We will also conduct further
research on feature
selection formultivariate
anomalydetection,and
investigate principled
methodsfor choosing the
latent dimension andPC
dimension withtheoretical
guarantees.

Anomalydetection Financialtime-series
Modelingapproachesfortimeseries forecastingand
anomaly detection (2018)
Du,Shuyang; Pandey, Madhulima; Xing,Cuiqun
http://cs229.stanford.edu/proj2017/final-reports/5244275.pdf
This project focuses on prediction of time series data for Wikipedia
page accesses for a period of over twenty-four months. The methods
explored here are K-nearest neighbors (KNN), Long short-term memory
network (LSTM), and Sequence to Sequence with Convolution Neural
Network (CNN) and we will compare predicted values to actual web traffic.
Thepredictionscan helpusinanomalydetectionintheseries.
Pre-processing : “The are many series in which values are zero. This
could be a missing value, or actual lack of web page access. In addition,
there are significant spikes in the data, where values have a broad range
from 1 to hundreds/thousandsfor several web pages. We normalize this
data by adding 1 to all entries, taking the log of the values, and setting
the mean to zero and variance to one. We have the results of fourier
analysisforexploringperiodictyonaweekly/monthly/quarterlybasis.”
Our approaches to time series prediction depends on features extracted
from the the time series data itself. Our models learn periodicity, ramp and
other regular trends quite well. However, none of our models are able to
capture spikes or outliers that arise from external sources. Enhancing
the performance of the models will require augmenting our feature set from
othersourcessuchasnewseventsandweather.

“SpecialOutliers” Disguisedmissingvalues
FAHES:ARobustDisguised Missing
ValuesDetector
QatarComputingResearch Institute,HBKU, Doha,Qatar
Missing values are common in real-world data and may
seriously affect data analytics such as simple statistics
and hypothesis testing. Generally speaking, there are
two types of missing values: explicitly missing
values (i.e. NULL values), and implicitly missing values
(a.k.a. disguised missing values (DMVs)) such as
"11111111" for a phone number and "Some college" for
education. While detecting explicitly missing values is
trivial, detecting DMVs is not; the essential challenge is
the lack of standardization about how DMVs are
generated.
Onefutureworkweareplanning
toperformistoimproveFAHESto
detecttheDMVsthataregenerated
randomlywithintherangeofthe
data.For example,whenachildtries
tocreateanaccountonadomain
thathasaminimumagerestriction,
thechildfakesher agewitharandom
valuethatallowshimtocreatethe
account.Suchrandomfakevalues
arehard,ifnotimpossible,todetect.
Moreover,althoughDMVsarethe
focusofthispaper,therearemore
typesoferrorsarefoundinthewild.
Manyoftheprinciplesand
techniqueswehaveusedtodetect
DMVscanbeleveragedtodetect
other typesoferrors,soanatural
nextstepistoextendthe
infrastructurewehavebuiltto
detectthose.Thisopensnew
challengesrelatedtotherobust
identificationoferrorsthatcouldbe
interpreteddifferentlybydifferent
modules.

DeepLearning Outlier Detection overview

UncertaintyandNoveltydetection #1a
Does YourModel KnowtheDigit6Is NotaCat?ALessBiased
Evaluationof“Outlier” Detectors (2018)
AlirezaShafaei,MarkSchmidt,andJamesJ.Little
What makes this problem differentfrom a typical supervisedlearning setting
isthatwecannotmodelthediversityofout-of-distributionsamplesin
practice. The distribution of outliers used in training may not be the same as
the distribution of outliers encountered in the application. Therefore,
classical approaches that learn inliers vs. outliers with only two datasets
can yield optimistic results. We introduce OD-test, a three-dataset
evaluation scheme as a practical and more reliable strategy to assess
progress on this problem. The OD-test benchmark provides a
straightforward means of comparison for methods that address the out-of-
distributionsampledetectionproblem.
In real life deployment of products that use complex machinery such as
deepneuralnetworks(DNNs),we wouldhavevery littlecontroloverthe
input. In the absence ofextrapolation guarantees, when the independently
and identically distributed (IID) assumption is violated, the behaviour of the
pipeline may be be unpredictable. From a quality assurance
perspective, it is desirable to detect and prevent these scenarios
automatically.
A reliable pipeline would first determine whether it can process a
given sample, then it would use the prediction of the target neural
network. The unfortunate incident that
mislabeledpeople asnon-human , for instance, is a clear example of
OOD extrapolation that could have been prevented by such a
decision scheme: the model simply did not know that it did
not know. While incidentsof similar nature have fueled researchon
de-biasing the datasets and the deep learning machinery, we still
wouldneed to identify the limitationsof ourmodels.
The application is not limited to fortifying large-scale user-
facing products. Successful detection of such violations could
also be used in active learning, unsupervised learning, learning with
noisy data, or simply be a condition to invoking transfer learning
strategies. In this work, we are interested in evaluating mechanisms
that detect OOD samples.

UncertaintyandNoveltydetection #1b
DoesYour Model Know the Digit 6 Is Not a Cat?A Less Biased Evaluation of
“Outlier” Detectors (2018) AlirezaShafaei, MarkSchmidt, and JamesJ. Little
https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test
The Uncertainty View. A commonly invoked strategy in addressing
similarproblemsistocharacterizeanotionofuncertainty.
The literature distinguishes aleatoric uncertainty, the uncertainty inherent
to the process (the known unknowns, like flipping a coin), from epistemic
uncertainty, the uncertainty that can be eliminated with more information
(the unknown unknowns). The Bayesian approach to epistemic
uncertainty estimation is to measure the degree of disagreement among
thepotentiallyviablemodels(theposterior).
The MC-Dropout approach is often advertised as a feasible method to
estimateuncertainty for a variety of applications. Similarly, we can adopt a
non-Bayesian approach by training independent models and then
measuringthedisagreement.Lakshminarayananetal.showanensembleof
five neural networks (DeepEnsemble) that are trained with an
adversarialsample-augmented strategy is sufficient to provide a non-
Bayesian alternative to capturing predictive uncertainty. We evaluate
DeepEnsemble and MC-Dropout.
* The Abstention View
* The Anomaly View AEThreshold PixelCNN++ K-NNSVM
* The Novelty View OpenMax
We train these architectures with a cross-entropy loss (CE), and a k-way logistic
regression loss (KL). CE loss is the typical choice for k-way classification tasks – it enforces
mutual exclusion in the predictions. KL loss is the typical choice for attribute prediction tasks –
it does not enforce mutual exclusivity of the predictions.
We test these two loss functions to see if the exclusivity assumption of CE has an adverse effect
on the ability to predict OOD samples. CE loss cannot make a None prediction without an
explicitly defined None class, but KL loss can make None predictions through low activations of
all the classes.

UncertaintyandNoveltydetection #1c
VGG-backedandResnet-backedmethods
significantlydifferinaccuracy.Thegap
indicatesthesensitivityofthemethodstothe
underlyingnetworks.
Thismeansthattheimageclassificationaccuracy
maynotbetheonlyrelevantfactor inperformance
ofthesemethods.ODINislesssensitivetothe
underlyingnetwork.
Despitenotenforcingmutualexclusivity,training
thenetworkswithKLlossinsteadofCEloss
consistentlyreducestheaccuracyofOOD
detectionmethodsonaverage.

UncertaintyandNoveltydetection #1d
https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test [PyTorch]
Related work indeep learning can be categorized into two broadgroupsbased on the underlyingassumptions:
(i) in-distribution techniques, and (ii) out-of-distribution techniques.
Guoetal. (2017) observed that
modern neural networks tend to
be overconfident in their
predictions. They show that
temperature scaling in the
softmax operator, also known as
Platt scaling, can be used to
calibrate the output probabilities of
a neural network to empirically
align the accuracy of a prediction
with its probability. Their efforts fall
under the uncertainty estimation
approaches.
Geifman and El-Yaniv (2017)
present a framework for selective
classification with deep neural
networks that follows the
abstention view. A selection
function decides whether to
make a prediction or not. For
the choice of selection function,
they experiment with MC-Dropout
and the softmax output. They
provide an analytical trade-off
between risk and coverage within
their formulation.
input perturbation serves as a way to assess how the network would behave nearby the given
input. When the temperature is 1 and the perturbation step is 0 we simply recover the
PbThreshold method. ODIN, the state-of-the-art at the time of this writing, is reported to
outperform the previous work [8] by a significant margin. We also assess the performance of ODIN
inourwork.
These methods provide an abstract idea which depends on the successful training of GANs. To
the best of our knowledge, training GANs is itself an active area of research, and it is not apparent
what design decisions would be appropriate to implement these ideas in practice. Furthermore,
someoftheseideasareprohibitivelyexpensivetoexecuteatthetimeofthiswriting.

UncertaintyandNoveltydetection #1e
Datasets.
We extend the previous work by evaluating over a broader set
of datasets with varying levels of complexity. The
variation in complexity allows for a fine-grained evaluation of
the techniques. Since OOD detection is closely related to the
problem of density estimation, the dimensionality of the
input image will be of vital importance in practical
assessments. As the input dimensionality increases, we
expect the task to become much more difficult.
Therefore, to provide a more accurate picture of performance,
itiscrucialtoevaluatethemethodsonhighdimensionaldata.
MC-Dropout
Inlow-dimensional
datasets,K-
NNSVMperforms
similarlyorbetter
than theother
methods
Thetop-performingmethod,ODIN,isinfluencedbythe
numberofclassesin thedataset.Similarto PbThreshold,ODIN
dependson themaximum signalin theclasspredictions,
thereforetheincreasednumberof classeswould directly affect
bothofthemethods.Furthermore,neitherofthemconsistently
prefersVGGoverResnetwithinalldatasets. Overall,ODIN
consistentlyoutperformsothersinhigh-dimensional
settings, but allthemethodshavea relativelylow average
accuracyinthe60%-78%range.

UncertaintyandNoveltydetection #1f

UncertaintyandNoveltydetection #2
To TrustOr NotTo Trust A Classifier
HeinrichJiang, Been Kim, Maya Gupta (2018)
Google Research;Google Brain
We propose a new score, called the trust
score, which measures the agreement
between the classifier and a modified
nearest-neighbor classifier on the testing
example. We show empirically that high
(low) trust scores produce surprisingly high
precision at identifying correctly (incorrectly)
classified examples, consistently
outperforming the classifier’s confidence
scoreas well as many other baselines.
Two example datasets and models. Predicting correctness (top row) and
incorrectness (bottom). The vertical dotted black line indicates accuracy level of the
classifier. The trust score consistently attains a higher precision for each given percentile
of classifier decision-rejection. Furthermore, the trust score generally shows increasing
precision as the percentile level increases, but surprisingly, many of the comparison
baselinesdo not.

Interpreting Neural NetworksWith Nearest
Neighbors
Eric Wallace, Shi Feng, Jordan Boyd-Graber
Local model interpretation methodsexplain individual
predictionsbyassigning animportance value to each
inputfeature. Thisvalue isoften determined by
measuringthe change in confidence when a feature is
removed. However, the confidence of neural networksis
nota robust measure of model uncertainty.
Thisissue makesreliably judgingthe importance of the
input featuresdifficult.We addressthisby changing
the test-time behaviorofneural networks using
Deep k-Nearest Neighbors. Without harmingtext
classification accuracy, thisalgorithm providesa more
robustuncertainty metric whichwe use to generate
feature importance values.
The resultinginterpretationsbetteralign withhuman
perception than baseline methods. Finally, we use our
interpretation methodto analyze model predictionson
dataset annotation artifacts.
Deepk-nearest neighbors: Towards confident,
interpretable and RobustDeep Learning
NicolasPapernot and Patrick D. McDaniel (2018)
Debugging ResNet model biases—This illustrates how the
DkNN algorithm helps to understand a bias identified by Stock and
Cisse [105] in the ResNet model for ImageNet. The image at the
bottom of each column is the test input presented to the DkNN.
Each test input is cropped slightly differently to include (left) or
exclude (right) the football. Images shown at the top are nearest
neighbors in the predicted class according to the representation
output by the last hidden layer. This comparison suggests that the
“basketball” prediction may have been a consequence of the ball
being in the picture. Also note how the white apparel color and
general arm positions of players often match the test image of
BarackObama.

AND:AutoregressiveNoveltyDetectors
Davide Abati, AngeloPorrello, Simone Calderara, RitaCucchiara
(Submitted on4 Jul 2018)
We propose an unsupervised model for novelty
detection. The subject is treated as a density estimation
problem, in which a deep neural network is employed to learn a
parametric function that maximizes probabilities of training
samples. This is achieved by equipping an autoencoder with a
novel module, responsible for the maximization of
compressed codes' likelihood by means of autoregression. We
illustrate design choices and proper layers to perform
autoregressive density estimation when dealing with both
image and video inputs. Despite a very general formulation, our
model shows promising results in diverse one-class novelty
detectionandvideoanomalydetectionbenchmarks.
Thestructureoftheproposedautoencoder.Pairedwithastandardcompression-reconstruction
network,adensityestimationmodulelearnsthedistributionoflatentcodes,viaautoregression.1

Anomalydetection withGANs#1
AnomalydetectionwithWassersteinGAN
IlyassHaloui, Jayant SenGupta, and Vincent Feuillard
(Submitted on11Dec2018)
https://arxiv.org/pdf/1812.02463
Inthispaper,we investigateGAN toperformanomalydetectionon
time series dataset. In order to achieve this goal, a bibliography is
made focusing on theoretical properties of GAN and GAN used for
anomaly detection. A Wasserstein GAN hasbeen chosen to learn the
representation of normal data distribution and a stacked encoder with
the generator performsthe anomaly detection. W-GAN with encoder
seems to produce state of the art anomaly detection scores on MNIST
datasetandweinvestigateitsusageon multi-variatetimeseries.
Based on this literature review, we chose to perform anomaly detection
using a Wasserstein Generative Adversarial Network. The main
reason is that Wasserstein GAN does not collapse contrarily to the
classical GAN which needs to be heavily tuned in order to avoid this
problem. Mode collapse can be blocking if we need to perform
anomaly detection: ifasubset ofour datadistributionisnotlearned bythe
generator, then all samples that are similar to this subset might end up
classified as abnormal. Another added value of the wasserstein GAN
version compared to a standard GAN is the possibility of using the loss
function of the discriminator to evaluate convergence since it is an
approximationoftheWassersteindistancebetween Pr
andPθ
.
A future improvement consists in considering CNN for both
the generator and discriminator in order to detect anomalies from
raw time series data. 1-D convolutions are needed and will be
investigated to produce good visual representations of time
series samples.A more thorough study of the impact of the
architecture should also be done.

Anomalydetection withGANs#2
MAD-GAN:MultivariateAnomalyDetectionforTimeSeries
DatawithGenerativeAdversarialNetworks
DanLi, DachengChen, LeiShi, BaihongJin, Jonathan Goh, and See-KiongNg
(Submitted on15Jan 2019) Institute ofData Science, National UniversityofSingapore
In this work, we propose a novel Multivariate Anomaly Detection
strategywith GAN (MAD-GAN) to model the complex multivariate
correlations among the multiple data streams to detect
anomalies using both the GANtrained generator and discriminator.
Unlike traditional classification methods, the GAN-trained discriminator
learns to detect fake data from real data in an unsupervised fashion,
making it an attractive unsupervised machine learning technique for
anomalydetection
Given that this is an early attempt on multivariate anomaly detection on
timeseriesdatausingGAN,thereareinteresting issuesthatawaitfurther
investigations.Forexample,wehavenotedtheissuesofdeterminingthe
optimal subsequence length as well as the potential model instability of
theGANapproaches.
For future work, we plan to conduct further research on feature
selection for multivariate anomaly detection, and investigate principled
methods for choosing the latent dimension and PC dimension
with theoretical guarantees.Wealsohope toperformadetailedstudyon
the stability of the detection model. In terms of applications, we plan to
explore the use of MAD-GAN for other anomaly detection applications
such as predictive maintenance and fault diagnosis for smart buildings
andmachineries.

Uncertainty InsightsfromNLP uncertainty
QuantifyingUncertaintiesinNaturalLanguage
ProcessingTasks
YijunXiaoand William YangWang(Submitted on 18 May2018)
In this paper, we propose novel methods to study the
benefits of characterizing model and data
uncertainties for natural language processing (NLP)
tasks. With empirical experiments on sentiment analysis,
named entity recognition, and language modeling using
convolutional and recurrent neural network models, we
show that explicitly modeling uncertainties is not only
necessary to measure output confidence levels, but also
useful at enhancing model performances in various
NLPtasks.
1. We mathematically define model and data
uncertaintiesviathelawof totalvariance;
2. Our empirical experiments show that by accounting
for model and data uncertainties, we observe
significantimprovementsinthree importantNLPtasks;
3. We show that our model outputs higher data
uncertainties for more difficult predictions in sentiment
analysis andnamedentity recognitiontasks.

Uncertainty CNNs+GaussianProcesses
CalibratingDeepConvolutionalGaussianProcesses
Gia-Lac Tran, Edwin V. Bonilla, John P. Cunningham, PietroMichiardi, Maurizio
Filippone. (Submitted on 26May 2018)
Despite the considerable interest in combining CNNs
with GPs, little attention has been devoted to
understand the implications in terms of the ability of
these models to accurately quantify the level of
uncertainty inpredictions.
This is the first work that highlights the issues of
calibration of these models, showing that GPs cannot
cure the issues of miscalibration in CNNs. We
have proposed a novel combination of CNNs and GPs
where the resulting model becomes a particular form of
a Bayesian CNN for which inference using variational
inference isstraightforward.
However, our results also indicate that combining CNNs
and GPs does not significantly improve the
performance of standard CNNs. This can serve as
a motivation for investigating new approximation
methods for scalable inference in GP models and
combinationswithCNNs.
CalibrationofConvolutionalNetworks:
The issue of calibration of classifiers in machine learning was popularized in the 90’s with the use of
support vector machines for probabilistic classification. Calibration techniques aim to learn a
transformation of the output using a validation set in order for the transformed output to give a reliable
account ofthe actual probability ofclasslabels; interestingly,calibration can be appliedregardless
of the probabilistic nature of the untransformed output of the classifier. Popular calibration techniques
include Plattscaling and isotonicregression. Classifiers based on Deep Neural Networks (DNNs)
have been shown to be well-calibrated]. The reason is that the optimization of the cross-entropy
loss promotes calibrated output. The same loss is used in Platt scaling and it corresponds to the
correct multinomial likelihood for class labels. Recent studies on the calibration of CNNs, which are a
particular case of DNNs, however, show that depth has a negative impact on calibration, despite
the use of a cross-entropy loss, and that regularization improves the calibration properties of
classifiers[Guoetal.2017].
Combinationsof ConvNetsandGaussianProcesses:
Thinking of Bayesian priors as a form of regularization, it is natural to assume that Bayesian
CNNs can “cure” the miscalibration of modern CNNs. Despite the abundant literature on Bayesian
DNNs, far less attention has been devoted to Bayesian CNNs, and the calibration properties of these
approaches have not been investigated. In this work, we propose an alternative way to combine CNNs
and GPs, where GPs are approximated using random features expansions. The random feature
expansion approximation amounts in replacing the orginal kernel matrix with a low-rank approximation,
turning GPs into Bayesian linear models. Combining this with CNNs leads to a particular form of
Bayesian CNNs, much like GPs and DGPs are particular forms of Bayesian DNNs. Inference in Bayesian
CNNs is intractable and requires some form of approximation. In this work, we draw on the interpretation
of dropout as variational inference, employing the so-called Monte Carlo Dropout (MCD) to obtain a
practicalwayofcombiningCNNsand GPs.

Uncertainty in timestamps,modelingfor clinicaluse#1
Time-DiscountingConvolutionforEventSequences
withAmbiguousTimestamps
(Submitted on 6Dec2018)
This paper proposes a method for modeling event
sequences with ambiguous timestamps, a time-
discounting convolution. Unlike in ordinary time series,
time intervals are not constant, small time-shifts
have no significant effect, and inputting timestamps or
time durations into a model is not effective. The criteria
that we require for the modeling are providing
robustness against time-shifts or timestamps
uncertainty as well as maintaining the essential
capabilities of time-series models, i.e., forgetting
meaningless past information and handling infinite
sequences.
The proposed method handles them with a
convolutional mechanism across time with specific
parameterizations, which efficiently represents the event
dependencies in a time-shift invariant manner while
discounting the effect of past events, and a dynamic
pooling mechanism, which provides robustness
against the uncertainty in timestamps and enhances the
time-discounting capability by dynamically changing the
poolingwindowsize.

Typesof Missing Values
Feldmanetal.(2018): “Rubin (1976) discusses three possible
mechanisms for the formation of missing values, each reflecting a
different form of missing-data probabilities and relationships between the
measured variables, and each may lead to different imputation methods
(Luengoetal.,2012)”
Missing Completely at Random (MCAR): a missing value that cannot be
related to the value itself or to other variable values in that record. This is a
completely unsystematic missing pattern and therefore the observed data
canbethoughtofasarandomunbiasedsampleofacompletedataset.
Missing at Random (MAR): cases in which a missing value is related to
other variable valuesin thatrecord,but nottothevalue itself(e.g., aperson with
a "marital status" value "single", has a missing value in the "spouse name"
attribute). In other words, in MAR scenarios, incomplete data can be partially
explained and the actual value can be possibly predicted by other variable
values.
Missing Not at Random (MNAR): the missing value is not random and
depends on the actual value itself; hence, cannot be explained by other values
(e.g., an overweight person is reluctant to provide the "weight" value in a
survey). NMAR scenarios are the most difficult to analyze and handle, as the
missing data cannot be associated with other data items that are available in
thedataset.
https://statistical-programming.com/missing-data/
Missinginaction:the dangersofignoringmissingdata
https://doi.org/10.1016/j.tree.2008.06.014

Intro toimputationmethods
ComparisonofEstimatingMissingValues inIoTTime
Series DataUsingDifferentInterpolationAlgorithms
August2018
“When collecting the Internet of Things data using various sensors or
other devices, it may be possible to miss several kinds of values of
interest.In thispaper,we focusonestimating the missing valuesin IoT
time series data using three interpolation algorithms, including
(1) Radial Basis Functions, (2) Moving Least Squares (MLS), and (3)
AdaptiveInverseDistanceWeighted.“
Onthechoiceofthebestimputationmethods formissingvalues
consideringthreegroups ofclassificationmethods
June2011
https://doi.org/10.1007/s10115-011-0424-2|https://sci2s.ugr.es/MVDM
“In thiswork, wefocuson aclassification task with twenty-three classification methods
and fourteen different imputation approaches to missing values treatment that
are presented and analyzed. The analysis involves a group-based approach, in which
we distinguish between three different categories of classification methods.
Each category behaves differently, and the evidence obtained shows that the use of
determined missing values imputation methods could improve the accuracy obtained
for these methods. In this study, the convenience of using imputation methods
for preprocessing data sets with missing values is stated. The analysis suggests
that theuseofparticularimputation methodsconditionedtothegroupsisrequired.“
We have discovered that the
Combined Multivariate Collapsing
(CMC) and Event Covering (EC)
methods show good behavior for
these two measures, and they are
two methods that provide good
results for an important range of
learning methods, as we have
previously analyzed. In short, these
two approaches introduce less
noise and maintain the mutual
information better.
Class centerbasedapproachformissingvalue
imputation2018
https://doi.org/10.1016/j.knosys.2018.03.026
A novel missing value imputation isintroduced, which iscomposedof
two modules. Each class center and its distances from the other
observed data are measured to identify a threshold. Then, the
identified threshold is used for missing value imputation. The
proposed approach outperforms the other approaches for both
numerical and mixed datasets. It requires much less imputation
timethanthemachinelearning basedmethods.

Imputation withDeepLearning#1
BRITS:BidirectionalRecurrentImputationforTime
Series
WeiCao,DongWang,JianLi,HaoZhou,LeiLi,YitanLi
(Submittedon27May2018) https://arxiv.org/abs/1805.10572
https://github.com/NIPS-BRITS/BRITS
Existing imputation methods often impose strong
assumptions of the underlying data generating process,
such as linear dynamics in the state space. In this paper, we
propose BRITS, a novel method based on recurrent neural
networksformissingvalueimputationintimeseriesdata.
Our proposed method directly learns the missing
values in abidirectional recurrentdynamicalsystem,without
any specific assumption. The imputed values are treated as
variablesofRNNgraphandcan beeffectivelyupdatedduring
the backpropagation. We simultaneously perform missing
value imputation and classification/regression of applications
jointlyinoneneuralgraph.
BRITS has three advantages: (a) it can handle multiple
correlated missing values in time series; (b) it generalizes
to time series with nonlinear dynamics underlying; (c) it
provides a data-driven imputation procedure and
appliestogeneralsettingswithmissing data.
We evaluate the imputation performance in terms of
mean absolute error (MAE) and mean relative error
(MRE).

End-to-EndTimeSeriesImputationviaResidualShortPaths
Lifeng Shen,Qianli Ma,SenLi (2018)
http://proceedings.mlr.press/v95/shen18a.html
We propose an end-to-end imputation network with residual
short paths, called Residual IMPutation LSTM (RIMP-LSTM), a
flexible combination of residual short paths with graph-based
temporal dependencies. We construct a residual sum unit (RSU),
which enables RIMP-LSTM to make full use of previous revealed
information to model incomplete time series and reduce the
negative impact of missing values. Moreover, a switch unit is
designed to detect the missing values and a new loss function is
then developed to train our model with time series in the presence of
missing values in an end-to-end way, which also allows
simultaneous imputationand prediction.
RIMP-LSTM combines the merits of graph-based models with
explicitly modeled temporal dependencies via weighted
residual connection between nodes, with the ones of LSTM that can
accumulate historical residual information and learn the underlying
patternsof incomplete time seriesautomatically.
On the other hand, compared with IMP-LSTM, RIMP-LSTM has
better performance as it is good at modeling temporal
dependencies with weighted residual short paths, which
demonstrates that the reasonability of using these weighted residual
pathsto model graphlike temporal dependenciesforimputation.

Acontextencoderforaudioinpainting
AndresMarafioti,NathanaelPerraudin,Nicki Holighaus,andPiotr Majdak (Submittedon29Oct2018)
http://www.github.com/andimarafioti/audioContextEncoder
(Python,Matlab)
We studied the ability of deep neural networks (DNNs) to restore missing audio
content based on its context, a process usually referred to as audio inpainting.
We focused on gaps in the range of tens of milliseconds, a condition which has
not received much attention yet. The proposed DNN structure was trained on
audio signals containing music and musical instruments, separately, with 64-ms
long gaps
Here, the STFT features, meant as a reasonable first choice,
provided a decent performance. In the future, we expect more
hearing-related features to provide even better reconstructions. In
particular, an investigation of Audlet frames, i.e., invertible time-
frequency systems adapted to perceptual frequency scales, as
featuresforaudioinpaintingpresentintriguingopportunities.
Here, preferred architectures are those not relying on a
predetermined target and input feature length, e.g., a recurrent
network. Recent advances in generative networks will provide
other interesting alternatives for analyzing and processing audio
dataaswell.Theseapproachesareyettobefully explored.
Finally, music data can be highly complex and it is unreasonable to
expect a single trained model to accurately inpaint a large number
of musical styles and instruments at once. Thus, instead of training
on a very general dataset, we expect significantly improved
performance for more specialized networks that could be
trained by restricting the training data to specific genres or
instrumentation. Applied to a complex mixture and potentially
preceded by a source-separation algorithm, the resulting
modelscouldbeusedjointlyinamixture-of-experts.approach.

Imputation withDeepLearning#4: GANs
NAOMI:Non-AutoregressiveMultiresolutionSequenceImputation
Yukai Liu,RoseYu,StephanZheng,EricZhan,Yisong Yue (Submittedon30Jan2019)
We studied the ability of deep neural networks (DNNs) to restore missing audio
content based on its context, a process usually referred to as audio inpainting.
We focused on gaps in the range of tens of milliseconds, a condition which has
not received much attention yet. The proposed DNN structure was trained on
audio signals containing music and musical instruments, separately, with 64-ms
long gaps
Leveraging multiresolution modeling and adversarial training, NAOMI is able to
learn the conditional distribution given very few known observations and
achieves superior performances in variousexperiments of both deterministic and
stochastic dynamics. Future work will investigate how to infer the
underlyingdistribution when complete training dataisunavailable.The trade-
off between partial observations and external constraints is another direction for
deepgenerativeimputationmodels.

Effect of missingvalues toclassificationperformance
Amethodologyforquantifyingtheeffectofmissingdata ondecisionquality in
classificationproblems
Received 09Mar 2016, Accepted 22 Dec 2016, Accepted author version posted online: 13Jan 2017,
“This study suggests that the negative impact of poor data quality (DQ) on decision making is often
mediated by biased model estimation. To highlight this perspective, we develop an analytical framework
that links three quality levels – data, model, and decision. The general framework is first developed at a
high-level”
Evolutionary MachineLearningfor
ClassificationwithIncompleteData
Tran, CaoTruong(2018, PhDThesis)
http://hdl.handle.net/10063/7639
“The thesis develops approaches for
improving imputation for
classification with incomplete data by
integrating clustering and feature
selection with imputation. The approaches
improve both the effectiveness and the
efficiency of using imputation for
classificationwith incompletedata.
The thesis develops interval genetic
programming to directly evolve classifiers
for incomplete data. The results show that
classifiers generated by interval genetic
programming can be more effective and
efficient than classifiers generated the
combination of imputation and traditional
genetic programming. Interval genetic
programming is also more effective than
common classification algorithms able to
workdirectlywith incompletedata.”

Imputation and Classification
MissingData ImputationforSupervisedLearning
August 2018
“Thispapercomparesmethodsforimputingmissing
categoricaldataforsupervisedclassificationtasks. “
The results of the present study show that perturbation can help increase predictive accuracy
for imputed models, but not one-hot encoded models. Future work can identify the conditions
under which missing-data perturbation can improve prediction accuracy. Interesting
extensions of this paper include evaluating the benefits of using missing-data
perturbation over more popularregularization techniquessuchas dropout training.
ErrorratesontheAdulttestsetwith(bottom)andwithout(top)missing dataimputation,for variouslevelsofMCAR-perturbedcategoricaltrainingfeatures(x-axis).
TheAdult datasetcontainsN= 48,842examples
and 14 features(6 continuousand 8 categorical).The
predictiontask isto determinewhether aperson
makesover $50,000a year.

Decomposition LiteratureReview

CEEMD EmpiricalModeDecomposition
Empirical mode decomposition for
seismic time-frequency analysis
Jiajun Han and Mirko van der Baan
Geophysics (2013) 78 (2):O9-O19.
https://doi.org/10.1190/geo2012-0199.1
Complete ensemble empirical mode
decomposition decomposes a
seismic signal into a sum of
oscillatory components, with
guaranteed positive and smoothly
varying instantaneous frequencies.
Analysis on synthetic and real data
demonstrates that this method
promises higher spectral-spatial
resolution than the short-time
Fourier transform or wavelet
transform. Application on field data
thus offers the potential of
highlighting subtle geologic
structures that might otherwise
escape unnoticed.
CEEMD is a robust extension of EMD methods. It
solves not only the mode mixing problem, but also leads to
complete signal reconstructions. After CEEMD,
instantaneous frequency spectra manifest visibly higher
time-frequency resolution than short-time Fourier and
wavelet transforms on synthetic and field data examples.
These characteristics render the technique highly
promisingforseismic processingand interpretation.
Introducinglibeemd:Aprogrampackageforperformingthe
ensembleempiricalmodedecomposition(July2015)
ComputationalStatistics 31(2):1-13P.J.J.Luukko,JouniHelske,E.
Räsänen C, R and Python
http://doi.org/10.1007/s00180-015-0603-9
https://bitbucket.org/luukko/libeemd

SourceSeparation ”signaldecomposition”#1
Wave-U-Net:AMulti-ScaleNeuralNetworkfor
End-to-EndAudioSourceSeparation
Daniel Stoller, Sebastian Ewert, Simon Dixon
Queen Mary Universityof London, Spotify
(Submitted on8 Jun2018)
https://arxiv.org/abs/1806.03185 |https://github.com/f90/Wave-U-Net
“Models for audio source separation usually operate on the
magnitude spectrum, which ignores phase information and
makes separation performance dependant on hyper-parameters
for the spectral front-end. Therefore, we investigate end-to-end
source separation in the time-domain, which allows
modelling phase information and avoids fixed spectral
transformations. Due to high sampling rates for audio, employing a
long temporal input context on the sample level is difficult, but
required for high quality separation results because of long-range
temporalcorrelations.
In thiscontext, weproposethe Wave-U-Net,an adaptation of the
U-Net to the one-dimensional time domain, which repeatedly
resamples feature maps to compute and combine features at
different time scales. We introduce further architectural
improvements, including an output layer that enforces source
additivity, an upsampling technique and a context-aware
predictionframework toreduceoutput artifacts.
Experiments for singing voice separation indicate that our
architecture yields a performance comparable to a state-of-the-
artspectrogram-basedU-Netarchitecture,given thesamedata.
75 tracks from the training partition of the MUSDB
multi-track database are randomly assigned to
our training set. For singing voice separation, we
also add the whole CCMixter database to the
training set. No further data preprocessing is performed, only a
conversion to mono (except for stereo models) and downsampling to
22050 Hz.
For future work, we could investigate to
which extent our model performs a
spectral analysis, and how to incorporate
computations similar to those in a multi-
scale filterbank, or to explicitly compute
a decomposition of the input signal into a
hierarchical set of basis signals and
weightings on which to perform the
separation, similarto the TasNet [12].
Furthermore, better loss functions for
raw audio prediction should be investigated
such as the ones provided by generative
adversarial networks [3, 21], since the MSE
might not reflect the perceived loss of
quality well.

TasNet:SurpassingIdealTime-Frequency
MaskingforSpeechSeparation
YiLuo, NimaMesgarani
(Submitted on21 Sep 2018)
“TasNet uses a convolutional encoder to create a representation
of the signal that is optimized for extracting individual speakers.
Speaker extraction is achieved by applying a weighting
function (mask) to the encoder output. The modified encoder
representation is then inverted to the sound waveform using a
linear decoder. A linear deconvolution layer serves as a decoder
by invertin gthe encoder output back to the sound waveform. This
encoder-decoder framework is similar to the ICA method when
a nonnegativemixing matrix (NMF) is used [Wangetal.2009] and
to the semi-nonnegative matrix factorization method (semi-NMF)
[Dingetal.2008], where the basis signals are the parameters of
thedecoder.
The masks are found using a temporal convolutional network
(TCN) consisting of dilated convolutions, which allow the
network to model the long-term dependencies of the speech
signal. This end-to-end speech separation algorithm significantly
outperforms previous time-frequency methods in terms
of separating speakers in mixed audio, even when compared to
the separation accuracy achieved with the ideal time-frequency
mask of the speakers. In addition, TasNet has a smaller model size
and a shorter minimum latency, making it a suitable solution for
bothofflineandreal-time speechseparation applications.“

DisentanglingCorrelatedSpeakerandNoisefor
SpeechSynthesis viaDataAugmentationand
AdversarialFactorization
Wei-NingHsu, Yu Zhang, Ron J. Weiss, Yu-An Chung, Yuxuan Wang,
YonghuiWu, JamesGlass.
32nd ConferenceonNeural InformationProcessing Systems (NIPS 2018), Montréal, Canada.
https://openreview.net/pdf?id=Bkg9ZeBB37
“To leverage crowd-sourced data to train multi-speaker text-
to-speech (TTS) models that can synthesize clean speech
for all speakers, it is essential to learn disentangled
representations which can independently control the
speaker identity and background noise in generated signals.
However, learning such representations can be challenging,
duetothe lackoflabelsdescribingtherecordingconditionsof
each training example, and the fact that speakers and
recording conditions are often correlated, e.g. since users
oftenmakemanyrecordingsusingthesameequipment.
This paper proposes three components to address this
problem by: (1) formulating a conditional generative model
with factorized latent variables, (2) using data augmentation
to add noise that is not correlated with speaker identity and
whose label is known during training, and (3) using
adversarial factorization to improve disentanglement.
Experimental results demonstrate that the proposed method
can disentangle speaker and noise attributes even if
they are correlated in the training data, and can be used to
consistentlysynthesizecleanspeechforallspeakers.”

Decompose HighandLow frequencies
Drop anOctave:ReducingSpatialRedundancy in
Convolutional Neural Networks withOctave
Convolution
YunpengChen, HaoqiFang, BingXu, ZhichengYan, YannisKalantidis,
MarcusRohrbach, ShuichengYan, JiashiFeng
(Submitted on 10 Apr 2019)
https://export.arxiv.org/abs/1904.05049
In this work, we propose to factorize the mixed feature maps by
their frequencies and design a novel Octave Convolution
(OctConv) operation to store and process feature maps that vary
spatially "slower" at a lower spatial resolution reducing both memory
and computation cost. Unlike existing multi-scale meth-ods,
OctConv is formulated as a single, generic, plug-and-play
convolutional unit that can be used as a direct
replacement of (vanilla) convolutions without any
adjustments in the network architecture. It is also orthogonal and
complementary to methods that suggest better topologies or
reduce channel-wise redundancy like group or depth-wise
convolutions. We experimentally show that by simply replacing
con-volutions with OctConv, we can consistently boost
accuracy for both image and video recognition tasks, while reducing
memoryandcomputationalcost.

Decompose Signalandthe Noise
Deeplearningofdynamicsandsignal-noise
decompositionwithtime-steppingconstraints
Samuel H. Rudy, J. Nathan Kutz, Steven L. Brunton
Department of Applied Mathematics/ Mechanical Engineering, Universityof Washington, Seattle,
last revised 22 Aug2018
https://github.com/snagcliffs/RKNN
“We propose a novel paradigm for data-driven modeling that
simultaneously learns the dynamics and estimates the
measurement noise at each observation. By constraining our
learning algorithm, our method explicitly accounts for measurement
error in the map between observations, treating both the
measurement error and the dynamics as unknowns to be
identified,ratherthan assumingidealizednoiselesstrajectories.
We also discuss issues with the generalizability of neural network
models for dynamicalsystemsand provide open-source code for
allexamples.”
The combination of neural networks and numerical time-stepping
schemes suggests a number of high-priority research
directions in system identification and data-driven forecasting.
Future extensions of this work include considering systems with
process noise, a more rigorous analysis of the specific method for
interpolating f, including time delay coordinates to accommodate
latent variables, and generalizing the method to identify
partial differential equations. Rapid advances in hardware and
the ease of writing software for deep learning will enable these
innovations through fast turnover in developing and testing
methods.

Signal Restoration LiteratureReview

Super-resolutions Insightsfromaudio
Time-frequencynetworks foraudiosuper-
resolution
TeckYian Lim etal. (2018)
http://isle.illinois.edu/sst/pubs/2018/lim18icassp.pdf
http://tlim11.web.engr.illinois.edu/
“Audiosuper-resolution (a.k.a. bandwidthextension)is
thechallengingtaskofincreasingthetemporalresolutionof
audiosignals. Recentdeepnetworksapproachesachieved
promisingresultsby modelingthetaskas aregression
problem ineithertimeorfrequencydomain. Inthispaper,
weintroducedTime-FrequencyNetwork(TFNet),a
deepnetworkthat utilizessupervision inboth thetimeand
frequencydomain.Weproposedanovelmodelarchitecture
whichallowsthetwodomainstobe jointlyoptimized.”
Spectrogram correspondingto
the LR input(frequenciesabove
4kHz missing), HR
reconstruction, and the HR
ground truth. Our approach
successfullyrecoversthehigh
frequencycomponentsfrom the
LRaudiosignal.

GANs Alsofortime-seriesdenoising #1a
DenoisingTimeSeriesData Using
AsymmetricGenerativeAdversarial
Networks
Sunil Gandhi;Tim Oates;TinooshMohsenin and David
Hairston (2018)
“In this paper, we explicitly learn to remove
noise from time series data without
assuming a prior distribution of noise.
We propose an online, fully automated, end-
to-endsystemfordenoisingtimeseriesdata.
Our model for denoising time series is trained
using unpaired training corpora and does
not need information about the source of the
noiseorhowitismanifestedin thetimeseries.
We propose a new architecture called
AsymmetricGAN that uses a generative
adversarial network for denoising time series
data.”
Consider, for example, a widely used method for time series featurization called Symbolic Aggregate
approXimation (SAX) that assumes time series are generated from a single normal distribution. As
shown in this assumption does not hold in several real life time series datasets. Other techniques
assume noise comes from a Gaussian distribution and estimate the parameters of that distribution. This
assumption doesnot hold for datasourceslikeElectroencephalography (EEG), wherenoisecan have diverse
characteristics and originate from different sources. Hence, in this work, we focus on learning the
characteristics of noise in EEG data and removing it as a preprocessing step. ICA has high
computationalcomplexityandlargememoryrequirements,makingitunsuitableforreal-timeapplications.
For training of our network, we only need a set of clean signals and set of noisy signals. We do not need
paired training data, i.e., we do not need clean versions of the noisy data. This is particularly useful for
applicationslikeartifact removalinEEGdataaswecannot recordclean versionsofnoisyEEG.

GANs Alsofortime-seriesdenoising #1b
DenoisingTimeSeriesData Using
AsymmetricGenerativeAdversarial
Networks
Sunil Gandhi;Tim Oates;TinooshMohsenin and David
Hairston (2018)
Pre-processing
The DC component in EEG data is different for each
recording. We normalize every window of clean and
noisy data to remove the DC offset from the data. We
remove the DC offset by subtracting the median of the
datain the window.
Evaluation of EEG data is challenging as the
ground truth noiseless signals are not
known. Multiple approaches to evaluation
have been proposed in recent years,
however, authors do not agree on a single
mechanismforevaluatingartifactremoval.

GANs Alsoforspeechdenoising
Segan:Speechenhancementgenerative
adversarialnetwork.
SantiagoPascual, AntonioBonafonte, and Joan Serra (2017)
https://github.com/santi-pdp/segan
“For the purpose of speech enhancement
and denoising, the SEGAN was developed,
employing a neural network with an encoder and
decoder pathway that successively halves and
doubles the resolution of feature maps in each
layer, respectively, and features skip connections
betweenencoderanddecoderlayersa.
The model works as an encoder-decoder fully-
convolutional structure, which makes it fast to
operate for denoising waveform chunks. The
results show that, not only the method is viable, but it
can also represent an effective alternative to current
approaches.
Possible future work involves the exploration of
better convolutional structures and the inclusion of
perceptual weightings in the adversarial training,
so that we reduce possible high frequency artifacts
that might be introduced by the current model.
Further experiments need to be done to compare
SEGANwithothercompetitiveapproaches.” Thedatasetisaselectionof30speakers
fromtheVoiceBankcorpus

GANs Alsoformultichannelaudiodenoising
Multi-ViewNetworks forDenoisingofArbitrary
NumbersofChannels
Jonah Casebeer, Brian Luc and ParisSmaragdis (July2018)
“We propose a set of denoising neural networks capable
of operating on an arbitrary number of channels at
runtime, irrespective of how many channels they were
trained on. We coin the proposed models multi-view
networks sincetheyoperateusingmultipleviewsofthe
samedata.
We explore two such architectures and show how they
outperform traditional denoising models in multi-channel
scenarios. Additionally, we demonstrate how multi-
view networks can leverage information
provided by additional recordings to make
better predictions, and how they are able to
generalize to a number of recordings not seen in
training.”

GANs forgenerativemodelsoftimeseries
Ontheevaluationofgenerativemodels inmusic
Li-ChiaYang, Alexander Lerch (October 2018)
https://github.com/RichardYang40148/mgeval
Therefore, we propose a set of simple
musically informed objective metrics
enabling an objective and reproducible
way of evaluating and comparing the
output of music generative systems.
We demonstrate the usefulness of the
proposed metrics with several
experiments on real-world data.
We have released the evaluation
framework as an open-source toolbox
which implements the demonstrated
evaluation and analysis methods along
with visualization tools. Our future work
will include the extension of the current
toolbox with additional dimensions (e.g.,
dynamics) and to expand it toward
polyphonic music.

Analysisand Classification LiteratureReview

Classification, non-DL algorithms: COTE
Thegreattimeseries classificationbakeoff:
areviewandexperimentalevaluationof
recentalgorithmic advances
Anthony Bagnall, Jason Lines, Aaron Bostrom, James Large,
Eamonn Keoghs (May2017)
https://bitbucket.org/TonyBagnall/time-series-classification
“We have implemented 18 recently proposed algorithms in a
common Java framework (Weka) and compared them
against two standard benchmark classifiers (and each other)
by performing 100 resampling experiments on each of the 85
datasets. We use these results to test several hypotheses
relating to whether the algorithms are significantly more
accurate than the benchmarks and each other. Our results
indicate that only nine of these algorithms are significantly
more accurate than both benchmarks and that one classifier,
the collective of transformation ensembles, is significantly
moreaccuratethan allof theothers”
Summaryofthetimeandspacecomplexity of the
18TSCalgorithmsconsidered
However, our conclusion is that using COTE (
Bagnall et al2015; Cited by 91) will probably give you
the most accurate model. If a simpler approach is needed
and the discriminatory features are likely to be embedded in
subseries, then we would recommend using TSF or ST if the
features are in the time domain (depending on whether they
are phase dependent or not) or BOSS if they are in the
frequency domain. If a whole series elastic measure seems
appropriate, then using EE is likely to lead to better predictions
than usingjust DTW.

Time series IntroofDNNuse#1A
Deep learningfortimeseriesclassification: a review
Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, LhassaneIdoumghar, Pierre-Alain Muller (Submitted on 12Sep2018)
https://arxiv.org/abs/1809.04356 | https://github.com/hfawaz/dl-4-tsc
In this article, we study the current
state of the art performance of deep
learning algorithms for Time Series
Classification (TSC) by presenting
an empiricalstudyofthemostrecent
DNN architectures for TSC. We give
an overview of the most successful
deep learning applications in various
time series domains under a
unified taxonomy of DNNs for
TSC. We also provide an open
source deep learning
framework to the TSC community
where we implemented each of the
compared approaches and
evaluated them on a univariate TSC
benchmark (the UCR archive) and
12 multivariate time series datasets.
By training 8,730 deep learning
models on 97 time series
datasets, we propose the most
exhaustive study ofDNNsfor TSC to
date.
COTEiscurrentlyconsideredthe stateoftheart fortimeseriesclassification(Bagnalletal.,2017)
whenevaluatedoverthe85datasetsfromtheUCRarchive (Chenetal.,2015b).
Finally,addingtothehugeruntimeofCOTE,thedecisiontakenby 35classifierscannotbeinterpreted
easily by domainexperts,sinceresearchersalreadystrugglewithunderstandingthedecisionstakenby
anindividualclassifier.
●
WhatisthecurrentstateoftheartDNNforTSC?
●
IsthereacurrentDNNapproachthatreachesstateoftheartperformanceforTSCandis less
complexthanCOTE?
●
WhattypeofDNNarchitecturesworksbestfortheTSCtask?
●
Andfinally:Couldtheblack-boxeffectofDNNsbeavoidedtoprovideinterpretability?
GiventhatthelatterquestionshavenotbeenaddressedbytheTSCcommunity,itissurprisinghow
muchrecentpapershaveneglectedthepossibilitythatTSCproblemscouldbe solvedusingapure
featurelearningalgorithm

Time series IntroofDNNuse#1B
The result of a applying a learned
discriminative convolution on the GunPoint
dataset
https://arxiv.org/abs/1809.04356 (2018) https://github.com/hfawaz/dl-4-tsc

Time series IntroofDNNuse#1C
Given the aforementioned limitations for
generative models, we decided to limit our
experimental evaluation to discriminative
deep learningmodelsforTSC.
Second, since we cannot cover an empirical study of
all approaches validated in all TSC domains, we
decided to onlyinclude approachesthatwere validated
on the whole (or a subset of) the univariate time
series UCRarchive and/or on the MTS archive (
Baydogan,2015).
Finally, we chose to work with approaches that do not try to
solve a sub task of the TSC problem such as in Geng and
Luo (2018) where CNNs were modified to solve the task of
classifying imbalanced time series datasets. Another sub
task that has been at the center of recent studies is earlytime
series classification (Wang et al., 2016a) where deep CNNs
were modified to include an early classification of time series.
More recently, a deep reinforcement learning approach was
also proposed for the early TSC task (Martinez et al., 2018).
For further details, we refer the interested reader to a recent
survey on deep learning for early time series
classification(Santos andKern,2017).
The third and final proposed architecture in Wangetal.(2017) is a relatively deep Residual Network
(ResNet). For TSC, this is the deepest architecture with 11 layers of which the first 9 layers are
convolutionalfollowedbyaGAP layerthataveragesthetimeseriesacrossthetimedimension.

Time series IntroofDNNuse#1D
Giventhehugenumberofmodels[8730experimentsfor the85univariateTSCdatasets]that
neededtobetrained,weranour experimentsonaclusterof 60GPUs.TheseGPUswereamix
ofthreetypesofNvidiagraphiccards: GTX1080Ti, TeslaK20,K40andK80.
Thetotalsequentialrunning timewasapproximately 100days,thatisifthecomputationhas
beendoneonasingleGPU.However,byleveraging theclusterof60GPUs,wemanagedto
obtaintheresultsinlessthanonemonth. Weimplementedour frameworkusingtheopen
sourcedeeplearninglibraryKeraswiththeTensorflowback-end.
Figure1showsthecriticaldifferencediagram (Demšar,2006,
Citedby6414),whereathick horizontallineshowsagroupofclassifiers(a
clique)thatarenotsignificantlydifferentintermsofaccuracy.
→ AnExtension on"StatisticalComparisonsofClassifiersover Multiple
DataSets"forallPairwiseComparisons

Time series IntroofDNNuse#1E: ResNettheTopDog
Giventhehugenumberofmodels[8730experimentsfor the85univariateTSCdatasets]that
neededtobetrained,weranour experimentsonaclusterof 60GPUs.TheseGPUswereamix
ofthreetypesofNvidiagraphiccards: GTX1080Ti, TeslaK20,K40andK80.
Thetotalsequentialrunning timewasapproximately 100days,thatisifthecomputationhas
beendoneonasingleGPU.However,byleveraging theclusterof60GPUs,wemanagedto
obtaintheresultsinlessthanonemonth. Weimplementedour frameworkusingtheopen
sourcedeeplearninglibraryKeraswiththeTensorflowback-end.
Figure1showsthecriticaldifferencediagram (Demšar,2006),wherea
thick horizontallineshowsagroupofclassifiers(aclique)thatarenot
significantlydifferentintermsofaccuracy.

Time series IntroofDNNuse#1F:ResNetsvs.Traditional
We give two potential reasons for
this high generalization
capabilities of deep CNNs on the
TSCtasks.
First, having seen the success of
convolutions in classification tasks
that require learning features that
are spatially invariant in a two
dimensional space (such as width
and height in images), it is only
natural to think that discovering
patterns in a one dimensional
space (time) should be an easier
task for CNNs thus requiring less
datatolearnfrom.
The other more direct reason
behind the high accuracies of
deep CNNs on time series data is
itssuccessinother sequentialdata
such as speech recognition and
sentence classification where text
and audio, similarly to time series
data, exhibit a natural temporal
ordering.
We compared ResNet(the mostaccurate DNN ofour study) with the currentstateofthe artclassifiersevaluated on the UCR
archive in the great time series classification bake off Bagnalletal.(2017)). Note that our empirical study strongly
suggeststouseResNetinsteadofanyother deeplearningalgorithm
Outofthe18classifiersevaluatedbyBagnalletal.(2017),wehavechosenthefourbestperformingalgorithms:
(1) Elastic Ensemble (EE) proposed by Lines and Bagnall (2015) is an ensemble of nearest neighbor classifiers with 11
different time series similarity measures; (2) Bag-of-SFA-Symbols (BOSS) published in Schäfer (2015) forms a
discriminative bag of words by discretizing the time series using a Discrete Fourier Transform and then building a nearest
neighbor classifier with a bespoke distance measure; (3) Shapelet Transform (ST) developed by Hills et al. (2014)
extracts discriminative subsequences (shapelets) and builds a new representation of the time series that is fed to an
ensemble of 8 classifiers; (4) Collective of Transformation-based Ensembles (COTE) proposed by
Bagnalletal.(2017) is basically a weighted ensemble of 35 TSC algorithms including EE and ST. Finally, we added a recent
approach named Proximity Forest (PF) which is similar to Random Forest but replaces the attribute based splitting
criteriabyarandomsimilaritymeasurechosenoutofEE’selasticdistances(Lucasetal.,2018).
Although COTE is still the
most accurate classifier (when
evaluated on the UEA archive) its
use in a real data mining
application is limited due to its
huge training time
complexity whichisO(N2
·T4
).

Time series IntroofDNNuse#1G:
Again, we can clearly see the dominance of ResNet as the
best performing approach across different domains. One
exception is the electrocardiography (ECG) datasets
(7 in total) where ResNet was drastically beaten by the FCN
modelin71.4%ofECGdatasets.
THEMES
One might expect that the relatively
short filters (3) might affect the
performance of ResNet and FCN since
longer patterns cannot be captured by
short filters. However, since increasing
the number of convolutional layers will
increase the path length viewed
(receptive field) by the CNN model
(Vaswani et al., 2017), ResNet and FCN
managed to outperform other
approaches whose filter length is longer
(21)suchasEncoder.
SIGNALLENGTH Wangetal.(2017) later introduced a one-
dimensional CAM with an application to TSC. This
method explains the classification of a certain deep
learning model by highlighting the subsequences that
contributedthemosttoacertainclassification.
An interesting observation would be to compare the
discriminative regions identified by a deep learning model with
the most discriminative shapelets extracted by other shapelet-
based approaches. This observation would also be backed up
by the mathematical proof provided by Cui et al. (2016), that
showed how the learned filters in a CNN can be
considered a generic form of shapelets extracted by the
learningshapeletsalgorithm (Grabockaetal., 2014).

Time series IntroofDNNuse#1H:Future
Although we have
conducted an extensive
experimental evaluation,
deep learning for Time
Series Classification,
unlike for computer vision
and NLP tasks, still lacks a
thorough study of data
augmentation (Ismail
Fawaz et al., 2018a;
Forestier et al., 2017) and
transfer learning.
Furthermore, we think
that the effect of z-
normalization (and
other normalization
methods) on the learning
capabilities of DNNs
should also be thoroughly
explored.
What makesImageNetgood for
transferlearning?
MinyoungHuh, PulkitAgrawal, AlexeiA. Efros
“Our results might indicate that researchers have
been overestimating the amount of data required
for learning good general CNN features. If that is the
case, it might suggest that CNN training is not as
data-hungry as previously thought. It would also
suggest that beating ImageNet-trained features with
models trained on a much bigger data corpus will be
much harder than once thought.”
AutoAugment:Learning
AugmentationPoliciesfrom Data Ekin D.
Cubuk, Barret Zoph, DandelionMane, VijayVasudevan, QuocV. Le (9Oct2018)
https://github.com/tensorflow/models/tree/master/research/autoaugment
“We describe a simple procedure called
AutoAugment to search for improved data
augmentation policies”
Albumentations:fast and flexible
imageaugmentations Alexander Buslaev, Alex Parinov,
EugeneKhvedchenya,Vladimir I.Iglovikov, AlexandrA.Kalinin(18Sep2018)
https://github.com/albu/albumentations
“We present Albumentations, a fast and
flexible library for image augmentations with
many various image transform operations
available, that is also an easy-to-use
wrapper (based on highly-optimized
OpenCV library) around other augmentation
libraries.”
Combiningraw and normalized datain
multivariate time seriesclassification
with dynamic time warping Łuczak,Maciej(2018)
http://doi.org/10.3233/JIFS-171393

Time series IntroofDNNuse#1H2:TransferLearning
Transferlearningfortimeseriesclassification
Hassan Ismail Fawaz, GermainForestier, Jonathan Weber, Lhassane Idoumgharand Pierre-Alain Muller
Whenobserving theheatmapinFig.4,onecaneasilysee
that fine-tuning a pre-trained model almost never
hurtstheperformanceoftheCNN.
In our future work, we aim again to reduce
the deep neural network’s overfitting
phenomena by generating synthetic data
using a Weighted DTW Barycenter
Averaging method [Forestier etal.2017]
, since
the latter distance gave encouraging
results in guiding a complex deep learning
tool such as transfer learning. Finally, with
big data repositories becoming more
frequent, leveraging existing source
datasets that are similar to, but not
exactly the same as a target dataset of
interest, makes a transfer learning method
anenticing approach.

Time series IntroofDNNs#2:WhyResNetswork?
Wang etal.,2017 https://doi.org/10.1109/IJCNN.2017.7966039
WhyandWhenCanDeep--but NotShallow--
NetworksAvoidtheCurseof Dimensionality:a
ReviewTomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco,
BrandoMiranda, Qianli Liao - https://arxiv.org/abs/1611.00740
Thepapercharacterizesclassesoffunctionsfor whichdeep
learningcanbeexponentiallybetterthanshallow learning.
http://www.telesens.co/2019/01/16/neural-network-loss-visualization/
http://www.telesens.co/loss-landscape-viz/viewer.html
VisualizingtheLoss
LandscapeofNeuralNets
http://papers.nips.cc/paper/78
75-visualizing-the-loss-landsc
ape-of-neural-nets
HLi-2017-Citedby93 -
Relatedarticles

Time series IntroofDNNuse#2A
CNNApproachesfor TimeSeriesclassification
LamyaaSadouk
https://cdn.intechopen.com/pdfs/64216.pdf (2018)
Instead of employing the FFT which is restricted to a predefined fixed
window length, we choose to adopt the Stockwell transform (ST)
asour preprocessing methodfor CNN training.The advantage ofthe ST
over the FFT is its ability to adaptively capture spectral changes over
time without windowing of data, resulting in a better time-frequency
resolutionfor non-stationarysignals[Stockwell1996].
While works [17, 24] transformed the time series signals (by applying
down-sampling,slicing,or warping) so asto help the convolutionalfilters
(especially the 1st convolutional layer filters) capture entire peaks (i.e.,
whole peaks) and fluctuations within the signals, the work of [18]
proposedto keep time seriesdataunchangedandrather feedthem into
three branches, each having a different 1st convolutional filter size, in
order to capture the whole fluctuations within signals. An alternative isto
find an adaptive 1st convolutional layer filter which has the most
optimal size and is able to capture most of entire peaks present in the
input signals. The question of how to compute this adaptive 1st
convolutionallayer filterisaddressedin[4].
Therefore,the most optimalsizeof the1st
convolutionalfilterisequaltothesamplemedian of
signalpeaklengths,suggestingthat 0.1isthebesttime
span ofthe1st convolutionallayerto retrievethewhole
acceleration peaksandthebestacceleration changes.
Similarly, in frequencydomain,the1stconvolutionallayer
kernelyieldingthehighestF1-scoreistheonewithsize10,
whichissimply thesamplemedian (Me
(x)=10).
(a)and(b)Histogramsand
boxplotsofthefrequency
distribution of30peak
lengths presentwithin 30
randomlyselectedtimeand
frequencydomain signals
respectively
Frequencydomain
Timedomain

Time series IntroofDNNuse#2B
CNNApproachesfor TimeSeriesclassification
LamyaaSadouk
https://cdn.intechopen.com/pdfs/64216.pdf (2018)
In some fields such as medicine experience a lack of annotated data as
manually annotating a large set requires human expertise and is time
consuming.
The conventional approach to deal with this kind of problem is to
perform data augmentation by applying transformations to the
existing data. Data augmentation achieves slightly better time series
classification rates but still the CNN is prone to overfitting. In this section,
we present another solution to this problem, a “knowledge transfer”
framework which is a global, fast and light-weight framework that
combinesthetransfer learning techniquewithan SVMclassifier.
Transfer learning is amachine learning technique where amodel trained
on one task (a source domain) is re-purposed on a second related task
(a target domain). Accordingly, the questions that arise are: (i) which
source learning task should be used for pre-training the CNN model
given a target learning task, and (ii) which parts (e.g., learned features) of
thismodelarecommonbetweenthesourceandtargetlearning tasks.
In thatsense,wepropose a“TransferlearningwithSVM read-out”framework which is
composed of two parts: (i) the first part having first and intermediate layers’ weights of a CNN
already pre-trained on a source learning task, (the last CNN layer being discarded), and (ii) the
second part composed of a support vector machine (SVM) classifier with RBF kernel which is
connectedtotheendofthefirstpart.
Then, we feed the entire training dataset of the target task into this framework in order to train
the SVM parameters. As opposed to training a CNN on the target task which requires updating
all hidden layers’ weights for several iterations using a large training set for all these weights to
converge,ourframeworkcomputesweightsofthelastlayer(s)only,inoneiterationonly.
Moreover the advantage of using SVM as the classifier is that it is fast and generally
performs well on small training set since it only relies on the support vectors, which are the
training samples that lay exactly on the hyperplanes used to define the margin. In addition,
SVMs have the powerful RBF kernel, which allows to map the data to a very high dimension
space in which the data can be separable by a hyperplane, hence guaranteeing convergence.
Hence, our framework can be regarded as a global, fast and light-weight technique for time
seriesclassificationwherethetargettaskhaslimitedannotated/labeleddata.

Time series IntroofDNNuse#3
3Dconvolutionrecurrentneuralnetworksforbird
sounddetection
Himawan, Ivan, Towsey, Michael, &Roe, Paul (2018)
https://eprints.qut.edu.au/122760/
https://github.com/himaivan/BAD2
We propose 3D convolutions for extracting long-term and short-term
information in frequency simultaneously. In order to leverage powerful
and compact features of 3D convolution, we employ separate recurrent
neural networks (RNN), acting on each filter of the last convolutional
layers rather than stacking the feature maps in the typical combined
convolutionandrecurrentarchitectures.
We split 10-second audio clip into 5 × 2-second clips. The 2- second
length is based on empirical analysis. A spectrogram (from 2-second clip)
computed from sequences of Short-Time Fourier Transform (STFT) of
overlappingwindowedsignalsisusedasthesoundrepresentation.
The 3D convolution highlights only frequency bands where the bird calls are located
acrossthetemporaldimension.
As a comparison, the 2D convolution in CNN+RNN highlights few specific locations
ofthebirdcalls,andincludelow-frequencyregionswithnobirdcalls.
This shows that 3D convolution is more capable of extracting in terms of
long-termtimeinformationinbirdcalls.
In future work, we will investigate the method of generating labeled data via a
pseudo-labeling method where approximate labels are produced from unlabeled data.
This can be achieved, for example, using generative adversarial networks. Domain
adaptation using adversarial learning is another alternative to build a
discriminativemodelandinvarianttodomainatthesametime.
2D
3D

EarlyTimeSeriesClassification
ALiteratureSurveyofEarlyTimeSeriesClassificationandDeepLearning
TiagoSantosandRomanKern(2017)
http://ceur-ws.org/Vol-1793/paper4.pdf
Early time series classification
aims to classify a time series with
as few temporal
observations as possible,
while keeping the loss of
classification accuracy at a
minimum. One of the first works on
the topic of early classification, as
defined over time series length,
waswrittenby[31].
Prominent early classification
frameworks reviewed by this
paper include, but are not limited
to, ECTS, RelClass and
ECDIRE.
These works have shown that
early time series classification may
be feasible and performant, but
they also show room for
improvement.
ECDIREhttps://doi.org/10.1007/s10618-016-0462-1
RelClass
https://dl.acm.org/citation.cfm?id=2627671

EarlyTSC with deepreinforcementlearning
Adeepreinforcementlearningapproachforearly
classificationoftimeseries
Martinez Coralie, Guillaume Perrin, E Ramasso, Michèle Rombaut
https://hal.archives-ouvertes.fr/hal-01825472/
We formulate the early classification problem in a
reinforcement learning framework: we introduce a suitable
set of states and actions but we also define a specific reward
function which aims at finding a compromise between earliness
andclassificationaccuracy.
While most of the existing solutions do not explicitly take time into
account in the final decision, this solution allows the user to set this
trade-off in a more flexible way. In particular, we show
experimentally on datasets from the UCR time series archive that
this agent is able to continually adapt its behavior without
human intervention and progressively learn to compromise
between accurate and fast predictions.
Evolution of the early classifier agent behaviour on Gun-Point dataset. The scatter plot
shows the relationship between accuracy(in percentage)and averagetime ofprediction oftheagent
over training. We evaluate the agent on the whole training set every 5,000 iterations. Each evaluation
corresponds to one dot. Dot points are coloured according to iterations of training: blue dots
correspond to early training while yellow dots correspond to the agent’s performance after 100,000
iterations of training. We evaluate the agent’s policy surrounded by the red star on the testing set and
we report its performance in table I. In this experiment, the agent learned to slow its predictions down
and improved itsaccuracyover training.
As future work, we plan to improve the proposed approach with a
dynamic adjustment of the reward function parameters over training
based on the user trade-off criteria. We will also propose a new
management of the agent’s replay memory which could be more suitable
forthe problem of early classification.

EarlyTSC for clinicaluse:ICUMortalityPrediction
DynamicPredictionofICUMortalityRisk
UsingDomainAdaptation
TiagoAlves, AlbertoLaender, Adriano Veloso, NivioZiviani
https://homepages.dcc.ufmg.br/~nivio/papers/alves@bigdata18.pdf
Early recognition of risky trajectories during an Intensive
Care Unit (ICU) stay is one of the key steps towards improving
patient survival. Learning trajectories from physiological
signals continuously measured during an ICU stay requires
learning time-series features that are robust and discriminative
acrossdiversepatientpopulations.
Mortalityriskspacefor differentICUdomains.Regionsinredarerisky.Eachaxisisat-SNE
non-linearcombinationof:(toprow)physiologicalparameters,or(bottomrow)features
extractedbyCNN−LSTM.

BiosignalDeepLearning
Deeplearningforhealthcareapplicationsbasedonphysiologicalsignals:AreviewSGauthors: “Wehavecastthenetintotheoceanofknowledget...”
OliverFaust,YukiHagiwara,TanJenHong,OhShuLih,URajendraAcharya https://doi.org/10.1016/j.cmpb.2018.04.005
Once the architecture is chosen, the tuning
parameters must be adjusted. Both the structure
selection and parameter adjustment will basically
influence the model. Hence, it is necessary to have
many test runs. Shortening the training phase of
deep leaning models is an active area of research
[159]. The challenge is speeding up the training
process in a parallel distributed processing system
[160]. The network between the individual
processors becomes the bottleneck [161].
Graphics Processing Unit (GPUs) can be used to
reduce the network latency [162]

ECGClassification #1
AconvolutionalneuralnetworkforECGannotationasthebasisfor
classificationof cardiacrhythms
PhilippSodmann etal2018Physiol.Meas.inpresshttps://doi.org/10.1088/1361-6579/aae304
https://github.com/MarcusVollmer/PhysioNet
222,202Rpeaks,
192,200Pwaves,
256,966 Twaves,and
3, 311,487interbeatsegments
wereextractedfromthe QTdatabase
In totalapproximately12,000,000characteristic
waveformswereusedasinput volume.Theassigned
annotation codesofthemidpoint peakofeach
segment wereused asoutputvolume
Amajoradvantage
ofdecisiontreesis
thattheydirectly
provideinformation
on feature
importance

Detectingandinterpretingmyocardialinfarctions
usingfullyconvolutionalneuralnetworks
NilsStrodthoff,ClaasStrodthoff
(Submittedon18Jun2018)
We consider the detection of myocardial infarction in
electrocardiography(ECG)dataasprovidedbythePTB
ECG database without non-trivial preprocessing. The
classification is carried out using deep neural networks
in a comparative study involving convolutional as well as
recurrent neural network architectures. The best
architecture, an ensemble of fully convolutional
architectures, beats state-of-the-art results on this
dataset and reaches 93.3% sensitivity and 89.7%
specificity evaluated with 10-fold crossvalidation, which
is the performance level of human cardiologists for this
task.
We investigate questions relevant for clinical
applications such as the dependence of the
classification results on the considered data channels
and the considered subdiagnoses. Finally, we apply
attribution methods to gain an understanding of the
network'sdecisioncriteriaonanexemplarybasis.
Time seriesclassification in a realistic setting hastobe able to cope with timeseries that are so large that they
cannot be used as input to a single neural network or that cannot be downsampled to reach this state without
losing too much information. At this point two different procedures are conceivable: Either one uses attentional
models that allow to focus on regions of interest, see e.g. Karim et al. 2018, or one extracts random
subsequences from the original timeseries. For reasons of simplicity and with real-time on-site analysis in mind we
explore only the latter possibility, which is only applicable for signals that exhibit a certain degree of periodicity.
The assumption underlying thisapproach isthat the characteristics leadingtoacertain classification are present in
every random subsequence. We stress at this point that this procedure does not rely on the identification of
beginning and endpoints of certain patterns in the window. The procedure leaves two hyperparameters: the
choice of the window size and an optional downsampling rate to reduce the temporal input dimension for the
neural network.
Moreover, we present a first exploratory study of
the application of interpretability methods in
this domain, which is a key requirement for
applications in the medical field. These methods
can not only help to gain an understanding and
thereby build trust in the network’s decision
process but could also lead to a data-driven
identification of important markers for certain
classification decisions in ECG data that might
evenproveusefulfor human experts.
Here we identified common cardiologists’
decision rules in the network’s attribution maps
and outlined prospects for future studies in this
direction. Both such an analysis of attribution
maps and further improvements of the
classification performance would have to rely on
considerably larger databases such as for
quantitative precision. This would also allow
extension to further subdiagnoses and other
cardiacconditions such as other confounding and
non-exclusive diagnoses or irregular heart
rhythms.

Automaticdetectionofsleep-disordered
breathingeventsusing
recurrentneuralnetworksfroman
electrocardiogramsignal
Erdenebayar Urtnasan,Jong-UkPark,Kyoung-JoungLee
In this study, we propose a novel method for
automatically detecting sleep-disordered
breathing (SDB) events using a recurrent neural
network (RNN) to analyze nocturnal electrocardiogram
(ECG) recordings. … Single-lead ECG recordings (200
Hz) were measured for an average 7.2-h duration and
segmented into 10-s events (2,000 samples). A
bandpass filter (5–11 Hz) was applied for data
preprocessing to removeundesired noisefrom theECG
signal.The dataset comprised a training dataset
(68,545 events) from 74 patients and test dataset
(17,157 events)from18patients
Theproposed deep RNN model for automatic detection
of SDB events was implemented by Keras’ platform
usingaTensorFlowbackground(sic!).

Arrhythmiadetectionusingdeepconvolutionalneuralnetwork
withlongdurationECGsignalshttps://doi.org/10.1016/j.compbiomed.2018.09.009
c
DepartmentofCardiology,NationalHeartCentreSingapore,Singapore
D
Duke-NUSMedicalSchool,Singapore
The goal of our research was to design a new method based on
deep learning (1D-CNN is employed) to efficiently and quickly
classify cardiac arrhythmias. Approach based on the analysis of
10-s ECG signal fragments (not a single QRS complex) is
applied (on average, 13 times less classifications/analysis). A
complete end-to-end structure was designed instead of the hand-
crafted feature extraction and selection used in traditional
methods. Can be used in tele-medicine especially in mobile
devices and cloud computing due to its low computational
complexity.

Deeplearninginthecross-time-frequencydomainforsleep
stagingfromasingleleadelectrocardiogram
https://doi.org/10.1088/1361-6579/aaf339
This study classifies sleep stages from a single lead
electrocardiogram (ECG) using beat detection,
cardiorespiratory coupling in the time-frequency domain and
adeep convolutional neuralnetwork (CNN).
An ECG-derived respiration (EDR) signal and
synchronous beat-to-beat heart rate variability (HRV) time
series were derived from the ECG using previously
described robust algorithms. A measure of
cardiorespiratory coupling (CRC) was extracted by
calculating the coherence and cross-spectrogram of
theEDR andHRVsignalinfive-minutewindows
A support vector machine (SVM) was then used to
combine the output of CNN with the other features derived
from the ECG, including phase-rectified signal averaging
(PRSA), sample entropy, as well as standard spectral and
temporal HRV measures.
TheECGsignalswerepreprocessedbyafiniteimpulseresponse(FIR)
lowpassfilterwithabandstopat22HzandaFIRhighpassfilterandwithat
cornerfrequencyof1.2Hz.Astate-of-the-artQRSdetector(jqrs)was
usedforECGR-peakdetection(Johnsonetal.(2015)).

Kalman-basedSpectro-TemporalECGAnalysisusingDeep
ConvolutionalNetworksforAtrialFibrillationDetection
Zheng Zhao, SimoSärkkä, and AliBahrami Rad
For ECG signals, one can directly adopt 1D convolutional or
recurrent network models for the classification task.
However, transforming signals into spectral domain
(spectro-temporal features) is a promising alternative
approach knowing that the current state-of-theart deep
convolutional neural networks (CNNs) structures are
typicallydesignedfor 2Dimages.
The contributions of this paper are: 1) We propose two
extended models for spectro-temporal estimation using
Kalmanfilter andsmoother. We then combine them with
deep convolutional networks for AF detection. 2) We test and
compare the performance of proposed approaches for
spectro-temporal estimation on simulated data and AF
detection with other popular estimation methods and
different classifiers. 3) For AF detection, we evaluate the
proposals using PhysioNet/CinC 2017 dataset, which is
considered to be a challenging dataset that resembles
practical applications, and our results are in line with the
state-of-theart.
The key advantages of this kind of approaches over other spectrotemporal methods
are that we can apply them to both evenly and unevenly sampled signals [25] and they
requirenostationarity guaranteesnorwindowing.
In practice, the computational cost of Kalman filter and smoother can be extensive
when the length of the signal is very long. However, instead of the Fourier series state
space model in previous section, one can also derive an alternative representation
using stochastic oscillator differential equations. In this way, the dynamic and
measurement models become linear time-invariant (LTI) so that we can leverage a
stationary Kalman filter to reduce the time consumption. This kind of stochastic
oscillator models were also considered in [33] and the link to period Gaussian
processmodelswasinvestigatedin[35].

EEG Classification #1a
DeeplearningwithconvolutionalneuralnetworksforEEGdecoding
andvisualizationhttps://doi.org/10.1002/hbm.23730
https://github.com/robintibor/braindecode/
Thereisincreasing interestin using deep ConvNetsforend to endEEGanalysis‐to‐end EEG analysis ‐to‐end EEG analysis ,but
a better understanding of how to design and train ConvNets for end to end EEG‐to‐end EEG ‐to‐end EEG
decoding and how to visualize the informative EEG features the ConvNets learn is still
needed. Here, we studied deep ConvNets witha rangeof different architectures, designed
fordecodingimagined orexecutedtasksfromrawEEG.
Our study thus shows how to design and train ConvNets to decode task related‐to‐end EEG analysis
information from the raw EEG without handcrafted features and highlights the
potential of deep ConvNets combined with advanced visualization techniques for
EEG basedbrain mapping.‐to‐end EEG

EEG Classification #1b
Deep learningwith convolutionalneuralnetworks for EEG decodingand visualization
https://doi.org/10.1002/hbm.23730 → Citedby90| https://github.com/robintibor/braindecode/
Correlation between the mean squared envelope feature and unit output for a single subject at one electrode position (FCC4h).
Left: All correlations. Colors indicate the correlation between unit outputs per convolutional filter (x-axis) and mean squared
envelope in different frequency bands (y-axis). Filters are sorted by their correlation to the 7–13 Hz envelope (outlined by the
black rectangle). Note the large correlations/anticorrelations in the alpha/beta bands (7–31 Hz) and somewhat
weaker correlations/anticorrelations in the gamma band (around 75 Hz). Right: mean absolute values across units of all
convolutional filters for all correlation coefficients of the trained model, the untrained model and the difference between the
trained and untrained model. Peaksin the alpha, beta, and gammabandsare clearly visible
CSP-common spatialpatterns

EEG+ECGClassification
UseoffeaturesfromRR-timeseriesandEEGsignalsfor
automatedclassificationofsleepstagesindeepneuralnetwork
framework
https://doi.org/10.1016/j.bbe.2018.05.005
The method uses iterative filtering (IF) based multiresolution analysis
approach for the decomposition of RR-time series into intrinsic mode
functions (IMFs). The recurrence quantification analysis (RQA) and
dispersion entropy (DE) based features are evaluated from the IMFs of RR-
time series. The dispersion entropy and the variance features are evaluated
from the different bands of EEG signal. The RR-time series features and
the EEG features coupled with the deep neural network (DNN) are
Stackedautoencoders
withbinaryclassifiers?
Slightlyconfusingarchitecture
Engineeredfeatureswith
deeplearning?

EMGClassification
EMGPatternRecognitionintheEraofBigDataandDeep
Learning
BigDataCogn.Comput.2018,2(3),21;
https://doi.org/10.3390/bdcc2030021
We provide a review of recent research and development in EMG
pattern recognition methods that can be applied to big data
analytics.
These modern EMG signal analysis methods can be divided into two
main categories: (1) methods based on feature engineering
involving a promising big data exploration tool called topological data
analysis; and (2) methods based on feature learning with a special
emphasison “deeplearning”.
Compared to other well-known bioelectrical signals (e.g.,
electrocardiogram, ECG; electrooculogram, EOG; and galvanic skin
response, GSR), however, the analysis of surface EMG signal is
morechallenginggiventhatitisstochasticinnature.
Due to the increasing availability of multi-modality sensing
systems, multi-modal analysis approaches are becoming a viable
option. Multiple modalities can be used to capture complementary
information which is not visible using a single modality, or to provide
contextfor others.
Even when two or more modalities capture similar information, their
combination can still improve the robustness of pattern
recognitionsystemswhenoneofthemodalitiesismissingor noisy.
Outside of prosthesis control, other applications of EMG pattern recognition for which multi-
modality data sets exist include, for example, sleep studies, such as the Cyclic Alternating
Pattern (CAP) Sleep Database [49] and the Sleep Heart Health Study (SHHS) Polysomnography
Database [50]; biomechanics, such as the cutting movement dataset [51] and the horse gait
dataset [52]; and brain computer interfaces, such as the Affective Pacman dataset [53] and the
emergency braking assistance dataset [54]. Recently, emotion recognition using multiple
physiological modalities has gained attention as another important application that has benefited
fromtheincorporationofsurfaceEMG.
http://doi.org/10.3390/s17071622

Time series 2D Recurrence Plots→ Shapelets
This paper investigates the performance of Recurrence Plots (RP) [
Eckmannetal.1987] within the deep CNN model for TSC. RP provides a
waytovisualizetheperiodicnatureofatrajectory throughaphase
space and enables us to investigate certain aspects of the m-dimensional
phase space trajectory through a 2D representation. Because of the
recent outstanding results by CNN on image recognition, we first encode
time-series signals as 2D plots, and then treat TSC problem as
texture recognition task. A CNN model with 2 hidden layers followed
byafullyconnectedlayerisused.
In particular, comparing with models using RP with the traditional classification framework (e.g.
SIFT, Gabor and LBP features with SVM classifier25, 26
) and other CNN-based time-series image
classification (e.g. GAF-MTF images with CNN23, 24
) demonstrates that using RP images with CNN in
ourproposedmodel obtainsthe betterresults.
As future work, CNN architecture with more feature representation layers should be
investigated for more difficult TSC tasks (preferably with more data samples available). Large
datasetsare neededin orderto train a deeper architectures.
Therefore, adopting the proposed pipeline for TSC with small sample sizes can be another
interesting future direction. Exploring different ensemble learning methods for CNN can be also
interesting. We will particularly be investigating application of the output coding for CNNs.

Wavelets fordeep learningTSC #1
MultilevelWaveletDecompositionNetworkfor
InterpretableTimeSeriesAnalysis
To this end, we first designed a novel wavelet-based network structure called mWDN for
frequency learning of time series, which can then be seamlessly embedded into deep learning
frameworks by making all parameters trainable. We further designed two deep learning
models based on mWDN for time series classification and forecasting, respectively, and
the extensive experiments on abundant real-world datasets demonstrated their superiority to
state-of-the-art competitors. As a nice try for interpretable deep learning, we further propose an
importance analysis methodforidentifyingimportantfactorsfor timeseriesanalysis,whichin
turnverifiestheinterpretabilitymerit ofmWDN.
Frequency Analysis of Time Series. Frequency analysis of time series data has been deeply studied
by the signal processing community. Many classical methods, such as Discrete Wavelet Transform,
Discrete Fourier, and Z-Transform, have been proposed to analysis the frequency pattern of time series
signals. In existing TSC/TSF applications, however, transforms are usually used as an independent step
in data preprocessing, which have no interactions with model training and therefore might not be
optimized for TSC/TSF tasks from a global view. In recent years, some research works, such as
Clockwork RNN [Koutniketal. 2014]
and SFM [HaoHuandGuo-JunQi2017]
, begins to introduce the frequency
analysis methodology into the deep learning framework. To our best knowledge, our study is among the
very few works that embed wavelet time series transforms as a part of neural networks so as to achieve an
end-to-endlearning.

Learningfilterwidthsofspectraldecompositionswith
Wavelets
Haidar Khan and Bülent Yener. Rensselaer Polytechnic Institute
http://papers.nips.cc/paper/7711-learning-filter-widths-of-spectral-decompositions-with-wavelets.pdf
https://github.com/haidark/WaveletDeconv
We propose the
wavelet
deconvolution (WD)
layer as an efficient
alternative to this
preprocessing step
that eliminates a
significant number of
hyperparameters. The
WD layer uses wavelet
functions with
adjustable scale
parameters to learn the
spectral decomposition
directlyfromthesignal.
Furthermore, the WD
layer adds
interpretability to the
learned time series
classifier by exploiting
the properties of the
wavelettransform.
Asfuturework,we plantoinvestigate howtoextendtheWDlayertosignalsinhigherdimensions,suchas
magesandvideo,aswellas generalizingthewavelettransform toempiricalmode
decompositions(EMDs).

MultilevelWaveletDecompositionNetworkforinterpretableTimeSeries
Analysis
In this paper we propose a wavelet-based neural network structure called multilevel Wavelet
Decomposition Network (mWDN) for building frequency-aware deep learning models for
time series analysis. mWDN preserves the advantage of multilevel discrete wavelet
decomposition in frequency learning while enables the fine-tuning of all parameters under a
deep neural network framework. Based on mWDN, we further propose two deep learning
models called Residual Classification Flow (RCF) and multi-frequency Long Short-Term
Memory (mLSTM) for time series classification and forecasting, respectively. The two models
take all or partial mWDN decomposed sub-series in different frequencies as input, and resort to
the back propagation algorithm to learn all the parameters globally, which enables seamless
embeddingof wavelet-basedfrequencyanalysisintodeeplearningframeworks

Multivariatetime-series classification#1:CNNonly
TemporalConvolutional Neural Network for the
Classificationof SatelliteImageTimeSeries
CharlottePelletier, Geoffrey I. WebbandFrançois Petitjean
(Submittedon 31Jan 2019)
https://github.com/charlotte-pel/temporalCNN (Keras)
Note!Despitethename,theauthorsusedtraditional
convolutionalfiltersfortimeseries,andnotTCNs

Multivariatetime-series classification#2: CNN+LSTM
MultivariateLSTM-FCNsforTimeSeriesClassification
Fazle Karim, SomshubraMajumdar, HoushangDarabi, Samuel Harford
(Submitted on 14Jan 2018)
We propose augmenting the existing univariate time series
classification models, LSTM-FCN and ALSTM-FCN with a squeeze
and excitationblocktofurtherimproveperformance.
The proposed models work efficiently on various complex
multivariate time series classification tasks such as activity
recognition or action recognition. Furthermore, the proposed models
are highly efficient at test time and small enough to deploy on memory
constrainedsystems. For datasets with class
imbalance, a class weighing
schemed inspired by
King et al. (2001).

Multivariatetime-series classification#3: CNN+GRU
DeepGatedRecurrentandConvolutionalNetworkHybridModelforUnivariateTimeSeries
ClassificationNellyElsayed, Anthony S. Maidaand MagdyBayoumi
(Submitted on 27 Dec 2018) https://arxiv.org/abs/1812.07683
https://github.com/NellyElsayed/GRU-FCN-model-for-univariate-time-series-classification
The proposed GRU-FCN classification model shows that
replacing the LSTM by a GRU enhances the classification
accuracy without needing extra algorithm enhancements
such as fine-tuning or attention algorithms. The GRU also
has a smaller architecture that requires fewer
computations than the LSTM. Moreover, the GRU-based
model requires smaller number of trainable parameters,
memory, and training time comparing to the LSTM-based
models.

Applicationformultivariatetimeseries: Wearable sensors
WearableDL:WearableInternet-of-ThingsandDeep
LearningforBigDataAnalytics—Concept,Literature,
andFuture
ArasR. Dargazany, PaoloStegagno, and Kunal Mankodiya
(Submitted on 14November 2018)
https://doi.org/10.1155/2018/8125126
This work introduces Wearable deep learning (WearableDL) that is a
unifying conceptual architecture inspired by the human nervous
system, offering the convergence of deep learning (DL), Internet-of-
things(IoT),andwearabletechnologies(WT)

Applicationformultivariatetimeseries: ActionRecognition
SensorDataAcquisitionandMultimodalSensorFusion
forHumanActivityRecognitionUsingDeepLearning
Published:10April 2019
(Thisarticle belongstotheSpecial IssueDeep LearningBased Sensing
Technologiesfor AutonomousVehicles)
Sensors2019, 19(7), 1716; https://doi.org/10.3390/s19071716
We develop a Long Short-Term Memory (LSTM) network framework to support
training of a deep learning model on human activity data, which is acquired in
both real-world and controlled environments. From the experiment results, we
identify that activity data with sampling rate as low as 10 Hz from four sensors at
both sides of wrists, right ankle, and waist is sufficient in recognizing Activities of
Daily Living (ADLs) including eating and driving activity. We adopt a two-level
ensemble model to combine class-probabilities of multiple sensor modalities,
and demonstrate that a classifier-level sensor fusion technique can improve the
classification performance. By analyzing the accuracy of each sensor on
different types of activity, we elaborate custom weights for multimodal
sensor fusion that reflect the characteristic of individual activities.

Ensemblingmodels for uni/multivariatetimeseriesaswell
DeepNeuralNetworkEnsemblesforTimeSeries
Classification
Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar
and Pierre-Alain Muller
IRIMAS, UniversiteHaute-Alsace, Mulhouse, France
In the future, we would like to consider a meta-learningapproach
where the output logistics of individual deep learning models are fed to a
meta-network that learns to map these inputs to the correct
prediction.(e.g.Juetal.2019;2018)

Segmenting timeseries
BEATS: Blocksof Eigenvalues AlgorithmforTime
SeriesSegmentation
https://doi.org/10.1109/TKDE.2018.2817229 (2018)
https://github.com/auroragonzalez/BEATS implemented in R
The massive collection of data via emerging technologies like
the Internet of Things (IoT) requires finding optimal ways to
reduce the observations in the time seriesanalysisdomain.
In this paper, we propose a segmentation algorithm that adapts
to unannounced mutations of the data (i.e., data drifts).
The algorithm splits the data streams into blocks and groups
them in square matrices, computes the Discrete Cosine
Transform (DCT),andquantizesthem.
The algorithm, called BEATS, is designed to tackle dynamic
IoT streams, whose distribution changes over time. We
implement experiments with six datasets combining real,
synthetic, real-world data, and data with drifts. Compared to
other segmentation methods like Symbolic Aggregate
approXimation (SAX), BEATS shows significant improvements.
Trying it with classification and clustering algorithms
it provides efficient results. BEATS is an effective mechanism to
work with dynamic and multi-variate data, making it suitable for
IoTdata sources.
By using BEATS, we are able to restructure the streaming data in a 2D way and then transform it into the frequency
domain using DCT. The algorithm finds a smaller sequence that contains the key information of the initial representative.
This aggregation provides an opportunity to eliminate repetitive content and similarities that can be found in the sequence
of data. The eigenvalues vectors are a homogeneous representation of the data streams in BEATS that allow us to
go one step further in understanding of the sequences and patterns that can be considered as the data structure of a data
series in an application domain (e.g. smart cities). Its applications can be extended to several other domains and various
patterns/activity monitoring and detection methods. The future work will focus on applying 3D cosine transform and
adaptive blocksize estimation.

Regression/Forecasting LiteratureReview

FinancialForecasting withDeep Learning#1
ConditionalTime SeriesForecastingwith
Convolutional NeuralNetworks
Anastasia Borovykh,SanderBohte, CornelisW.Oosterlee
https://arxiv.org/abs/1703.04691 (2017)
We present a method for conditional time series forecasting
based on an adaptation of the recent deep convolutional
WaveNet architecture. The proposed network contains stacks of
dilated convolutions that allow it to access a broad range of
history when forecasting, a ReLU activation function and
conditioning is performed by applying multiple convolutional filters
in parallel to separate time series which allows for the fast
processing of data and the exploitation of the correlation
structurebetween the multivariatetimeseries.
We show that a convolutional network is well-suited for
regression-type problems and is able to effectively learn
dependencies in and between the series without the need for long
historical time series, is a time-efficient and easy to implement
alternative to recurrent-type networks and tends to outperform
linearandrecurrent models.
Effectively, we use multiple financial time series as input in a neural
network, thus conditioningthe forecast of atimeseries on both
its own history as well as that ofmultipleother time series. Training a
model on multiple stock series allows the network to exploit the
correlation structure between these series so that the network can
learn themarketdynamicsin shortersequencesof data.
While on the relatively short time series the prediction time is negligible when compared to the
training time, for longer time series the prediction of the autoregressive model may be sped
up by implementing a recent variation that exploits the memorization structure of the
network, or speeding up the convolutions by working in the frequency domain emloying
Fouriertransforms.Finally,itiswell-known thatcorrelationsbetween datapointsarestronger
on an intraday basis. Therefore, it might be interesting to test the model on intraday data to
see if the ability of the model to learn long-term dependencies is even more valuable in that
case.

Autoregressive ConvolutionalNeural
NetworksforAsynchronousTime Series
Mikołaj Bińkowski, Gautier Marti, Philippe Donnat
(Submitted on12 Mar 2017 (v1), last revised 12 Jun 2018 (thisversion, v4))
https://arxiv.org/abs/1703.04122 → Cited by8
https://github.com/mbinkowski/nntimeseries
We propose Significance-Offset Convolutional Neural
Network, a deep convolutional network architecture for
regressionofmultivariateasynchronoustimeseries.
Conclusionanddiscussion In thisarticle,we
proposedaweighting mechanismthat,coupled
withconvolutionalnetworks,formsanewneural
networkarchitecturefortimeseriesprediction.
Theproposed architectureisdesignedfor
regression taskson asynchronoussignalsin
thepresenceof highamount of noise.This
approachhasprovedtobesuccessfulin
forecastingseveralasynchronoustimeseries
outperformingpopularconvolutionaland
recurrentnetworks.
Theproposed modelcanbeextendedfurtherby
adding intermediateweightinglayers ofthe
sametypein thenetworkstructure.Another
possiblegeneralization thatrequiresfurther
empiricalstudiescan beobtainedbyleaving the
assumption ofindependent offsetvaluesfor
eachpastobservation, i.e.consideringnotonly
1x1convolutionalkernelsin theoffsetsub-
network.
Finally,weaimattestingtheperformanceofthe
proposedarchitectureon otherreal-lifedatasets
withrelevantcharacteristics.Weobservethat
thereexistsastrongneed forcommon
‘econometric’ datasets benchmarkand,
moregenerally,fortimeseries(stochastic
processes)regression.

Multi-taskLearningforFinancial
Forecasting
TaoMa, Guolin Ke(27 Sep 2018)
Due to the strong connections among stocks,
the information valuable for forecasting is not only
included in individual stocks, but also included in
the stocks related to them. However, most previous
works focus on one single stock, which easily
ignore the valuable information in others. To
leverage more information, in this paper, we
propose a jointly forecasting approach to
process multiple time series of related stocks
simultaneously, using multi-task learning
framework(Ruder2017).
Durichen et al. (2015) used multi-task Gaussian
processes to process physiological time series.
Jung (2015) proposed a multi-task learning
approach to learn the conditional independence
structure of stationary time series. Liu et al. (2016)
used multi-task multi-view learning to predict urban
water quality. Harutyunyan et al. (2017) used
recurrent LSTM neural networks and multi-task
learning to deal with clinical time series. And Li et al.
(2018) applied multi-task representation learning to
travel time estimation. Moreover, some methods
are proposed to learn the shared representation of
all the task-private information, e.g., Misra et al.
(2016) proposed cross-stitch networks to combine
multipletask-privatelatentfeatures
In the future works, we would like to further improve
SPA’s ability of combining latent features. And for DMTL,
we would like to build hierarchical models to extract the
shared information from all tasksmore efficiently.
The contributionsof thispaper aremultifold:
●
To the bestofourknowledge,theproposed
multi-seriesjointlyforecastingapproach isthe
firstwork applying multi-task learningtotime
seriesforecastingformultiplerelatedstocks.
●
We proposea novel attentionmethod to
learnthe optimizedcombination of shared
andtask-privatelatentfeaturesbasedonthe
ideaofCAPM.
●
We demonstrate inexperimentsonfinancial data
thatthe proposedapproach outperformssingle-
task baselinesandotherMTL basedmethods,
which furtherimprovesthe forecasting
performance.

Multi-taskLearningforFinancial
Forecasting
TaoMa, Guolin Ke(27 Sep 2018)
In thispaper, we empiricallystudythe applicabilityofthe
latest deep structures with respect to the volatility
modellingproblem,throughwhichweaimtoprovide
an empirical guidance for the theoretical analysis of the
marriage between deep learning techniques and
financialapplicationsinthefuture.
We examine both the traditional approaches and the
deep sequential models on the task of volatility
prediction, including the most recent variants of
convolutional and recurrent networks, such as the
dilated architecture.
Accordingly, experiments with real-world stock
price datasets are performed on a set of 1314 daily
stock series for 2018 days of transaction. The
evaluation and comparison are based on the negative
log likelihood(NLL) ofreal-worldstockpricetimeseries.
The result shows that the dilated neural models,
including dilated CNN and Dilated RNN, produce
most accurate estimation and prediction,
outperforming various widely-used deterministic
models in the GARCHfamily and several recently
proposed stochastic models. In addition, the high
flexibility and rich expressive power are validated in this
study.

Trading withDeep Learning#1a
DeepLOB: Deep Convolutional
NeuralNetworksforLimitOrder
Books
ZihaoZhang, Stefan Zohren, Stephen Roberts(2018)
We develop a large-scale deep learning
model to predict price movements from limit
order book (LOB) data of cash equities. The
architecture utilises convolutional filters to
capture the spatial structure of the limit order
books as well as LSTM modules to capture
longer time dependencies.
Importantly, our model translates well to
instruments which were not part of the training set,
indicating the model’s ability to extract universal
features. In order to better understand these
features and to go beyond a “black box” model, we
perform a sensitivity analysis to understand the
rationale behind the model predictions and reveal
the components of LOBs that are most
relevant. The ability to extract robust features
which translate well to other instruments is an
important property of our model which has many
other applications.
We use standardisation (z-score) to normalise our
data, and use the mean and standard deviation of
the previous day’s data to normalise the current
day’s data (separate normalisation for each
instrument):
Because financial data is highly stochastic, if we simply compare pt
and
pt+k
to decide the price movement, the resulting label set will be noisy. We
adopt the idea in Tsantekidis et al. (2017) to introduce a smoothed
labelling method.

Trading withDeep Learning#1b
DeepLOB:DeepConvolutionalNeural
NetworksforLimitOrderBooks
Zihao Zhang, Stefan Zohren, Stephen Roberts(2018)
To observe what convolutional layers do, we feed a single input to the trained model and plot
the intermediate outputs on the right of Figure 5. Since 16 filters are applied, we get 16
series after the “Conv” block. The convolution operations transform the original time-series
into signals that indicate time regions that have great impacts on final outputs. In our case, we
observe strong signals around t = 1, 20, 40, 70 time stamps, suggesting information at
these time stamps decide the final outputs.
In our case, we use LIME [Ribeiro et al. 2016; Cited by 751
] to reveal components of LOBs that are
most important for predictions and to understand why the proposed model DeepLOB works better
than Ref model [Tsantekidis et al. (2017)]. LIME uses aninterpretable model to approximate the
prediction of a complex model on a given input. It locally perturbs the input and observes variations
in the model’s predictions, thus providing some measure of information regarding input importance
and sensitivity.

Trading withDeep Learning#2
DevelopingArbitrageStrategyin
High-frequencyPairsTrading
withFilterbankCNNAlgorithm
Yu-YingChen; Wei-Lun Chen;Szu-HaoHuang(2018)
https://doi.org/10.1109/AGENTS.2018.8459920
This paper proposed a novel intelligent high-
frequency pairs trading system in Taiwan Stock
Index Futures (TX) and Mini Index Futures
(MTX) market based on deep learning
techniques.
This research utilized the improved time
series visualization method to transfer
historical volatilities with different time frames
into 2D images which are helpful in capturing
arbitragesignals.
Moreover, this research improved convolutional
neural networks (CNN) model by combining
the financial domain knowledge and
filterbank mechanism. We proposed
Filterbank CNN to extract high-quality features
by replacing the random-generating filters with
thearbitrageknowledgefilters.
Algorithmicfinancialtradingwithdeep convolutionalneuralnetworks:
Time seriestoimage conversionapproach
Omer Berat Sezerand Ahmet Murat Ozbayoglu (2018)
https://doi.org/10.1016/j.asoc.2018.04.024
For future work, we will use more
Exchange-Traded Fund (ETFs)
and stocks in order to create
more data for the deep learning
models.
We will also analyze the
correlations between selected
indicators in order to create more
meaningful images so that the
learning models can better
associate the Buy–Sell–Hold
signals and come up with more
profitable trading models.

Trading withDeep Learning: GANs
GenerativeAdversarial Networks
forFinancial TradingStrategies
Fine-TuningandCombination
AdrianoKoshiyama, Nick Firoozye, and Philip Treleaven
(Jan 2019) https://arxiv.org/abs/1901.01751
Relatedarticles
Systematic trading strategies are
algorithmic procedures that allocate
assets aiming to optimize a certain
performance criterion. To obtain an
edge in a highly competitive
environment, the analyst needs to
proper finetune its strategy, or discover
how to combine weak signals in
novel alpha creating manners.
Both aspects, namely fine-tuning and
combination, have been extensively
researched using several methods, but
emerging techniques such as
Generative Adversarial Networks can
have an impact into such aspects.
Therefore, our work proposes the use
of Conditional Generative
Adversarial Networks (cGANs) for
trading strategies calibration and
aggregation.
StockMarketPredictiononHigh-FrequencyDataUsingGenerative
AdversarialNets
Xingyu Zhouet al. 2018
https://doi.org/10.1155/2018/4907423
In this paper, we propose a generic framework employing Long Short-Term
Memory (LSTM) and convolutional neural network (CNN) for adversarial
trainingto forecast high-frequency stock market.Thismodel takesthe
publicly available index provided by trading software as input to avoid complex
financial theory research and difficult technical analysis, which provides the
conveniencefortheordinarytraderof nonfinancial specialty.
Based on the deep learning network, this model achieves prediction ability superior to other
benchmark methods by means of adversarial training, minimizing direction prediction loss,
and forecast error loss. Moreover, the efects of the model update cycles on the predictive
capability are analyzed, and the experimental results show that the smaller model update
cycle can obtain better prediction performance. In the future, we will attempt to integrate
predictivemodelsundermultiscaleconditions.

GlucosePrediction CNN-RNNHybrid
KezhiLi,JohnDaniels,ChengyuanLiu,PauHerrero,PantelisGeorgiou
DepartmentofElectronicandElectrical Engineering,ImperialCollegeLondon
Current digital therapeutic approaches for subjects with Type 1 diabetes mellitus (T1DM) such as the artificial pancreas and
insulin bolus calculators leverage machine learning techniques for predicting subcutaneous glucose for improved
control.
In this work, we present a deep learning model that is capable of predicting glucose levels over a 30-minute horizon.
The prediction algorithm is implemented on an Android mobile phone (LG Nexus5 with Processor:2.26GHz quad-core,
RAM:2GB, 8-bit integer) , with an execution time of 6ms on a phone compared to an execution time of 780ms,on a laptop
(MacProwith Processor:3.1GHz Intel Core i5, RAM:8GB, 32-bit fp) inPython.
Given that learning is solely based on historical data, unexpected predictions may occur given that correlations learned in the
datamay not implycausation. Thushybrid approacheswhereby the deep learningmodel isused tomakean accurate prediction,
and rulesof meal/bolus supported byphysiological model avoid apparent errorsthat might result. Based on the CRNN approach
proposed in this paper, it is possible to develop the hybrid method, which may have the advantages of both conventional and DL
algorithms.

Survivalmodels LiteratureReview

ClinicalSurvivalModels CancerSurvival
ASimpleDiscrete-TimeSurvivalModelforNeuralNetworks
MichaelF.GensheimerandBalasubramanianNarasimhan
StanfordUniversity(May2018)
https://arxiv.org/pdf/1805.00917.pdf
https://github.com/MGensheimer/nnet-survival Keras
It is recommended to use at least ten time intervals to avoid bias in the survival
estimates [17]. Using narrow time intervals also helps avoid inaccurate parameter
estimatesif the effect of the input data variesrapidlywithfollow-up time (time-varying
coefficients, in the language of survival analysis). In most of our experiments we have
used 20-50 time intervals. We suggest choosing the cut-points so that around the
same number of survival events fall into each time interval, which will help ensure
reliable estimatesforall time intervals
While the model has several advantages and we think it will be useful for a broad range
of applications, it does has some drawbacks. The discretization of follow-up time
results in a less smooth predicted survival curve compared to a parametric survival
model suchasa Weibull acceleratedfailure time model.
As long as a sufficient number of time intervals is used, this is not a large practical
concern. Unlike a parametric survival model, the model does not provide survival
predictions past the end of the last time interval, so it isrecommended to extend the last
interval past the last follow-up time of interest.
The advantages of parametric survival models and our discrete-time survival
model could be combined in the future using a flexible parametric model, such
as the cubic splinebased model of Royston and Parmar(2002), implemented in the
flexsurv R package.
Complex non-proportional hazards models (see Katzman etal. 2018, for
proportional deep learning model) can be created in this way, and likely could be
implementedin deep learningpackages.

ClinicalSurvivalModels SequentialDL “recurrent”
DeepRecurrentSurvivalAnalysis
Kan Ren, JiaruiQin, Lei Zheng, Zhengyu Yang, WeinanZhang, Lin Qiu, YongYu
ShanghaiJiaoTongUniversity(Sept2018) https://arxiv.org/abs/1809.02403
Recent advancesof modern technologymakesredundant datacollection available for time-to-
event information, which facilitates observing and tracking the event of interests. However,
due to different reasons, many events would lose tracking during observation period, which
makesthe data censored.
We only know that the true time to the occurrence of the event is larger or smaller than, or within
the observation time, which have been defined as survivorship bias categorized into right-
censored, left-censored and internal-censored respectively (Lee and Wang2003). Survival
analysis, a.k.a. time-to-event analysis (Leeet al. 2018; DeepHit), is a typical statistical
methodology for modeling time-to-event data while handling censorship, which is a traditional
research problem and hasbeen studied over decades.
Our model proposesanovelmodeling viewfor survivalanalysis,which aimsat
flexibly modeling the survival probability function rather than making any
assumptions for the distribution form. Specifically, DRSA creatively predicts
the conditionalprobability of the event at each time given that the event non-
occured before, and combines them through probability chain rule for
estimatingboththeprobabilitydensityfunctionandthecumulativedistribution
function of the event over time, eventually forecasts the survival rate at
eachtime,which ismore reasonable andmathematicallyefficientfor survival
analysis. Through these modeling methods, our DRSA model can capture
the sequential patterns embedded in the feature space along the
time, and output more effective distributions for each individual samples at
fine-grainedlevel.

ClinicalSurvivalModels Cardiac Motionanalysis
Deep learning cardiacmotion analysis forhumansurvivalprediction
Ghalib A. Bello,Timothy J.W. Dawes, Jinming Duan,Carlo Biffi,Antonio deMarvao,LukeS.G.E. Howard,J. SimonR.Gibbs,
Martin R. Wilkins, StuartA.Cook, Daniel Rueckert, DeclanP.O'Regan (Submittedon8Oct2018)
ImperialCollegeLondon, NationalHeartCentreSingapore,Singapore,andDuke-NUS GraduateMedical School,Singapore
https://github.com/UK-Digital-Heart-Project/4Dsurvival
Making predictions about future events from the current state of a moving three
dimensional (3D) scene depends on learning correspondences between patterns of
motion and subsequent outcomes. Such relationships are important in biological
systems which exhibit complex spatio-temporal behaviour in response tostimuli or
as a consequence of disease processes. Here we use recent advances in machine
learning for visual processing tasks to develop a generalisable approach for modelling
time-to-event outcomes from time-resolved 3D sensory input. We tested this on
the challenging task of predicting survival due to heart disease through analysis of
cardiacimaging
The traditional paradigm of epidemiological research is to draw insight from large-scale clinical
studies through linear regression modelling of conventional explanatory variables, but this approach
does not embrace the dynamic physiological complexity of heart disease. Even objective quantification of
heart function by conventional analysis of cardiac imaging relies on crude measures of global contraction that
are only moderatelyreproducible and insensitivetothe underlyingdisturbancesofcardiovascular physiology.
While conventional autoencoders are used for unsupervised learning tasks we extend recent proposals for
supervised autoencoders in which the learned representations are both reconstructive and
discriminative. We achieved this by adding a prediction branch to the network with a loss function for
survival inspired by the Cox proportional hazards model. A hybrid loss function, optimising the trade-
off between survival prediction and accurate input reconstruction, is calibrated during training. The
compressed representations of 3D motion predict survival more accurately than a composite measure
ofconventional manually-derived parametersmeasured on the same images.

SequentialTime-Series LiteratureReview

RepresentationLearning forSequences
Unifiedrecurrentneuralnetworkformanyfeaturetypes
AlexanderStec,DiegoKlabjan, Jean Utke
(Submittedon24Sep2018)
“Therearetimeseriesthat areamenabletorecurrent neural
network(RNN) solutionswhen treatedas sequences, butsome
series,e.g.asynchronous timeseries,providearicher
variationoffeaturetypesthancurrentRNNcells takeinto
account.
Inordertoaddresssuchsituations,weintroduceaunifiedRNNthat
handles fivedifferentfeaturetypes,eachinadifferent manner.
OurRNNframeworkseparatessequentialfeaturesinto two
groups dependentontheirfrequency,whichwecall sparseand
densefeatures, andwhichaffect cellupdatesdifferently.
Further,wealsoincorporatetimefeatures at thesequential
levelthat relatetothetimebetweenspecifiedeventsin the
sequenceandareusedtomodifythecell'smemorystate. Wealso
include twotypesofstatic (wholesequencelevel) features, one
relatedtotimeandonenot,whicharecombinedwiththeencoder
output.“
For future work, it would be interesting to incorporate even more
feature types than the five covered in this work. One in particular is a
feature type that gives time information looking forward in the sequence.
All features in this work use time information related to past events, but
there are cases that can benefit from the utility of incorporating
future knowledge when available. One example of this is the time to
the prediction from the current time step so the network can have direct
knowledge of itsabsolute time location in the sequence.

InMedicalDiagnostics Sequence ≈ PatientVisits
ShortFuse:BiomedicalTimeSeries
RepresentationsinthePresenceof
Structured Information
MadalinaFiterau,SuvratBhooshan,Jason Fries,Charles
Bournhonesque, JenniferHicks,Eni Halilaj, ChristopherRé, ScottDelp
(revised16May2017) StanfordUniversity
https://arxiv.org/abs/1705.04790 -Citedby5
“In healthcare applications, temporal variables that
encode movement, health status and longitudinal patient
evolution are often accompanied by rich structured
information such as demographics, diagnostics and
medical exam data (constant along the temporal domain).
However, current methods do not jointly optimize over
structured covariates and time series in the feature extraction
process.
We present ShortFuse, a method that boosts the accuracy of
deep learning models for time series by explicitly
modeling temporal interactions and dependencies
withstructuredcovariates.
ShortFuse introduces hybrid convolutional and LSTM
cells that incorporate the covariates via weights that are
sharedacrossthetemporaldomain. “

Sequences /+→ Networkscience (Graphinference)
ReferralpathsintheU.S.physiciannetwork
Chuankai An, A.JamesO’Malley,DanielN.Rockmore
(December2018)
For a patient, a “referral path” records (“patient journey”) the
chronological sequence of physicians encountered by a patient
(subject to certain constraints on the times between encounters). It
provides a basic unit of analysis in a broader referral network that
encodes the flow of patients and information between
physicians ina healthcaresystem.Weconsiderreferralnetworks
defined over a range of interactions as well as the characteristics of
referral paths, producing a characterization of the various networks
aswell asthephysicianstheycomprise.
In this paper we study the more fine-scale patterns to be found in the
consideration of the referral paths and importantly link these
statistics to treatment outcomes in the particular setting of
cardiovascular disease. While referral path and referral information
generally has been ignored as a factor in the important problem of
treatmentoutcomeprediction
An example referral path with three physicians A,B,C. The patient visits them five
times. Physicians A and C are from the same HRR/hospital in blue, while physician B is from
anotherHRR/hospital inred
Visualization of a hospital
(PHN) referral network with
30 physicians and 101 directed
edges in 2011. Red, yellow and
lightblue nodes represent
physicians with positive, zero and
negative net patient flow (NPF),
respectively. Targets of referrals
are marked with shadow on
directed edges.

SmallData fordeep learning
SmallSampleLearninginBigDataEra
JunShu,ZongbenXu,DeyuMeng
lastrevised22Aug 2018
Asapromising areain artificialintelligence,anewlearning paradigm,
called Small SampleLearning(SSL),hasbeen attracting
prominentresearchattention in therecentyears.In thispaper,weaim
topresent asurvey tocomprehensivelyintroducethecurrent
techniquesproposedon thistopic.Specifically,currentSSL
techniquescanbemainlydividedinto twocategories.
ThefirstcategoryofSSLapproachescanbecalled "concept
learning", whichemphasizeslearningnewconceptsfromonlyfew
relatedobservations. Thepurposeismainlytosimulatehuman
learningbehaviorslikerecognition,generation, imagination,synthesis
andanalysis. Thesecondcategoryiscalled "experience
learning",whichusuallyco-existswiththelargesamplelearning
mannerofconventionalmachinelearning.Thiscategorymainly
focuseson learningwithinsufficientsamples,andcan alsobecalled
smalldatalearningin someliteratures.
MoreextensivesurveysonbothcategoriesofSSLtechniquesare
introduced andsomeneuroscienceevidencesareprovidedto
clarifytherationalityoftheentireSSLregime,andtherelationship
withhuman learningprocess.Somediscussionson themain
challengesandpossiblefutureresearchdirectionsalongthislineare
alsopresented.
TheFastand the Flexible: training neural
networks tolearn tofollow instructions from
smalldata
RezkaLeonandya,EliaBruni,DieuwkeHupkes,
GermánKruszewski(Submittedon17Sep2018)
Learning to follow human instructions is a challenging
task because while interpreting instructions requires
discovering arbitrary algorithms, humans typically
provideveryfew examples to learn from.
For learning from this data to be possible, strong
inductive biases are necessary. Work in the past has
relied on hand-coded components or manually
engineered features to provide such biases. In contrast,
here we seek to establish whether this knowledge can
be acquired automatically by a neural network system
through a two phase training procedure: A (slow)
offline learning stage where the network learns about
the general structure of the task and a (fast) online
adaptation phase where the network learns the
languageof anew given speaker.

Dataaugmentation fortimeseries
T-CGAN:ConditionalGenerative AdversarialNetworkforData
AugmentationinNoisyTime SerieswithIrregularSampling
GiorgiaRamponi, PavlosProtopapas,MarcoBrambillaandRyanJanssen (20
Nov 2018)
In this paper we propose a data augmentation method for time series with irregular
sampling, Time-Conditional Generative Adversarial Network (T-CGAN).
Our approach is based on Conditional Generative Adversarial Networks (CGAN),
where the generative step is implemented by a deconvolutional NN and the
discriminative step bya convolutional NN. Boththe generator and the discriminator are
conditioned on the sampling timestamps, to learn the hidden relationship between
data andtimestamps, and consequentlyto generate new time series.

Dataaugmentation frominvariancemodelling#1
DataAugmentationofRoom ClassifiersusingGenerative
AdversarialNetworks
Constantinos Papayiannis, Christine Evers, Patrick A. Naylor
https://arxiv.org/abs/1901.03257 (Jan 2019)

Dataaugmentation frominvariancemodelling#2
Sinusoidalwavegeneratingnetworkbased onadversarial
learningandits application:synthesizingfrogsounds fordata
augmentation
Sangwook Park, David K. Han, and Hanseok Ko
https://arxiv.org/abs/1901.02050 (Jan 2019)
Graphical comparisons of time-domain waveforms and spectrograms and quantitative comparisons using the inception score clearly showed that the synthetic data
closely resembles the target signal. Overall, it was demonstrated that the proposed approach of data augmentation by direct generation of synthetic audio
streams improved the CNN based classificationrate anditstraining efficiency when both the real andthe synthetic datawere usedto train the classifier.These
resultsdemonstratethattheproposednetworkgeneratesanarbitrarysignalthatiscomposedofsinusoidalwaveformsandcanbeusedfor trainingadeepnetwork

TransferLearning withTime Series #1
Dataaugmentationusingsyntheticdatafortimeseries
classificationwithdeepresidualnetworks
HassanIsmailFawaz,GermainForestier,Jonathan Weber,Lhassane
Idoumghar,Pierre-AlainMuller (Submittedon7Aug2018)
https://github.com/hfawaz/aaltd18
Unlike in image recognition problems, data augmentation techniques
have not yet been investigated thoroughly for the TSC task. This
is surprising as the accuracy of deep learning models for TSC could
potentially be improved, especially for small datasets that exhibit
overfitting, when a data augmentation method is adopted. In this paper,
we fill this gap by investigating the application of a recently proposed
data augmentation technique based on the Dynamic Time Warping
distance,foradeeplearningmodelforTSC.
The data augmentation method is mainly based on a weighted form of Dynamic Time
Warping (DTW) Barycentric Averaging (DBA) technique [Petitjeanetal.2016]. The latter
algorithm averages aset of time series in aDTW induced space and byleveraging aweighted
versionofDBA,themethodcanthuscreate aninfinitenumberof newtimeseries from
a given set of time series by simply varying these weights. Three techniques were proposed
to select these weights, from which we chose only one in our approach for the sake of
simplicity, although we consider evaluating other techniques in our future work.The weighting
method is called Average Selected which consists of selecting a subset of close time
seriesandfilltheir boundingboxes.
We did not test the effect of imbalanced classes in the training set and how it could
affectthe model’sgeneralization capabilities.Notethatimbalancedtime seriesclassification is
a recent active area of research that merits an empirical study of its own [Gengetal.2018]. At
last, we should add that the number of generated time series in our framework was chosen to
be equal to double the amount of time series in the most represented class (which is a
hyper-parameter ofour approachthatweaimtofurtherinvestigateinour futurework).

TransferLearning withTime Series #2
Physiological-signal-basedmental workloadestimationvia
transferdynamical autoencoders inadeeplearningframework
NeurocomputingAvailableonline11April2019
In this study, we propose a new transfer dynamical autoencoder (TDAE)
to capture the dynamical properties of electroencephalograph (EEG) features
and the individual differences. The TDAE consists of three consecutively-
connected modules, which are termed as feature filter, abstraction filter, and
transferred MW classifier. The feature and abstraction filters introduce
dynamical deep network to abstract the EEG features across adjacent time
steps to salient MW indicators. Transferred MW classifier exploits large volume
EEG data from an source-domain EEG database recorded under emotional
stimuli toimprovethemodeltrainingstability
The main limitation of the proposed TDAE deep learning framework for MW recognition lies in two
aspects. The computational cost for training the entire network is significantly higher than classical
shallow and deep classifiers. It leads to high time cost in selection optimal hyper-parameters
of the model. Therefore, we employed the same value of the feature filter order to reduce the
computational burden. However, it is no doubt that the filer order should feature-specific. Moreover,
there exists a prerequisite for knowledge transferring across two mental-task domains. That is, we
need to select exactly the same EEG channels for data preprocessing and it leads to a
possibility that useful MW indicators are excluded. In future work, we will further investigate the deep
learningmethodsfor MW assessment on these twoaspects.

Active Learning withTimeSeries
RobustActiveLearningforElectrocardiographicSignal
Classification
XuChen,SaratenduSethi (Submittedon21Nov 2018)
Motivated by the fact that ECG data are usually heavily unbalanced
among different classes and the class labels are noisy as they are
manually labeled, this paper proposes a novel solution based on robust
active learning for addressing these challenges. The key idea is to first
apply the clustering of the data in a low dimensional embedded
space and then select the most information instances within local
clusters. By selecting the most informative instances relying on local
average minimal distances, the algorithm tends to select the data for
labelinginamorediversifiedway.
The first stage of RALS algorithm relies on label spreading. The label spreading
algorithm is a well known graph-based semi-supervised learning algorithm. It
calculates the similarity measure and propagates the labels by the measure for
prediction. It also generates the label distribution matrix which consists of the
predicted probability for every class for each sample. In order to select the data from
different classes, here t-Distributed Stochastic Neighbor Embedding (t-SNE) is
applied to the label distribution matrix due to its good performance for high
dimensionaldatasets.
A novel noisy label reduction relying on an effective confidence score measure is
proposed based on the criteria of best vs second best (BSVB) to enhance the
active learning performance. Typically, for each selected data sample after ranking,
the ratio of the largest estimated class probability to the second largest
estimated class probability iscalculatedwhere thisinformation can beretrieved
from the label distribution matrix. Subsequently, the ratio is compared to the user set
threshold. The selected data are added into the labeled set if the ratio is larger than
thethreshold.
Therefore, by adding the estimated labels passed from the noise reduction step into
the labeled dataset, the noisy labels in the selection are significantly reduced. The
new augmented labeled dataset after adding the selected data samples are applied
tolabelspreadingalgorithmagaintolearnthenextenhancedmodel.

Visualizing Audio processing
InterpretableConvolutionalFilterswithSincNet
https://github.com/mravanelli/SincNet/
https://github.com/mravanelli/pytorch-kaldi/
MircoRavanelliandYoshuaBengio
(NIPS2018)
This paper summarizes our recent efforts to develop a more interpretable
neural model for directly processing speech from the raw waveform.
In particular, we propose SincNet, a novel Convolutional Neural Network
(CNN) that encourages the first layer to discover more meaningful filters by
exploitingparametrized sinc functions.
In contrast to standard CNNs, which learn all the elements of each filter, only
low and high cutoff frequencies of band-pass filters are directly
learned from data. This inductive bias offers a very compact way to derive
a customized filter-bank front-end, that only depends on some parameters
with a clear physical meaning. Our experiments, conducted on both
speaker and speech recognition, show that the proposed architecture

Spatiotemporal activations
CompensatedIntegratedGradientstoReliably
InterpretEEGClassification
KazukiTachikawa,YujiKawai,JihoonPark,MinoruAsada
MachineLearningforHealth(ML4H)Workshop atNeurIPS2018.
Integratedgradientsarewidely employedto evaluate thecontribution
of input features in classification models because it satisfies the
axiomsforattribution ofprediction.Thismethod,however,requires an
appropriate baseline for reliable determination of the contributions.
We propose a compensated integrated gradients method that
does not require a baseline. In fact, the method compensates the
attributions calculated by integrated gradients at an arbitrary baseline
usingShapleysampling.
The classifier constraints decrease the classification accuracy of temporal CNN. In contrast,
spatiotemporal CNNs exhibit higher classification accuracy but lower interpretation reliability
than the temporal CNNs.Therefore, classifier selection should dependon whetherreliability or
classificationaccuracy areemphasized.

Visualizationandinterpretation SleepStaging
AlbertVilamala,Kristoffer H.Madsen,LarsK.Hansen(Submittedon2Oct2017)

Deep Learning for Biomedical Unstructured Time Series

More Related Content

What's hot (15)

Similar to Deep Learning for Biomedical Unstructured Time Series (20)

More from PetteriTeikariPhD (16)

Recently uploaded (20)

Deep Learning for Biomedical Unstructured Time Series