SlideShare a Scribd company logo
Deep Learning
for Biomedical
Unstructured
Time-series
1D Convolutional neural
networks (CNNs) for time
series analysis, and
inspiration from beyond
biomedical field
Petteri Teikari, PhD
Singapore Eye Research Institute (SERI)
Visual Neurosciences group
http://petteri-teikari.com/
Version “Wed 17 April 2019“
Time SeriesAnalysis VeryShortIntro
TimeSeries Basics
Regular time seriesvs. irregular timeseries
https://mediatum.ub.tum.de/doc/1444158/78684.pdf
UnstructuredBiomedical1DTimeSeries
Time-Frequencyvisualization
https://doi.org/10.3389/fnhum.2016.00605
Timeserieswithdiscrete“states”
Sleepstagesinferredfromunivariateormultivariate(multipleEEGelectrodelocations,),
multimodal(EEGwithECG/EMG,etc.)dense1Dtimeseries
Manytypesof groundtruths possiblealsofor1Dtime
series Segmentation,classification,regression
https://arxiv.org/abs/1801.05394
TimeSeries Stationarity
Non-stationaritiessignificantly
distort short-term spectral,
symbolicand entropyheartrate
variabilityindicesNovember
2011PhysiologicalMeasurement
32(11):1775-86
DOI: 10.1088/0967-3334/32/11/S05
Testsof Stationarity
https://stats.stackexchange.com/questions/182764/stationarity-test
s-in-r-checking-mean-variance-and-covariance
Stationarity of order 2 For everyday use we often consider time series that have (instead of
strictstationarity):https://people.maths.bris.ac.uk/~magpn/Research/LSTS/TOS.html
●
aconstantmean
●
aconstantvariance
●
anautocovariancethatdoesnotdependontime.
Suchtimeseriesareknownas second-orderstationary or stationaryoforder2.
Examples of non-stationary processes are random walk with or without a
drift (a slow steady change) and deterministic trends (trends that are
constant, positive or negative, independent of time for the whole life of the
series).https://www.investopedia.com/articles/trading/07/stationary.asp
Time SeriesAnalysis LiteratureOverview
Representation vsSimilarity
https://arxiv.org/abs/1704.00794: “Time series
analysis approaches can be broadly categorized
into two families: (i) representation methods,
which provide high-level features for representing
properties of the time series at hand, and (ii)
similarity measures, which yield a meaningful
similarity between different time series for further
analysis.“
Classic representation methods are for instance
Fourier transforms, wavelets, singular value
decomposition, symbolic aggregate approximation,
andpiecewiseaggregateapproximation.
Time series may also be represented through the
parameters of model-based methods such as
Gaussian mixture models (GMM), Markov models and
hidden Markov models (HMMs), time series bitmaps
andvariantsofARIMA.
An advantage with parametric models is that they
can be naturally extended to the multivariate
case. For detailed overviews on representation
methods, we refer the interested reader to e.g.
Wangetal.(2013).
https://arxiv.org/abs/1704.00794: “Similarity-based approaches, once defined, such similarities
between pairs of time series may be utilized in a wide range of applications, such as
classification, clustering, and anomaly detection. Time series similarity measures include for
example dynamic time warping (DTW, the longest common subsequence (LCSS), the
extended Frobenius norm (Eros), and the Edit Distance with Real sequences (EDR), and
representstate-of-the-artperformanceinunivariatetimeseries(UTS)prediction.
Attempts have been made to design kernels from non-metric distances such as DTW, of
which the global alignment kernel (GAK) is an example. There are also promising works on
deriving kernels from parametric models, such as the probability product kernel, Fisher kernel,
andreservoir basedkernels.Commontoallthese methodsishowever a strongdependence
onacorrecthyperparametertuning,whichisdifficulttoobtaininanunsupervisedsetting.
Moreover, many of these methods cannot naturally be extended to deal with multivariate time
series (MTS), as they only capture the similarities between individual attributes and do not
modelthe dependenciesbetweenmultiple attributes.Equallyimportant,thesemethodsare not
designed to handle missing data, an important limitation in many existing scenarios, such
as clinical data where MTS originating from Electronic Health Records (EHRs) often contain
missingdata
In this work, we propose a surgical site infection detection framework for
patients undergoing colorectal cancer surgery that is completely
unsupervised, hence alleviating the problem of getting access to labelled
training data. The framework is based on powerful kernels for multivariate
time series that account for missing data when computing similarities.
https://arxiv.org/abs/1803.07879
Analysis withSimilarityMeasures
TimeSeriesClusterKernelforLearningSimilaritiesbetweenMultivariateTimeSerieswithMissingData
KarlØyvindMikalsen,FilippoMariaBianchi,CristinaSoguero-Ruiz,RobertJenssen(lastrevised29Jun2017)
https://arxiv.org/abs/1704.00794|https://github.com/kmi010/Time-series-cluster-kernel-TCK-(TheTCKwasimplementedinRandMatlab)
Similarity-based approaches represent a
promising direction for time series analysis.
However, many such methods rely on
parameter tuning, and some have
shortcomings if the time series are
multivariate (MTS), due to dependencies
between attributes, or the time series
containmissingdata.
In this paper, we address these challenges
within the powerful context of kernel
methods by proposing the robust time
series cluster kernel (TCK). The approach
taken leverages the missing data
handling properties of Gaussian
mixture models (GMM) augmented with
informative prior distributions. An ensemble
learning approach is exploited to ensure
robustness to parameters by combining the
clustering results of many GMM to
formthefinalkernel.
The experimental results demonstrated that the TCK
(1) is robust to hyperparameter settings, (2) is
competitive to established methods on prediction
tasks without missing data and (3) is better than
established methods on prediction tasks with missing
data.
In future works we plan to investigate whether the
use of more general covariance structures in the
GMM, or the use of HMMs as base probabilistic
models, could improve TCK.
Wavelets Shapelets→ Shapelets  ”1DGabors”#1
Fast classification of univariate and multivariate time
seriesthrough shapelet discovery
https://doi.org/10.1007/s10115-015-0905-9
Josif Grabocka, MartinWistuba, Lars Schmidt-Thieme
A Shapelet Selection Algorithm forTime Series Classification: New Directions
https://doi.org/10.1016/j.procs.2018.03.025
The high timecomplexityof shapelet selection processhindersitsapplication in real timedataprocession.
Toovercome this, inthispaper we proposeafast shapelet selection algorithm (FSS), which sharply
reducesthe time consumption ofshapeletselection.
https://slideplayer.com/slide/8370683/
Forexample,aclassof
abnormalECG
measurementmaybe
characterised by an
unusualpatternthat
onlyoccurs
occasionallyatany
point during the
measurement.Shapelets
aresubseriesthatcapture
thistypeofcharacteristic.
Theyallowforthe
detection ofphase-
independentlocalised
similaritybetween series
within thesameclass.
Thegreattimeseriesclassificationbakeoff:areviewandexperimental
evaluationof recentalgorithmicadvances
Anthony Bagnall, Jason Lines, Aaron Bostrom,James Large, Eamonn Keoghs (May2017)
https://doi.org/10.1007/s10618-016-0483-9 | https://bitbucket.org/TonyBagnall/time-series-classification
Wavelets Shapelets→ Shapelets  ”1DGabors”#2
Afastshapelet selectionalgorithmfortime
series classification
https://doi.org/10.1016/j.comnet.2018.11.031
Thetrainingtime ofshapelet based algorithmsishigh, eventhough itis
computed off-line, and the authorsaim tomake it moreefficient
Shapelet transformation algorithms have attracted a great deal of attention in the last
decade. However, the timecomplexity of the shapelet selectionprocess in shapelet
transformation algorithms is too high. To accelerate the shapelet selection process with
noreductioninaccuracy,wepresentedFSSforST.
The experimental results demonstrate that our proposed FSS was thousands of
timesfasterthantheoriginalshapelettransformation methodwithnoreduction
in accuracy. Our results also demonstrate that our method was the fastest method
among shapeletmethodsthathavetheleadinglevelofaccuracy.
RepresentationLearning with deeplearning #1
TowardsaUniversalNeuralNetworkEncoderforTime
Series
Joan Serrà,SantiagoPascual,AlexandrosKaratzoglou(Submitted on
10May 2018)https://arxiv.org/abs/1805.03908
We have studied the use of a universal encoder for time
series in the specific case of classifying an out-of-sample data
set of an unseen data type. We have considered the cases of
no-adaptation,mappingadaptation,andfulladaptation.
In all cases we achieve performances that are competitive with
the state-of-the-art that, in addition, involve a compact reusable
representation and few training iterations. We have also studied
the effect of the representation dimensionality, showing that
small representations have an impact to no-adaptation and
mapping adaptation approaches,butnotmuch tofulladaptation
ones.
In the future, we plan to refine the encoder architecture, as well
as optimizing some of the parameters we empirically use in our
experiments. A very interesting direction for future research is
the adoption of one-shot learning schemas (Snelletal.2017;
Sutskeveretal.2014), which we find very suitable for the
current setting in time series classification problems.
A further option to enhance the performance of a universal
encoder is data augmentation, specially considering recent
linear instance/class interpolation approaches (
Zhangetal.2018).
In order to have sufficient knowledge to accomplish any task, and in order to be
applicable in the absence of labeled data or even without adaptation/re-training,
researchers have been increasingly adopting the generic concept of universal
encoders, specially within the text processing domain (note that related concepts also
existinother domains).
The basic idea is to train a model (the encoder) that learns a common representation
which is useful for a variety of tasks and that, at the same time, can be reused for
novel tasks with minimal or no adaptation. While it would seem that classical
autoencoders and other unsupervised models should perfectly fit this purpose, recent
research in sentence encoding shows that, with current means, encoders learnt with a
sufficiently large set of supervised tasks, or mixing supervised and
unsupervised data, consistentlyoutperformtheirpurelyunsupervisedcounterparts.
RepresentationLearning with deeplearning #2
OneDeepMusicRepresentationtoRuleThem All?
Acomparativeanalysisofdifferentrepresentationlearning
strategies
JaehunKim,JulianUrbano,CynthiaC. S.Liem,AlanHanjalic
(Submittedon13Feb2018)
https://arxiv.org/abs/1802.04051
Ourworkwilladdressthefollowing researchquestions:
–RQ1:Givenasetofcommonlearningtasksthatcanbeusedtotrain
anetwork,whatistheinfluenceofthenumberandtypeofthetaskson
theeffectivenessofthelearneddeeprepresentation?
–RQ2:Howdovariousdegreesofinformationsharinginthedeep
architectureaffecttheultimatesuccessofalearneddeep
representation?
–RQ3:Whatisthebestwaytoassesstheeffectivenessofadeep
representation?
Simplified illustration of the conceptual difference between traditional deep transfer learning (DTL) based on a single
learning task (above) and multi-task based deep transfer learning (MTDTL) (below). The same color used for a
learning and an unseen task indicates that the tasks have commonalities, which implies that the learned representation is
likely to be informative for the unseen task. At the same time, this representation may not be that informative to another
unseen task, leading to a low transfer learning performance. The hypothesis behind MTDTL is that relying on more
learning tasksincreasesrobustness of thelearned representationand itsusabilityfor abroadersetof unseen tasks.
RepresentationLearning with deeplearning #3
LearningFiner-classNetworksforUniversal
Representations
https://arxiv.org/abs/1810.02126
https://arxiv.org/abs/1712.09708
JulienGirard,YoussefTamaazousti,HervéLeBorgne,Céline
Hudelot(Submittedon4 Oct2018)
Many real-world visual recognition use-cases can not directly benefit from
state-of-the-art CNN-based approaches because of the lack of many
annotated data. The usual approach to deal with this is to transfer a
representation pre-learned on a large annotated source-task onto a target-
task of interest. This raises the question of how well the original
representation is "universal", that is to say directly adapted to many
different target-tasks. To improve such universality, the state-of-the-art
consists in training networks on a diversified source problem, that is
modified either by adding generic or specific categories to the initial set of
categories.
We propose two methods to improve universality, but pay special attention
to limit the need of annotated data. We also propose a unified
framework of the methods based on the diversifying of the training
problem. Finally, to better match Atkinson's cognitive study about
universal human representations, we proposed to rely on the
transfer-learningschemeas wellasa new metric toevaluateuniversality.
We show thatourmethod learnsmore universal representationsthan state-
of-the-art, leading to significantly better results on 10 target-tasks from
multiple domains, using several network architectures, either alone or
combinedwithnetworkslearnedat acoarsersemantic level.
RepresentationLearning with deeplearning #4
ImprovingClinicalPredictionsthroughUnsupervised
TimeSeriesRepresentationLearning
https://arxiv.org/abs/1812.00490
XinruiLyu,MatthiasHüser,StephanieL.Hyland,GeorgeZerveas,
Gunnar Rätsch(Submittedon2Dec2018)
MachineLearningforHealth(ML4H)Workshop atNeurIPS2018.
We empirically showed that in scenarios
where labeled medical time series data is
scarce, training classifiers on unsupervised
representations provides performance gains
over end-to-end supervised learning using
raw input signals, thus making effective use
of information available in a separate,
unlabeled training set.
The proposed model, explored for the first
time in the context of unsupervised patient
representation learning, produces
representations with the highest
performance in future signal prediction
and clinical outcome prediction,
exceeding several baselines.
The idea behind applying attention mechanisms to time series forecasting is to enable the
decoder to preferentially “attend” to specific parts of the input sequence
during decoding. This allows for particularly relevant events (e.g. drastic changes in heart
rate),tocontributemoretothegenerationofdifferentpointsintheoutputsequence.
RepresentationLearning with deeplearning #5
UnsupervisedScalableRepresentationLearningforMultivariate
TimeSeries
https://arxiv.org/abs/1901.10738
https://github.com/White-Link/UnsupervisedScalableRepresentationLearni
ngTimeSeries
(PyTorch)
Jean-YvesFranceschi,AymericDieuleveut,MartinJaggi
(Submittedon30Jan2019)
Hence, we propose in the following an unsupervised
method to learn general-purpose representations for
multivariate time series that comply with the issues of
varying and potentially high lengths of the studied time
series. To this end, we adaptrecognized deep learningtools
and introduce a novel unsupervised loss. Our
representations are computed by a deep convolutional
neuralnetworkwithdilatedconvolutions(i.e.TCNs).
This network is then trained unsupervised, using the first
specifically designed triplet loss in the literature of
time series, taking advantage of the encoder resilience to
time seriesofunequallengths.
We leave as future work the applicability of our method to
other tasks like forecasting, and the study of its impact if it
weretobeaddedinpowerful ensemblemethods.
RepresentationLearning with deeplearning #6
Unsupervised speech representation learning
using WaveNet autoencoder
https://arxiv.org/abs/1812.00490
Jan Chorowski, Ron J. Weiss,Samy Bengio, Aaron van den
Oord(Submitted on 25 Jan 2019)
We consider the task of unsupervised extraction of
meaningful latent representations of speech by applying
autoencoding neural networks to speech waveforms. The
goal is to learn a representation able to capture high level
semantic content from the signal, e.g. phoneme identities,
while being invariant to confounding low level details in the
signal such as the underlying pitch contour or background
noise. The behavior of autoencoder models depends on the
kind of constraintthatis applied tothelatentrepresentation.
Our best models used MFCCs (mel-frequency cepstral
coefficient) as the encoder input, but reconstructed raw
waveforms at the decoder output. We used standard 13
MFCC features extracted every 10ms (i.e., at a rate of 100 Hz)
and augmented with their temporal first and second
derivatives. Such features were originally designed for
speech recognition and are mostly invariant to pitch and
similarconfoundingdetail in theaudiosignal. T
RepresentationLearning with deeplearning #7
ATaleof Two Time Series Methods:Representation
Learningfor Improved Distance and RiskMetrics
https://dspace.mit.edu/bitstream/handle/1721.1/119575/1076
345253-MIT.pdf
DivyaShanmugam (June2018)
Architecture of the proposed model. A single convolutional layer
extracts local features from the input, which a strided maxpool
layer reduces to a fixed-size vector. A fully connected layer
with ReLU activation carries out further, nonlinear dimensionality
reduction to yield the embedding. A softmax layer is added at
training time.
We introduce the multiple instance learning paradigm to risk
stratification. Risk stratification models aim to identify patients
at high risk for a given outcome so that doctors may intervene, with
the attempt of avoiding that outcome. Machine learning has led to
improved risk stratification models for a number of outcomes,
including stroke, cancer and treatment resistance [55]. To the best of
our knowledge, this is the first application of multiple instance learning
to risk stratification.
The extension of Jiffy to multi-label classification and unsupervised
learning poses a challenging but necessary task. The availability of
unlabeled time series data eclipses the availability of its annotated
counterpart. Thus, a simple network-based method for representation
learning on multivariate timeseries inthe absence oflabels isan important
line of work. There is also potential to further increase Jiffy’s speed by
replacing the fully connected layer with a structured [Bojarskietal.2016]
or
binarized[Rastegariet al.2016]
matrix.
The proposed risk stratification model extends naturally to a range of adverse
outcomes. The model is not limited to operating on ECG signals - it is
worth exploring whether the multiple instance learning approach may be
successful in other modalities of medical data, including voice. On a
theoretical level, strong generalization guarantees for distinguishing bags with
relative witnessratesdonotexistand are worth exploring asthese modelsare
appliedintherealworld.
Intro tomethods#1a
Highlycomparative time-series analysis: theempirical
structure of time series and their methods
http://doi.org/10.1098/rsif.2013.0048
Ben D. Fulcher, Max A. Little, Nick S. Jones
Intro tomethods#1b
Highlycomparative time-series analysis: theempirical
structure of time series and their methods
http://doi.org/10.1098/rsif.2013.0048
Ben D. Fulcher, Max A. Little, Nick S. Jones
Structure inalibrary of8651time-seriesanalysisoperations. (a) A
summaryof thefourmainclassesof operationsin ourlibrary,asdetermined by
a k-medoidsclustering,reflectsacrudebutintuitiveoverviewof thetime-series
analysisliterature.(b)A network representation of theoperationsinour library
thataremostsimilarto theapproximateentropy algorithm, ApEn(2,0.2)[7],
which wereretrieved fromourlibraryautomatically.Each nodein thenetwork
representsanoperationand linksencodedistancesbetweenthem(computed
using a normalized mutual information-based distancemetric, cf.electronic
supplementary material,§S1.3.1).Annotated scatterplotsshowtheoutputsof
ApEn(2,0.2)(horizontal axis)againsta representativememberof each shaded
community (indicated bya heavily outlined node, vertical axis). Similar pictures
can beproduced by targeting anygivenoperationin our library, thereby
connecting differenttime-seriesanalysismethodsthatneverthelessdisplay
similar behaviour acrossempiricaltimeseries.
Key scientific questions that can be addressed by representing time series by their properties (measured by many types of analysis
methods) and operations by their behaviour (across many types of time-series data). We show that this representation facilitates a range of
versatile techniquesfor addressingscientific time-seriesanalysisproblems, which are illustrated schematicallyin thisfigure.
The representations of time series (rows of the data matrix, figure 1a) and operations (columns of the data matrix, figure 1b) serve as
empirical fingerprints, and are shown in the top panel. Coloured borders are used to label different classes of time series and
operations, and other figures in this paper that explicitly demonstrate each technique are given in the bottom right-hand corner of each
panel.
(a) Time-seriesdatasetscan be organized automatically, revealingthe structure in agiven dataset (cf. figures4a,b and 5a). (b)Collectionsof
scientific methods can be organized automatically, highlighting relationships between methods developed in different fields (cf. figures
3a and 5b). (c) Real-world and model-generated datawith similar propertiesto aspecific time-seriestarget can be identified (cf. figure 4c,d).
(d)Given aspecific operation, alternativesfrom acrossscience can be retrieved (cf. figure 3b). (e)Regression:the behaviour of operations in
our library can be compared to find operations that vary with a target characteristic assigned to time series in a dataset (cf. figure 5d). (f)
Classification: operations can be selected based on their classification performance to build useful classifiers and gain insights into the
differencesbetween classesof labelled time-series datasets(cf. figure 5e).
Intro tomethods#1c
Highlycomparative time-series analysis: theempirical
structure of time series and their methods
http://doi.org/10.1098/rsif.2013.0048
Ben D. Fulcher, Max A. Little, Nick S. Jones
Highlycomparativetechniquesfortime-
seriesanalysistasks.Wedrawonourfull
library oftime-seriesanalysismethodsto:
(a) structure datasetsinmeaningfulways,
andretrieveandorganizeusefuloperations
for (b,e) classificationand(c,d) regression
tasks.(a)Fiveclassesof EEG signalsare
structuredmeaningfullyinatwo-
dimensional principalcomponentsspaceof
our libraryof operations.(b)Pairwise linear
correlationcoefficientsmeasuredbetween
the60mostsuccessful operationsfor
classifyingcongestiveheartfailureand
normalsinusrhythmRR intervalseries.
Clusteringrevealsthatmostoperationsare
organizedintooneof threegroups
(indicatedbydashedboxes). 
Most of the time when people talk about time series and deep
learning, most likely they talking of Sequences (e.g. language)
instead of unstructuredtime series (e.g. voice waveform)
“Sequences” vs“TimeSeries”
“DenseTimeSeries”at videoframerate
Icehockeyas
gamecan be
simplifiedto
discreteevents
(sequences)
https://arxiv.org/abs/1808.04063
Notalwayssoblack-white,butinourcasetime-seriesaremainlydense1DBiosignalswithambiguousormissingdiscretestates
Time Series RNNsforsequences
The Unreasonable Effectivenessof
RecurrentNeuralNetworks
May21,2015|AndrejKarpathy
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
DanQ:ahybridconvolutionaland
recurrentdeepneuralnetworkfor
quantifyingthefunctionofDNA
sequences 
Daniel Quang XiaohuiXieNucleic AcidsResearch,Volume44,
Issue11,20June2016,Pagese107, 
https://doi.org/10.1093/nar/gkw226
DeepLearningforUnderstandingConsumerHistories
byTobiasLang- 25Oct2016
https://jobs.zalando.com/tech/blog/deep-learning-for-understanding-consumer-histories/?gh_src=4n3gxh1
Sequences. Depending on your background you mightbewondering: 
WhatmakesRecurrentNetworkssospecial?
Time Series LSTM,upgradedRNNs
TimeSeries LSTMsApplied
DeepAir|UCBerkeleySchoolofInformation
https://www.ischool.berkeley.edu/projects/2017/deep-air
This project investigates the use of the LSTM recurrent neural network (RNN) as a
framework for forecasting in the future, based on time series data of pollution and
meteorological information in Beijing. Our results show that the LSTM framework
produces equivalent accuracy when predicting future time stamps compared to the
baseline support vector regression for a single time stamp. Using our LSTM framework,
we can now extend the prediction from a single time stamp out to 5 to 10 hours in the
future.
Overview of our self-supervised approach for posture and sequence representation learning
using CNNLSTM. After the initial training with motion-based detections we retrain our model for
enhancingthe learningof therepresentations. https://doi.org/10.1109/CVPR.2017.399
PianoGenie:An IntelligentMusicalInterface
Oct15,2018 |https://magenta.tensorflow.org/pianogenie
Chris Donahue (  chrisdonahue ,  chrisdonahuey ) ;Ian Simon (  iansimon ,  iansimon ) ;Sander Dieleman (  benanne ,  sedielem )
A bidirectional LSTM encoder maps asequence of piano notestoasequence of controller
buttons (shown as 4 in the above figure, 8 in the actual system). A unidirectional LSTM
decoder then decodes these controller sequences back into piano performances. After
training, the encoder isdiscarded and controller sequencesareprovided byuser input.
Time Series RNN/LSTMsareoutdated?#1
ThefallofRNN/ LSTM
EugenioCulurciello
https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0
Combining multiple neural attention modules, comes the “hierarchical
neural attention encoder”… Notice there is a hierarchy of attention
modules here, very similar to the hierarchy of neural networks. This is also
similar toTemporalconvolutionalnetwork(TCN)
→ Shapelets AttentionModels,e.g. Pervasive Attention: 2D Convolutional
NeuralNetworksforSequence-to-
SequencePrediction
MahaElbayad,LaurentBesacier,JakobVerbeek
(Submittedon11Aug 2018)
https://arxiv.org/abs/1808.03867|
https://github.com/elbayadm/attn2d
Time Series RNN/LSTMsareoutdated?#2
AnEmpiricalEvaluationof GenericConvolutional and
RecurrentNetworksforSequence Modeling
ShaojieBai,J.ZicoKolter,VladlenKoltun
(Revised19Apr2018)
https://arxiv.org/abs/1803.01271 |http://github.com/locuslab/TCN
For most deep learning practitioners, sequence modeling is
synonymous with recurrent networks. Yet recent results
indicate that convolutional architectures can outperform recurrent
networks on tasks such as audio synthesis and machine translation.
Given a new sequence modeling task or dataset, which architecture
should one use?
We conduct a systematic evaluation of generic convolutional and
recurrent architectures for sequence modeling. The models are
evaluated across a broad range of standard tasks that are commonly
used to benchmark recurrent networks. Our results indicate that a
simple convolutional architecture outperforms canonical
recurrent networks such as LSTMs across a diverse range of
tasks and datasets, while demonstrating longer effective memory. We
conclude that the common association between sequence modeling
and recurrent networks should be reconsidered, and convolutional
networks should be regarded as a natural starting point for sequence
modelingtasks.
The preeminence enjoyed by recurrent networks in sequence modeling
may be largely a vestige of history. Until recently, before the introduction of
architectural elements such as dilated convolutions and residual
connections, convolutional architectures were indeed weaker. Our
results indicate that with these elements, a simple convolutional
architecture is more effective across diverse sequence modeling tasks
than recurrent architectures such as LSTMs. Due to the comparable
clarity and simplicity of TCNs, we conclude that convolutional
networks should be regarded as a natural starting point and a
powerfultoolkit for sequence modeling
Time Series RNN/LSTMsareoutdated?#3
Dilated Temporal Fully-Convolutional Networkfor
Semantic Segmentation ofMotion CaptureData
NoshabaCheema,Somayeh Hosseini, Janis Sprenger, Erik
Herrmann,Han Du, Klaus Fischer, PhilippSlusallek
(Submittedon 24Jun 2018)
https://arxiv.org/abs/1806.09174
Semantic segmentation of motion capture sequences
plays a key part in many data-driven motion synthesis
frameworks. It is a preprocessing step in which long
recordings of motion capture sequences are partitioned
into smaller segments. Afterwards, additional methods like
statistical modeling can be applied to each group of
structurally-similar segments to learn an abstract motion
manifold. The segmentation task however often
remains a manual task, which increases the effort and
costofgeneratinglarge-scalemotiondatabases.
We therefore propose an automatic framework for
semantic segmentation of motion capture data using a
dilated temporal fully-convolutional network. Our
model outperforms a state-of-the-art model in action
segmentation, as well as three networks for sequence
modeling.
Time Series RNN/LSTMsareoutdated?#4
TemporalConvolutionalNetworksandDynamicTimeWarping
canDrasticallyImprovetheEarlyPredictionofSepsis
MichaelMoor,MaxHorn,BastianRieck,DamianRoqueiroandKarsten
Borgwardt(Submittedon7Feb2019)
https://arxiv.org/abs/1902.01659
https://osf.io/av5yx/?view_only=a6e3442634b34d53ba6e59c4a956b318
For future work, we aim to extend our analysis to more types of data
sources arising from the ICU. Futoma et al. (2017b) already
employed a subset of baseline covariates, medication effects, and
missingness indicator variables. However, a multitude of feature
classes still remain to be explored and properly integrated. For
instance, the combination of sequential and non-sequential
features has previously been handled by feeding non-sequential
data into the sequential model (Futoma et al.,2017a).
We hypothesize that this could be handled more efficiently by
using a more modular architecture that incorporates both
sequential and non-sequential parts. Furthermore, we aim to obtain
a better understanding of the time series features utilized by the
model. Specifically, we are interested in assessing the
interpretability of the learned filters of the MGPTCN framework
and evaluate how much the activity of an individual filter contributes
to a prediction. This endeavor is somewhat facilitated by our use of a
convolutional architecture. The extraction of short per-channel
signals could prove very relevant for supporting diagnoses made by
clinical practitioners.
Overview of our model. The raw, irregularly spaced time series are provided to the Multi-task Gaussian Process
(MGP) patient by patient. The MGP then draws from a posterior distribution (given the observed data) at evenly
spaced grid times (each hour). This grid is then fed into a temporal convolutional network (TCN) which after aforward
pass returns a loss. Its gradient is then computed by backpropagating through the computational graph including
both the TCN and the MGP (green arrows). Both the MGP and TCN parameters are learned end-to-end during
training.
We evaluate all methods using Area under the Precision–Recall Curve
(AUPRC) and additionally display the (less informative) Area under the
Receiver Operator Characteristic (AUC). The current state-of-the-art
method, MGP-RNN, is shown in blue. The two approaches for early
detection of sepsis that were introduced in this paper, i.e. MGP-TCN and
DTW-KNN ensemble, are shown in pink and red, respectively. By using three
random splits for all measures and methods, we depict the mean (line) and
standard deviation error bars (shaded area).
Clinicalnotes and textreportunderstanding
Wordsas thesequences
StructuringClinicalText
Comparativeeffectiveness of convolutional neural
network(CNN)and recurrent neural network(RNN)
architectures for radiologytext reportclassification (2018)
https://doi.org/10.1016/j.artmed.2018.11.004
DepartmentofBiomedicalDataScience,StanfordUniversitySchoolof
Medicine,Stanford,CA,USA
This paper explores cutting-edge deep learning methods for
information extraction from medical imaging free text
reports at a multi-institutional scale and compares them to the
state-of-the-art domain-specific rule-based system – PEFinder
andtraditionalmachinelearning methods– SVMandAdaboost.
Visualization methods have been developed to identify the
impact of input words on the output decision for both
deeplearning models.
DomainPhraseAttention-basedHierarchicalNeuralNetwork(DPA-
HNN)architecture.
ClinicalText +Images
Unsupervised MultimodalRepresentation Learning across
Medical Images and Reportsn
(MachineLearning for Health (ML4H)Workshop atNeurIPS 2018.)
https://arxiv.org/abs/1811.08615 MITCSAIL
Joint embeddings between medical imaging modalities and
associated radiology reports have the potential to offer
significant benefits to the clinical community, ranging from cross-
domain retrieval to conditional generation of reports to the
broader goals of multimodal representation learning. In this work,
we establish baseline joint embedding results measured via both
local and global retrieval methods on the soon to be released
MIMIC-CXR dataset consisting of both chest X-ray images and
the associatedradiologyreports..
We establish baseline results using supervised and unsupervised joint embedding
methods along with local (direct pairs) and global (ICD-9 code groupings) retrieval
evaluation metrics. Results show a possibility of incorporating more unsupervised data
into training for minimal-effort performance increase. A further study of joint
embeddings between these modalities may enable significant applications, such as
text/imagegenerationor theincorporationofotherEMRmodalities.
ElectronicHealthRecords
Visitsassequences,
eachsequencecancontain1Dbiosignals
EHRMining Risk Predictionmodel
Risk Prediction on Electronic Health Records with Prior
Medical Knowledge (2018)
https://doi.org/10.1145/3219819.3220020
We propose a novel and general framework called PRIME for
risk prediction task, which can successfully incorporate
discrete prior medical knowledge into all of the state-of-the-
art predictive models using posterior regularization technique.
Different from traditional posterior regularization, we do not need
to manually set a bound for each piece of prior medical
knowledge when modeling desired distribution of the target
disease on patients. Moreover, the proposed PRIME can
automatically learn the importance of different prior knowledge
with alog-linearmodel.
The limitation of this work is that the proposed PRIME is only
effective for common diseases. For rare and emerging
diseases, since there is little medical knowledge about them, it
is hard to incorporate any prior knowledge into deep learning
predictive models. Thus, the proposed PRIME may achieve
similar performance to the state-of-the-art baselines. In our
future work, we will focus on how to improve predictive
performanceofrisk predictionforrare diseases.
Preprocessing Cleaning
Intro to cleaning
Inthepreprocessing component,themainpurposeistocleanthe
data,filter theunusualpointsandmakeitsuitableastheinputtothe
CNN.Besidesthenormalstepsincludingtimestampalignment,
normalizationandmissingdataimputationfortimeseriesdatawith
trend,
themostimportantoperationtoimprovethedataqualityisthe
outlierdetection,interpolation andfiltering,inparticularfor
clinicaldata.Becauseintheclinicaldataofglucosetimeseries,there
aremanymissingor outlier datapointsduetoerrorsincalibration,
measurements,and/or mistakesintheprocessofdatacollectionand
transmission.Here,severalmethodsareintroducedtohandlethese
scenarios[36].
●
DimensionReductionModel: thetimeseriescan beprojectedinto
lowerdimensionsusinglinearcorrelationssuchasprinciplecomponent
analysis(PCA),and datawithlargeresidualerrorscanbeconsideredas
outliers.
●
Proximity-basedModel: thedataaredeterminedbynearest
neighbouranalysis,clusterordensity.Thus thedatainstancesthat are
isolatedfromthemajorityareconsidered asoutliers.
●
Probabilistic Stochastic Filters:differentfiltersforthesignals, such
asgaussian mixturemodelsoptimized usingexpectation-maximization.
In ourcasethefiltercan beimplementedbeforetheCNN, duetothe
continuouscharacteristic oftheinputglycaemic timeseriesdata.
AconvolutionalneuralnetworkforECGannotationasthebasisfor
classificationofcardiacrhythms
PhilippSodmann etal2018Physiol.Meas.inpress
https://doi.org/10.1088/1361-6579/aae304
Signalcleaning:
Inthedatapreprocessing,weperformedresamplingandsignaldenoising.We
resampledallECGsto300HzusingthefastFourier transforminorder topassECG
segmentsofequallengthontotheCNN.
Tofilternoisycomponentsinthesignalsuchasbaselinewandering,respirationeffects,
or powerlineinterference,weappliedadiscretewavelettransform(DWT)whichworks
asaband-passfilter.For this,weusedDaubechieswavelettransform(Db4).
Beforere-composition,eachcoefficientofthetransformwasmultipliedbyafactor
accordingtotabulatedvalues.Afterwards,a15%-trimmedmeanwithawindowsizeof
33sampleswasappliedtoremovethepersistentbaseline.
https://doi.org/10.3389/fnins.2013.00267
MEGandEEGdataanalysis withMNE-Python
Preprocessing Transformations
TimeSeries Invariances
Acomplexity-invariantdistancemeasurefortimeseries
https://doi.org/10.1137/1.9781611972818.60
GustavoEAPA Batista, Xiaoyue Wang, and Eamonn J Keogh.
In Proceedingsofthe2011SIAM InternationalConferenceon DataMining(SDM),
pages699–710.SIAM,2011.Citedby216 
TimeSeries DTWthe classicalmethod
https://doi.org/10.1145/2888451.2888
456
StockPricePredictionwithFluctuationPatternsUsing
IndexingDynamic TimeWarpingand k∗
-Nearest
NeighborsKei Nakagawa, MitsuyoshiImamura,Kenichi Yoshida(2018)
https://doi.org/10.1007/978-3-319-93794-6_7
Learning invariances#1a
LearningtoExploit InvariancesinClinical
Time-SeriesDatausingSequence
TransformerNetworks
JeehehOh, JiaxuanWang, JennaWiens
(Submittedon 21 Aug2018)
https://arxiv.org/abs/1808.06725
Recently, researchers have started applying convolutional neural
networks (CNNs) with 1D convolutions to clinical tasks
involving time-series data. This is due, in part, to their
computational efficiency, relative to recurrent neural networks
and their ability to efficiently exploit certain temporal invariances,
(e.g.,phaseinvariance).
However, it is well-established that clinical data may exhibit many
other types of invariances (e.g., scaling). While preprocessing
techniques, (e.g., dynamic time warping) may successfully
transform and align inputs, their use often requires one to identify
thetypesofinvariancesinadvance.
In contrast, we propose the use of Sequence Transformer
Networks, an end-to-end trainable architecture that learns to
identify and account for invariances in clinical time-series data.
Applied to the task of predicting in-hospital mortality, our
proposedapproachachievesanimprovementintheAUROC.
Toaddressesthesechallenges,weproposeSequenceTransformer Networks,anapproachfor
learningtask-specificinvariancesrelatedtoamplitude,offset,andscaleinvariancesdirectlyfrom
thedata.Appliedtoclinicaltime-seriesdata,SequenceTransformerNetworkslearn input-and
task-dependenttransformations.Incontrasttodataaugmentationapproaches,our
proposedapproachmakeslimitedassumptionsaboutthepresenceofinvariancesinthedata.
Learning invariances#1b
LearningtoExploitInvariancesinClinicalTime-
Series DatausingSequenceTransformerNetworks
Jeeheh Oh, Jiaxuan Wang, JennaWiens
(Submitted on 21 Aug 2018)
https://arxiv.org/abs/1808.06725
Theproposedapproachisnotwithoutlimitation.Morespecifically,initscurrentformthe
SequenceTransformer appliesthesametransformationacrossallfeatureswithinanexample,
insteadoflearningfeature-specifictransformations.Despitethislimitation,thelearned
transformationsstillleadtoanincreaseinintra-classsimilarity.Inconclusion,weare
encouragedbythesepreliminaryresults.Overall,thiswork representsastartingpoint on
whichotherscanbuild.Inparticular,wehypothesizethattheabilitytocapturelocalinvariances
andfeature-specificinvariancescouldleadtofurther improvementsinperformance.
Learning invariances#2
Autowarp:LearningaWarpingDistancefromUnlabeledTime
Series UsingSequenceAutoencoders
Abubakar Abid, JamesZou StanfordUniversity
(Submitted on 23Oct2018)
https://arxiv.org/abs/1810.10107
Domain experts typically hand-craft or manually select a specific metric, such as dynamic time
warping (DTW), to apply on their data. In this paper, we propose Autowarp, an end-to-end
algorithm that optimizesand learnsagood metric givenunlabeled trajectories.
We define a flexible and differentiable family of warping metrics, which encompasses common
metrics such as DTW, Euclidean, and edit distance. Autowarp then leverages the representation
power of sequence autoencoders to optimize for a member of this warping distance
family. The output is a metric which is easy to interpret and can be robustly learned from relatively
few trajectories.
Future work will extend these results to more challenge time series data, such as those with higher
dimensionality or heterogeneousdata.
Learning invariances#3
NeuralWarp:Time-Series SimilaritywithWarpingNetworks
Josif Grabocka, LarsSchmidt-Thieme (Submitted on20 Dec2018)
https://arxiv.org/abs/1812.08306 | Relatedarticles
In this paper we propose to learn a warping function for
aligning the indices of time series in a deep latent
representation. We compared the suggested architecture
with two types of encoders (CNN, or RNN) and a deep
forward network as a warping function. Experimental
comparisons to non-parametric and un-warped Siames
networks demonstrated that the proposed elastic deep
similaritymeasureismoreaccuratethanpriormodels.
Preprocessing ClassImbalances
SMOTE forimbalancedclasses
SMOTE-GPU:BigData preprocessingon
commodityhardwareforimbalancedclassification
ProgressinArtificialIntelligenceDecember2017,Volume6,
Issue4,pp347–354
https://doi.org/10.1007/s13748-017-0128-2
Consideringabinaryproblemwithamajorityclassanda
minorityclass,itislikelythatalearning algorithmignoresthe
later andstillachievesahighaccuracy.Thereare threemain
waysof dealingwiththesesituations [16]:
●
Algorithmicmodification Modifyinglearning algorithmsin
order totackletheproblembydesign.
●
Cost-sensitivelearningIntroducingcostsfor
misclassificationoftheminorityclassatdataor algorithmic
level.
●
DatasamplingPreprocessingthedatainorder toreduce
thebreachbetweenthenumberofinstancesofeachclass.
TheSMOTEtechniqueisbasedontheideaof
neighborhoodofthek-nearestneighbor (kNN)rule.
The area under the ROC curve results show that the use of
oversampling methods improves the detection of the minority
class in Big Data datasets. We have also shown how our design can
successfully work on a wide range of devices, including a laptop,
while requiring reasonable times, around 25 min on high-end devices,
and less than 2 h on the laptop, for the most time-demanding
experiment.
SMOTEforLearningfromImbalancedData:Progress and
Challenges,Markingthe15-yearAnniversary(2018)
https://doi.org/10.1613/jair.1.11192
●
GS4(Moutafis & Kakadiaris, 2014)
,SEG-SSC (Triguero et al.,2015)
and OCHS-SSC
(Dong et al.,2016)
generate synthetic examplestodiminish the
drawbacksproducedby the absence of labeled examples.
Several learning techniques were checked andsomeproperties
such asthecommonhiddenspacebetweenlabeledsamplesand
thesyntheticsamplewereexploited.
●
The technique proposed by Park et al. (2014) is a semi-
supervised active learning method in which labels are
incrementally obtained and applied using a clusteringalgorithm.
Inthe contextofcurrentchallengesoutlined,we highlightedtheneed
forenhancingthetreatmentof smalldisjuncts,noise, lack of data,
overlapping,datasetshiftandthecurseof dimensionality. To doso,the
theoreticalpropertiesof SMOTE re-garding these data
characteristics, and its relationship with the new synthetic
instances,mustbefurtheranalyzedindepth. Finally,wealsoposited
thatitisimportanttofocusondatasampling andpre-processing
approaches(such asSMOTE anditsextension)withintheframework
ofBig Dataandreal-timeprocessing.
Outlierdetection Whatto impute?
TypesofAnomalies
globalanomalies(x1, x2),
localanomaly x3 
micro-cluster c3. 
Asimpletwo-dimensionalexample
“Thissimpleexamplealready
illustratesthatanomaliesarenot
alwaysobviousandascoreis
muchmoreusefulthanabinary
labelassignment.”
AComparative EvaluationofUnsupervised
AnomalyDetectionAlgorithmsforMultivariate
Data(2016)
Markus Goldstein, SeiichiUchida
https://doi.org/10.1371/journal.pone.0152173
Threetypesofanomaly
schemes:
●
 pointanomalydetection
●
collectiveanomaly
●
contextualanomalies
State-of-the-art 2 yearsoldcuttingedge#1
AComparativeEvaluationofUnsupervisedAnomaly
DetectionAlgorithms forMultivariateData (2016)
MarkusGoldstein,Seiichi Uchida
https://doi.org/10.1371/journal.pone.0152173
Dozens of algorithms have been proposed in this area, but unfortunately
the research community still lacks a comparative universal evaluation as
wellascommonpubliclyavailabledatasets.
These shortcomings are addressed in this study, where 19 different
unsupervised anomaly detection algorithms are evaluated on 10
different datasetsfrommultipleapplicationdomains.
By publishing the source code and the datasets, this paper aims to
be a new well-funded basis for unsupervised anomaly detection
research. Additionally, this evaluation reveals the strengths and
weaknessesofthedifferent approachesforthefirst time.
As a general summary for algorithmselection, werecommend to use
nearest-neighbor based methods, in particular k-NN for global tasks
and LOF for local tasks instead of clustering-based methods. If
computation time is essential, HBOS is a good candidate, especially for
larger datasets. A special attention should be paid to the nature of the
dataset when applying local algorithms, and if local anomalies are of
interest at allin thiscase. 
Different anomaly detection modes
dependingon the availability of labels
in the dataset.
(a) Supervised anomaly detection uses a
fully labeled dataset for training. (b) Semi-
supervised anomaly detection uses an
anomaly-free training dataset. Afterwards,
deviations in the test data from that normal
model are used to detect anomalies. (c)
Unsupervised anomaly detection
algorithms use only intrinsic information of
the data in order to detect instances
deviatingfrom the majority of thedata.
State-of-the-art 2 yearsoldcuttingedge#2
A ComparativeEvaluation of Unsupervised Anomaly Detection Algorithmsfor
Multivariate Data (2016)MarkusGoldstein, SeiichiUchida
https://doi.org/10.1371/journal.pone.0152173
A visualization of the results of the k-NN global
anomaly detection algorithm. The anomaly score is
represented by the bubble size whereas the color shows the
labelsoftheartificiallygenerateddataset.
Comparing Influenced Outlierness (INFLO) withLocal Outlier Factor
(LOF) showstheusefulnessofthe reverseneighborhoodset.
For the red instance, LOF takes only the neighbors in the gray
area into account resulting in a high anomaly score. INFLO
additionally takes the blue instances into account (reverse
neighbors)andthusscorestheredinstancemorenormal.
Anomalydetection Cyber-physicalsystems
Anomaly DetectionwithGenerativeAdversarialNetworks for
MultivariateTimeSeries (2018)
Dan Li, DachengChen, Jonathan Goh,andSee-KiongNg
InstituteofDataScience, National UniversityofSingapore,
https://arxiv.org/abs/1809.04758
Unsupervised machinelearningtechniquescanbeusedtomodelthe
systembehaviour andclassifydeviantbehavioursaspossibleattacks.
Inthiswork,weproposedanovelGenerativeAdversarialNetworks-based
AnomalyDetection(GAN-AD)methodfor suchcomplexnetworkedCPSs.
WeusedLSTM-RNNinourGANtocapturethedistributionofthe
multivariatetimeseriesofthesensorsandactuatorsundernormal
workingconditionsofaCPS.
Insteadoftreatingeachsensor’sandactuator’stimeseriesindependently,we model
thetimeseriesofmultiplesensorsandactuatorsintheCPS
concurrently totakeintoaccountofpotentiallatentinteractions betweenthem.
ToexploitboththegeneratorandthediscriminatorofourGAN,wedeployedthe
GAN-traineddiscriminator together withtheresidualsbetweengenerator-
reconstructeddataandtheactualsamplestodetectpossibleanomaliesinthe
complexCPS.
We will also conduct further
research on feature
selection formultivariate
anomalydetection,and
investigate principled
methodsfor choosing the
latent dimension andPC
dimension withtheoretical
guarantees.
Anomalydetection Financialtime-series
Modelingapproachesfortimeseries forecastingand
anomaly detection (2018)
Du,Shuyang; Pandey, Madhulima; Xing,Cuiqun
http://cs229.stanford.edu/proj2017/final-reports/5244275.pdf
This project focuses on prediction of time series data for Wikipedia
page accesses for a period of over twenty-four months. The methods
explored here are K-nearest neighbors (KNN), Long short-term memory
network (LSTM), and Sequence to Sequence with Convolution Neural
Network (CNN) and we will compare predicted values to actual web traffic.
Thepredictionscan helpusinanomalydetectionintheseries.
Pre-processing : “The are many series in which values are zero. This
could be a missing value, or actual lack of web page access. In addition,
there are significant spikes in the data, where values have a broad range
from 1 to hundreds/thousandsfor several web pages. We normalize this
data by adding 1 to all entries, taking the log of the values, and setting
the mean to zero and variance to one. We have the results of fourier
analysisforexploringperiodictyonaweekly/monthly/quarterlybasis.”
Our approaches to time series prediction depends on features extracted
from the the time series data itself. Our models learn periodicity, ramp and
other regular trends quite well. However, none of our models are able to
capture spikes or outliers that arise from external sources. Enhancing
the performance of the models will require augmenting our feature set from
othersourcessuchasnewseventsandweather.
“SpecialOutliers” Disguisedmissingvalues
FAHES:ARobustDisguised Missing
ValuesDetector
QatarComputingResearch Institute,HBKU, Doha,Qatar
https://doi.org/10.1145/3219819.3220109
Missing values are common in real-world data and may
seriously affect data analytics such as simple statistics
and hypothesis testing. Generally speaking, there are
two types of missing values: explicitly missing
values (i.e. NULL values), and implicitly missing values
(a.k.a. disguised missing values (DMVs)) such as
"11111111" for a phone number and "Some college" for
education. While detecting explicitly missing values is
trivial, detecting DMVs is not; the essential challenge is
the lack of standardization about how DMVs are
generated.
Onefutureworkweareplanning
toperformistoimproveFAHESto
detecttheDMVsthataregenerated
randomlywithintherangeofthe
data.For example,whenachildtries
tocreateanaccountonadomain
thathasaminimumagerestriction,
thechildfakesher agewitharandom
valuethatallowshimtocreatethe
account.Suchrandomfakevalues
arehard,ifnotimpossible,todetect.
Moreover,althoughDMVsarethe
focusofthispaper,therearemore
typesoferrorsarefoundinthewild.
Manyoftheprinciplesand
techniqueswehaveusedtodetect
DMVscanbeleveragedtodetect
other typesoferrors,soanatural
nextstepistoextendthe
infrastructurewehavebuiltto
detectthose.Thisopensnew
challengesrelatedtotherobust
identificationoferrorsthatcouldbe
interpreteddifferentlybydifferent
modules.
DeepLearning Outlier Detection overview
UncertaintyandNoveltydetection #1a
Does YourModel KnowtheDigit6Is NotaCat?ALessBiased
Evaluationof“Outlier” Detectors (2018)
AlirezaShafaei,MarkSchmidt,andJamesJ.Little
https://arxiv.org/abs/1809.04729
What makes this problem differentfrom a typical supervisedlearning setting
isthatwecannotmodelthediversityofout-of-distributionsamplesin
practice. The distribution of outliers used in training may not be the same as
the distribution of outliers encountered in the application. Therefore,
classical approaches that learn inliers vs. outliers with only two datasets
can yield optimistic results. We introduce OD-test, a three-dataset
evaluation scheme as a practical and more reliable strategy to assess
progress on this problem. The OD-test benchmark provides a
straightforward means of comparison for methods that address the out-of-
distributionsampledetectionproblem.
In real life deployment of products that use complex machinery such as
deepneuralnetworks(DNNs),we wouldhavevery littlecontroloverthe
input. In the absence ofextrapolation guarantees, when the independently
and identically distributed (IID) assumption is violated, the behaviour of the
pipeline may be be unpredictable. From a quality assurance
perspective, it is desirable to detect and prevent these scenarios
automatically.
A reliable pipeline would first determine whether it can process a
given sample, then it would use the prediction of the target neural
network. The unfortunate incident that
mislabeledpeople asnon-human , for instance, is a clear example of
OOD extrapolation that could have been prevented by such a
decision scheme: the model simply did not know that it did
not know. While incidentsof similar nature have fueled researchon
de-biasing the datasets and the deep learning machinery, we still
wouldneed to identify the limitationsof ourmodels.
The application is not limited to fortifying large-scale user-
facing products. Successful detection of such violations could
also be used in active learning, unsupervised learning, learning with
noisy data, or simply be a condition to invoking transfer learning
strategies. In this work, we are interested in evaluating mechanisms
that detect OOD samples.
UncertaintyandNoveltydetection #1b
DoesYour Model Know the Digit 6 Is Not a Cat?A Less Biased Evaluation of
“Outlier” Detectors (2018) AlirezaShafaei, MarkSchmidt, and JamesJ. Little
https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test
The Uncertainty View. A commonly invoked strategy in addressing
similarproblemsistocharacterizeanotionofuncertainty.
The literature distinguishes aleatoric uncertainty, the uncertainty inherent
to the process (the known unknowns, like flipping a coin), from epistemic
uncertainty, the uncertainty that can be eliminated with more information
(the unknown unknowns). The Bayesian approach to epistemic
uncertainty estimation is to measure the degree of disagreement among
thepotentiallyviablemodels(theposterior).
The MC-Dropout approach is often advertised as a feasible method to
estimateuncertainty for a variety of applications. Similarly, we can adopt a
non-Bayesian approach by training independent models and then
measuringthedisagreement.Lakshminarayananetal.showanensembleof
five neural networks (DeepEnsemble) that are trained with an
adversarialsample-augmented strategy is sufficient to provide a non-
Bayesian alternative to capturing predictive uncertainty. We evaluate
DeepEnsemble and MC-Dropout.
* The Abstention View
* The Anomaly View AEThreshold PixelCNN++ K-NNSVM
* The Novelty View OpenMax
We train these architectures with a cross-entropy loss (CE), and a k-way logistic
regression loss (KL). CE loss is the typical choice for k-way classification tasks – it enforces
mutual exclusion in the predictions. KL loss is the typical choice for attribute prediction tasks –
it does not enforce mutual exclusivity of the predictions.
We test these two loss functions to see if the exclusivity assumption of CE has an adverse effect
on the ability to predict OOD samples. CE loss cannot make a None prediction without an
explicitly defined None class, but KL loss can make None predictions through low activations of
all the classes.
UncertaintyandNoveltydetection #1c
VGG-backedandResnet-backedmethods
significantlydifferinaccuracy.Thegap
indicatesthesensitivityofthemethodstothe
underlyingnetworks.
Thismeansthattheimageclassificationaccuracy
maynotbetheonlyrelevantfactor inperformance
ofthesemethods.ODINislesssensitivetothe
underlyingnetwork.
Despitenotenforcingmutualexclusivity,training
thenetworkswithKLlossinsteadofCEloss
consistentlyreducestheaccuracyofOOD
detectionmethodsonaverage.
UncertaintyandNoveltydetection #1d
DoesYour Model Know the Digit 6 Is Not a Cat?A Less Biased Evaluation of
“Outlier” Detectors (2018) AlirezaShafaei, MarkSchmidt, and JamesJ. Little
https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test [PyTorch]
Related work indeep learning can be categorized into two broadgroupsbased on the underlyingassumptions:
(i) in-distribution techniques, and (ii) out-of-distribution techniques.
Guoetal. (2017) observed that
modern neural networks tend to
be overconfident in their
predictions. They show that
temperature scaling in the
softmax operator, also known as
Platt scaling, can be used to
calibrate the output probabilities of
a neural network to empirically
align the accuracy of a prediction
with its probability. Their efforts fall
under the uncertainty estimation
approaches.
Geifman and El-Yaniv (2017)
present a framework for selective
classification with deep neural
networks that follows the
abstention view. A selection
function decides whether to
make a prediction or not. For
the choice of selection function,
they experiment with MC-Dropout
and the softmax output. They
provide an analytical trade-off
between risk and coverage within
their formulation.
input perturbation serves as a way to assess how the network would behave nearby the given
input. When the temperature is 1 and the perturbation step is 0 we simply recover the
PbThreshold method. ODIN, the state-of-the-art at the time of this writing, is reported to
outperform the previous work [8] by a significant margin. We also assess the performance of ODIN
inourwork.
These methods provide an abstract idea which depends on the successful training of GANs. To
the best of our knowledge, training GANs is itself an active area of research, and it is not apparent
what design decisions would be appropriate to implement these ideas in practice. Furthermore,
someoftheseideasareprohibitivelyexpensivetoexecuteatthetimeofthiswriting.
UncertaintyandNoveltydetection #1e
DoesYour Model Know the Digit 6 Is Not a Cat?A Less Biased Evaluation of
“Outlier” Detectors (2018) AlirezaShafaei, MarkSchmidt, and JamesJ. Little
https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test
Datasets.
We extend the previous work by evaluating over a broader set
of datasets with varying levels of complexity. The
variation in complexity allows for a fine-grained evaluation of
the techniques. Since OOD detection is closely related to the
problem of density estimation, the dimensionality of the
input image will be of vital importance in practical
assessments. As the input dimensionality increases, we
expect the task to become much more difficult.
Therefore, to provide a more accurate picture of performance,
itiscrucialtoevaluatethemethodsonhighdimensionaldata.
MC-Dropout
Inlow-dimensional
datasets,K-
NNSVMperforms
similarlyorbetter
than theother
methods
Thetop-performingmethod,ODIN,isinfluencedbythe
numberofclassesin thedataset.Similarto PbThreshold,ODIN
dependson themaximum signalin theclasspredictions,
thereforetheincreasednumberof classeswould directly affect
bothofthemethods.Furthermore,neitherofthemconsistently
prefersVGGoverResnetwithinalldatasets. Overall,ODIN
consistentlyoutperformsothersinhigh-dimensional
settings, but allthemethodshavea relativelylow average
accuracyinthe60%-78%range.
UncertaintyandNoveltydetection #1f
DoesYour Model Know the Digit 6 Is Not a Cat?A Less Biased Evaluation of
“Outlier” Detectors (2018) AlirezaShafaei, MarkSchmidt, and JamesJ. Little
https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test
UncertaintyandNoveltydetection #2
To TrustOr NotTo Trust A Classifier
HeinrichJiang, Been Kim, Maya Gupta (2018)
Google Research;Google Brain
https://arxiv.org/abs/1805.11783
We propose a new score, called the trust
score, which measures the agreement
between the classifier and a modified
nearest-neighbor classifier on the testing
example. We show empirically that high
(low) trust scores produce surprisingly high
precision at identifying correctly (incorrectly)
classified examples, consistently
outperforming the classifier’s confidence
scoreas well as many other baselines.
Two example datasets and models. Predicting correctness (top row) and
incorrectness (bottom). The vertical dotted black line indicates accuracy level of the
classifier. The trust score consistently attains a higher precision for each given percentile
of classifier decision-rejection. Furthermore, the trust score generally shows increasing
precision as the percentile level increases, but surprisingly, many of the comparison
baselinesdo not.
UncertaintyandNoveltydetection #3
Interpreting Neural NetworksWith Nearest
Neighbors
Eric Wallace, Shi Feng, Jordan Boyd-Graber
https://arxiv.org/abs/1809.02847
Local model interpretation methodsexplain individual
predictionsbyassigning animportance value to each
inputfeature. Thisvalue isoften determined by
measuringthe change in confidence when a feature is
removed. However, the confidence of neural networksis
nota robust measure of model uncertainty.
Thisissue makesreliably judgingthe importance of the
input featuresdifficult.We addressthisby changing
the test-time behaviorofneural networks using
Deep k-Nearest Neighbors. Without harmingtext
classification accuracy, thisalgorithm providesa more
robustuncertainty metric whichwe use to generate
feature importance values.
The resultinginterpretationsbetteralign withhuman
perception than baseline methods. Finally, we use our
interpretation methodto analyze model predictionson
dataset annotation artifacts.
Deepk-nearest neighbors: Towards confident,
interpretable and RobustDeep Learning
NicolasPapernot and Patrick D. McDaniel (2018)
https://arxiv.org/abs/1803.04765
Debugging ResNet model biases—This illustrates how the
DkNN algorithm helps to understand a bias identified by Stock and
Cisse [105] in the ResNet model for ImageNet. The image at the
bottom of each column is the test input presented to the DkNN.
Each test input is cropped slightly differently to include (left) or
exclude (right) the football. Images shown at the top are nearest
neighbors in the predicted class according to the representation
output by the last hidden layer. This comparison suggests that the
“basketball” prediction may have been a consequence of the ball
being in the picture. Also note how the white apparel color and
general arm positions of players often match the test image of
BarackObama.
UncertaintyandNoveltydetection #4
AND:AutoregressiveNoveltyDetectors
Davide Abati, AngeloPorrello, Simone Calderara, RitaCucchiara
(Submitted on4 Jul 2018)
https://arxiv.org/abs/1807.01653
We propose an unsupervised model for novelty
detection. The subject is treated as a density estimation
problem, in which a deep neural network is employed to learn a
parametric function that maximizes probabilities of training
samples. This is achieved by equipping an autoencoder with a
novel module, responsible for the maximization of
compressed codes' likelihood by means of autoregression. We
illustrate design choices and proper layers to perform
autoregressive density estimation when dealing with both
image and video inputs. Despite a very general formulation, our
model shows promising results in diverse one-class novelty
detectionandvideoanomalydetectionbenchmarks.
Thestructureoftheproposedautoencoder.Pairedwithastandardcompression-reconstruction
network,adensityestimationmodulelearnsthedistributionoflatentcodes,viaautoregression.1
Anomalydetection withGANs#1
AnomalydetectionwithWassersteinGAN
IlyassHaloui, Jayant SenGupta, and Vincent Feuillard
(Submitted on11Dec2018)
https://arxiv.org/pdf/1812.02463
Inthispaper,we investigateGAN toperformanomalydetectionon
time series dataset. In order to achieve this goal, a bibliography is
made focusing on theoretical properties of GAN and GAN used for
anomaly detection. A Wasserstein GAN hasbeen chosen to learn the
representation of normal data distribution and a stacked encoder with
the generator performsthe anomaly detection. W-GAN with encoder
seems to produce state of the art anomaly detection scores on MNIST
datasetandweinvestigateitsusageon multi-variatetimeseries.
Based on this literature review, we chose to perform anomaly detection
using a Wasserstein Generative Adversarial Network. The main
reason is that Wasserstein GAN does not collapse contrarily to the
classical GAN which needs to be heavily tuned in order to avoid this
problem. Mode collapse can be blocking if we need to perform
anomaly detection: ifasubset ofour datadistributionisnotlearned bythe
generator, then all samples that are similar to this subset might end up
classified as abnormal. Another added value of the wasserstein GAN
version compared to a standard GAN is the possibility of using the loss
function of the discriminator to evaluate convergence since it is an
approximationoftheWassersteindistancebetween Pr
andPθ
.
A future improvement consists in considering CNN for both
the generator and discriminator in order to detect anomalies from
raw time series data. 1-D convolutions are needed and will be
investigated to produce good visual representations of time
series samples.A more thorough study of the impact of the
architecture should also be done.
Anomalydetection withGANs#2
MAD-GAN:MultivariateAnomalyDetectionforTimeSeries
DatawithGenerativeAdversarialNetworks
DanLi, DachengChen, LeiShi, BaihongJin, Jonathan Goh, and See-KiongNg
(Submitted on15Jan 2019) Institute ofData Science, National UniversityofSingapore
https://arxiv.org/abs/1901.04997
In this work, we propose a novel Multivariate Anomaly Detection
strategywith GAN (MAD-GAN) to model the complex multivariate
correlations among the multiple data streams to detect
anomalies using both the GANtrained generator and discriminator.
Unlike traditional classification methods, the GAN-trained discriminator
learns to detect fake data from real data in an unsupervised fashion,
making it an attractive unsupervised machine learning technique for
anomalydetection
Given that this is an early attempt on multivariate anomaly detection on
timeseriesdatausingGAN,thereareinteresting issuesthatawaitfurther
investigations.Forexample,wehavenotedtheissuesofdeterminingthe
optimal subsequence length as well as the potential model instability of
theGANapproaches.
For future work, we plan to conduct further research on feature
selection for multivariate anomaly detection, and investigate principled
methods for choosing the latent dimension and PC dimension
with theoretical guarantees.Wealsohope toperformadetailedstudyon
the stability of the detection model. In terms of applications, we plan to
explore the use of MAD-GAN for other anomaly detection applications
such as predictive maintenance and fault diagnosis for smart buildings
andmachineries.
Uncertainty InsightsfromNLP uncertainty
QuantifyingUncertaintiesinNaturalLanguage
ProcessingTasks
YijunXiaoand William YangWang(Submitted on 18 May2018)
https://arxiv.org/abs/1811.07253
In this paper, we propose novel methods to study the
benefits of characterizing model and data
uncertainties for natural language processing (NLP)
tasks. With empirical experiments on sentiment analysis,
named entity recognition, and language modeling using
convolutional and recurrent neural network models, we
show that explicitly modeling uncertainties is not only
necessary to measure output confidence levels, but also
useful at enhancing model performances in various
NLPtasks.
1. We mathematically define model and data
uncertaintiesviathelawof totalvariance;
2. Our empirical experiments show that by accounting
for model and data uncertainties, we observe
significantimprovementsinthree importantNLPtasks;
3. We show that our model outputs higher data
uncertainties for more difficult predictions in sentiment
analysis andnamedentity recognitiontasks.
Uncertainty CNNs+GaussianProcesses
CalibratingDeepConvolutionalGaussianProcesses
Gia-Lac Tran, Edwin V. Bonilla, John P. Cunningham, PietroMichiardi, Maurizio
Filippone. (Submitted on 26May 2018)
https://arxiv.org/abs/1805.10522
Despite the considerable interest in combining CNNs
with GPs, little attention has been devoted to
understand the implications in terms of the ability of
these models to accurately quantify the level of
uncertainty inpredictions.
This is the first work that highlights the issues of
calibration of these models, showing that GPs cannot
cure the issues of miscalibration in CNNs. We
have proposed a novel combination of CNNs and GPs
where the resulting model becomes a particular form of
a Bayesian CNN for which inference using variational
inference isstraightforward.
However, our results also indicate that combining CNNs
and GPs does not significantly improve the
performance of standard CNNs. This can serve as
a motivation for investigating new approximation
methods for scalable inference in GP models and
combinationswithCNNs.
CalibrationofConvolutionalNetworks:
The issue of calibration of classifiers in machine learning was popularized in the 90’s with the use of
support vector machines for probabilistic classification. Calibration techniques aim to learn a
transformation of the output using a validation set in order for the transformed output to give a reliable
account ofthe actual probability ofclasslabels; interestingly,calibration can be appliedregardless
of the probabilistic nature of the untransformed output of the classifier. Popular calibration techniques
include Plattscaling and isotonicregression. Classifiers based on Deep Neural Networks (DNNs)
have been shown to be well-calibrated]. The reason is that the optimization of the cross-entropy
loss promotes calibrated output. The same loss is used in Platt scaling and it corresponds to the
correct multinomial likelihood for class labels. Recent studies on the calibration of CNNs, which are a
particular case of DNNs, however, show that depth has a negative impact on calibration, despite
the use of a cross-entropy loss, and that regularization improves the calibration properties of
classifiers[Guoetal.2017].
Combinationsof ConvNetsandGaussianProcesses:
Thinking of Bayesian priors as a form of regularization, it is natural to assume that Bayesian
CNNs can “cure” the miscalibration of modern CNNs. Despite the abundant literature on Bayesian
DNNs, far less attention has been devoted to Bayesian CNNs, and the calibration properties of these
approaches have not been investigated. In this work, we propose an alternative way to combine CNNs
and GPs, where GPs are approximated using random features expansions. The random feature
expansion approximation amounts in replacing the orginal kernel matrix with a low-rank approximation,
turning GPs into Bayesian linear models. Combining this with CNNs leads to a particular form of
Bayesian CNNs, much like GPs and DGPs are particular forms of Bayesian DNNs. Inference in Bayesian
CNNs is intractable and requires some form of approximation. In this work, we draw on the interpretation
of dropout as variational inference, employing the so-called Monte Carlo Dropout (MCD) to obtain a
practicalwayofcombiningCNNsand GPs.
Uncertainty in timestamps,modelingfor clinicaluse#1
Time-DiscountingConvolutionforEventSequences
withAmbiguousTimestamps
(Submitted on 6Dec2018)
https://arxiv.org/abs/1812.02395
This paper proposes a method for modeling event
sequences with ambiguous timestamps, a time-
discounting convolution. Unlike in ordinary time series,
time intervals are not constant, small time-shifts
have no significant effect, and inputting timestamps or
time durations into a model is not effective. The criteria
that we require for the modeling are providing
robustness against time-shifts or timestamps
uncertainty as well as maintaining the essential
capabilities of time-series models, i.e., forgetting
meaningless past information and handling infinite
sequences.
The proposed method handles them with a
convolutional mechanism across time with specific
parameterizations, which efficiently represents the event
dependencies in a time-shift invariant manner while
discounting the effect of past events, and a dynamic
pooling mechanism, which provides robustness
against the uncertainty in timestamps and enhances the
time-discounting capability by dynamically changing the
poolingwindowsize.
Imputation LiteratureReview
Typesof Missing Values
Feldmanetal.(2018): “Rubin (1976) discusses three possible
mechanisms for the formation of missing values, each reflecting a
different form of missing-data probabilities and relationships between the
measured variables, and each may lead to different imputation methods
(Luengoetal.,2012)”
Missing Completely at Random (MCAR): a missing value that cannot be
related to the value itself or to other variable values in that record. This is a
completely unsystematic missing pattern and therefore the observed data
canbethoughtofasarandomunbiasedsampleofacompletedataset.
Missing at Random (MAR): cases in which a missing value is related to
other variable valuesin thatrecord,but nottothevalue itself(e.g., aperson with
a "marital status" value "single", has a missing value in the "spouse name"
attribute). In other words, in MAR scenarios, incomplete data can be partially
explained and the actual value can be possibly predicted by other variable
values.
Missing Not at Random (MNAR): the missing value is not random and
depends on the actual value itself; hence, cannot be explained by other values
(e.g., an overweight person is reluctant to provide the "weight" value in a
survey). NMAR scenarios are the most difficult to analyze and handle, as the
missing data cannot be associated with other data items that are available in
thedataset.
https://statistical-programming.com/missing-data/
Missinginaction:the dangersofignoringmissingdata
https://doi.org/10.1016/j.tree.2008.06.014
Intro toimputationmethods
ComparisonofEstimatingMissingValues inIoTTime
Series DataUsingDifferentInterpolationAlgorithms
August2018
https://doi.org/10.1007/s10766-018-0595-5
“When collecting the Internet of Things data using various sensors or
other devices, it may be possible to miss several kinds of values of
interest.In thispaper,we focusonestimating the missing valuesin IoT
time series data using three interpolation algorithms, including
(1) Radial Basis Functions, (2) Moving Least Squares (MLS), and (3)
AdaptiveInverseDistanceWeighted.“
Onthechoiceofthebestimputationmethods formissingvalues
consideringthreegroups ofclassificationmethods
June2011
https://doi.org/10.1007/s10115-011-0424-2|https://sci2s.ugr.es/MVDM
“In thiswork, wefocuson aclassification task with twenty-three classification methods
and fourteen different imputation approaches to missing values treatment that
are presented and analyzed. The analysis involves a group-based approach, in which
we distinguish between three different categories of classification methods.
Each category behaves differently, and the evidence obtained shows that the use of
determined missing values imputation methods could improve the accuracy obtained
for these methods. In this study, the convenience of using imputation methods
for preprocessing data sets with missing values is stated. The analysis suggests
that theuseofparticularimputation methodsconditionedtothegroupsisrequired.“
We have discovered that the
Combined Multivariate Collapsing
(CMC) and Event Covering (EC)
methods show good behavior for
these two measures, and they are
two methods that provide good
results for an important range of
learning methods, as we have
previously analyzed. In short, these
two approaches introduce less
noise and maintain the mutual
information better.
Class centerbasedapproachformissingvalue
imputation2018
https://doi.org/10.1016/j.knosys.2018.03.026
A novel missing value imputation isintroduced, which iscomposedof
two modules. Each class center and its distances from the other
observed data are measured to identify a threshold. Then, the
identified threshold is used for missing value imputation. The
proposed approach outperforms the other approaches for both
numerical and mixed datasets. It requires much less imputation
timethanthemachinelearning basedmethods.
Imputation withDeepLearning#1
BRITS:BidirectionalRecurrentImputationforTime
Series
WeiCao,DongWang,JianLi,HaoZhou,LeiLi,YitanLi
(Submittedon27May2018) https://arxiv.org/abs/1805.10572
https://github.com/NIPS-BRITS/BRITS
Existing imputation methods often impose strong
assumptions of the underlying data generating process,
such as linear dynamics in the state space. In this paper, we
propose BRITS, a novel method based on recurrent neural
networksformissingvalueimputationintimeseriesdata.
Our proposed method directly learns the missing
values in abidirectional recurrentdynamicalsystem,without
any specific assumption. The imputed values are treated as
variablesofRNNgraphandcan beeffectivelyupdatedduring
the backpropagation. We simultaneously perform missing
value imputation and classification/regression of applications
jointlyinoneneuralgraph.
BRITS has three advantages: (a) it can handle multiple
correlated missing values in time series; (b) it generalizes
to time series with nonlinear dynamics underlying; (c) it
provides a data-driven imputation procedure and
appliestogeneralsettingswithmissing data.
We evaluate the imputation performance in terms of
mean absolute error (MAE) and mean relative error
(MRE).
Imputation withDeepLearning#2
End-to-EndTimeSeriesImputationviaResidualShortPaths
Lifeng Shen,Qianli Ma,SenLi (2018)
http://proceedings.mlr.press/v95/shen18a.html
We propose an end-to-end imputation network with residual
short paths, called Residual IMPutation LSTM (RIMP-LSTM), a
flexible combination of residual short paths with graph-based
temporal dependencies. We construct a residual sum unit (RSU),
which enables RIMP-LSTM to make full use of previous revealed
information to model incomplete time series and reduce the
negative impact of missing values. Moreover, a switch unit is
designed to detect the missing values and a new loss function is
then developed to train our model with time series in the presence of
missing values in an end-to-end way, which also allows
simultaneous imputationand prediction.
RIMP-LSTM combines the merits of graph-based models with
explicitly modeled temporal dependencies via weighted
residual connection between nodes, with the ones of LSTM that can
accumulate historical residual information and learn the underlying
patternsof incomplete time seriesautomatically.
On the other hand, compared with IMP-LSTM, RIMP-LSTM has
better performance as it is good at modeling temporal
dependencies with weighted residual short paths, which
demonstrates that the reasonability of using these weighted residual
pathsto model graphlike temporal dependenciesforimputation.
Imputation withDeepLearning#3
Acontextencoderforaudioinpainting
AndresMarafioti,NathanaelPerraudin,Nicki Holighaus,andPiotr Majdak (Submittedon29Oct2018)
https://arxiv.org/abs/1810.12138
http://www.github.com/andimarafioti/audioContextEncoder
(Python,Matlab)
We studied the ability of deep neural networks (DNNs) to restore missing audio
content based on its context, a process usually referred to as audio inpainting.
We focused on gaps in the range of tens of milliseconds, a condition which has
not received much attention yet. The proposed DNN structure was trained on
audio signals containing music and musical instruments, separately, with 64-ms
long gaps
Here, the STFT features, meant as a reasonable first choice,
provided a decent performance. In the future, we expect more
hearing-related features to provide even better reconstructions. In
particular, an investigation of Audlet frames, i.e., invertible time-
frequency systems adapted to perceptual frequency scales, as
featuresforaudioinpaintingpresentintriguingopportunities.
Here, preferred architectures are those not relying on a
predetermined target and input feature length, e.g., a recurrent
network. Recent advances in generative networks will provide
other interesting alternatives for analyzing and processing audio
dataaswell.Theseapproachesareyettobefully explored.
Finally, music data can be highly complex and it is unreasonable to
expect a single trained model to accurately inpaint a large number
of musical styles and instruments at once. Thus, instead of training
on a very general dataset, we expect significantly improved
performance for more specialized networks that could be
trained by restricting the training data to specific genres or
instrumentation. Applied to a complex mixture and potentially
preceded by a source-separation algorithm, the resulting
modelscouldbeusedjointlyinamixture-of-experts.approach.
Imputation withDeepLearning#4: GANs
NAOMI:Non-AutoregressiveMultiresolutionSequenceImputation
Yukai Liu,RoseYu,StephanZheng,EricZhan,Yisong Yue (Submittedon30Jan2019)
https://arxiv.org/abs/1901.10946
We studied the ability of deep neural networks (DNNs) to restore missing audio
content based on its context, a process usually referred to as audio inpainting.
We focused on gaps in the range of tens of milliseconds, a condition which has
not received much attention yet. The proposed DNN structure was trained on
audio signals containing music and musical instruments, separately, with 64-ms
long gaps
Leveraging multiresolution modeling and adversarial training, NAOMI is able to
learn the conditional distribution given very few known observations and
achieves superior performances in variousexperiments of both deterministic and
stochastic dynamics. Future work will investigate how to infer the
underlyingdistribution when complete training dataisunavailable.The trade-
off between partial observations and external constraints is another direction for
deepgenerativeimputationmodels.
Effect of missingvalues toclassificationperformance
Amethodologyforquantifyingtheeffectofmissingdata ondecisionquality in
classificationproblems
Received 09Mar 2016, Accepted 22 Dec 2016, Accepted author version posted online: 13Jan 2017,
https://doi.org/10.1080/03610926.2016.1277752
“This study suggests that the negative impact of poor data quality (DQ) on decision making is often
mediated by biased model estimation. To highlight this perspective, we develop an analytical framework
that links three quality levels – data, model, and decision. The general framework is first developed at a
high-level”
Evolutionary MachineLearningfor
ClassificationwithIncompleteData
Tran, CaoTruong(2018, PhDThesis)
http://hdl.handle.net/10063/7639
“The thesis develops approaches for
improving imputation for
classification with incomplete data by
integrating clustering and feature
selection with imputation. The approaches
improve both the effectiveness and the
efficiency of using imputation for
classificationwith incompletedata.
The thesis develops interval genetic
programming to directly evolve classifiers
for incomplete data. The results show that
classifiers generated by interval genetic
programming can be more effective and
efficient than classifiers generated the
combination of imputation and traditional
genetic programming. Interval genetic
programming is also more effective than
common classification algorithms able to
workdirectlywith incompletedata.”
Imputation and Classification
MissingData ImputationforSupervisedLearning
August 2018
https://doi.org/10.1080/08839514.2018.1448143
“Thispapercomparesmethodsforimputingmissing
categoricaldataforsupervisedclassificationtasks. “
The results of the present study show that perturbation can help increase predictive accuracy
for imputed models, but not one-hot encoded models. Future work can identify the conditions
under which missing-data perturbation can improve prediction accuracy. Interesting
extensions of this paper include evaluating the benefits of using missing-data
perturbation over more popularregularization techniquessuchas dropout training.
ErrorratesontheAdulttestsetwith(bottom)andwithout(top)missing dataimputation,for variouslevelsofMCAR-perturbedcategoricaltrainingfeatures(x-axis).
TheAdult datasetcontainsN= 48,842examples
and 14 features(6 continuousand 8 categorical).The
predictiontask isto determinewhether aperson
makesover $50,000a year.
Decomposition LiteratureReview
CEEMD EmpiricalModeDecomposition
Empirical mode decomposition for
seismic time-frequency analysis
Jiajun Han and Mirko van der Baan
Geophysics (2013) 78 (2):O9-O19.
https://doi.org/10.1190/geo2012-0199.1
Complete ensemble empirical mode
decomposition decomposes a
seismic signal into a sum of
oscillatory components, with
guaranteed positive and smoothly
varying instantaneous frequencies.
Analysis on synthetic and real data
demonstrates that this method
promises higher spectral-spatial
resolution than the short-time
Fourier transform or wavelet
transform. Application on field data
thus offers the potential of
highlighting subtle geologic
structures that might otherwise
escape unnoticed.
CEEMD is a robust extension of EMD methods. It
solves not only the mode mixing problem, but also leads to
complete signal reconstructions. After CEEMD,
instantaneous frequency spectra manifest visibly higher
time-frequency resolution than short-time Fourier and
wavelet transforms on synthetic and field data examples.
These characteristics render the technique highly
promisingforseismic processingand interpretation.
Introducinglibeemd:Aprogrampackageforperformingthe
ensembleempiricalmodedecomposition(July2015)
ComputationalStatistics 31(2):1-13P.J.J.Luukko,JouniHelske,E.
Räsänen C, R and Python
http://doi.org/10.1007/s00180-015-0603-9
https://bitbucket.org/luukko/libeemd
SourceSeparation ”signaldecomposition”#1
Wave-U-Net:AMulti-ScaleNeuralNetworkfor
End-to-EndAudioSourceSeparation
Daniel Stoller, Sebastian Ewert, Simon Dixon
Queen Mary Universityof London, Spotify
(Submitted on8 Jun2018)
https://arxiv.org/abs/1806.03185 |https://github.com/f90/Wave-U-Net
“Models for audio source separation usually operate on the
magnitude spectrum, which ignores phase information and
makes separation performance dependant on hyper-parameters
for the spectral front-end. Therefore, we investigate end-to-end
source separation in the time-domain, which allows
modelling phase information and avoids fixed spectral
transformations. Due to high sampling rates for audio, employing a
long temporal input context on the sample level is difficult, but
required for high quality separation results because of long-range
temporalcorrelations.
In thiscontext, weproposethe Wave-U-Net,an adaptation of the
U-Net to the one-dimensional time domain, which repeatedly
resamples feature maps to compute and combine features at
different time scales. We introduce further architectural
improvements, including an output layer that enforces source
additivity, an upsampling technique and a context-aware
predictionframework toreduceoutput artifacts.
Experiments for singing voice separation indicate that our
architecture yields a performance comparable to a state-of-the-
artspectrogram-basedU-Netarchitecture,given thesamedata.
75 tracks from the training partition of the MUSDB
multi-track database are randomly assigned to
our training set. For singing voice separation, we
also add the whole CCMixter database to the
training set. No further data preprocessing is performed, only a
conversion to mono (except for stereo models) and downsampling to
22050 Hz.
For future work, we could investigate to
which extent our model performs a
spectral analysis, and how to incorporate
computations similar to those in a multi-
scale filterbank, or to explicitly compute
a decomposition of the input signal into a
hierarchical set of basis signals and
weightings on which to perform the
separation, similarto the TasNet [12].
Furthermore, better loss functions for
raw audio prediction should be investigated
such as the ones provided by generative
adversarial networks [3, 21], since the MSE
might not reflect the perceived loss of
quality well.
SourceSeparation ”signaldecomposition”#2
TasNet:SurpassingIdealTime-Frequency
MaskingforSpeechSeparation
YiLuo, NimaMesgarani
(Submitted on21 Sep 2018)
https://arxiv.org/abs/1809.07454
“TasNet uses a convolutional encoder to create a representation
of the signal that is optimized for extracting individual speakers.
Speaker extraction is achieved by applying a weighting
function (mask) to the encoder output. The modified encoder
representation is then inverted to the sound waveform using a
linear decoder. A linear deconvolution layer serves as a decoder
by invertin gthe encoder output back to the sound waveform. This
encoder-decoder framework is similar to the ICA method when
a nonnegativemixing matrix (NMF) is used [Wangetal.2009] and
to the semi-nonnegative matrix factorization method (semi-NMF)
[Dingetal.2008], where the basis signals are the parameters of
thedecoder.
The masks are found using a temporal convolutional network
(TCN) consisting of dilated convolutions, which allow the
network to model the long-term dependencies of the speech
signal. This end-to-end speech separation algorithm significantly
outperforms previous time-frequency methods in terms
of separating speakers in mixed audio, even when compared to
the separation accuracy achieved with the ideal time-frequency
mask of the speakers. In addition, TasNet has a smaller model size
and a shorter minimum latency, making it a suitable solution for
bothofflineandreal-time speechseparation applications.“
SourceSeparation ”signaldecomposition”#3
DisentanglingCorrelatedSpeakerandNoisefor
SpeechSynthesis viaDataAugmentationand
AdversarialFactorization
Wei-NingHsu, Yu Zhang, Ron J. Weiss, Yu-An Chung, Yuxuan Wang,
YonghuiWu, JamesGlass.
32nd ConferenceonNeural InformationProcessing Systems (NIPS 2018), Montréal, Canada.
https://openreview.net/pdf?id=Bkg9ZeBB37
“To leverage crowd-sourced data to train multi-speaker text-
to-speech (TTS) models that can synthesize clean speech
for all speakers, it is essential to learn disentangled
representations which can independently control the
speaker identity and background noise in generated signals.
However, learning such representations can be challenging,
duetothe lackoflabelsdescribingtherecordingconditionsof
each training example, and the fact that speakers and
recording conditions are often correlated, e.g. since users
oftenmakemanyrecordingsusingthesameequipment.
This paper proposes three components to address this
problem by: (1) formulating a conditional generative model
with factorized latent variables, (2) using data augmentation
to add noise that is not correlated with speaker identity and
whose label is known during training, and (3) using
adversarial factorization to improve disentanglement.
Experimental results demonstrate that the proposed method
can disentangle speaker and noise attributes even if
they are correlated in the training data, and can be used to
consistentlysynthesizecleanspeechforallspeakers.”
Decompose HighandLow frequencies
Drop anOctave:ReducingSpatialRedundancy in
Convolutional Neural Networks withOctave
Convolution
YunpengChen, HaoqiFang, BingXu, ZhichengYan, YannisKalantidis,
MarcusRohrbach, ShuichengYan, JiashiFeng
(Submitted on 10 Apr 2019)
https://export.arxiv.org/abs/1904.05049
In this work, we propose to factorize the mixed feature maps by
their frequencies and design a novel Octave Convolution
(OctConv) operation to store and process feature maps that vary
spatially "slower" at a lower spatial resolution reducing both memory
and computation cost. Unlike existing multi-scale meth-ods,
OctConv is formulated as a single, generic, plug-and-play
convolutional unit that can be used as a direct
replacement of (vanilla) convolutions without any
adjustments in the network architecture. It is also orthogonal and
complementary to methods that suggest better topologies or
reduce channel-wise redundancy like group or depth-wise
convolutions. We experimentally show that by simply replacing
con-volutions with OctConv, we can consistently boost
accuracy for both image and video recognition tasks, while reducing
memoryandcomputationalcost.
Decompose Signalandthe Noise
Deeplearningofdynamicsandsignal-noise
decompositionwithtime-steppingconstraints
Samuel H. Rudy, J. Nathan Kutz, Steven L. Brunton
Department of Applied Mathematics/ Mechanical Engineering, Universityof Washington, Seattle,
last revised 22 Aug2018
https://arxiv.org/abs/1808.02578
https://github.com/snagcliffs/RKNN
“We propose a novel paradigm for data-driven modeling that
simultaneously learns the dynamics and estimates the
measurement noise at each observation. By constraining our
learning algorithm, our method explicitly accounts for measurement
error in the map between observations, treating both the
measurement error and the dynamics as unknowns to be
identified,ratherthan assumingidealizednoiselesstrajectories.
We also discuss issues with the generalizability of neural network
models for dynamicalsystemsand provide open-source code for
allexamples.”
The combination of neural networks and numerical time-stepping
schemes suggests a number of high-priority research
directions in system identification and data-driven forecasting.
Future extensions of this work include considering systems with
process noise, a more rigorous analysis of the specific method for
interpolating f, including time delay coordinates to accommodate
latent variables, and generalizing the method to identify
partial differential equations. Rapid advances in hardware and
the ease of writing software for deep learning will enable these
innovations through fast turnover in developing and testing
methods.
Signal Restoration LiteratureReview
Super-resolutions Insightsfromaudio
Time-frequencynetworks foraudiosuper-
resolution
TeckYian Lim etal. (2018)
http://isle.illinois.edu/sst/pubs/2018/lim18icassp.pdf
http://tlim11.web.engr.illinois.edu/
“Audiosuper-resolution (a.k.a. bandwidthextension)is
thechallengingtaskofincreasingthetemporalresolutionof
audiosignals. Recentdeepnetworksapproachesachieved
promisingresultsby modelingthetaskas aregression
problem ineithertimeorfrequencydomain. Inthispaper,
weintroducedTime-FrequencyNetwork(TFNet),a
deepnetworkthat utilizessupervision inboth thetimeand
frequencydomain.Weproposedanovelmodelarchitecture
whichallowsthetwodomainstobe jointlyoptimized.”
Spectrogram correspondingto
the LR input(frequenciesabove
4kHz missing), HR
reconstruction, and the HR
ground truth. Our approach
successfullyrecoversthehigh
frequencycomponentsfrom the
LRaudiosignal.
GANs Alsofortime-seriesdenoising #1a
DenoisingTimeSeriesData Using
AsymmetricGenerativeAdversarial
Networks
Sunil Gandhi;Tim Oates;TinooshMohsenin and David
Hairston (2018)
https://doi.org/10.1007/978-3-319-93040-4_23
“In this paper, we explicitly learn to remove
noise from time series data without
assuming a prior distribution of noise.
We propose an online, fully automated, end-
to-endsystemfordenoisingtimeseriesdata.
Our model for denoising time series is trained
using unpaired training corpora and does
not need information about the source of the
noiseorhowitismanifestedin thetimeseries.
We propose a new architecture called
AsymmetricGAN that uses a generative
adversarial network for denoising time series
data.”
Consider, for example, a widely used method for time series featurization called Symbolic Aggregate
approXimation (SAX) that assumes time series are generated from a single normal distribution. As
shown in this assumption does not hold in several real life time series datasets. Other techniques
assume noise comes from a Gaussian distribution and estimate the parameters of that distribution. This
assumption doesnot hold for datasourceslikeElectroencephalography (EEG), wherenoisecan have diverse
characteristics and originate from different sources. Hence, in this work, we focus on learning the
characteristics of noise in EEG data and removing it as a preprocessing step. ICA has high
computationalcomplexityandlargememoryrequirements,makingitunsuitableforreal-timeapplications.
For training of our network, we only need a set of clean signals and set of noisy signals. We do not need
paired training data, i.e., we do not need clean versions of the noisy data. This is particularly useful for
applicationslikeartifact removalinEEGdataaswecannot recordclean versionsofnoisyEEG.
GANs Alsofortime-seriesdenoising #1b
DenoisingTimeSeriesData Using
AsymmetricGenerativeAdversarial
Networks
Sunil Gandhi;Tim Oates;TinooshMohsenin and David
Hairston (2018)
https://doi.org/10.1007/978-3-319-93040-4_23
Pre-processing
The DC component in EEG data is different for each
recording. We normalize every window of clean and
noisy data to remove the DC offset from the data. We
remove the DC offset by subtracting the median of the
datain the window.
Evaluation of EEG data is challenging as the
ground truth noiseless signals are not
known. Multiple approaches to evaluation
have been proposed in recent years,
however, authors do not agree on a single
mechanismforevaluatingartifactremoval.
GANs Alsoforspeechdenoising
Segan:Speechenhancementgenerative
adversarialnetwork.
SantiagoPascual, AntonioBonafonte, and Joan Serra (2017)
https://arxiv.org/abs/1703.09452
https://github.com/santi-pdp/segan
“For the purpose of speech enhancement
and denoising, the SEGAN was developed,
employing a neural network with an encoder and
decoder pathway that successively halves and
doubles the resolution of feature maps in each
layer, respectively, and features skip connections
betweenencoderanddecoderlayersa.
The model works as an encoder-decoder fully-
convolutional structure, which makes it fast to
operate for denoising waveform chunks. The
results show that, not only the method is viable, but it
can also represent an effective alternative to current
approaches.
Possible future work involves the exploration of
better convolutional structures and the inclusion of
perceptual weightings in the adversarial training,
so that we reduce possible high frequency artifacts
that might be introduced by the current model.
Further experiments need to be done to compare
SEGANwithothercompetitiveapproaches.” Thedatasetisaselectionof30speakers
fromtheVoiceBankcorpus
GANs Alsoformultichannelaudiodenoising
Multi-ViewNetworks forDenoisingofArbitrary
NumbersofChannels
Jonah Casebeer, Brian Luc and ParisSmaragdis (July2018)
https://arxiv.org/abs/1806.05296
“We propose a set of denoising neural networks capable
of operating on an arbitrary number of channels at
runtime, irrespective of how many channels they were
trained on. We coin the proposed models multi-view
networks sincetheyoperateusingmultipleviewsofthe
samedata.
We explore two such architectures and show how they
outperform traditional denoising models in multi-channel
scenarios. Additionally, we demonstrate how multi-
view networks can leverage information
provided by additional recordings to make
better predictions, and how they are able to
generalize to a number of recordings not seen in
training.”
GANs forgenerativemodelsoftimeseries
Ontheevaluationofgenerativemodels inmusic
Li-ChiaYang, Alexander Lerch (October 2018)
https://doi.org/10.1007/s00521-018-3849-7
https://github.com/RichardYang40148/mgeval
Therefore, we propose a set of simple
musically informed objective metrics
enabling an objective and reproducible
way of evaluating and comparing the
output of music generative systems.
We demonstrate the usefulness of the
proposed metrics with several
experiments on real-world data.
We have released the evaluation
framework as an open-source toolbox
which implements the demonstrated
evaluation and analysis methods along
with visualization tools. Our future work
will include the extension of the current
toolbox with additional dimensions (e.g.,
dynamics) and to expand it toward
polyphonic music.
Analysisand Classification LiteratureReview
Classification, non-DL algorithms: COTE
Thegreattimeseries classificationbakeoff:
areviewandexperimentalevaluationof
recentalgorithmic advances
Anthony Bagnall, Jason Lines, Aaron Bostrom, James Large,
Eamonn Keoghs (May2017)
https://doi.org/10.1007/s10618-016-0483-9
https://bitbucket.org/TonyBagnall/time-series-classification
“We have implemented 18 recently proposed algorithms in a
common Java framework (Weka) and compared them
against two standard benchmark classifiers (and each other)
by performing 100 resampling experiments on each of the 85
datasets. We use these results to test several hypotheses
relating to whether the algorithms are significantly more
accurate than the benchmarks and each other. Our results
indicate that only nine of these algorithms are significantly
more accurate than both benchmarks and that one classifier,
the collective of transformation ensembles, is significantly
moreaccuratethan allof theothers”
Summaryofthetimeandspacecomplexity of the
18TSCalgorithmsconsidered
However, our conclusion is that using COTE (
Bagnall et al2015; Cited by 91) will probably give you
the most accurate model. If a simpler approach is needed
and the discriminatory features are likely to be embedded in
subseries, then we would recommend using TSF or ST if the
features are in the time domain (depending on whether they
are phase dependent or not) or BOSS if they are in the
frequency domain. If a whole series elastic measure seems
appropriate, then using EE is likely to lead to better predictions
than usingjust DTW.
Time series IntroofDNNuse#1A
Deep learningfortimeseriesclassification: a review
Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, LhassaneIdoumghar, Pierre-Alain Muller (Submitted on 12Sep2018)
https://arxiv.org/abs/1809.04356 | https://github.com/hfawaz/dl-4-tsc
In this article, we study the current
state of the art performance of deep
learning algorithms for Time Series
Classification (TSC) by presenting
an empiricalstudyofthemostrecent
DNN architectures for TSC. We give
an overview of the most successful
deep learning applications in various
time series domains under a
unified taxonomy of DNNs for
TSC. We also provide an open
source deep learning
framework to the TSC community
where we implemented each of the
compared approaches and
evaluated them on a univariate TSC
benchmark (the UCR archive) and
12 multivariate time series datasets.
By training 8,730 deep learning
models on 97 time series
datasets, we propose the most
exhaustive study ofDNNsfor TSC to
date.
COTEiscurrentlyconsideredthe stateoftheart fortimeseriesclassification(Bagnalletal.,2017)
whenevaluatedoverthe85datasetsfromtheUCRarchive (Chenetal.,2015b).
Finally,addingtothehugeruntimeofCOTE,thedecisiontakenby 35classifierscannotbeinterpreted
easily by domainexperts,sinceresearchersalreadystrugglewithunderstandingthedecisionstakenby
anindividualclassifier.
●
WhatisthecurrentstateoftheartDNNforTSC?
●
IsthereacurrentDNNapproachthatreachesstateoftheartperformanceforTSCandis less
complexthanCOTE?
●
WhattypeofDNNarchitecturesworksbestfortheTSCtask?
●
Andfinally:Couldtheblack-boxeffectofDNNsbeavoidedtoprovideinterpretability?
GiventhatthelatterquestionshavenotbeenaddressedbytheTSCcommunity,itissurprisinghow
muchrecentpapershaveneglectedthepossibilitythatTSCproblemscouldbe solvedusingapure
featurelearningalgorithm
Time series IntroofDNNuse#1B
The result of a applying a learned
discriminative convolution on the GunPoint
dataset
Deep learningfortimeseriesclassification: a review
https://arxiv.org/abs/1809.04356 (2018) https://github.com/hfawaz/dl-4-tsc
Time series IntroofDNNuse#1C
Given the aforementioned limitations for
generative models, we decided to limit our
experimental evaluation to discriminative
deep learningmodelsforTSC.
Second, since we cannot cover an empirical study of
all approaches validated in all TSC domains, we
decided to onlyinclude approachesthatwere validated
on the whole (or a subset of) the univariate time
series UCRarchive and/or on the MTS archive (
Baydogan,2015).
Finally, we chose to work with approaches that do not try to
solve a sub task of the TSC problem such as in Geng and
Luo (2018) where CNNs were modified to solve the task of
classifying imbalanced time series datasets. Another sub
task that has been at the center of recent studies is earlytime
series classification (Wang et al., 2016a) where deep CNNs
were modified to include an early classification of time series.
More recently, a deep reinforcement learning approach was
also proposed for the early TSC task (Martinez et al., 2018).
For further details, we refer the interested reader to a recent
survey on deep learning for early time series
classification(Santos andKern,2017).
The third and final proposed architecture in Wangetal.(2017) is a relatively deep Residual Network
(ResNet). For TSC, this is the deepest architecture with 11 layers of which the first 9 layers are
convolutionalfollowedbyaGAP layerthataveragesthetimeseriesacrossthetimedimension.
Deep learningfortimeseriesclassification: a review
https://arxiv.org/abs/1809.04356 (2018) https://github.com/hfawaz/dl-4-tsc
Time series IntroofDNNuse#1D
Deep learningfortimeseriesclassification: a review
https://arxiv.org/abs/1809.04356 (2018) https://github.com/hfawaz/dl-4-tsc
Giventhehugenumberofmodels[8730experimentsfor the85univariateTSCdatasets]that
neededtobetrained,weranour experimentsonaclusterof 60GPUs.TheseGPUswereamix
ofthreetypesofNvidiagraphiccards: GTX1080Ti, TeslaK20,K40andK80.
Thetotalsequentialrunning timewasapproximately 100days,thatisifthecomputationhas
beendoneonasingleGPU.However,byleveraging theclusterof60GPUs,wemanagedto
obtaintheresultsinlessthanonemonth. Weimplementedour frameworkusingtheopen
sourcedeeplearninglibraryKeraswiththeTensorflowback-end.
Figure1showsthecriticaldifferencediagram (Demšar,2006,
Citedby6414),whereathick horizontallineshowsagroupofclassifiers(a
clique)thatarenotsignificantlydifferentintermsofaccuracy.
→  AnExtension on"StatisticalComparisonsofClassifiersover Multiple
DataSets"forallPairwiseComparisons
Time series IntroofDNNuse#1E: ResNettheTopDog
Deep learningfortimeseriesclassification: a review
https://arxiv.org/abs/1809.04356 (2018) https://github.com/hfawaz/dl-4-tsc
Giventhehugenumberofmodels[8730experimentsfor the85univariateTSCdatasets]that
neededtobetrained,weranour experimentsonaclusterof 60GPUs.TheseGPUswereamix
ofthreetypesofNvidiagraphiccards: GTX1080Ti, TeslaK20,K40andK80.
Thetotalsequentialrunning timewasapproximately 100days,thatisifthecomputationhas
beendoneonasingleGPU.However,byleveraging theclusterof60GPUs,wemanagedto
obtaintheresultsinlessthanonemonth. Weimplementedour frameworkusingtheopen
sourcedeeplearninglibraryKeraswiththeTensorflowback-end.
Figure1showsthecriticaldifferencediagram (Demšar,2006),wherea
thick horizontallineshowsagroupofclassifiers(aclique)thatarenot
significantlydifferentintermsofaccuracy.
Time series IntroofDNNuse#1F:ResNetsvs.Traditional
Deep learningfortimeseriesclassification: a review
https://arxiv.org/abs/1809.04356 (2018) https://github.com/hfawaz/dl-4-tsc
We give two potential reasons for
this high generalization
capabilities of deep CNNs on the
TSCtasks.
First, having seen the success of
convolutions in classification tasks
that require learning features that
are spatially invariant in a two
dimensional space (such as width
and height in images), it is only
natural to think that discovering
patterns in a one dimensional
space (time) should be an easier
task for CNNs thus requiring less
datatolearnfrom.
The other more direct reason
behind the high accuracies of
deep CNNs on time series data is
itssuccessinother sequentialdata
such as speech recognition and
sentence classification where text
and audio, similarly to time series
data, exhibit a natural temporal
ordering.
We compared ResNet(the mostaccurate DNN ofour study) with the currentstateofthe artclassifiersevaluated on the UCR
archive in the great time series classification bake off Bagnalletal.(2017)). Note that our empirical study strongly
suggeststouseResNetinsteadofanyother deeplearningalgorithm
Outofthe18classifiersevaluatedbyBagnalletal.(2017),wehavechosenthefourbestperformingalgorithms:
(1) Elastic Ensemble (EE) proposed by Lines and Bagnall (2015) is an ensemble of nearest neighbor classifiers with 11
different time series similarity measures; (2) Bag-of-SFA-Symbols (BOSS) published in Schäfer (2015) forms a
discriminative bag of words by discretizing the time series using a Discrete Fourier Transform and then building a nearest
neighbor classifier with a bespoke distance measure; (3) Shapelet Transform (ST) developed by Hills et al. (2014)
extracts discriminative subsequences (shapelets) and builds a new representation of the time series that is fed to an
ensemble of 8 classifiers; (4) Collective of Transformation-based Ensembles (COTE) proposed by
Bagnalletal.(2017) is basically a weighted ensemble of 35 TSC algorithms including EE and ST. Finally, we added a recent
approach named Proximity Forest (PF) which is similar to Random Forest but replaces the attribute based splitting
criteriabyarandomsimilaritymeasurechosenoutofEE’selasticdistances(Lucasetal.,2018).
Although COTE is still the
most accurate classifier (when
evaluated on the UEA archive) its
use in a real data mining
application is limited due to its
huge training time
complexity whichisO(N2
·T4
).
Time series IntroofDNNuse#1G:
Deep learningfortimeseriesclassification: a review
https://arxiv.org/abs/1809.04356 (2018) https://github.com/hfawaz/dl-4-tsc
Again, we can clearly see the dominance of ResNet as the
best performing approach across different domains. One
exception is the electrocardiography (ECG) datasets
(7 in total) where ResNet was drastically beaten by the FCN
modelin71.4%ofECGdatasets.
THEMES
One might expect that the relatively
short filters (3) might affect the
performance of ResNet and FCN since
longer patterns cannot be captured by
short filters. However, since increasing
the number of convolutional layers will
increase the path length viewed
(receptive field) by the CNN model
(Vaswani et al., 2017), ResNet and FCN
managed to outperform other
approaches whose filter length is longer
(21)suchasEncoder.
SIGNALLENGTH Wangetal.(2017) later introduced a one-
dimensional CAM with an application to TSC. This
method explains the classification of a certain deep
learning model by highlighting the subsequences that
contributedthemosttoacertainclassification.
An interesting observation would be to compare the
discriminative regions identified by a deep learning model with
the most discriminative shapelets extracted by other shapelet-
based approaches. This observation would also be backed up
by the mathematical proof provided by Cui et al. (2016), that
showed how the learned filters in a CNN can be
considered a generic form of shapelets extracted by the
learningshapeletsalgorithm (Grabockaetal., 2014).
Time series IntroofDNNuse#1H:Future
Deep learningfortimeseriesclassification: a review
https://arxiv.org/abs/1809.04356 (2018) https://github.com/hfawaz/dl-4-tsc
Although we have
conducted an extensive
experimental evaluation,
deep learning for Time
Series Classification,
unlike for computer vision
and NLP tasks, still lacks a
thorough study of data
augmentation (Ismail
Fawaz et al., 2018a;
Forestier et al., 2017) and
transfer learning.
Furthermore, we think
that the effect of z-
normalization (and
other normalization
methods) on the learning
capabilities of DNNs
should also be thoroughly
explored.
What makesImageNetgood for
transferlearning?
MinyoungHuh, PulkitAgrawal, AlexeiA. Efros
https://arxiv.org/abs/1608.08614
“Our results might indicate that researchers have
been overestimating the amount of data required
for learning good general CNN features. If that is the
case, it might suggest that CNN training is not as
data-hungry as previously thought. It would also
suggest that beating ImageNet-trained features with
models trained on a much bigger data corpus will be
much harder than once thought.”
AutoAugment:Learning
AugmentationPoliciesfrom Data Ekin D.
Cubuk, Barret Zoph, DandelionMane, VijayVasudevan, QuocV. Le (9Oct2018)
https://arxiv.org/abs/1805.09501
https://github.com/tensorflow/models/tree/master/research/autoaugment
“We describe a simple procedure called
AutoAugment to search for improved data
augmentation policies”
Albumentations:fast and flexible
imageaugmentations Alexander Buslaev, Alex Parinov,
EugeneKhvedchenya,Vladimir I.Iglovikov, AlexandrA.Kalinin(18Sep2018)
https://arxiv.org/abs/1809.06839
https://github.com/albu/albumentations
“We present Albumentations, a fast and
flexible library for image augmentations with
many various image transform operations
available, that is also an easy-to-use
wrapper (based on highly-optimized
OpenCV library) around other augmentation
libraries.”
Combiningraw and normalized datain
multivariate time seriesclassification
with dynamic time warping Łuczak,Maciej(2018)
http://doi.org/10.3233/JIFS-171393
Time series IntroofDNNuse#1H2:TransferLearning
Transferlearningfortimeseriesclassification
Hassan Ismail Fawaz, GermainForestier, Jonathan Weber, Lhassane Idoumgharand Pierre-Alain Muller
https://arxiv.org/abs/1809.04356 (2018) https://github.com/hfawaz/dl-4-tsc
Whenobserving theheatmapinFig.4,onecaneasilysee
that fine-tuning a pre-trained model almost never
hurtstheperformanceoftheCNN.
In our future work, we aim again to reduce
the deep neural network’s overfitting
phenomena by generating synthetic data
using a Weighted DTW Barycenter
Averaging method [Forestier etal.2017]
, since
the latter distance gave encouraging
results in guiding a complex deep learning
tool such as transfer learning. Finally, with
big data repositories becoming more
frequent, leveraging existing source
datasets that are similar to, but not
exactly the same as a target dataset of
interest, makes a transfer learning method
anenticing approach.
Time series IntroofDNNs#2:WhyResNetswork?
Wang etal.,2017 https://doi.org/10.1109/IJCNN.2017.7966039
WhyandWhenCanDeep--but NotShallow--
NetworksAvoidtheCurseof Dimensionality:a
ReviewTomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, 
BrandoMiranda, Qianli Liao - https://arxiv.org/abs/1611.00740
Thepapercharacterizesclassesoffunctionsfor whichdeep
learningcanbeexponentiallybetterthanshallow learning.
http://www.telesens.co/2019/01/16/neural-network-loss-visualization/
http://www.telesens.co/loss-landscape-viz/viewer.html
VisualizingtheLoss
LandscapeofNeuralNets
http://papers.nips.cc/paper/78
75-visualizing-the-loss-landsc
ape-of-neural-nets
 HLi-2017-Citedby93 -
Relatedarticles
Time series IntroofDNNuse#2A
CNNApproachesfor TimeSeriesclassification
LamyaaSadouk
https://cdn.intechopen.com/pdfs/64216.pdf (2018)
Instead of employing the FFT which is restricted to a predefined fixed
window length, we choose to adopt the Stockwell transform (ST)
asour preprocessing methodfor CNN training.The advantage ofthe ST
over the FFT is its ability to adaptively capture spectral changes over
time without windowing of data, resulting in a better time-frequency
resolutionfor non-stationarysignals[Stockwell1996].
While works [17, 24] transformed the time series signals (by applying
down-sampling,slicing,or warping) so asto help the convolutionalfilters
(especially the 1st convolutional layer filters) capture entire peaks (i.e.,
whole peaks) and fluctuations within the signals, the work of [18]
proposedto keep time seriesdataunchangedandrather feedthem into
three branches, each having a different 1st convolutional filter size, in
order to capture the whole fluctuations within signals. An alternative isto
find an adaptive 1st convolutional layer filter which has the most
optimal size and is able to capture most of entire peaks present in the
input signals. The question of how to compute this adaptive 1st
convolutionallayer filterisaddressedin[4].
Therefore,the most optimalsizeof the1st
convolutionalfilterisequaltothesamplemedian of
signalpeaklengths,suggestingthat 0.1isthebesttime
span ofthe1st convolutionallayerto retrievethewhole
acceleration peaksandthebestacceleration changes.
Similarly, in frequencydomain,the1stconvolutionallayer
kernelyieldingthehighestF1-scoreistheonewithsize10,
whichissimply thesamplemedian (Me
(x)=10).
(a)and(b)Histogramsand
boxplotsofthefrequency
distribution of30peak
lengths presentwithin 30
randomlyselectedtimeand
frequencydomain signals
respectively
Frequencydomain
Timedomain
Time series IntroofDNNuse#2B
CNNApproachesfor TimeSeriesclassification
LamyaaSadouk
https://cdn.intechopen.com/pdfs/64216.pdf (2018)
In some fields such as medicine experience a lack of annotated data as
manually annotating a large set requires human expertise and is time
consuming.
The conventional approach to deal with this kind of problem is to
perform data augmentation by applying transformations to the
existing data. Data augmentation achieves slightly better time series
classification rates but still the CNN is prone to overfitting. In this section,
we present another solution to this problem, a “knowledge transfer”
framework which is a global, fast and light-weight framework that
combinesthetransfer learning techniquewithan SVMclassifier.
Transfer learning is amachine learning technique where amodel trained
on one task (a source domain) is re-purposed on a second related task
(a target domain). Accordingly, the questions that arise are: (i) which
source learning task should be used for pre-training the CNN model
given a target learning task, and (ii) which parts (e.g., learned features) of
thismodelarecommonbetweenthesourceandtargetlearning tasks.
In thatsense,wepropose a“TransferlearningwithSVM read-out”framework which is
composed of two parts: (i) the first part having first and intermediate layers’ weights of a CNN
already pre-trained on a source learning task, (the last CNN layer being discarded), and (ii) the
second part composed of a support vector machine (SVM) classifier with RBF kernel which is
connectedtotheendofthefirstpart.
Then, we feed the entire training dataset of the target task into this framework in order to train
the SVM parameters. As opposed to training a CNN on the target task which requires updating
all hidden layers’ weights for several iterations using a large training set for all these weights to
converge,ourframeworkcomputesweightsofthelastlayer(s)only,inoneiterationonly.
Moreover the advantage of using SVM as the classifier is that it is fast and generally
performs well on small training set since it only relies on the support vectors, which are the
training samples that lay exactly on the hyperplanes used to define the margin. In addition,
SVMs have the powerful RBF kernel, which allows to map the data to a very high dimension
space in which the data can be separable by a hyperplane, hence guaranteeing convergence.
Hence, our framework can be regarded as a global, fast and light-weight technique for time
seriesclassificationwherethetargettaskhaslimitedannotated/labeleddata.
Time series IntroofDNNuse#3
3Dconvolutionrecurrentneuralnetworksforbird
sounddetection
Himawan, Ivan, Towsey, Michael, &Roe, Paul (2018)
https://eprints.qut.edu.au/122760/
https://github.com/himaivan/BAD2
We propose 3D convolutions for extracting long-term and short-term
information in frequency simultaneously. In order to leverage powerful
and compact features of 3D convolution, we employ separate recurrent
neural networks (RNN), acting on each filter of the last convolutional
layers rather than stacking the feature maps in the typical combined
convolutionandrecurrentarchitectures.
We split 10-second audio clip into 5 × 2-second clips. The 2- second
length is based on empirical analysis. A spectrogram (from 2-second clip)
computed from sequences of Short-Time Fourier Transform (STFT) of
overlappingwindowedsignalsisusedasthesoundrepresentation.
The 3D convolution highlights only frequency bands where the bird calls are located
acrossthetemporaldimension.
As a comparison, the 2D convolution in CNN+RNN highlights few specific locations
ofthebirdcalls,andincludelow-frequencyregionswithnobirdcalls.
This shows that 3D convolution is more capable of extracting in terms of
long-termtimeinformationinbirdcalls.
In future work, we will investigate the method of generating labeled data via a
pseudo-labeling method where approximate labels are produced from unlabeled data.
This can be achieved, for example, using generative adversarial networks. Domain
adaptation using adversarial learning is another alternative to build a
discriminativemodelandinvarianttodomainatthesametime.
2D
3D
EarlyTimeSeriesClassification
ALiteratureSurveyofEarlyTimeSeriesClassificationandDeepLearning
TiagoSantosandRomanKern(2017)
http://ceur-ws.org/Vol-1793/paper4.pdf
Early time series classification
aims to classify a time series with
as few temporal
observations as possible,
while keeping the loss of
classification accuracy at a
minimum. One of the first works on
the topic of early classification, as
defined over time series length,
waswrittenby[31].
Prominent early classification
frameworks reviewed by this
paper include, but are not limited
to, ECTS, RelClass and
ECDIRE.
These works have shown that
early time series classification may
be feasible and performant, but
they also show room for
improvement.
ECDIREhttps://doi.org/10.1007/s10618-016-0462-1
RelClass
https://dl.acm.org/citation.cfm?id=2627671
EarlyTSC with deepreinforcementlearning
Adeepreinforcementlearningapproachforearly
classificationoftimeseries
Martinez Coralie, Guillaume Perrin, E Ramasso, Michèle Rombaut
https://hal.archives-ouvertes.fr/hal-01825472/
We formulate the early classification problem in a
reinforcement learning framework: we introduce a suitable
set of states and actions but we also define a specific reward
function which aims at finding a compromise between earliness
andclassificationaccuracy.
While most of the existing solutions do not explicitly take time into
account in the final decision, this solution allows the user to set this
trade-off in a more flexible way. In particular, we show
experimentally on datasets from the UCR time series archive that
this agent is able to continually adapt its behavior without
human intervention and progressively learn to compromise
between accurate and fast predictions.
Evolution of the early classifier agent behaviour on Gun-Point dataset. The scatter plot
shows the relationship between accuracy(in percentage)and averagetime ofprediction oftheagent
over training. We evaluate the agent on the whole training set every 5,000 iterations. Each evaluation
corresponds to one dot. Dot points are coloured according to iterations of training: blue dots
correspond to early training while yellow dots correspond to the agent’s performance after 100,000
iterations of training. We evaluate the agent’s policy surrounded by the red star on the testing set and
we report its performance in table I. In this experiment, the agent learned to slow its predictions down
and improved itsaccuracyover training.
As future work, we plan to improve the proposed approach with a
dynamic adjustment of the reward function parameters over training
based on the user trade-off criteria. We will also propose a new
management of the agent’s replay memory which could be more suitable
forthe problem of early classification.
EarlyTSC for clinicaluse:ICUMortalityPrediction
DynamicPredictionofICUMortalityRisk
UsingDomainAdaptation
TiagoAlves, AlbertoLaender, Adriano Veloso, NivioZiviani
https://homepages.dcc.ufmg.br/~nivio/papers/alves@bigdata18.pdf
Early recognition of risky trajectories during an Intensive
Care Unit (ICU) stay is one of the key steps towards improving
patient survival. Learning trajectories from physiological
signals continuously measured during an ICU stay requires
learning time-series features that are robust and discriminative
acrossdiversepatientpopulations.
Mortalityriskspacefor differentICUdomains.Regionsinredarerisky.Eachaxisisat-SNE
non-linearcombinationof:(toprow)physiologicalparameters,or(bottomrow)features
extractedbyCNN−LSTM.
BiosignalDeepLearning
Deeplearningforhealthcareapplicationsbasedonphysiologicalsignals:AreviewSGauthors: “Wehavecastthenetintotheoceanofknowledget...”
OliverFaust,YukiHagiwara,TanJenHong,OhShuLih,URajendraAcharya https://doi.org/10.1016/j.cmpb.2018.04.005
Once the architecture is chosen, the tuning
parameters must be adjusted. Both the structure
selection and parameter adjustment will basically
influence the model. Hence, it is necessary to have
many test runs. Shortening the training phase of
deep leaning models is an active area of research
[159]. The challenge is speeding up the training
process in a parallel distributed processing system
[160]. The network between the individual
processors becomes the bottleneck [161].
Graphics Processing Unit (GPUs) can be used to
reduce the network latency [162]
ECGClassification #1
AconvolutionalneuralnetworkforECGannotationasthebasisfor
classificationof cardiacrhythms
PhilippSodmann etal2018Physiol.Meas.inpresshttps://doi.org/10.1088/1361-6579/aae304
https://github.com/MarcusVollmer/PhysioNet
222,202Rpeaks,
192,200Pwaves,
256,966 Twaves,and
3, 311,487interbeatsegments
wereextractedfromthe QTdatabase
In totalapproximately12,000,000characteristic
waveformswereusedasinput volume.Theassigned
annotation codesofthemidpoint peakofeach
segment wereused asoutputvolume
Amajoradvantage
ofdecisiontreesis
thattheydirectly
provideinformation
on feature
importance
ECGClassification #2
Detectingandinterpretingmyocardialinfarctions
usingfullyconvolutionalneuralnetworks
NilsStrodthoff,ClaasStrodthoff
(Submittedon18Jun2018)
https://arxiv.org/abs/1806.07385
We consider the detection of myocardial infarction in
electrocardiography(ECG)dataasprovidedbythePTB
ECG database without non-trivial preprocessing. The
classification is carried out using deep neural networks
in a comparative study involving convolutional as well as
recurrent neural network architectures. The best
architecture, an ensemble of fully convolutional
architectures, beats state-of-the-art results on this
dataset and reaches 93.3% sensitivity and 89.7%
specificity evaluated with 10-fold crossvalidation, which
is the performance level of human cardiologists for this
task.
We investigate questions relevant for clinical
applications such as the dependence of the
classification results on the considered data channels
and the considered subdiagnoses. Finally, we apply
attribution methods to gain an understanding of the
network'sdecisioncriteriaonanexemplarybasis.
Time seriesclassification in a realistic setting hastobe able to cope with timeseries that are so large that they
cannot be used as input to a single neural network or that cannot be downsampled to reach this state without
losing too much information. At this point two different procedures are conceivable: Either one uses attentional
models that allow to focus on regions of interest, see e.g. Karim et al. 2018, or one extracts random
subsequences from the original timeseries. For reasons of simplicity and with real-time on-site analysis in mind we
explore only the latter possibility, which is only applicable for signals that exhibit a certain degree of periodicity.
The assumption underlying thisapproach isthat the characteristics leadingtoacertain classification are present in
every random subsequence. We stress at this point that this procedure does not rely on the identification of
beginning and endpoints of certain patterns in the window. The procedure leaves two hyperparameters: the
choice of the window size and an optional downsampling rate to reduce the temporal input dimension for the
neural network.
Moreover, we present a first exploratory study of
the application of interpretability methods in
this domain, which is a key requirement for
applications in the medical field. These methods
can not only help to gain an understanding and
thereby build trust in the network’s decision
process but could also lead to a data-driven
identification of important markers for certain
classification decisions in ECG data that might
evenproveusefulfor human experts.
Here we identified common cardiologists’
decision rules in the network’s attribution maps
and outlined prospects for future studies in this
direction. Both such an analysis of attribution
maps and further improvements of the
classification performance would have to rely on
considerably larger databases such as for
quantitative precision. This would also allow
extension to further subdiagnoses and other
cardiacconditions such as other confounding and
non-exclusive diagnoses or irregular heart
rhythms.
ECGClassification #3
Automaticdetectionofsleep-disordered
breathingeventsusing
recurrentneuralnetworksfroman
electrocardiogramsignal
Erdenebayar Urtnasan,Jong-UkPark,Kyoung-JoungLee
https://doi.org/10.1007/s00521-018-3833-2
In this study, we propose a novel method for
automatically detecting sleep-disordered
breathing (SDB) events using a recurrent neural
network (RNN) to analyze nocturnal electrocardiogram
(ECG) recordings. … Single-lead ECG recordings (200
Hz) were measured for an average 7.2-h duration and
segmented into 10-s events (2,000 samples). A
bandpass filter (5–11 Hz) was applied for data
preprocessing to removeundesired noisefrom theECG
signal.The dataset comprised a training dataset
(68,545 events) from 74 patients and test dataset
(17,157 events)from18patients
Theproposed deep RNN model for automatic detection
of SDB events was implemented by Keras’ platform
usingaTensorFlowbackground(sic!).
ECGClassification #4
Arrhythmiadetectionusingdeepconvolutionalneuralnetwork
withlongdurationECGsignalshttps://doi.org/10.1016/j.compbiomed.2018.09.009
c
DepartmentofCardiology,NationalHeartCentreSingapore,Singapore
D
Duke-NUSMedicalSchool,Singapore
The goal of our research was to design a new method based on
deep learning (1D-CNN is employed) to efficiently and quickly
classify cardiac arrhythmias. Approach based on the analysis of
10-s ECG signal fragments (not a single QRS complex) is
applied (on average, 13 times less classifications/analysis). A
complete end-to-end structure was designed instead of the hand-
crafted feature extraction and selection used in traditional
methods. Can be used in tele-medicine especially in mobile
devices and cloud computing due to its low computational
complexity.
ECGClassification #5
Deeplearninginthecross-time-frequencydomainforsleep
stagingfromasingleleadelectrocardiogram
https://doi.org/10.1088/1361-6579/aaf339
This study classifies sleep stages from a single lead
electrocardiogram (ECG) using beat detection,
cardiorespiratory coupling in the time-frequency domain and
adeep convolutional neuralnetwork (CNN).
An ECG-derived respiration (EDR) signal and
synchronous beat-to-beat heart rate variability (HRV) time
series were derived from the ECG using previously
described robust algorithms. A measure of
cardiorespiratory coupling (CRC) was extracted by
calculating the coherence and cross-spectrogram of
theEDR andHRVsignalinfive-minutewindows
A support vector machine (SVM) was then used to
combine the output of CNN with the other features derived
from the ECG, including phase-rectified signal averaging
(PRSA), sample entropy, as well as standard spectral and
temporal HRV measures.
TheECGsignalswerepreprocessedbyafiniteimpulseresponse(FIR)
lowpassfilterwithabandstopat22HzandaFIRhighpassfilterandwithat
cornerfrequencyof1.2Hz.Astate-of-the-artQRSdetector(jqrs)was
usedforECGR-peakdetection(Johnsonetal.(2015)).
ECGClassification #6
Kalman-basedSpectro-TemporalECGAnalysisusingDeep
ConvolutionalNetworksforAtrialFibrillationDetection
Zheng Zhao, SimoSärkkä, and AliBahrami Rad
https://arxiv.org/abs/1812.05555
For ECG signals, one can directly adopt 1D convolutional or
recurrent network models for the classification task.
However, transforming signals into spectral domain
(spectro-temporal features) is a promising alternative
approach knowing that the current state-of-theart deep
convolutional neural networks (CNNs) structures are
typicallydesignedfor 2Dimages.
The contributions of this paper are: 1) We propose two
extended models for spectro-temporal estimation using
Kalmanfilter andsmoother. We then combine them with
deep convolutional networks for AF detection. 2) We test and
compare the performance of proposed approaches for
spectro-temporal estimation on simulated data and AF
detection with other popular estimation methods and
different classifiers. 3) For AF detection, we evaluate the
proposals using PhysioNet/CinC 2017 dataset, which is
considered to be a challenging dataset that resembles
practical applications, and our results are in line with the
state-of-theart.
The key advantages of this kind of approaches over other spectrotemporal methods
are that we can apply them to both evenly and unevenly sampled signals [25] and they
requirenostationarity guaranteesnorwindowing.
In practice, the computational cost of Kalman filter and smoother can be extensive
when the length of the signal is very long. However, instead of the Fourier series state
space model in previous section, one can also derive an alternative representation
using stochastic oscillator differential equations. In this way, the dynamic and
measurement models become linear time-invariant (LTI) so that we can leverage a
stationary Kalman filter to reduce the time consumption. This kind of stochastic
oscillator models were also considered in [33] and the link to period Gaussian
processmodelswasinvestigatedin[35].
EEG Classification #1a
DeeplearningwithconvolutionalneuralnetworksforEEGdecoding
andvisualizationhttps://doi.org/10.1002/hbm.23730
https://github.com/robintibor/braindecode/
Thereisincreasing interestin using deep ConvNetsforend to endEEGanalysis‐to‐end EEG analysis ‐to‐end EEG analysis ,but
a better understanding of how to design and train ConvNets for end to end EEG‐to‐end EEG  ‐to‐end EEG 
decoding and how to visualize the informative EEG features the ConvNets learn is still
needed. Here, we studied deep ConvNets witha rangeof different architectures, designed
fordecodingimagined orexecutedtasksfromrawEEG.
Our study thus shows how to design and train ConvNets to decode task related‐to‐end EEG analysis
information from the raw EEG without handcrafted features and highlights the
potential of deep ConvNets combined with advanced visualization techniques for
EEG basedbrain mapping.‐to‐end EEG
EEG Classification #1b
Deep learningwith convolutionalneuralnetworks for EEG decodingand visualization
https://doi.org/10.1002/hbm.23730 → Citedby90| https://github.com/robintibor/braindecode/
Correlation between the mean squared envelope feature and unit output for a single subject at one electrode position (FCC4h).
Left: All correlations. Colors indicate the correlation between unit outputs per convolutional filter (x-axis) and mean squared
envelope in different frequency bands (y-axis). Filters are sorted by their correlation to the 7–13 Hz envelope (outlined by the
black rectangle). Note the large correlations/anticorrelations in the alpha/beta bands (7–31 Hz) and somewhat
weaker correlations/anticorrelations in the gamma band (around 75 Hz). Right: mean absolute values across units of all
convolutional filters for all correlation coefficients of the trained model, the untrained model and the difference between the
trained and untrained model. Peaksin the alpha, beta, and gammabandsare clearly visible
CSP-common spatialpatterns
EEG+ECGClassification
UseoffeaturesfromRR-timeseriesandEEGsignalsfor
automatedclassificationofsleepstagesindeepneuralnetwork
framework
https://doi.org/10.1016/j.bbe.2018.05.005
The method uses iterative filtering (IF) based multiresolution analysis
approach for the decomposition of RR-time series into intrinsic mode
functions (IMFs). The recurrence quantification analysis (RQA) and
dispersion entropy (DE) based features are evaluated from the IMFs of RR-
time series. The dispersion entropy and the variance features are evaluated
from the different bands of EEG signal. The RR-time series features and
the EEG features coupled with the deep neural network (DNN) are
Stackedautoencoders
withbinaryclassifiers?
Slightlyconfusingarchitecture
Engineeredfeatureswith
deeplearning?
EMGClassification
EMGPatternRecognitionintheEraofBigDataandDeep
Learning
BigDataCogn.Comput.2018,2(3),21;
https://doi.org/10.3390/bdcc2030021
We provide a review of recent research and development in EMG
pattern recognition methods that can be applied to big data
analytics.
These modern EMG signal analysis methods can be divided into two
main categories: (1) methods based on feature engineering
involving a promising big data exploration tool called topological data
analysis; and (2) methods based on feature learning with a special
emphasison “deeplearning”.
Compared to other well-known bioelectrical signals (e.g.,
electrocardiogram, ECG; electrooculogram, EOG; and galvanic skin
response, GSR), however, the analysis of surface EMG signal is
morechallenginggiventhatitisstochasticinnature.
Due to the increasing availability of multi-modality sensing
systems, multi-modal analysis approaches are becoming a viable
option. Multiple modalities can be used to capture complementary
information which is not visible using a single modality, or to provide
contextfor others.
Even when two or more modalities capture similar information, their
combination can still improve the robustness of pattern
recognitionsystemswhenoneofthemodalitiesismissingor noisy.
Outside of prosthesis control, other applications of EMG pattern recognition for which multi-
modality data sets exist include, for example, sleep studies, such as the Cyclic Alternating
Pattern (CAP) Sleep Database [49] and the Sleep Heart Health Study (SHHS) Polysomnography
Database [50]; biomechanics, such as the cutting movement dataset [51] and the horse gait
dataset [52]; and brain computer interfaces, such as the Affective Pacman dataset [53] and the
emergency braking assistance dataset [54]. Recently, emotion recognition using multiple
physiological modalities has gained attention as another important application that has benefited
fromtheincorporationofsurfaceEMG.
http://doi.org/10.3390/s17071622
Time series 2D Recurrence Plots→ Shapelets 
This paper investigates the performance of Recurrence Plots (RP) [
Eckmannetal.1987] within the deep CNN model for TSC. RP provides a
waytovisualizetheperiodicnatureofatrajectory throughaphase
space and enables us to investigate certain aspects of the m-dimensional
phase space trajectory through a 2D representation. Because of the
recent outstanding results by CNN on image recognition, we first encode
time-series signals as 2D plots, and then treat TSC problem as
texture recognition task. A CNN model with 2 hidden layers followed
byafullyconnectedlayerisused.
In particular, comparing with models using RP with the traditional classification framework (e.g.
SIFT, Gabor and LBP features with SVM classifier25, 26
) and other CNN-based time-series image
classification (e.g. GAF-MTF images with CNN23, 24
) demonstrates that using RP images with CNN in
ourproposedmodel obtainsthe betterresults.
As future work, CNN architecture with more feature representation layers should be
investigated for more difficult TSC tasks (preferably with more data samples available). Large
datasetsare neededin orderto train a deeper architectures.
Therefore, adopting the proposed pipeline for TSC with small sample sizes can be another
interesting future direction. Exploring different ensemble learning methods for CNN can be also
interesting. We will particularly be investigating application of the output coding for CNNs.
Wavelets fordeep learningTSC #1
MultilevelWaveletDecompositionNetworkfor
InterpretableTimeSeriesAnalysis
https://doi.org/10.1145/3219819.3220060
To this end, we first designed a novel wavelet-based network structure called mWDN for
frequency learning of time series, which can then be seamlessly embedded into deep learning
frameworks by making all parameters trainable. We further designed two deep learning
models based on mWDN for time series classification and forecasting, respectively, and
the extensive experiments on abundant real-world datasets demonstrated their superiority to
state-of-the-art competitors. As a nice try for interpretable deep learning, we further propose an
importance analysis methodforidentifyingimportantfactorsfor timeseriesanalysis,whichin
turnverifiestheinterpretabilitymerit ofmWDN.
Frequency Analysis of Time Series. Frequency analysis of time series data has been deeply studied
by the signal processing community. Many classical methods, such as Discrete Wavelet Transform,
Discrete Fourier, and Z-Transform, have been proposed to analysis the frequency pattern of time series
signals. In existing TSC/TSF applications, however, transforms are usually used as an independent step
in data preprocessing, which have no interactions with model training and therefore might not be
optimized for TSC/TSF tasks from a global view. In recent years, some research works, such as
Clockwork RNN [Koutniketal. 2014]
and SFM [HaoHuandGuo-JunQi2017]
, begins to introduce the frequency
analysis methodology into the deep learning framework. To our best knowledge, our study is among the
very few works that embed wavelet time series transforms as a part of neural networks so as to achieve an
end-to-endlearning.
Wavelets fordeep learningTSC #2
Learningfilterwidthsofspectraldecompositionswith
Wavelets
Haidar Khan and Bülent Yener. Rensselaer Polytechnic Institute
http://papers.nips.cc/paper/7711-learning-filter-widths-of-spectral-decompositions-with-wavelets.pdf
https://github.com/haidark/WaveletDeconv
We propose the
wavelet
deconvolution (WD)
layer as an efficient
alternative to this
preprocessing step
that eliminates a
significant number of
hyperparameters. The
WD layer uses wavelet
functions with
adjustable scale
parameters to learn the
spectral decomposition
directlyfromthesignal.
Furthermore, the WD
layer adds
interpretability to the
learned time series
classifier by exploiting
the properties of the
wavelettransform.
Asfuturework,we plantoinvestigate howtoextendtheWDlayertosignalsinhigherdimensions,suchas
magesandvideo,aswellas generalizingthewavelettransform toempiricalmode
decompositions(EMDs).
Wavelets fordeep learningTSC #3
MultilevelWaveletDecompositionNetworkforinterpretableTimeSeries
Analysis
https://doi.org/10.1145/3219819.3220060
In this paper we propose a wavelet-based neural network structure called multilevel Wavelet
Decomposition Network (mWDN) for building frequency-aware deep learning models for
time series analysis. mWDN preserves the advantage of multilevel discrete wavelet
decomposition in frequency learning while enables the fine-tuning of all parameters under a
deep neural network framework. Based on mWDN, we further propose two deep learning
models called Residual Classification Flow (RCF) and multi-frequency Long Short-Term
Memory (mLSTM) for time series classification and forecasting, respectively. The two models
take all or partial mWDN decomposed sub-series in different frequencies as input, and resort to
the back propagation algorithm to learn all the parameters globally, which enables seamless
embeddingof wavelet-basedfrequencyanalysisintodeeplearningframeworks
Multivariatetime-series classification#1:CNNonly
TemporalConvolutional Neural Network for the
Classificationof SatelliteImageTimeSeries
CharlottePelletier, Geoffrey I. WebbandFrançois Petitjean
(Submittedon 31Jan 2019)
https://arxiv.org/abs/1811.10166
https://github.com/charlotte-pel/temporalCNN (Keras)
Note!Despitethename,theauthorsusedtraditional
convolutionalfiltersfortimeseries,andnotTCNs
Multivariatetime-series classification#2: CNN+LSTM
MultivariateLSTM-FCNsforTimeSeriesClassification
Fazle Karim, SomshubraMajumdar, HoushangDarabi, Samuel Harford
(Submitted on 14Jan 2018)
https://arxiv.org/abs/1801.04503
We propose augmenting the existing univariate time series
classification models, LSTM-FCN and ALSTM-FCN with a squeeze
and excitationblocktofurtherimproveperformance.
The proposed models work efficiently on various complex
multivariate time series classification tasks such as activity
recognition or action recognition. Furthermore, the proposed models
are highly efficient at test time and small enough to deploy on memory
constrainedsystems. For datasets with class
imbalance, a class weighing
schemed inspired by
King et al. (2001).
Multivariatetime-series classification#3: CNN+GRU
DeepGatedRecurrentandConvolutionalNetworkHybridModelforUnivariateTimeSeries
ClassificationNellyElsayed, Anthony S. Maidaand MagdyBayoumi
(Submitted on 27 Dec 2018) https://arxiv.org/abs/1812.07683
https://github.com/NellyElsayed/GRU-FCN-model-for-univariate-time-series-classification
The proposed GRU-FCN classification model shows that
replacing the LSTM by a GRU enhances the classification
accuracy without needing extra algorithm enhancements
such as fine-tuning or attention algorithms. The GRU also
has a smaller architecture that requires fewer
computations than the LSTM. Moreover, the GRU-based
model requires smaller number of trainable parameters,
memory, and training time comparing to the LSTM-based
models.
Applicationformultivariatetimeseries: Wearable sensors
WearableDL:WearableInternet-of-ThingsandDeep
LearningforBigDataAnalytics—Concept,Literature,
andFuture
ArasR. Dargazany, PaoloStegagno, and Kunal Mankodiya
(Submitted on 14November 2018)
https://doi.org/10.1155/2018/8125126
This work introduces Wearable deep learning (WearableDL) that is a
unifying conceptual architecture inspired by the human nervous
system, offering the convergence of deep learning (DL), Internet-of-
things(IoT),andwearabletechnologies(WT)
Applicationformultivariatetimeseries: ActionRecognition
SensorDataAcquisitionandMultimodalSensorFusion
forHumanActivityRecognitionUsingDeepLearning
Published:10April 2019
(Thisarticle belongstotheSpecial IssueDeep LearningBased Sensing
Technologiesfor AutonomousVehicles)
Sensors2019, 19(7), 1716; https://doi.org/10.3390/s19071716
We develop a Long Short-Term Memory (LSTM) network framework to support
training of a deep learning model on human activity data, which is acquired in
both real-world and controlled environments. From the experiment results, we
identify that activity data with sampling rate as low as 10 Hz from four sensors at
both sides of wrists, right ankle, and waist is sufficient in recognizing Activities of
Daily Living (ADLs) including eating and driving activity. We adopt a two-level
ensemble model to combine class-probabilities of multiple sensor modalities,
and demonstrate that a classifier-level sensor fusion technique can improve the
classification performance. By analyzing the accuracy of each sensor on
different types of activity, we elaborate custom weights for multimodal
sensor fusion that reflect the characteristic of individual activities.
Ensemblingmodels for uni/multivariatetimeseriesaswell
DeepNeuralNetworkEnsemblesforTimeSeries
Classification
Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar
and Pierre-Alain Muller
IRIMAS, UniversiteHaute-Alsace, Mulhouse, France
https://arxiv.org/abs/1903.06602
In the future, we would like to consider a meta-learningapproach
where the output logistics of individual deep learning models are fed to a
meta-network that learns to map these inputs to the correct
prediction.(e.g.Juetal.2019;2018)
Segmentation LiteratureReview
Segmenting timeseries
BEATS: Blocksof Eigenvalues AlgorithmforTime
SeriesSegmentation
https://doi.org/10.1109/TKDE.2018.2817229 (2018)
https://github.com/auroragonzalez/BEATS implemented in R
The massive collection of data via emerging technologies like
the Internet of Things (IoT) requires finding optimal ways to
reduce the observations in the time seriesanalysisdomain.
In this paper, we propose a segmentation algorithm that adapts
to unannounced mutations of the data (i.e., data drifts).
The algorithm splits the data streams into blocks and groups
them in square matrices, computes the Discrete Cosine
Transform (DCT),andquantizesthem.
The algorithm, called BEATS, is designed to tackle dynamic
IoT streams, whose distribution changes over time. We
implement experiments with six datasets combining real,
synthetic, real-world data, and data with drifts. Compared to
other segmentation methods like Symbolic Aggregate
approXimation (SAX), BEATS shows significant improvements.
Trying it with classification and clustering algorithms
it provides efficient results. BEATS is an effective mechanism to
work with dynamic and multi-variate data, making it suitable for
IoTdata sources.
By using BEATS, we are able to restructure the streaming data in a 2D way and then transform it into the frequency
domain using DCT. The algorithm finds a smaller sequence that contains the key information of the initial representative.
This aggregation provides an opportunity to eliminate repetitive content and similarities that can be found in the sequence
of data. The eigenvalues vectors are a homogeneous representation of the data streams in BEATS that allow us to
go one step further in understanding of the sequences and patterns that can be considered as the data structure of a data
series in an application domain (e.g. smart cities). Its applications can be extended to several other domains and various
patterns/activity monitoring and detection methods. The future work will focus on applying 3D cosine transform and
adaptive blocksize estimation.
Regression/Forecasting LiteratureReview
FinancialForecasting withDeep Learning#1
ConditionalTime SeriesForecastingwith
Convolutional NeuralNetworks
Anastasia Borovykh,SanderBohte, CornelisW.Oosterlee
https://arxiv.org/abs/1703.04691 (2017)
We present a method for conditional time series forecasting
based on an adaptation of the recent deep convolutional
WaveNet architecture. The proposed network contains stacks of
dilated convolutions that allow it to access a broad range of
history when forecasting, a ReLU activation function and
conditioning is performed by applying multiple convolutional filters
in parallel to separate time series which allows for the fast
processing of data and the exploitation of the correlation
structurebetween the multivariatetimeseries.
We show that a convolutional network is well-suited for
regression-type problems and is able to effectively learn
dependencies in and between the series without the need for long
historical time series, is a time-efficient and easy to implement
alternative to recurrent-type networks and tends to outperform
linearandrecurrent models.
Effectively, we use multiple financial time series as input in a neural
network, thus conditioningthe forecast of atimeseries on both
its own history as well as that ofmultipleother time series. Training a
model on multiple stock series allows the network to exploit the
correlation structure between these series so that the network can
learn themarketdynamicsin shortersequencesof data.
While on the relatively short time series the prediction time is negligible when compared to the
training time, for longer time series the prediction of the autoregressive model may be sped
up by implementing a recent variation that exploits the memorization structure of the
network, or speeding up the convolutions by working in the frequency domain emloying
Fouriertransforms.Finally,itiswell-known thatcorrelationsbetween datapointsarestronger
on an intraday basis. Therefore, it might be interesting to test the model on intraday data to
see if the ability of the model to learn long-term dependencies is even more valuable in that
case.
FinancialForecasting withDeep Learning#2
Autoregressive ConvolutionalNeural
NetworksforAsynchronousTime Series
Mikołaj Bińkowski, Gautier Marti, Philippe Donnat
(Submitted on12 Mar 2017 (v1), last revised 12 Jun 2018 (thisversion, v4))
https://arxiv.org/abs/1703.04122 →   Cited by8
https://github.com/mbinkowski/nntimeseries
We propose Significance-Offset Convolutional Neural
Network, a deep convolutional network architecture for
regressionofmultivariateasynchronoustimeseries.
Conclusionanddiscussion In thisarticle,we
proposedaweighting mechanismthat,coupled
withconvolutionalnetworks,formsanewneural
networkarchitecturefortimeseriesprediction.
Theproposed architectureisdesignedfor
regression taskson asynchronoussignalsin
thepresenceof highamount of noise.This
approachhasprovedtobesuccessfulin
forecastingseveralasynchronoustimeseries
outperformingpopularconvolutionaland
recurrentnetworks.
Theproposed modelcanbeextendedfurtherby
adding intermediateweightinglayers ofthe
sametypein thenetworkstructure.Another
possiblegeneralization thatrequiresfurther
empiricalstudiescan beobtainedbyleaving the
assumption ofindependent offsetvaluesfor
eachpastobservation, i.e.consideringnotonly
1x1convolutionalkernelsin theoffsetsub-
network.
Finally,weaimattestingtheperformanceofthe
proposedarchitectureon otherreal-lifedatasets
withrelevantcharacteristics.Weobservethat
thereexistsastrongneed forcommon
‘econometric’ datasets benchmarkand,
moregenerally,fortimeseries(stochastic
processes)regression.
FinancialForecasting withDeep Learning#3
Multi-taskLearningforFinancial
Forecasting
TaoMa, Guolin Ke(27 Sep 2018)
https://arxiv.org/abs/1809.10336
Due to the strong connections among stocks,
the information valuable for forecasting is not only
included in individual stocks, but also included in
the stocks related to them. However, most previous
works focus on one single stock, which easily
ignore the valuable information in others. To
leverage more information, in this paper, we
propose a jointly forecasting approach to
process multiple time series of related stocks
simultaneously, using multi-task learning
framework(Ruder2017).
Durichen et al. (2015) used multi-task Gaussian
processes to process physiological time series.
Jung (2015) proposed a multi-task learning
approach to learn the conditional independence
structure of stationary time series. Liu et al. (2016)
used multi-task multi-view learning to predict urban
water quality. Harutyunyan et al. (2017) used
recurrent LSTM neural networks and multi-task
learning to deal with clinical time series. And Li et al.
(2018) applied multi-task representation learning to
travel time estimation. Moreover, some methods
are proposed to learn the shared representation of
all the task-private information, e.g., Misra et al.
(2016) proposed cross-stitch networks to combine
multipletask-privatelatentfeatures
In the future works, we would like to further improve
SPA’s ability of combining latent features. And for DMTL,
we would like to build hierarchical models to extract the
shared information from all tasksmore efficiently.
The contributionsof thispaper aremultifold:
●
To the bestofourknowledge,theproposed
multi-seriesjointlyforecastingapproach isthe
firstwork applying multi-task learningtotime
seriesforecastingformultiplerelatedstocks.
●
We proposea novel attentionmethod to
learnthe optimizedcombination of shared
andtask-privatelatentfeaturesbasedonthe
ideaofCAPM.
●
We demonstrate inexperimentsonfinancial data
thatthe proposedapproach outperformssingle-
task baselinesandotherMTL basedmethods,
which furtherimprovesthe forecasting
performance.
FinancialForecasting withDeep Learning#4
Multi-taskLearningforFinancial
Forecasting
TaoMa, Guolin Ke(27 Sep 2018)
https://arxiv.org/abs/1809.10336
In thispaper, we empiricallystudythe applicabilityofthe
latest deep structures with respect to the volatility
modellingproblem,throughwhichweaimtoprovide
an empirical guidance for the theoretical analysis of the
marriage between deep learning techniques and
financialapplicationsinthefuture.
We examine both the traditional approaches and the
deep sequential models on the task of volatility
prediction, including the most recent variants of
convolutional and recurrent networks, such as the
dilated architecture.
Accordingly, experiments with real-world stock
price datasets are performed on a set of 1314 daily
stock series for 2018 days of transaction. The
evaluation and comparison are based on the negative
log likelihood(NLL) ofreal-worldstockpricetimeseries.
The result shows that the dilated neural models,
including dilated CNN and Dilated RNN, produce
most accurate estimation and prediction,
outperforming various widely-used deterministic
models in the GARCHfamily and several recently
proposed stochastic models. In addition, the high
flexibility and rich expressive power are validated in this
study.
Trading withDeep Learning#1a
DeepLOB: Deep Convolutional
NeuralNetworksforLimitOrder
Books
ZihaoZhang, Stefan Zohren, Stephen Roberts(2018)
https://arxiv.org/abs/1808.03668
We develop a large-scale deep learning
model to predict price movements from limit
order book (LOB) data of cash equities. The
architecture utilises convolutional filters to
capture the spatial structure of the limit order
books as well as LSTM modules to capture
longer time dependencies.
Importantly, our model translates well to
instruments which were not part of the training set,
indicating the model’s ability to extract universal
features. In order to better understand these
features and to go beyond a “black box” model, we
perform a sensitivity analysis to understand the
rationale behind the model predictions and reveal
the components of LOBs that are most
relevant. The ability to extract robust features
which translate well to other instruments is an
important property of our model which has many
other applications.
We use standardisation (z-score) to normalise our
data, and use the mean and standard deviation of
the previous day’s data to normalise the current
day’s data (separate normalisation for each
instrument):
Because financial data is highly stochastic, if we simply compare pt
and
pt+k
to decide the price movement, the resulting label set will be noisy. We
adopt the idea in Tsantekidis et al. (2017) to introduce a smoothed
labelling method.
Trading withDeep Learning#1b
DeepLOB:DeepConvolutionalNeural
NetworksforLimitOrderBooks
Zihao Zhang, Stefan Zohren, Stephen Roberts(2018)
https://arxiv.org/abs/1808.03668
To observe what convolutional layers do, we feed a single input to the trained model and plot
the intermediate outputs on the right of Figure 5. Since 16 filters are applied, we get 16
series after the “Conv” block. The convolution operations transform the original time-series
into signals that indicate time regions that have great impacts on final outputs. In our case, we
observe strong signals around t = 1, 20, 40, 70 time stamps, suggesting information at
these time stamps decide the final outputs.
In our case, we use LIME [Ribeiro et al. 2016; Cited by 751
] to reveal components of LOBs that are
most important for predictions and to understand why the proposed model DeepLOB works better
than Ref model [Tsantekidis et al. (2017)]. LIME uses aninterpretable model to approximate the
prediction of a complex model on a given input. It locally perturbs the input and observes variations
in the model’s predictions, thus providing some measure of information regarding input importance
and sensitivity.
Trading withDeep Learning#2
DevelopingArbitrageStrategyin
High-frequencyPairsTrading
withFilterbankCNNAlgorithm
Yu-YingChen; Wei-Lun Chen;Szu-HaoHuang(2018)
https://doi.org/10.1109/AGENTS.2018.8459920
This paper proposed a novel intelligent high-
frequency pairs trading system in Taiwan Stock
Index Futures (TX) and Mini Index Futures
(MTX) market based on deep learning
techniques.
This research utilized the improved time
series visualization method to transfer
historical volatilities with different time frames
into 2D images which are helpful in capturing
arbitragesignals.
Moreover, this research improved convolutional
neural networks (CNN) model by combining
the financial domain knowledge and
filterbank mechanism. We proposed
Filterbank CNN to extract high-quality features
by replacing the random-generating filters with
thearbitrageknowledgefilters.
Algorithmicfinancialtradingwithdeep convolutionalneuralnetworks:
Time seriestoimage conversionapproach
Omer Berat Sezerand Ahmet Murat Ozbayoglu (2018)
https://doi.org/10.1016/j.asoc.2018.04.024
For future work, we will use more
Exchange-Traded Fund (ETFs)
and stocks in order to create
more data for the deep learning
models.
We will also analyze the
correlations between selected
indicators in order to create more
meaningful images so that the
learning models can better
associate the Buy–Sell–Hold
signals and come up with more
profitable trading models.
Trading withDeep Learning: GANs
GenerativeAdversarial Networks
forFinancial TradingStrategies
Fine-TuningandCombination
AdrianoKoshiyama, Nick Firoozye, and Philip Treleaven
(Jan 2019) https://arxiv.org/abs/1901.01751
Relatedarticles
Systematic trading strategies are
algorithmic procedures that allocate
assets aiming to optimize a certain
performance criterion. To obtain an
edge in a highly competitive
environment, the analyst needs to
proper finetune its strategy, or discover
how to combine weak signals in
novel alpha creating manners.
Both aspects, namely fine-tuning and
combination, have been extensively
researched using several methods, but
emerging techniques such as
Generative Adversarial Networks can
have an impact into such aspects.
Therefore, our work proposes the use
of Conditional Generative
Adversarial Networks (cGANs) for
trading strategies calibration and
aggregation.
StockMarketPredictiononHigh-FrequencyDataUsingGenerative
AdversarialNets
Xingyu Zhouet al. 2018
https://doi.org/10.1155/2018/4907423
In this paper, we propose a generic framework employing Long Short-Term
Memory (LSTM) and convolutional neural network (CNN) for adversarial
trainingto forecast high-frequency stock market.Thismodel takesthe
publicly available index provided by trading software as input to avoid complex
financial theory research and difficult technical analysis, which provides the
conveniencefortheordinarytraderof nonfinancial specialty.
Based on the deep learning network, this model achieves prediction ability superior to other
benchmark methods by means of adversarial training, minimizing direction prediction loss,
and forecast error loss. Moreover, the efects of the model update cycles on the predictive
capability are analyzed, and the experimental results show that the smaller model update
cycle can obtain better prediction performance. In the future, we will attempt to integrate
predictivemodelsundermultiscaleconditions.
GlucosePrediction CNN-RNNHybrid
KezhiLi,JohnDaniels,ChengyuanLiu,PauHerrero,PantelisGeorgiou
DepartmentofElectronicandElectrical Engineering,ImperialCollegeLondon
https://arxiv.org/abs/1807.03043
Current digital therapeutic approaches for subjects with Type 1 diabetes mellitus (T1DM) such as the artificial pancreas and
insulin bolus calculators leverage machine learning techniques for predicting subcutaneous glucose for improved
control.
In this work, we present a deep learning model that is capable of predicting glucose levels over a 30-minute horizon.
The prediction algorithm is implemented on an Android mobile phone (LG Nexus5 with Processor:2.26GHz quad-core,
RAM:2GB, 8-bit integer) , with an execution time of 6ms on a phone compared to an execution time of 780ms,on a laptop
(MacProwith Processor:3.1GHz Intel Core i5, RAM:8GB, 32-bit fp) inPython.
Given that learning is solely based on historical data, unexpected predictions may occur given that correlations learned in the
datamay not implycausation. Thushybrid approacheswhereby the deep learningmodel isused tomakean accurate prediction,
and rulesof meal/bolus supported byphysiological model avoid apparent errorsthat might result. Based on the CRNN approach
proposed in this paper, it is possible to develop the hybrid method, which may have the advantages of both conventional and DL
algorithms.
Survivalmodels LiteratureReview
ClinicalSurvivalModels CancerSurvival
ASimpleDiscrete-TimeSurvivalModelforNeuralNetworks
MichaelF.GensheimerandBalasubramanianNarasimhan
StanfordUniversity(May2018)
https://arxiv.org/pdf/1805.00917.pdf
https://github.com/MGensheimer/nnet-survival Keras
It is recommended to use at least ten time intervals to avoid bias in the survival
estimates [17]. Using narrow time intervals also helps avoid inaccurate parameter
estimatesif the effect of the input data variesrapidlywithfollow-up time (time-varying
coefficients, in the language of survival analysis). In most of our experiments we have
used 20-50 time intervals. We suggest choosing the cut-points so that around the
same number of survival events fall into each time interval, which will help ensure
reliable estimatesforall time intervals
While the model has several advantages and we think it will be useful for a broad range
of applications, it does has some drawbacks. The discretization of follow-up time
results in a less smooth predicted survival curve compared to a parametric survival
model suchasa Weibull acceleratedfailure time model.
As long as a sufficient number of time intervals is used, this is not a large practical
concern. Unlike a parametric survival model, the model does not provide survival
predictions past the end of the last time interval, so it isrecommended to extend the last
interval past the last follow-up time of interest.
The advantages of parametric survival models and our discrete-time survival
model could be combined in the future using a flexible parametric model, such
as the cubic splinebased model of Royston and Parmar(2002), implemented in the
flexsurv R package.
Complex non-proportional hazards models (see Katzman etal. 2018, for
proportional deep learning model) can be created in this way, and likely could be
implementedin deep learningpackages.
ClinicalSurvivalModels SequentialDL “recurrent”
DeepRecurrentSurvivalAnalysis
Kan Ren, JiaruiQin, Lei Zheng, Zhengyu Yang, WeinanZhang, Lin Qiu, YongYu
ShanghaiJiaoTongUniversity(Sept2018) https://arxiv.org/abs/1809.02403
Recent advancesof modern technologymakesredundant datacollection available for time-to-
event information, which facilitates observing and tracking the event of interests. However,
due to different reasons, many events would lose tracking during observation period, which
makesthe data censored.
We only know that the true time to the occurrence of the event is larger or smaller than, or within
the observation time, which have been defined as survivorship bias categorized into right-
censored, left-censored and internal-censored respectively (Lee and Wang2003). Survival
analysis, a.k.a. time-to-event analysis (Leeet al. 2018; DeepHit), is a typical statistical
methodology for modeling time-to-event data while handling censorship, which is a traditional
research problem and hasbeen studied over decades.
Our model proposesanovelmodeling viewfor survivalanalysis,which aimsat
flexibly modeling the survival probability function rather than making any
assumptions for the distribution form. Specifically, DRSA creatively predicts
the conditionalprobability of the event at each time given that the event non-
occured before, and combines them through probability chain rule for
estimatingboththeprobabilitydensityfunctionandthecumulativedistribution
function of the event over time, eventually forecasts the survival rate at
eachtime,which ismore reasonable andmathematicallyefficientfor survival
analysis. Through these modeling methods, our DRSA model can capture
the sequential patterns embedded in the feature space along the
time, and output more effective distributions for each individual samples at
fine-grainedlevel.
ClinicalSurvivalModels Cardiac Motionanalysis
Deep learning cardiacmotion analysis forhumansurvivalprediction
Ghalib A. Bello,Timothy J.W. Dawes, Jinming Duan,Carlo Biffi,Antonio deMarvao,LukeS.G.E. Howard,J. SimonR.Gibbs,
Martin R. Wilkins, StuartA.Cook, Daniel Rueckert, DeclanP.O'Regan (Submittedon8Oct2018)
ImperialCollegeLondon, NationalHeartCentreSingapore,Singapore,andDuke-NUS GraduateMedical School,Singapore
https://arxiv.org/abs/1810.03382
https://github.com/UK-Digital-Heart-Project/4Dsurvival
Making predictions about future events from the current state of a moving three
dimensional (3D) scene depends on learning correspondences between patterns of
motion and subsequent outcomes. Such relationships are important in biological
systems which exhibit complex spatio-temporal behaviour in response tostimuli or
as a consequence of disease processes. Here we use recent advances in machine
learning for visual processing tasks to develop a generalisable approach for modelling
time-to-event outcomes from time-resolved 3D sensory input. We tested this on
the challenging task of predicting survival due to heart disease through analysis of
cardiacimaging
The traditional paradigm of epidemiological research is to draw insight from large-scale clinical
studies through linear regression modelling of conventional explanatory variables, but this approach
does not embrace the dynamic physiological complexity of heart disease. Even objective quantification of
heart function by conventional analysis of cardiac imaging relies on crude measures of global contraction that
are only moderatelyreproducible and insensitivetothe underlyingdisturbancesofcardiovascular physiology.
While conventional autoencoders are used for unsupervised learning tasks we extend recent proposals for
supervised autoencoders in which the learned representations are both reconstructive and
discriminative. We achieved this by adding a prediction branch to the network with a loss function for
survival inspired by the Cox proportional hazards model. A hybrid loss function, optimising the trade-
off between survival prediction and accurate input reconstruction, is calibrated during training. The
compressed representations of 3D motion predict survival more accurately than a composite measure
ofconventional manually-derived parametersmeasured on the same images.
SequentialTime-Series LiteratureReview
RepresentationLearning forSequences
Unifiedrecurrentneuralnetworkformanyfeaturetypes
AlexanderStec,DiegoKlabjan, Jean Utke
(Submittedon24Sep2018)
https://arxiv.org/abs/1809.08717
“Therearetimeseriesthat areamenabletorecurrent neural
network(RNN) solutionswhen treatedas sequences, butsome
series,e.g.asynchronous timeseries,providearicher
variationoffeaturetypesthancurrentRNNcells takeinto
account.
Inordertoaddresssuchsituations,weintroduceaunifiedRNNthat
handles fivedifferentfeaturetypes,eachinadifferent manner.
OurRNNframeworkseparatessequentialfeaturesinto two
groups dependentontheirfrequency,whichwecall sparseand
densefeatures, andwhichaffect cellupdatesdifferently.
Further,wealsoincorporatetimefeatures at thesequential
levelthat relatetothetimebetweenspecifiedeventsin the
sequenceandareusedtomodifythecell'smemorystate. Wealso
include twotypesofstatic (wholesequencelevel) features, one
relatedtotimeandonenot,whicharecombinedwiththeencoder
output.“
For future work, it would be interesting to incorporate even more
feature types than the five covered in this work. One in particular is a
feature type that gives time information looking forward in the sequence.
All features in this work use time information related to past events, but
there are cases that can benefit from the utility of incorporating
future knowledge when available. One example of this is the time to
the prediction from the current time step so the network can have direct
knowledge of itsabsolute time location in the sequence.
InMedicalDiagnostics Sequence ≈ PatientVisits
ShortFuse:BiomedicalTimeSeries
RepresentationsinthePresenceof
Structured Information
MadalinaFiterau,SuvratBhooshan,Jason Fries,Charles
Bournhonesque, JenniferHicks,Eni Halilaj, ChristopherRé, ScottDelp
(revised16May2017) StanfordUniversity
https://arxiv.org/abs/1705.04790 -Citedby5 
“In healthcare applications, temporal variables that
encode movement, health status and longitudinal patient
evolution are often accompanied by rich structured
information such as demographics, diagnostics and
medical exam data (constant along the temporal domain).
However, current methods do not jointly optimize over
structured covariates and time series in the feature extraction
process.
We present ShortFuse, a method that boosts the accuracy of
deep learning models for time series by explicitly
modeling temporal interactions and dependencies
withstructuredcovariates.
ShortFuse introduces hybrid convolutional and LSTM
cells that incorporate the covariates via weights that are
sharedacrossthetemporaldomain. “
Sequences /+→  Networkscience (Graphinference)
ReferralpathsintheU.S.physiciannetwork
Chuankai An, A.JamesO’Malley,DanielN.Rockmore
(December2018)
https://doi.org/10.1007/s41109-018-0081-4
For a patient, a “referral path” records (“patient journey”) the
chronological sequence of physicians encountered by a patient
(subject to certain constraints on the times between encounters). It
provides a basic unit of analysis in a broader referral network that
encodes the flow of patients and information between
physicians ina healthcaresystem.Weconsiderreferralnetworks
defined over a range of interactions as well as the characteristics of
referral paths, producing a characterization of the various networks
aswell asthephysicianstheycomprise.
In this paper we study the more fine-scale patterns to be found in the
consideration of the referral paths and importantly link these
statistics to treatment outcomes in the particular setting of
cardiovascular disease. While referral path and referral information
generally has been ignored as a factor in the important problem of
treatmentoutcomeprediction
An example referral path with three physicians A,B,C. The patient visits them five
times. Physicians A and C are from the same HRR/hospital in blue, while physician B is from
anotherHRR/hospital inred
Visualization of a hospital
(PHN) referral network with
30 physicians and 101 directed
edges in 2011. Red, yellow and
lightblue nodes represent
physicians with positive, zero and
negative net patient flow (NPF),
respectively. Targets of referrals
are marked with shadow on
directed edges. 
HandlingSmallData
SmallData fordeep learning
SmallSampleLearninginBigDataEra
https://arxiv.org/abs/1808.04572
JunShu,ZongbenXu,DeyuMeng
lastrevised22Aug 2018
Asapromising areain artificialintelligence,anewlearning paradigm,
called Small SampleLearning(SSL),hasbeen attracting
prominentresearchattention in therecentyears.In thispaper,weaim
topresent asurvey tocomprehensivelyintroducethecurrent
techniquesproposedon thistopic.Specifically,currentSSL
techniquescanbemainlydividedinto twocategories.
ThefirstcategoryofSSLapproachescanbecalled "concept
learning", whichemphasizeslearningnewconceptsfromonlyfew
relatedobservations. Thepurposeismainlytosimulatehuman
learningbehaviorslikerecognition,generation, imagination,synthesis
andanalysis. Thesecondcategoryiscalled "experience
learning",whichusuallyco-existswiththelargesamplelearning
mannerofconventionalmachinelearning.Thiscategorymainly
focuseson learningwithinsufficientsamples,andcan alsobecalled
smalldatalearningin someliteratures.
MoreextensivesurveysonbothcategoriesofSSLtechniquesare
introduced andsomeneuroscienceevidencesareprovidedto
clarifytherationalityoftheentireSSLregime,andtherelationship
withhuman learningprocess.Somediscussionson themain
challengesandpossiblefutureresearchdirectionsalongthislineare
alsopresented.
TheFastand the Flexible: training neural
networks tolearn tofollow instructions from
smalldata
https://arxiv.org/abs/1809.06194
RezkaLeonandya,EliaBruni,DieuwkeHupkes,
GermánKruszewski(Submittedon17Sep2018)
Learning to follow human instructions is a challenging
task because while interpreting instructions requires
discovering arbitrary algorithms, humans typically
provideveryfew examples to learn from.
For learning from this data to be possible, strong
inductive biases are necessary. Work in the past has
relied on hand-coded components or manually
engineered features to provide such biases. In contrast,
here we seek to establish whether this knowledge can
be acquired automatically by a neural network system
through a two phase training procedure: A (slow)
offline learning stage where the network learns about
the general structure of the task and a (fast) online
adaptation phase where the network learns the
languageof anew given speaker.
Dataaugmentation fortimeseries
T-CGAN:ConditionalGenerative AdversarialNetworkforData
AugmentationinNoisyTime SerieswithIrregularSampling
https://arxiv.org/abs/1811.08295
GiorgiaRamponi, PavlosProtopapas,MarcoBrambillaandRyanJanssen (20
Nov 2018)
In this paper we propose a data augmentation method for time series with irregular
sampling, Time-Conditional Generative Adversarial Network (T-CGAN).
Our approach is based on Conditional Generative Adversarial Networks (CGAN),
where the generative step is implemented by a deconvolutional NN and the
discriminative step bya convolutional NN. Boththe generator and the discriminator are
conditioned on the sampling timestamps, to learn the hidden relationship between
data andtimestamps, and consequentlyto generate new time series.
Dataaugmentation frominvariancemodelling#1
DataAugmentationofRoom ClassifiersusingGenerative
AdversarialNetworks
Constantinos Papayiannis, Christine Evers, Patrick A. Naylor
https://arxiv.org/abs/1901.03257 (Jan 2019)
Dataaugmentation frominvariancemodelling#2
Sinusoidalwavegeneratingnetworkbased onadversarial
learningandits application:synthesizingfrogsounds fordata
augmentation
Sangwook Park, David K. Han, and Hanseok Ko
https://arxiv.org/abs/1901.02050 (Jan 2019)
Graphical comparisons of time-domain waveforms and spectrograms and quantitative comparisons using the inception score clearly showed that the synthetic data
closely resembles the target signal. Overall, it was demonstrated that the proposed approach of data augmentation by direct generation of synthetic audio
streams improved the CNN based classificationrate anditstraining efficiency when both the real andthe synthetic datawere usedto train the classifier.These
resultsdemonstratethattheproposednetworkgeneratesanarbitrarysignalthatiscomposedofsinusoidalwaveformsandcanbeusedfor trainingadeepnetwork
TransferLearning withTime Series #1
Dataaugmentationusingsyntheticdatafortimeseries
classificationwithdeepresidualnetworks
HassanIsmailFawaz,GermainForestier,Jonathan Weber,Lhassane
Idoumghar,Pierre-AlainMuller (Submittedon7Aug2018)
https://arxiv.org/abs/1808.02455
https://github.com/hfawaz/aaltd18
Unlike in image recognition problems, data augmentation techniques
have not yet been investigated thoroughly for the TSC task. This
is surprising as the accuracy of deep learning models for TSC could
potentially be improved, especially for small datasets that exhibit
overfitting, when a data augmentation method is adopted. In this paper,
we fill this gap by investigating the application of a recently proposed
data augmentation technique based on the Dynamic Time Warping
distance,foradeeplearningmodelforTSC.
The data augmentation method is mainly based on a weighted form of Dynamic Time
Warping (DTW) Barycentric Averaging (DBA) technique [Petitjeanetal.2016]. The latter
algorithm averages aset of time series in aDTW induced space and byleveraging aweighted
versionofDBA,themethodcanthuscreate aninfinitenumberof newtimeseries from
a given set of time series by simply varying these weights. Three techniques were proposed
to select these weights, from which we chose only one in our approach for the sake of
simplicity, although we consider evaluating other techniques in our future work.The weighting
method is called Average Selected which consists of selecting a subset of close time
seriesandfilltheir boundingboxes.
We did not test the effect of imbalanced classes in the training set and how it could
affectthe model’sgeneralization capabilities.Notethatimbalancedtime seriesclassification is
a recent active area of research that merits an empirical study of its own [Gengetal.2018]. At
last, we should add that the number of generated time series in our framework was chosen to
be equal to double the amount of time series in the most represented class (which is a
hyper-parameter ofour approachthatweaimtofurtherinvestigateinour futurework).
TransferLearning withTime Series #2
Physiological-signal-basedmental workloadestimationvia
transferdynamical autoencoders inadeeplearningframework
NeurocomputingAvailableonline11April2019
https://arxiv.org/abs/1808.02455
In this study, we propose a new transfer dynamical autoencoder (TDAE)
to capture the dynamical properties of electroencephalograph (EEG) features
and the individual differences. The TDAE consists of three consecutively-
connected modules, which are termed as feature filter, abstraction filter, and
transferred MW classifier. The feature and abstraction filters introduce
dynamical deep network to abstract the EEG features across adjacent time
steps to salient MW indicators. Transferred MW classifier exploits large volume
EEG data from an source-domain EEG database recorded under emotional
stimuli toimprovethemodeltrainingstability
The main limitation of the proposed TDAE deep learning framework for MW recognition lies in two
aspects. The computational cost for training the entire network is significantly higher than classical
shallow and deep classifiers. It leads to high time cost in selection optimal hyper-parameters
of the model. Therefore, we employed the same value of the feature filter order to reduce the
computational burden. However, it is no doubt that the filer order should feature-specific. Moreover,
there exists a prerequisite for knowledge transferring across two mental-task domains. That is, we
need to select exactly the same EEG channels for data preprocessing and it leads to a
possibility that useful MW indicators are excluded. In future work, we will further investigate the deep
learningmethodsfor MW assessment on these twoaspects.
Active Learning withTimeSeries
RobustActiveLearningforElectrocardiographicSignal
Classification
XuChen,SaratenduSethi (Submittedon21Nov 2018)
https://arxiv.org/abs/1811.08919
Motivated by the fact that ECG data are usually heavily unbalanced
among different classes and the class labels are noisy as they are
manually labeled, this paper proposes a novel solution based on robust
active learning for addressing these challenges. The key idea is to first
apply the clustering of the data in a low dimensional embedded
space and then select the most information instances within local
clusters. By selecting the most informative instances relying on local
average minimal distances, the algorithm tends to select the data for
labelinginamorediversifiedway.
The first stage of RALS algorithm relies on label spreading. The label spreading
algorithm is a well known graph-based semi-supervised learning algorithm. It
calculates the similarity measure and propagates the labels by the measure for
prediction. It also generates the label distribution matrix which consists of the
predicted probability for every class for each sample. In order to select the data from
different classes, here t-Distributed Stochastic Neighbor Embedding (t-SNE) is
applied to the label distribution matrix due to its good performance for high
dimensionaldatasets.
A novel noisy label reduction relying on an effective confidence score measure is
proposed based on the criteria of best vs second best (BSVB) to enhance the
active learning performance. Typically, for each selected data sample after ranking,
the ratio of the largest estimated class probability to the second largest
estimated class probability iscalculatedwhere thisinformation can beretrieved
from the label distribution matrix. Subsequently, the ratio is compared to the user set
threshold. The selected data are added into the labeled set if the ratio is larger than
thethreshold.
Therefore, by adding the estimated labels passed from the noise reduction step into
the labeled dataset, the noisy labels in the selection are significantly reduced. The
new augmented labeled dataset after adding the selected data samples are applied
tolabelspreadingalgorithmagaintolearnthenextenhancedmodel.
InterpretingTime Series
Visualizing Audio processing
InterpretableConvolutionalFilterswithSincNet
https://arxiv.org/abs/1811.08633
https://github.com/mravanelli/SincNet/
https://github.com/mravanelli/pytorch-kaldi/
MircoRavanelliandYoshuaBengio
(NIPS2018)
This paper summarizes our recent efforts to develop a more interpretable
neural model for directly processing speech from the raw waveform.
In particular, we propose SincNet, a novel Convolutional Neural Network
(CNN) that encourages the first layer to discover more meaningful filters by
exploitingparametrized sinc functions.
In contrast to standard CNNs, which learn all the elements of each filter, only
low and high cutoff frequencies of band-pass filters are directly
learned from data. This inductive bias offers a very compact way to derive
a customized filter-bank front-end, that only depends on some parameters
with a clear physical meaning. Our experiments, conducted on both
speaker and speech recognition, show that the proposed architecture
Spatiotemporal activations
CompensatedIntegratedGradientstoReliably
InterpretEEGClassification
https://arxiv.org/abs/1811.08633
KazukiTachikawa,YujiKawai,JihoonPark,MinoruAsada
MachineLearningforHealth(ML4H)Workshop atNeurIPS2018.
Integratedgradientsarewidely employedto evaluate thecontribution
of input features in classification models because it satisfies the
axiomsforattribution ofprediction.Thismethod,however,requires an
appropriate baseline for reliable determination of the contributions.
We propose a compensated integrated gradients method that
does not require a baseline. In fact, the method compensates the
attributions calculated by integrated gradients at an arbitrary baseline
usingShapleysampling.
The classifier constraints decrease the classification accuracy of temporal CNN. In contrast,
spatiotemporal CNNs exhibit higher classification accuracy but lower interpretation reliability
than the temporal CNNs.Therefore, classifier selection should dependon whetherreliability or
classificationaccuracy areemphasized.
Visualizationandinterpretation SleepStaging
AlbertVilamala,Kristoffer H.Madsen,LarsK.Hansen(Submittedon2Oct2017)
https://arxiv.org/abs/1710.00633

More Related Content

PDF
Optical Designs for Fundus Cameras
PDF
OCT Monte Carlo & Deep Learning
PDF
Shallow introduction for Deep Learning Retinal Image Analysis
PDF
Multimodal RGB-D+RF-based sensing for human movement analysis
PDF
Beyond Broken Stick Modeling: R Tutorial for interpretable multivariate analysis
PDF
Advanced Retinal Imaging
PDF
Multispectral Purkinje Imaging
PDF
Geometric Deep Learning
Optical Designs for Fundus Cameras
OCT Monte Carlo & Deep Learning
Shallow introduction for Deep Learning Retinal Image Analysis
Multimodal RGB-D+RF-based sensing for human movement analysis
Beyond Broken Stick Modeling: R Tutorial for interpretable multivariate analysis
Advanced Retinal Imaging
Multispectral Purkinje Imaging
Geometric Deep Learning

What's hot (15)

PDF
Purkinje imaging for crystalline lens density measurement
PDF
Image Restoration for 3D Computer Vision
PDF
Instrumentation for in vivo intravital microscopy
PDF
Practical Considerations in the design of Embedded Ophthalmic Devices
PDF
Portable Multispectral Fundus Camera
PDF
Time-resolved biomedical sensing through scattering medium
PDF
Design of lighting systems for animal experiments
PDF
Hyperspectral Retinal Imaging
PDF
Labeling fundus images for classification models
PDF
Pupillometry Through the Eyelids
PDF
Data-driven Ophthalmology
PDF
Lighting design for Startup Offices
PDF
Short intro for retinal biomarkers of Alzheimer’s Disease
PDF
Future of Retinal Diagnostics
PDF
Smartphone-powered Ophthalmic Diagnostics
Purkinje imaging for crystalline lens density measurement
Image Restoration for 3D Computer Vision
Instrumentation for in vivo intravital microscopy
Practical Considerations in the design of Embedded Ophthalmic Devices
Portable Multispectral Fundus Camera
Time-resolved biomedical sensing through scattering medium
Design of lighting systems for animal experiments
Hyperspectral Retinal Imaging
Labeling fundus images for classification models
Pupillometry Through the Eyelids
Data-driven Ophthalmology
Lighting design for Startup Offices
Short intro for retinal biomarkers of Alzheimer’s Disease
Future of Retinal Diagnostics
Smartphone-powered Ophthalmic Diagnostics
Ad

Similar to Deep Learning for Biomedical Unstructured Time Series (20)

PDF
A Survey on Deep Learning for time series Forecasting
PDF
Quantitative and Qualitative Analysis of Time-Series Classification using Dee...
PDF
Combination of Similarity Measures for Time Series Classification using Genet...
PDF
Accurate time series classification using shapelets
PDF
System for Prediction of Non Stationary Time Series based on the Wavelet Radi...
PDF
Time series analysis : Refresher and Innovations
PDF
Kz2418571860
PDF
Lecture9_Time_Series_2024_and_data_analysis (1).pdf
PDF
Forecasting time series powerful and simple
PDF
Una introducción a la minería de series temporales
PDF
RDataMining slides-time-series-analysis
PPTX
Presentation On Time Series Analysis in Mechine Learning
PPTX
Gaussian Processes and Time Series.pptx
PPTX
time_series and the forecastring age of RNNS.pptx
PDF
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
PPTX
wasim 1
PDF
2204.01637.pdf
PPTX
Gde time series_modeling
PDF
R data mining-Time Series Analysis with R
PDF
A Survey on Deep Learning for time series Forecasting
Quantitative and Qualitative Analysis of Time-Series Classification using Dee...
Combination of Similarity Measures for Time Series Classification using Genet...
Accurate time series classification using shapelets
System for Prediction of Non Stationary Time Series based on the Wavelet Radi...
Time series analysis : Refresher and Innovations
Kz2418571860
Lecture9_Time_Series_2024_and_data_analysis (1).pdf
Forecasting time series powerful and simple
Una introducción a la minería de series temporales
RDataMining slides-time-series-analysis
Presentation On Time Series Analysis in Mechine Learning
Gaussian Processes and Time Series.pptx
time_series and the forecastring age of RNNS.pptx
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
wasim 1
2204.01637.pdf
Gde time series_modeling
R data mining-Time Series Analysis with R
Ad

More from PetteriTeikariPhD (16)

PDF
ML and Signal Processing for Lung Sounds
PDF
Next Gen Ophthalmic Imaging for Neurodegenerative Diseases and Oculomics
PDF
Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...
PDF
Wearable Continuous Acoustic Lung Sensing
PDF
Precision Medicine for personalized treatment of asthma
PDF
Two-Photon Microscopy Vasculature Segmentation
PDF
Skin temperature as a proxy for core body temperature (CBT) and circadian phase
PDF
Summary of "Precision strength training: The future of strength training with...
PDF
Precision strength training: The future of strength training with data-driven...
PDF
Intracerebral Hemorrhage (ICH): Understanding the CT imaging features
PDF
Hand Pose Tracking for Clinical Applications
PDF
Precision Physiotherapy & Sports Training: Part 1
PDF
Creativity as Science: What designers can learn from science and technology
PDF
Light Treatment Glasses
PDF
Efficient Data Labelling for Ocular Imaging
PDF
Dashboards for Business Intelligence
ML and Signal Processing for Lung Sounds
Next Gen Ophthalmic Imaging for Neurodegenerative Diseases and Oculomics
Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...
Wearable Continuous Acoustic Lung Sensing
Precision Medicine for personalized treatment of asthma
Two-Photon Microscopy Vasculature Segmentation
Skin temperature as a proxy for core body temperature (CBT) and circadian phase
Summary of "Precision strength training: The future of strength training with...
Precision strength training: The future of strength training with data-driven...
Intracerebral Hemorrhage (ICH): Understanding the CT imaging features
Hand Pose Tracking for Clinical Applications
Precision Physiotherapy & Sports Training: Part 1
Creativity as Science: What designers can learn from science and technology
Light Treatment Glasses
Efficient Data Labelling for Ocular Imaging
Dashboards for Business Intelligence

Recently uploaded (20)

PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
KodekX | Application Modernization Development
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
Teaching material agriculture food technology
PDF
Electronic commerce courselecture one. Pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
MYSQL Presentation for SQL database connectivity
PDF
cuic standard and advanced reporting.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
KodekX | Application Modernization Development
The AUB Centre for AI in Media Proposal.docx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Unlocking AI with Model Context Protocol (MCP)
Review of recent advances in non-invasive hemoglobin estimation
Reach Out and Touch Someone: Haptics and Empathic Computing
Teaching material agriculture food technology
Electronic commerce courselecture one. Pdf
Big Data Technologies - Introduction.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
MYSQL Presentation for SQL database connectivity
cuic standard and advanced reporting.pdf
Understanding_Digital_Forensics_Presentation.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Building Integrated photovoltaic BIPV_UPV.pdf

Deep Learning for Biomedical Unstructured Time Series

  • 1. Deep Learning for Biomedical Unstructured Time-series 1D Convolutional neural networks (CNNs) for time series analysis, and inspiration from beyond biomedical field Petteri Teikari, PhD Singapore Eye Research Institute (SERI) Visual Neurosciences group http://petteri-teikari.com/ Version “Wed 17 April 2019“
  • 3. TimeSeries Basics Regular time seriesvs. irregular timeseries https://mediatum.ub.tum.de/doc/1444158/78684.pdf UnstructuredBiomedical1DTimeSeries Time-Frequencyvisualization https://doi.org/10.3389/fnhum.2016.00605 Timeserieswithdiscrete“states” Sleepstagesinferredfromunivariateormultivariate(multipleEEGelectrodelocations,), multimodal(EEGwithECG/EMG,etc.)dense1Dtimeseries Manytypesof groundtruths possiblealsofor1Dtime series Segmentation,classification,regression https://arxiv.org/abs/1801.05394
  • 4. TimeSeries Stationarity Non-stationaritiessignificantly distort short-term spectral, symbolicand entropyheartrate variabilityindicesNovember 2011PhysiologicalMeasurement 32(11):1775-86 DOI: 10.1088/0967-3334/32/11/S05 Testsof Stationarity https://stats.stackexchange.com/questions/182764/stationarity-test s-in-r-checking-mean-variance-and-covariance Stationarity of order 2 For everyday use we often consider time series that have (instead of strictstationarity):https://people.maths.bris.ac.uk/~magpn/Research/LSTS/TOS.html ● aconstantmean ● aconstantvariance ● anautocovariancethatdoesnotdependontime. Suchtimeseriesareknownas second-orderstationary or stationaryoforder2. Examples of non-stationary processes are random walk with or without a drift (a slow steady change) and deterministic trends (trends that are constant, positive or negative, independent of time for the whole life of the series).https://www.investopedia.com/articles/trading/07/stationary.asp
  • 6. Representation vsSimilarity https://arxiv.org/abs/1704.00794: “Time series analysis approaches can be broadly categorized into two families: (i) representation methods, which provide high-level features for representing properties of the time series at hand, and (ii) similarity measures, which yield a meaningful similarity between different time series for further analysis.“ Classic representation methods are for instance Fourier transforms, wavelets, singular value decomposition, symbolic aggregate approximation, andpiecewiseaggregateapproximation. Time series may also be represented through the parameters of model-based methods such as Gaussian mixture models (GMM), Markov models and hidden Markov models (HMMs), time series bitmaps andvariantsofARIMA. An advantage with parametric models is that they can be naturally extended to the multivariate case. For detailed overviews on representation methods, we refer the interested reader to e.g. Wangetal.(2013). https://arxiv.org/abs/1704.00794: “Similarity-based approaches, once defined, such similarities between pairs of time series may be utilized in a wide range of applications, such as classification, clustering, and anomaly detection. Time series similarity measures include for example dynamic time warping (DTW, the longest common subsequence (LCSS), the extended Frobenius norm (Eros), and the Edit Distance with Real sequences (EDR), and representstate-of-the-artperformanceinunivariatetimeseries(UTS)prediction. Attempts have been made to design kernels from non-metric distances such as DTW, of which the global alignment kernel (GAK) is an example. There are also promising works on deriving kernels from parametric models, such as the probability product kernel, Fisher kernel, andreservoir basedkernels.Commontoallthese methodsishowever a strongdependence onacorrecthyperparametertuning,whichisdifficulttoobtaininanunsupervisedsetting. Moreover, many of these methods cannot naturally be extended to deal with multivariate time series (MTS), as they only capture the similarities between individual attributes and do not modelthe dependenciesbetweenmultiple attributes.Equallyimportant,thesemethodsare not designed to handle missing data, an important limitation in many existing scenarios, such as clinical data where MTS originating from Electronic Health Records (EHRs) often contain missingdata In this work, we propose a surgical site infection detection framework for patients undergoing colorectal cancer surgery that is completely unsupervised, hence alleviating the problem of getting access to labelled training data. The framework is based on powerful kernels for multivariate time series that account for missing data when computing similarities. https://arxiv.org/abs/1803.07879
  • 7. Analysis withSimilarityMeasures TimeSeriesClusterKernelforLearningSimilaritiesbetweenMultivariateTimeSerieswithMissingData KarlØyvindMikalsen,FilippoMariaBianchi,CristinaSoguero-Ruiz,RobertJenssen(lastrevised29Jun2017) https://arxiv.org/abs/1704.00794|https://github.com/kmi010/Time-series-cluster-kernel-TCK-(TheTCKwasimplementedinRandMatlab) Similarity-based approaches represent a promising direction for time series analysis. However, many such methods rely on parameter tuning, and some have shortcomings if the time series are multivariate (MTS), due to dependencies between attributes, or the time series containmissingdata. In this paper, we address these challenges within the powerful context of kernel methods by proposing the robust time series cluster kernel (TCK). The approach taken leverages the missing data handling properties of Gaussian mixture models (GMM) augmented with informative prior distributions. An ensemble learning approach is exploited to ensure robustness to parameters by combining the clustering results of many GMM to formthefinalkernel. The experimental results demonstrated that the TCK (1) is robust to hyperparameter settings, (2) is competitive to established methods on prediction tasks without missing data and (3) is better than established methods on prediction tasks with missing data. In future works we plan to investigate whether the use of more general covariance structures in the GMM, or the use of HMMs as base probabilistic models, could improve TCK.
  • 8. Wavelets Shapelets→ Shapelets ”1DGabors”#1 Fast classification of univariate and multivariate time seriesthrough shapelet discovery https://doi.org/10.1007/s10115-015-0905-9 Josif Grabocka, MartinWistuba, Lars Schmidt-Thieme A Shapelet Selection Algorithm forTime Series Classification: New Directions https://doi.org/10.1016/j.procs.2018.03.025 The high timecomplexityof shapelet selection processhindersitsapplication in real timedataprocession. Toovercome this, inthispaper we proposeafast shapelet selection algorithm (FSS), which sharply reducesthe time consumption ofshapeletselection. https://slideplayer.com/slide/8370683/ Forexample,aclassof abnormalECG measurementmaybe characterised by an unusualpatternthat onlyoccurs occasionallyatany point during the measurement.Shapelets aresubseriesthatcapture thistypeofcharacteristic. Theyallowforthe detection ofphase- independentlocalised similaritybetween series within thesameclass. Thegreattimeseriesclassificationbakeoff:areviewandexperimental evaluationof recentalgorithmicadvances Anthony Bagnall, Jason Lines, Aaron Bostrom,James Large, Eamonn Keoghs (May2017) https://doi.org/10.1007/s10618-016-0483-9 | https://bitbucket.org/TonyBagnall/time-series-classification
  • 9. Wavelets Shapelets→ Shapelets ”1DGabors”#2 Afastshapelet selectionalgorithmfortime series classification https://doi.org/10.1016/j.comnet.2018.11.031 Thetrainingtime ofshapelet based algorithmsishigh, eventhough itis computed off-line, and the authorsaim tomake it moreefficient Shapelet transformation algorithms have attracted a great deal of attention in the last decade. However, the timecomplexity of the shapelet selectionprocess in shapelet transformation algorithms is too high. To accelerate the shapelet selection process with noreductioninaccuracy,wepresentedFSSforST. The experimental results demonstrate that our proposed FSS was thousands of timesfasterthantheoriginalshapelettransformation methodwithnoreduction in accuracy. Our results also demonstrate that our method was the fastest method among shapeletmethodsthathavetheleadinglevelofaccuracy.
  • 10. RepresentationLearning with deeplearning #1 TowardsaUniversalNeuralNetworkEncoderforTime Series Joan Serrà,SantiagoPascual,AlexandrosKaratzoglou(Submitted on 10May 2018)https://arxiv.org/abs/1805.03908 We have studied the use of a universal encoder for time series in the specific case of classifying an out-of-sample data set of an unseen data type. We have considered the cases of no-adaptation,mappingadaptation,andfulladaptation. In all cases we achieve performances that are competitive with the state-of-the-art that, in addition, involve a compact reusable representation and few training iterations. We have also studied the effect of the representation dimensionality, showing that small representations have an impact to no-adaptation and mapping adaptation approaches,butnotmuch tofulladaptation ones. In the future, we plan to refine the encoder architecture, as well as optimizing some of the parameters we empirically use in our experiments. A very interesting direction for future research is the adoption of one-shot learning schemas (Snelletal.2017; Sutskeveretal.2014), which we find very suitable for the current setting in time series classification problems. A further option to enhance the performance of a universal encoder is data augmentation, specially considering recent linear instance/class interpolation approaches ( Zhangetal.2018). In order to have sufficient knowledge to accomplish any task, and in order to be applicable in the absence of labeled data or even without adaptation/re-training, researchers have been increasingly adopting the generic concept of universal encoders, specially within the text processing domain (note that related concepts also existinother domains). The basic idea is to train a model (the encoder) that learns a common representation which is useful for a variety of tasks and that, at the same time, can be reused for novel tasks with minimal or no adaptation. While it would seem that classical autoencoders and other unsupervised models should perfectly fit this purpose, recent research in sentence encoding shows that, with current means, encoders learnt with a sufficiently large set of supervised tasks, or mixing supervised and unsupervised data, consistentlyoutperformtheirpurelyunsupervisedcounterparts.
  • 11. RepresentationLearning with deeplearning #2 OneDeepMusicRepresentationtoRuleThem All? Acomparativeanalysisofdifferentrepresentationlearning strategies JaehunKim,JulianUrbano,CynthiaC. S.Liem,AlanHanjalic (Submittedon13Feb2018) https://arxiv.org/abs/1802.04051 Ourworkwilladdressthefollowing researchquestions: –RQ1:Givenasetofcommonlearningtasksthatcanbeusedtotrain anetwork,whatistheinfluenceofthenumberandtypeofthetaskson theeffectivenessofthelearneddeeprepresentation? –RQ2:Howdovariousdegreesofinformationsharinginthedeep architectureaffecttheultimatesuccessofalearneddeep representation? –RQ3:Whatisthebestwaytoassesstheeffectivenessofadeep representation? Simplified illustration of the conceptual difference between traditional deep transfer learning (DTL) based on a single learning task (above) and multi-task based deep transfer learning (MTDTL) (below). The same color used for a learning and an unseen task indicates that the tasks have commonalities, which implies that the learned representation is likely to be informative for the unseen task. At the same time, this representation may not be that informative to another unseen task, leading to a low transfer learning performance. The hypothesis behind MTDTL is that relying on more learning tasksincreasesrobustness of thelearned representationand itsusabilityfor abroadersetof unseen tasks.
  • 12. RepresentationLearning with deeplearning #3 LearningFiner-classNetworksforUniversal Representations https://arxiv.org/abs/1810.02126 https://arxiv.org/abs/1712.09708 JulienGirard,YoussefTamaazousti,HervéLeBorgne,Céline Hudelot(Submittedon4 Oct2018) Many real-world visual recognition use-cases can not directly benefit from state-of-the-art CNN-based approaches because of the lack of many annotated data. The usual approach to deal with this is to transfer a representation pre-learned on a large annotated source-task onto a target- task of interest. This raises the question of how well the original representation is "universal", that is to say directly adapted to many different target-tasks. To improve such universality, the state-of-the-art consists in training networks on a diversified source problem, that is modified either by adding generic or specific categories to the initial set of categories. We propose two methods to improve universality, but pay special attention to limit the need of annotated data. We also propose a unified framework of the methods based on the diversifying of the training problem. Finally, to better match Atkinson's cognitive study about universal human representations, we proposed to rely on the transfer-learningschemeas wellasa new metric toevaluateuniversality. We show thatourmethod learnsmore universal representationsthan state- of-the-art, leading to significantly better results on 10 target-tasks from multiple domains, using several network architectures, either alone or combinedwithnetworkslearnedat acoarsersemantic level.
  • 13. RepresentationLearning with deeplearning #4 ImprovingClinicalPredictionsthroughUnsupervised TimeSeriesRepresentationLearning https://arxiv.org/abs/1812.00490 XinruiLyu,MatthiasHüser,StephanieL.Hyland,GeorgeZerveas, Gunnar Rätsch(Submittedon2Dec2018) MachineLearningforHealth(ML4H)Workshop atNeurIPS2018. We empirically showed that in scenarios where labeled medical time series data is scarce, training classifiers on unsupervised representations provides performance gains over end-to-end supervised learning using raw input signals, thus making effective use of information available in a separate, unlabeled training set. The proposed model, explored for the first time in the context of unsupervised patient representation learning, produces representations with the highest performance in future signal prediction and clinical outcome prediction, exceeding several baselines. The idea behind applying attention mechanisms to time series forecasting is to enable the decoder to preferentially “attend” to specific parts of the input sequence during decoding. This allows for particularly relevant events (e.g. drastic changes in heart rate),tocontributemoretothegenerationofdifferentpointsintheoutputsequence.
  • 14. RepresentationLearning with deeplearning #5 UnsupervisedScalableRepresentationLearningforMultivariate TimeSeries https://arxiv.org/abs/1901.10738 https://github.com/White-Link/UnsupervisedScalableRepresentationLearni ngTimeSeries (PyTorch) Jean-YvesFranceschi,AymericDieuleveut,MartinJaggi (Submittedon30Jan2019) Hence, we propose in the following an unsupervised method to learn general-purpose representations for multivariate time series that comply with the issues of varying and potentially high lengths of the studied time series. To this end, we adaptrecognized deep learningtools and introduce a novel unsupervised loss. Our representations are computed by a deep convolutional neuralnetworkwithdilatedconvolutions(i.e.TCNs). This network is then trained unsupervised, using the first specifically designed triplet loss in the literature of time series, taking advantage of the encoder resilience to time seriesofunequallengths. We leave as future work the applicability of our method to other tasks like forecasting, and the study of its impact if it weretobeaddedinpowerful ensemblemethods.
  • 15. RepresentationLearning with deeplearning #6 Unsupervised speech representation learning using WaveNet autoencoder https://arxiv.org/abs/1812.00490 Jan Chorowski, Ron J. Weiss,Samy Bengio, Aaron van den Oord(Submitted on 25 Jan 2019) We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms. The goal is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. The behavior of autoencoder models depends on the kind of constraintthatis applied tothelatentrepresentation. Our best models used MFCCs (mel-frequency cepstral coefficient) as the encoder input, but reconstructed raw waveforms at the decoder output. We used standard 13 MFCC features extracted every 10ms (i.e., at a rate of 100 Hz) and augmented with their temporal first and second derivatives. Such features were originally designed for speech recognition and are mostly invariant to pitch and similarconfoundingdetail in theaudiosignal. T
  • 16. RepresentationLearning with deeplearning #7 ATaleof Two Time Series Methods:Representation Learningfor Improved Distance and RiskMetrics https://dspace.mit.edu/bitstream/handle/1721.1/119575/1076 345253-MIT.pdf DivyaShanmugam (June2018) Architecture of the proposed model. A single convolutional layer extracts local features from the input, which a strided maxpool layer reduces to a fixed-size vector. A fully connected layer with ReLU activation carries out further, nonlinear dimensionality reduction to yield the embedding. A softmax layer is added at training time. We introduce the multiple instance learning paradigm to risk stratification. Risk stratification models aim to identify patients at high risk for a given outcome so that doctors may intervene, with the attempt of avoiding that outcome. Machine learning has led to improved risk stratification models for a number of outcomes, including stroke, cancer and treatment resistance [55]. To the best of our knowledge, this is the first application of multiple instance learning to risk stratification. The extension of Jiffy to multi-label classification and unsupervised learning poses a challenging but necessary task. The availability of unlabeled time series data eclipses the availability of its annotated counterpart. Thus, a simple network-based method for representation learning on multivariate timeseries inthe absence oflabels isan important line of work. There is also potential to further increase Jiffy’s speed by replacing the fully connected layer with a structured [Bojarskietal.2016] or binarized[Rastegariet al.2016] matrix. The proposed risk stratification model extends naturally to a range of adverse outcomes. The model is not limited to operating on ECG signals - it is worth exploring whether the multiple instance learning approach may be successful in other modalities of medical data, including voice. On a theoretical level, strong generalization guarantees for distinguishing bags with relative witnessratesdonotexistand are worth exploring asthese modelsare appliedintherealworld.
  • 17. Intro tomethods#1a Highlycomparative time-series analysis: theempirical structure of time series and their methods http://doi.org/10.1098/rsif.2013.0048 Ben D. Fulcher, Max A. Little, Nick S. Jones
  • 18. Intro tomethods#1b Highlycomparative time-series analysis: theempirical structure of time series and their methods http://doi.org/10.1098/rsif.2013.0048 Ben D. Fulcher, Max A. Little, Nick S. Jones Structure inalibrary of8651time-seriesanalysisoperations. (a) A summaryof thefourmainclassesof operationsin ourlibrary,asdetermined by a k-medoidsclustering,reflectsacrudebutintuitiveoverviewof thetime-series analysisliterature.(b)A network representation of theoperationsinour library thataremostsimilarto theapproximateentropy algorithm, ApEn(2,0.2)[7], which wereretrieved fromourlibraryautomatically.Each nodein thenetwork representsanoperationand linksencodedistancesbetweenthem(computed using a normalized mutual information-based distancemetric, cf.electronic supplementary material,§S1.3.1).Annotated scatterplotsshowtheoutputsof ApEn(2,0.2)(horizontal axis)againsta representativememberof each shaded community (indicated bya heavily outlined node, vertical axis). Similar pictures can beproduced by targeting anygivenoperationin our library, thereby connecting differenttime-seriesanalysismethodsthatneverthelessdisplay similar behaviour acrossempiricaltimeseries. Key scientific questions that can be addressed by representing time series by their properties (measured by many types of analysis methods) and operations by their behaviour (across many types of time-series data). We show that this representation facilitates a range of versatile techniquesfor addressingscientific time-seriesanalysisproblems, which are illustrated schematicallyin thisfigure. The representations of time series (rows of the data matrix, figure 1a) and operations (columns of the data matrix, figure 1b) serve as empirical fingerprints, and are shown in the top panel. Coloured borders are used to label different classes of time series and operations, and other figures in this paper that explicitly demonstrate each technique are given in the bottom right-hand corner of each panel. (a) Time-seriesdatasetscan be organized automatically, revealingthe structure in agiven dataset (cf. figures4a,b and 5a). (b)Collectionsof scientific methods can be organized automatically, highlighting relationships between methods developed in different fields (cf. figures 3a and 5b). (c) Real-world and model-generated datawith similar propertiesto aspecific time-seriestarget can be identified (cf. figure 4c,d). (d)Given aspecific operation, alternativesfrom acrossscience can be retrieved (cf. figure 3b). (e)Regression:the behaviour of operations in our library can be compared to find operations that vary with a target characteristic assigned to time series in a dataset (cf. figure 5d). (f) Classification: operations can be selected based on their classification performance to build useful classifiers and gain insights into the differencesbetween classesof labelled time-series datasets(cf. figure 5e).
  • 19. Intro tomethods#1c Highlycomparative time-series analysis: theempirical structure of time series and their methods http://doi.org/10.1098/rsif.2013.0048 Ben D. Fulcher, Max A. Little, Nick S. Jones Highlycomparativetechniquesfortime- seriesanalysistasks.Wedrawonourfull library oftime-seriesanalysismethodsto: (a) structure datasetsinmeaningfulways, andretrieveandorganizeusefuloperations for (b,e) classificationand(c,d) regression tasks.(a)Fiveclassesof EEG signalsare structuredmeaningfullyinatwo- dimensional principalcomponentsspaceof our libraryof operations.(b)Pairwise linear correlationcoefficientsmeasuredbetween the60mostsuccessful operationsfor classifyingcongestiveheartfailureand normalsinusrhythmRR intervalseries. Clusteringrevealsthatmostoperationsare organizedintooneof threegroups (indicatedbydashedboxes). 
  • 20. Most of the time when people talk about time series and deep learning, most likely they talking of Sequences (e.g. language) instead of unstructuredtime series (e.g. voice waveform)
  • 21. “Sequences” vs“TimeSeries” “DenseTimeSeries”at videoframerate Icehockeyas gamecan be simplifiedto discreteevents (sequences) https://arxiv.org/abs/1808.04063 Notalwayssoblack-white,butinourcasetime-seriesaremainlydense1DBiosignalswithambiguousormissingdiscretestates
  • 22. Time Series RNNsforsequences The Unreasonable Effectivenessof RecurrentNeuralNetworks May21,2015|AndrejKarpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/ DanQ:ahybridconvolutionaland recurrentdeepneuralnetworkfor quantifyingthefunctionofDNA sequences  Daniel Quang XiaohuiXieNucleic AcidsResearch,Volume44, Issue11,20June2016,Pagese107,  https://doi.org/10.1093/nar/gkw226 DeepLearningforUnderstandingConsumerHistories byTobiasLang- 25Oct2016 https://jobs.zalando.com/tech/blog/deep-learning-for-understanding-consumer-histories/?gh_src=4n3gxh1 Sequences. Depending on your background you mightbewondering:  WhatmakesRecurrentNetworkssospecial?
  • 24. TimeSeries LSTMsApplied DeepAir|UCBerkeleySchoolofInformation https://www.ischool.berkeley.edu/projects/2017/deep-air This project investigates the use of the LSTM recurrent neural network (RNN) as a framework for forecasting in the future, based on time series data of pollution and meteorological information in Beijing. Our results show that the LSTM framework produces equivalent accuracy when predicting future time stamps compared to the baseline support vector regression for a single time stamp. Using our LSTM framework, we can now extend the prediction from a single time stamp out to 5 to 10 hours in the future. Overview of our self-supervised approach for posture and sequence representation learning using CNNLSTM. After the initial training with motion-based detections we retrain our model for enhancingthe learningof therepresentations. https://doi.org/10.1109/CVPR.2017.399 PianoGenie:An IntelligentMusicalInterface Oct15,2018 |https://magenta.tensorflow.org/pianogenie Chris Donahue (  chrisdonahue ,  chrisdonahuey ) ;Ian Simon (  iansimon ,  iansimon ) ;Sander Dieleman (  benanne ,  sedielem ) A bidirectional LSTM encoder maps asequence of piano notestoasequence of controller buttons (shown as 4 in the above figure, 8 in the actual system). A unidirectional LSTM decoder then decodes these controller sequences back into piano performances. After training, the encoder isdiscarded and controller sequencesareprovided byuser input.
  • 25. Time Series RNN/LSTMsareoutdated?#1 ThefallofRNN/ LSTM EugenioCulurciello https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0 Combining multiple neural attention modules, comes the “hierarchical neural attention encoder”… Notice there is a hierarchy of attention modules here, very similar to the hierarchy of neural networks. This is also similar toTemporalconvolutionalnetwork(TCN) → Shapelets AttentionModels,e.g. Pervasive Attention: 2D Convolutional NeuralNetworksforSequence-to- SequencePrediction MahaElbayad,LaurentBesacier,JakobVerbeek (Submittedon11Aug 2018) https://arxiv.org/abs/1808.03867| https://github.com/elbayadm/attn2d
  • 26. Time Series RNN/LSTMsareoutdated?#2 AnEmpiricalEvaluationof GenericConvolutional and RecurrentNetworksforSequence Modeling ShaojieBai,J.ZicoKolter,VladlenKoltun (Revised19Apr2018) https://arxiv.org/abs/1803.01271 |http://github.com/locuslab/TCN For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should one use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modelingtasks. The preeminence enjoyed by recurrent networks in sequence modeling may be largely a vestige of history. Until recently, before the introduction of architectural elements such as dilated convolutions and residual connections, convolutional architectures were indeed weaker. Our results indicate that with these elements, a simple convolutional architecture is more effective across diverse sequence modeling tasks than recurrent architectures such as LSTMs. Due to the comparable clarity and simplicity of TCNs, we conclude that convolutional networks should be regarded as a natural starting point and a powerfultoolkit for sequence modeling
  • 27. Time Series RNN/LSTMsareoutdated?#3 Dilated Temporal Fully-Convolutional Networkfor Semantic Segmentation ofMotion CaptureData NoshabaCheema,Somayeh Hosseini, Janis Sprenger, Erik Herrmann,Han Du, Klaus Fischer, PhilippSlusallek (Submittedon 24Jun 2018) https://arxiv.org/abs/1806.09174 Semantic segmentation of motion capture sequences plays a key part in many data-driven motion synthesis frameworks. It is a preprocessing step in which long recordings of motion capture sequences are partitioned into smaller segments. Afterwards, additional methods like statistical modeling can be applied to each group of structurally-similar segments to learn an abstract motion manifold. The segmentation task however often remains a manual task, which increases the effort and costofgeneratinglarge-scalemotiondatabases. We therefore propose an automatic framework for semantic segmentation of motion capture data using a dilated temporal fully-convolutional network. Our model outperforms a state-of-the-art model in action segmentation, as well as three networks for sequence modeling.
  • 28. Time Series RNN/LSTMsareoutdated?#4 TemporalConvolutionalNetworksandDynamicTimeWarping canDrasticallyImprovetheEarlyPredictionofSepsis MichaelMoor,MaxHorn,BastianRieck,DamianRoqueiroandKarsten Borgwardt(Submittedon7Feb2019) https://arxiv.org/abs/1902.01659 https://osf.io/av5yx/?view_only=a6e3442634b34d53ba6e59c4a956b318 For future work, we aim to extend our analysis to more types of data sources arising from the ICU. Futoma et al. (2017b) already employed a subset of baseline covariates, medication effects, and missingness indicator variables. However, a multitude of feature classes still remain to be explored and properly integrated. For instance, the combination of sequential and non-sequential features has previously been handled by feeding non-sequential data into the sequential model (Futoma et al.,2017a). We hypothesize that this could be handled more efficiently by using a more modular architecture that incorporates both sequential and non-sequential parts. Furthermore, we aim to obtain a better understanding of the time series features utilized by the model. Specifically, we are interested in assessing the interpretability of the learned filters of the MGPTCN framework and evaluate how much the activity of an individual filter contributes to a prediction. This endeavor is somewhat facilitated by our use of a convolutional architecture. The extraction of short per-channel signals could prove very relevant for supporting diagnoses made by clinical practitioners. Overview of our model. The raw, irregularly spaced time series are provided to the Multi-task Gaussian Process (MGP) patient by patient. The MGP then draws from a posterior distribution (given the observed data) at evenly spaced grid times (each hour). This grid is then fed into a temporal convolutional network (TCN) which after aforward pass returns a loss. Its gradient is then computed by backpropagating through the computational graph including both the TCN and the MGP (green arrows). Both the MGP and TCN parameters are learned end-to-end during training. We evaluate all methods using Area under the Precision–Recall Curve (AUPRC) and additionally display the (less informative) Area under the Receiver Operator Characteristic (AUC). The current state-of-the-art method, MGP-RNN, is shown in blue. The two approaches for early detection of sepsis that were introduced in this paper, i.e. MGP-TCN and DTW-KNN ensemble, are shown in pink and red, respectively. By using three random splits for all measures and methods, we depict the mean (line) and standard deviation error bars (shaded area).
  • 30. StructuringClinicalText Comparativeeffectiveness of convolutional neural network(CNN)and recurrent neural network(RNN) architectures for radiologytext reportclassification (2018) https://doi.org/10.1016/j.artmed.2018.11.004 DepartmentofBiomedicalDataScience,StanfordUniversitySchoolof Medicine,Stanford,CA,USA This paper explores cutting-edge deep learning methods for information extraction from medical imaging free text reports at a multi-institutional scale and compares them to the state-of-the-art domain-specific rule-based system – PEFinder andtraditionalmachinelearning methods– SVMandAdaboost. Visualization methods have been developed to identify the impact of input words on the output decision for both deeplearning models. DomainPhraseAttention-basedHierarchicalNeuralNetwork(DPA- HNN)architecture.
  • 31. ClinicalText +Images Unsupervised MultimodalRepresentation Learning across Medical Images and Reportsn (MachineLearning for Health (ML4H)Workshop atNeurIPS 2018.) https://arxiv.org/abs/1811.08615 MITCSAIL Joint embeddings between medical imaging modalities and associated radiology reports have the potential to offer significant benefits to the clinical community, ranging from cross- domain retrieval to conditional generation of reports to the broader goals of multimodal representation learning. In this work, we establish baseline joint embedding results measured via both local and global retrieval methods on the soon to be released MIMIC-CXR dataset consisting of both chest X-ray images and the associatedradiologyreports.. We establish baseline results using supervised and unsupervised joint embedding methods along with local (direct pairs) and global (ICD-9 code groupings) retrieval evaluation metrics. Results show a possibility of incorporating more unsupervised data into training for minimal-effort performance increase. A further study of joint embeddings between these modalities may enable significant applications, such as text/imagegenerationor theincorporationofotherEMRmodalities.
  • 33. EHRMining Risk Predictionmodel Risk Prediction on Electronic Health Records with Prior Medical Knowledge (2018) https://doi.org/10.1145/3219819.3220020 We propose a novel and general framework called PRIME for risk prediction task, which can successfully incorporate discrete prior medical knowledge into all of the state-of-the- art predictive models using posterior regularization technique. Different from traditional posterior regularization, we do not need to manually set a bound for each piece of prior medical knowledge when modeling desired distribution of the target disease on patients. Moreover, the proposed PRIME can automatically learn the importance of different prior knowledge with alog-linearmodel. The limitation of this work is that the proposed PRIME is only effective for common diseases. For rare and emerging diseases, since there is little medical knowledge about them, it is hard to incorporate any prior knowledge into deep learning predictive models. Thus, the proposed PRIME may achieve similar performance to the state-of-the-art baselines. In our future work, we will focus on how to improve predictive performanceofrisk predictionforrare diseases.
  • 35. Intro to cleaning Inthepreprocessing component,themainpurposeistocleanthe data,filter theunusualpointsandmakeitsuitableastheinputtothe CNN.Besidesthenormalstepsincludingtimestampalignment, normalizationandmissingdataimputationfortimeseriesdatawith trend, themostimportantoperationtoimprovethedataqualityisthe outlierdetection,interpolation andfiltering,inparticularfor clinicaldata.Becauseintheclinicaldataofglucosetimeseries,there aremanymissingor outlier datapointsduetoerrorsincalibration, measurements,and/or mistakesintheprocessofdatacollectionand transmission.Here,severalmethodsareintroducedtohandlethese scenarios[36]. ● DimensionReductionModel: thetimeseriescan beprojectedinto lowerdimensionsusinglinearcorrelationssuchasprinciplecomponent analysis(PCA),and datawithlargeresidualerrorscanbeconsideredas outliers. ● Proximity-basedModel: thedataaredeterminedbynearest neighbouranalysis,clusterordensity.Thus thedatainstancesthat are isolatedfromthemajorityareconsidered asoutliers. ● Probabilistic Stochastic Filters:differentfiltersforthesignals, such asgaussian mixturemodelsoptimized usingexpectation-maximization. In ourcasethefiltercan beimplementedbeforetheCNN, duetothe continuouscharacteristic oftheinputglycaemic timeseriesdata. AconvolutionalneuralnetworkforECGannotationasthebasisfor classificationofcardiacrhythms PhilippSodmann etal2018Physiol.Meas.inpress https://doi.org/10.1088/1361-6579/aae304 Signalcleaning: Inthedatapreprocessing,weperformedresamplingandsignaldenoising.We resampledallECGsto300HzusingthefastFourier transforminorder topassECG segmentsofequallengthontotheCNN. Tofilternoisycomponentsinthesignalsuchasbaselinewandering,respirationeffects, or powerlineinterference,weappliedadiscretewavelettransform(DWT)whichworks asaband-passfilter.For this,weusedDaubechieswavelettransform(Db4). Beforere-composition,eachcoefficientofthetransformwasmultipliedbyafactor accordingtotabulatedvalues.Afterwards,a15%-trimmedmeanwithawindowsizeof 33sampleswasappliedtoremovethepersistentbaseline. https://doi.org/10.3389/fnins.2013.00267 MEGandEEGdataanalysis withMNE-Python
  • 37. TimeSeries Invariances Acomplexity-invariantdistancemeasurefortimeseries https://doi.org/10.1137/1.9781611972818.60 GustavoEAPA Batista, Xiaoyue Wang, and Eamonn J Keogh. In Proceedingsofthe2011SIAM InternationalConferenceon DataMining(SDM), pages699–710.SIAM,2011.Citedby216 
  • 38. TimeSeries DTWthe classicalmethod https://doi.org/10.1145/2888451.2888 456 StockPricePredictionwithFluctuationPatternsUsing IndexingDynamic TimeWarpingand k∗ -Nearest NeighborsKei Nakagawa, MitsuyoshiImamura,Kenichi Yoshida(2018) https://doi.org/10.1007/978-3-319-93794-6_7
  • 39. Learning invariances#1a LearningtoExploit InvariancesinClinical Time-SeriesDatausingSequence TransformerNetworks JeehehOh, JiaxuanWang, JennaWiens (Submittedon 21 Aug2018) https://arxiv.org/abs/1808.06725 Recently, researchers have started applying convolutional neural networks (CNNs) with 1D convolutions to clinical tasks involving time-series data. This is due, in part, to their computational efficiency, relative to recurrent neural networks and their ability to efficiently exploit certain temporal invariances, (e.g.,phaseinvariance). However, it is well-established that clinical data may exhibit many other types of invariances (e.g., scaling). While preprocessing techniques, (e.g., dynamic time warping) may successfully transform and align inputs, their use often requires one to identify thetypesofinvariancesinadvance. In contrast, we propose the use of Sequence Transformer Networks, an end-to-end trainable architecture that learns to identify and account for invariances in clinical time-series data. Applied to the task of predicting in-hospital mortality, our proposedapproachachievesanimprovementintheAUROC. Toaddressesthesechallenges,weproposeSequenceTransformer Networks,anapproachfor learningtask-specificinvariancesrelatedtoamplitude,offset,andscaleinvariancesdirectlyfrom thedata.Appliedtoclinicaltime-seriesdata,SequenceTransformerNetworkslearn input-and task-dependenttransformations.Incontrasttodataaugmentationapproaches,our proposedapproachmakeslimitedassumptionsaboutthepresenceofinvariancesinthedata.
  • 40. Learning invariances#1b LearningtoExploitInvariancesinClinicalTime- Series DatausingSequenceTransformerNetworks Jeeheh Oh, Jiaxuan Wang, JennaWiens (Submitted on 21 Aug 2018) https://arxiv.org/abs/1808.06725 Theproposedapproachisnotwithoutlimitation.Morespecifically,initscurrentformthe SequenceTransformer appliesthesametransformationacrossallfeatureswithinanexample, insteadoflearningfeature-specifictransformations.Despitethislimitation,thelearned transformationsstillleadtoanincreaseinintra-classsimilarity.Inconclusion,weare encouragedbythesepreliminaryresults.Overall,thiswork representsastartingpoint on whichotherscanbuild.Inparticular,wehypothesizethattheabilitytocapturelocalinvariances andfeature-specificinvariancescouldleadtofurther improvementsinperformance.
  • 41. Learning invariances#2 Autowarp:LearningaWarpingDistancefromUnlabeledTime Series UsingSequenceAutoencoders Abubakar Abid, JamesZou StanfordUniversity (Submitted on 23Oct2018) https://arxiv.org/abs/1810.10107 Domain experts typically hand-craft or manually select a specific metric, such as dynamic time warping (DTW), to apply on their data. In this paper, we propose Autowarp, an end-to-end algorithm that optimizesand learnsagood metric givenunlabeled trajectories. We define a flexible and differentiable family of warping metrics, which encompasses common metrics such as DTW, Euclidean, and edit distance. Autowarp then leverages the representation power of sequence autoencoders to optimize for a member of this warping distance family. The output is a metric which is easy to interpret and can be robustly learned from relatively few trajectories. Future work will extend these results to more challenge time series data, such as those with higher dimensionality or heterogeneousdata.
  • 42. Learning invariances#3 NeuralWarp:Time-Series SimilaritywithWarpingNetworks Josif Grabocka, LarsSchmidt-Thieme (Submitted on20 Dec2018) https://arxiv.org/abs/1812.08306 | Relatedarticles In this paper we propose to learn a warping function for aligning the indices of time series in a deep latent representation. We compared the suggested architecture with two types of encoders (CNN, or RNN) and a deep forward network as a warping function. Experimental comparisons to non-parametric and un-warped Siames networks demonstrated that the proposed elastic deep similaritymeasureismoreaccuratethanpriormodels.
  • 44. SMOTE forimbalancedclasses SMOTE-GPU:BigData preprocessingon commodityhardwareforimbalancedclassification ProgressinArtificialIntelligenceDecember2017,Volume6, Issue4,pp347–354 https://doi.org/10.1007/s13748-017-0128-2 Consideringabinaryproblemwithamajorityclassanda minorityclass,itislikelythatalearning algorithmignoresthe later andstillachievesahighaccuracy.Thereare threemain waysof dealingwiththesesituations [16]: ● Algorithmicmodification Modifyinglearning algorithmsin order totackletheproblembydesign. ● Cost-sensitivelearningIntroducingcostsfor misclassificationoftheminorityclassatdataor algorithmic level. ● DatasamplingPreprocessingthedatainorder toreduce thebreachbetweenthenumberofinstancesofeachclass. TheSMOTEtechniqueisbasedontheideaof neighborhoodofthek-nearestneighbor (kNN)rule. The area under the ROC curve results show that the use of oversampling methods improves the detection of the minority class in Big Data datasets. We have also shown how our design can successfully work on a wide range of devices, including a laptop, while requiring reasonable times, around 25 min on high-end devices, and less than 2 h on the laptop, for the most time-demanding experiment. SMOTEforLearningfromImbalancedData:Progress and Challenges,Markingthe15-yearAnniversary(2018) https://doi.org/10.1613/jair.1.11192 ● GS4(Moutafis & Kakadiaris, 2014) ,SEG-SSC (Triguero et al.,2015) and OCHS-SSC (Dong et al.,2016) generate synthetic examplestodiminish the drawbacksproducedby the absence of labeled examples. Several learning techniques were checked andsomeproperties such asthecommonhiddenspacebetweenlabeledsamplesand thesyntheticsamplewereexploited. ● The technique proposed by Park et al. (2014) is a semi- supervised active learning method in which labels are incrementally obtained and applied using a clusteringalgorithm. Inthe contextofcurrentchallengesoutlined,we highlightedtheneed forenhancingthetreatmentof smalldisjuncts,noise, lack of data, overlapping,datasetshiftandthecurseof dimensionality. To doso,the theoreticalpropertiesof SMOTE re-garding these data characteristics, and its relationship with the new synthetic instances,mustbefurtheranalyzedindepth. Finally,wealsoposited thatitisimportanttofocusondatasampling andpre-processing approaches(such asSMOTE anditsextension)withintheframework ofBig Dataandreal-timeprocessing.
  • 47. State-of-the-art 2 yearsoldcuttingedge#1 AComparativeEvaluationofUnsupervisedAnomaly DetectionAlgorithms forMultivariateData (2016) MarkusGoldstein,Seiichi Uchida https://doi.org/10.1371/journal.pone.0152173 Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as wellascommonpubliclyavailabledatasets. These shortcomings are addressed in this study, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasetsfrommultipleapplicationdomains. By publishing the source code and the datasets, this paper aims to be a new well-funded basis for unsupervised anomaly detection research. Additionally, this evaluation reveals the strengths and weaknessesofthedifferent approachesforthefirst time. As a general summary for algorithmselection, werecommend to use nearest-neighbor based methods, in particular k-NN for global tasks and LOF for local tasks instead of clustering-based methods. If computation time is essential, HBOS is a good candidate, especially for larger datasets. A special attention should be paid to the nature of the dataset when applying local algorithms, and if local anomalies are of interest at allin thiscase.  Different anomaly detection modes dependingon the availability of labels in the dataset. (a) Supervised anomaly detection uses a fully labeled dataset for training. (b) Semi- supervised anomaly detection uses an anomaly-free training dataset. Afterwards, deviations in the test data from that normal model are used to detect anomalies. (c) Unsupervised anomaly detection algorithms use only intrinsic information of the data in order to detect instances deviatingfrom the majority of thedata.
  • 48. State-of-the-art 2 yearsoldcuttingedge#2 A ComparativeEvaluation of Unsupervised Anomaly Detection Algorithmsfor Multivariate Data (2016)MarkusGoldstein, SeiichiUchida https://doi.org/10.1371/journal.pone.0152173 A visualization of the results of the k-NN global anomaly detection algorithm. The anomaly score is represented by the bubble size whereas the color shows the labelsoftheartificiallygenerateddataset. Comparing Influenced Outlierness (INFLO) withLocal Outlier Factor (LOF) showstheusefulnessofthe reverseneighborhoodset. For the red instance, LOF takes only the neighbors in the gray area into account resulting in a high anomaly score. INFLO additionally takes the blue instances into account (reverse neighbors)andthusscorestheredinstancemorenormal.
  • 49. Anomalydetection Cyber-physicalsystems Anomaly DetectionwithGenerativeAdversarialNetworks for MultivariateTimeSeries (2018) Dan Li, DachengChen, Jonathan Goh,andSee-KiongNg InstituteofDataScience, National UniversityofSingapore, https://arxiv.org/abs/1809.04758 Unsupervised machinelearningtechniquescanbeusedtomodelthe systembehaviour andclassifydeviantbehavioursaspossibleattacks. Inthiswork,weproposedanovelGenerativeAdversarialNetworks-based AnomalyDetection(GAN-AD)methodfor suchcomplexnetworkedCPSs. WeusedLSTM-RNNinourGANtocapturethedistributionofthe multivariatetimeseriesofthesensorsandactuatorsundernormal workingconditionsofaCPS. Insteadoftreatingeachsensor’sandactuator’stimeseriesindependently,we model thetimeseriesofmultiplesensorsandactuatorsintheCPS concurrently totakeintoaccountofpotentiallatentinteractions betweenthem. ToexploitboththegeneratorandthediscriminatorofourGAN,wedeployedthe GAN-traineddiscriminator together withtheresidualsbetweengenerator- reconstructeddataandtheactualsamplestodetectpossibleanomaliesinthe complexCPS. We will also conduct further research on feature selection formultivariate anomalydetection,and investigate principled methodsfor choosing the latent dimension andPC dimension withtheoretical guarantees.
  • 50. Anomalydetection Financialtime-series Modelingapproachesfortimeseries forecastingand anomaly detection (2018) Du,Shuyang; Pandey, Madhulima; Xing,Cuiqun http://cs229.stanford.edu/proj2017/final-reports/5244275.pdf This project focuses on prediction of time series data for Wikipedia page accesses for a period of over twenty-four months. The methods explored here are K-nearest neighbors (KNN), Long short-term memory network (LSTM), and Sequence to Sequence with Convolution Neural Network (CNN) and we will compare predicted values to actual web traffic. Thepredictionscan helpusinanomalydetectionintheseries. Pre-processing : “The are many series in which values are zero. This could be a missing value, or actual lack of web page access. In addition, there are significant spikes in the data, where values have a broad range from 1 to hundreds/thousandsfor several web pages. We normalize this data by adding 1 to all entries, taking the log of the values, and setting the mean to zero and variance to one. We have the results of fourier analysisforexploringperiodictyonaweekly/monthly/quarterlybasis.” Our approaches to time series prediction depends on features extracted from the the time series data itself. Our models learn periodicity, ramp and other regular trends quite well. However, none of our models are able to capture spikes or outliers that arise from external sources. Enhancing the performance of the models will require augmenting our feature set from othersourcessuchasnewseventsandweather.
  • 51. “SpecialOutliers” Disguisedmissingvalues FAHES:ARobustDisguised Missing ValuesDetector QatarComputingResearch Institute,HBKU, Doha,Qatar https://doi.org/10.1145/3219819.3220109 Missing values are common in real-world data and may seriously affect data analytics such as simple statistics and hypothesis testing. Generally speaking, there are two types of missing values: explicitly missing values (i.e. NULL values), and implicitly missing values (a.k.a. disguised missing values (DMVs)) such as "11111111" for a phone number and "Some college" for education. While detecting explicitly missing values is trivial, detecting DMVs is not; the essential challenge is the lack of standardization about how DMVs are generated. Onefutureworkweareplanning toperformistoimproveFAHESto detecttheDMVsthataregenerated randomlywithintherangeofthe data.For example,whenachildtries tocreateanaccountonadomain thathasaminimumagerestriction, thechildfakesher agewitharandom valuethatallowshimtocreatethe account.Suchrandomfakevalues arehard,ifnotimpossible,todetect. Moreover,althoughDMVsarethe focusofthispaper,therearemore typesoferrorsarefoundinthewild. Manyoftheprinciplesand techniqueswehaveusedtodetect DMVscanbeleveragedtodetect other typesoferrors,soanatural nextstepistoextendthe infrastructurewehavebuiltto detectthose.Thisopensnew challengesrelatedtotherobust identificationoferrorsthatcouldbe interpreteddifferentlybydifferent modules.
  • 53. UncertaintyandNoveltydetection #1a Does YourModel KnowtheDigit6Is NotaCat?ALessBiased Evaluationof“Outlier” Detectors (2018) AlirezaShafaei,MarkSchmidt,andJamesJ.Little https://arxiv.org/abs/1809.04729 What makes this problem differentfrom a typical supervisedlearning setting isthatwecannotmodelthediversityofout-of-distributionsamplesin practice. The distribution of outliers used in training may not be the same as the distribution of outliers encountered in the application. Therefore, classical approaches that learn inliers vs. outliers with only two datasets can yield optimistic results. We introduce OD-test, a three-dataset evaluation scheme as a practical and more reliable strategy to assess progress on this problem. The OD-test benchmark provides a straightforward means of comparison for methods that address the out-of- distributionsampledetectionproblem. In real life deployment of products that use complex machinery such as deepneuralnetworks(DNNs),we wouldhavevery littlecontroloverthe input. In the absence ofextrapolation guarantees, when the independently and identically distributed (IID) assumption is violated, the behaviour of the pipeline may be be unpredictable. From a quality assurance perspective, it is desirable to detect and prevent these scenarios automatically. A reliable pipeline would first determine whether it can process a given sample, then it would use the prediction of the target neural network. The unfortunate incident that mislabeledpeople asnon-human , for instance, is a clear example of OOD extrapolation that could have been prevented by such a decision scheme: the model simply did not know that it did not know. While incidentsof similar nature have fueled researchon de-biasing the datasets and the deep learning machinery, we still wouldneed to identify the limitationsof ourmodels. The application is not limited to fortifying large-scale user- facing products. Successful detection of such violations could also be used in active learning, unsupervised learning, learning with noisy data, or simply be a condition to invoking transfer learning strategies. In this work, we are interested in evaluating mechanisms that detect OOD samples.
  • 54. UncertaintyandNoveltydetection #1b DoesYour Model Know the Digit 6 Is Not a Cat?A Less Biased Evaluation of “Outlier” Detectors (2018) AlirezaShafaei, MarkSchmidt, and JamesJ. Little https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test The Uncertainty View. A commonly invoked strategy in addressing similarproblemsistocharacterizeanotionofuncertainty. The literature distinguishes aleatoric uncertainty, the uncertainty inherent to the process (the known unknowns, like flipping a coin), from epistemic uncertainty, the uncertainty that can be eliminated with more information (the unknown unknowns). The Bayesian approach to epistemic uncertainty estimation is to measure the degree of disagreement among thepotentiallyviablemodels(theposterior). The MC-Dropout approach is often advertised as a feasible method to estimateuncertainty for a variety of applications. Similarly, we can adopt a non-Bayesian approach by training independent models and then measuringthedisagreement.Lakshminarayananetal.showanensembleof five neural networks (DeepEnsemble) that are trained with an adversarialsample-augmented strategy is sufficient to provide a non- Bayesian alternative to capturing predictive uncertainty. We evaluate DeepEnsemble and MC-Dropout. * The Abstention View * The Anomaly View AEThreshold PixelCNN++ K-NNSVM * The Novelty View OpenMax We train these architectures with a cross-entropy loss (CE), and a k-way logistic regression loss (KL). CE loss is the typical choice for k-way classification tasks – it enforces mutual exclusion in the predictions. KL loss is the typical choice for attribute prediction tasks – it does not enforce mutual exclusivity of the predictions. We test these two loss functions to see if the exclusivity assumption of CE has an adverse effect on the ability to predict OOD samples. CE loss cannot make a None prediction without an explicitly defined None class, but KL loss can make None predictions through low activations of all the classes.
  • 56. UncertaintyandNoveltydetection #1d DoesYour Model Know the Digit 6 Is Not a Cat?A Less Biased Evaluation of “Outlier” Detectors (2018) AlirezaShafaei, MarkSchmidt, and JamesJ. Little https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test [PyTorch] Related work indeep learning can be categorized into two broadgroupsbased on the underlyingassumptions: (i) in-distribution techniques, and (ii) out-of-distribution techniques. Guoetal. (2017) observed that modern neural networks tend to be overconfident in their predictions. They show that temperature scaling in the softmax operator, also known as Platt scaling, can be used to calibrate the output probabilities of a neural network to empirically align the accuracy of a prediction with its probability. Their efforts fall under the uncertainty estimation approaches. Geifman and El-Yaniv (2017) present a framework for selective classification with deep neural networks that follows the abstention view. A selection function decides whether to make a prediction or not. For the choice of selection function, they experiment with MC-Dropout and the softmax output. They provide an analytical trade-off between risk and coverage within their formulation. input perturbation serves as a way to assess how the network would behave nearby the given input. When the temperature is 1 and the perturbation step is 0 we simply recover the PbThreshold method. ODIN, the state-of-the-art at the time of this writing, is reported to outperform the previous work [8] by a significant margin. We also assess the performance of ODIN inourwork. These methods provide an abstract idea which depends on the successful training of GANs. To the best of our knowledge, training GANs is itself an active area of research, and it is not apparent what design decisions would be appropriate to implement these ideas in practice. Furthermore, someoftheseideasareprohibitivelyexpensivetoexecuteatthetimeofthiswriting.
  • 57. UncertaintyandNoveltydetection #1e DoesYour Model Know the Digit 6 Is Not a Cat?A Less Biased Evaluation of “Outlier” Detectors (2018) AlirezaShafaei, MarkSchmidt, and JamesJ. Little https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test Datasets. We extend the previous work by evaluating over a broader set of datasets with varying levels of complexity. The variation in complexity allows for a fine-grained evaluation of the techniques. Since OOD detection is closely related to the problem of density estimation, the dimensionality of the input image will be of vital importance in practical assessments. As the input dimensionality increases, we expect the task to become much more difficult. Therefore, to provide a more accurate picture of performance, itiscrucialtoevaluatethemethodsonhighdimensionaldata. MC-Dropout Inlow-dimensional datasets,K- NNSVMperforms similarlyorbetter than theother methods Thetop-performingmethod,ODIN,isinfluencedbythe numberofclassesin thedataset.Similarto PbThreshold,ODIN dependson themaximum signalin theclasspredictions, thereforetheincreasednumberof classeswould directly affect bothofthemethods.Furthermore,neitherofthemconsistently prefersVGGoverResnetwithinalldatasets. Overall,ODIN consistentlyoutperformsothersinhigh-dimensional settings, but allthemethodshavea relativelylow average accuracyinthe60%-78%range.
  • 58. UncertaintyandNoveltydetection #1f DoesYour Model Know the Digit 6 Is Not a Cat?A Less Biased Evaluation of “Outlier” Detectors (2018) AlirezaShafaei, MarkSchmidt, and JamesJ. Little https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test
  • 59. UncertaintyandNoveltydetection #2 To TrustOr NotTo Trust A Classifier HeinrichJiang, Been Kim, Maya Gupta (2018) Google Research;Google Brain https://arxiv.org/abs/1805.11783 We propose a new score, called the trust score, which measures the agreement between the classifier and a modified nearest-neighbor classifier on the testing example. We show empirically that high (low) trust scores produce surprisingly high precision at identifying correctly (incorrectly) classified examples, consistently outperforming the classifier’s confidence scoreas well as many other baselines. Two example datasets and models. Predicting correctness (top row) and incorrectness (bottom). The vertical dotted black line indicates accuracy level of the classifier. The trust score consistently attains a higher precision for each given percentile of classifier decision-rejection. Furthermore, the trust score generally shows increasing precision as the percentile level increases, but surprisingly, many of the comparison baselinesdo not.
  • 60. UncertaintyandNoveltydetection #3 Interpreting Neural NetworksWith Nearest Neighbors Eric Wallace, Shi Feng, Jordan Boyd-Graber https://arxiv.org/abs/1809.02847 Local model interpretation methodsexplain individual predictionsbyassigning animportance value to each inputfeature. Thisvalue isoften determined by measuringthe change in confidence when a feature is removed. However, the confidence of neural networksis nota robust measure of model uncertainty. Thisissue makesreliably judgingthe importance of the input featuresdifficult.We addressthisby changing the test-time behaviorofneural networks using Deep k-Nearest Neighbors. Without harmingtext classification accuracy, thisalgorithm providesa more robustuncertainty metric whichwe use to generate feature importance values. The resultinginterpretationsbetteralign withhuman perception than baseline methods. Finally, we use our interpretation methodto analyze model predictionson dataset annotation artifacts. Deepk-nearest neighbors: Towards confident, interpretable and RobustDeep Learning NicolasPapernot and Patrick D. McDaniel (2018) https://arxiv.org/abs/1803.04765 Debugging ResNet model biases—This illustrates how the DkNN algorithm helps to understand a bias identified by Stock and Cisse [105] in the ResNet model for ImageNet. The image at the bottom of each column is the test input presented to the DkNN. Each test input is cropped slightly differently to include (left) or exclude (right) the football. Images shown at the top are nearest neighbors in the predicted class according to the representation output by the last hidden layer. This comparison suggests that the “basketball” prediction may have been a consequence of the ball being in the picture. Also note how the white apparel color and general arm positions of players often match the test image of BarackObama.
  • 61. UncertaintyandNoveltydetection #4 AND:AutoregressiveNoveltyDetectors Davide Abati, AngeloPorrello, Simone Calderara, RitaCucchiara (Submitted on4 Jul 2018) https://arxiv.org/abs/1807.01653 We propose an unsupervised model for novelty detection. The subject is treated as a density estimation problem, in which a deep neural network is employed to learn a parametric function that maximizes probabilities of training samples. This is achieved by equipping an autoencoder with a novel module, responsible for the maximization of compressed codes' likelihood by means of autoregression. We illustrate design choices and proper layers to perform autoregressive density estimation when dealing with both image and video inputs. Despite a very general formulation, our model shows promising results in diverse one-class novelty detectionandvideoanomalydetectionbenchmarks. Thestructureoftheproposedautoencoder.Pairedwithastandardcompression-reconstruction network,adensityestimationmodulelearnsthedistributionoflatentcodes,viaautoregression.1
  • 62. Anomalydetection withGANs#1 AnomalydetectionwithWassersteinGAN IlyassHaloui, Jayant SenGupta, and Vincent Feuillard (Submitted on11Dec2018) https://arxiv.org/pdf/1812.02463 Inthispaper,we investigateGAN toperformanomalydetectionon time series dataset. In order to achieve this goal, a bibliography is made focusing on theoretical properties of GAN and GAN used for anomaly detection. A Wasserstein GAN hasbeen chosen to learn the representation of normal data distribution and a stacked encoder with the generator performsthe anomaly detection. W-GAN with encoder seems to produce state of the art anomaly detection scores on MNIST datasetandweinvestigateitsusageon multi-variatetimeseries. Based on this literature review, we chose to perform anomaly detection using a Wasserstein Generative Adversarial Network. The main reason is that Wasserstein GAN does not collapse contrarily to the classical GAN which needs to be heavily tuned in order to avoid this problem. Mode collapse can be blocking if we need to perform anomaly detection: ifasubset ofour datadistributionisnotlearned bythe generator, then all samples that are similar to this subset might end up classified as abnormal. Another added value of the wasserstein GAN version compared to a standard GAN is the possibility of using the loss function of the discriminator to evaluate convergence since it is an approximationoftheWassersteindistancebetween Pr andPθ . A future improvement consists in considering CNN for both the generator and discriminator in order to detect anomalies from raw time series data. 1-D convolutions are needed and will be investigated to produce good visual representations of time series samples.A more thorough study of the impact of the architecture should also be done.
  • 63. Anomalydetection withGANs#2 MAD-GAN:MultivariateAnomalyDetectionforTimeSeries DatawithGenerativeAdversarialNetworks DanLi, DachengChen, LeiShi, BaihongJin, Jonathan Goh, and See-KiongNg (Submitted on15Jan 2019) Institute ofData Science, National UniversityofSingapore https://arxiv.org/abs/1901.04997 In this work, we propose a novel Multivariate Anomaly Detection strategywith GAN (MAD-GAN) to model the complex multivariate correlations among the multiple data streams to detect anomalies using both the GANtrained generator and discriminator. Unlike traditional classification methods, the GAN-trained discriminator learns to detect fake data from real data in an unsupervised fashion, making it an attractive unsupervised machine learning technique for anomalydetection Given that this is an early attempt on multivariate anomaly detection on timeseriesdatausingGAN,thereareinteresting issuesthatawaitfurther investigations.Forexample,wehavenotedtheissuesofdeterminingthe optimal subsequence length as well as the potential model instability of theGANapproaches. For future work, we plan to conduct further research on feature selection for multivariate anomaly detection, and investigate principled methods for choosing the latent dimension and PC dimension with theoretical guarantees.Wealsohope toperformadetailedstudyon the stability of the detection model. In terms of applications, we plan to explore the use of MAD-GAN for other anomaly detection applications such as predictive maintenance and fault diagnosis for smart buildings andmachineries.
  • 64. Uncertainty InsightsfromNLP uncertainty QuantifyingUncertaintiesinNaturalLanguage ProcessingTasks YijunXiaoand William YangWang(Submitted on 18 May2018) https://arxiv.org/abs/1811.07253 In this paper, we propose novel methods to study the benefits of characterizing model and data uncertainties for natural language processing (NLP) tasks. With empirical experiments on sentiment analysis, named entity recognition, and language modeling using convolutional and recurrent neural network models, we show that explicitly modeling uncertainties is not only necessary to measure output confidence levels, but also useful at enhancing model performances in various NLPtasks. 1. We mathematically define model and data uncertaintiesviathelawof totalvariance; 2. Our empirical experiments show that by accounting for model and data uncertainties, we observe significantimprovementsinthree importantNLPtasks; 3. We show that our model outputs higher data uncertainties for more difficult predictions in sentiment analysis andnamedentity recognitiontasks.
  • 65. Uncertainty CNNs+GaussianProcesses CalibratingDeepConvolutionalGaussianProcesses Gia-Lac Tran, Edwin V. Bonilla, John P. Cunningham, PietroMichiardi, Maurizio Filippone. (Submitted on 26May 2018) https://arxiv.org/abs/1805.10522 Despite the considerable interest in combining CNNs with GPs, little attention has been devoted to understand the implications in terms of the ability of these models to accurately quantify the level of uncertainty inpredictions. This is the first work that highlights the issues of calibration of these models, showing that GPs cannot cure the issues of miscalibration in CNNs. We have proposed a novel combination of CNNs and GPs where the resulting model becomes a particular form of a Bayesian CNN for which inference using variational inference isstraightforward. However, our results also indicate that combining CNNs and GPs does not significantly improve the performance of standard CNNs. This can serve as a motivation for investigating new approximation methods for scalable inference in GP models and combinationswithCNNs. CalibrationofConvolutionalNetworks: The issue of calibration of classifiers in machine learning was popularized in the 90’s with the use of support vector machines for probabilistic classification. Calibration techniques aim to learn a transformation of the output using a validation set in order for the transformed output to give a reliable account ofthe actual probability ofclasslabels; interestingly,calibration can be appliedregardless of the probabilistic nature of the untransformed output of the classifier. Popular calibration techniques include Plattscaling and isotonicregression. Classifiers based on Deep Neural Networks (DNNs) have been shown to be well-calibrated]. The reason is that the optimization of the cross-entropy loss promotes calibrated output. The same loss is used in Platt scaling and it corresponds to the correct multinomial likelihood for class labels. Recent studies on the calibration of CNNs, which are a particular case of DNNs, however, show that depth has a negative impact on calibration, despite the use of a cross-entropy loss, and that regularization improves the calibration properties of classifiers[Guoetal.2017]. Combinationsof ConvNetsandGaussianProcesses: Thinking of Bayesian priors as a form of regularization, it is natural to assume that Bayesian CNNs can “cure” the miscalibration of modern CNNs. Despite the abundant literature on Bayesian DNNs, far less attention has been devoted to Bayesian CNNs, and the calibration properties of these approaches have not been investigated. In this work, we propose an alternative way to combine CNNs and GPs, where GPs are approximated using random features expansions. The random feature expansion approximation amounts in replacing the orginal kernel matrix with a low-rank approximation, turning GPs into Bayesian linear models. Combining this with CNNs leads to a particular form of Bayesian CNNs, much like GPs and DGPs are particular forms of Bayesian DNNs. Inference in Bayesian CNNs is intractable and requires some form of approximation. In this work, we draw on the interpretation of dropout as variational inference, employing the so-called Monte Carlo Dropout (MCD) to obtain a practicalwayofcombiningCNNsand GPs.
  • 66. Uncertainty in timestamps,modelingfor clinicaluse#1 Time-DiscountingConvolutionforEventSequences withAmbiguousTimestamps (Submitted on 6Dec2018) https://arxiv.org/abs/1812.02395 This paper proposes a method for modeling event sequences with ambiguous timestamps, a time- discounting convolution. Unlike in ordinary time series, time intervals are not constant, small time-shifts have no significant effect, and inputting timestamps or time durations into a model is not effective. The criteria that we require for the modeling are providing robustness against time-shifts or timestamps uncertainty as well as maintaining the essential capabilities of time-series models, i.e., forgetting meaningless past information and handling infinite sequences. The proposed method handles them with a convolutional mechanism across time with specific parameterizations, which efficiently represents the event dependencies in a time-shift invariant manner while discounting the effect of past events, and a dynamic pooling mechanism, which provides robustness against the uncertainty in timestamps and enhances the time-discounting capability by dynamically changing the poolingwindowsize.
  • 68. Typesof Missing Values Feldmanetal.(2018): “Rubin (1976) discusses three possible mechanisms for the formation of missing values, each reflecting a different form of missing-data probabilities and relationships between the measured variables, and each may lead to different imputation methods (Luengoetal.,2012)” Missing Completely at Random (MCAR): a missing value that cannot be related to the value itself or to other variable values in that record. This is a completely unsystematic missing pattern and therefore the observed data canbethoughtofasarandomunbiasedsampleofacompletedataset. Missing at Random (MAR): cases in which a missing value is related to other variable valuesin thatrecord,but nottothevalue itself(e.g., aperson with a "marital status" value "single", has a missing value in the "spouse name" attribute). In other words, in MAR scenarios, incomplete data can be partially explained and the actual value can be possibly predicted by other variable values. Missing Not at Random (MNAR): the missing value is not random and depends on the actual value itself; hence, cannot be explained by other values (e.g., an overweight person is reluctant to provide the "weight" value in a survey). NMAR scenarios are the most difficult to analyze and handle, as the missing data cannot be associated with other data items that are available in thedataset. https://statistical-programming.com/missing-data/ Missinginaction:the dangersofignoringmissingdata https://doi.org/10.1016/j.tree.2008.06.014
  • 69. Intro toimputationmethods ComparisonofEstimatingMissingValues inIoTTime Series DataUsingDifferentInterpolationAlgorithms August2018 https://doi.org/10.1007/s10766-018-0595-5 “When collecting the Internet of Things data using various sensors or other devices, it may be possible to miss several kinds of values of interest.In thispaper,we focusonestimating the missing valuesin IoT time series data using three interpolation algorithms, including (1) Radial Basis Functions, (2) Moving Least Squares (MLS), and (3) AdaptiveInverseDistanceWeighted.“ Onthechoiceofthebestimputationmethods formissingvalues consideringthreegroups ofclassificationmethods June2011 https://doi.org/10.1007/s10115-011-0424-2|https://sci2s.ugr.es/MVDM “In thiswork, wefocuson aclassification task with twenty-three classification methods and fourteen different imputation approaches to missing values treatment that are presented and analyzed. The analysis involves a group-based approach, in which we distinguish between three different categories of classification methods. Each category behaves differently, and the evidence obtained shows that the use of determined missing values imputation methods could improve the accuracy obtained for these methods. In this study, the convenience of using imputation methods for preprocessing data sets with missing values is stated. The analysis suggests that theuseofparticularimputation methodsconditionedtothegroupsisrequired.“ We have discovered that the Combined Multivariate Collapsing (CMC) and Event Covering (EC) methods show good behavior for these two measures, and they are two methods that provide good results for an important range of learning methods, as we have previously analyzed. In short, these two approaches introduce less noise and maintain the mutual information better. Class centerbasedapproachformissingvalue imputation2018 https://doi.org/10.1016/j.knosys.2018.03.026 A novel missing value imputation isintroduced, which iscomposedof two modules. Each class center and its distances from the other observed data are measured to identify a threshold. Then, the identified threshold is used for missing value imputation. The proposed approach outperforms the other approaches for both numerical and mixed datasets. It requires much less imputation timethanthemachinelearning basedmethods.
  • 70. Imputation withDeepLearning#1 BRITS:BidirectionalRecurrentImputationforTime Series WeiCao,DongWang,JianLi,HaoZhou,LeiLi,YitanLi (Submittedon27May2018) https://arxiv.org/abs/1805.10572 https://github.com/NIPS-BRITS/BRITS Existing imputation methods often impose strong assumptions of the underlying data generating process, such as linear dynamics in the state space. In this paper, we propose BRITS, a novel method based on recurrent neural networksformissingvalueimputationintimeseriesdata. Our proposed method directly learns the missing values in abidirectional recurrentdynamicalsystem,without any specific assumption. The imputed values are treated as variablesofRNNgraphandcan beeffectivelyupdatedduring the backpropagation. We simultaneously perform missing value imputation and classification/regression of applications jointlyinoneneuralgraph. BRITS has three advantages: (a) it can handle multiple correlated missing values in time series; (b) it generalizes to time series with nonlinear dynamics underlying; (c) it provides a data-driven imputation procedure and appliestogeneralsettingswithmissing data. We evaluate the imputation performance in terms of mean absolute error (MAE) and mean relative error (MRE).
  • 71. Imputation withDeepLearning#2 End-to-EndTimeSeriesImputationviaResidualShortPaths Lifeng Shen,Qianli Ma,SenLi (2018) http://proceedings.mlr.press/v95/shen18a.html We propose an end-to-end imputation network with residual short paths, called Residual IMPutation LSTM (RIMP-LSTM), a flexible combination of residual short paths with graph-based temporal dependencies. We construct a residual sum unit (RSU), which enables RIMP-LSTM to make full use of previous revealed information to model incomplete time series and reduce the negative impact of missing values. Moreover, a switch unit is designed to detect the missing values and a new loss function is then developed to train our model with time series in the presence of missing values in an end-to-end way, which also allows simultaneous imputationand prediction. RIMP-LSTM combines the merits of graph-based models with explicitly modeled temporal dependencies via weighted residual connection between nodes, with the ones of LSTM that can accumulate historical residual information and learn the underlying patternsof incomplete time seriesautomatically. On the other hand, compared with IMP-LSTM, RIMP-LSTM has better performance as it is good at modeling temporal dependencies with weighted residual short paths, which demonstrates that the reasonability of using these weighted residual pathsto model graphlike temporal dependenciesforimputation.
  • 72. Imputation withDeepLearning#3 Acontextencoderforaudioinpainting AndresMarafioti,NathanaelPerraudin,Nicki Holighaus,andPiotr Majdak (Submittedon29Oct2018) https://arxiv.org/abs/1810.12138 http://www.github.com/andimarafioti/audioContextEncoder (Python,Matlab) We studied the ability of deep neural networks (DNNs) to restore missing audio content based on its context, a process usually referred to as audio inpainting. We focused on gaps in the range of tens of milliseconds, a condition which has not received much attention yet. The proposed DNN structure was trained on audio signals containing music and musical instruments, separately, with 64-ms long gaps Here, the STFT features, meant as a reasonable first choice, provided a decent performance. In the future, we expect more hearing-related features to provide even better reconstructions. In particular, an investigation of Audlet frames, i.e., invertible time- frequency systems adapted to perceptual frequency scales, as featuresforaudioinpaintingpresentintriguingopportunities. Here, preferred architectures are those not relying on a predetermined target and input feature length, e.g., a recurrent network. Recent advances in generative networks will provide other interesting alternatives for analyzing and processing audio dataaswell.Theseapproachesareyettobefully explored. Finally, music data can be highly complex and it is unreasonable to expect a single trained model to accurately inpaint a large number of musical styles and instruments at once. Thus, instead of training on a very general dataset, we expect significantly improved performance for more specialized networks that could be trained by restricting the training data to specific genres or instrumentation. Applied to a complex mixture and potentially preceded by a source-separation algorithm, the resulting modelscouldbeusedjointlyinamixture-of-experts.approach.
  • 73. Imputation withDeepLearning#4: GANs NAOMI:Non-AutoregressiveMultiresolutionSequenceImputation Yukai Liu,RoseYu,StephanZheng,EricZhan,Yisong Yue (Submittedon30Jan2019) https://arxiv.org/abs/1901.10946 We studied the ability of deep neural networks (DNNs) to restore missing audio content based on its context, a process usually referred to as audio inpainting. We focused on gaps in the range of tens of milliseconds, a condition which has not received much attention yet. The proposed DNN structure was trained on audio signals containing music and musical instruments, separately, with 64-ms long gaps Leveraging multiresolution modeling and adversarial training, NAOMI is able to learn the conditional distribution given very few known observations and achieves superior performances in variousexperiments of both deterministic and stochastic dynamics. Future work will investigate how to infer the underlyingdistribution when complete training dataisunavailable.The trade- off between partial observations and external constraints is another direction for deepgenerativeimputationmodels.
  • 74. Effect of missingvalues toclassificationperformance Amethodologyforquantifyingtheeffectofmissingdata ondecisionquality in classificationproblems Received 09Mar 2016, Accepted 22 Dec 2016, Accepted author version posted online: 13Jan 2017, https://doi.org/10.1080/03610926.2016.1277752 “This study suggests that the negative impact of poor data quality (DQ) on decision making is often mediated by biased model estimation. To highlight this perspective, we develop an analytical framework that links three quality levels – data, model, and decision. The general framework is first developed at a high-level” Evolutionary MachineLearningfor ClassificationwithIncompleteData Tran, CaoTruong(2018, PhDThesis) http://hdl.handle.net/10063/7639 “The thesis develops approaches for improving imputation for classification with incomplete data by integrating clustering and feature selection with imputation. The approaches improve both the effectiveness and the efficiency of using imputation for classificationwith incompletedata. The thesis develops interval genetic programming to directly evolve classifiers for incomplete data. The results show that classifiers generated by interval genetic programming can be more effective and efficient than classifiers generated the combination of imputation and traditional genetic programming. Interval genetic programming is also more effective than common classification algorithms able to workdirectlywith incompletedata.”
  • 75. Imputation and Classification MissingData ImputationforSupervisedLearning August 2018 https://doi.org/10.1080/08839514.2018.1448143 “Thispapercomparesmethodsforimputingmissing categoricaldataforsupervisedclassificationtasks. “ The results of the present study show that perturbation can help increase predictive accuracy for imputed models, but not one-hot encoded models. Future work can identify the conditions under which missing-data perturbation can improve prediction accuracy. Interesting extensions of this paper include evaluating the benefits of using missing-data perturbation over more popularregularization techniquessuchas dropout training. ErrorratesontheAdulttestsetwith(bottom)andwithout(top)missing dataimputation,for variouslevelsofMCAR-perturbedcategoricaltrainingfeatures(x-axis). TheAdult datasetcontainsN= 48,842examples and 14 features(6 continuousand 8 categorical).The predictiontask isto determinewhether aperson makesover $50,000a year.
  • 77. CEEMD EmpiricalModeDecomposition Empirical mode decomposition for seismic time-frequency analysis Jiajun Han and Mirko van der Baan Geophysics (2013) 78 (2):O9-O19. https://doi.org/10.1190/geo2012-0199.1 Complete ensemble empirical mode decomposition decomposes a seismic signal into a sum of oscillatory components, with guaranteed positive and smoothly varying instantaneous frequencies. Analysis on synthetic and real data demonstrates that this method promises higher spectral-spatial resolution than the short-time Fourier transform or wavelet transform. Application on field data thus offers the potential of highlighting subtle geologic structures that might otherwise escape unnoticed. CEEMD is a robust extension of EMD methods. It solves not only the mode mixing problem, but also leads to complete signal reconstructions. After CEEMD, instantaneous frequency spectra manifest visibly higher time-frequency resolution than short-time Fourier and wavelet transforms on synthetic and field data examples. These characteristics render the technique highly promisingforseismic processingand interpretation. Introducinglibeemd:Aprogrampackageforperformingthe ensembleempiricalmodedecomposition(July2015) ComputationalStatistics 31(2):1-13P.J.J.Luukko,JouniHelske,E. Räsänen C, R and Python http://doi.org/10.1007/s00180-015-0603-9 https://bitbucket.org/luukko/libeemd
  • 78. SourceSeparation ”signaldecomposition”#1 Wave-U-Net:AMulti-ScaleNeuralNetworkfor End-to-EndAudioSourceSeparation Daniel Stoller, Sebastian Ewert, Simon Dixon Queen Mary Universityof London, Spotify (Submitted on8 Jun2018) https://arxiv.org/abs/1806.03185 |https://github.com/f90/Wave-U-Net “Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyper-parameters for the spectral front-end. Therefore, we investigate end-to-end source separation in the time-domain, which allows modelling phase information and avoids fixed spectral transformations. Due to high sampling rates for audio, employing a long temporal input context on the sample level is difficult, but required for high quality separation results because of long-range temporalcorrelations. In thiscontext, weproposethe Wave-U-Net,an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales. We introduce further architectural improvements, including an output layer that enforces source additivity, an upsampling technique and a context-aware predictionframework toreduceoutput artifacts. Experiments for singing voice separation indicate that our architecture yields a performance comparable to a state-of-the- artspectrogram-basedU-Netarchitecture,given thesamedata. 75 tracks from the training partition of the MUSDB multi-track database are randomly assigned to our training set. For singing voice separation, we also add the whole CCMixter database to the training set. No further data preprocessing is performed, only a conversion to mono (except for stereo models) and downsampling to 22050 Hz. For future work, we could investigate to which extent our model performs a spectral analysis, and how to incorporate computations similar to those in a multi- scale filterbank, or to explicitly compute a decomposition of the input signal into a hierarchical set of basis signals and weightings on which to perform the separation, similarto the TasNet [12]. Furthermore, better loss functions for raw audio prediction should be investigated such as the ones provided by generative adversarial networks [3, 21], since the MSE might not reflect the perceived loss of quality well.
  • 79. SourceSeparation ”signaldecomposition”#2 TasNet:SurpassingIdealTime-Frequency MaskingforSpeechSeparation YiLuo, NimaMesgarani (Submitted on21 Sep 2018) https://arxiv.org/abs/1809.07454 “TasNet uses a convolutional encoder to create a representation of the signal that is optimized for extracting individual speakers. Speaker extraction is achieved by applying a weighting function (mask) to the encoder output. The modified encoder representation is then inverted to the sound waveform using a linear decoder. A linear deconvolution layer serves as a decoder by invertin gthe encoder output back to the sound waveform. This encoder-decoder framework is similar to the ICA method when a nonnegativemixing matrix (NMF) is used [Wangetal.2009] and to the semi-nonnegative matrix factorization method (semi-NMF) [Dingetal.2008], where the basis signals are the parameters of thedecoder. The masks are found using a temporal convolutional network (TCN) consisting of dilated convolutions, which allow the network to model the long-term dependencies of the speech signal. This end-to-end speech separation algorithm significantly outperforms previous time-frequency methods in terms of separating speakers in mixed audio, even when compared to the separation accuracy achieved with the ideal time-frequency mask of the speakers. In addition, TasNet has a smaller model size and a shorter minimum latency, making it a suitable solution for bothofflineandreal-time speechseparation applications.“
  • 80. SourceSeparation ”signaldecomposition”#3 DisentanglingCorrelatedSpeakerandNoisefor SpeechSynthesis viaDataAugmentationand AdversarialFactorization Wei-NingHsu, Yu Zhang, Ron J. Weiss, Yu-An Chung, Yuxuan Wang, YonghuiWu, JamesGlass. 32nd ConferenceonNeural InformationProcessing Systems (NIPS 2018), Montréal, Canada. https://openreview.net/pdf?id=Bkg9ZeBB37 “To leverage crowd-sourced data to train multi-speaker text- to-speech (TTS) models that can synthesize clean speech for all speakers, it is essential to learn disentangled representations which can independently control the speaker identity and background noise in generated signals. However, learning such representations can be challenging, duetothe lackoflabelsdescribingtherecordingconditionsof each training example, and the fact that speakers and recording conditions are often correlated, e.g. since users oftenmakemanyrecordingsusingthesameequipment. This paper proposes three components to address this problem by: (1) formulating a conditional generative model with factorized latent variables, (2) using data augmentation to add noise that is not correlated with speaker identity and whose label is known during training, and (3) using adversarial factorization to improve disentanglement. Experimental results demonstrate that the proposed method can disentangle speaker and noise attributes even if they are correlated in the training data, and can be used to consistentlysynthesizecleanspeechforallspeakers.”
  • 81. Decompose HighandLow frequencies Drop anOctave:ReducingSpatialRedundancy in Convolutional Neural Networks withOctave Convolution YunpengChen, HaoqiFang, BingXu, ZhichengYan, YannisKalantidis, MarcusRohrbach, ShuichengYan, JiashiFeng (Submitted on 10 Apr 2019) https://export.arxiv.org/abs/1904.05049 In this work, we propose to factorize the mixed feature maps by their frequencies and design a novel Octave Convolution (OctConv) operation to store and process feature maps that vary spatially "slower" at a lower spatial resolution reducing both memory and computation cost. Unlike existing multi-scale meth-ods, OctConv is formulated as a single, generic, plug-and-play convolutional unit that can be used as a direct replacement of (vanilla) convolutions without any adjustments in the network architecture. It is also orthogonal and complementary to methods that suggest better topologies or reduce channel-wise redundancy like group or depth-wise convolutions. We experimentally show that by simply replacing con-volutions with OctConv, we can consistently boost accuracy for both image and video recognition tasks, while reducing memoryandcomputationalcost.
  • 82. Decompose Signalandthe Noise Deeplearningofdynamicsandsignal-noise decompositionwithtime-steppingconstraints Samuel H. Rudy, J. Nathan Kutz, Steven L. Brunton Department of Applied Mathematics/ Mechanical Engineering, Universityof Washington, Seattle, last revised 22 Aug2018 https://arxiv.org/abs/1808.02578 https://github.com/snagcliffs/RKNN “We propose a novel paradigm for data-driven modeling that simultaneously learns the dynamics and estimates the measurement noise at each observation. By constraining our learning algorithm, our method explicitly accounts for measurement error in the map between observations, treating both the measurement error and the dynamics as unknowns to be identified,ratherthan assumingidealizednoiselesstrajectories. We also discuss issues with the generalizability of neural network models for dynamicalsystemsand provide open-source code for allexamples.” The combination of neural networks and numerical time-stepping schemes suggests a number of high-priority research directions in system identification and data-driven forecasting. Future extensions of this work include considering systems with process noise, a more rigorous analysis of the specific method for interpolating f, including time delay coordinates to accommodate latent variables, and generalizing the method to identify partial differential equations. Rapid advances in hardware and the ease of writing software for deep learning will enable these innovations through fast turnover in developing and testing methods.
  • 84. Super-resolutions Insightsfromaudio Time-frequencynetworks foraudiosuper- resolution TeckYian Lim etal. (2018) http://isle.illinois.edu/sst/pubs/2018/lim18icassp.pdf http://tlim11.web.engr.illinois.edu/ “Audiosuper-resolution (a.k.a. bandwidthextension)is thechallengingtaskofincreasingthetemporalresolutionof audiosignals. Recentdeepnetworksapproachesachieved promisingresultsby modelingthetaskas aregression problem ineithertimeorfrequencydomain. Inthispaper, weintroducedTime-FrequencyNetwork(TFNet),a deepnetworkthat utilizessupervision inboth thetimeand frequencydomain.Weproposedanovelmodelarchitecture whichallowsthetwodomainstobe jointlyoptimized.” Spectrogram correspondingto the LR input(frequenciesabove 4kHz missing), HR reconstruction, and the HR ground truth. Our approach successfullyrecoversthehigh frequencycomponentsfrom the LRaudiosignal.
  • 85. GANs Alsofortime-seriesdenoising #1a DenoisingTimeSeriesData Using AsymmetricGenerativeAdversarial Networks Sunil Gandhi;Tim Oates;TinooshMohsenin and David Hairston (2018) https://doi.org/10.1007/978-3-319-93040-4_23 “In this paper, we explicitly learn to remove noise from time series data without assuming a prior distribution of noise. We propose an online, fully automated, end- to-endsystemfordenoisingtimeseriesdata. Our model for denoising time series is trained using unpaired training corpora and does not need information about the source of the noiseorhowitismanifestedin thetimeseries. We propose a new architecture called AsymmetricGAN that uses a generative adversarial network for denoising time series data.” Consider, for example, a widely used method for time series featurization called Symbolic Aggregate approXimation (SAX) that assumes time series are generated from a single normal distribution. As shown in this assumption does not hold in several real life time series datasets. Other techniques assume noise comes from a Gaussian distribution and estimate the parameters of that distribution. This assumption doesnot hold for datasourceslikeElectroencephalography (EEG), wherenoisecan have diverse characteristics and originate from different sources. Hence, in this work, we focus on learning the characteristics of noise in EEG data and removing it as a preprocessing step. ICA has high computationalcomplexityandlargememoryrequirements,makingitunsuitableforreal-timeapplications. For training of our network, we only need a set of clean signals and set of noisy signals. We do not need paired training data, i.e., we do not need clean versions of the noisy data. This is particularly useful for applicationslikeartifact removalinEEGdataaswecannot recordclean versionsofnoisyEEG.
  • 86. GANs Alsofortime-seriesdenoising #1b DenoisingTimeSeriesData Using AsymmetricGenerativeAdversarial Networks Sunil Gandhi;Tim Oates;TinooshMohsenin and David Hairston (2018) https://doi.org/10.1007/978-3-319-93040-4_23 Pre-processing The DC component in EEG data is different for each recording. We normalize every window of clean and noisy data to remove the DC offset from the data. We remove the DC offset by subtracting the median of the datain the window. Evaluation of EEG data is challenging as the ground truth noiseless signals are not known. Multiple approaches to evaluation have been proposed in recent years, however, authors do not agree on a single mechanismforevaluatingartifactremoval.
  • 87. GANs Alsoforspeechdenoising Segan:Speechenhancementgenerative adversarialnetwork. SantiagoPascual, AntonioBonafonte, and Joan Serra (2017) https://arxiv.org/abs/1703.09452 https://github.com/santi-pdp/segan “For the purpose of speech enhancement and denoising, the SEGAN was developed, employing a neural network with an encoder and decoder pathway that successively halves and doubles the resolution of feature maps in each layer, respectively, and features skip connections betweenencoderanddecoderlayersa. The model works as an encoder-decoder fully- convolutional structure, which makes it fast to operate for denoising waveform chunks. The results show that, not only the method is viable, but it can also represent an effective alternative to current approaches. Possible future work involves the exploration of better convolutional structures and the inclusion of perceptual weightings in the adversarial training, so that we reduce possible high frequency artifacts that might be introduced by the current model. Further experiments need to be done to compare SEGANwithothercompetitiveapproaches.” Thedatasetisaselectionof30speakers fromtheVoiceBankcorpus
  • 88. GANs Alsoformultichannelaudiodenoising Multi-ViewNetworks forDenoisingofArbitrary NumbersofChannels Jonah Casebeer, Brian Luc and ParisSmaragdis (July2018) https://arxiv.org/abs/1806.05296 “We propose a set of denoising neural networks capable of operating on an arbitrary number of channels at runtime, irrespective of how many channels they were trained on. We coin the proposed models multi-view networks sincetheyoperateusingmultipleviewsofthe samedata. We explore two such architectures and show how they outperform traditional denoising models in multi-channel scenarios. Additionally, we demonstrate how multi- view networks can leverage information provided by additional recordings to make better predictions, and how they are able to generalize to a number of recordings not seen in training.”
  • 89. GANs forgenerativemodelsoftimeseries Ontheevaluationofgenerativemodels inmusic Li-ChiaYang, Alexander Lerch (October 2018) https://doi.org/10.1007/s00521-018-3849-7 https://github.com/RichardYang40148/mgeval Therefore, we propose a set of simple musically informed objective metrics enabling an objective and reproducible way of evaluating and comparing the output of music generative systems. We demonstrate the usefulness of the proposed metrics with several experiments on real-world data. We have released the evaluation framework as an open-source toolbox which implements the demonstrated evaluation and analysis methods along with visualization tools. Our future work will include the extension of the current toolbox with additional dimensions (e.g., dynamics) and to expand it toward polyphonic music.
  • 91. Classification, non-DL algorithms: COTE Thegreattimeseries classificationbakeoff: areviewandexperimentalevaluationof recentalgorithmic advances Anthony Bagnall, Jason Lines, Aaron Bostrom, James Large, Eamonn Keoghs (May2017) https://doi.org/10.1007/s10618-016-0483-9 https://bitbucket.org/TonyBagnall/time-series-classification “We have implemented 18 recently proposed algorithms in a common Java framework (Weka) and compared them against two standard benchmark classifiers (and each other) by performing 100 resampling experiments on each of the 85 datasets. We use these results to test several hypotheses relating to whether the algorithms are significantly more accurate than the benchmarks and each other. Our results indicate that only nine of these algorithms are significantly more accurate than both benchmarks and that one classifier, the collective of transformation ensembles, is significantly moreaccuratethan allof theothers” Summaryofthetimeandspacecomplexity of the 18TSCalgorithmsconsidered However, our conclusion is that using COTE ( Bagnall et al2015; Cited by 91) will probably give you the most accurate model. If a simpler approach is needed and the discriminatory features are likely to be embedded in subseries, then we would recommend using TSF or ST if the features are in the time domain (depending on whether they are phase dependent or not) or BOSS if they are in the frequency domain. If a whole series elastic measure seems appropriate, then using EE is likely to lead to better predictions than usingjust DTW.
  • 92. Time series IntroofDNNuse#1A Deep learningfortimeseriesclassification: a review Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, LhassaneIdoumghar, Pierre-Alain Muller (Submitted on 12Sep2018) https://arxiv.org/abs/1809.04356 | https://github.com/hfawaz/dl-4-tsc In this article, we study the current state of the art performance of deep learning algorithms for Time Series Classification (TSC) by presenting an empiricalstudyofthemostrecent DNN architectures for TSC. We give an overview of the most successful deep learning applications in various time series domains under a unified taxonomy of DNNs for TSC. We also provide an open source deep learning framework to the TSC community where we implemented each of the compared approaches and evaluated them on a univariate TSC benchmark (the UCR archive) and 12 multivariate time series datasets. By training 8,730 deep learning models on 97 time series datasets, we propose the most exhaustive study ofDNNsfor TSC to date. COTEiscurrentlyconsideredthe stateoftheart fortimeseriesclassification(Bagnalletal.,2017) whenevaluatedoverthe85datasetsfromtheUCRarchive (Chenetal.,2015b). Finally,addingtothehugeruntimeofCOTE,thedecisiontakenby 35classifierscannotbeinterpreted easily by domainexperts,sinceresearchersalreadystrugglewithunderstandingthedecisionstakenby anindividualclassifier. ● WhatisthecurrentstateoftheartDNNforTSC? ● IsthereacurrentDNNapproachthatreachesstateoftheartperformanceforTSCandis less complexthanCOTE? ● WhattypeofDNNarchitecturesworksbestfortheTSCtask? ● Andfinally:Couldtheblack-boxeffectofDNNsbeavoidedtoprovideinterpretability? GiventhatthelatterquestionshavenotbeenaddressedbytheTSCcommunity,itissurprisinghow muchrecentpapershaveneglectedthepossibilitythatTSCproblemscouldbe solvedusingapure featurelearningalgorithm
  • 93. Time series IntroofDNNuse#1B The result of a applying a learned discriminative convolution on the GunPoint dataset Deep learningfortimeseriesclassification: a review https://arxiv.org/abs/1809.04356 (2018) https://github.com/hfawaz/dl-4-tsc
  • 94. Time series IntroofDNNuse#1C Given the aforementioned limitations for generative models, we decided to limit our experimental evaluation to discriminative deep learningmodelsforTSC. Second, since we cannot cover an empirical study of all approaches validated in all TSC domains, we decided to onlyinclude approachesthatwere validated on the whole (or a subset of) the univariate time series UCRarchive and/or on the MTS archive ( Baydogan,2015). Finally, we chose to work with approaches that do not try to solve a sub task of the TSC problem such as in Geng and Luo (2018) where CNNs were modified to solve the task of classifying imbalanced time series datasets. Another sub task that has been at the center of recent studies is earlytime series classification (Wang et al., 2016a) where deep CNNs were modified to include an early classification of time series. More recently, a deep reinforcement learning approach was also proposed for the early TSC task (Martinez et al., 2018). For further details, we refer the interested reader to a recent survey on deep learning for early time series classification(Santos andKern,2017). The third and final proposed architecture in Wangetal.(2017) is a relatively deep Residual Network (ResNet). For TSC, this is the deepest architecture with 11 layers of which the first 9 layers are convolutionalfollowedbyaGAP layerthataveragesthetimeseriesacrossthetimedimension. Deep learningfortimeseriesclassification: a review https://arxiv.org/abs/1809.04356 (2018) https://github.com/hfawaz/dl-4-tsc
  • 95. Time series IntroofDNNuse#1D Deep learningfortimeseriesclassification: a review https://arxiv.org/abs/1809.04356 (2018) https://github.com/hfawaz/dl-4-tsc Giventhehugenumberofmodels[8730experimentsfor the85univariateTSCdatasets]that neededtobetrained,weranour experimentsonaclusterof 60GPUs.TheseGPUswereamix ofthreetypesofNvidiagraphiccards: GTX1080Ti, TeslaK20,K40andK80. Thetotalsequentialrunning timewasapproximately 100days,thatisifthecomputationhas beendoneonasingleGPU.However,byleveraging theclusterof60GPUs,wemanagedto obtaintheresultsinlessthanonemonth. Weimplementedour frameworkusingtheopen sourcedeeplearninglibraryKeraswiththeTensorflowback-end. Figure1showsthecriticaldifferencediagram (Demšar,2006, Citedby6414),whereathick horizontallineshowsagroupofclassifiers(a clique)thatarenotsignificantlydifferentintermsofaccuracy. → AnExtension on"StatisticalComparisonsofClassifiersover Multiple DataSets"forallPairwiseComparisons
  • 96. Time series IntroofDNNuse#1E: ResNettheTopDog Deep learningfortimeseriesclassification: a review https://arxiv.org/abs/1809.04356 (2018) https://github.com/hfawaz/dl-4-tsc Giventhehugenumberofmodels[8730experimentsfor the85univariateTSCdatasets]that neededtobetrained,weranour experimentsonaclusterof 60GPUs.TheseGPUswereamix ofthreetypesofNvidiagraphiccards: GTX1080Ti, TeslaK20,K40andK80. Thetotalsequentialrunning timewasapproximately 100days,thatisifthecomputationhas beendoneonasingleGPU.However,byleveraging theclusterof60GPUs,wemanagedto obtaintheresultsinlessthanonemonth. Weimplementedour frameworkusingtheopen sourcedeeplearninglibraryKeraswiththeTensorflowback-end. Figure1showsthecriticaldifferencediagram (Demšar,2006),wherea thick horizontallineshowsagroupofclassifiers(aclique)thatarenot significantlydifferentintermsofaccuracy.
  • 97. Time series IntroofDNNuse#1F:ResNetsvs.Traditional Deep learningfortimeseriesclassification: a review https://arxiv.org/abs/1809.04356 (2018) https://github.com/hfawaz/dl-4-tsc We give two potential reasons for this high generalization capabilities of deep CNNs on the TSCtasks. First, having seen the success of convolutions in classification tasks that require learning features that are spatially invariant in a two dimensional space (such as width and height in images), it is only natural to think that discovering patterns in a one dimensional space (time) should be an easier task for CNNs thus requiring less datatolearnfrom. The other more direct reason behind the high accuracies of deep CNNs on time series data is itssuccessinother sequentialdata such as speech recognition and sentence classification where text and audio, similarly to time series data, exhibit a natural temporal ordering. We compared ResNet(the mostaccurate DNN ofour study) with the currentstateofthe artclassifiersevaluated on the UCR archive in the great time series classification bake off Bagnalletal.(2017)). Note that our empirical study strongly suggeststouseResNetinsteadofanyother deeplearningalgorithm Outofthe18classifiersevaluatedbyBagnalletal.(2017),wehavechosenthefourbestperformingalgorithms: (1) Elastic Ensemble (EE) proposed by Lines and Bagnall (2015) is an ensemble of nearest neighbor classifiers with 11 different time series similarity measures; (2) Bag-of-SFA-Symbols (BOSS) published in Schäfer (2015) forms a discriminative bag of words by discretizing the time series using a Discrete Fourier Transform and then building a nearest neighbor classifier with a bespoke distance measure; (3) Shapelet Transform (ST) developed by Hills et al. (2014) extracts discriminative subsequences (shapelets) and builds a new representation of the time series that is fed to an ensemble of 8 classifiers; (4) Collective of Transformation-based Ensembles (COTE) proposed by Bagnalletal.(2017) is basically a weighted ensemble of 35 TSC algorithms including EE and ST. Finally, we added a recent approach named Proximity Forest (PF) which is similar to Random Forest but replaces the attribute based splitting criteriabyarandomsimilaritymeasurechosenoutofEE’selasticdistances(Lucasetal.,2018). Although COTE is still the most accurate classifier (when evaluated on the UEA archive) its use in a real data mining application is limited due to its huge training time complexity whichisO(N2 ·T4 ).
  • 98. Time series IntroofDNNuse#1G: Deep learningfortimeseriesclassification: a review https://arxiv.org/abs/1809.04356 (2018) https://github.com/hfawaz/dl-4-tsc Again, we can clearly see the dominance of ResNet as the best performing approach across different domains. One exception is the electrocardiography (ECG) datasets (7 in total) where ResNet was drastically beaten by the FCN modelin71.4%ofECGdatasets. THEMES One might expect that the relatively short filters (3) might affect the performance of ResNet and FCN since longer patterns cannot be captured by short filters. However, since increasing the number of convolutional layers will increase the path length viewed (receptive field) by the CNN model (Vaswani et al., 2017), ResNet and FCN managed to outperform other approaches whose filter length is longer (21)suchasEncoder. SIGNALLENGTH Wangetal.(2017) later introduced a one- dimensional CAM with an application to TSC. This method explains the classification of a certain deep learning model by highlighting the subsequences that contributedthemosttoacertainclassification. An interesting observation would be to compare the discriminative regions identified by a deep learning model with the most discriminative shapelets extracted by other shapelet- based approaches. This observation would also be backed up by the mathematical proof provided by Cui et al. (2016), that showed how the learned filters in a CNN can be considered a generic form of shapelets extracted by the learningshapeletsalgorithm (Grabockaetal., 2014).
  • 99. Time series IntroofDNNuse#1H:Future Deep learningfortimeseriesclassification: a review https://arxiv.org/abs/1809.04356 (2018) https://github.com/hfawaz/dl-4-tsc Although we have conducted an extensive experimental evaluation, deep learning for Time Series Classification, unlike for computer vision and NLP tasks, still lacks a thorough study of data augmentation (Ismail Fawaz et al., 2018a; Forestier et al., 2017) and transfer learning. Furthermore, we think that the effect of z- normalization (and other normalization methods) on the learning capabilities of DNNs should also be thoroughly explored. What makesImageNetgood for transferlearning? MinyoungHuh, PulkitAgrawal, AlexeiA. Efros https://arxiv.org/abs/1608.08614 “Our results might indicate that researchers have been overestimating the amount of data required for learning good general CNN features. If that is the case, it might suggest that CNN training is not as data-hungry as previously thought. It would also suggest that beating ImageNet-trained features with models trained on a much bigger data corpus will be much harder than once thought.” AutoAugment:Learning AugmentationPoliciesfrom Data Ekin D. Cubuk, Barret Zoph, DandelionMane, VijayVasudevan, QuocV. Le (9Oct2018) https://arxiv.org/abs/1805.09501 https://github.com/tensorflow/models/tree/master/research/autoaugment “We describe a simple procedure called AutoAugment to search for improved data augmentation policies” Albumentations:fast and flexible imageaugmentations Alexander Buslaev, Alex Parinov, EugeneKhvedchenya,Vladimir I.Iglovikov, AlexandrA.Kalinin(18Sep2018) https://arxiv.org/abs/1809.06839 https://github.com/albu/albumentations “We present Albumentations, a fast and flexible library for image augmentations with many various image transform operations available, that is also an easy-to-use wrapper (based on highly-optimized OpenCV library) around other augmentation libraries.” Combiningraw and normalized datain multivariate time seriesclassification with dynamic time warping Łuczak,Maciej(2018) http://doi.org/10.3233/JIFS-171393
  • 100. Time series IntroofDNNuse#1H2:TransferLearning Transferlearningfortimeseriesclassification Hassan Ismail Fawaz, GermainForestier, Jonathan Weber, Lhassane Idoumgharand Pierre-Alain Muller https://arxiv.org/abs/1809.04356 (2018) https://github.com/hfawaz/dl-4-tsc Whenobserving theheatmapinFig.4,onecaneasilysee that fine-tuning a pre-trained model almost never hurtstheperformanceoftheCNN. In our future work, we aim again to reduce the deep neural network’s overfitting phenomena by generating synthetic data using a Weighted DTW Barycenter Averaging method [Forestier etal.2017] , since the latter distance gave encouraging results in guiding a complex deep learning tool such as transfer learning. Finally, with big data repositories becoming more frequent, leveraging existing source datasets that are similar to, but not exactly the same as a target dataset of interest, makes a transfer learning method anenticing approach.
  • 101. Time series IntroofDNNs#2:WhyResNetswork? Wang etal.,2017 https://doi.org/10.1109/IJCNN.2017.7966039 WhyandWhenCanDeep--but NotShallow-- NetworksAvoidtheCurseof Dimensionality:a ReviewTomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco,  BrandoMiranda, Qianli Liao - https://arxiv.org/abs/1611.00740 Thepapercharacterizesclassesoffunctionsfor whichdeep learningcanbeexponentiallybetterthanshallow learning. http://www.telesens.co/2019/01/16/neural-network-loss-visualization/ http://www.telesens.co/loss-landscape-viz/viewer.html VisualizingtheLoss LandscapeofNeuralNets http://papers.nips.cc/paper/78 75-visualizing-the-loss-landsc ape-of-neural-nets  HLi-2017-Citedby93 - Relatedarticles
  • 102. Time series IntroofDNNuse#2A CNNApproachesfor TimeSeriesclassification LamyaaSadouk https://cdn.intechopen.com/pdfs/64216.pdf (2018) Instead of employing the FFT which is restricted to a predefined fixed window length, we choose to adopt the Stockwell transform (ST) asour preprocessing methodfor CNN training.The advantage ofthe ST over the FFT is its ability to adaptively capture spectral changes over time without windowing of data, resulting in a better time-frequency resolutionfor non-stationarysignals[Stockwell1996]. While works [17, 24] transformed the time series signals (by applying down-sampling,slicing,or warping) so asto help the convolutionalfilters (especially the 1st convolutional layer filters) capture entire peaks (i.e., whole peaks) and fluctuations within the signals, the work of [18] proposedto keep time seriesdataunchangedandrather feedthem into three branches, each having a different 1st convolutional filter size, in order to capture the whole fluctuations within signals. An alternative isto find an adaptive 1st convolutional layer filter which has the most optimal size and is able to capture most of entire peaks present in the input signals. The question of how to compute this adaptive 1st convolutionallayer filterisaddressedin[4]. Therefore,the most optimalsizeof the1st convolutionalfilterisequaltothesamplemedian of signalpeaklengths,suggestingthat 0.1isthebesttime span ofthe1st convolutionallayerto retrievethewhole acceleration peaksandthebestacceleration changes. Similarly, in frequencydomain,the1stconvolutionallayer kernelyieldingthehighestF1-scoreistheonewithsize10, whichissimply thesamplemedian (Me (x)=10). (a)and(b)Histogramsand boxplotsofthefrequency distribution of30peak lengths presentwithin 30 randomlyselectedtimeand frequencydomain signals respectively Frequencydomain Timedomain
  • 103. Time series IntroofDNNuse#2B CNNApproachesfor TimeSeriesclassification LamyaaSadouk https://cdn.intechopen.com/pdfs/64216.pdf (2018) In some fields such as medicine experience a lack of annotated data as manually annotating a large set requires human expertise and is time consuming. The conventional approach to deal with this kind of problem is to perform data augmentation by applying transformations to the existing data. Data augmentation achieves slightly better time series classification rates but still the CNN is prone to overfitting. In this section, we present another solution to this problem, a “knowledge transfer” framework which is a global, fast and light-weight framework that combinesthetransfer learning techniquewithan SVMclassifier. Transfer learning is amachine learning technique where amodel trained on one task (a source domain) is re-purposed on a second related task (a target domain). Accordingly, the questions that arise are: (i) which source learning task should be used for pre-training the CNN model given a target learning task, and (ii) which parts (e.g., learned features) of thismodelarecommonbetweenthesourceandtargetlearning tasks. In thatsense,wepropose a“TransferlearningwithSVM read-out”framework which is composed of two parts: (i) the first part having first and intermediate layers’ weights of a CNN already pre-trained on a source learning task, (the last CNN layer being discarded), and (ii) the second part composed of a support vector machine (SVM) classifier with RBF kernel which is connectedtotheendofthefirstpart. Then, we feed the entire training dataset of the target task into this framework in order to train the SVM parameters. As opposed to training a CNN on the target task which requires updating all hidden layers’ weights for several iterations using a large training set for all these weights to converge,ourframeworkcomputesweightsofthelastlayer(s)only,inoneiterationonly. Moreover the advantage of using SVM as the classifier is that it is fast and generally performs well on small training set since it only relies on the support vectors, which are the training samples that lay exactly on the hyperplanes used to define the margin. In addition, SVMs have the powerful RBF kernel, which allows to map the data to a very high dimension space in which the data can be separable by a hyperplane, hence guaranteeing convergence. Hence, our framework can be regarded as a global, fast and light-weight technique for time seriesclassificationwherethetargettaskhaslimitedannotated/labeleddata.
  • 104. Time series IntroofDNNuse#3 3Dconvolutionrecurrentneuralnetworksforbird sounddetection Himawan, Ivan, Towsey, Michael, &Roe, Paul (2018) https://eprints.qut.edu.au/122760/ https://github.com/himaivan/BAD2 We propose 3D convolutions for extracting long-term and short-term information in frequency simultaneously. In order to leverage powerful and compact features of 3D convolution, we employ separate recurrent neural networks (RNN), acting on each filter of the last convolutional layers rather than stacking the feature maps in the typical combined convolutionandrecurrentarchitectures. We split 10-second audio clip into 5 × 2-second clips. The 2- second length is based on empirical analysis. A spectrogram (from 2-second clip) computed from sequences of Short-Time Fourier Transform (STFT) of overlappingwindowedsignalsisusedasthesoundrepresentation. The 3D convolution highlights only frequency bands where the bird calls are located acrossthetemporaldimension. As a comparison, the 2D convolution in CNN+RNN highlights few specific locations ofthebirdcalls,andincludelow-frequencyregionswithnobirdcalls. This shows that 3D convolution is more capable of extracting in terms of long-termtimeinformationinbirdcalls. In future work, we will investigate the method of generating labeled data via a pseudo-labeling method where approximate labels are produced from unlabeled data. This can be achieved, for example, using generative adversarial networks. Domain adaptation using adversarial learning is another alternative to build a discriminativemodelandinvarianttodomainatthesametime. 2D 3D
  • 105. EarlyTimeSeriesClassification ALiteratureSurveyofEarlyTimeSeriesClassificationandDeepLearning TiagoSantosandRomanKern(2017) http://ceur-ws.org/Vol-1793/paper4.pdf Early time series classification aims to classify a time series with as few temporal observations as possible, while keeping the loss of classification accuracy at a minimum. One of the first works on the topic of early classification, as defined over time series length, waswrittenby[31]. Prominent early classification frameworks reviewed by this paper include, but are not limited to, ECTS, RelClass and ECDIRE. These works have shown that early time series classification may be feasible and performant, but they also show room for improvement. ECDIREhttps://doi.org/10.1007/s10618-016-0462-1 RelClass https://dl.acm.org/citation.cfm?id=2627671
  • 106. EarlyTSC with deepreinforcementlearning Adeepreinforcementlearningapproachforearly classificationoftimeseries Martinez Coralie, Guillaume Perrin, E Ramasso, Michèle Rombaut https://hal.archives-ouvertes.fr/hal-01825472/ We formulate the early classification problem in a reinforcement learning framework: we introduce a suitable set of states and actions but we also define a specific reward function which aims at finding a compromise between earliness andclassificationaccuracy. While most of the existing solutions do not explicitly take time into account in the final decision, this solution allows the user to set this trade-off in a more flexible way. In particular, we show experimentally on datasets from the UCR time series archive that this agent is able to continually adapt its behavior without human intervention and progressively learn to compromise between accurate and fast predictions. Evolution of the early classifier agent behaviour on Gun-Point dataset. The scatter plot shows the relationship between accuracy(in percentage)and averagetime ofprediction oftheagent over training. We evaluate the agent on the whole training set every 5,000 iterations. Each evaluation corresponds to one dot. Dot points are coloured according to iterations of training: blue dots correspond to early training while yellow dots correspond to the agent’s performance after 100,000 iterations of training. We evaluate the agent’s policy surrounded by the red star on the testing set and we report its performance in table I. In this experiment, the agent learned to slow its predictions down and improved itsaccuracyover training. As future work, we plan to improve the proposed approach with a dynamic adjustment of the reward function parameters over training based on the user trade-off criteria. We will also propose a new management of the agent’s replay memory which could be more suitable forthe problem of early classification.
  • 107. EarlyTSC for clinicaluse:ICUMortalityPrediction DynamicPredictionofICUMortalityRisk UsingDomainAdaptation TiagoAlves, AlbertoLaender, Adriano Veloso, NivioZiviani https://homepages.dcc.ufmg.br/~nivio/papers/alves@bigdata18.pdf Early recognition of risky trajectories during an Intensive Care Unit (ICU) stay is one of the key steps towards improving patient survival. Learning trajectories from physiological signals continuously measured during an ICU stay requires learning time-series features that are robust and discriminative acrossdiversepatientpopulations. Mortalityriskspacefor differentICUdomains.Regionsinredarerisky.Eachaxisisat-SNE non-linearcombinationof:(toprow)physiologicalparameters,or(bottomrow)features extractedbyCNN−LSTM.
  • 108. BiosignalDeepLearning Deeplearningforhealthcareapplicationsbasedonphysiologicalsignals:AreviewSGauthors: “Wehavecastthenetintotheoceanofknowledget...” OliverFaust,YukiHagiwara,TanJenHong,OhShuLih,URajendraAcharya https://doi.org/10.1016/j.cmpb.2018.04.005 Once the architecture is chosen, the tuning parameters must be adjusted. Both the structure selection and parameter adjustment will basically influence the model. Hence, it is necessary to have many test runs. Shortening the training phase of deep leaning models is an active area of research [159]. The challenge is speeding up the training process in a parallel distributed processing system [160]. The network between the individual processors becomes the bottleneck [161]. Graphics Processing Unit (GPUs) can be used to reduce the network latency [162]
  • 109. ECGClassification #1 AconvolutionalneuralnetworkforECGannotationasthebasisfor classificationof cardiacrhythms PhilippSodmann etal2018Physiol.Meas.inpresshttps://doi.org/10.1088/1361-6579/aae304 https://github.com/MarcusVollmer/PhysioNet 222,202Rpeaks, 192,200Pwaves, 256,966 Twaves,and 3, 311,487interbeatsegments wereextractedfromthe QTdatabase In totalapproximately12,000,000characteristic waveformswereusedasinput volume.Theassigned annotation codesofthemidpoint peakofeach segment wereused asoutputvolume Amajoradvantage ofdecisiontreesis thattheydirectly provideinformation on feature importance
  • 110. ECGClassification #2 Detectingandinterpretingmyocardialinfarctions usingfullyconvolutionalneuralnetworks NilsStrodthoff,ClaasStrodthoff (Submittedon18Jun2018) https://arxiv.org/abs/1806.07385 We consider the detection of myocardial infarction in electrocardiography(ECG)dataasprovidedbythePTB ECG database without non-trivial preprocessing. The classification is carried out using deep neural networks in a comparative study involving convolutional as well as recurrent neural network architectures. The best architecture, an ensemble of fully convolutional architectures, beats state-of-the-art results on this dataset and reaches 93.3% sensitivity and 89.7% specificity evaluated with 10-fold crossvalidation, which is the performance level of human cardiologists for this task. We investigate questions relevant for clinical applications such as the dependence of the classification results on the considered data channels and the considered subdiagnoses. Finally, we apply attribution methods to gain an understanding of the network'sdecisioncriteriaonanexemplarybasis. Time seriesclassification in a realistic setting hastobe able to cope with timeseries that are so large that they cannot be used as input to a single neural network or that cannot be downsampled to reach this state without losing too much information. At this point two different procedures are conceivable: Either one uses attentional models that allow to focus on regions of interest, see e.g. Karim et al. 2018, or one extracts random subsequences from the original timeseries. For reasons of simplicity and with real-time on-site analysis in mind we explore only the latter possibility, which is only applicable for signals that exhibit a certain degree of periodicity. The assumption underlying thisapproach isthat the characteristics leadingtoacertain classification are present in every random subsequence. We stress at this point that this procedure does not rely on the identification of beginning and endpoints of certain patterns in the window. The procedure leaves two hyperparameters: the choice of the window size and an optional downsampling rate to reduce the temporal input dimension for the neural network. Moreover, we present a first exploratory study of the application of interpretability methods in this domain, which is a key requirement for applications in the medical field. These methods can not only help to gain an understanding and thereby build trust in the network’s decision process but could also lead to a data-driven identification of important markers for certain classification decisions in ECG data that might evenproveusefulfor human experts. Here we identified common cardiologists’ decision rules in the network’s attribution maps and outlined prospects for future studies in this direction. Both such an analysis of attribution maps and further improvements of the classification performance would have to rely on considerably larger databases such as for quantitative precision. This would also allow extension to further subdiagnoses and other cardiacconditions such as other confounding and non-exclusive diagnoses or irregular heart rhythms.
  • 111. ECGClassification #3 Automaticdetectionofsleep-disordered breathingeventsusing recurrentneuralnetworksfroman electrocardiogramsignal Erdenebayar Urtnasan,Jong-UkPark,Kyoung-JoungLee https://doi.org/10.1007/s00521-018-3833-2 In this study, we propose a novel method for automatically detecting sleep-disordered breathing (SDB) events using a recurrent neural network (RNN) to analyze nocturnal electrocardiogram (ECG) recordings. … Single-lead ECG recordings (200 Hz) were measured for an average 7.2-h duration and segmented into 10-s events (2,000 samples). A bandpass filter (5–11 Hz) was applied for data preprocessing to removeundesired noisefrom theECG signal.The dataset comprised a training dataset (68,545 events) from 74 patients and test dataset (17,157 events)from18patients Theproposed deep RNN model for automatic detection of SDB events was implemented by Keras’ platform usingaTensorFlowbackground(sic!).
  • 112. ECGClassification #4 Arrhythmiadetectionusingdeepconvolutionalneuralnetwork withlongdurationECGsignalshttps://doi.org/10.1016/j.compbiomed.2018.09.009 c DepartmentofCardiology,NationalHeartCentreSingapore,Singapore D Duke-NUSMedicalSchool,Singapore The goal of our research was to design a new method based on deep learning (1D-CNN is employed) to efficiently and quickly classify cardiac arrhythmias. Approach based on the analysis of 10-s ECG signal fragments (not a single QRS complex) is applied (on average, 13 times less classifications/analysis). A complete end-to-end structure was designed instead of the hand- crafted feature extraction and selection used in traditional methods. Can be used in tele-medicine especially in mobile devices and cloud computing due to its low computational complexity.
  • 113. ECGClassification #5 Deeplearninginthecross-time-frequencydomainforsleep stagingfromasingleleadelectrocardiogram https://doi.org/10.1088/1361-6579/aaf339 This study classifies sleep stages from a single lead electrocardiogram (ECG) using beat detection, cardiorespiratory coupling in the time-frequency domain and adeep convolutional neuralnetwork (CNN). An ECG-derived respiration (EDR) signal and synchronous beat-to-beat heart rate variability (HRV) time series were derived from the ECG using previously described robust algorithms. A measure of cardiorespiratory coupling (CRC) was extracted by calculating the coherence and cross-spectrogram of theEDR andHRVsignalinfive-minutewindows A support vector machine (SVM) was then used to combine the output of CNN with the other features derived from the ECG, including phase-rectified signal averaging (PRSA), sample entropy, as well as standard spectral and temporal HRV measures. TheECGsignalswerepreprocessedbyafiniteimpulseresponse(FIR) lowpassfilterwithabandstopat22HzandaFIRhighpassfilterandwithat cornerfrequencyof1.2Hz.Astate-of-the-artQRSdetector(jqrs)was usedforECGR-peakdetection(Johnsonetal.(2015)).
  • 114. ECGClassification #6 Kalman-basedSpectro-TemporalECGAnalysisusingDeep ConvolutionalNetworksforAtrialFibrillationDetection Zheng Zhao, SimoSärkkä, and AliBahrami Rad https://arxiv.org/abs/1812.05555 For ECG signals, one can directly adopt 1D convolutional or recurrent network models for the classification task. However, transforming signals into spectral domain (spectro-temporal features) is a promising alternative approach knowing that the current state-of-theart deep convolutional neural networks (CNNs) structures are typicallydesignedfor 2Dimages. The contributions of this paper are: 1) We propose two extended models for spectro-temporal estimation using Kalmanfilter andsmoother. We then combine them with deep convolutional networks for AF detection. 2) We test and compare the performance of proposed approaches for spectro-temporal estimation on simulated data and AF detection with other popular estimation methods and different classifiers. 3) For AF detection, we evaluate the proposals using PhysioNet/CinC 2017 dataset, which is considered to be a challenging dataset that resembles practical applications, and our results are in line with the state-of-theart. The key advantages of this kind of approaches over other spectrotemporal methods are that we can apply them to both evenly and unevenly sampled signals [25] and they requirenostationarity guaranteesnorwindowing. In practice, the computational cost of Kalman filter and smoother can be extensive when the length of the signal is very long. However, instead of the Fourier series state space model in previous section, one can also derive an alternative representation using stochastic oscillator differential equations. In this way, the dynamic and measurement models become linear time-invariant (LTI) so that we can leverage a stationary Kalman filter to reduce the time consumption. This kind of stochastic oscillator models were also considered in [33] and the link to period Gaussian processmodelswasinvestigatedin[35].
  • 115. EEG Classification #1a DeeplearningwithconvolutionalneuralnetworksforEEGdecoding andvisualizationhttps://doi.org/10.1002/hbm.23730 https://github.com/robintibor/braindecode/ Thereisincreasing interestin using deep ConvNetsforend to endEEGanalysis‐to‐end EEG analysis ‐to‐end EEG analysis ,but a better understanding of how to design and train ConvNets for end to end EEG‐to‐end EEG ‐to‐end EEG decoding and how to visualize the informative EEG features the ConvNets learn is still needed. Here, we studied deep ConvNets witha rangeof different architectures, designed fordecodingimagined orexecutedtasksfromrawEEG. Our study thus shows how to design and train ConvNets to decode task related‐to‐end EEG analysis information from the raw EEG without handcrafted features and highlights the potential of deep ConvNets combined with advanced visualization techniques for EEG basedbrain mapping.‐to‐end EEG
  • 116. EEG Classification #1b Deep learningwith convolutionalneuralnetworks for EEG decodingand visualization https://doi.org/10.1002/hbm.23730 → Citedby90| https://github.com/robintibor/braindecode/ Correlation between the mean squared envelope feature and unit output for a single subject at one electrode position (FCC4h). Left: All correlations. Colors indicate the correlation between unit outputs per convolutional filter (x-axis) and mean squared envelope in different frequency bands (y-axis). Filters are sorted by their correlation to the 7–13 Hz envelope (outlined by the black rectangle). Note the large correlations/anticorrelations in the alpha/beta bands (7–31 Hz) and somewhat weaker correlations/anticorrelations in the gamma band (around 75 Hz). Right: mean absolute values across units of all convolutional filters for all correlation coefficients of the trained model, the untrained model and the difference between the trained and untrained model. Peaksin the alpha, beta, and gammabandsare clearly visible CSP-common spatialpatterns
  • 117. EEG+ECGClassification UseoffeaturesfromRR-timeseriesandEEGsignalsfor automatedclassificationofsleepstagesindeepneuralnetwork framework https://doi.org/10.1016/j.bbe.2018.05.005 The method uses iterative filtering (IF) based multiresolution analysis approach for the decomposition of RR-time series into intrinsic mode functions (IMFs). The recurrence quantification analysis (RQA) and dispersion entropy (DE) based features are evaluated from the IMFs of RR- time series. The dispersion entropy and the variance features are evaluated from the different bands of EEG signal. The RR-time series features and the EEG features coupled with the deep neural network (DNN) are Stackedautoencoders withbinaryclassifiers? Slightlyconfusingarchitecture Engineeredfeatureswith deeplearning?
  • 118. EMGClassification EMGPatternRecognitionintheEraofBigDataandDeep Learning BigDataCogn.Comput.2018,2(3),21; https://doi.org/10.3390/bdcc2030021 We provide a review of recent research and development in EMG pattern recognition methods that can be applied to big data analytics. These modern EMG signal analysis methods can be divided into two main categories: (1) methods based on feature engineering involving a promising big data exploration tool called topological data analysis; and (2) methods based on feature learning with a special emphasison “deeplearning”. Compared to other well-known bioelectrical signals (e.g., electrocardiogram, ECG; electrooculogram, EOG; and galvanic skin response, GSR), however, the analysis of surface EMG signal is morechallenginggiventhatitisstochasticinnature. Due to the increasing availability of multi-modality sensing systems, multi-modal analysis approaches are becoming a viable option. Multiple modalities can be used to capture complementary information which is not visible using a single modality, or to provide contextfor others. Even when two or more modalities capture similar information, their combination can still improve the robustness of pattern recognitionsystemswhenoneofthemodalitiesismissingor noisy. Outside of prosthesis control, other applications of EMG pattern recognition for which multi- modality data sets exist include, for example, sleep studies, such as the Cyclic Alternating Pattern (CAP) Sleep Database [49] and the Sleep Heart Health Study (SHHS) Polysomnography Database [50]; biomechanics, such as the cutting movement dataset [51] and the horse gait dataset [52]; and brain computer interfaces, such as the Affective Pacman dataset [53] and the emergency braking assistance dataset [54]. Recently, emotion recognition using multiple physiological modalities has gained attention as another important application that has benefited fromtheincorporationofsurfaceEMG. http://doi.org/10.3390/s17071622
  • 119. Time series 2D Recurrence Plots→ Shapelets This paper investigates the performance of Recurrence Plots (RP) [ Eckmannetal.1987] within the deep CNN model for TSC. RP provides a waytovisualizetheperiodicnatureofatrajectory throughaphase space and enables us to investigate certain aspects of the m-dimensional phase space trajectory through a 2D representation. Because of the recent outstanding results by CNN on image recognition, we first encode time-series signals as 2D plots, and then treat TSC problem as texture recognition task. A CNN model with 2 hidden layers followed byafullyconnectedlayerisused. In particular, comparing with models using RP with the traditional classification framework (e.g. SIFT, Gabor and LBP features with SVM classifier25, 26 ) and other CNN-based time-series image classification (e.g. GAF-MTF images with CNN23, 24 ) demonstrates that using RP images with CNN in ourproposedmodel obtainsthe betterresults. As future work, CNN architecture with more feature representation layers should be investigated for more difficult TSC tasks (preferably with more data samples available). Large datasetsare neededin orderto train a deeper architectures. Therefore, adopting the proposed pipeline for TSC with small sample sizes can be another interesting future direction. Exploring different ensemble learning methods for CNN can be also interesting. We will particularly be investigating application of the output coding for CNNs.
  • 120. Wavelets fordeep learningTSC #1 MultilevelWaveletDecompositionNetworkfor InterpretableTimeSeriesAnalysis https://doi.org/10.1145/3219819.3220060 To this end, we first designed a novel wavelet-based network structure called mWDN for frequency learning of time series, which can then be seamlessly embedded into deep learning frameworks by making all parameters trainable. We further designed two deep learning models based on mWDN for time series classification and forecasting, respectively, and the extensive experiments on abundant real-world datasets demonstrated their superiority to state-of-the-art competitors. As a nice try for interpretable deep learning, we further propose an importance analysis methodforidentifyingimportantfactorsfor timeseriesanalysis,whichin turnverifiestheinterpretabilitymerit ofmWDN. Frequency Analysis of Time Series. Frequency analysis of time series data has been deeply studied by the signal processing community. Many classical methods, such as Discrete Wavelet Transform, Discrete Fourier, and Z-Transform, have been proposed to analysis the frequency pattern of time series signals. In existing TSC/TSF applications, however, transforms are usually used as an independent step in data preprocessing, which have no interactions with model training and therefore might not be optimized for TSC/TSF tasks from a global view. In recent years, some research works, such as Clockwork RNN [Koutniketal. 2014] and SFM [HaoHuandGuo-JunQi2017] , begins to introduce the frequency analysis methodology into the deep learning framework. To our best knowledge, our study is among the very few works that embed wavelet time series transforms as a part of neural networks so as to achieve an end-to-endlearning.
  • 121. Wavelets fordeep learningTSC #2 Learningfilterwidthsofspectraldecompositionswith Wavelets Haidar Khan and Bülent Yener. Rensselaer Polytechnic Institute http://papers.nips.cc/paper/7711-learning-filter-widths-of-spectral-decompositions-with-wavelets.pdf https://github.com/haidark/WaveletDeconv We propose the wavelet deconvolution (WD) layer as an efficient alternative to this preprocessing step that eliminates a significant number of hyperparameters. The WD layer uses wavelet functions with adjustable scale parameters to learn the spectral decomposition directlyfromthesignal. Furthermore, the WD layer adds interpretability to the learned time series classifier by exploiting the properties of the wavelettransform. Asfuturework,we plantoinvestigate howtoextendtheWDlayertosignalsinhigherdimensions,suchas magesandvideo,aswellas generalizingthewavelettransform toempiricalmode decompositions(EMDs).
  • 122. Wavelets fordeep learningTSC #3 MultilevelWaveletDecompositionNetworkforinterpretableTimeSeries Analysis https://doi.org/10.1145/3219819.3220060 In this paper we propose a wavelet-based neural network structure called multilevel Wavelet Decomposition Network (mWDN) for building frequency-aware deep learning models for time series analysis. mWDN preserves the advantage of multilevel discrete wavelet decomposition in frequency learning while enables the fine-tuning of all parameters under a deep neural network framework. Based on mWDN, we further propose two deep learning models called Residual Classification Flow (RCF) and multi-frequency Long Short-Term Memory (mLSTM) for time series classification and forecasting, respectively. The two models take all or partial mWDN decomposed sub-series in different frequencies as input, and resort to the back propagation algorithm to learn all the parameters globally, which enables seamless embeddingof wavelet-basedfrequencyanalysisintodeeplearningframeworks
  • 123. Multivariatetime-series classification#1:CNNonly TemporalConvolutional Neural Network for the Classificationof SatelliteImageTimeSeries CharlottePelletier, Geoffrey I. WebbandFrançois Petitjean (Submittedon 31Jan 2019) https://arxiv.org/abs/1811.10166 https://github.com/charlotte-pel/temporalCNN (Keras) Note!Despitethename,theauthorsusedtraditional convolutionalfiltersfortimeseries,andnotTCNs
  • 124. Multivariatetime-series classification#2: CNN+LSTM MultivariateLSTM-FCNsforTimeSeriesClassification Fazle Karim, SomshubraMajumdar, HoushangDarabi, Samuel Harford (Submitted on 14Jan 2018) https://arxiv.org/abs/1801.04503 We propose augmenting the existing univariate time series classification models, LSTM-FCN and ALSTM-FCN with a squeeze and excitationblocktofurtherimproveperformance. The proposed models work efficiently on various complex multivariate time series classification tasks such as activity recognition or action recognition. Furthermore, the proposed models are highly efficient at test time and small enough to deploy on memory constrainedsystems. For datasets with class imbalance, a class weighing schemed inspired by King et al. (2001).
  • 125. Multivariatetime-series classification#3: CNN+GRU DeepGatedRecurrentandConvolutionalNetworkHybridModelforUnivariateTimeSeries ClassificationNellyElsayed, Anthony S. Maidaand MagdyBayoumi (Submitted on 27 Dec 2018) https://arxiv.org/abs/1812.07683 https://github.com/NellyElsayed/GRU-FCN-model-for-univariate-time-series-classification The proposed GRU-FCN classification model shows that replacing the LSTM by a GRU enhances the classification accuracy without needing extra algorithm enhancements such as fine-tuning or attention algorithms. The GRU also has a smaller architecture that requires fewer computations than the LSTM. Moreover, the GRU-based model requires smaller number of trainable parameters, memory, and training time comparing to the LSTM-based models.
  • 126. Applicationformultivariatetimeseries: Wearable sensors WearableDL:WearableInternet-of-ThingsandDeep LearningforBigDataAnalytics—Concept,Literature, andFuture ArasR. Dargazany, PaoloStegagno, and Kunal Mankodiya (Submitted on 14November 2018) https://doi.org/10.1155/2018/8125126 This work introduces Wearable deep learning (WearableDL) that is a unifying conceptual architecture inspired by the human nervous system, offering the convergence of deep learning (DL), Internet-of- things(IoT),andwearabletechnologies(WT)
  • 127. Applicationformultivariatetimeseries: ActionRecognition SensorDataAcquisitionandMultimodalSensorFusion forHumanActivityRecognitionUsingDeepLearning Published:10April 2019 (Thisarticle belongstotheSpecial IssueDeep LearningBased Sensing Technologiesfor AutonomousVehicles) Sensors2019, 19(7), 1716; https://doi.org/10.3390/s19071716 We develop a Long Short-Term Memory (LSTM) network framework to support training of a deep learning model on human activity data, which is acquired in both real-world and controlled environments. From the experiment results, we identify that activity data with sampling rate as low as 10 Hz from four sensors at both sides of wrists, right ankle, and waist is sufficient in recognizing Activities of Daily Living (ADLs) including eating and driving activity. We adopt a two-level ensemble model to combine class-probabilities of multiple sensor modalities, and demonstrate that a classifier-level sensor fusion technique can improve the classification performance. By analyzing the accuracy of each sensor on different types of activity, we elaborate custom weights for multimodal sensor fusion that reflect the characteristic of individual activities.
  • 128. Ensemblingmodels for uni/multivariatetimeseriesaswell DeepNeuralNetworkEnsemblesforTimeSeries Classification Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar and Pierre-Alain Muller IRIMAS, UniversiteHaute-Alsace, Mulhouse, France https://arxiv.org/abs/1903.06602 In the future, we would like to consider a meta-learningapproach where the output logistics of individual deep learning models are fed to a meta-network that learns to map these inputs to the correct prediction.(e.g.Juetal.2019;2018)
  • 130. Segmenting timeseries BEATS: Blocksof Eigenvalues AlgorithmforTime SeriesSegmentation https://doi.org/10.1109/TKDE.2018.2817229 (2018) https://github.com/auroragonzalez/BEATS implemented in R The massive collection of data via emerging technologies like the Internet of Things (IoT) requires finding optimal ways to reduce the observations in the time seriesanalysisdomain. In this paper, we propose a segmentation algorithm that adapts to unannounced mutations of the data (i.e., data drifts). The algorithm splits the data streams into blocks and groups them in square matrices, computes the Discrete Cosine Transform (DCT),andquantizesthem. The algorithm, called BEATS, is designed to tackle dynamic IoT streams, whose distribution changes over time. We implement experiments with six datasets combining real, synthetic, real-world data, and data with drifts. Compared to other segmentation methods like Symbolic Aggregate approXimation (SAX), BEATS shows significant improvements. Trying it with classification and clustering algorithms it provides efficient results. BEATS is an effective mechanism to work with dynamic and multi-variate data, making it suitable for IoTdata sources. By using BEATS, we are able to restructure the streaming data in a 2D way and then transform it into the frequency domain using DCT. The algorithm finds a smaller sequence that contains the key information of the initial representative. This aggregation provides an opportunity to eliminate repetitive content and similarities that can be found in the sequence of data. The eigenvalues vectors are a homogeneous representation of the data streams in BEATS that allow us to go one step further in understanding of the sequences and patterns that can be considered as the data structure of a data series in an application domain (e.g. smart cities). Its applications can be extended to several other domains and various patterns/activity monitoring and detection methods. The future work will focus on applying 3D cosine transform and adaptive blocksize estimation.
  • 132. FinancialForecasting withDeep Learning#1 ConditionalTime SeriesForecastingwith Convolutional NeuralNetworks Anastasia Borovykh,SanderBohte, CornelisW.Oosterlee https://arxiv.org/abs/1703.04691 (2017) We present a method for conditional time series forecasting based on an adaptation of the recent deep convolutional WaveNet architecture. The proposed network contains stacks of dilated convolutions that allow it to access a broad range of history when forecasting, a ReLU activation function and conditioning is performed by applying multiple convolutional filters in parallel to separate time series which allows for the fast processing of data and the exploitation of the correlation structurebetween the multivariatetimeseries. We show that a convolutional network is well-suited for regression-type problems and is able to effectively learn dependencies in and between the series without the need for long historical time series, is a time-efficient and easy to implement alternative to recurrent-type networks and tends to outperform linearandrecurrent models. Effectively, we use multiple financial time series as input in a neural network, thus conditioningthe forecast of atimeseries on both its own history as well as that ofmultipleother time series. Training a model on multiple stock series allows the network to exploit the correlation structure between these series so that the network can learn themarketdynamicsin shortersequencesof data. While on the relatively short time series the prediction time is negligible when compared to the training time, for longer time series the prediction of the autoregressive model may be sped up by implementing a recent variation that exploits the memorization structure of the network, or speeding up the convolutions by working in the frequency domain emloying Fouriertransforms.Finally,itiswell-known thatcorrelationsbetween datapointsarestronger on an intraday basis. Therefore, it might be interesting to test the model on intraday data to see if the ability of the model to learn long-term dependencies is even more valuable in that case.
  • 133. FinancialForecasting withDeep Learning#2 Autoregressive ConvolutionalNeural NetworksforAsynchronousTime Series Mikołaj Bińkowski, Gautier Marti, Philippe Donnat (Submitted on12 Mar 2017 (v1), last revised 12 Jun 2018 (thisversion, v4)) https://arxiv.org/abs/1703.04122 →  Cited by8 https://github.com/mbinkowski/nntimeseries We propose Significance-Offset Convolutional Neural Network, a deep convolutional network architecture for regressionofmultivariateasynchronoustimeseries. Conclusionanddiscussion In thisarticle,we proposedaweighting mechanismthat,coupled withconvolutionalnetworks,formsanewneural networkarchitecturefortimeseriesprediction. Theproposed architectureisdesignedfor regression taskson asynchronoussignalsin thepresenceof highamount of noise.This approachhasprovedtobesuccessfulin forecastingseveralasynchronoustimeseries outperformingpopularconvolutionaland recurrentnetworks. Theproposed modelcanbeextendedfurtherby adding intermediateweightinglayers ofthe sametypein thenetworkstructure.Another possiblegeneralization thatrequiresfurther empiricalstudiescan beobtainedbyleaving the assumption ofindependent offsetvaluesfor eachpastobservation, i.e.consideringnotonly 1x1convolutionalkernelsin theoffsetsub- network. Finally,weaimattestingtheperformanceofthe proposedarchitectureon otherreal-lifedatasets withrelevantcharacteristics.Weobservethat thereexistsastrongneed forcommon ‘econometric’ datasets benchmarkand, moregenerally,fortimeseries(stochastic processes)regression.
  • 134. FinancialForecasting withDeep Learning#3 Multi-taskLearningforFinancial Forecasting TaoMa, Guolin Ke(27 Sep 2018) https://arxiv.org/abs/1809.10336 Due to the strong connections among stocks, the information valuable for forecasting is not only included in individual stocks, but also included in the stocks related to them. However, most previous works focus on one single stock, which easily ignore the valuable information in others. To leverage more information, in this paper, we propose a jointly forecasting approach to process multiple time series of related stocks simultaneously, using multi-task learning framework(Ruder2017). Durichen et al. (2015) used multi-task Gaussian processes to process physiological time series. Jung (2015) proposed a multi-task learning approach to learn the conditional independence structure of stationary time series. Liu et al. (2016) used multi-task multi-view learning to predict urban water quality. Harutyunyan et al. (2017) used recurrent LSTM neural networks and multi-task learning to deal with clinical time series. And Li et al. (2018) applied multi-task representation learning to travel time estimation. Moreover, some methods are proposed to learn the shared representation of all the task-private information, e.g., Misra et al. (2016) proposed cross-stitch networks to combine multipletask-privatelatentfeatures In the future works, we would like to further improve SPA’s ability of combining latent features. And for DMTL, we would like to build hierarchical models to extract the shared information from all tasksmore efficiently. The contributionsof thispaper aremultifold: ● To the bestofourknowledge,theproposed multi-seriesjointlyforecastingapproach isthe firstwork applying multi-task learningtotime seriesforecastingformultiplerelatedstocks. ● We proposea novel attentionmethod to learnthe optimizedcombination of shared andtask-privatelatentfeaturesbasedonthe ideaofCAPM. ● We demonstrate inexperimentsonfinancial data thatthe proposedapproach outperformssingle- task baselinesandotherMTL basedmethods, which furtherimprovesthe forecasting performance.
  • 135. FinancialForecasting withDeep Learning#4 Multi-taskLearningforFinancial Forecasting TaoMa, Guolin Ke(27 Sep 2018) https://arxiv.org/abs/1809.10336 In thispaper, we empiricallystudythe applicabilityofthe latest deep structures with respect to the volatility modellingproblem,throughwhichweaimtoprovide an empirical guidance for the theoretical analysis of the marriage between deep learning techniques and financialapplicationsinthefuture. We examine both the traditional approaches and the deep sequential models on the task of volatility prediction, including the most recent variants of convolutional and recurrent networks, such as the dilated architecture. Accordingly, experiments with real-world stock price datasets are performed on a set of 1314 daily stock series for 2018 days of transaction. The evaluation and comparison are based on the negative log likelihood(NLL) ofreal-worldstockpricetimeseries. The result shows that the dilated neural models, including dilated CNN and Dilated RNN, produce most accurate estimation and prediction, outperforming various widely-used deterministic models in the GARCHfamily and several recently proposed stochastic models. In addition, the high flexibility and rich expressive power are validated in this study.
  • 136. Trading withDeep Learning#1a DeepLOB: Deep Convolutional NeuralNetworksforLimitOrder Books ZihaoZhang, Stefan Zohren, Stephen Roberts(2018) https://arxiv.org/abs/1808.03668 We develop a large-scale deep learning model to predict price movements from limit order book (LOB) data of cash equities. The architecture utilises convolutional filters to capture the spatial structure of the limit order books as well as LSTM modules to capture longer time dependencies. Importantly, our model translates well to instruments which were not part of the training set, indicating the model’s ability to extract universal features. In order to better understand these features and to go beyond a “black box” model, we perform a sensitivity analysis to understand the rationale behind the model predictions and reveal the components of LOBs that are most relevant. The ability to extract robust features which translate well to other instruments is an important property of our model which has many other applications. We use standardisation (z-score) to normalise our data, and use the mean and standard deviation of the previous day’s data to normalise the current day’s data (separate normalisation for each instrument): Because financial data is highly stochastic, if we simply compare pt and pt+k to decide the price movement, the resulting label set will be noisy. We adopt the idea in Tsantekidis et al. (2017) to introduce a smoothed labelling method.
  • 137. Trading withDeep Learning#1b DeepLOB:DeepConvolutionalNeural NetworksforLimitOrderBooks Zihao Zhang, Stefan Zohren, Stephen Roberts(2018) https://arxiv.org/abs/1808.03668 To observe what convolutional layers do, we feed a single input to the trained model and plot the intermediate outputs on the right of Figure 5. Since 16 filters are applied, we get 16 series after the “Conv” block. The convolution operations transform the original time-series into signals that indicate time regions that have great impacts on final outputs. In our case, we observe strong signals around t = 1, 20, 40, 70 time stamps, suggesting information at these time stamps decide the final outputs. In our case, we use LIME [Ribeiro et al. 2016; Cited by 751 ] to reveal components of LOBs that are most important for predictions and to understand why the proposed model DeepLOB works better than Ref model [Tsantekidis et al. (2017)]. LIME uses aninterpretable model to approximate the prediction of a complex model on a given input. It locally perturbs the input and observes variations in the model’s predictions, thus providing some measure of information regarding input importance and sensitivity.
  • 138. Trading withDeep Learning#2 DevelopingArbitrageStrategyin High-frequencyPairsTrading withFilterbankCNNAlgorithm Yu-YingChen; Wei-Lun Chen;Szu-HaoHuang(2018) https://doi.org/10.1109/AGENTS.2018.8459920 This paper proposed a novel intelligent high- frequency pairs trading system in Taiwan Stock Index Futures (TX) and Mini Index Futures (MTX) market based on deep learning techniques. This research utilized the improved time series visualization method to transfer historical volatilities with different time frames into 2D images which are helpful in capturing arbitragesignals. Moreover, this research improved convolutional neural networks (CNN) model by combining the financial domain knowledge and filterbank mechanism. We proposed Filterbank CNN to extract high-quality features by replacing the random-generating filters with thearbitrageknowledgefilters. Algorithmicfinancialtradingwithdeep convolutionalneuralnetworks: Time seriestoimage conversionapproach Omer Berat Sezerand Ahmet Murat Ozbayoglu (2018) https://doi.org/10.1016/j.asoc.2018.04.024 For future work, we will use more Exchange-Traded Fund (ETFs) and stocks in order to create more data for the deep learning models. We will also analyze the correlations between selected indicators in order to create more meaningful images so that the learning models can better associate the Buy–Sell–Hold signals and come up with more profitable trading models.
  • 139. Trading withDeep Learning: GANs GenerativeAdversarial Networks forFinancial TradingStrategies Fine-TuningandCombination AdrianoKoshiyama, Nick Firoozye, and Philip Treleaven (Jan 2019) https://arxiv.org/abs/1901.01751 Relatedarticles Systematic trading strategies are algorithmic procedures that allocate assets aiming to optimize a certain performance criterion. To obtain an edge in a highly competitive environment, the analyst needs to proper finetune its strategy, or discover how to combine weak signals in novel alpha creating manners. Both aspects, namely fine-tuning and combination, have been extensively researched using several methods, but emerging techniques such as Generative Adversarial Networks can have an impact into such aspects. Therefore, our work proposes the use of Conditional Generative Adversarial Networks (cGANs) for trading strategies calibration and aggregation. StockMarketPredictiononHigh-FrequencyDataUsingGenerative AdversarialNets Xingyu Zhouet al. 2018 https://doi.org/10.1155/2018/4907423 In this paper, we propose a generic framework employing Long Short-Term Memory (LSTM) and convolutional neural network (CNN) for adversarial trainingto forecast high-frequency stock market.Thismodel takesthe publicly available index provided by trading software as input to avoid complex financial theory research and difficult technical analysis, which provides the conveniencefortheordinarytraderof nonfinancial specialty. Based on the deep learning network, this model achieves prediction ability superior to other benchmark methods by means of adversarial training, minimizing direction prediction loss, and forecast error loss. Moreover, the efects of the model update cycles on the predictive capability are analyzed, and the experimental results show that the smaller model update cycle can obtain better prediction performance. In the future, we will attempt to integrate predictivemodelsundermultiscaleconditions.
  • 140. GlucosePrediction CNN-RNNHybrid KezhiLi,JohnDaniels,ChengyuanLiu,PauHerrero,PantelisGeorgiou DepartmentofElectronicandElectrical Engineering,ImperialCollegeLondon https://arxiv.org/abs/1807.03043 Current digital therapeutic approaches for subjects with Type 1 diabetes mellitus (T1DM) such as the artificial pancreas and insulin bolus calculators leverage machine learning techniques for predicting subcutaneous glucose for improved control. In this work, we present a deep learning model that is capable of predicting glucose levels over a 30-minute horizon. The prediction algorithm is implemented on an Android mobile phone (LG Nexus5 with Processor:2.26GHz quad-core, RAM:2GB, 8-bit integer) , with an execution time of 6ms on a phone compared to an execution time of 780ms,on a laptop (MacProwith Processor:3.1GHz Intel Core i5, RAM:8GB, 32-bit fp) inPython. Given that learning is solely based on historical data, unexpected predictions may occur given that correlations learned in the datamay not implycausation. Thushybrid approacheswhereby the deep learningmodel isused tomakean accurate prediction, and rulesof meal/bolus supported byphysiological model avoid apparent errorsthat might result. Based on the CRNN approach proposed in this paper, it is possible to develop the hybrid method, which may have the advantages of both conventional and DL algorithms.
  • 142. ClinicalSurvivalModels CancerSurvival ASimpleDiscrete-TimeSurvivalModelforNeuralNetworks MichaelF.GensheimerandBalasubramanianNarasimhan StanfordUniversity(May2018) https://arxiv.org/pdf/1805.00917.pdf https://github.com/MGensheimer/nnet-survival Keras It is recommended to use at least ten time intervals to avoid bias in the survival estimates [17]. Using narrow time intervals also helps avoid inaccurate parameter estimatesif the effect of the input data variesrapidlywithfollow-up time (time-varying coefficients, in the language of survival analysis). In most of our experiments we have used 20-50 time intervals. We suggest choosing the cut-points so that around the same number of survival events fall into each time interval, which will help ensure reliable estimatesforall time intervals While the model has several advantages and we think it will be useful for a broad range of applications, it does has some drawbacks. The discretization of follow-up time results in a less smooth predicted survival curve compared to a parametric survival model suchasa Weibull acceleratedfailure time model. As long as a sufficient number of time intervals is used, this is not a large practical concern. Unlike a parametric survival model, the model does not provide survival predictions past the end of the last time interval, so it isrecommended to extend the last interval past the last follow-up time of interest. The advantages of parametric survival models and our discrete-time survival model could be combined in the future using a flexible parametric model, such as the cubic splinebased model of Royston and Parmar(2002), implemented in the flexsurv R package. Complex non-proportional hazards models (see Katzman etal. 2018, for proportional deep learning model) can be created in this way, and likely could be implementedin deep learningpackages.
  • 143. ClinicalSurvivalModels SequentialDL “recurrent” DeepRecurrentSurvivalAnalysis Kan Ren, JiaruiQin, Lei Zheng, Zhengyu Yang, WeinanZhang, Lin Qiu, YongYu ShanghaiJiaoTongUniversity(Sept2018) https://arxiv.org/abs/1809.02403 Recent advancesof modern technologymakesredundant datacollection available for time-to- event information, which facilitates observing and tracking the event of interests. However, due to different reasons, many events would lose tracking during observation period, which makesthe data censored. We only know that the true time to the occurrence of the event is larger or smaller than, or within the observation time, which have been defined as survivorship bias categorized into right- censored, left-censored and internal-censored respectively (Lee and Wang2003). Survival analysis, a.k.a. time-to-event analysis (Leeet al. 2018; DeepHit), is a typical statistical methodology for modeling time-to-event data while handling censorship, which is a traditional research problem and hasbeen studied over decades. Our model proposesanovelmodeling viewfor survivalanalysis,which aimsat flexibly modeling the survival probability function rather than making any assumptions for the distribution form. Specifically, DRSA creatively predicts the conditionalprobability of the event at each time given that the event non- occured before, and combines them through probability chain rule for estimatingboththeprobabilitydensityfunctionandthecumulativedistribution function of the event over time, eventually forecasts the survival rate at eachtime,which ismore reasonable andmathematicallyefficientfor survival analysis. Through these modeling methods, our DRSA model can capture the sequential patterns embedded in the feature space along the time, and output more effective distributions for each individual samples at fine-grainedlevel.
  • 144. ClinicalSurvivalModels Cardiac Motionanalysis Deep learning cardiacmotion analysis forhumansurvivalprediction Ghalib A. Bello,Timothy J.W. Dawes, Jinming Duan,Carlo Biffi,Antonio deMarvao,LukeS.G.E. Howard,J. SimonR.Gibbs, Martin R. Wilkins, StuartA.Cook, Daniel Rueckert, DeclanP.O'Regan (Submittedon8Oct2018) ImperialCollegeLondon, NationalHeartCentreSingapore,Singapore,andDuke-NUS GraduateMedical School,Singapore https://arxiv.org/abs/1810.03382 https://github.com/UK-Digital-Heart-Project/4Dsurvival Making predictions about future events from the current state of a moving three dimensional (3D) scene depends on learning correspondences between patterns of motion and subsequent outcomes. Such relationships are important in biological systems which exhibit complex spatio-temporal behaviour in response tostimuli or as a consequence of disease processes. Here we use recent advances in machine learning for visual processing tasks to develop a generalisable approach for modelling time-to-event outcomes from time-resolved 3D sensory input. We tested this on the challenging task of predicting survival due to heart disease through analysis of cardiacimaging The traditional paradigm of epidemiological research is to draw insight from large-scale clinical studies through linear regression modelling of conventional explanatory variables, but this approach does not embrace the dynamic physiological complexity of heart disease. Even objective quantification of heart function by conventional analysis of cardiac imaging relies on crude measures of global contraction that are only moderatelyreproducible and insensitivetothe underlyingdisturbancesofcardiovascular physiology. While conventional autoencoders are used for unsupervised learning tasks we extend recent proposals for supervised autoencoders in which the learned representations are both reconstructive and discriminative. We achieved this by adding a prediction branch to the network with a loss function for survival inspired by the Cox proportional hazards model. A hybrid loss function, optimising the trade- off between survival prediction and accurate input reconstruction, is calibrated during training. The compressed representations of 3D motion predict survival more accurately than a composite measure ofconventional manually-derived parametersmeasured on the same images.
  • 146. RepresentationLearning forSequences Unifiedrecurrentneuralnetworkformanyfeaturetypes AlexanderStec,DiegoKlabjan, Jean Utke (Submittedon24Sep2018) https://arxiv.org/abs/1809.08717 “Therearetimeseriesthat areamenabletorecurrent neural network(RNN) solutionswhen treatedas sequences, butsome series,e.g.asynchronous timeseries,providearicher variationoffeaturetypesthancurrentRNNcells takeinto account. Inordertoaddresssuchsituations,weintroduceaunifiedRNNthat handles fivedifferentfeaturetypes,eachinadifferent manner. OurRNNframeworkseparatessequentialfeaturesinto two groups dependentontheirfrequency,whichwecall sparseand densefeatures, andwhichaffect cellupdatesdifferently. Further,wealsoincorporatetimefeatures at thesequential levelthat relatetothetimebetweenspecifiedeventsin the sequenceandareusedtomodifythecell'smemorystate. Wealso include twotypesofstatic (wholesequencelevel) features, one relatedtotimeandonenot,whicharecombinedwiththeencoder output.“ For future work, it would be interesting to incorporate even more feature types than the five covered in this work. One in particular is a feature type that gives time information looking forward in the sequence. All features in this work use time information related to past events, but there are cases that can benefit from the utility of incorporating future knowledge when available. One example of this is the time to the prediction from the current time step so the network can have direct knowledge of itsabsolute time location in the sequence.
  • 147. InMedicalDiagnostics Sequence ≈ PatientVisits ShortFuse:BiomedicalTimeSeries RepresentationsinthePresenceof Structured Information MadalinaFiterau,SuvratBhooshan,Jason Fries,Charles Bournhonesque, JenniferHicks,Eni Halilaj, ChristopherRé, ScottDelp (revised16May2017) StanfordUniversity https://arxiv.org/abs/1705.04790 -Citedby5  “In healthcare applications, temporal variables that encode movement, health status and longitudinal patient evolution are often accompanied by rich structured information such as demographics, diagnostics and medical exam data (constant along the temporal domain). However, current methods do not jointly optimize over structured covariates and time series in the feature extraction process. We present ShortFuse, a method that boosts the accuracy of deep learning models for time series by explicitly modeling temporal interactions and dependencies withstructuredcovariates. ShortFuse introduces hybrid convolutional and LSTM cells that incorporate the covariates via weights that are sharedacrossthetemporaldomain. “
  • 148. Sequences /+→ Networkscience (Graphinference) ReferralpathsintheU.S.physiciannetwork Chuankai An, A.JamesO’Malley,DanielN.Rockmore (December2018) https://doi.org/10.1007/s41109-018-0081-4 For a patient, a “referral path” records (“patient journey”) the chronological sequence of physicians encountered by a patient (subject to certain constraints on the times between encounters). It provides a basic unit of analysis in a broader referral network that encodes the flow of patients and information between physicians ina healthcaresystem.Weconsiderreferralnetworks defined over a range of interactions as well as the characteristics of referral paths, producing a characterization of the various networks aswell asthephysicianstheycomprise. In this paper we study the more fine-scale patterns to be found in the consideration of the referral paths and importantly link these statistics to treatment outcomes in the particular setting of cardiovascular disease. While referral path and referral information generally has been ignored as a factor in the important problem of treatmentoutcomeprediction An example referral path with three physicians A,B,C. The patient visits them five times. Physicians A and C are from the same HRR/hospital in blue, while physician B is from anotherHRR/hospital inred Visualization of a hospital (PHN) referral network with 30 physicians and 101 directed edges in 2011. Red, yellow and lightblue nodes represent physicians with positive, zero and negative net patient flow (NPF), respectively. Targets of referrals are marked with shadow on directed edges. 
  • 150. SmallData fordeep learning SmallSampleLearninginBigDataEra https://arxiv.org/abs/1808.04572 JunShu,ZongbenXu,DeyuMeng lastrevised22Aug 2018 Asapromising areain artificialintelligence,anewlearning paradigm, called Small SampleLearning(SSL),hasbeen attracting prominentresearchattention in therecentyears.In thispaper,weaim topresent asurvey tocomprehensivelyintroducethecurrent techniquesproposedon thistopic.Specifically,currentSSL techniquescanbemainlydividedinto twocategories. ThefirstcategoryofSSLapproachescanbecalled "concept learning", whichemphasizeslearningnewconceptsfromonlyfew relatedobservations. Thepurposeismainlytosimulatehuman learningbehaviorslikerecognition,generation, imagination,synthesis andanalysis. Thesecondcategoryiscalled "experience learning",whichusuallyco-existswiththelargesamplelearning mannerofconventionalmachinelearning.Thiscategorymainly focuseson learningwithinsufficientsamples,andcan alsobecalled smalldatalearningin someliteratures. MoreextensivesurveysonbothcategoriesofSSLtechniquesare introduced andsomeneuroscienceevidencesareprovidedto clarifytherationalityoftheentireSSLregime,andtherelationship withhuman learningprocess.Somediscussionson themain challengesandpossiblefutureresearchdirectionsalongthislineare alsopresented. TheFastand the Flexible: training neural networks tolearn tofollow instructions from smalldata https://arxiv.org/abs/1809.06194 RezkaLeonandya,EliaBruni,DieuwkeHupkes, GermánKruszewski(Submittedon17Sep2018) Learning to follow human instructions is a challenging task because while interpreting instructions requires discovering arbitrary algorithms, humans typically provideveryfew examples to learn from. For learning from this data to be possible, strong inductive biases are necessary. Work in the past has relied on hand-coded components or manually engineered features to provide such biases. In contrast, here we seek to establish whether this knowledge can be acquired automatically by a neural network system through a two phase training procedure: A (slow) offline learning stage where the network learns about the general structure of the task and a (fast) online adaptation phase where the network learns the languageof anew given speaker.
  • 151. Dataaugmentation fortimeseries T-CGAN:ConditionalGenerative AdversarialNetworkforData AugmentationinNoisyTime SerieswithIrregularSampling https://arxiv.org/abs/1811.08295 GiorgiaRamponi, PavlosProtopapas,MarcoBrambillaandRyanJanssen (20 Nov 2018) In this paper we propose a data augmentation method for time series with irregular sampling, Time-Conditional Generative Adversarial Network (T-CGAN). Our approach is based on Conditional Generative Adversarial Networks (CGAN), where the generative step is implemented by a deconvolutional NN and the discriminative step bya convolutional NN. Boththe generator and the discriminator are conditioned on the sampling timestamps, to learn the hidden relationship between data andtimestamps, and consequentlyto generate new time series.
  • 152. Dataaugmentation frominvariancemodelling#1 DataAugmentationofRoom ClassifiersusingGenerative AdversarialNetworks Constantinos Papayiannis, Christine Evers, Patrick A. Naylor https://arxiv.org/abs/1901.03257 (Jan 2019)
  • 153. Dataaugmentation frominvariancemodelling#2 Sinusoidalwavegeneratingnetworkbased onadversarial learningandits application:synthesizingfrogsounds fordata augmentation Sangwook Park, David K. Han, and Hanseok Ko https://arxiv.org/abs/1901.02050 (Jan 2019) Graphical comparisons of time-domain waveforms and spectrograms and quantitative comparisons using the inception score clearly showed that the synthetic data closely resembles the target signal. Overall, it was demonstrated that the proposed approach of data augmentation by direct generation of synthetic audio streams improved the CNN based classificationrate anditstraining efficiency when both the real andthe synthetic datawere usedto train the classifier.These resultsdemonstratethattheproposednetworkgeneratesanarbitrarysignalthatiscomposedofsinusoidalwaveformsandcanbeusedfor trainingadeepnetwork
  • 154. TransferLearning withTime Series #1 Dataaugmentationusingsyntheticdatafortimeseries classificationwithdeepresidualnetworks HassanIsmailFawaz,GermainForestier,Jonathan Weber,Lhassane Idoumghar,Pierre-AlainMuller (Submittedon7Aug2018) https://arxiv.org/abs/1808.02455 https://github.com/hfawaz/aaltd18 Unlike in image recognition problems, data augmentation techniques have not yet been investigated thoroughly for the TSC task. This is surprising as the accuracy of deep learning models for TSC could potentially be improved, especially for small datasets that exhibit overfitting, when a data augmentation method is adopted. In this paper, we fill this gap by investigating the application of a recently proposed data augmentation technique based on the Dynamic Time Warping distance,foradeeplearningmodelforTSC. The data augmentation method is mainly based on a weighted form of Dynamic Time Warping (DTW) Barycentric Averaging (DBA) technique [Petitjeanetal.2016]. The latter algorithm averages aset of time series in aDTW induced space and byleveraging aweighted versionofDBA,themethodcanthuscreate aninfinitenumberof newtimeseries from a given set of time series by simply varying these weights. Three techniques were proposed to select these weights, from which we chose only one in our approach for the sake of simplicity, although we consider evaluating other techniques in our future work.The weighting method is called Average Selected which consists of selecting a subset of close time seriesandfilltheir boundingboxes. We did not test the effect of imbalanced classes in the training set and how it could affectthe model’sgeneralization capabilities.Notethatimbalancedtime seriesclassification is a recent active area of research that merits an empirical study of its own [Gengetal.2018]. At last, we should add that the number of generated time series in our framework was chosen to be equal to double the amount of time series in the most represented class (which is a hyper-parameter ofour approachthatweaimtofurtherinvestigateinour futurework).
  • 155. TransferLearning withTime Series #2 Physiological-signal-basedmental workloadestimationvia transferdynamical autoencoders inadeeplearningframework NeurocomputingAvailableonline11April2019 https://arxiv.org/abs/1808.02455 In this study, we propose a new transfer dynamical autoencoder (TDAE) to capture the dynamical properties of electroencephalograph (EEG) features and the individual differences. The TDAE consists of three consecutively- connected modules, which are termed as feature filter, abstraction filter, and transferred MW classifier. The feature and abstraction filters introduce dynamical deep network to abstract the EEG features across adjacent time steps to salient MW indicators. Transferred MW classifier exploits large volume EEG data from an source-domain EEG database recorded under emotional stimuli toimprovethemodeltrainingstability The main limitation of the proposed TDAE deep learning framework for MW recognition lies in two aspects. The computational cost for training the entire network is significantly higher than classical shallow and deep classifiers. It leads to high time cost in selection optimal hyper-parameters of the model. Therefore, we employed the same value of the feature filter order to reduce the computational burden. However, it is no doubt that the filer order should feature-specific. Moreover, there exists a prerequisite for knowledge transferring across two mental-task domains. That is, we need to select exactly the same EEG channels for data preprocessing and it leads to a possibility that useful MW indicators are excluded. In future work, we will further investigate the deep learningmethodsfor MW assessment on these twoaspects.
  • 156. Active Learning withTimeSeries RobustActiveLearningforElectrocardiographicSignal Classification XuChen,SaratenduSethi (Submittedon21Nov 2018) https://arxiv.org/abs/1811.08919 Motivated by the fact that ECG data are usually heavily unbalanced among different classes and the class labels are noisy as they are manually labeled, this paper proposes a novel solution based on robust active learning for addressing these challenges. The key idea is to first apply the clustering of the data in a low dimensional embedded space and then select the most information instances within local clusters. By selecting the most informative instances relying on local average minimal distances, the algorithm tends to select the data for labelinginamorediversifiedway. The first stage of RALS algorithm relies on label spreading. The label spreading algorithm is a well known graph-based semi-supervised learning algorithm. It calculates the similarity measure and propagates the labels by the measure for prediction. It also generates the label distribution matrix which consists of the predicted probability for every class for each sample. In order to select the data from different classes, here t-Distributed Stochastic Neighbor Embedding (t-SNE) is applied to the label distribution matrix due to its good performance for high dimensionaldatasets. A novel noisy label reduction relying on an effective confidence score measure is proposed based on the criteria of best vs second best (BSVB) to enhance the active learning performance. Typically, for each selected data sample after ranking, the ratio of the largest estimated class probability to the second largest estimated class probability iscalculatedwhere thisinformation can beretrieved from the label distribution matrix. Subsequently, the ratio is compared to the user set threshold. The selected data are added into the labeled set if the ratio is larger than thethreshold. Therefore, by adding the estimated labels passed from the noise reduction step into the labeled dataset, the noisy labels in the selection are significantly reduced. The new augmented labeled dataset after adding the selected data samples are applied tolabelspreadingalgorithmagaintolearnthenextenhancedmodel.
  • 158. Visualizing Audio processing InterpretableConvolutionalFilterswithSincNet https://arxiv.org/abs/1811.08633 https://github.com/mravanelli/SincNet/ https://github.com/mravanelli/pytorch-kaldi/ MircoRavanelliandYoshuaBengio (NIPS2018) This paper summarizes our recent efforts to develop a more interpretable neural model for directly processing speech from the raw waveform. In particular, we propose SincNet, a novel Convolutional Neural Network (CNN) that encourages the first layer to discover more meaningful filters by exploitingparametrized sinc functions. In contrast to standard CNNs, which learn all the elements of each filter, only low and high cutoff frequencies of band-pass filters are directly learned from data. This inductive bias offers a very compact way to derive a customized filter-bank front-end, that only depends on some parameters with a clear physical meaning. Our experiments, conducted on both speaker and speech recognition, show that the proposed architecture
  • 159. Spatiotemporal activations CompensatedIntegratedGradientstoReliably InterpretEEGClassification https://arxiv.org/abs/1811.08633 KazukiTachikawa,YujiKawai,JihoonPark,MinoruAsada MachineLearningforHealth(ML4H)Workshop atNeurIPS2018. Integratedgradientsarewidely employedto evaluate thecontribution of input features in classification models because it satisfies the axiomsforattribution ofprediction.Thismethod,however,requires an appropriate baseline for reliable determination of the contributions. We propose a compensated integrated gradients method that does not require a baseline. In fact, the method compensates the attributions calculated by integrated gradients at an arbitrary baseline usingShapleysampling. The classifier constraints decrease the classification accuracy of temporal CNN. In contrast, spatiotemporal CNNs exhibit higher classification accuracy but lower interpretation reliability than the temporal CNNs.Therefore, classifier selection should dependon whetherreliability or classificationaccuracy areemphasized.