Clustering Financial Time Series and Evidences of Memory E

Clustering Financial Time Series and
Evidences of Memory E ects
Facoltà di Scienze Matematiche, Fisiche e Naturali
Corso di Laurea Magistrale in Fisica
Candidate
Gabriele Pompa
ID number 1146901
Thesis Advisor
Prof. Luciano Pietronero
Academic Year 2011/2012

Clustering Financial Time Series and Evidences of Memory E ects
Master thesis. Sapienza – University of Rome
© 2012 Gabriele Pompa. All rights reserved
This thesis has been typeset by LATEX and the Sapthesis class.
Author’s email: gabriele.pompa@gmail.com

Non scholae, sed vitae discimus.
dedicata a mia madre, per avermi insegnato l’impegno,
a mio padre, per avermelo fatto amare,
e a Lilla, per sopportare tutto questo con amore.

v
Contents
Introduction vii
1 Financial Markets 1
1.1 E cient Market Hypothesis . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Random-Walk Models . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Stylized Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Technical trading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Main assumptions and Skepticism . . . . . . . . . . . . . . . 8
1.4.2 Feed-back E ect and Common Figures . . . . . . . . . . . . . 9
2 Pattern Recognition 11
2.1 From the Iris Dataset to Economic Taxonomy . . . . . . . . . . . . . 11
2.2 Supervised and Unsupervised Learning and Classification . . . . . . 16
2.3 Bayesian learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Bayesian Model Selection . . . . . . . . . . . . . . . . . . . . 18
2.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.1 Definition and Distinctions . . . . . . . . . . . . . . . . . . . 20
2.4.2 Time Series Clustering . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Distance and Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.1 Information Theoretic Interpretation . . . . . . . . . . . . . . 23
3 Monte Carlo Framework 25
3.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Static Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 Hit-or-Miss Sampling: a numerical experiment . . . . . . . . 28
3.4 Dynamic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.2 MCMC and Metropolis-Hastings Algorithm . . . . . . . . . . 32
4 Memory E ects: Bounce Analysis 35
4.1 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.1 T Seconds Rescaling . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Bounce: Critical Discussion About Definition . . . . . . . . . . . . . 37
4.3 Consistent Random Walks . . . . . . . . . . . . . . . . . . . . . . . . 42

vi Contents
4.4 Memory E ects in Bounce Probability . . . . . . . . . . . . . . . . . 45
4.5 Window Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5.1 Recurrence Time . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5.2 Window Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5.3 Fluctuations within Window . . . . . . . . . . . . . . . . . . 55
5 The Clustering Model 59
5.1 Structure of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Toy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3 Real Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4 Best Partition: Bayesian Characterization . . . . . . . . . . . . . . . 65
5.4.1 Gaussian Cost Prior . . . . . . . . . . . . . . . . . . . . . . . 67
5.4.2 Gaussian Likelihood . . . . . . . . . . . . . . . . . . . . . . . 68
5.5 MCMC Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.6 Splitting Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.6.1 RANDOM of SPLITTING . . . . . . . . . . . . . . . . . . . . 78
5.7 Merging Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.7.1 RANDOM of MERGING . . . . . . . . . . . . . . . . . . . . . 82
6 The Clustering Results 87
6.1 Role of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1.1 Noise Dependency of RANDOM Thresholds . . . . . . . . . 89
6.2 Toy Model Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2.1 Insights of Convergence . . . . . . . . . . . . . . . . . . . . . 92
6.2.2 ‡prior Analysis and Sub-Optimal Partitions . . . . . . . . . . 95
6.2.3 Results of the Entire 3-Steps Procedure . . . . . . . . . . . . 100
6.3 Real Series Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3.1 Missteps: Granularity and Short Series E ects . . . . . . . . 102
6.3.2 Correct Clustering Results . . . . . . . . . . . . . . . . . . . . 103
6.3.3 Attempt of Cause-E ect Analysis . . . . . . . . . . . . . . . . 108
6.4 Conclusions and Further Analysis . . . . . . . . . . . . . . . . . . . . 111
A Noise Dependency of Merging Threshold - List of Plots 113
B Clustering Results - List of Plots 119
C Cause-E ect Clustering - List of Plots 141
C.1 Half-Series Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
C.2 Cause-E ect Relations . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

vii
Introduction
Among the community of investors, that of technical trading is a growing school of
thought. Every day more and more investors rely on technical indicators.
Standard economic theory considers the concept of "e ciency" of the market (Fama,
1970) as the cornerstone of the whole theory.
This assumption postulates the impossibility of the existence of investment strategies
without risk. This hypothesis would endorse a simplistic stochastic modeling of the
market in terms of a random walk of the price.
But there are various empirical evidences, known as "stylized facts", which seriously
restrict the validity of totally random models to explain the performance of the
market.
Recently the problem was approached from a di erent point of view, that is, if in
the performance of market there were evidences of investment strategies familiar
and simultaneously adopted by a large community of investors [57].
Focusing on "technical" indicators of investment, such as Supports and Resistances,
was studied the probability of correct prediction of such indicators conditional on
the number of times they had been previously exploited. It were shown evidences of
memory e ects.
In this thesis work was developed a new method to investigate regularity in the
market. The problem was approached from the point of view of the presence of
"similarity" in the performance of the price around the points where a level of Support
or Resistance had been already identified.
The procedure that was defined and refined over months led to design an original
algorithm for the clustering of time series.
The thesis consists of 6 chapters and 3 appendices, namely the core of the results
are presented in chapters from 4 to 6, whereas the first three chapters provide
the necessary background and in the appendices are listed all plots necessary for
completeness and omitted in the text for clarity. The structure of the chapters is
the following:
• Chapter 1: is a rapid review of the standard economic theory mainly focused
around the E cient Market Hypothesis, the main random walk models designed
to explain market behavior and the corresponding drawbacks. The chapter
ends with an introduction on the philosophy of technical trading and provides
examples of common technical indicators, such as Supports and Resistances.
• Chapter 2: provides the essential background of the theory of Pattern Recogni-
tion, together with examples of applications from ancient and modern literature,
then specializes around the statistical framework developed starting from the

viii Introduction
Bayes rule and finally introduces the central concept of Clustering, stressing
the aspects concerning time series clustering.
• Chapter 3: introduces the numerical instruments adopted, namely those of
the Monte Carlo sampling theory. The MC theory is briefly revised considering
separately static and dynamic methods. The chapter ends the essential features
of the theory of Markov Chains, necessary in order to contextualize the Monte
Carlo Markov Chains methods widely adopted in the numerical simulations
performed.
• Chapter 4: intends to critically review the results previously obtained on
the analysis of the rebounds on the Support and Resistance levels. Then, the
bounce analysis is extended to the statistical properties characteristics times
describing bounces and typical fluctuations of price around those events.
• Chapter 5: this completely original chapter introduces the bayesian algorithm
adopted in the subsequent clustering analysis. Stated the 3-steps structure of
the procedure, each step is analyzed in detail in order to provide a mathematical
basis and in order to make possible more reproducible results.
• Chapter 6: here are reported all the results obtained via the clustering
procedure adopted. Are reported both the results obtained with the toy-model,
used to test the algorithm, both those with the real financial time series, with
the hope of being able to report as objectively as possible weaknesses as well
as positive aspects of the algorithm designed for clustering purposes.
Although it should be considered that this thesis work represents, in my opinion,
only the beginning of this original and fascinating analysis, evidences of structural
regularities among time series analyzed are e ectively found even at this early stage.

1
Chapter 1
Financial Markets
1.1 E cient Market Hypothesis
Since 1970, with Fama’s work [1], the dominant assumption on capital markets has
been the E cient Market Hypothesis (EMH). Under this hypothesis the market
is viewed as an open system instantly processing all the informations available.
The concept of information available, or information set ◊t, describes the corpus of
knowledge on which traders base their investment decisions. These could be based
on public or private informations, such as expected profit of a company, interest rate
and expectation of dividends [2, Chap. 8].
Jensen (1978): "A market is e cient with respect to information set
◊t if it is impossible to make economic profits by trading on the basis of
information set ◊t"[3].
The e ciency of the market is expressed by the absence of arbitrage, namely the
impossibility to realize riskless strategies relying only on the time needed by the
price to reach again to its fundamental value after an operation, the fundamental
value being that expected on the base of ◊t.
This automatic self-organization of the market yields to prices always fully reflecting
the information available, namely price increments are substantially random.
In finance, the variable related to the price increment in a lapse · of time, from
price pt to price pt+· , is called the return r· (t) and is defined in various ways.

2 1. Financial Markets
Let pt be the price of a financial asset at time t. Then possible definitions of
returns adopted are:
• Linear Returns
r· (t) = pt+· ≠ pt (1.1)
which has the advantage of being linear, but directly depends on the currency.
• Relative Returns
r· (t) =
pt+· ≠ pt
pt
(1.2)
which takes account only of the percentage changes. However, with this
definition two consecutive and opposite variations are not equivalent to null
variation1.
• Logarithmic Returns
r· (t) = log(pt+· ) ≠ log(pt) ≥
pt+· ≠ pt
pt
(1.3)
where the approximation is valid in the case of high frequency data (more
details in section 4.1) in which the absolute variation |pt+· ≠ pt| of the price is
much smaller than the value pt.
Market e ciency can be mathematically formalized by the martingale property:
E[pt+1|p0, p1, ..., pt] = pt (1.4)
to be satisfied by the price time series, which states the statistical independence of
the price from its history. This condition correspond exactly to what is defined as a
perfect market [2].
Really, an e cient market needs a finite time to self-organize itself, time quantified
practically via the two-point autocorrelation function:
fl· (t, tÕ
) =
E[r· (t)r· (tÕ)] ≠ E[r· (t)]E[r· (tÕ)]
E[r2
· (t)] ≠ E[r· (t)]2
(1.5)
which is indeed always zero, except for very short scales, up to few minutes (figures
(1.1) and (1.2))
1.2 Random-Walk Models
The e cient market hypothesis led to a random-walk modelization of price time
series. The first attempt to formalize the dependence between price variation x and
the time t, was made by Bachelier in his doctoral thesis [4]. He proposed to use a
gaussian form for the distribution of the price change x at the time t
x ≥ N(0, ‡2
) , ‡ ≥
Ô
t
x = pt+· ≠ p·
1
Two consecutive increments such as a gain of +1% and a loss of ≠1% of a pt = 100 $ will not
result on the same value for the price after them.

1.2 Random-Walk Models 3
-0.4
-0.2
0
0.2
0.4
0 10 20 30 40 50 60 70 80 90
Autocorrelation
Lag
BNPP.PA 1-minute return
FIG. 3. Autocorrelation function of BNPP.PA returns.
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50 60 70 80 90
Autocorrelation
Lag
FIG. 4. Autocorrelation function of BNPP.PA absolute re-
turns.
3. Volatility clustering
The third “stylized-fact” that we present here is of pri-
mary importance. Absence of correlation between re-
turns must no be mistaken for a property of indepen-
dence and identical distribution: price fluctuations are
not identically distributed and the properties of the dis-
tribution change with time.
In particular, absolute returns or squared returns ex-
hibit a long-range slowly decaying auto correlation func-
tion. This phenomena is widely known as “volatility
clustering”, and was formulated by Mandelbrot (1963)
as “large changes tend to be followed by large changes –
of either sign – and small changes tend to be followed by
small changes”.
On figure 4, the autocorrelation function of absolute
10
-3
10
-2
10
-1
10
0
0 1 2 3
Empiricalcumulativedistribution
Normalize
FIG. 5. Distribution of log-returns
and monthly returns. Same data s
mum (more than 70% at the firs
pled every five minutes. However
frequency, autocorrelation is still
hours of trading. On this data, w
law decay with exponent 0.4. O
port exponents between 0.1 and
Liu et al. (1997); Cizeau et al. (
4. Aggregational normality
It has been observed that as
scale over which the returns are
property becomes less pronoun
tion approaches the Gaussian fo
“stylized-fact”. This cross-ove
mented in Kullmann et al. (199
of the Pareto exponent of the di
scale is studied. On figure 5, we
distributions for S&P 500 inde
1950 and June 15th, 2009. It is
time scale increases, the more G
is. The fact that the shape of t
with makes it clear that the ran
prices must have non-trivial tem
B. Getting the right “time”
1. Four ways to measure “time”
In the previous section, all “s
presented in physical time, or c
series were indexed, as we expec
Figure 1.1. Autocorrelation function (1.5) of BNP Paribas (BNPP.PA) logarithmic returns
(1.3), over periods · = 1 and 5 minutes, as a function of the lag t≠tÕ
. Source: Chakraborti
et al. Econophysics: Empirical Facts and Agent-Based Models [13].
8 M. Cristelli, L. Pietronero, and A. Zaccaria
0 100 200 300
t (day)
0
0,2
0,4
0,6
0,8
1
Autocorrelationr
0 50 100 150
t (tick)
-0,2
0
0,2
0,4
0,6
0,8
1
Autocorrelationr
Fig. 3. – We report the autocorrelation function of returns for two time series. The series of
the main plot is the return series of a stock of New York Stock Exchange (NYSE) from 1966 to
1998 while the series of the inset is the return series of a day of trading of a stock of London
Stock Exchange (LSE). As we can see the sign of prices are unpredictable that is the correlation
of returns is zero everywhere. The time unit of the inset is the tick, this means that we are
studying the time series in event time and not in physical time.
which describes the tail behavior of the distribution P(x) of returns.
The complementary cumulative distribution function F(x) of real returns is
found to be approximately a power law F(x) x with exponent in the range
2 4 [15], i.e. the tails of the probability density function (pdf) decay with an exponent
+ 1. Since the decay is much slower than a gaussian this evidence is called
Fat or Heavy Tails. Sometimes a distribution with power law tails is called a Pareto
distribution. The right tail (positive returns) is usually characterized by a di erent ex-
ponent with respect to the left tail (negative returns). This implies that the distribution
is asymmetric in respect of the mean that is the left tail is heavier than the right one
( +
> ).
Moreover the return pdf is a function characterized by positive excess kurtosis, a Gaus-
sian being characterized by zero excess kurtosis. In fig. 4 we report the complementary
cumulative distribution function F(x) of real returns compared with a pure power law
Figure 1.2. Autocorrelation function (1.5) of the return time series of a stock of New York
Stock Exchange (NYSE) from 1966 to 1998 (main plot) and of a stock of London Stock
Exchange (LSE) (in the inset). The lag-time unit of the inset is the event time, or tick,
i.e. the number of transactions (more details on the meaning of this choice can be found
in section (4.1)). Note that the exact definition for the returns here is note relevant
because the graphs refer to high-frequency data. Source: Cristelli M., Pietronero L. and
Zaccaria A. (2001): Critical Overview of Agent Based Models for Economics [12].
The expected value of a common stock’s price change is always zero
E[x] = 0
thus reflecting the martingale property (1.4), but the Bachelier’s model assigns finite
probability to negative value of stock prices, increasingly with time: Èx2Í ≥ ·.

Citing Samuelson (1973) [5]:
"Seminal as the Bachelier model is, it leads to ridiculous results.
[. . . ] An ordinary random walk of price, even if it is unbiased, will result
in price becoming negative with a probability that goes to 1/2 as t æ Œ.
This contradicts the limited liability feature of modern stocks and bonds.
The General Motors stock I buy for 100 $ today can at most drop in
value to zero, at which point I tear up my certificate and never look back.
[. . . ] The absolute-Brownian motion or absolute random-walk model
must be abandoned as absurd."
The random-walk paradigm was actually introduced among the economic community
by Samuelson’s work as the geometric Brownian motion model providing a geo-
metric random-walk dynamics of the price, log-normally distributed, and a normal
distribution of returns:
r· (t) ≥ N(µ·, ‡
Ô
·)
r· (t) = log(pt+· ) ≠ log(pt) (1.6)
In order to provide an evidence that the price behavior is substantially not predictable,
avvalorating the random walk hypothesis, we present on figure (4.5) a comparison
between the performance of a real stock and the simulation of a suitable random
walk.
VOD, 110th
trading day of 2002 Consistent Random Walk
Figure 1.3. On the left: price time series of the Vodafone (VOD) stock in the 110th
trading day of the year 2002. On the right: comparison with the consistent random
walk: pt+1 = pt + N(µ, ‡), where µ = ≠1.2 · 10≠5
is the mean linear return (1.1) of
Vodafone in the case considered and ‡ = 0.02 is the corresponding dispersion.
The detailed definition of consistent random walk and its meaning can be found in
section (4.3).

1.3 Stylized Facts 5
1.3 Stylized Facts
The geometric brownian motion model circumvents the di culties of the abso-
lute-random walk model but still has several drawbacks, summarized in empirical
evidences known as Stylized Facts (SF) [2, Chap. 5] [12]:
• Fat-tailed empirical distribution of returns: very large price fluctuation
are more likely than in a gaussian distribution (figure (1.4)).
10-4
10-3
10-2
10-1
-1.5 -1 -0.5 0 0.5 1 1.5
Probabilitydensityfunction
Log-returns
BNPP.PA
Gaussian
Student
FIG. 1. (Top) Empirical probability density function of the
normalized 1-minute S&P500 returns between 1984 and 1996.
Reproduced from Gopikrishnan et al. (1999). (Bottom) Em-
pirical probability density function of BNP Paribas unnor-
malized log-returns over a period of time = 5 minutes.
trading. Except where mentioned otherwise in captions,
this data set will be used for all empirical graphs in this
section. On figure 2, cumulative distribution in log-log
scale from Gopikrishnan et al. (1999) is reproduced. We
also show the same distribution in linear-log scale com-
puted on our data for a larger time scale = 1 day,
showing similar behaviour.
Many studies obtain similar observations on di erent
sets of data. For example, using two years of data on
more than a thousand US stocks, Gopikrishnan et al.
(1998) finds that the cumulative distribution of returns
asymptotically follow a power law F(r ) |r| with
10-3
10-2
10-1
100
Cumulativedistribution
FIG. 2. Empi
returns. (Top
in log-log scal
price between
14956 values,
call that e
plaining fat
that models
popular in ec
tions ( < 1)
tical evidenc
Figure 1.4. Empirical probability density function of BNP Paribas (BNPP.PA) unnor-
malized logarithmic returns (1.3) over a period of time · = 5 minutes. The graph is
computed by sampling a set of tick-by-tick data from 9:05 am till 5:20 pm between Jan-
uary 1st
, 2007 and May 30th
, 2008, i.e. 356 days of trading. Continuous and dashed lines
are respectively gaussian and Student-t fits. Source: Chakraborti et al. Econophysics:
Empirical Facts and Agent-Based Models [13].
• Absence of simple arbitrage: the sign of next price time variation is
unpredictable on average, namely Èr· (t)r· (t + T)Í is substantially zero. In
figure (1.5) is reported the autocorrelation function of the returns of the DAX
index 2 on the scale · = 15 minutes. It is noteworthy that up to a lag-time
of 53ÕÕ the correlation is positive, whereas until 9.4Õ there is anti-correlation,
nevertheless it is really weak.
2
The DAX index is a blue chip stock market index consisting of the 30 major German companies
trading on the Frankfurt Stock Exchange. According to the New York Stock Exchange (NYSE), a
blue chip is stock in a corporation with a national reputation for quality, reliability and the ability
to operate profitably in good times and bad [60].

Figura 4.2: Grafico della autocorrelazione alla scala di = 15 dell’indice
DAX. I punti corrispondono al valore della funzione di autocorrelazione
R15 (t t ) al variare dell’intervallo temporale t t . Le barre d’errore
corrispondono all’intervallo di confidenza di 3 , la linea continua è il fit.
Tratta da: S.Dresdel, Modellierung von Aktienmarkten durch stochastische
Prozesse, Diplomarbeit, Universitat Bayreuth, 2001.
Figure 1.5. Autocorrelation of return time series of the DAX index. Returns are evaluated
on a period of · = 15 minutes and the autocorrelation function (1.5) is plotted against
the lag time T = t ≠ tÕ
. Error bars correspond to confidence interval of 3‡ and the
continuous line is the fit. Source: S. Dresdel (2001) "Modellierung von Aktienmarkten
dutch stochastische Prozesse", Diplomarbeit, Universitat Bayreuth [9].
• Volatility Clustering: intermittent behavior of price fluctuation, regardless
the sign.
10 M. Cristelli, L. Pietronero, and A. Zaccaria
0 2000 4000 6000 8000
t (days)
-40
-20
0
20
40
returns(Δp)
0 2000 4000 6000 8000
t (days)
-0,2
-0,1
0
0,1
logreturns
Fig. 5. – Return time series of a stock of NYSE from 1966 to 1998. The two figures represent the
same price pattern but returns are di erently computed. In the top figure returns are calculated
as simple di erence, i.e. rt = pt pt t while in the bottom one returns are log returns that is
rt = log pt log pt t. From the lower plot we can see that volatility appears to be clustered
and therefore large fluctuations tend to be followed by large ones and vice versa. The visual
impression that the return time series appears to be stationary for log returns suggests the idea
that real prices follow a multiplicative stochastic process rather than a linear process.
behavior happens for small ones.
In Economics the magnitude of price fluctuations is usually called volatility. It is worth
noticing that a clustered volatility does not deny the fact that returns are uncorrelated
(i.e. arbitrage e ciency). In fact correlation does not imply probabilistic independence,
while the contrary is true. Therefore the magnitude of the next price fluctuations is cor-
related with the present one while the sign is still unpredictable. In other words stock
prices define a stochastic process where the increments are uncorrelated but
not independent.
Di erent proxies for the volatility can be adopted: widespread measures are the absolute
Figure 1.6. Return time series of a stock of New York Stock Exchange (NYSE) from 1966
to 1998. At the top are reported linear returns (1.1) r· (t) = pt+· ≠ pt, whereas at the
bottom are plotted logarithmic returns (1.3). Mainly from the lower plot is evident
that price changes tend to be clustered, so to move coherently despite the sign. Source:
Cristelli M., Pietronero L. and Zaccaria A. (2001): Critical Overview of Agent Based
Models for Economics [12].

1.3 Stylized Facts 7
Even if the e ciency condition – Èr· (t)r· (t + T)Í negligible – is substantially
satisfied, non-linear correlations of absolute È|r· (t)||r· (t + T)|Í and squared
Èr2
· (t)r2
· (t + T)Í returns are still present due to volatility clustering [13]:
Mandelbrot (1963) [6]: "Large changes tend to be followed by
large changes of either sign and small changes tend to be followed
by small changes" (figure (1.6)).
-0.4
-0.2
0
0.2
0.4
0 10 20 30 40 50 60 70 80 90
Autocorrelation
Lag
FIG. 3. Autocorrelation function of BNPP.PA returns.
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50 60 70 80 90
Autocorrelation
Lag
FIG. 4. Autocorrelation function of BNPP.PA absolute re-
turns.
3. Volatility clustering
The third “stylized-fact” that we present here is of pri-
mary importance. Absence of correlation between re-
turns must no be mistaken for a property of indepen-
dence and identical distribution: price fluctuations are
not identically distributed and the properties of the dis-
tribution change with time.
In particular, absolute returns or squared returns ex-
hibit a long-range slowly decaying auto correlation func-
tion. This phenomena is widely known as “volatility
clustering”, and was formulated by Mandelbrot (1963)
as “large changes tend to be followed by large changes –
of either sign – and small changes tend to be followed by
small changes”.
On figure 4, the autocorrelation function of absolute
returns is plotted for = 1 minute and 5 minutes. The
levels of autocorrelations at the first lags vary wildly with
the parameter . On our data, it is found to be maxi-
10
-3
10
-2
10
-1
0 1 2 3 4
Empiricalcumulativedistribu
Normalized return
τ = 1
τ = 1 m
Gau
FIG. 5. Distribution of log-returns of S&P
and monthly returns. Same data set as figu
mum (more than 70% at the first lag) fo
pled every five minutes. However, whatev
frequency, autocorrelation is still above 1
hours of trading. On this data, we can gr
law decay with exponent 0.4. Other em
port exponents between 0.1 and 0.3 (Co
Liu et al. (1997); Cizeau et al. (1997)).
4. Aggregational normality
It has been observed that as one inc
scale over which the returns are calcula
property becomes less pronounced, and
tion approaches the Gaussian form, whi
“stylized-fact”. This cross-over phenom
mented in Kullmann et al. (1999) wher
of the Pareto exponent of the distributio
scale is studied. On figure 5, we plot the
distributions for S&P 500 index betwe
1950 and June 15th, 2009. It is clear tha
time scale increases, the more Gaussian
is. The fact that the shape of the distr
with makes it clear that the random pro
prices must have non-trivial temporal st
B. Getting the right “time”
1. Four ways to measure “time”
In the previous section, all “stylized f
presented in physical time, or calendar
series were indexed, as we expect them
minutes, seconds, milliseconds. Let us
tick-by-tick data available on financial m
the world is time-stamped up to the mill
Figure 1.7. Autocorrelation function (1.5) of BNP Paribas (BNPP.PA) absolute logarithmic
returns (1.3), over periods · = 1 and 5 minutes, as a function of the lag-time. Source:
Chakraborti et al. Econophysics: Empirical Facts and Agent-Based Models [13].
Absolute or squared returns exhibit long-range slow decaying autocorrelation
compatible with a stochastic process with increments uncorrelated but not
independents (figures (1.7) and (1.8)).
Figure 1.8. Autocorrelation function of · = 1 minute returns, squared returns and absolute
returns of the Vodafone (VOD) stock in the 110th
trading day of the year 2002 (price
time series is shown on the left in figure (4.5)). Lag-time T is in 1 minute units.

1.4 Technical trading
Technical analysis is a method of forecasting price movements using past prices. A
leading technical trader defines his field:
Pring (2002) [14]: "The technical approach to investment is essentially
a reflection of the idea that prices move in trends that are determined
by the changing attitudes of investors toward a variety of real and
psychological forces" .
Here real and psychological forces should be considered as exogenous and endogenous
informations, namely, economical and political news on one side, and past price
series, interpreted as technical patterns, on the other one.
1.4.1 Main assumptions and Skepticism
Among traders the knowledge of random-walk theory is rather widespread so the
motivations underpinning technical analysis seems to be even more peculiar:
1. the market discounts everything: technical traders believe that price is
itself the only ◊t needed to make decisions. Price at present time reflects all
the possible causes of its future movements.
2. price moves in trends: trend is the "behaviour" of the price time series, it
could be bullish, bearish or sideway and it is more likely to persist than to be
ended.
3. history repeats itself: if some particular kind of figures or patterns have
anticipated the same bullish or bearish behavior, technical analysis argues it
will happen again. Investors react in the same way to similar conditions.
This assumptions sound in open contrast with the EMH as they rely directly on
the price time series, as they appear, instead on the unknown stochastic process
generating price movements.
Despite the widespread use of technical instruments among traders, the academic
community tend to be skeptical about technical analysis, mainly for these reasons:
• Acceptance of EMH
• Linguistic and methodological barriers
Assuming EMH and the full rationality of agents, no speculative opportunities
should be present in the market. Operators should base their investment decisions
only on market information set ◊t, namely would be present only "fundamentalists".
However, especially in case of bubble crash [58], the influence of fundamental data is
not so strong as to rule out the possibility of other, possible endogenous, influences,
i.e. the presence of investors who rely in past price histories in order to make their
investment decisions: "Technical traders" (or simply: "Chartists") [2, Chap. 8].

1.4 Technical trading 9
On the other side, linguistic barriers could be illustrated contrasting this technical
jargon statement [15]:
The presence of clearly identified Supports and Resistance levels,
coupled with a one-third retracement parameter when prices lie between
them, suggest the presence of strong buying and selling opportunities in
the near term.
with this one:
The magnitude and decay patterns of the first twelve autocorrelations
and the statistical significance of the Box-Pierce Q-Statistic suggest the
presence of a high frequency predictable component in stock returns.
The last barrier I quote is of methodological nature: technical analysis is primarily
visual, employing the tools of geometry and pattern recognition, whereas quantitative
finance is mainly algebraic and numerical [16].
1.4.2 Feed-back E ect and Common Figures
One of the main reasons for which technical analysis should work is that a huge
number of investors rely on it:
Gehrig and Menkho (2003) [17]: " [...] technical analysis dominates
foreign exchange and most foreign exchange traders seem to be chartist
now".
This mass behavior reflects in a feedback e ect of investors own decisions: if a
common figure is known in literature to anticipate a specific trend, even if the future
price movement would have been in the opposite direction, the huge amount of
capital moved a ects the price history making it fulfilling investors expectations.
In this perspective, I briefly review two of the main technical figures 3.
• Moving Average: PM (t) is the average of the past n price values, defined as
PM (t) =
nÿ
k=0
wkP(t ≠ k)
where wk are the weights. The moving average is an example of technical
indicator as it signals the inversion of the trend crossing the price graph. In
technical words: PM (t) acts as a dynamical supports in a bullish trend and as
a resistance during a bearish one (figure (1.9)).
3
There are more technical figures to be mention, such as: Head-and-Shoulders, Inverse-Head-
and-Shoulders, Broadening tops and bottoms and so on [18]. Their definitions are slightly more
involved than the two mentioned in the text and will not be discussed in this thesis.

ULVR, 66th
trading day of 2002
Figure 1.9. Moving average of the price time series of the Unilever (ULVR) stock in the
66th
trading day of the year 2002. The average is computed with constant weights wk
over the last 5 minutes of trading. The two squared points represent (author’s opinion)
respectively selling and buying signals.
• Supports and Resistances: these levels represent local minima and maxima
and the price here is more likely expected to bounce than to cross them. In
investors psychology, at a Support level an investor relying on this indicator
believes that the demand is so strong as to overcome the supply, preventing
further decrease of price, vice versa for a Resistance level (figure (1.10)).
CAPITOLO 5. ANALISI TECNICA 70
Figura 5.2: Esempio di supporto (linea verde) e di supporto (linea rossa).
Fonte: stockcharts.com
gerisce che quando il prezzo scende avvicinandosi al supporto i compratori
sono più inclini a comprare e i venditori meno inclini a vendere. Quando
il prezzo raggiunge il livello del supporto si crede che la domanda superi
l’offerta e impedisce che il prezzo cada sotto il supporto. Una resistenza è
quel livello del prezzo al quale si pensa che l’offerta sia forte abbastanza da
impedire che il prezzo salga ancora. Quando il prezzo si avvicina alla resi-
stenza, i venditori sono più inclini a vendere e i compratori diventano meno
inclini a comprare. Quando il prezzo raggiunge il livello della resistenza,
si crede che l’offerta superi la domanda in modo da impedire che il prezzo
salga ancora (Cf. [29]). In figura 5.2 è mostrato un esempio di supporto e
resistenza.
I supporti e le resistenze verranno ampiamente trattati nel capitolo 6 in
cui si discuterà la possibilità di introdurre una definizione quantitativa che
sarà testata per mezzo dell’analisi statistica delle serie temporali finanziarie.
Figure 1.10. Example of support and resistance from the Lilly Eli & Co. (LLY) stock,
traded at the New York Stock Exchange (NYSE) during Febraury the 2nd
of the year
2000. Source: http://stockcharts.com [61].

11
Chapter 2
Pattern Recognition
In this chapter the main features of the theory of Pattern Recognition will be
reviewed in order to provide a wide context for the algorithm that will be adopted
coping with the clustering of financial time series.
2.1 From the Iris Dataset to Economic Taxonomy
The term Pattern Recognition (PR) refers to the task of placing some object to
a correct class based on the measurements about the object [19]. Objects to be
recognized, measurements and possible classes can be almost everything so there are
very di erent PR tasks.
A spam (junk-mail) filter, a recycling machine, a speech recognizer or the Optical
character recognition protocol (OCR) are both Patter Recognition Systems, they
play a central role in everyday life.
The beginning of the modern discipline probably dates back to the paper of R.A.
Fisher: "The use of multiple measurements in taxonomic problems" [20, 1936]. He
considered as an example the Iris dataset, collected by E. Anderson in 1935, which
contains sepal width, sepal lent, petal width and petal length measurement from
150 irises belonging to three di erent sub-species (figure (2.1)).

12 2. Pattern Recognition
Figure 2.1. Correct Classiﬁcation of the Iris dataset. 150 observations of sepal and petal
width and length. Irises belong to three di erent subspecies: Setosa, Versicolor and
Virginica. Classiﬁcation is performed through a 5-Nearest Neighbor algorithm with
30 training observations (see section (2.2)) assigning each new Iris observation to the
subspecies most common among its 5 nearest neighbors.
In order to provide a basis for comparison with current developments and
applications of the theory, we can consider the Economic Taxonomy, which is a

2.1 From the Iris Dataset to Economic Taxonomy 13
system of classification of the economic activity, including products, companies and
industries [25] [64].
In 1999 Mantegna [21] considered the portfolio of stocks used to compute the
Standard and Poor’s 500 (S&P 500) index and the 30 stocks considered in the Dow
Jones Industrial Average (DJIA) over the time period from July 1989 to October
1995 and introduced a method for finding a hierarchical arrangement of stocks traded
in financial market.
He compared the synchronous time evolution of pairs (i, j) of daily stock prices by
the correlation coe cient flij of the logarithmic returns, defined in (1.3), evaluated
over a period of · = 1 trading day
rstock=i
·=1day(day = t)
def
= ri
and then, defining an appropriate metric1 dij = dij(flij), he quantified the degree of
similarity between stocks as:
flij =
ÈrirjÍ ≠ ÈriÍÈrjÍ
Ò
(Èr2
i Í ≠ ÈriÍ2)(Èr2
j Í ≠ ÈrjÍ2)
(2.1)
dij =
Ò
2(1 ≠ flij) (2.2)
Defined the distance matrix 2 di,j, he used it to determine the minimal spanning
tree (MST) [22] of the stocks in the portfolio.
To provide a simple description of the constructive procedure defining the MST, I
will refer to figure (2.2), from Mantegna’s work, which describes the MST of the
portfolio of stocks considered in computing the DJIA index [62].
Stocks are labelled by their o cial tick symbol, whose coding can be found at
www.forbes.com and recent updating of the labels can be found at http://en.
wikipedia.org/wiki/List_of_S%26P_500_companies. The MST of a set of n
elements is a graph with n ≠ 1 links [23]. Here the nodes are represented by the
stocks and the links between them are weighted by the dij’s.
In building the MST, one has firstly to fill a list, sorted in ascending order, with the
non-diagonal elements of the distance matrix di,j:
{d1st
, d2nd
, d3rd
, ..., d
n(n≠1)
2
th
} where: d1st
< d2nd
< d3rd
< ... < d
n(n≠1)
2
th
then the two nearest stocks, here Chevron (CHV) and Texaco (TX), have to be
added as a start:
d1st
= dCHV ≠TX = 0.949 (2.3)
The growth of the tree is provided following the aforementioned list: d2nd =
dTX≠XON = 0.962 so the Exxon company (XON) is added to the MST and linked
to TX stock.
1
In section (2.5) I will stress further the concept of metric/distance and its role in the theory.
2
By definition dij is a symmetric matrix with d11 = ... = dnn = 0 so that only n(n≠1)
2
are relevant
and need to be computed.

Figure 2.2. Minimal spanning tree connecting the 30 stocks used to compute the Dow
Jones Industrial Average (DJIA) in the period July 1989 to October 1995. The 30
stocks are labeled by their tick symbols. See text for details. The red oval (author’s
graphical modification) encloses Chevron and Texaco stocks, the nearest stocks. Texaco
was acquired by Chevron on October 9, 2001. Original source: R.N. Mantegna (1999),
Hierarchical structure in financial markets.
To the next step: d3rd = dKO≠PG = 1.040 and these two stocks (Coca-Cola and
Procter & Gamble) are both added to the tree because none of them has been
already counted.
The MST keeps growing in this way discarding dkth = duv if and only if both u and
v stocks are already nodes of the tree.
Looking back at (2.3) and at the red oval in figure (2.2), the following news sounds
less surprising:
NEW YORK, October 3, 2001 /PRNewswire/ – Standard & Poor’s
[...] will make the following changes in the S&P 500, [...] after the close
of trading on Tuesday, October 9, 2001 [...] Equity O ce Properties
(NYSE: EOP) will replace Texaco Inc. (NYSE: TX) in the S&P 500
Index. Texaco is being acquired by S& 500 Index component Chevron
(NYSE: CHV) in a transaction expected to close on that date. [63]
The logic behind the minimal spanning tree is that of an arrangement of elements
which selects the most relevant connections of each element of the set, based on the
distance matrix dij.
With this MST Mantegna provided a taxonomy of the well defined group of stocks
considered.

2.1 From the Iris Dataset to Economic Taxonomy 15
The link between Mantegna’s work and the pioneering Fisher’s work is based on the
study of the meaningfulness of this taxonomy.
Mantegna compared his results with the reference grouping of stocks provided by
Forbes [65] which assigns each stock to one of 12 business sectors and 51 industries
(figure (2.3)).
In assessing the meaningfulness of the taxonomy provided by Mantegna’s method,
onomy. We will now explore this issue further, as the mean-
ingfulness of the emerging economic taxonomy is the key
justification for the use of the current methodology. In Ref.
͓1͔, Mantegna examined the meaningfulness of the tax-
onomy by comparing the grouping of stocks in the tree with
a third party reference grouping of stocks by their industry,
etc., classifications. In this case, the reference was provided
by Forbes ͓15͔, which uses its own classification system,
assigning each stock with a sector ͑higher level͒ and industry
͑lower level͒ category.
In order to visualize the grouping of stocks, we con-
structed a sample asset tree for a smaller dataset ͓14͔, shown
in Fig. 5. This was obtained by studying our previous dataset
͓14͔, which consists of 116 S&P 500 stocks, extending from
the beginning of 1982 to the end of 2000, resulting in a total
of 4787 price quotes per stock ͓16͔.
Before evaluating the economic meaningfulness of group-
ing stocks, we wish to establish some terminology. We use
the term sector exclusively to refer to the given third party
classification system of stocks. The term branch refers to a
subset of the tree, to all the nodes that share the specified
common parent. In addition to the parent, we need to have a
reference point to indicate the generational direction ͑i.e.,
who is who’s parent͒ in order for a branch to be well defined.
Without this reference there is no way to determine where
one branch ends and the other begins. In our case, the refer-
as a subset of a branch. Let us now exami
clusters that have been formed in the sample
terms complete and incomplete to describe,
terms, the success of clustering. A complete
all the companies of the studied set belongi
sponding business sector, so that none are
cluster. In practice, however, clusters are mo
containing most, but not all, of the compani
business sector, and the rest are to be found
in the tree. Only the Energy cluster was foun
many others come very close, typically miss
two members of the cluster.
Building upon the normalized tree length
characterize the strength of clusters in a sim
they are simply subsets of the tree. These c
complete or incomplete, are characterized by
cluster length, defined for a cluster c as follo
Lc͑t͒ϭ
1
Nc
͚
dij
t
෈c
dij
t
,
where Nc is the number of stocks in the clus
compared with the normalized tree length,
sample tree in Fig. 5 at time t* is L(t*)Ϸ
FIG. 5. ͑Color online͒ Snaps
asset tree connecting the exami
the S&P 500 index. The tree wa
four-year window width and
January 1, 1998. Business sect
according to Ref. ͓15͔. In this tr
tric ͑GE͒ was used as the centra
layers can be identified.
ONNELA et al. PHYSICAL REVIEW E 68
Figure 2.3. Minimal spanning tree of 116 S&P 500 stocks returns. Data extend from
the beginning of 1982 to the end of 2000. Links are weighted according to Mantegna’s
correlation based distance (2.2). Business sectors are indicated according to Forbes [65].
Source: Onnela et al. (2003) Dynamics of market correlations: Taxonomy and portfolio
analysis [26].
the classification of Forbes represents a reference in the same way in which the three
subspecies, Setosa, Virginica and Versicolor, were the benchmark for the classification
of the irises.
O course, while there could be common agreement in how many and which are the
irises sub-species, in coping with economic taxonomy, one could have alternatively
used di erent kind of classifications, e.g. the Global Industry Classification Standard
(GICS) [24]: when dealing with the classification of real data often there is neither a
unique nor a proper way to classify. The classification system makes the di erence.

2.2 Supervised and Unsupervised Learning and Classi-
fication
Usually the most part of data describing objects is useless for classification. So the
classification task is preceded by a feature extraction to select only those features
that best represent the data for classification.
The classifier takes as an input the feature vector x from the object to be classified
and places the feature vector (i.e. the object) to the class that is the most appropriate
one.
xiris = (sepal lenght = 5.9, sepal width = 4.2, petal width = 3.0, petal length = 1.5)
xiris æ Versicolor subspecies
In this thesis we shall deal with statistical pattern recognition in which the classs
and objects within the classes are modeled statistically.
Formally, feature vectors as xiris belong to a feature space F and classes are denoted
by {Ê1, Ê2, ..., Êc}.
The classifier – can be thought as a mapping from the feature space to the set of
possible classes:
– : F æ {Ê1, Ê2, ..., Êc}
–(x) = Êk
The classifier is usually a machine or a computer, for this the theory is also known
as machine learning.
Depending on the task and on the data available, we can broadly distinguish to kind
of learning: supervised and unsupervised, or clustering.
• In supervised classification, examples of correct classification are presented to
the classifier, namely as a training set:
Di = {x1, ..., xni } of class Êi
where ni is the number of training samples from the class Êi. Based on this
prototypes, the goal of the classfier is to deduct the class of a never seen object.
• In Clustering, there is no explicit teacher nor training samples. The classifi-
cation of the feature vectors must be based only on similarity between them.
Whether any two feature vectors are similar depends on the application.

2.3 Bayesian learning 17
2.3 Bayesian learning
2.3.1 Bayes Rule
Let’s begin with a general description of bayesian reasoning [29]: consider the
universe of events , the measured event E œ and a complete class of hypothesis
Hi to "explain" E . By definition, these hypotheses must be exhaustive and mutually
exclusive:
Hi fl Hj = ÿ (i ”= j)
n€
i=1
=
The conditional probability of the hypothesis Hi given the measurement of E is:
P(Hi|E) =
P(E|Hi)P(Hi)
P(E)
and by the property of complete class, satisfied by the Hi’s, the probability of event
E can be decomposed on the entire set {Hi}n
i=1 of classes:
P(Hi|E) =
P(E|Hi)P(Hi)
q
j P(E|Hj)P(Hj)
(2.4)
This Bayes Rule, introduced by Thomas Bayes (1702-1761) to provide a solution
to a problem of inverse probability 3, was presented in "An Essay towards solving
a Problem in the Doctrine of Chances and was read at the Royal Society in 1763,
after Bayes’s death [27]. His definition of probability was stated as follows:
T.Bayes - "The probability of any event is the ratio between the value
at which an expectation depending on the happening of the event ought
to be computed, and the value of the thing expected upon its happening"
[28].
Let’s analyze the terms in Bayes Rule (2.4) [29]:
• P(Hi) is the initial probability or prior probability, namely the probability of
the hypothesys Hi conditioned by all preliminary hypotheses with the exclusion
of the occurrence or nonoccurrence of E.
• P(E|Hi) is called likelihood and it’s a measure of how likely is E in the light
of Hi. In terms of cause and e ect, it means how easily the cause Hi may
produce the e ect E.
• P(Hi|E) is simply the final or posterior probability, namely is the probability
of Hi reproposed in the light of the hypothesis that E is true.
3
The "inverse" probability is the approach which endeavours to reason from observed events to
the probabilities of the hypotheses which may explain them, as distinct from "direct" probability,
which reasons deductively from given probabilities to the probabilities of contingent events [66].

Bayesian probability theory can be used to represent degrees of belief in uncertain
propositions. Cox (1946) [30] and Jaines (2003) [31] show that if one has to represent
numerically the strength of his beliefs about the world, then the only reasonable
and coherent way of manipulating these beliefs is to have them satisfy the rules of
probability, such Kolmogorov axioms [32], together with Bayes Rule.
In order to motivate the use of the rules of probability to encode degrees of beliefs,
there is also a game-theoretic result in the form of:
Dutch Book Theorem [33]: if you are willing to accept bets with odds
based on your degrees of confidence, then unless your beliefs are coherent
in the sense that they satisfy the rules of probability, there exists a set
of simultaneous bets (called "Dutch Book") which you will accept and
which is guaranteed to lose you money, no matter what the outcome.
The only way to ensure the Dutch Books don’t exist against you, is to
have degrees of beliefs that satisfy Bayes rule and the other rules of
probability theory.
2.3.2 Bayesian Model Selection
From Bayes rule may be derived a framework for machine learning. Here we adopt
the term model as a synonymous of class because in this way it can be stressed clearly
the statistical interpretation of the classification problem: to find the statistical
model m that best describes the data.
Consider a set of N data points, D = {x1, x2, ..., xN }, belonging to some model
m œ M, universe of models. The machine (i.e. the classifier) starts with some prior
beliefs over models, such that:
ÿ
m
P(m) = 1
In our respect a model m is represented by a probability distribution over data
points, i.e. P(D|m).
The classification/model selection goal is achieved computing the posterior distribu-
tion over "all" m œ M:
P(m|D) =
P(D|m)P(m)
P(D)
(2.5)
Almost in all cases the space M is a huge infinite-dimensional space, so some kind
of approximation is required, this point will be considered later discussing about
Monte-Carlo methods.
However the principle is simple: the model m which results in the highest posterior
P(m|D) over our dataset D will be selected as the best model for our data.
Often models are defined by a parametric distribution, so that P(D|mk) can be
actually written in terms of a P(D|◊k, mk), being ◊k the set of parameters defining
the model mk. In case of gaussian models, for example:
P(D|◊k, mk) = N(D|µk, k)
Given the prior preferences P(mk) over the models, the only term necessary to
compare models is the marginal likelihood term:
P(D|mk) =
⁄
P(D|◊k, mk)P(◊k)d◊k (2.6)

2.4 Clustering 19
where:
• P(◊k|mk) is the prior over the parameters of mk, namely the probability that
its parameters take values: ◊k.
• P(D|◊k, mk) is the likelihood term, depending only on a single setting of the
parameters of the model mk.
• the integral extends over all possible values of mk parameters.
The interpretation of marginal likelihood, sometimes called the evidence for the
model mk, is very interesting: it is the probability of generating data set D from
parameters that are randomly sampled from under the prior P(◊k|mk).
Usually, in order to evaluate the marginal likelihood term (2.6), a point-estimate is
chosen to select only one setting ˆ◊k of the parameters of the model mk. There are
two natural choices:
ˆ◊kMAP
= arg max
◊k
{P(D|◊k, mk)P(◊k|mk)}
ˆ◊kML
= arg max
◊k
{P(D|◊k, mk)}
The ˆ◊kMAP
is known as the Maximum A Posteriori estimate for the marginal
likelihood term, while ˆ◊kML
is the frequentist Maximum Likelihood estimate.
There is a deep di erence in these two approaches: the ML estimation often results
in overfitting, namely the preference for models more complex then necessary, due
to the fact that a model mk with more parameters will have a higher maximum of
the likelihood P(D|ˆ◊kML
, mk).
Instead, the complete marginal likelihood term P(D|mk) (2.6) may decrease as the
model become more complicated: in a more complicated model, i.e. with more
parameters, sampling random parameter values ◊k may generate a wider range of
possible N-dimensional datasets DN , but since the probability over data sets has to
integrate to 1:
⁄
{DN }
P(DN |mk)dDN = 1
spreading the density to allow for more complicated data sets necessarily results in
some simpler data sets having lower density under the model mk [34].
The decrease of marginal likelihood as additional parameters are added has been
called the automatic Occam’s Razor. [35] (figure (2.4)).
2.4 Clustering
Coping with the problem of finding patterns in a time series data set, I have dealt
with a Clustering problem. In this section will outlined the main features of this
theory together with the most relevants methods and teqniques.

too simple
too complex
"just right"
All possible data sets
P(D|mi)
D
Figure 4: The marginal likelihood (evidence) as a function of an abstract one dimensional repres
of “all possible” data sets of some size N. Because the evidence is a probability over data sets,
normalise to one. Therefore very complex models which can account for many datasets only achieve
evidence; simple models can reach high evidences, but only for a limited set of data. When a data
observed, the evidence can be used to select between model complexities.
11.1 Laplace approximation
It can be shown that under some regularity conditions, for large amounts of data N relative to the nu
parameters in the model, d, the parameter posterior is approximately Gaussian around the MAP esti
p( |D, m) (2 )
d
2 |A|
1
2 exp
1
2
( ˆ) A ( ˆ)
Here A is the d d negative of the Hessian matrix which measures the curvature of the log posterio
MAP estimate:
Aij =
d2
d id j
log p( |D, m)
=ˆ
The matrix A is also referred to as the observed information matrix. Equation (51) is the Laplace
mation to the parameter posterior.
By Bayes rule, the marginal likelihood satisfies the following equality at any :
p(D|m) =
p( , D|m)
p( |D, m)
The Laplace approximation to the marginal likelihood can be derived by evaluating the log of this ex
at ˆ, using the Gaussian approximation to the posterior from equation (51) in the denominator:
log p(D|m) log p(ˆ|m) + log p(D|ˆ, m) +
d
2
log 2
1
2
log |A|
Figure 2.4. Pictorical one-dimensional representation of the marginal likelihood, or evidence,
distribution over data sets D of a given size N (2.6). The normalization to 1 acts as a
penalization in the way that very complex models, which can account for many datasets,
only achieve modest evidence, whereas simple models can reach high evidences, but only
for a limited set of data. Source: Zoubin Ghahramani Unsupervised Learning.
2.4.1 Definition and Distinctions
Cluster analysis groups data objects based only on informations found in the data
that describe the objects and their relationships [36]. Namely the goal of clustering
is to identify structures in an unlabeled data set by objectively organizing them so
that objects belonging to the same cluster are similar (or related) to one another
and di erent from (or unrelated to) the objects in other groups. Clustering can be
regarded as a form of classification in that it creates labeling of objects with class
(cluster) labels. However it derives these labels only from the data.
Clustering methods had been classified in five major categories [37]:
• Partitioning methods
• Hierarchical methods
• Density-based methods
• Grid-based methods
• Model-based methods
Let’s review the main features and distinctions among these methods: the first
distinction is wether the set of clusters is nested or unnested, or in other words
hierarchical or partitional.
A partitional clustering is a division of objects into subsets such that each data

2.4 Clustering 21
object is in at least one subset. The partition is crisp, or hard, if each object belongs
exactly to one cluster, or fuzzy if one object is allowed to be in more than one cluster
to a di erent degree.
Instead a hierarchical clustering method works by grouping data objects into a
tree of clusters. A hierarchical clustering method can be agglomerative if it starts
placing each object in its own cluster an than merges clusters into larger and larger
clusters until a termination condition is satisfied, or it can be divisive if it acts
splitting clusters. A pure hierarchical clustering su ers from its inability to perform
adjustment once a merge or split decision has been executed.
The general idea of density-based methods is to continue growing a cluster as long as
the density (number of objects or data points) in the "neighborhood" exceeds some
threshold.
Grid-based methods quantize the object space into a finite number of cells that form
a grid structure on which all of the operations for clustering are performed. The
procedure is iterative and acts varying the size of the cells, corresponding to some
kind of resolution of the objects.
At last, model-based methods assume a model for each clusters and attempt to best
fit the data to the assumed model.
2.4.2 Time Series Clustering
Time series are dynamic data, namely they can be thought as objects whose features
comprise values changing with time.
In this thesis we will deal with financial stock-price time series and in this respect
clustering can be an econometric tool for studying dependencies between variables
[38]. It finds applications for example in:
• identifying area or sectors for policy-making purposes,
• identifying structural similarities in economic processes for economic forecast-
ing,
• identifying stable dependencies for risk management and investment manage-
ment.
Coming back to the clustering methods, time series clustering algorithms usually try
to modify the existing algorithms for clustering static data in such a way that time
series can be handled or to convert time series data into the form of static data. We
can broadly distinguish three main approaches:
• Raw-data-based approaches: here can be placed methods working either in
time or frequency domain and the major modification in respect to static-data
algorithms lies in replacing the distance/similarity measure for static data with
an appropriate one for time series.
• Feature-based approaches: this choice is usually adopted when dealing
with noisy time series or data sampled at fast sampling rates.
• Model-based approaches: each time series is considered to be generated
by some kind of model or by a mixture of underlying statistical distributions.

The similarity here relies on the structure of the models or on the remaining
residuals after fitting the model.
2.5 Distance and Similarity
Given two time series, the correlation coe cient flij can be interpreted as a measure
of the existence of some causal dependency between variables or in respect to some
common exogenous factor, so it can be exploited for clustering purposes. But cor-
ralation can be invoked only if variables follow similar and concurrent time patterns.
In order to consider not only common causation of random variables and co-
movements of time series but also similarities in their structure (similar patterns
may evolve at di erent speeds at di erent time scales) it is needed to encompass the
broader concept of similarity [36].
The function used in cluster analysis to measure the similarity between two data
being compared could be in various forms, for example:
• Euclidean distance: let xi and xj be two ·-dimensional time series, dE is
defined by
dE =
ˆ
ı
ı
Ù
·ÿ
k=1
(xik
≠ xjk
)2
• Pearson’s correlation coe cient and related distances:
flij =
q·
k=1(xik
≠ µi)(xjk
≠ µj)
SiSj
where:
µi =
1
·
·ÿ
k=1
xik
is the mean and Si =
ˆ
ı
ı
Ù
·ÿ
k=1
(xik
≠ µi) is the scatter.
Two corralation-based distance can be derived [39]:
d1
fl = (
1 ≠ fl
1 + fl
)—
, — > 0
d2
fl =
Ò
2(1 ≠ fl)
The last one is the one adopted by Mantegna [21] in the work cited in section
(2.1).
These distances are computed directly in the space of time series, eventually after
preprocessing.
Often, projecting time series into a space of distance-preserving transforms avoids a
part of the computational cost. Many projection techniques have been proposed,
such as the Fast Fourier Transform (FFT) [40], which is distance-preserving due
to the Parseval theorem, or the piece-wise linearization [41] which provides a linear
smoothing of the time series.

2.5 Distance and Similarity 23
2.5.1 Information Theoretic Interpretation
When time series are modeled within a probabilistic model, their values are thought
to be sampled by an underlying probability distribution. Another approach to
quantify their similarity is based on the projection of the time series in the space of
their probability distributions.
This is a more abstract notion of similarity as it allows to measure series of di erent
lengths and with shapes that, thought similar in their distributions, cannot be
directly matched.
Let x1(t) and x2(t) be two time series and assume at least the weak stationarity
condition. Namely the first two moments of their distributions must not depend on
temporal index t and their auto-covariance R(t, t + k) depends only on the time lag
k:
E[x(t)] = µx
V ar[x(t)] = E[(x(t) ≠ E[x(t)])2
] = E[x2
(t)] ≠ E[x(t)]2
= ‡2
x
R(t, t + k) = E[x(t)x(t + k)] = R(k)
empirically estimated by [42]:
Ê[x(t)] =
1
·
·ÿ
t=1
x(t)
ˆV ar[x(t)] = Ê[x2
(t)] ≠ Ê[x(t)]2
=
1
·
·ÿ
t=1
x(t)2
≠
1
·2
(
·ÿ
t=1
x(t))2
ˆR(t, t + k) = Ê(x(t)x(t + k)) =
1
·
·≠kÿ
t=1
(x(t) ≠ Ê[x(t)])(x(t + k) ≠ Ê[x(t + k)])
Being p(x) and q(x) respectively the distributions of values from x1(t) and x2(t), a
natural distance function can be the Kullback-Leibler divergence defined as:
KL(p||q) = E[log2(
p(x)
q(x)
)] =
ÿ
x
p(x)log2(
p(x)
q(x)
)
which has a symmetric version:
dpq =
1
2
[KL(p||q) + KL(q||p)]
The information theoretical interpretation of the KL-divergence is very interesting.
Let x(t) be a time series whose values are distributed according to an unknown
distribution p(x).
In order to transmit x(t) to a receiver we should encode it in some way, with the
intuitive criterion of encoding with fewer bits (quanta of information) those values
which occur more frequently in x(t).
Shannon’s source coding theorem quantifies this optimal number of bits to be used
to encode a symbol which can occur with probability p(x), as ≠log2(p(x)).
Using these number of bits per value, the expected coding cost is the entropy of the
distribution p(x) and is the minimum cost [43]:
H(p) =
ÿ
x
p(x)log2(p(x))

In a machine learning perspective we could theorize a model for x(t)’s values: let it
be denoted as the q(x) distribution. In this case the expected coding cost will be
not optimal:
H(q) =
ÿ
x
p(x)log2(q(x))
with coding ine ciency quantiﬁed precisely through the KL(p||q) divergence.
This example allows us to appreciate the prominent role of machine learning to
achieve an e cient communication and data compression:
the better our model of the data, the more e ciently we can compress
and communicate new data [44]
I stress that there is no "natural" distance function, each distance implements a
speciﬁc concept of similarity and it is very problem dependent.

25
Chapter 3
Monte Carlo Framework
Monte Carlo methods are a standard and often extremely e cient way of computing
complicated integrals over high dimensional or poorly-behaved functions [45]. The
idea of Monte Carlo calculation, however, is a lot older than the computer.
The name "Monte Carlo" is relatively recent, it was coined by Nicolas Metropolis in
1949 but under the older name of "statistical sampling". The history of the method
dates back to 1946:
While convalescing from an illness in 1946, Stan Ulam was playing a
solitaire. It, then, occurred to him to try to compute the chances that a
particular solitaire laid out with 52 cards would come out successfully.
After attempting exhaustive combinatorial calculations, he decided to go
for the more practical approach of laying out several solitaires at random
and then observing and counting the number of successful plays [46].
This idea of selecting a statistical sample to approximate a hard combinatorial
problem by a much simpler problem is at the heart of modern Monte Carlo simulation.
In 1949, the young phisicist Nick Metropolis published the ﬁrst document on Monte
Carlo simulation with Stan Ulam [47] and few years later he proposed the famous
Metropolis algorithm [48].
3.1 Motivations
Monte Carlo techniques are often applied to solve integration and optimization
problems, here are some examples:
• Bayesian inference and learning: intractable integration or summation
problems occur typically in Bayesian statistics.
– Normalization problems, i.e. computing the normalizing factor in Bayes’
theorem
P(m|D) =
P(D|m)P(m)
P(D)
where
P(D) =
ÿ
mœM
P(D|m)P(m) (3.1)

26 3. Monte Carlo Framework
– Computing the Marginal Likelihood, which is the integral of the likelihood
under the prior:
P(D|mk) =
⁄
P(D|◊k, mk)P(◊k)d◊k
• Statistical Mechanics: here usually is needed to compute the partition
function Z of a system with states s and Hamiltonian H[s]:
Z =
ÿ
{s}
exp ≠
H[s]
kbT
where kb is the Boltzmann’s constant and T denotes the temperature of the
system. Summing over all the configurations {s} is often prohibitive, so a
Monte Carlo sampling is necessary.
• Optimization: the goal is to extract the solution that minimizes some objec-
tive function from a large set of feasible solutions.
In the Clustering algorithm it will be introduced in next chapters, I faced this
kind of problem dealing with finding the optimal splitting or merging of a
clusters made of time series, basing those operations on the maximization of a
likelihood function over the data.
3.2 Principles
All the Monte Carlo methods have the same general structure: given some probability
measure p(x) on some configuration space X, it is wished to generate many random
sample {x(i)}N
i=1 from p [49].
These N samples can be used to approximate the target density with the following
empirical point-mass function:
pN (x) =
1
N
Nÿ
i=1
”(x ≠ x(i)
) (3.2)
Aiming to evaluate the mean I(f) of some function f(x) of the configurations, one
can build the Monte Carlo estimate IN (f) simply as the sample mean fN (x), but
the configurations must be sampled from pN (x):
I(f) = ÈfÍX =
⁄
X
f(x)p(x)dx
IN (f) =
1
N
Nÿ
i=1
f(x(i)
)
The MC-estimate converges almost certainly 1 (ac-limnæŒ) [7] to I(f):
ac ≠ lim
næŒ
IN (f) = I(f)
1
A sequence Xn of random variables defined on elementary event Ê œ converges to X if:
lim
næ+Œ
Xn(Ê) = X(Ê) ’Ê œ {set of zero measure}

3.3 Static Methods 27
If the variance ‡2
f of f (in the univariate case for simplicity) is finite, then the Central
Limit Theorem (TLC) states the asymptotic behavior of the values of IN (f) when
the sample size N approaches infinity:
IN (f) ≥ N(I(f), ‡2
I )
where the dispersion ‡I of the estimate IN (f) behaves as an ordinary error of an
averaged variable:
‡I =
‡f
Ô
N
This strong form of convergence guarantees the e ectiveness of Monte Carlo integra-
tion.
3.3 Static Methods
Static methods are those that generate a sequence of statistically independent
samples from the desired probability distribution fi. Coming back to equation (3.2),
we see that central problem is a sampling problem.
3.3.1 Rejection Sampling
If we know the target distribution p(x) up to a constant, we can sample from it by
sampling from another easy-to-sample proposal distribution that satisfies:
p(x) Æ Mq(x) , M < Œ
using the accept/reject procedure described in the pseudo-code below:
Rejection Sampling:
i = 1
repeat:
sample x(i) ≥ q(x)
sample u ≥ U(0,1)
if u < p(x(i))
Mq(x(i))
then: accept x(i) and i Ω i + 1
else: reject
end if
until i = N

28 3. Monte Carlo FrameworkFigure 1. Rejection sampling algorithm. Here, u ∼ U(0,1) denotes the operation of sampling a uniform random
variable on the interval (0, 1).
Figure 2. Rejection sampling: Sample a candidate x(i) and a uniform variable u. Accept the candidate sample if
uMq(x(i)) < p(x(i)), otherwise reject it.
Figure 3.1. Rejection sampling: Sample a candidate x(i)
and a uniform variable u. Accept
the candidate sample if uMq(x(i)
) < p(x(i)
), otherwise reject it. Source: Andrieu et al.
(2003) An Introduction to MCMC for Machine Learning.
A candidate x(i) is accepted only if the ratio between p(x) and Mq(x) is over the
u-threshold (figure (3.1)). This procedure reshapes q(x) to the target distribution
p(x) and does not depend on the absolute normalization of p(x) (any normalization
constant can be absorbed in M). Those x(i) which are accepted result in being
sampled with probability p(x) [50].
3.3.2 Hit-or-Miss Sampling: a numerical experiment
Now I will be concerned on a particular form of the Rejection sampling with reference
to a specific situation I’ve encountered during the analysis of bounces on Support
and Resistance levels.
Consider a target p(x) whose support ranges over the interval [0, ”].
Suppose you don’t know the analytic form of p(x) but, for accepting/rejecting
purposes, you can rely on a su ciently dense histogram depicting its profile.
Now suppose you are also able to find a bound for its values:
you know M such that: p(x) < M , ’x œ [0, ”]
for example from a visual inspection of the histogram.
In figure (3.2) is presented a computational experiment of Hit-or-Miss sampling.
Defining a data set of 105 random walk time series (length 103)
pt+1 = pt + N(0, 1) (3.3)
p0 = 0
The expected mean absolute return of each random walk series can be computed as:
” =
⁄ Œ
≠Œ
|x|Nx(0, 1)dx =
Ú
2
fi
ƒ 0.8
where with Nx(0, 1) I mean the distribution of the increments of these random walks,
according to (3.3). This value has been adopted as the width of a strip [0, ”].

3.3 Static Methods 29
I let the random walks evolve, getting out from under the strip and then con-
sidered the event of reaching again the strip. I measured the position of the first
point reaching again the strip, computed the histogram over the data set considered
and considered all the events of this kind occurring during the path of each random
walk in the data set.
The resulting histogram is an approximation of the target distribution p(x). So the
interval [0, ”] has been considered as the domain of definition for p(x). With the aim
Figure 3.2. Example of Hit-or-Miss sampling. Histogram bars refer to the distribution
of the position of first points reaching the [0, ”] strip from below. Simuation has been
carried on a sample of N = 105
random walk histories, whose length was L = 103
. The
pdf was reconstructed via the procedure exposed in the text through the extraction of
106
(x, y) pairs of points.
of sampling points x(i) from this unknown p(x), I’ve adopted the so called
Hit-or-Miss Sampling:
i = 1
repeat:
sample x(i) ≥ U(0,”)
sample y ≥ U(0,M)
if y < ˆp(x(i)) then: accept x(i) and i Ω i + 1
else: reject
end if
until i = N
where with ˆp(x(i)) I have denoted the histogram-estimate of p at the point x(i).
Red points in figure (3.2) represent the profile of the distribution reconstructed by
the method exposed through the extraction of a sample of 106 (x, y) pairs of points.

3.4 Dynamic Methods
The idea of dynamic Monte Carlo methods is to invent some kind of stochastic
process with state space X and p(x) as unique equilibrium distribution.
3.4.1 Markov Chains
A Markov Chain, after the Russian mathematician A.A. Markov, is one of the
simplest example of nontrivial, discrete-time and discrete-state stochastic processes
[43].
Consider a random variable x(i) which, at any discrete time/step i, may assume S
possible values x(i) œ X = {x1, x2, ..., xS}.
Indicate with j the state xj and suppose that the generating process veriﬁes the
Markov property:
Prob(x(i)
= ji|x(i≠1)
= ji≠1, x(i≠2)
= ji≠2, ..., x(i≠k)
= ji≠k, ...) = Prob(x(i)
= ji|x(i≠1)
= ji≠1)
(3.4)
namely every future state is conditionally independent of every prior state but the
present one, in other words the process has memory "1".
Now restrict ourself to time-homogeneous Markov Chains:
Prob(x(i)
= j|x(i≠1)
= k) = p(j|k)
def
= Wjk
which are characterized exclusively by their Transition Matrix Wij with properties:
• non-negativity: Wij Ø 0
• normalization:
qS
i=1 Wij = 1
In order to introduce the concept of invariant distribution, consider an ensemble of
random variables all evolving with the same transition matrix. The probability
Prob(x(i)
= j)
def
= Pj(i)
of ﬁnding the random variable in state j at time i is given, due to the Markov
property (3.4), by:
Pj(i) =
Sÿ
k=1
WjkPk(i ≠ 1) (3.5)
i.e. the probability to be in j at time i is equal to the probability to have been in k
at i ≠ 1, times the probability to jump from k to j, summed over all the possible
previous states k. Using matrix notation
P(i) = (P1(i), P2(i), ..., PS(i))
we can write the (3.5) as:
P(i) = WP(i ≠ 1) =∆ P(i) = Wi
P(0)

3.4 Dynamic Methods 31
The relevant question, at this point, is the possibility of convergence of P(i) to some
limit and the whether such a limit is unique.
Clearly if the limiæŒP(i) exists, it will be the invariant or equilibrium probability
Pinv
that satisfies the eigenvalue equation:
Pinv
= WPinv
Two more definitions to complete the theory (figure (3.3)):
e 30, 2009 11:56 World Scientific Book - 9.75in x 6.5in ChaosSimpleModels
74 Chaos: From Simple Models to Complex Systems
1 2 3
1 p1
4
1
23
4
1
1
1
p
1 p
1
2
3
4
(a)
(c)(b)
p
1
p 1
1 p
1
1 q
q
q
1 q
Fig. B6.2 Three examples of MC with 4 states. (a) Reducible MC where state 1 is transient
and 2, 3, 4 are recurrent and periodic with period 2. (b) Period-3 irreducible MC. (c) Ergodic
irreducible MC. In all examples p, q = 0, 1.
and reducible (decomposable) Markov Chains according to the fact that each state is
accessible from any other or not. The property of being accessible, in practice, means
that there exists a k 1 such that Wk
ij > 0 for each i, j. The notion of irreducibility
is important in virtue of a theorem (see, e.g., Feller, 1968) stating that the states of an
irreducible chain are all of the same kind. Therefore, we shall call a MC ergodic if it is
irreducible and its states are ergodic. Figure B6.1 is an example of ergodic irreducible MC
with two states, other examples of MC are shown in Fig. B6.2.
Consider now an ensemble of random variables all evolving with the same transition
matrix, analogously to what has been done for the logistic map, we can investigate the
evolution of the probability Pj (t) = Prob(Xt = j) to find the random variable in state j
at time t. The time-evolution for such a probability is obtained from Eq. (B.6.1):
Pj (t) =
S
k=1
WjkPk(t 1) , (B.6.3)
i.e. the probability to be in j at time t is equal to the probability to have been in k at
t 1 times the probability to jump from k to j summed over all the possible previous
states k. Equation (B.6.3) takes a particularly simple form introducing the column vector
P (t) = (P1(t), .., PS(t)), and using the matrix notation
P (t) = WP (t 1) = P (t) = Wt
P (0) . (B.6.4)
A question of obvious relevance concerns the convergence of the probability vector P (t)
to a certain limit and, if so, whether such a limit is unique. Of course, if such limit exists,
it is the invariant (or equilibrium) probability P inv
that satisfies the equation
P inv
= WP inv
, (B.6.5)
Figure 3.3. Three examples of Markov Chains with 4 states. (a) A reducible MC where
state 1 is transient (never reached again, if left) whereas states 2, 3, 4 are recurrent (there
exist a finite probability to come back) and periodic with period 2: the probability to
come back in k steps is null, unless k Ã 2. (b) Period-3 irreducible MC. (c) Ergodic
and irreducible MC. In all examples p and q are supposed to be di erent from 0, 1.
Source: Vulpiani et al. Chaos: From Simple Models To Complex Systems.
• Irreducible chain: a chain whose states are accessible from any other. For-
mally this means that there exists a k > 0 such that Wk
ij > 0 ’i, j. The chain
is called reducible if this does not happen.
• Ergodic chain: a chain that is irreducible and whose states are ergodic.
Namely each of them, once visited, will be visited again by the chain, with a
finite mean recurrence time.
For this special class of Markov Chains, a Fundamental Theorem asses the existence
and unicity of Pinv
:
Fundamental Theorem of the Markov Chains 1 For an irreducible ergodic
Markov Chain, the limit
P(i) = Wi
P(0) æ P(Œ) for t æ Œ
exists, is unique and is independent of the initial distribution P(0). Moreover:
P(Œ) = Pinv
Pinv
= WPinv
(3.6)
meaning that the limit distribution is invariant.

3.4.2 MCMC and Metropolis-Hastings Algorithm
Dealing with sampling problems, we are interested in constructing Markov Chains
for which the distribution we wish to sample from, given by p(x), x œ X, is invariant,
i.e. once reached, it never changes [51].
We restrict to homogeneous Markov Chains. A su cient but not necessary condition
to ensure that a particular p(x) is the desired invariant distribution is the following
detailed balance condition, which is a reversibility condition:
p(x(i)
)W(x(i≠1)
|x(i)
) = p(x(i≠1)
)W(x(i)
|x(i≠1)
) (3.7)
where x(i) is the state of the chain at time i and W(x(i≠1)|x(i)) is the jump probability.
This condition obviously implies that p(x) is the invariant distribution of the chain,
indeed summing both sides on the possible states at time i ≠ 1:
ÿ
{x(i≠1)}
p(x(i≠1)
)W(x(i)
|x(i≠1)
) =
ÿ
{x(i≠1)}
p(x(i)
)W(x(i≠1)
|x(i)
)
= p(x(i)
)
ÿ
{x(i≠1)}
W(x(i≠1)
|x(i)
)
= p(x(i)
)
where the last equality is based on the normalization condition.
These kinds of Markov Chains are called Monte Carlo Markov Chains (MCMC)
samplers are irreducible and aperiodic Markov chains that have the target distribution
as the invariant distribution [52]. The Metropolis-Hastings algorithm, [53] and [48],
is the most popular MCMC method.
An MH step of invariant distribution p(x) and proposal distribution q(xú|x) involves
sampling a candidate value xú given the current value x according to q(xú|x).
The Markov chain then moves towards xú with acceptance probability
A(x, xú
) = min{1,
p(xú)q(x|xú)
p(x)q(xú|x)
} (3.8)
otherwise it remains at x. The pseudo-code below illustrates the main features of
the algorithm:
Metropolis-Hastings Algorithm:
Initialize x(0)
for i = 1 to i = N do:
sample u ≥ U(0,1)
sample xú ≥ q(xú|x(i))
if u < A(x, xú) then: x(i+1) Ω xú
else: x(i+1) Ω x(i)
end if
end for
To show that the MH-Markov chain converges (ﬁgure (3.4)), we need to ensure
irreducibility and aperiodicity:

3.4 Dynamic Methods 33
• Irreducibility: is su cient that the support of q(·) includes the support of p(·).
In this way every state of the chain has finite probability to be reached in a
finite number of steps.
• Aperiodicity: it follows from the fact that the chain always allows for rejection.
We note that the Metropolis-Hastings Algorithm can be restricted to the Metropolis
Algorithm in a straightforward way, the latter coresponds to the choice of symmetric
proposal:
q(xú
|x) = q(x|xú
)
and consequently the acceptance ratio A(x, xú) reduces to:
A(x, xú
) = min{1,
p(xú)
p(x)
} = min{1, e— E
}
where the last equality follows in the statistical mechanics framework, where the
target distribution is the Canonical Distribution
e≠—E
, — =
1
T
and the variables x and xú represent di erent configurations of the system immersed
in a thermal bath at the temperature T.
To conclude, I stress two practically important properties of the Metropolis-Hastings
Algorithm:
• the target distribution is needed to be known only up to a constant of propor-
tionality.
• it is easy to simulate several independent chains in parallel.

16 C. ANDRIEU ET AL.
Figure 5. Metropolis-Hastings algorithm.
−10 0 10 20
0
0.05
0.1
0.15
i=100
−10 0 10 20
0
0.05
0.1
0.15
i=500
−10 0 10 20
0
0.05
0.1
0.15
i=1000
−10 0 10 20
0
0.05
0.1
0.15
i=5000
Figure 6. Target distribution and histogram of the MCMC samples at different iteration points.
The MH algorithm is very simple, but it requires careful design of the proposal distri-
bution q(x⋆
| x). In subsequent sections, we will see that many MCMC algorithms arise by
considering speciﬁc choices of this distribution. In general, it is possible to use suboptimal
inference and learning algorithms to generate data-driven proposal distributions.
The transition kernel for the MH algorithm is
KMH x(i+1)
x(i)
= q x(i+1)
x(i)
A x(i)
, x(i+1)
+ δx(i) x(i+1)
r x(i)
,
Figure 3.4. Example of Metropolis-Hasting algorithm. Bimodal target distribution
p(x) Ã ≠0.3e≠0.2x2
+ 0.7e≠0.2(x≠10)2
and histogram of the MCMC samples at di erent
iteration points. The proposal distribution is gaussian: q(xú
|x(i)
) = N(x(i)
, 100). Plots
show progressive convergence, after i = 100, 500, 1000, 5000 iterations. Source: Andrieu
et al. (2003) An Introduction to MCMC for Machine Learning.

35
Chapter 4
Memory E ects: Bounce
Analysis
In a recent work from Garzarelli, Cristelli, Zaccaria, and Pietronero [57, 2012],
evidences of technical trading strategies have been shown. These strategies produce
detectable memory e ects in the stock-price dynamics at various time scales.
My analysis began with the critical reproduction of their results concerning the
analysis of bounces on Support and Resistance levels in order to verify the Feed-back
impact of such strategies on the signiﬁcance of these indicators themselves.
4.1 The Data
The analysis in this thesis has been carried out with the high-frequency time series
of the price of 9 stocks traded at the London Stock Exchange (LSE) in 2002, made
up by 251 trading days 1.
In ﬁnancial high-frequency data the price is recorded on a time scale less than a day,
in particular the value of the time series considered here was updated second-by-
second.
Another possible choice to measure the time could have been to record the price
transaction-by-transaction, or as they say, tick-by-tick, but have been choiced to
consider the physical time since this is the time perceived by investors and on which
they base their investments.
Another reason that led us to choose the domain analysis of the seconds was the fact
that while the physical time of trading does not change for di erent stocks, the very
di erent number of operations per day would make it di cult to compare the results
for di erent stocks if it had been decided to consider the tick-by-tick time series.
1
Actually the 248th
day has not been considered due to lack of data caused by the interruption
and subsequent resumption of trading during the day

36 4. Memory E ects: Bounce Analysis
Among the stocks traded at the LSE in 2002, we have decided to consider the
following 9 stocks:
• AstraZeNeca (whose abbreviation is AZN)
• British Oetroleum (BP)
• GlaxoSmithKline (GSK)
• Heritage Financial Group (HBOS)
• Royal Bank of Scotland (RBS)
• Rio TInto (RIO)
• Royal Dutch Shell (SHELL)
• Unilever (ULVR)
• Vodafone Group (VOD)
The prices examined were measured in ticks, which is the minimum change in the
price. The tick is assigned to price ranges and at the LSE was adopted the following
convention:
Price (pence) tick (pence)
0 ≠ 10 0.01
10 ≠ 50 0.25
500 ≠ 1000 0.5
Ø 1000 1
Traders look at price time-series graph at di erent time scales and their decisions
are mainly based on the observation with the bare-eye (figures (4.1) and (4.2)).
4.1.1 T Seconds Rescaling
It was decided to carry on a coarse-grained analysis of each time series
P(t)STOCK,day
considered picking out a point (price) every T = 45, 60, 90, 180 seconds:
P(t) =∆ PT (t)
In this way we are removing information of price fluctuations that develop on time
scales less than T (red circle in figure (4.1)).
If the original time series was made up by L terms, the new one has only
Â
L
T
Ê terms
where with Â·Ê we have denoted the greatest integer less than or equal to L
T .

4.2 Bounce: Critical Discussion About Definition 37
SHELL, 55th
trading day of 2002
Figure 4.1. E ect of Rescaling - Rescaling of the price time series of SHELL stock in
the 55th
trading day of the year 2002. At the top there is the second-by-second time
series. Rescaling has been performed picking points each T = 5, 10, 15 minutes. The
red circle in the T = 15 minutes time series shows that price fluctuations developing on
smaller scales have been ignored.
4.2 Bounce: Critical Discussion About Definition
In technical jargons, Supports (Sup) and Resistances (Res) are referred rather
qualitatively:
Support and Resistance are important technical levels for a stock price.
Support describes a price level that the stock tried to cross below, but
ultimately stayed above. Resistance describes a price level that the stock
tried to cross above, but could not. The bare minimum requirement to
draw a support line or a resistance line is that the stock must spend a
significant amount of time or volume at the price level. [67]
In order to quantify the e ect of these figures on the price time series it was necessary
to characterize them quantitatively.

BP, 178th
trading day of 2002
Figure 4.2. E ect of Adopted Rescaling - Rescaling of the price time series of BP
stock in the 178th
trading day of the year 2002. At the top there is the second-by-second
time series. Rescaling has been performed picking points each T = 45, 60, 90, 180 seconds.
This will be the rescaling adopted in the rest of the analysis.
The definition above introduces in a rather pictorial way the concept of bounce, for
which it has been adopted the following definition:
A Bounce is the event of a future price entering in a stripe centered
around a Support / Resistance level and exiting from the stripe without
crossing it.
This definition is a good compromise between a quantitative and a bare-eye approach,
but still lacks of precision, indeed the following questions raise:
1. Generating Max (Min): Which kind of point may generate a Sup or Res level?
2. Time in strip: How much time (in T units) should the price spend within the
strip of a level?
3. Stripe-width: How much should be large the strip ”?

Clearly points 2 and 3 are closely linked: the strip-width ” should be related to some
kind of average price fluctuation in order for the time chosen at point 2 is realistic.
Point 1, instead, is more subtle and is sensitive to scale T taken due to the presence
of a tick minimum, which makes discrete the change in the price.
It is observed that at the smaller time scales the profile of the price varies little
(graph at the top of figure (4.3)). This means that, at equal length (i.e. number of
points), a series belonging to smaller scale T rescaling is less disperse than another
one coming up from a bigger T rescaling.
This phenomenon is evident in figure (4.3) where are shown 5 slices of time series
from our dataset, all made of 25 points.
The first one, on the top, is in real time and so the time window corresponds to 25
seconds, whereas the others refer to T = 45, 60, 90, 180 seconds rescaling and so the
corresponding window size is 25 · T seconds.
This observation is an obvious consequence of the rescaling, for which, at equal
length, higher-T series cover a greater portion of the trading day.
GSK, 150th
trading day of 2002
Figure 4.3. Focus on the dispersion of equally length series. These are 25 points length
series, relative to rescaling of T = 45, 60, 90, 180 seconds. Corresponding time windows
are 25 seconds or 25 · T seconds. Is evident the presence of constant price level in the
series not rescaled and a lower dispersion in the T = 45 seconds series compared to the
T = 180 series.

Therefore this poses the question of whether to consider strip-generating maxima
(minima) only those "tight":
Tight Max: PT (ti≠1) < PT (ti) and PT (ti) > PT (ti+1)
Tight Min: PT (ti≠1) > PT (ti) and PT (ti) < PT (ti+1)
where
ti+1
def
= ti + T
or even those "isolated":
Isolated Max: Tight Max and |PT (ti) ≠ PT (ti±1)| >
”
2
Isolated Min: Tight Min and |PT (ti) ≠ PT (ti±1)| >
”
2
or if a relaxed definition:
Relaxed Max: PT (ti≠1) < PT (ti) and PT (ti) Ø PT (ti+1)
Relaxed Min: PT (ti≠1) > PT (ti) and PT (ti) Æ PT (ti+1)
would reflect more closely the psychology of investors, who have in mind the concept
of Support or Resistance level, rather than single peak.
At the light of these considerations, it was decided to adopt rather natural definitions
(figure (4.4)):
• Generating Max (Min): a Relaxed Max (Min) PT (ti) who does not belong
to the strip of a previous Generating Max (Min).
• Bounce: a Relaxed Max (Min) PT (ti) who does belong to the strip of a
previous Generating Max (Min).
• Time in strip: no time-condition posed. If PT (ti) is a Generating Max (Min)
/ Bounce, then every point it has dropped below the strip of PT (ti), starting
from PT (ti+1), means that PT (ti) is a legitimate Generating Max (Min) or
Bounce.
• Strip-width: ” was defined as the average of the absolute value of the price
increment at time scale T, i.e. the average absolute linear returns time series
(1.1):
”(T) =
1
ÂL
T Ê ≠ 1
Â L
T
Ê≠1
ÿ
i=1
|PT (ti+1) ≠ PT (ti)| (4.1)
where the T-time is defined as:
ti+1
def
= ti + T
Summarizing:
1. We will deal with time series PT (t) discretized up to scale T = 45, 60, 90, 180
seconds.

2. We have deﬁned the Support / Resistance levels in the most natural way as
the levels starting with points not crossed by the price, with a straightforward
characterization of bounces and with a consistent deﬁnition of the strip-width
”.
3. We will tackle separately and in parallel the analysis of Supports and Resis-
tances.
HBOS, 103rd
trading day of 2002
RBS, 54th
trading day of 2002
Figure 4.4. Examples of Resistance and Support - (top) Price time series of the
103rd
trading day of the year 2002 of the stock HBOS (” ƒ 0.6): two Resistance levels
are visible. (bottom) Price time series of the 54th
trading day of 2002 of RBS stock
(” ƒ 1.0): evidence of two bounces on a Support level. Scale adopted: T = 45 seconds.

4.3 Consistent Random Walks
To provide a basis of comparison, bounce analysis was conducted in parallel both on
the real time series and on consistent random walks.
Let
PT (t)VOD,110
represents the price of the 110th day of trading of stock Vodafone at the time scale
T, then
ÈPT (ti+1) ≠ PT (ti)Íday = µT,VOD,110
È(PT (ti+1) ≠ PT (ti))2
Íday ≠ (µT,VOD,110)2
= ‡2
T,VOD,110 (4.2)
where È·Íday is the mean in the trading day, are the daily mean and dispersion of
the price increments of PT (t)VOD,110. The random walk consistent with PT (t)VOD,110
is then defined as:
RwT (t)VOD,110 : Rw(ti+1) = Rw(ti) + N(µ, ‡)
where clearly:
ti+1 = ti + T
µ = µT,VOD,110
‡ = ‡T,VOD,110 (4.3)
So this process is a random walk whose increments are normally distributed around
the mean increment of price of the PT (t)STOCK,day time series considered, with
dispersion given by mean increments fluctuation in the day (figures (4.6)and (4.7)).
VOD, 110th
trading day of 2002 Consistent Random Walk
Figure 4.5. Real Series compared with Consistent Random Walk - On the left:
price time series of the Vodafone (VOD) stock in the 110th
trading day of the year 2002.
On the right: comparison with the consistent random walk: pt+1 = pt + N(µ, ‡), where
µ = ≠1.2 · 10≠5
is the mean linear return (1.1) of Vodafone in the case considered and
‡ = 0.02 is the corresponding dispersion.

4.3 Consistent Random Walks 43
Left: VOD, 110th
trading day of 2002, T = 45 sec - Right: Consistent Random-Walk
(µ = ≠5.5 · 10≠4
, ‡ = 0.12)
Left: VOD, 110th
(µ = ≠7.3 · 10≠4
, ‡ = 0.14)
Figure 4.6. T = 45, 60 Rescaled Series compared with Consistent Random Walks
- On the left: price time series of the Vodafone (VOD) stock in the 110th
trading day of
the year 2002 on scale T = 45 (top) and 60 seconds (bottom). On the right: comparison
with the consistent random walks: pt+1 = pt + N(µ, ‡). The same graph referring to
the time series not rescaled can be found in ﬁgure (4.5).

Left: VOD, 110th
(µ = ≠1.1 · 10≠3
, ‡ = 0.15)
Left: VOD, 110th
(µ = ≠2.2 · 10≠3
, ‡ = 0.21)
Figure 4.7. T = 90, 180 Rescaled Series compared with Consistent Random Walks
- On the left: price time series of the Vodafone (VOD) stock in the 110th
trading day of
the year 2002 on scale T = 90 (top) and 180 seconds (bottom). On the right: comparison
with the consistent random walks: pt+1 = pt + N(µ, ‡). The same graph referring to
the time series not rescaled can be found in ﬁgure (4.5).

4.4 Memory E ects in Bounce Probability 45
4.4 Memory E ects in Bounce Probability
As I have mentioned earlier, for chartists Supports and Resistances are reference
levels they expect not to be crossed. What is relevant here is they operate as if it
were true, making it more tangible presence of these price levels.
The coincidence of expectations and the coordinate reactions to the same indicator
generate a feed-back e ect known as self-fulfilling prophecy.
Feed-back impact of such strategies was estimated measuring the conditional prob-
ability of bouncing again on such levels conditionally on the number of previous
bounces:
p(bounce|bprevious bounces)
We counted the number Ni of times the price re-entered the strip after the ith bounce.
At this point the structure of the process is Bernoullian, indeed the price has only
two alternatives:
• crossing the level
• bouncing on it with elementary probability p = p(b|i) where b denoted the
event of the next bounce after i bounces already realized
PNi (ni, p) =
A
ni
Ni
B
pni
(1 ≠ p)Ni≠ni
=∆ Ep(ni) = nip(b|i)
where ni denoted the number of positive realizations, among the Ni Bernoulli trials,
of bounce events. So we are interested in inferring p(bounce|bprevious bounces = i) from
ni and Ni to understand if it is comparable with the coin-toss level pelementary = 1
2
or not.
With coin-toss limit we mean the level at which it belongs a process that does not
have memory of previous bounces, which is therefore indi erent (pelementary = 1
2 ) to
cross the strip or re-bouncing on it.
Using Bayes theorem [29], we obtain the expected value
E[p(bounce|bprevious bounces = i)]
that will be denoted as E[p(b|i)] hereafter, which is actually a refinement of the
frequency f(b|i) of re-bouncing:
f(b|i) =
ni
Ni
E[p(b|i)] =
ni + 1
Ni + 2
(4.4)
V ar[p(b|i)] =
(ni + 1)(Ni ≠ ni + 1)
(Ni + 3)(Ni + 2)2
(4.5)
The analysis has been repeated for all the time series (all stocks, all days) for the
various scales T and for Supports and Resistances levels separately. Results have
been compared with those obtained from consistent random walks time series (figures
(4.8), (4.9), (4.10) and (4.11)).

Resistances T = 45 sec Supports T = 45 sec
Figure 4.8. Inferred value of probability of re-bouncing E[p(b|i)] (4.4) conditioned on the
number of previous bounces b = 1, 2, 3, 4. Errorbars are computed as inferred dispersion
V ar[p(b|i)] (4.5). Graphs refer to scale T = 45, on the left: Resistances, on the right:
Supports. Results are compared with the same procedure carried on Compatible Random
Walks time series. Statistics is based on 10 random walks for each stock and for each
trading day.
V ar[p(b|i)] (4.5). Graphs refer to scale T = 60, on the left: Resistances, on the right:
trading day.

4.4 Memory E ects in Bounce Probability 47
V ar[p(b|i)] (4.5) Graphs refer to scale T = 90, on the left: Resistances, on the right:
trading day.
V ar[p(b|i)] (4.5) Graphs refer to scale T = 180, on the left: Resistances, on the right:
trading day.

Some considerations on the graphs:
• Bounce probabilities are almost always greater compared to the coin toss limit
0.5.
• Bounce probabilities rise up as bprevious bounces increases. This can be inter-
preted as the reinforcement of investors’ beliefs.
• Random walk probabilities are comparable with coin-toss level and do not
show dependence on previous bounces.
• The increasing of the scale T a ects the evidence of memory (figure (4.11)).
This could suggest a finite memory of investors, but the lack of much data
does not allow for precise conclusions.
4.5 Window Analysis
In order to study the characteristics of price trajectories around Support and
Resistance levels, we have analyzed some features of the bounces:
• Recurrence time.
• Window size.
• Fluctuations within windows.
This study led us to select the appropriate scale T to consider in order to gets
evidences of memory e ects directly detectable in the price time series trajectory
around Support or Resistance figures.
4.5.1 Recurrence Time
We have studied the distribution of the time elapsing between the output of the
price from the strip of a previous bounce and the following entry in the strip of the
next bounce:
TT = tj ≠ tk
where
PT (tk≠1) belongs to the strip of bounce i
PT (tj+1) belongs to the strip of bounce i + 1
By definition, TT is measured in units of the scale T, allowing the comparison of the
histograms of the 4 scales chosen (4.12) and (4.13).
In order to take into account the rare events for big TT values, histograms have been
computed through logarithmic binning.
A bin of constant logarithmic width b means that the logarithm of the upper edge
of a bin (TT )i+1 is equal to the logarithm of the lower edge of that bin (TT )i plus
the bin width b. That is,
log((TT )i+1) = log((TT )i) + b =∆ (TT )i+1 = (TT )ieb

4.5 Window Analysis 49
Since the linear bin width wi of bin i is defined as:
wi = (TT )i+1 ≠ (TT )i
it is directly proportional to (TT )i because
wi = (TT )i+1 ≠ (TT )i = (TT )ieb
≠ (TT )i = (TT )i(1 ≠ eb
)
The number of observations ni in the ith bin is equal to the density of observations
in that bin times the width wi of that bin.
Therefore if the probability density function f(TT ) is a power law with exponent –
and the bin width wi is proportional to the bin value (TT )i,
f(TT ) Ã (TT )–
wi Ã (TT )i (4.6)
then the simple regression of log(n) against log(TT ) yields a slope equal to – + 1:
ni = f((TT )i)(TT )i Ã ((TT )i)–
(TT )i = ((TT )i)–+1
So in order to estimate the exponent –, the regression must be conducted on the
logarithm of the bin counts log(ni) normalized to the bin width wi:
ni
wi
Ã
((TT )i)–(TT )i
wi
Ã
((TT )i)–(TT )i
(TT )i
= ((TT )i)–
In this respect, the histograms in figures (4.12) and (4.13) are computed as the bin
density
PDFi =
ni
ntotal
normalized to the bin width wi.
So the linear fit gives insights on the real exponent of the power law regression of
the recurrence time:
(TT ) ≥ (TT )–T
where the exponent –T shows a little or no dependence at all from the scale T.
4.5.2 Window Size
The study was similar to the one carried on for the recurrence time but here we
were interested in quantifying the average window size ·.
This will result of primarily importance for the clustering of trajectories around the
Support and Resistane levels. Indeed the measure of · will be exploited in order to
define a common length for trajectories around the bounces.
The window size is defined as the time the price spends in a bounce strip. Precisely
it is the temporal distance between the last not-in-strip point and the first out. Let
PT (ti) be a bounce or a generating max (min), then:
· = tkout ≠ tkin≠1 and tkin≠1 < ti < tout

Figure 4.12. Histogram of Recurrence time TT for Supports and Resistances windows
together. Scales T = 45, 60. The binning is logarithmic of constant width b = 0.001.

Figure 4.13. Histogram of Recurrence time TT for Supports and Resistances windows
together. Scale T = 90, 180. The binning is logarithmic of constant width b = 0.001.

where
|PT (ti) ≠ PT (tkin≠1)| > ”(T) and |PT (ti) ≠ PT (tkin
)| < ”(T)
|PT (ti) ≠ PT (tkout≠1)| < ”(T) and |PT (ti) ≠ PT (tkout )| > ”(T)
being ”(T) the width of the time series considered at the scale T considered.
As well as for the recurrence time, the histograms of · in pictures (4.14) and
(4.15) have been reported with the the same choice of logarithmic width b = 0.001.
Especially with respect to the greater scales T, the linear ﬁt here is less meaningful
due to the substantial rarity of a bounce event.

Figure 4.14. Histogram of Window size · for Supports and Resistances windows together.
Scales T = 45, 60. The binning is logarithmic of constant width b = 0.001.

Figure 4.15. Histogram of Window size · for Supports and Resistances windows together.
Scales T = 90, 180. The binning is logarithmic of constant width b = 0.001.

Figure 4.16. Histogram of Fluctuation in windows on tick minimum centered around
Resistance (left) and Support (right) levels. Window size considered: · = 150. Scale
T = 45. The distribution refers to the whole data set of window time series, namely all
stocks and all days.
4.5.3 Fluctuations within Window
Henceforth we will deal with price time series referring to ·-sized windows opened
around the bounces identified in each original time series.
These window time series will be our data set for clustering purposes we will
introduce. We have separate the analysis in respect to:
• The whole data set of window time series.
• Window time series referring to a specific stock.
• Window time series referring to a specific bounce (up to the 4th).
Here we present the analysis of the maximum dispersion of the price in a window
centered around a Support or Resistance bounce, relatively to the minimum tick.
Let [≠·
2 , ·
2 ] (expressed in units of the scale T) be the considered window size.
The selection of the range of · was made considering the distribution of its values
presented in the previous section.
Recalling the table of the conventional assignment of the tick minimum at the LSE,
let be tickSTOCK the tick minimum of the considered stock, we define the maximum
dispersion in the window relative to the tickSTOCK as:
MaxPT (t)STOCK,day ≠ minPT (t)STOCK,day
tickSTOCK
when t œ [≠
·
2
,
·
2
]
We present the histogram of the distribution of the values referring to the whole
dataset for · = 150 at various scales T for Support and Resistance levels separately
(pictures (4.16) and (4.17)).

Figure 4.17. Histogram of Fluctuation in windows on tick minimum centered around
Resistance (left) and Support (right) levels. Window size considered: · = 150. Scales
T = 60, 90 and 180. The distribution refers to the whole data set of window time series,
namely all stocks and all days.
We have labeled the values of the within windows dispersion as dictionary size
in order to stress that this may also be thought as the distribution of the e ective
price levels in a quantization purpose. Namely represents the distribution of optimal

number of letters that would be present in a dictionary encoding the discretize price
time series through an alphabet.
To conclude we note a peculiar feature: local maxima of this distribution are reached
always in correspondence of even values, i.e. prices that are even multiples of the
tick minimum.
We suppose this e ect is related to the conventional assignment o tickSTOCK which
for the stocks analyzed in this thesis is:
STOCKS tickmin (pence)
AZN 0.5
BP 0.25
GSK 0.5
HBOS 0.25
RBS 0.5
RIO 0.5
SHELL 0.125
ULVR 0.25
VOD 0.125
indeed the e ect is evident also examining the same kind of distribution but referred
to time series belonging to a particular stock.
In ﬁgure (4.18) are reported the results for the time series of the AZN stock at the
scale T = 45 for the choice of · = 100 around all Resistance levels found in the
trading year analyzed:
Figure 4.18. Histogram of ﬂuctuation on the tick minimum (0.5 pence) in windows centered
around Resistance levels for the AZN stock. Window size chosen is · = 150 and the
scale considered is T = 45.

59
Chapter 5
The Clustering Model
The main part of the thesis work was concerned on the creation and refinement of
an algorithm to perform the clustering of the time series dataset considered. In this
chapter is presented the algorithm and is introduced the toy model, created to test
it, together with the financial dataset adopted for simulations.
5.1 Structure of the Algorithm
In order to find patterns in time series analyzed, was designed an algorithm whose
output was a particular partition mh of the time series dataset.
Namely, given a dataset DN of N length · time series
xi = (xi1, xi2, ..., xi· )
the algorithm is designed to find similarities (if present) among these time series
and create clusters Ckh
of them:
DN = {x1, x2, ..., xN }
clustering
=∆ mh = {(x4, x17, x25), (x1, x27), ..., (xN ), ...}
that is
mh = {C1, C2, ..., Cnh
} = {Ckh
}k=1,...,nh
being nh the number of clusters provided by the partition mh.
The algorithm is structured in 3 steps (figure (5.1)) as follows:
• Random initialization: in every instance of the procedure a random initial
partition is generated in order to provide a seed for the whole clustering
procedure.
1. MCMC step: starting from the initial partition, a partition of the dataset
is created according to Bayesian Model Selection (section (2.3.2)) through a
MCMC based on an adaptation to the problem of the Metropolis-Hastings
algorithm (section (3.4.2)).
2. Splitting step: clusters provided by the partition found via MCMC were
splitted in order to separate noisy ones. The acceptance process is controlled
by a threshold RANDOM
SPLITTING in order to avoid breaking of well made clusters.

60 5. The Clustering Model
3. Merging step: clusters of the resulting partition were iteratively merged
together in order to reduce their number. Also at this step the acceptance
process is controlled by a threshold RANDOM
MERGING to minimize the unavoidable
increasing in noise due to the merging operation.
• (eventually) Iteration: in some cases, the entire process was repeated in order
to refine the results. The final partition provided by the merging step was
adopted as initial partition for the MCMC step.
Figure 5.1. Schematic representation of the clustering procedure: the initialization provides
the initial seed for the entire procedure. This initial partition is processed via the 3-steps
algorithm and, eventually, the resulting partition is adopted as a seed for a new instance
of the procedure.
5.2 Toy Model
In order to test the algorithm, it was defined a toy model, consisting of an artificial
dataset of time series created in a controlled way so that their correct partition
would be known.
Were considered 5 length · mother-series randomly generated in [≠1:1] and color-
coded (figure (5.2))
(xmother)k , k = 1, 2, 3, 4, 5
Then, starting from the mother series, it was generated the e ective dataset of
daughter-series adopted for test purposes:
(xdaughter)k = N((xmother)k, diag(‡ = 0.1))
so that daughter series are ‡ = 0.1 gaussian fluctuations around the mothers (figure
(5.3)).

5.2 Toy Model 61
1st
mother 2nd
mother
3rd
mother 4th
mother
5th
mother
Figure 5.2. Mother Series - Length · = 25 series randomly sampled in [≠1:1]. Di erent
mother series are color-coded: blue, red, green, magenta and cyan.

1st
mother daughters series 2nd
mother daughters series
3rd
mother daughters series 4th
5th
Figure 5.3. Daughter Series - Length · = 25 series sampled as ‡ = 0.1 gaussian
ﬂuctuations around the respective mothers. Daughters series, as well as mothers series,
are color-coded: blue, red, green, magenta and cyan.

5.3 Real Series 63
5.3 Real Series
Coping with real time series of the data set introduced in section (4.1), the length ·
was taken as the window size of windows opened around Support and Resistance
levels (figure (5.4)), so that clustering of that kind of time series would provide
insights of memory e ects detectable directly in the form of the trajectory of the
price around those levels:
xi = {PT (t)} t œ [tb ≠
·
2
, tb +
·
2
]
(tb, PT (tb)) is bounce point or generating max (min)
In order to homogenize the dataset, it was provided a standardization of the
Figure 5.4. Trajectory around bounce - Schematic representation of a trajectory
around a Resistance bounce. The bounce event is identified as the entering and the
subsequent getting o the strip [P(tb)≠ ”
2 , P(tb)+ ”
2 ]. In the algorithm, the strip-width ”
is regarded as the intrinsic indetermination of the time series. The trajectory is plotted
as a continuous line and is defined over the symmetric interval [tb ≠ ·
2 , tb + ·
2 ] centered
at the bounce point tb, chosen as origin.
window-series values in the [≠1:1] interval (figure 5.5):
1. Translation: each time series was translated by the level PT (tb) of the
Support/Resistance it belongs to.
2. Rescaling: the translated time series was divided by the maximum excursion
in order to keep its value in [≠1:1].
xi =∆
xi ≠ PT (tb)
maxtœ[tb≠ ·
2
,tb+ ·
2
] |PT (t) ≠ PT (tb)|
(5.1)
and so the resulting dataset DN consists of time series defined in the interval
[tbounce ≠ ·
2 , tbounce + ·
2 ], whose values range over the symmetric interval [≠1:1]:
DN = {xi}i=1,...,N
xi œ [tb ≠
·
2
, tb +
·
2
] ◊ [≠1 : 1]

Observe that the bounce point (tb, PT (tb)) is mapped into the center (tb, 0) of the
[domain] ◊ [range] rectangle and the strip-width is consistently rescaled as:
” =∆ ˆ” =
”
maxtœ[tb≠ ·
2
,tb+ ·
2
] |PT (t) ≠ PT (tb)|
(5.2)
Figure 5.5. Trajectory around bounce rescaled - Schematic representation of the
e ect of rescaling (5.1) of a trajectory around a Resistance bounce. Now values range
over the symmetric interval [≠1:1] and the strip width ” is consistently rescaled to ˆ” (5.2).
The bounce event (tb, PT (tb)) becomes the center of the [domain] ◊ [range] rectangle.
As described in section (4.5.2), the window size ·, here adopted as length of
the time series to be clustered, is statistically distributed according to a power law
(figure (4.14) and (4.15)).
It was decided to treat it as a free parameter and to study the performance of the
algorithm and the results of clustering according to a wide range of · values:
· = 10, 20, 25, 35, 50, 100, 150, 200
where I stress that · is measured in units of the scale T, so that the e ective period
of time covered by a time series length · is: · ◊ T seconds.
Therefore not surprising that, dealing with big values of ·, a time series defined
around, say, the ith bounce point, results to cover also neighbors bounces or points
subsequent to the breaking (crossing) of the Support/Resistance level considered.
Figure (5.6) provides a schematic representation of this e ect, whereas in figure
(5.7)) is presented evidence in a real series.

5.4 Best Partition: Bayesian Characterization 65
Figure 5.6. Trajectory around bounce rescaled - big · e ect - Schematic represen-
tation of the e ect of considering trajectories whose length · is considerably greater
then the e ective window size. The plot represent a trajectory around the ith
bounce,
but the trajectory covers as well the (i ≠ 1)th
bounce and the breakpoint where the strip
gets broken and the Resistance level ceases to be valid.
5.4 Best Partition: Bayesian Characterization
In order to quantify the clustering procedure, it was decided to follow a bayesian
approach to the problem. The best partition of the dataset DN was defined in terms
of:
• Prior P(mh) over the partitions mh œ MN , which maps the beliefs, if any,
over the set MN of all the possible partitions of N time series.
• Likelihood P(DN |mh) which represents how likely would be the dataset DN
if the right partition is mh.
In this terms the problem of finding clusters in the dataset is mapped into the
optimization problem of finding the partition mh that maximize the Posterior over
all the possible partitions:
P(mh|DN ) Ã P(DN |mh)P(mh) (5.3)

RIO, 202nd
trading day of 2002 - Scale T = 45 seconds
4th
bounce on Resistance level - · = 100
Series standardized in [t4th ≠ ·
2 , t4th + ·
2 ] ◊ [≠1 : 1]
Figure 5.7. Trajectory around bounce rescaled - big · e ect - Real Series -
Trajectory of the price time series of RIO stock in the 202nd
trading day of the year
2002. Trajectory is centered around the 4th
bounce on the Resistance considered but
the lengtht · = 100 is so big that the trajectory results in covering all previous bounces
on the same Resistance (red points) and also the breakpoint (blue point). In order to
appreciate this fact, it should be considered that the e ective period of trading covered
by this time series consists of · ◊ T = 100 ◊ 45 = 4500 seconds, namely 1 hour and 15
minutes.
Note that bounce points are consistent with the deﬁnition of relaxed max which was
used in order to identify them in a way more similar to the e ective bare-eye recognition
adopted by technical traders (section (4.2)).
Therefore the best partition
mbest = {C1, C2, ..., Cnbest
}
is characterized as the solution of a model/partition selection problem of the kind
previously introduced (section (2.3.2)).

Note that the denominator P(DN ) (3.1), present in the general expression (2.5) for
the posterior, is not necessary for the purposes of determining the optimal partition
and was therefore omitted.
5.4.1 Gaussian Cost Prior
The prior was designed in terms of the number of clusters nh > 0 provided by the
partition mh. Clearly it was not possible to write down directly an analytic form
of Prior over the possible partitions. We thought that a Gaussian Cost Function
(ﬁgure (5.8))
P(mh) = Nnh
(0, ‡p) (5.4)
would have been a su cient binding against an unbounded increasing number of
cluster and, at the same time, being centered at 0, it would not have preferred any
particular number of cluster.
With Nnh
(0, ‡p) was denoted the normal distribution N(0, ‡p) evaluated at the
number of clusters nh provided by the partition mh.
This kind of prior provides a simple example of Occam’s Razor principle: P(nh)
acts against the uncontrolled growing of the number of clusters, always favored by
the likelihood adopted (to be introduced in the next section (5.4.2)).
Figure 5.8. Gaussian cost function as right side of the normal distribution N(0, ‡p) for
‡p = 0.1, 0.3, 0.5, 0.7, 1

5.4.2 Gaussian Likelihood
In order to characterize the Likelihood term in (5.11), it was decided to keep things
as simple as possible, i.e. introducing as few parameters as possible.
It was opted to separate the contribution of each cluster Ckh
belonging to the
partition mh of the dataset DN and to adopt the following assumptions:
1. Time series are temporally uncorrelated:
ÈPT (tk)PT (tj)Í = ÈxikxijÍ = 0 ’k ”= j
2. Their values are normally distributed around the mean series of the cluster
Ckh
they belong to.
3. There is no correlation at all between two di erent time series:
ÈxixjÍ = 0 ’i ”= j
The case i = j, i.e. dealing with Èx2
i Í, was regarded to be the intrinsic
indetermination of the time series. In order to do not make any assumption on
the process underlying the time series, it was adopted the same indetermination
”i for all the components 1, 2, ..., · of xi
Èx2
i Í = ”2
and quantified it depending on the time series considered:
• Toy Model:
” = ‡
i.e., the size of gaussian fluctuations around the mother series defining
xi = xdaughteri
.
• Real Series:
” = ”strip≠width
namely the width of the strip (4.1) of the window to which xi belongs to.
Point 2 need further clarification. Did not intend to make any assumption on the
(stochastic) process generating the values xij, it was only assessed how likely should
xi belong to Ckh
if mh the right partition of DN .
This probability was modeled taking into account several factors:
• xi’s values: xij
• the mean series of the cluster cluster Ck with respect to which it is calculated
the gaussian likelihood of time series values:
µk = (µk1, µk2, ..., µk· ) =
1
dim(Ck)
ÿ
xiœCk
xi = ÈxiÍCk
(5.5)
where it was denoted as dim(Ck) the number of series belonging to the cluster
Ck.

• To the indetermination within cluster cooperate the intrinsic indetermination
” of xi’s values and the intra-cluster variance, determined by the di erence of
values among time series clustered in Ck:
‡2
k = Èx2
i ÍCk
≠ µ2
k (5.6)
whose components will in general be denoted by:
‡2
kh
= (‡2
1kh
, ‡2
2kh
, ..., ‡2
·kh
)
So we can say: if mh was the chosen partition for the dataset DN , then the probability
that the xi series, whose values are xij for j = 1, ..., ·, belonging to the cluster Ckh
would be:
P(xi|mh) = Nxi (µkh
, i,kh
) where xi œ Ckh
(5.7)
where
i,kh
=
Q
c
c
c
c
c
c
a
Ò
”2
i + ‡2
1kh
0 0 0
0
Ò
”2
i + ‡2
2kh
0 0
0 0
... 0
0 0 0
Ò
”2
i + ‡2
·kh
R
d
d
d
d
d
d
b
(5.8)
The subscript xi means that the normal multivariate distribution must be computed
at (xi1, xi2, ..., xi· ).
The cluster likelihood contribution to the whole likelihood coming from Ckh
, due to
uncorrelation of time series, will factorize over the time series belonging to it:
P(Ckh
|mh) =
Ÿ
xiœCkh
P(xi|mh) (5.9)
The complete likelihood will take into account this kind of term coming from each
cluster, who are regarded as completely separate entities:
P(DN |mh) =
Ÿ
Ckh
œmh
P(Ckh
|mh)
=
Ÿ
Ckh
œmh
Ÿ
xiœCkh
P(xi|mh)
=
Ÿ
Ckh
œmh
Ÿ
xiœCkh
Nxi (µkh
, i,kh
)
=
Ÿ
Ckh
œmh
Ÿ
xiœCkh
·Ÿ
j=1
Nxij (µjkh
,
Ò
”2
i + ‡2
jkh
) (5.10)
where it was denoted with N the multivariate (·-variate) normal distribution,
whereas the univariate gaussian pdf was denoted by the N in block.
The last equality stems from the uncorrelation among time series components,
mentioned above.
So the Posterior takes the following form:
P(mh|DN ) Ã P(DN |mh)P(mh) =
Ÿ
Ckh
œmh
Ÿ
xiœCkh
·Ÿ
j=1
Nxij (µjkh
,
Ò
”2
i + ‡2
jkh
)Nnh
(0, ‡p)
(5.11)

where it should be recalled that nh is the number of cluster in the partition mh
considered.
Observe that the likelihood P(DN |mh) chosen here can be thought as a particu-
lar case of marginal likelihood (2.6) in respect to the possible parameters of the
model/partition mh.
Indeed the form in (5.10) can be obtained from the marginal likelihood
P(D|mh) =
⁄
P(D|◊h, mh)P(◊h)d◊h
if you choose a prior P(◊h) over the parameters sharply peaked around (µkh
, i,kh
) , k =
1, ..., nh, denoted as (µ, ).
For example, the calculation is straightforward if it is selected:
P(◊h) = ”(◊h ≠ (µ, ))
This choice has the meaning that, whatever the partition mh considered, it is
requested that its parameters (µ, ) are precisely those defined as in (5.5) and (5.6).
The goal of the model is to find the correct partition for the dataset
DN of N series, maximizing the posterior function P(mh|DN ) in (5.11)
over all the models/possible partitions mh œ MN .
Note finally that the Likelihood term P(DN |mh) (5.10) is linked to the dispersion
within each cluster and then will favor the formation of many clusters.
As assessed before, only the presence of the Prior P(mh) (5.4), which acts against
the increasing of clusters, can led the model to stabilize around the correct number
of clusters, and then eventually finding its way to the maximum of the Posterior
P(mh|DN ) (5.11).

5.5 MCMC Step 71
5.5 MCMC Step
A combinatorial result states that the number of possible partitions of the set DN is
the
Bell’s number BN : the number of ways in which a set of N objects1
can be obtained as a disjoint union of its non-empty subsets.
This number may be recursively defined as:
Bn+1 =
nÿ
k=0
A
n
k
B
Bk
where the binomial coe cient gives the multiplicity of each partition of DN in subsets
of k < N objects.
In principle, once defined some kind of similarity among the objects, one could try
each partition until he finds the best one.
The problem arising with BN is that it grows up with N faster then 2N
N 1 2 3 5 10 20 50
2N 2 4 8 32 1024 ≥ 1.05 · 106 ≥ 1.1 · 1015
BN 1 2 5 52 115875 ≥ 5.17 · 1013 ≥ 1.85 · 1047
so it becomes soon a prohibitive computation, it is therefore needed some kind of
approximation.
For this reason it was decided to define a Markov Chain which visits the Posterior
landscape through the logic of the Metropolis Hastings Algorithm (introduced in
section (3.4.2)).
Therefore it was considered a Makov Chain that had as target distribution the
Posterior P(mh|DN ) (5.11).
In order to adapt to this problem the acceptation/rejection procedure of the
Metropolis-Hastings algorithm, it was choiced the following jump Proposal q(mp|mh),
being mh the actual partition and mp the proposed one:
Jump Proposal:
sample series label i ≥ U[1,N]
sample cluster label k ≥ U[1,nh]
if xi œ Ckh
then: make the singleton (xi)
else if xi ”œ Ckh
then: reallocate Ckh
Ω xi
end if
Note that according with this proposal, from the current partition to the proposed
one, the number of clusters may vary only by one unit each time:
np = nh ± 1
1
as indeed is DN

and due to uniform sampling, an analytic form of this proposal could be:
q(mp|mh) =
1
Nnh
Recalling the form of the acceptance probability of the MH algorithm (3.8), in this
case it is:
A(mh, mp) = min{1,
P(DN |mp)P(mp)nh
P(DN |mh)P(mh)np
}
Actually, the calculation was carried on via logarithms2. So in terms of logarithms
the acceptance probability becomes:
A(mh, mp) = min{0, [log(P(DN |mp)) ≠ log(P(DN |mh))]
+ [log(P(mp)) ≠ log(P(mh))]
+ [log(nh) ≠ log(np)]}
and, as a consequence, the acceptation threshold of the M-H algorithm was taken to
be:
ln(u) where u ≥ U[0,1] (5.12)
As previously mentioned, the behavior of the acceptance A(mh, mp) consists in a
balance between the di erence of the log-likelihoods (5.10):
likelihood = log(P(DN |mp)) ≠ log(P(DN |mh))
= log(P(DN |nh ± 1)) ≠ log(P(DN |nh)) (5.13)
always favoring the increasing of the number of clusters
likelihood > 0 ≈∆ np = nh + 1
and the di erence in log-priors (5.4):
prior = log(P(mp)) ≠ log(P(mh))
= log(P(nh ± 1)) ≠ log(P(nh)) (5.14)
acting limiting this degeneracy, but in a non linear fashion (ﬁgure (5.9)).
2
It should be noted that the log(·) is a monotone function so that it does not change the order:
if x > y =∆ log(x) > log(y)

5.5 MCMC Step 73
Figure 5.9. Log-Prior gain - Behavior of the di erence of log-prior prior (5.14) in the
case of a proposal of decreasing the number of clusters (5.15): nh æ np = nh ≠ 1. The
order of magnitude depends on the dispersion parameter ‡p whereas the behavior is the
same for each value of ‡p. The prior (5.4) acts as a strong binding only for small values
of nh.
In ﬁgure (5.9) is presented the behavior of
prior(nh æ np = nh ≠ 1) = log(N(nh ≠ 1, 0, ‡p)) ≠ log(N(nh, 0, ‡p)) (5.15)
namely it is the prior-gain due to the decreasing of the number of clusters.
The order of magnitude of this gain strongly depends on the scale parameter ‡prior,
whereas the behavior is the same: the prior (5.4) acts as a strong binding only for
small values of nh.
To conclude, it should be noted that the particular scale ‡ú
p adopted, was chosen
from time to time, in order to the prior had an order of magnitude comparable
with that of the di erence in log-likelihood likelihood:
likelihood ≥ prior(‡ú
p)
This consideration, together with the lacking of an argument to assess a preferred
value of ‡p, led to analyze every time several orders of magnitude of the ‡p spectrum:
‡p = 10≠5
, 10≠4
, 10≠3
, 10≠2
, 10≠1
, 1

5.6 Splitting Step
The Markov Chain of the MCMC step visits the posterior landscape step-by-step
and so it is possible that the chain gets trapped into a local maximum of P(mh|DN ).
Therefore, it was defined the splitting step in order to provide a macroscopic
displacement in the posterior domain with the aim of finding a higher maximum, i.e.
a better partition.
Given a partition mh = {C1, C2, ..., Cnh
}, the splitting was defined to act on each
cluster of the partition as the best splitting in 2 clusters (figure (5.10))
Ck
splitting
=∆ {Ck1 , Ck2 }
and the process was driven by the maximization of the likelihood gain:
splitting = (L(Ck1 ) + L(Ck2 )) ≠ L(Ck) (5.16)
where
L(C) = log(P(C|mh))
=
ÿ
xiœC
log(P(xi|mh))
where it was adopted the definition of cluster likelihood (5.9). Note that, consistently
with the considerations done about the likelihood form (section (5.4.2)), the splitting
is actually always a gain
splitting > 0
indeed the splitting operation reduces the within cluster dispersion ‡k (5.6), down
to zero in case of singleton clusters Ck = {xi}, in which case the whole covariance
matrix (5.8) of the multivariate gaussian likelihood of single series3 (5.7) degenerates
into:
i,kh
singleton
=∆ ”i · Id·◊·
Due to this fact, the dispersion of the 2-clusters system results to be smaller than
the one of the original Ck, so that the resulting log-likelihood L(Ck1 ) + L(Ck2 ) is
greater.
The way of splitting a cluster of N objects in k disjoint sub-clusters is known in
combinatorial literature as a
Stirling number of the 2nd kind: which counts the number of ways to
partition a set of N labeled objects into k nonempty unlabeled subsets.
which is denoted as: I
N
k
J
=
1
k!
kÿ
j=0
(≠1)k≠j
A
N
k
B
jN
It should be noted, a latere, that this number can be viewed as the kth addend
defining the Bell’s number:
BN =
Nÿ
k=0
I
N
k
J
which indeed takes into account all the possible ways of partitioning.
3
that in the case of singleton cluster coincides with the cluster likelihood (5.9)

5.6 Splitting Step 75
Figure 5.10. Splitting example - Schematic example of splitting · = 25 time series:
mh = {Cpurple} =∆ {Cblue, Cred} eventually wrongly clustered by the MCMC chain. It
should be observed that such drastic change in the structure of the partition would be
reached by the Markov Chain alone only in several iterations.

The combinatorial problem of the splitting operation presented here corresponds
to:
1. find all the way of splitting N objects in k = 2 sub-clusters
I
N
2
J
=
1
2
2ÿ
j=0
(≠1)2≠j
A
N
2
B
jN
2. evaluating on each of them the splitting in order to find the greatest gain.
The feasibility of this exact computation depends strongly on the time occurring
to calculate the log-likelihood of the system of 2 clusters provided by the splitting
operation4.
This computation is a really time consuming one, indeed according to the likelihood
expression (5.10), it depends on the:
• number N of series present into the cluster considered.
• the length · of each series, which determines the number of single-coordinate
contributions, according to the last equality in (5.10).
Therefore it was decided to adopt this exact computation only for clusters up to
dim(Ck) = N = 16
and to define an appropriate Markov Chain for bigger ones.
Also this MC was designed accordingly to the Metropolis-Hastings algorithm (section
(3.4.2)), but now the goal was to maximize the gain splitting of each cluster.
In order to cope with this issue, it was considered:
• a proposal which acts directly on the space of 2-clusters systems, given original
one, C, with dimension dim(C) = N:
2-Clusters Subspace Proposal:
randomly split: C æ {C1, C2}
flag = 0
while flag == 0 do:
sample series label i ≥ U[1,N]
sample cluster label k ≥ U[1,2]
if xi œ Ck and dim(Ck) > 1 then:
reallocate C¬k Ω xi
flag = 1
end if
end while
4
which has to be repeated
;
N
2
<
times.

The dichotomous logic of k or ¬k is allowed by the fact that the splitted
partition, which provides the initial seed for every instance of proposal, is
made up by 2 clusters only. The logic-while loop was chosen in order to
avoid the possibility of reallocating a series from a singleton cluster Ck (and,
consequently, eliminating it) to an (N ≠ 1)-sized one C¬k, i.e. in order to keep
the MC into the subspace of 2-clusters partitions of N series.
• the acceptance probability of the MH-algorithm results simply in the expression:
A(mh, mp) = min{0, splitting}
where the likelihood gain is evaluated, every iteration, between the current
partition mh and the proposed one mp.
• a logarithmic threshold ln(u) as in (5.12)
Figure 5.11. RANDOM
SP LIT T ING(N) for di erent time series length · - For every instance
run = 1, ..., 100, it was generated an unique mother series in [≠1:1] and sampled N
daughter series as ‡ = 0.1 gaussian ﬂuctuations around it.
In order to do not introduce spurious correlations among values belonging to di erent
sizes N, the sample of daughter series was generated ex-novo for each N value.
Red points refer to N Æ 16 and are evaluated exactly, whereas otherwise, cal-
culation of RANDOM
SP LIT T ING(N > 16) was carried on through the MCMC described
in section (5.6), blue points in ﬁgure. The sizes N e ectively calculated were:
N = 2, 3, ..., 16, 17, 18, ..., 25, 30, 35, 40, 45, 50, 75, 100, 250 and 500.
Error bars are the dispersions of values in the 100 run sample.
Bottom-up view of results obtained in correspondence of length · =
10, 20, 35, 50, 100, 150, 200.

5.6.1 RANDOM of SPLITTING
The splitting step was designed in order to separate noisy clusters from well done.
But, at the light of the following considerations:
• the unavoidable increasing in likelihood, i.e., splitting > 0, whatever5 the
operation of splitting done.
• the exact/MC splitting procedure acts to maximize directly splitting
it seemed to be necessary a threshold in order to control the process of accepta-
tion/rejection of the best partition provided by the splitting procedure.
It was chosen a threshold, called RANDOM
SPLITTING, characterized as the likelihood gain
provided by splitting N-sized clusters well-done6
RANDOM
SPLITTING(N)
def
= È(L(Ck1 ) + L(Ck2 )) ≠ L(Ck)Í100 run (5.17)
where the statistical significance of its value is provided by the average over 100
instances.
This value provided a benchmark for each splitting operation, namely, given a cluster
C whose dimension7 was N, the splitting operation would be accepted if and only if
the gain provided was greater than this threshold:
SPLITTING ACCEPTED ≈∆ splitting > RANDOM
SPLITTING(N)
The values of RANDOM
SPLITTING(N) were computed straightway
• via exact computation, through the definition (5.17), if N Æ 16,
• via MCMC, with the 2-clusters subspace proposal previously introduced, oth-
erwise,
generating a cluster of N series, fluctuations around the same mother, in order to
characterize the assertion of not-to-be-separated cluster.
The computation of each value was repeated every instance run = 1, ..., 100 starting
from a di erent mother series and for all values of series-lenght · considered.
Computing values for all N sizes, would be a very long task, but fortunately
RANDOM
SPLITTING(N) presents a regular behavior (figure (5.11)) and therefore it was
decided to compute only some values and then relying on a fit in order to extrapolate
intermediate ones (figure (5.12)).
5
namely "right" or "wrong" splitting, rather evident in the toy model, less in real data
6
and therefore not to be separated
7
I recall that it was denoted with dimension the number of series contained in the cluster

Figure 5.12. Log and linear fit of RANDOM
SP LIT T ING(N) - · = 100 - For every instance
run = 1, ..., 100, it was generated an unique mother series in [≠1:1] and sampled N
daughter series as ‡ = 0.1 gaussian fluctuations around it.
In order to do not introduce spurious correlations among values belonging to di erent
sizes N, the sample of daughter series was generated ex-novo for each N value.
Red points refer to N Æ 16 and are evaluated exactly, whereas otherwise, the
calculation of RANDOM
SP LIT T ING(N>16) was carried on through the MCMC described
in section (5.6), blue points in figure. The sizes N e ectively calculated were:
N = 2, 3, ..., 16, 17, 20, 25, 30, 35, 40, 45, 50, 75, 100, 250 and 500.
Results obtained in correspondence of length · = 100. Log (top graph) and linear fit
(bottom graph) are provided. Error bars are the dispersions of values in the 100 run
sample, consistently scaled in the logarithmic plot.

5.7 Merging Step
The last step of the procedure was the merging step, designed in order to merge
together some of the clusters found through the splitting procedure to reduce their
number (figure (5.13)).
In order to define this procedure, let mh be the current partition:
mh = {C1, C2, ..., Cnh
}
merging
=∆ {(C1 + C3), (C2 + C11 + C8 + C10), ..., (Cnh
+ C6 + C27)}
= {C–, C—, ..., Cˆnh
}
where ˆnh < nh and the last equality means that once merged, super-clusters created
behave as normal clusters. This step is driven by the minimization of the likelihood
loss:
merging = L{C–, C—, ..., Cˆnh
} ≠ L(mh) (5.18)
where
L(m) = log(P(DN |m))
=
ÿ
Cœm
L(C|m)
=
ÿ
Cœm
ÿ
xiœC
log(P(xi|m)) (5.19)
being adopted the definition of cluster likelihood (5.9) and single-series likelihood
(5.10). Therefore merging was defined as the decreasing of likelihood if the current
partition mh was merged until it becomes the partition {C–, C—, ..., Cˆnh
}.
Considerations specular of that already discussed about the splitting operation,
could convince that the merging is actually always a loss
merging < 0
The merging operation, indeed, acts increasing the within cluster dispersion ‡k (5.6)
and, correspondingly, decreasing the likelihood, if compared to that of the original
cluster.
As well as for the Splitting and MCMC step, also in this case the procedure of
finding the best merged partition was carried on via a Monte Carlo Markov Chain,
structured within the logic of the Metropolis-Hastings algorithm, but with the
minimization of merging as the goal.
Coping with this issue, the Merging proposal chosen appears to be identical to that
adopted for the MCMC step, but acting on clusters instead on single series.
Let mh be the current partition, providing nh clusters, and let {C} represent the
set of those not already merged8:
Merging Proposal:
sample cluster label i ≥ U{C}
sample cluster label k ≥ U[1,nh]
if Ci ”œ Ck then: merge Ck Ω Ci
end if
8
namely, to be distinguished from super-clusters , i.e. clusters resulting by previous merging
operations

5.7 Merging Step 81
Figure 5.13. Merging example - Schematic example of merging · = 25 time series:
mh = {C1, C2} =∆ {C1+2}, eventually wrongly splitted by the splitted procedure.

It was decided to operate in an irreversible way, namely a super-cluster already
formed cannot be splitted again.
Note that, consistently, the range of selection of index i of the proponent reallocating
cluster Ci is only the set of clusters {C} not already merged, i.e. not super-clusters.
5.7.1 RANDOM of MERGING
This procedure, as it is at the moment, would result very quickly in a degeneration
of the partition in a unique big super-cluster, nh-sized in term of clusters, or N-sized
in terms of single series.
Therefore, in order to establish a criterion of acceptation of the merging proposal, it
was decided to compare the merging between the current and proposed partition
with a threshold:
MERGING ACCEPTED ≈∆ merging > ≠|threshold|
Following the logic adopting in defining the RANDOM
SPLITTING, this merging threshold
was defined to be the likelihood loss provided by merging 2 clusters composed by
series intrinsically belonging to di erent clusters.
This intrinsic dissimilarity was provided forming 2 clusters from time series belonging
to di erent mother series.
It was therefore introduced the RANDOM
MERGING defined as:
RANDOM
MERGING(N1, N2) ≥ L{C1, C2} ≠ (L(C1) + L(C2))
This value provided a benchmark for each merging instance, namely, given two
clusters C1 and C2, whose respective dimension were N1 and N2, the merging
proposal would be accepted if and only if the loss provided by the merging operation
would be smaller than this threshold:
MERGING ACCEPTED ≈∆ merging > RANDOM
MERGING(N1, N2)
The values of RANDOM
MERGING(N1, N2) were computed generating two samples, respec-
tively of size N1 or N2, of series belonging intrinsically to two di erent clusters.
This was achieved considering two di erent mother series in [≠1:1] and then picking
out values of daughter series from them.
Actually the resulting matrix is symmetric:
RANDOM
MERGING(N1, N2) = RANDOM
MERGING(N2, N1)
so that, having considered the case max(N1) = max(N2) = N it was necessary to
compute only
N(N + 1)
2
terms.
In order to avoid spurious correlations among values belonging to di erent pairs
(N1, N2), it was posed particularly attention on the process of generation of the
daughter samples. ’ (N1, N2), two di erent mother series in [≠1 : 1] were generated
each time.

5.7 Merging Step 83
Thus samples of daughters belonging to di erent values of the pair (N1, N2) were
completely di erent.
The calculation was repeated up to values (N1 = 100, N2 = 100) (ﬁgures (5.14)
and (5.15)) and the statistical signiﬁcance of the RANDOM
MERGING(N1, N2) threshold was
provided averaging its value over 100 runs of computation:
RANDOM
MERGING(N1, N2)
def
= ÈL{C1, C2} ≠ (L(C1) + L(C2)Í100 run (5.20)

· = 10 · = 10 - dispersion
· = 25 · = 25 - dispersion
· = 50 · = 50 - dispersion
Figure 5.14. RANDOM
MERGING(N1, N2) - · = 10, 25, 50 - Mean (on the left) and dispersion (on
the right) over 100 runs of the threshold of the merging procedure. The behavior seems
to be rather similar among the graphs, whereas the order of magnitude strongly depends
on the series length ·.

5.7 Merging Step 85
· = 100 · = 100 - dispersion
· = 150 · = 150 - dispersion
· = 200 · = 200 - dispersion
Figure 5.15. RANDOM
MERGING(N1, N2) - · = 100, 150, 200 - Mean (on the left) and dispersion
(on the right) over 100 runs of the threshold of the merging procedure. The behavior
seems to be rather similar among the graphs, whereas the order of magnitude strongly
depends on the series length ·.

87
Chapter 6
The Clustering Results
After a brief review of the role of the parameters introduced, will be presented the
test results carried on the toy model and those obtained coping with financial time
series of the data set considered.
6.1 Role of Parameters
Coping with objects to be clustered and in particular with time series, two parameters
play a crucial role:
• N: number of series to be clustered.
• ·: length of each time series.
The first acts increasing the overall noise, whereas the length of the series a ects
the clustering in the following way: if · is too short, then is low the reliability of
the similarity among time series, whereas if it is too long, the time series results in
covering a period of time much bigger than the window around the Resistance or
Support level considered (see discussion in section (5.3)). Not negligible is also the
increasing of the computational time needed to process more and/or longer time
series.
Introducing parameters in a model is a sensible issue, so it was kept their number as
small as possible, but some of them were unavoidable:
• ‡prior: it appears in the definition of the Prior (5.4) and actually represents
its strength parameter. As discussed at the end of section (5.5) and as it will
become clear inspecting the results, an appropriate value ‡ú
p would be the one
at which there would be balance between the prior and the likelihood. The
di culty in determining this value a priori led to consider several orders of
magnitude of the ‡p spectrum
‡p = 10≠5
, 10≠4
, 10≠3
, 10≠2
, 10≠1
, 1
together with more values when it was needed.
• ‡: it has a straightforward interpretation only in the toy model, representing
the gaussian dispersion of daughters around mother series values. Coping with

88 6. The Clustering Results
real series it was assumed ‡ = ”i, therefore identifying the gaussian (of course
also this one is a strong assumption) dispersion (5.8) around the mean series
of the cluster (5.5) as an intrinsic dispersion, given by the strip-width ”i, plus
an intra-cluster dispersion (5.6).
It is a sensible, maybe critical, point for the model but it has to be noted also
that in computing the covariance matrix (5.8) it seemed to be the most natural
assumption in order to take care of the intrinsic degree of dispersion ”i of each
time series.
‡ = 0.25 ‡ = 10≠1
‡ = 10≠2
‡ = 10≠3
Figure 6.1. RANDOM
MERGING - Dependence from ‡ - length · = 100 - Plot of the merging
threshold evaluated in correspondence to samples of daughter series generated as ‡ =
0.25, 0.1, 10≠2
, 10≠3
around mother series values. The main di erence among plots
consists of the increasing trend of the range of RANDOM
MERGING(‡) accordingly to the
decreasing of the noise level ‡.

6.1 Role of Parameters 89
6.1.1 Noise Dependency of RANDOM Thresholds
In order to consider the possibility that the real time series would have been better
represented through di erent levels of noise, the two threshold of the splitting and
merging step were studied as functions of ‡.
The study was carried on evaluating RANDOM
SPLITTING and RANDOM
MERGING in correspondence
of daughter time series generated with di erent levels of dispersion ‡ around their
mothers.
Results seem to be less trivial than expected. It was found that:
• the threshold of the splitting step was substantially independent from the
magnitude of noise (figures (6.3) and (6.4))
• the threshold of the merging procedure depends strongly on ‡. As summarized
in figure (6.2), the main dependence is the increasing in the range of values
accordingly with the decreasing of ‡. Some plots of RANDOM
MERGING(‡) are reported
in figure (6.1), whereas the rest is listed in appendix (A).
While in the toy model the real ‡ is, by construction, specified in the generating
process of daughters series, when dealing with real time series, this observations led
to consider parallel several merging steps corresponding to di erent RANDOM
MERGING(‡)
thresholds and determine the best choice visually inspecting clustering results.
Figure 6.2. RANDOM
MERGING - Dependence of the range of values from ‡ - length
· = 100 - Plot of the range of values of the merging threshold in correspondence to
samples of daughter series generated as ‡ = 0.25, 10≠1
, 5 · 10≠2
, 10≠2
, 5 · 10≠3
, 10≠3
, 5 ·
10≠4
, 10≠4
, 5 · 10≠5
, 10≠5
gaussian fluctuations around mother series values. Some of the
corresponding RANDOM
MERGING(‡) plots were presented above in figure (6.1), others were
omitted here for the sake of brevity and listed in the appendix (A).

Figure 6.3. RANDOM
SP LIT T ING - Independence from ‡ - length · = 100, 150, 200
- Overlapped plot of RANDOM
SP LIT T ING(N) evaluated in correspondence of ‡ =
0.25, 0.1, 10≠2
, 10≠3
, 10≠4
. Values are reported in semi-log scale up to N = 50 in
order to magnificate eventually discrepancies among the five curves. There is no evidence
at all. As previously discussed, red values were calculated exactly, whereas blue one
via MCMC chain with the 2 clusters subspace proposal presented in section (5.6). Note
that, unless otherwise specified, the entire analysis was carried on generating ‡ = 0.1
daughters and relying on the correspondent values of the splitting threshold.

6.1 Role of Parameters 91
Figure 6.4. RANDOM
SP LIT T ING - Independence from ‡ - length · = 100, 150, 200 - Ab-
sence of trend - For each value of N was plotted the dispersion of RANDOM
SP LIT T ING(N, ‡)
among the values belonging to a di erent ‡. Values were rescaled by the mean
value È RANDOM
SP LIT T ING(N, ‡)Í‡. As well as in ﬁgure (6.3), were adopted the values
‡ = 0.25, 0.1, 10≠2
, 10≠3
, 10≠4
and reported points up to N = 50. The resulting
plots show the absence of trend in the behavior of the dependence from ‡ of the splitting
threshold.

6.2 Toy Model Clustering
In order to test the algorithm, were generated several samples of daughter series
series, belonging to 5 mother series (procedure described in section (5.2)). Sizes
N = 10, 25, 50, 100, 500, 1000
were considered, along with lenghts
· = 10, 25, 50, 100, 150, 200
6.2.1 Insights of Convergence
Dealing with Markov Chains defined in the space of partitions, it was thought that
one way to discover the behavior of the chain would have been that of inspecting
the:
• Trace of the number of clusters which is expected to establish on the
correct number of clusters (5 in the toy model).
• Trace of the Posterior which is supposed to be maximized by the MC.
As it is clear in figure (6.5), when convergence is reached, it is reached very "quickly".
But analysis is not always proceeded linearly, in fact it was thought that evidence of
the presence of a right value ‡ú
p of the ‡prior parameter, would be suggested by the
behavior of the confidence:
confidence =
Number of Occurrences
Iterations ◊ Number of Parallel Chains
(6.1)
which seemed to present a peak in correspondence of the value of ‡p at which the
chain converges to the correct partition (1st plot in figure (6.6)). Unfortunatelly, the
sharply peak tends to disappear with the increasing of the number N of time series
to be clustered. This could be interpreted as some kind of flattening of the posterior
landscape, for what reason it was decided to carry on the analysis on a wide range
of the spectrum of ‡p.

6.2 Toy Model Clustering 93
Figure 6.5. Traces of Markov Chain - N = 25 - · = 25 - ‡prior = 0.5 - The plots
on the top represent the trace of the number of clusters of the partitions visited by
the chain during all the path of 150000 iterations (1st
plot) and during the last 500
iterations. Bottom plots mean to provide a synoptic view of the behavior of, on the
top: log-Likelihood, in the middle: Posterior and at the bottom: log-Prior of the same
chain. Results are provided on a sample of 5 parallel chains starting from di erent initial
partitions, whose number of clusters were: T(0) = 16, 21, 15, 10, 9.

Figure 6.6. Confidence of the Chain - N = 10, 15, 20, 25 - · = 10 - Plots of the
confidence (6.1) of m = 5 parallel chains, whose target was the Posterior over partitions
of N = 10, 15, 20 or 25 toy series. Di erent markers correspond to di erent distances
from the correct number of clusters (which is 5 in this toy model). What is ought
to note here is the progressively smoothing of the sharply peak of the confidence in
correspondence of the ‡p at which the chain finds the solution. It was therefore opted to
take care any time of the entire spectrum of ‡prior.

6.2.2 ‡prior Analysis and Sub-Optimal Partitions
In this section are presented partial results, obtained after the MCMC alone. The
analysis was carried on several values of ‡prior and results were analyzed keeping
fixed the number N of time series, letting their length · varying and vice versa
studying the clustering varying N, · being fixed.
As it is evident from figures (6.7) and (6.8), the two time series parameters act
against the clustering: while at low values exists a rather wide portion of ‡prior
spectrum at which the chain finds its way to the correct solution (denoted as red
points at 5 clusters level in figures), in correspondence of high N and/or · values it
is hardly or not at all reached.
Nevertheless this situation is not completely unsatisfactory, infact the solution
provided around the correct number of clusters level is actually a sub-optimal
partition.
As one could imagine, high values of ‡prior always result in a relaxation of the
partition, because the strength of the prior is not enough to keep together series
within a cluster.
At the other side, decreasing ‡p, the binding could become so strong as to keep
merged together in a unique cluster time series even very di erent.
Therefore it should be convincing that, given the noise level ‡ with which daughter
series are generated, and fixed N and · values, each value of ‡prior results in a
characteristic number of clusters.
What is less intuitive and came as a surprise is that the Markov Chain of the MCMC
step, once trapped in a particular region of the Posterior space, corresponding to the
characteristic number of clusters provided by the setting of parameters, naturally
finds the best partition given that number of clusters. Some kind of sub-optimal
partition (figures (6.9) and (6.10)).
This should be considered as a noticeable result as the aim of the analysis of real
time series was only to find evidence of recurrent structures among them, not the
perfect way of partitioning.

N = 10 N = 25
N = 50 N = 100
N = 500 N = 1000
Figure 6.7. Clustering after MCMC Step - length · = 100 ﬁxed - N =
10, 25, 50, 100, 500, 1000 series - Is presented the plot of the number of cluster of the
resulting partitions of the MCMC step obtained for several values of the ‡p spectrum
considering a sample of N = 10, 25, 50, 100, 500, 1000 series length · = 100. Red points
represent the convergence to the exact 5-clusters partition, whereas blue points denote
wrong clustering. Red dashed line marks the correct number of clusters level: 5. Results
are obtained using m = 5 parallel chains, for 150000 iterations.

· = 10 · = 25
· = 50 · = 100
· = 150 · = 200
Figure 6.8. Clustering after MCMC Step - N = 50 series ﬁxed - · =
10, 25, 50, 100, 150, 200 - Is presented the plot of the number of clusters of the resulting
partitions of the MCMC step obtained for several values of the ‡p spectrum considering
a sample of N = 100 series length · = 10, 25, 50, 100, 150, 200. Red points represent
the convergence to the exact 5-clusters partition, whereas blue points denote wrong
clustering. Red dashed line marks the correct number of clusters level: 5. Results are
obtained using m = 5 parallel chains, for 150000 iterations.

Figure 6.9. Sub-Optimal partition - N = 50 - · = 50 - ‡prior = 0.15 - Clustering
provided by the MCMC Step for the toy model at N = · = 50 and ‡p = 0.15 (compare
with third plot in ﬁgure (6.8)). Were found 3 clusters and, as it is evident, this partition
is the best partition in 3 clusters of the dataset considered: blue and red clusters are
"pure", whereas in the remaining cluster is concentrated all the rest.

Figure 6.10. Sub-Optimal partition - N = 50 - · = 25 ‡prior = 0.45 - Clustering
provided by the MCMC Step for the toy model at N = 50, · = 25 and ‡p = 0.45
(compare with second plot in ﬁgure (6.8)). Were found 6 clusters and, as it is evident,
this partition is the best partition in 6 clusters of the dataset considered: the green
cluster is wrongly splitted but the others are perfect.

6.2.3 Results of the Entire 3-Steps Procedure
As described in sections (5.6) and (5.7), partitions provided as output from the
MCMC step were processed via the Splitting and Merging operations and in ﬁgure
(6.11) is shown the increasing of the range of ‡prior spectrum at which is reached
the correct partition.
Figure 6.11. Example of 3-steps procedure results - Is presented the number of
clusters against ‡prior obtained after respectively, MCMC (top), Splitting (middle) and
Merging (bottom) steps in correspondence of N = 50 and 100 toy time series length
· = 100. Results are obtained using m = 5 parallel chains and 150000 iterations in the
MCMC step (the same reported in ﬁgures (6.7, 4th
graph) for N = 50 and (6.8, 4th
graph) for N = 100), one chain for 25000 iterations in the Splitting step and m = 20 for
10000 iterations in the Merging step. The improvement of convergence to the correct 5
clusters (red points) partition is evident.

6.3 Real Series Clustering 101
6.3 Real Series Clustering
In order to cope with the clustering of real financial time series, were adopted the
definitions of section (5.3), namely the dataset was composed by time series [≠·
2 : ·
2 ]
centered around bounces on Resistance or Support levels and properly rescaled so
that they had values in the [≠1 : 1] interval.
As discussed in section (4.1.1), were considered windows of series previously scaled
each T = 45, 60, 90 and 180 seconds.
In figure (6.12) are reported two examples of the dataset of time series considered.
Forgive the emotional expression, but actually the aim of the thesis was to spot
regularities in such a chaos!
Figure 6.12. Dataset of 4th
bounce T = 45 time series, · = 100 - Here are presented
two examples of datasets considered. Plot at the top represents a sample of 103
series
from the dataset of N = 1172 time series length · = 100 rescaled each T = 45 seconds
referring to the 4th
bounce on Resistance levels. At the bottom is reported the plot of
103
of the N = 1158 time series forming the dataset of 4th
bounces on Support levels.

6.3.1 Missteps: Granularity and Short Series E ects
Recalling results of section (4.4), it was expected to find more memory e ects in
time series belonging to low T scales and mainly in those centered around the 3rd or
4th bounces levels.
With this thought, it was decided to begin considering time series rescaled each
T = 45 seconds belonging to the 4th bounce levels.
As shown in examples reported in figure (6.13), this choice was inappropriate due to
some kind of granularity e ects present at small scales: the trajectories consist of long
periods of constant price interspersed with abrupt changes. Note that what found
here is consistent with what previously noted in section (4.2) about the dispersion of
low scales time series, which su er much more of the finite size of the tick minimum.
Recalling considerations of section (5.3) and figure (5.7), the second kind of misstep
Figure 6.13. Granularity e ects - Example of clusters found performing the cluster
analysis on T = 45 time series length · = 100. On the left the 4th
/190 cluster found in
correspondence of ‡p = 10≠4
; on the right the 14th
/190 of the same partition.
Figure 6.14. Short series e ects - Example of clusters found performing the cluster
analysis on T = 45 time series length · = 10. On the left the 5th
/194 cluster found in
correspondence of ‡p = 10≠1
; on the right the 7th
/150 cluster found dealing with T = 45
time series length · = 20 in correspondence of ‡p = 10≠3
.

came from the analysis of small · time series, motivated in order to deal with series
covering only the true neighborhood of a bounce event.
Also this kind of analysis was inappropriate because short time series resemble
similar even if they were structurally di erent (ﬁgure (6.14)).
6.3.2 Correct Clustering Results
In order to avoid the kind of problems discussed above, it was decided to consider
series length · = 100 and rescaled each T = 180 seconds developing around 4th
bounce events.
This led to consider series covering maybe more than a single window but this
(T,·) choice was necessary in order to get a good compromise between length and
persistence of investors memory. In ﬁgure (6.15) are presented the whole datasets
respectively for Resistance and Supports time series considered here.
Figure 6.15. Dataset of 4th
bounce T = 180 time series, · = 100 - Here are presented
the two datasets considered. Plot at the top represents the dataset of N = 91 time series
length · = 100 rescaled each T = 180 seconds referring to the 4th
bounce on Resistance
levels. At the bottom is reported the plot of N = 72 time series forming the dataset of
4th
bounces on Support levels.

The results that will be reported in this section refer to the entire 3-steps
procedure, but may be interesting to provide some graphical demonstrations of the
convergence of the Markov Chain defined for the MCMC step.
In this respect, in figure (6.16) is reported the trace of the number of clusters visited
by the set of m = 20 chains adopted for the clustering, whereas in figure (6.17) are
provided evidences of the maximization of the corresponding Posterior.
Figure 6.16. Convergence of the number of clusters - Is presented the trace of the
number of clusters visited during the 250000 iterations of each of the m = 20 parallel
chains adopted in the MCMC step (‡prior = 10≠3
) of the clustering analysis of the
dataset presented at the top of figure (6.15), consisting of N = 91 time series length
· = 100 rescaled each T = 180 seconds belonging to 4th
bounce Resistance levels. On
the right is presented a focus on the initial transient, after which the most part of the
chains set stabilizes around a common number of clusters. The best partition was chosen
as the most visited overall.
Figure 6.17. Maximization of Posterior - Is presented the trace of the log-Likelihood,
on the left, and log-Posterior, on the right, corresponding to the set of 20 chains
considered in figure (6.16). As expected, the Posterior is maximized by the MCs to
levels of the same order of magnitude. As stated above, the best partition was chosen as
the most visited overall.

As expected, the set of parallel chains converges to a rather common level of the
number of clusters, namely even if there is not perfect agreement among the chains
regarding which would be the correct partition, all converge around the same order
of magnitude of the number of clusters. Also this one alone could be thought as a
satisfactory result.
Spitting and Merging steps refine the results providing the clustering presented in
figure (6.18) where is reported an example of cluster found coping with the same
time series referred to in the previous figures (6.16) and (6.17).
Actually at the top is presented the first cluster found: C1, whereas at the bottom is
reported the mean series µ1 as introduced in section (5.4.2) together with 1 and 2
confidence levels expressed in units of intra-cluster dispersion ‡1, defined in (5.6)
and reported here for clarity:
‡1 =
Ò
Èx2
i ÍC1 ≠ µ2
1
which, I recall, are vectors of · components.
µ1 = (µ11 , µ21 , ..., µ·1 )
‡1 = (‡11 , ‡21 , ..., ‡·1 )
This kind of representation provides visual insights of the goodness of clustering. In
this respect, it should be noted that the intra-cluster variance (‡1)2 appears as a
term in the overall covariance matrix (5.8) of the single-series likelihood P(xi|m)
(5.7), which I recall here for clarity referring to a series, say xi, belonging to the
cluster C1 considered (index h was omitted):
P(xi|m) = Nxi (µ1, i,1) where xi œ C1
i,1 =
Q
c
c
c
c
c
c
a
Ò
(ˆ”i)2 + (‡11 )2 0 0 0
0
Ò
(ˆ”i)2 + (‡21 )2 0 0
0 0
... 0
0 0 0
Ò
(ˆ”i)2 + (‡·1 )2
R
d
d
d
d
d
d
b
where I recall that:
• the subscript xi means that the normal multivariate distribution must be
computed at (xi1, xi2, ..., xi· ).
• ˆ” was defined in (5.2) as the proper rescaling of the strip-width ” according
to the rescaling of the time series in the [≠1 : 1] interval and adopted as the
intrinsic degree of dispersion characterizing each series.
Therefore, according to this model, the true confidence level of the series xi in cluster
C1 would have been:
µt1 ±
Ò
(ˆ”i)2 + (‡t1 )2 where t œ [≠
·
2
:
·
2
]

Figure 6.18. Example of overall clustering result - Here is presented the 1st
/20
cluster found with the 3-steps procedure acting on series length · = 100 rescaled each
T = 180 seconds and belonging to 4th
bounce Resistance levels. As previously stated,
series live in the rectangle [≠·/2 : ·/2] ◊ [≠1 : 1]. At the top is presented the cluster
found with the central event (the red dot). At the bottom is reported the mean series
together with the 1, in green, and 2, in red, conﬁdence levels around it expressed in
units of intra-cluster dispersion (refer to text for detailed explanation). Results are
obtained using m = 20 parallel chains and 250000 iterations in the MCMC step (in
correspondence of ‡prior = 10≠3
), m = 5 chains for 25000 iterations in the Splitting step
and m = 20 for 10000 iterations in the Merging step. The rest of the clustering results
are reported in appendix (B).

and so the confidence level reported in the bottom plot of figure (6.18) represents a
lower bound, common to all time series, for the confidence that each time series in
the cluster feels:
‡t1 Æ
Ò
(ˆ”i)2 + (‡t1 )2 e ective ’t œ [≠
·
2
:
·
2
]
Nevertheless, the (ˆ”i)2 term provides a correction of at least one order of magnitude
smaller than the intra-cluster variance (‡t1 )2 and so the choice adopted provides an
informative representation of the goodness of the cluster considered.
The analysis is completed providing a characterization of cluster time series in
terms of the stock and trading day they belong to. In figure (6.19) are reported the
two histograms of stocks and trading days occurrences of the cluster C1 considered
above (figure (6.18)), other results are listed in appendix (B). This analysis suggests
Figure 6.19. Stocks and trading days occurrences - Here are reported the histograms
of the occurrences of stocks and trading days to which the 4 series of cluster C1 presented
in figure (6.18) belong to. In section (4.1) is explained the convention adopted for stock
names and I recall that the trading year 2002 was composed of 250 trading days, all
considered in the analysis, omitting the 248th
for lack of data.

that while there is no evidence of memory through di erent trading days, there is
similarity of trajectories referring to the same stock.
This could be explained as a trading strategy specialized for each stock. The
algorithm is able to reconstruct this kind of memory, even if it was unexpected.
6.3.3 Attempt of Cause-E ect Analysis
Visually inspecting clusters found, it was recognized that frequently the clustering
algorithm results in merging together series that resemble similar after the bounce
event but that develop di erent dynamics before it. In ﬁgure (6.20) are presented
three examples supporting this consideration.
Figure 6.20. Di erent dynamics examples - Here are presented examples of clustered
trajectories developing di erent dynamics in the ﬁrst part of the plot and then resulting
in similar behaviors after the bounce event. All plots refer to time series length · = 100,
rescaled each T = 180 seconds. At the top, on the left, is presented the 7th
/14 cluster
of trajectories bouncing on 4th
Support bounce levels, on the right the 21st
/44 cluster
of trajectories bouncing on the 3rd
Resistance bounce levels and at the bottom the
100th
/160 cluster of trajectories bouncing on 2nd
Support bounce levels.

It was therefore decided to separately clustering the left [≠·
2 : 0] and right [0 : ·
2 ]
half of each series and then investigating the relations among clusters found.
In ﬁgure (6.21) are presented two examples of clusters found for the left and right
half of time series length · = 100 rescaled each T = 180 seconds and belonging to 4th
Resistance bounce events already consider for clustering procedure of the previous
section.
The other clusters found are listed in appendix (C.1).
Figure 6.21. Examples of half-series clustering - Here are presented two examples
of series length · = 100, rescaled each T = 180 seconds, belonging to 4th
Resistance
bounce events and clustered half by half. At the top is the 1st
/18 cluster found for the
left half: cause-cluster C1, together with the conﬁdence levels of 1 and 2 intra-cluster
dispersion units according to discussion of previous section. At the bottom is presented
an example of e ect-cluster: the 1st
/18 cluster found for the right half of the same
kind of time series considered above. The series were those already considered in the
clustering procedure of the previous section.
Namely right halves of the series were thought as e ect-clusters and for each
of these clusters was stated the provenience of each time series belonging to it.
Provenience meant from cause-clusters, representing the clustering of left half of
time series.

Were found 18 clusters for each half, consistent with the 20 found for whole series
clustering (discussed in previous section and listed in appendix (B)).
Examples of cause-e ect relations are presented in the histogram of ﬁgure (6.22) for
the 1st/18 e ect-cluster C1, also presented at the bottom of ﬁgure (6.21), and for the
7th one, both listed in appendix (C.1) together with the corresponding cause-clusters.
Histogram referring to other e ect-clusters are listed in appendix (C.2).
Figure 6.22. Example of cause-e ect relation - Here are presented the histogram
assessing the "provenience" of the series belonging to the 1st
, at the top, and 7th
, at the
bottom, clusters of the 18 found clustering the right half of time series already considered
in the whole clustering procedure presented in the last section.

6.4 Conclusions and Further Analysis 111
Inspecting histograms listed in appendix (C.2) can be found evidences of cause-
e ect relations in the sense that time series developing similar dynamics in the second
half of the window, and thus clustered together in the same e ect-cluster, tend to
originate from di erent cause-clusters; at the same time, some series originating
from the same cause-cluster tend to keep similarity also after the bounce event, and
consequently falling in the same e ect-cluster.
However the dataset seems to be too small to assess conclusive statements.
6.4 Conclusions and Further Analysis
The aim of the thesis work was to find regularities in price time series focusing
around peculiar points: Support and Resistance levels.
Coping with this issue, were extended previous results about the e ectiveness of
those technical indicators.
It was expected to find feedback e ects of investors strategies in correspondence
of these points and actually the persistence of memory was quantified measuring
the probability of bounce of the price on those levels conditional to the number of
previous bounces observed.
It was found that relaxing the definition of bounce (section (4.2)), in order to provide
a definition more familiar to the bare-eye one adopted by technical traders, memory
e ects appear strongly as self-reinforcing of investors confidence on these indicators
(section (4.4)).
Changing approach, it was designed a bayesian algorithm in order to spot those
regularities directly detectable in price time dynamics around Support and Resistance
levels (section (5.4)).
The algorithm acts in a 3-steps procedure with the aim to find the best partition of
the dataset of time series considered (section (5.1)).
It was therefore tested on artificial time series from a toy-model (sections (5.2) and
(6.2)) with satisfactory results: it provides a good clustering even when it is not able
to reach the perfect partition (sections (6.2.2)).
Dealing with real financial time series (sections (4.1) and (5.3)), the learning of the
outputs of the algorithm as well as the correct procedure in order to get significative
results took a long and systematic study (sections (6.1.1) and (6.3.1)).
It was a necessary study that provided non trivial results about regularities in price
time series (section (6.3.2)).
The analysis was concluded attempting a study of cause-e ect relations guessed
observing directly the dynamic of time series clustered that developed di erent
features before and after the bounce event they were centered at (section (6.3.3)).
Of course great time was spent on the definition and the refinement of the clustering
algorithm and so much interesting problems have not been tackled yet:
• considering time series standardized to mean increment equal to one, in order
to take care only of the percentage changes of the price, not considering their
value, and therefore bypassing problems related to the tick-minimum.
• considering time series in the tick-time in order to consider only the net e ect
of investors operations.

• extending the clustering to ﬁnancial series covering more then one trading day
in order to detect the presence and e ects of seasonal dynamics.
To conclude, should be observed that this analysis not only responds to the interests
of academic, but is also of fundamental importance to understand the degree of
unpredictability of the market, a question of primary importance for policy making
purposes, considering that regularities in a ﬁnancial market really represent weakness
against the speculation.

113
Appendix A
Noise Dependency of Merging
Threshold - List of Plots
Here are listed the plots of the dependency of RANDOM
MERGING from the ‡ wherewith
the daughter series, adopted for the computation, where generated. On the left
side are reported the values of the mean on the 100 runs, whereas on the right the
corresponding dispersion. The range of values presented in ﬁgure (6.2), reported
again here for clarity, corresponds to the color bar presented here.
Figure A.1. RANDOM
MERGING - Dependence of the range of values from ‡ - length
· = 100 - Plot of the range of values of the merging threshold in correspondence to
samples of daughter series generated as ‡ = 0.25, 10≠1
, 5 · 10≠2
, 10≠2
, 5 · 10≠3
, 10≠3
, 5 ·
10≠4
, 10≠4
, 5·10≠5
, 10≠5
around mother series values. The corresponding RANDOM
MERGING(‡)
plots are listed on the following pages.

114 A. Noise Dependency of Merging Threshold - List of Plots
Figure A.2. ‡ = 0.25
Figure A.3. ‡ = 10≠1
Figure A.4. ‡ = 5 · 10≠2

115
Figure A.5. ‡ = 10≠2
Figure A.6. ‡ = 5 · 10≠3
Figure A.7. ‡ = 10≠3

116 A. Noise Dependency of Merging Threshold - List of Plots
Figure A.8. ‡ = 5 · 10≠4
Figure A.9. ‡ = 10≠4
Figure A.10. ‡ = 5 · 10≠5

119
Appendix B
Clustering Results - List of
Plots
In this appendix are reported all the clusters found in the dataset of N = 91 time
series length · = 100 rescaled each T = 180 seconds and belonging to 4th bounce on
Resistance levels. Results of the ﬁrst cluster C1 were already presented in section
(6.3.2). Graphs refer to the entire 3-steps procedure: MCMC, Splitting and Merging
and are composed as follows:
• the plot of the cluster as it was found (top-left)
• the plot of the mean series and conﬁdence levels as discussed in section (6.3.2)
(top-right)
• the histogram of stock occurrences in the cluster (bottom-left).
• the histogram of trading day occurrences in the cluster (bottom-right).

120 B. Clustering Results - List of Plots
cluster C1

cluster C3

cluster C5

cluster C7

cluster C9

cluster C11

cluster C13

cluster C15

cluster C17

cluster C19

141
Appendix C
Cause-E ect Clustering - List of
Plots
In this appendix are reported all the clusters found clustering the two halves of
the same time series considered in appendix (B) and the corresponding analysis of
cause-e ect relations.
C.1 Half-Series Clusters
Here are presented ﬁrst cause-clusters, namely those describing the left half of time
series, and then e ect-clusters, describing right halves. Were found 18 clusters for
each half, consistent with the 20 found for whole series clustering (listed in appendix
(B)).

142 C. Cause-E ect Clustering - List of Plots
cause-clusters C1 and C2

C.1 Half-Series Clusters 143

e ect-clusters C1 and C2

C.2 Cause-E ect Relations
Here are listed the histograms showing the occurrences of the cause-cluster of origin
of each time series clustered in the same e ect-cluster. Both kind of clusters are
listed in appendix (C.1).
cause-cluster occurrences for e ect-clusters C1 and C2

C.2 Cause-E ect Relations 161
cause-cluster occurrences for e ect-clusters C3,C4,C5,C6

C.2 Cause-E ect Relations 163

165
Bibliography
[1] Fama, Eugene (1970). "E cient Capital Markets: A Review of Theory and
Empirical Work". Journal of Finance 25 (2): 383–417.
[2] Voit J., (2001) The Statistical Mechanics of Financial Markets. Springer.
[3] Jensen, M. C. (1978) Some anomalous evidence regarding market e ciency.
Journal of Financial Economics 6: 95-101.
[4] Bachelier, L. (1900) Theorie de la Speculation. Ann. Sci. Ecole Norm. Super.,
Ser III ≠ 17, 21 ≠ 86.
[5] Samuelson, P. A. (1973) Mathematics of Speculative Price. SIAM Review, Vol.
15 No. 1, pp. 1-42.
[6] Mandelbrot B. B. (1963) The variation of certain speculative prices. J. Business
38, 34
[7] Gardiner C., (2009) Stochastics Methods. Springer.
[8] Wang, B.H. and Hui, P.M. (2001) The distribution and scaling of ﬂuctuations
for Hang Seng index in Hong Kong stock market. Eur. Phys. J. B 20, 573-579
[9] Dresdel S. Die Modellierung von Aktienmarkten dutch stochastische Prozesse.
Diplomarbeit, Universitat Bayreuth, (2001, unpublished)
[10] Campbell J. Y., Lo A. W., and MacKinlay A. C., (2001) The econometrics of
ﬁnancial markets. Princeton University Press.
[11] Kreps D. M., (1990) A course in microeconomic theory Princeton University
Press.
[12] Cristelli M., Pietronero L. and Zaccaria A. (2001) Critical Overview of Agent
Based Models for Economics
[13] Chakraborti A., Toke I. M., Patriarca M., Abergel, F. (2010) Econophysics:
Empirical Facts and Agent-Based Models
[14] Pring, M.J. (2002) Technical Analysis Explained. New York: McGraw-Hill.
[15] Campbell, J., Lo, A. W. and MacKinlay A. C. (1997) The Econometrics of
Financial Markets. Princeton University Press, Princeton, N.J.

166 Bibliography
[16] Lo A. W., Mamaysky H. and Wang J. Foundations of Technical Analysis:
Computational Algorithms, Statistical Inference, and Empirical implementation.
The Journal of Finance, Vol. LV , No. 4, August 2000
[17] Gehrig, T. and Menkho , L. (2003) Technical analysis in foreign exchange –
the workhorse gains further ground. Discussion paper, University of Hannover.
[18] Murphy, J. J. (1999) Technical Analysis of the Financial Markets. New York
Institute of Finance.
[19] Duda, R. O. , Hart, P. E. and Stork. D. G. Pattern Classification. John Wiley
& Sons, Inc., New York, second edition, 2001.
[20] R. A. Fisher (1936). The use of multiple measurements in taxonomic problems.
Annals of Eugenics 7 (2): 179–188.
[21] Mantegna R., Hierarchical structure in financial markets. Eur. Phys. J. B, 1999,
11, 193-197
[22] West, D.B. Introduction to Graph Theory. Prentice-Hall, Englewood Clis NJ,
1996
[23] Papadimitriou, C.H., Steiglitz, K. Combinatorial Optimization. Prentice-Hall,
Englewood Clis, 1982.
[24] Standard & Poor’s 500 index at http://www.standardandpoors.com/, refer-
enced in June, 2002.
[25] Day, A.C.L. (1955) The taxonomic approach to the study of economic policies.
The American Economic Review 45: 64-78
[26] Onnela, J.-P., Chakraborti, A. and Kaski, K. Dynamics of market correlations:
Taxonomy and portfolio analysis. Phys. Rev. E 68, 056110 (2003)
[27] Bayes, T. (1763), A letter to John Canton. Phil. Trans. Royal Society London
53: 269–71.
[28] Bayes, T. An essay towards solving a Problem in the Doctrine of Chances.
Bayes’s essay in the original notation, available at http://www.stat.ucla.edu/
history/essay.pdf
[29] D’Agostini, G. Bayesian reasoning in data analysis - A critical introduction.
World Scientific Publishing 2003
[30] Cox, R. Probability, Frequency, and reasonable expectation. Am. Jour. Phys.,
14:1-13 (1946)
[31] Jaines, E. Probability Theory: The Logic of Science. Cambridge University Press
(2003)
[32] Kolmogorov, A.N. (1933) Foundations of the Theory of Probability. Chelsey
Publishing Company, New York, 1956

Bibliography 167
[33] Frank. Lad Operational Subjective Statistical Methods: A Mathematical, Philo-
sophical, and Historical Introduction. Wiley, 1996.
[34] Ghahramani, Z. Unsupervised Learning. Appeared in Bousquet, O., Raetsch, G.
and von Luxburg, U. (eds) Advanced Lectures on Machine Learning LNAI 3176.
Springer-Verlag.
[35] Je erys, W.H. and Berger, J.O. Ockham’s razor and Bayesian analysis. Ameri-
can Scientist, 80:64–72, 1992.
[36] Liao, T.W. Clustering of time series data – a survey. Pattern Recognition 38
(2005) 1857-1874
[37] Han, J. Kamber, M. Data mining: concepts and techniques. Morgan Kaufmann,
San Francisco, 2001, pp. 346-389
[38] Focardi, S. M. Clustering Economic and Financial Time Series: Exploring the
existence of stable correlation conditions. The intertek Group. Discussion Paper
2001-04
[39] Golay, X., Kollias, S., Stoll, G., Meier, D., Valavanis, A., Boesiger, P. A new
correlation-based fuzzy logic clustering algorithm for fMRI. Mag. Resonance Med.
40 (1998) 249–260.
[40] Agrawal, R., C. Faloutsos and A. N. Swami, E cient Similarity Search in
Sequence Databases. FODO, 1993.
[41] Keogh, E., K. Chakrabarti, M. Pazzani and S. Mehrotra, Dimensionality
Reduction for Fast Similarity Search in Large Time Series Databases, Journal of
Knowledge and Information Systems, 2000.
[42] Feller W. An Introduction to Probability Theory and Its Applications Wiley 3rd
edition (1968)
[43] Cencini, M. Cecconi, F. and Vulpiani, A. Chaos – From Simple Models to
Complex Systems World Scientiﬁc Publishing Co. Pte Ltd. 2010
[44] MacKay, D. J. C. Information Theory, Inference, and Learning Algorithms.
Cambridge University Press, 2003.
[45] Newman, M.E.J. and Barkema, G.T. Monte Carlo Methods in Statistical Physics
Oxford University Press, 2001
[46] Eckhard, R. (1987). Stan Ulam, John Von Neumann and the Monte Carlo
method. Los Alamos Science, 15, 131–136.
[47] Metropolis, N., & Ulam, S. (1949). The Monte Carlo method. Journal of the
American Statistical Association, 44:247, 335–341.
[48] Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller,
E. (1953). Equations of state calculations by fast computing machines. Journal
of Chemical Physics, 21, 1087–1091.

168 Bibliography
[49] Sokal, A. D. Monte Carlo Methods in Statistical Mechanics: Foundations
and New ALgorithms Lectures at the Carhese Summer School on "Functional
Integration: Basics and Applications", September 1996
[50] Robert, C. P., & Casella, G. (1999). Monte Carlo statistical methods. New York:
Springer-Verlag.
[51] Neal, R. M. (1993). Probabilistic inference using markov chain monte carlo
methods. Technical Report CRG-TR- 93-1, Dept. of Computer Science, University
of Toronto.
[52] Andrieu, C., De Freitas, N., Doucet, A., Jordan, M.I. An Introduction to MCMC
for Machine Learning. Machine Learning 50, 5-43, 2003
[53] Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains
and their Applications. Biometrika 57, 97–109.
[54] Arthur W., Holland J. H., LeBaron B., Palmer R. and Tyler P., (1996) Asset
Pricing Under Endogenous Expectations in an Artificial Stock Market
[55] Lux T. and Marchesi M., Scaling and Criticality in a Stochastic multi-agent
model of a financial market, Nature, 397 (1999) 498
[56] Alfi V., Cristelli M., Pietronero L. and Zaccaria A., Minimal Agent Based
Model for Financial Markets I: Origin and Self-Organization of Stylized Facts.,
Eur. Phys. J. B, 67 (2009) 385
[57] Garzarelli, F., Cristelli M., Zaccaria A., and Pietronero L. (2012) Memory
e ects in stock price dynamics: evidences of technical trading, in preparation
[58] Sornette D., Woodard R. and Zhou W.X. The 2006-2008 oil bubble: Evidence
of speculation, and prediction, Physica A 388 (2009) 1571-1576
[59] Caldarelli G. Scale-Free Networks: Complex Webs in Nature and Technology,
Oxford University Press.
[60] http://www.nyse.com/content/faqs/1042235995602.html?cat=Listed_
Company___General
[61] http://stockcharts.com
[62] http://www.investopedia.com/terms/d/djia.asp
[63] http://www.prnewswire.com/news-releases/
[64] http://en.wikipedia.org/wiki/Economic_taxonomy
[65] www.forbes.com
[66] http://stats.oecd.org
[67] http://www.trade-ideas.com/Glossary/Support_and_Resistance.html

Clustering Financial Time Series and Evidences of Memory E

More Related Content

What's hot (18)

Viewers also liked (16)

Similar to Clustering Financial Time Series and Evidences of Memory E (20)

Recently uploaded (20)

Clustering Financial Time Series and Evidences of Memory E