Transcending our views to sequential data

Transcending our views to sequential data

Markus Luczak-Roesch | @mluczak
University of Southampton, Web and Internet Science
http://markus-luczak.de

HF LF
[1] Kleinberg, Jon. "Bursty and hierarchical structure in streams." Data
Mining and Knowledge Discovery 7.4 (2003): 373-397.
[2] Subašić, I., & Berendt, B. (2013). Story graphs: Tracking document
set evolution using dynamic graphs. Intelligent Data Analysis, 17(1),
125-147.
Time
Numberofobserveddocuments
Content streams as automata [1]
“The key notion of TTM is
burstiness – sudden increases in
frequency of text fragments, and
all TTM methods aim to model
burstiness.” [2]
t

System A
System B
System C
Related activity?
t

Building transcendental information cascades
conditionality.
In [20] we presented the initial definition of a transcenden-
tal information cascade as a 4-tupel TC = (V, E, R, F). This
4-tupel represents a directed network consisting of a set of
nodes V and edges E, derived when applying a set of matching
functions F to a set of resources R = {r1, r2, ..., rm}, ri =
(ui, ti, ci), where every ui is a unique identifier of a resource
ri that was shared at the time ti with the content ci. Nodes in
the network are those resources from R that contain a set Ii of
one or multiple cascade identifiers. A cascade identifier is any
unique informational pattern that is recognized by applying
a matching function to the content or any other inherent
properties of a resource (e.g. simple string matching algorithms
to identify keywords in content). Formally a matching function
fk 2 F, k 2 N, k  n is defined as:
fk(ci) =
8
>>>>><
>>>>>:
{i1, i2, ..., ix} if fk matches patterns
{i1, i2, ..., ix} in ci
x 2 N
; otherwise
Nodes V and edges E are then given as follows
V ={v1, v2, ..., vp}
vy = (uy, ty, Iy),
E ={e1, e2, ..., eq}
ez =(ua, ub, ⇤z)
with Ii = {i1, i2, ..., io} = f1(ci) [ f2(ci) [ ... [ fn(ci) being
the result of the concatenation of all identifiers found by all
matching functions2
. An edge exists between any two nodes
that share a unique subset of all the cascade identifiers that
were found for them. This subset and none of its subsets is
part of the identifiers found for any node that was created in the
time period between when the two linked nodes were created.
⇤z ={ir|
ir 2 Ia ^ ir 2 Ib,
8ir ! V 0
=
{vc|vc = (uc,tc, Ic), ir 2 Ic ^ ta  tc  tb} = ;,
vc 2 V, r 2 N, r  |Ib|}
A node that contains a cascade identifier that was not
detected for any other nodes before is called the identifier
root. Beside this we call a node without any incoming edges
a network root and node that has no outgoing edges a stub.
network are those resources from R that contain a set Ii of
e or multiple cascade identifiers. A cascade identifier is any
que informational pattern that is recognized by applying
matching function to the content or any other inherent
perties of a resource (e.g. simple string matching algorithms
dentify keywords in content). Formally a matching function
2 F, k 2 N, k  n is defined as:
fk(ci) =
8
>>>>><
>>>>>:
{i1, i2, ..., ix} in ci
x 2 N
; otherwise
des V and edges E are then given as follows
V ={v1, v2, ..., vp}
vy = (uy, ty, Iy),
E ={e1, e2, ..., eq}
ez =(ua, ub, ⇤z)
h Ii = {i1, i2, ..., io} = f1(ci) [ f2(ci) [ ... [ fn(ci) being
result of the concatenation of all identifiers found by all
tching functions2
t share a unique subset of all the cascade identifiers that
re found for them. This subset and none of its subsets is
t of the identifiers found for any node that was created in the
e period between when the two linked nodes were created.
⇤z ={ir|
ir 2 Ia ^ ir 2 Ib,
8ir ! V 0
=
vc 2 V, r 2 N, r  |Ib|}
ected for any other nodes before is called the identifier
t. Beside this we call a node without any incoming edges
etwork root and node that has no outgoing edges a stub.
r cascade model clearly yields different outputs depending
the data to hand (e.g. determined by the extent of the
Please note that [20] contains an unintentionally malformed equation for
as the wrong symbol was used to refer to the concatenation of the matching
ctions.
Fig. 1. Depending on the applied matching functions, different transcendental
information cascade representations can be generated for the same input data.
A fictive example of a transcendental cascade based on our
model is shown in Figure 2. Consider a system that features
hashtags as an established form of identifying content patterns.
The visualisation uses the following approach to represent
distinct identifiers and time: Nodes are chronologically ordered
alongside the horizontal dimension from left (the oldest node)
to right (the most recent node); additionally nodes are ordered
alongside the vertical dimension depending on the set of
identifiers present in a node (each unique set is assigned to
a distinct level). Consequently, the visualisation represents the
content creation sequence (“#A”) - (“#A#B”) - (“#A”) - (“#A”)
- (“#A#B#C”) - (“#C”) - (“#A”) - (“#B#D”) - (“#A”).
Fig. 2. Example of a cascade that emerges along five different identifiers.
#A, #B, #A#B#C, #B#D and #C are fictive hashtags (or hashtag combinations
resepectively) treated as the indentifying content patterns
In order to understand how edges are labelled we highlight
the sub-graph involving the nodes 2, 3, 4, and 5. Conforming
to our cascade model an edge exist between nodes 2 and 3
nding of its use but also an abstract global
ropose a new model that we call transcen-
ascades. Informed by Kleinbergs work on
document streams [2] it regards time as
le condition for relationships between any
meaning that we focus on coincidence of
activities rather than socially-determined
nted the initial definition of a transcenden-
ade as a 4-tupel TC = (V, E, R, F). This
a directed network consisting of a set of
E, derived when applying a set of matching
et of resources R = {r1, r2, ..., rm}, ri =
very ui is a unique identifier of a resource
t the time ti with the content ci. Nodes in
se resources from R that contain a set Ii of
cade identifiers. A cascade identifier is any
al pattern that is recognized by applying
n to the content or any other inherent
rce (e.g. simple string matching algorithms
s in content). Formally a matching function
n is defined as:
, i2, ..., ix} if fk matches patterns
{i1, i2, ..., ix} in ci
x 2 N
otherwise
E are then given as follows
V ={v1, v2, ..., vp}
vy = (uy, ty, Iy),
E ={e1, e2, ..., eq}
ez =(ua, ub, ⇤z)
, io} = f1(ci) [ f2(ci) [ ... [ fn(ci) being
ncatenation of all identifiers found by all
2
subset of all the cascade identifiers that
m. This subset and none of its subsets is
s found for any node that was created in the
n when the two linked nodes were created.
{ir|
Web crawl), and the matching algorithms determining which
cascade identifiers will be spotted (e.g. reuse of hashtags,
URIs, quotes, images, or maybe exploiting wider semantics
or sentiment) as depicted in Figure ??.
- (“#A#B#C”) - (“#C”) - (“#A”) - (“#B#D”) - (“#A”).
i that was shared at the time ti with the content ci. Nodes in
he network are those resources from R that contain a set Ii of
ne or multiple cascade identifiers. A cascade identifier is any
nique informational pattern that is recognized by applying
matching function to the content or any other inherent
roperties of a resource (e.g. simple string matching algorithms
o identify keywords in content). Formally a matching function
k 2 F, k 2 N, k  n is defined as:
fk(ci) =
8
>>>>><
>>>>>:
{i1, i2, ..., ix} in ci
x 2 N
; otherwise
Nodes V and edges E are then given as follows
V ={v1, v2, ..., vp}
vy = (uy, ty, Iy),
E ={e1, e2, ..., eq}
ez =(ua, ub, ⇤z)
with Ii = {i1, i2, ..., io} = f1(ci) [ f2(ci) [ ... [ fn(ci) being
he result of the concatenation of all identifiers found by all
matching functions2
hat share a unique subset of all the cascade identifiers that
were found for them. This subset and none of its subsets is
art of the identifiers found for any node that was created in the
ime period between when the two linked nodes were created.
⇤z ={ir|
ir 2 Ia ^ ir 2 Ib,
8ir ! V 0
=
vc 2 V, r 2 N, r  |Ib|}
etected for any other nodes before is called the identifier
oot. Beside this we call a node without any incoming edges
network root and node that has no outgoing edges a stub.
Our cascade model clearly yields different outputs depending
n the data to hand (e.g. determined by the extent of the
2Please note that [20] contains an unintentionally malformed equation for
his as the wrong symbol was used to refer to the concatenation of the matching
unctions.
- (“#A#B#C”) - (“#C”) - (“#A”) - (“#B#D”) - (“#A”).
Fig. 2. Example of a cascade that emerges along five different identifiers.
#A, #B, #A#B#C, #B#D and #C are fictive hashtags (or hashtag combinations
resepectively) treated as the indentifying content patterns
In order to understand how edges are labelled we highlight
the sub-graph involving the nodes 2, 3, 4, and 5. Conforming
to our cascade model an edge exist between nodes 2 and 3
Markus Luczak-Roesch, Ramine Tinati, and Nigel Shadbolt. 2015. When Resources Collide: Towards a
Theory of Coincidence in Information Spaces. To appear in WWW’15 Companion, May 18–22, 2015,
Florence, Italy. http://dx.doi.org/10.1145/2740908.2743973

Transcendental
information cascades
t
#A
#A#B
#A#B#C
#B#D
#C

Cascade motifs as an indicator of state?
?
Markus Luczak-Roesch, Ramine Tinati, Max van Kleek, and Nigel Shadbolt. 2015. From
coincidence to purposeful ﬂow? Properties of transcendental information cascades. In
IEEE/ACM International Conference on Advances in Social Networks Analysis and
Mining (ASONAM), Paris, FR.

Analyzing low-level properties of the multiple
states of a system that exist at the same time
4
1
15
10
Tags URIs
KID & APH
Single node motifs
long uniform paths
short uniform paths
long non-uniform paths

Analyzing low-level properties of the multiple
states of a system that exist at the same time
Tags URIs
KID&APH
Identifier entropy
4. Overview of the results of the cascade comparison. Cascade size distribution and wi
d with a log scale on the y-axis.
ain one or few identifiers equally distributed. Very large identifiers
e size distribution and wiener index are plotted on a log-log scale; identifier entropy is
large identifiers (KID, APH, URIs), cascades which are based on
varying profiles of increasing
randomness with growing
cascade size

From information co-occurrence to the discovery
of hidden structure in Wikipedia
Figure 1: Wikipedia edits in a three dimensional space. The di-
mensions are (1) time; (2) information diversity as the chronologi-
Tinati, Ramine, Luczak-Roesch, Markus, Hall, Wendy and Shadbolt, Nigel (2016) More than an
edit: using transcendental information cascades to capture hidden structure in Wikipedia. At
25th International World Wide Web Conference, Montreal, Canada, 11 - 15 Apr 2016. ACM (doi:
10.1145/2872518.2889401).
Tinati, R., Luczak-Rösch, M., & Hall, W. Finding Structure in Wikipedia Edit Activity: An
Information Cascade Approach . In WikiWorkshop 2016, co-located with WWW 2016.
Events detected:
•  Edward Snowden speech at SXSW
conference
•  US supreme court case on same sex
marriage
(a) Cascade Article Network (CAN): Nodes represent unique
Wikipedia articles, edges are shared edits based on a shared
identiﬁer matched. A force directed layout has been ap-
plied, with edge path lengths determined by edge weight. The
strongly connected component (A) contains articles associated
with South Korean media, (B) and (C) contain articles related
to the USA.
(b) Cascade-to-Cascade path network graph: Nodes are cas-
cades, Edges are the shared articles between cascades. The cen-
tral strongly connected component is established by the Identi-
ﬁers shown in Table 3. A force directed layout has been applied,
with edge path lengths determined by edge weight.

Discrete vs. continuous data
Image source: https://en.wikipedia.org/wiki/Electroencephalography#/media/File:Spike-waves.png, CC BY-SA 2.0

EEG brain wave recordings
Image source: https://en.wikipedia.org/wiki/Electroencephalography#/media/File:Spike-waves.png, CC BY-SA 2.0

Linking based on similarity of spectral density
(Euclidian distance)

t
F1
Fn
…
…
C11
C21
C22
C23
Formalising the
multiple possible
representations of
a system at any time
and their relationships.

Not all representing
purposeful action but
reﬂecting useful
informational properties.

•  Applying Transcendental Information Cascades to
– data from the complex engineering industries (e.g. shipping)
– urban trafﬁc data
– disaster response data
Reducing risk and enhancing security by understanding
coincidence in information spaces (RECOIN)
PI: Markus Luczak-Roesch

F1
Fn …
Transcendental
Information Cascades
Generic time-ordered
networks of information co-
occurrence
t
…
C11
C21
C22
C23
t6 - t0
t2 - t1 t8 - t2
t4 - t2
t7 - t4
t5 - t3
t1 - t0
t2 - t1
t4 - t1
t4 - t3
t6 - t5
t8 - t6
t7 - t4
t5 - t4
t3 - t2

Transcending our views to sequential data

More Related Content

Viewers also liked (20)

Similar to Transcending our views to sequential data (20)

Recently uploaded (20)

Transcending our views to sequential data