SlideShare a Scribd company logo
Data Streaming 
Raja Chiky – raja.chiky@isep.fr
About me 
¡ Associate professor in Computer Science – LISITE-RDI 
¡ Research interest: Data stream mining, scalability and resource optimization in distributed architectures 
(e.g cloud architectures), recommender systems 
¡ Research field: Large scale data management 
1. Real-time and 
distributed 
processing of 
various data 
sources 
2. Use semantic 
technologies to 
add a semantic 
layer 
3. Recommender 
systems and 
collaborative data 
mining 
4. Optimizing resources in large scale systems 
Heterogeneous 
and 
sta1c 
data 
Heterogeneous 
and 
dynamic 
data 
streams 
sensors 
2
OUTLINE 
¡ Context: Big Data 
¡ What is a data stream ? 
¡ Data stream management systems 
¡ Basic approximate algorithms 
¡ Big Data: Distributed Systems 
¡ Semantic Data Streaming 
¡ Conclusion 
3
New era 
4
5 Big Data: Buzzword!
6 
08/12/2014 
Where is all this data coming 
from?
7 
More and More connected 
Things
8 So, what is Big Data? 
Dawn 
of 
(me 
Volume 
of 
data 
created 
Worldwide 
2003 
2012 
5 
EB 
… 
2.7 
ZB 
2015 
10 
ZB 
(E) 
§ 1 
YB 
= 
10^24 
Bytes 
§ 1 
ZB 
= 
10^21 
Bytes 
§ 1 
EB 
= 
10^18 
Bytes 
§ 1 
PB 
= 
10^15 
Bytes 
§ 1TB 
= 
10^12 
Bytes 
§ 1 
GB 
= 
10^9 
Bytes 
Variety 
of 
data 
§ Radio 
§ TV 
§ News 
§ E-­‐Mails 
§ Facebook 
Posts 
Velocity 
of 
data 
§ Walmart 
handles 
1M 
transac(ons 
per 
hour 
§ Google 
processes 
24PB 
of 
data 
per 
day 
§ AT&T 
transfers 
30 
PB 
of 
data 
per 
day 
§ 90 
trillion 
emails 
are 
sent 
per 
year 
§ World 
of 
WarcraQ 
uses 
1.3 
PB 
of 
storage 
§ Tweets 
§ Blogs 
§ Photos 
§ Videos 
(user 
and 
paid) 
§ RSS 
feeds 
§ Wikipedia 
§ GPS 
data 
§ RFID 
§ POS 
Scanners 
§ … 
§ Facebook 
when 
had 
a 
user 
base 
of 
900 
M 
users, 
had 
25 
PB 
of 
compressed 
data 
§ 400M 
tweets 
per 
day 
in 
June 
’12 
§ 72 
hours 
of 
video 
is 
uploaded 
to 
Youtube 
every 
minute 
Big 
Data 
Elements 
Volume 
Variety 
Velocity 
+ Veracity (IBM) - 
information 
uncertainty 
Source: Big Data & Analytics - Why Should We Care?, Vishwa Kolla
Output 
User 
Interaction 
Store 
Gathering 
Information 
Data 
sources 
Static data Stream (big) data 
C 
Continuous 
queries/ 
Business rules 
sensors 
databases 
Data stream Static data 
ETL 
Batch 
processing 
Semantic ETL 
stream 
processing 
Ad-hoc queries 
Analytics 
Knowledge 
enrichment 
Databases/ 
Triplestores 
(synopsis) 
9 
Real time 
visual-analytics 
Retro-action 
Load shedding 
Data Warehouse 
08/12/2014
10 Big Data : Velocity 
Website logs 
Network 
monitoring Financial services 
eCommerce Traffic control 
Weather 
forecasting 
Power 
consumption
What is a data stream? 
11 
¡ Golab & Oszu (2003): “A data stream is a real-time, continuous, ordered 
(implicitly by arrival time or explicitly by timestamp) sequence of items. It is 
impossible to control the order in which items arrive, nor is it feasible to locally 
store a stream in its entirety.” 
¡ Massive volumes of data, items arrive at a high rate.
12 
Applications of data stream 
processing 
¡ Data stream processing 
¡ Process queries (compute statistics, activate alarms) 
¡ Apply data mining algorithms 
¡ Requirements 
¡ Real-time processing 
¡ One-pass processing 
¡ Bounded storage (no complete storage of streams) 
¡ Possibly consider several streams
13 
Applications of data stream 
processing 
¡ Let’s go deeper into some examples 
¡ Network management 
¡ Stock monitoring
14 Network management 
¡ Supervision of a computer network 
¡ Improvement of network configuration (hardware, software, 
architecture) 
¡ Detection of attacks 
¡ Measurements made on routers 
Network supervision 
Huge volume of data center 
High rate of arrivals
15 Network management 
Network supervision 
center 
Timestamp Source Destination Duration Bytes Protocol 
… … … … … … 
12342 10.1.0.2 16.2.3.7 12 20K http 
12343 18.6.7.1 12.4.0.3 16 24K http 
12344 12.4.3.8 14.8.7.4 26 58K http 
12345 19.7.1.2 16.5.5.8 18 80K ftp 
… … … … … …
16 Network management 
Network supervision 
center 
Typical queries: 
- 100 most frequent (@S, @D) on router R1 … 
- How many different (@S, @D) seen on R1 but not R2 … 
- … during last month, last week, last day, last hour ?
17 Stock monitoring 
¡ Stream of price and sales volume of stocks over time 
¡ Technical analysis/charting for stock investors 
¡ Support trading decisions 
l Notify me when the price of IBM is above $83, and 
the first MSFT price afterwards is below $27. 
l Notify me when some stock goes up by at least 5% 
from one transaction to the next. 
l Notify me when the price of any stock increases 
monotonically for ≥30 min. 
l Notify me when the difference between the 
current price of a stock and its 10 day moving 
average is greater than some threshold value 
Source: Gehrke 07 and Cayuga application scenarios (Cornell University)
18 Where is the problem? (1/2) 
¡ Example: 
05/12/2014 
Bank 
withdrawal 
50 € 
12/05/2014 
Bank 
fraud bank 
withdrawal 
1000$ 
12/05/2014 
¡ Join between several streams 
¡ Join between stream data and customer database 
¡ Generic tools for processing streams 
¡ Avoid the ‘Store’, ‘Compute’, ‘Delete’ approach 
¡ Solution: incremental computation and definition of temporal windows for joins
19 Where is the problem? (2/2) 
¡ Example: 
¡ 100 most frequent @S IP adresses on a router 
¡ Maintain a table of IP addresses with 
frequencies ? 
¡ Sampling the stream ? 
¡ Face high (and varying) rate of arrivals 
¡ Exact versus approximate answers
20 Examples of queries 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
Find elements with 
frequency> 0.1% 
top-k 
Frequency of the element 3 
Total frequency of 
Elements between 8 and 14 
number of elements having a non-zero frequency
21 
Data Stream Management 
Systems 
DBMS DSMS 
Data model Permanent updatable relations Streams and permanent updatable 
relations 
Storage Data is stored on disk Permanent relations are stored on disk 
Streams are processed on the fly 
Query SQL language 
Creating structures 
Inserting/updating/deleting data 
Retrieving data (one-time query) 
SQL-like query language 
Standard SQL on permanent relations 
Extended SQL on streams with 
windowing 
Continuous queries 
Performance Large volumes of data Optimization of computer resources to 
deal with 
Several streams 
Several queries 
Ability to face variations in arrival rates 
without crash
22 Generic DSMS architecture 
Input 
Monitor 
Output 
Buffer 
Query Processor 
Query 
Reposi-tory 
Working 
Storage 
Summary 
Storage 
Static 
Storage 
Streaming 
Inputs 
Streaming 
Outputs 
Updates to 
Static Data 
User 
Queries 
Golab & Oszu (2003)
23 Existing DSMS
24 
How to deal with Big Data 
Streams 
Distribution 
DSMS 
Sampling/Load Shedding/Sketch/…
25 
Approximate answers to 
queries 
¡ When ? 
¡ Queries needing unbounded memory 
¡ Too much queries/too rapid streams/too high response time 
requirements 
¡ CPU limit 
¡ Memory limit 
¡ Solution : approximate answers to queries 
¡ Sliding windows 
¡ Sampling and load shedding 
¡ Definition of synopsis
26 Approaches 
¡ Two approaches for handling such streams 
¡ Use a time window, and query the window as a static table 
¡ When you can’t store collected data, or to keep track of historical 
data 
¡ Sampling 
¡ Filtering 
¡ Counting
Windowing 
¡ Applying queries/mining tasks to the whole stream (from 
beginning to current time) 
¡ Applying queries/mining to a portion of the stream 
Beginning of the stream 
Current date 
Window on the stream 
t
29 Windowing 
¡ Definition of windows of interest on streams 
¡ Fixed windows: September 2014 
¡ Sliding windows: last 3 hours 
¡ Landmark windows: from September 1st, 2014 
08/12/2014 
¡ Window specification 
¡ Physical : last 3 hours 
¡ Logical : last 1000 items 
¡ Refreshing rate 
¡ Rate of producing results (every item, every 10 items, every minute, 
…)
Sliding window 
Beginning of the stream 
t 
tc 
t’ t c 
Refreshment time 
Results 
Results 
Give me the last room where Axel has been in 
the last 10 minutes, updating results every minute 
30
Sliding window vs. Tumbling 
window 
Beginning of the stream 
t 
tc 
t’c t 
Results 
Refreshment time 
Results 
Give me the last room where Axel has been in 
the last 10 minutes, updating results every 10 
minutes 
31
32 Sampling from data stream 
¡ Inputs: 
¡ Sample size k 
¡ Window size n >> k (alternatively, time duration m) (optionnaly) 
¡ Stream of data elements that arrive online 
¡ Output: 
¡ k elements chosen uniformly at random from the last n elements 
(alternatively, from all elements that have arrived in the last m time 
units) 
¡ Goal: 
¡ maintain a data structure that can produce the desired output at 
any time upon request 
¡ Challenge: 
¡ don’t know how long stream is 
¡ So when/how often to sample?
A simple, Unsatisfying 
Approach 
¡ Choose a random subset X={x1, …,xk}, X⊂{0,1,…,n-1} 
¡ The sample always consists of the non-expired elements whose 
indexes are equal to x1, …,xk (modulo n) 
¡ Only uses O(k) memory 
¡ Technically produces a uniform random sample of each 
window, but unsatisfying because the sample is highly periodic 
¡ Unsuitable for many real applications, particularly those with 
periodicity in the data 
34
35 Reservoir sampling 
¡ Classic online algorithm due to Vitter (1985) 
¡ Maintains a fixed-size uniform random sample 
¡ Size of the data stream need not be known in advance 
¡ Data structure: “reservoir” of k data elements 
¡ As the ith data element arrives: 
¡ Add it to the reservoir with probability p = k/i, discarding a randomly 
chosen data element from the reservoir to make room
36 Chain Sampling 
v Chain Sampling method is for sequence based windows. 
v In this type of sampling when the ith element arrives it is chosen 
to become the sample with probability Min(i,n)/n. 
v If the ith element is chosen as the sample, the algorithm also 
selects the index of the element that will replace it when expires 
(assuming that it is still present in the sample when it expires). 
This index is picked uniformly at random from the range i+1…i 
+n, representing the range of indexes of the elements that will 
be active when the ith element expires. 
v When the element with the selected index arrives, the algorithm 
stores it in the memory and choses the index of the element 
that will replace it when it expires etc., building a chain of 
elements to use in case of the expiration of the current element 
in the sample.
37 Chain Sampling 
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
Another Simple Approach: 
oversample 
¡ As each element arrives remember it with probability 
p = ck/n log n; otherwise discard it 
¡ Discard elements when they expire 
¡ When asked to produce a sample, choose k elements at 
random from the set in memory 
¡ Expected memory usage of O(k log n) 
¡ The algorithm can fail if less than k elements from a window are 
remembered; however with high probability this will not happen 
38
39 Min-wise Sampling 
¡ For each item, pick a random fraction between 0 and 1 
¡ Store item(s) with the smallest random tag (Nath et al. 2004) 
0.391 0.908 0.291 0.555 0.619 0.273 
Each item has same chance of least tag, so uniform 
Can run on multiple streams separately, then merge
40 Hash function - Revision 
¡ Hash function: Any algorithm that maps data of arbitrary length 
to data of a fixed length. The values returned by a hash function 
are called hash values, hash codes, hash sums, checksums or 
simply hashes. 
08/12/2014
41 Stream filtering 
¡ Compact data structure, by B.H. Bloom, 1970: 
¡ A bit array of size m, 
¡ A H family of k hash functions (MD5, SHA256, Murmur), 
¡ A set S of n items 
¡ False positive probability f=(1-e-kn/m)k 
¡ Example M=10, hash=MD5, k=3
42 Sketches 
¡ Sketch 
¡ Synopsis structure taking advantage of high volumes of data 
¡ Provides an approximate result with probabilistic bounds 
¡ Random projections on smaller spaces (hash functions) 
¡ Many sketch structures: usually dedicated to a specialized task 
¡ Examples of sketch structures 
¡ COUNT (Flajolet 85) 
¡ COUNT SKETCH (Charikar et al. 04)
Difference between Sampling 
and Sketching 
43 
¡ Sample sees only those items which were selected to be in the 
sample; whereas the sketch sees the entire input, but is restricted 
to retain only a small summary of it 
¡ Not every problem can be solved with sampling 
¡ Example: counting how many distinct items in the stream 
¡ If a large fraction of items aren’t sampled, don’t know if they are all 
same or all different
44 
Sketches: COUNT (Flajolet 
Martin algorithm) 
¡ Goal 
¡ Number N of distinct values in a stream (for large N) 
¡ Example: number of distinct IP addresses going through a router 
¡ Sketch structure 
¡ SK: L bits initialized to 0 
0 0 0 0 0 0 0 0 
¡ H: hashing function transforming an element of the stream into L bits 
18.6.7.1 0 0 1 0 1 0 1 0 
¡ H distributes uniformly elements of the stream on the 2^L possibilities
45 Sketches 
¡ Method 
¡ Maintenance and update of SK 
¡ For each new element e 
¡ Compute H(e) 
¡ Select the position of the leftmost 1 in H(e) 
¡ Force to 1 this position in SK 
0 1 0 0 1 0 0 1 
SK 
H(18.6.7.1) 0 0 1 0 1 0 1 0 
New SK 0 1 1 0 1 0 0 1
46 Sketches 
¡ Result 
¡ Select the position R (0…L-1) of the leftmost 0 in SK 
¡ E(R) = log2 (φ*N) with φ = 0.77351… 
è Estimate the count by 2^R/φ 
¡ σ(R) = 1.12 
SK 1 1 1 0 1 0 0 0 
R 
To improve accuracy of this approximation algorithm, use multiple 
hash functions and use the average R instead.
47 Sketches 
¡ COUNT SKETCH ALGORITHM (Charikar et al. 2004) 
¡ Goal 
¡ k most frequent elements in a stream (for large number N of distinct 
values) 
¡ Ex. 100 most frequent IP addresses going through a router 
Input stream 2, 0, 1, 3, 1, 2, 4, . . . 
2 2 
1 1 1 
f(0) f(1) f(2) f(3) f(4)
48 Sketches 
+12 +7 +23 +15 
-5 -12 -23 +1 
. . . . 
. . . . 
. . . . 
. . . . 
. . . . 
+78 +56 +66 +65 
1 
2 
. 
. 
. 
. 
. 
B 
1 2 … t 
e 
-1 
+1 
-1 
+1
49 Sketches 
¡ Sketch structure 
¡ h : hash function from [0, … , N-1] to [0, 1, … , B] 
¡ s : hash function from [0, … , N-1] to {+1, -1} 
¡ Array of B counters: C1, …, CB (with B << N) 
¡ Sketch maintenance 
when e arrives: Ch(e) += s(e) 
¡ Use of sketch 
¡ Estimation of frequency of object e: : ne ≈ Ch(e) . s(e) 
¡ Actually t hash function h and t hash function s: 
ne ≈ median j∈[1…t] ( Chj(e) . sj(e) ) 
¡ Theoretical results on error depending on N, t and B.
50 Sketches 
¡ Algorithm 
¡ Maintenance of a list (e1, e2, …, ek) of the current k most frequent 
elements 
¡ For a new arriving element e 
¡ Add e to the sketch structure 
¡ Estimate frequency of e from the sketch structure 
¡ If f(e) > f(ek), remove ek and insert e into the list
Distributed 
Systems 
Answering today Big Data Needs 
S4, Storm, SAMOA, … 
51
Yahoo S4 
¡ Simple Scalable Streaming System 
¡ Use a decentralized and symmetric architecture 
¡ All nodes are similar, no master node 
¡ Based on Processing Elements 
¡ Each PE has 4 components: 
1. Functionality defined by a PE class and associated configuration 
¡ input event handler processEvent() 
¡ output mechanism output() 
2. the types of events that it consumes 
3. the keyed attribute in those events 
4. the value of the keyed attribute in events which it consumes 
¡ Several PEs are available for standard tasks such as count, aggregate, 
join … 
¡ Custom PEs can easily be programmed 
¡ The PEs are atomatically deployed on the Processing Nodes (machines): 
¡ Ensures load balancing of events, 
¡ Notifies appropriate PEs when an event comes in.
S4 Example: Word Count
Twitter Storm 
¡ Same princple as S4, with a simplified programming model 
¡ Storm provides realtime computation 
¡ Scalable 
¡ Guarantees no data loss 
¡ Extremely robust and fault-tolerant 
¡ Programming language agnostic 
¡ Concepts 
¡ Streams 
¡ Spouts 
¡ Bolts 
¡ Topologies
56 
Lambda architecture (By 
Nathan Marz) 
REAL TIME 
STREAM PROCESSING 
SERVING 
LAYER 
SPEED LAYER 
PRECOMPUTED 
DATA FLOW QUERIES 
BATCH LAYER 
BATCH 
PROCESSING 
VIEWS 
Source: Mathieu DESPRIEE (USI) 
Generic, scalable and fault-tolerant data processing architecture
Lambda architecture 
1. All data entering the system is dispatched to both the batch 
layer and the speed layer for processing. 
2. The batch layer has two functions: (i) managing the master 
dataset (an immutable, append-only set of raw data), and (ii) 
to pre-compute the batch views. 
3. The serving layer indexes the batch views so that they can be 
queried in ad-hoc way. 
4. The speed layer compensates for the high latency of updates 
to the serving layer and deals with recent data only. 
5. Any incoming query can be answered by merging results from 
batch views and real-time views. 
http://lambda-architecture.net/
58 Big Data Stream Mining 
Machine 
Learning 
Distributed 
Batch 
Hadoop 
Mahout 
Stream 
S4, Storm 
SAMOA 
Non 
Distributed 
Batch 
R, 
WEKA, 
… 
Stream 
MOA
59 
What is SAMOA? 
• NEW Software framework for mining distributed data streams 
• Big Data mining for evolving streams in REAL-TIME
Clustering 
Methods 
SAMOA 
SAMOA architecture 
Classifier 
Methods 
Frequent 
Pattern 
Mining 
S4 Storm … 
} Use S4, Storm, or 
other distributed 
stream processing 
platform 
} Use MOA, or other 
streaming machine 
learning library 
} Easy to extend 
through PACKAGES
Semantic data 
streaming 
61
62 Too much data streams
63 
Type of data used in Big Data 
initiatives 
Internal data 
Traditional sources 
« New data » 
Source: Big Data opportunities survey, Unisphere / SAP, May 2013.
64 BI: New Generation 
08/12/2014 
Decision 
Monitoring, Alerts, Statistics, fault 
detection, etc. 
Semantic filtering and Continuous queries 
Heterogeneous and 
dynamic data streams 
Heterogeneous and 
static data 
sensors 
Semantic data streams 
Interconnection 
Ontologies 
Retro-action
65 
Semantic Web technologies for 
data stream 
¡ Annotate stream data with semantic metadata 
¡ Apply Linked Data principles to publish streaming data 
¡ Interlink streaming data with existing datasets 
¡ Integrate data stream processing + reasoning 
¡ Objectives : interoperability, automation, enrichment
66 Existing prototypes 
CQELS 
SPARQL-Stream 
EP-SPARQL 
C-SPARQL
67 Approaches 
In RDF Stream 
models 
(timestamps, 
events, time 
intervals, triple-based, 
graph-based 
…)
68 Example: SRBench 
Q: Detect if a hurricane has been observed 
“A hurricane has a sustained wind (for more than 3 hours) of at 
least 33 metres per second or 74 miles per hour (119 km/h).” 
ASK 
WHERE { 
STREAM <http://www.cwi.nl/SRBench/observations> [RANGE 10800s SLIDE 600s] 
{?observation om-owl:procedure ?sensor ; 
om-owl:observedProperty weather:WindSpeed ; 
om-owl:result [ om-owl:floatValue ?value ] .} 
} 
GROUP BY ?sensor 
HAVING ( AVG(?value) >= "74"^^xsd:float ) 
ASK 
FROM STREAM <http://www.cwi.nl/SRBench/observations> [RANGE 3h STEP 10m] 
WHERE { 
?observation om-owl:procedure ?sensor ; 
om-owl:observedProperty weather:WindSpeed ; 
om-owl:result [ om-owl:floatValue ?value ] . } 
GROUP BY ?sensor 
HAVING ( AVG(?value) >= "74"^^xsd:float )
69 
How to deal with Big Data 
Streams 
Cloud Computing 
DSMS 
Sampling/Load Shedding 
[1] et [2] 
[3] 
[1] J. Hoeksema & S. Kotoulas : High-performance Distributed Stream Reasoning using S4 (ISWC 2011) 
[2] D. L. Phuoc & al : Elastic and Scalable Processing of Linked Stream Data in the Cloud (ISWC 2013) 
[3] N. Jain & al : Sampling Semantic Data Stream: Resolving Overload and Limited Storage Issues. (DaEng 2013)
70 
Sampling Extensions for 
continuous SPARQL 
PREFIX vocab: http://data-gov.tw.rpi.edu/vocab/p/8/ 
SELECT ?val 
WHERE { 
STREAM <C:/CQELS/streams/data-8.stream>[NOW][UNISAMPLING %80] 
{?rawData vocab:ozone_8hr_daily_max ?val } 
GRAPH <C:/CQELS/test.rdf> {?user sioc:account_of ?person} 
} 
Operators: [UNISAMPLING %{Sampling Percentage}] 
[RESSAMPLING %{Reservoir Size}] 
[CHNSAMPLING %{Window Size}] 
N. Jain & al : Sampling Semantic Data Stream: Resolving Overload and 
Limited Storage Issues. (DaEng 2013)
71 
Semantic Data Stream Load 
Shedding (1/2) 
Observation_AirTemperature_ID 
TemperatureObservation type 
AirTemperature 
observedProperty 
procedure 
System_ID 
MeasureData_AirTemperature_ID 
result 
Instant_ID 
NN 
floatValue double 
fahrenheit 
MeasureData 
uom 
Instant 
type 
2004-08-08T06:25:00 
inXSDDateTime 
samplingTime 
Observation_RelativeHumidity_ID 
procedure 
RelativeHumidityObservation 
type 
RelativeHumidity 
observedProperty 
MeasureData_RelativeHumidity_ID 
result 
type 
floatValue 
percentage 
uom 
NN 
double 
(a) 
(b) 
(c) 
(d) 
(1) 
(2) 
RDF Triple approach : 
Effect of deleting the two triples below (1) and 
(2) (in dotted line) in the graph : 
- The deletion of the first RDF triple destroys the 
link connecting node (a) to node (b) 
- The deletion of the second RDF triple destroys 
the link connecting node (c) to node (d) 
=> This has the effect of making nodes (b), (d) 
and all those connected to them 
unreachables, in spite of the presence of their 
data in memory. Which represents 6 RDF 
striples among 18 i.e. 33% of unusable data
72 
Semantic Data Stream Load 
Shedding (2/2) 
RDF Graph approach : 
Effect of deleting sub-graphs such as those formed by nodes (b) or (d) and all nodes 
to which they are directly connected. 
=> Preserving the semantic level of the information 
=> Protecting the data consistency of the whole graph 
=> Enhancing the Semantic Data Stream systems 
observedProperty
73 
Conclusion: Big Data Stream 
challenges 
¡ Semantic Information aggregation 
¡ Information aggregation: “too much data to assimilate but not 
enough knowledge to act” 
¡ Distributed and real-time processing 
¡ Design of real-time and distributed algorithms for stream processing 
and information aggregation 
¡ Distribution and parallelization of data mining algorithms 
¡ Visual analytics and user modeling 
¡ Dynamic user model 
¡ Novel visualizations for very large datasets
74 
Thanks to 
Zakia Kazi Aoul, ISEP 
Marie-Aude Aufaure, ECP 
Fethi Belghaouti, ISEP-INT 
Georges Hébrail, EDF R&D 
Sylvain Lefebvre, ISEP 
Yousra Chabchoub, ISEP
Big 
Data 
Linked 
Data 
Volume, 
Variety, 
Velocity, 
Veracity, 
… 
Value 
Web 
of 
data, 
Seman(c 
Web 
-­‐ A 
set 
of 
principles 
and 
good 
prac1ces 
allowing 
to 
link, 
publish 
and 
search 
for 
web 
data 
-­‐ Structure 
and 
seman1cally 
enrich 
RDF 
data, 
with 
a 
very 
high 
scalability 
-­‐> 
Big 
Linked 
Data 
Integrate, 
aggregate, 
analyze, 
visualize 
large 
data 
sets, 
whatever 
is 
their 
type, 
provenance, 
speed 
of 
their 
flow 
… 
Big 
Linked 
Data 
Linked 
Big 
Data 
Seman8c 
Technologies 
Living 
Lab 
Linked 
& 
Big 
Data 
Academic 
Chair 
Our 
Value 
proposi8on 
– 
Seman1c 
aggrega1on 
from 
textual 
and 
non 
textual 
streams 
– 
Manage 
seman1c 
heterogeneity, 
real-­‐1me 
and 
distributed 
processing 
– 
Ensure 
data 
quality 
and 
veracity 
– 
Visual 
analy1cs

More Related Content

PDF
18 Data Streams
PPTX
Information Retrieval Evaluation
PDF
Lecture6 introduction to data streams
PPTX
Mining Data Streams
PPT
5.1 mining data streams
PPTX
Intro to Big Data and NoSQL
PDF
NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
PPT
4.5 mining the worldwideweb
18 Data Streams
Information Retrieval Evaluation
Lecture6 introduction to data streams
Mining Data Streams
5.1 mining data streams
Intro to Big Data and NoSQL
NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
4.5 mining the worldwideweb

What's hot (20)

PPT
3. mining frequent patterns
PPTX
The vector space model
PPTX
Probabilistic information retrieval models & systems
PDF
LSTM Tutorial
PPTX
Association Rule Learning Part 1: Frequent Itemset Generation
PPT
5.2 mining time series data
PPT
Temporal data mining
PPTX
Mining data streams
PPTX
Big Data Open Source Technologies
PDF
[231]운영체제 수준에서의 데이터베이스 성능 분석과 최적화
PPT
Hive(ppt)
PPT
CS8091_BDA_Unit_I_Analytical_Architecture
PPTX
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
PPT
Web Usage Pattern
PPTX
Dynamic Itemset Counting
PPTX
Text MIning
PPTX
Apache PIG
PPTX
Data Mining: Graph mining and social network analysis
PPTX
Hadoop File system (HDFS)
PPTX
Information retrieval introduction
3. mining frequent patterns
The vector space model
Probabilistic information retrieval models & systems
LSTM Tutorial
Association Rule Learning Part 1: Frequent Itemset Generation
5.2 mining time series data
Temporal data mining
Mining data streams
Big Data Open Source Technologies
[231]운영체제 수준에서의 데이터베이스 성능 분석과 최적화
Hive(ppt)
CS8091_BDA_Unit_I_Analytical_Architecture
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Web Usage Pattern
Dynamic Itemset Counting
Text MIning
Apache PIG
Data Mining: Graph mining and social network analysis
Hadoop File system (HDFS)
Information retrieval introduction
Ad

Viewers also liked (13)

PPTX
Streaming Algorithms
PDF
Computer Programming For Power Systems Analysts.
PPT
Aggregation computation over distributed data streams(the final version)
PDF
Hash - A probabilistic approach for big data
PPT
Streaming from the cloud
PPTX
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
PPTX
Data Stream Algorithms in Storm and R
PPTX
Introduction To Streaming Data and Stream Processing with Apache Kafka
PDF
Cloud-based Data Stream Processing
PPT
Cloud Migration: Moving to the Cloud
PDF
Migrating to Cloud - A Step by Step
PDF
Slides That Rock
PPTX
cloud computing ppt
Streaming Algorithms
Computer Programming For Power Systems Analysts.
Aggregation computation over distributed data streams(the final version)
Hash - A probabilistic approach for big data
Streaming from the cloud
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
Data Stream Algorithms in Storm and R
Introduction To Streaming Data and Stream Processing with Apache Kafka
Cloud-based Data Stream Processing
Cloud Migration: Moving to the Cloud
Migrating to Cloud - A Step by Step
Slides That Rock
cloud computing ppt
Ad

Similar to Introduction to Data streaming - 05/12/2014 (20)

PDF
Design and Implementation of A Data Stream Management System
PPTX
Shikha fdp 62_14july2017
PDF
Stream Processing Overview
PDF
Seminaire bigdata23102014
PPTX
Kostas Tzoumas - Stream Processing with Apache Flink®
PPTX
Debunking Common Myths in Stream Processing
PPTX
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
PDF
Big data serving: Processing and inference at scale in real time
PDF
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
PDF
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
PDF
Stream Processing
PPT
Data Streaming in Big Data and Data mining in streaming
PPT
Semantics in Sensor Networks
PPT
Jewei Hans & Kamber Chapter 8
PPTX
real time data processing is a tsubtopic in the topic in the domain bigdata
PPT
data streammining and its applications.ppt
PDF
The State of Stream Processing
PDF
Spark Streaming and IoT by Mike Freedman
PPT
Chapter 08 Data Mining Techniques
PPTX
CERN IT Monitoring
Design and Implementation of A Data Stream Management System
Shikha fdp 62_14july2017
Stream Processing Overview
Seminaire bigdata23102014
Kostas Tzoumas - Stream Processing with Apache Flink®
Debunking Common Myths in Stream Processing
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
Big data serving: Processing and inference at scale in real time
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Stream Processing
Data Streaming in Big Data and Data mining in streaming
Semantics in Sensor Networks
Jewei Hans & Kamber Chapter 8
real time data processing is a tsubtopic in the topic in the domain bigdata
data streammining and its applications.ppt
The State of Stream Processing
Spark Streaming and IoT by Mike Freedman
Chapter 08 Data Mining Techniques
CERN IT Monitoring

Recently uploaded (20)

PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Database Infoormation System (DBIS).pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPT
Quality review (1)_presentation of this 21
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Lecture1 pattern recognition............
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
Business Ppt On Nestle.pptx huunnnhhgfvu
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Galatica Smart Energy Infrastructure Startup Pitch Deck
Database Infoormation System (DBIS).pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Quality review (1)_presentation of this 21
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Major-Components-ofNKJNNKNKNKNKronment.pptx
Moving the Public Sector (Government) to a Digital Adoption
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Lecture1 pattern recognition............
Introduction-to-Cloud-ComputingFinal.pptx
Business Acumen Training GuidePresentation.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Supervised vs unsupervised machine learning algorithms

Introduction to Data streaming - 05/12/2014

  • 1. Data Streaming Raja Chiky – raja.chiky@isep.fr
  • 2. About me ¡ Associate professor in Computer Science – LISITE-RDI ¡ Research interest: Data stream mining, scalability and resource optimization in distributed architectures (e.g cloud architectures), recommender systems ¡ Research field: Large scale data management 1. Real-time and distributed processing of various data sources 2. Use semantic technologies to add a semantic layer 3. Recommender systems and collaborative data mining 4. Optimizing resources in large scale systems Heterogeneous and sta1c data Heterogeneous and dynamic data streams sensors 2
  • 3. OUTLINE ¡ Context: Big Data ¡ What is a data stream ? ¡ Data stream management systems ¡ Basic approximate algorithms ¡ Big Data: Distributed Systems ¡ Semantic Data Streaming ¡ Conclusion 3
  • 5. 5 Big Data: Buzzword!
  • 6. 6 08/12/2014 Where is all this data coming from?
  • 7. 7 More and More connected Things
  • 8. 8 So, what is Big Data? Dawn of (me Volume of data created Worldwide 2003 2012 5 EB … 2.7 ZB 2015 10 ZB (E) § 1 YB = 10^24 Bytes § 1 ZB = 10^21 Bytes § 1 EB = 10^18 Bytes § 1 PB = 10^15 Bytes § 1TB = 10^12 Bytes § 1 GB = 10^9 Bytes Variety of data § Radio § TV § News § E-­‐Mails § Facebook Posts Velocity of data § Walmart handles 1M transac(ons per hour § Google processes 24PB of data per day § AT&T transfers 30 PB of data per day § 90 trillion emails are sent per year § World of WarcraQ uses 1.3 PB of storage § Tweets § Blogs § Photos § Videos (user and paid) § RSS feeds § Wikipedia § GPS data § RFID § POS Scanners § … § Facebook when had a user base of 900 M users, had 25 PB of compressed data § 400M tweets per day in June ’12 § 72 hours of video is uploaded to Youtube every minute Big Data Elements Volume Variety Velocity + Veracity (IBM) - information uncertainty Source: Big Data & Analytics - Why Should We Care?, Vishwa Kolla
  • 9. Output User Interaction Store Gathering Information Data sources Static data Stream (big) data C Continuous queries/ Business rules sensors databases Data stream Static data ETL Batch processing Semantic ETL stream processing Ad-hoc queries Analytics Knowledge enrichment Databases/ Triplestores (synopsis) 9 Real time visual-analytics Retro-action Load shedding Data Warehouse 08/12/2014
  • 10. 10 Big Data : Velocity Website logs Network monitoring Financial services eCommerce Traffic control Weather forecasting Power consumption
  • 11. What is a data stream? 11 ¡ Golab & Oszu (2003): “A data stream is a real-time, continuous, ordered (implicitly by arrival time or explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor is it feasible to locally store a stream in its entirety.” ¡ Massive volumes of data, items arrive at a high rate.
  • 12. 12 Applications of data stream processing ¡ Data stream processing ¡ Process queries (compute statistics, activate alarms) ¡ Apply data mining algorithms ¡ Requirements ¡ Real-time processing ¡ One-pass processing ¡ Bounded storage (no complete storage of streams) ¡ Possibly consider several streams
  • 13. 13 Applications of data stream processing ¡ Let’s go deeper into some examples ¡ Network management ¡ Stock monitoring
  • 14. 14 Network management ¡ Supervision of a computer network ¡ Improvement of network configuration (hardware, software, architecture) ¡ Detection of attacks ¡ Measurements made on routers Network supervision Huge volume of data center High rate of arrivals
  • 15. 15 Network management Network supervision center Timestamp Source Destination Duration Bytes Protocol … … … … … … 12342 10.1.0.2 16.2.3.7 12 20K http 12343 18.6.7.1 12.4.0.3 16 24K http 12344 12.4.3.8 14.8.7.4 26 58K http 12345 19.7.1.2 16.5.5.8 18 80K ftp … … … … … …
  • 16. 16 Network management Network supervision center Typical queries: - 100 most frequent (@S, @D) on router R1 … - How many different (@S, @D) seen on R1 but not R2 … - … during last month, last week, last day, last hour ?
  • 17. 17 Stock monitoring ¡ Stream of price and sales volume of stocks over time ¡ Technical analysis/charting for stock investors ¡ Support trading decisions l Notify me when the price of IBM is above $83, and the first MSFT price afterwards is below $27. l Notify me when some stock goes up by at least 5% from one transaction to the next. l Notify me when the price of any stock increases monotonically for ≥30 min. l Notify me when the difference between the current price of a stock and its 10 day moving average is greater than some threshold value Source: Gehrke 07 and Cayuga application scenarios (Cornell University)
  • 18. 18 Where is the problem? (1/2) ¡ Example: 05/12/2014 Bank withdrawal 50 € 12/05/2014 Bank fraud bank withdrawal 1000$ 12/05/2014 ¡ Join between several streams ¡ Join between stream data and customer database ¡ Generic tools for processing streams ¡ Avoid the ‘Store’, ‘Compute’, ‘Delete’ approach ¡ Solution: incremental computation and definition of temporal windows for joins
  • 19. 19 Where is the problem? (2/2) ¡ Example: ¡ 100 most frequent @S IP adresses on a router ¡ Maintain a table of IP addresses with frequencies ? ¡ Sampling the stream ? ¡ Face high (and varying) rate of arrivals ¡ Exact versus approximate answers
  • 20. 20 Examples of queries 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Find elements with frequency> 0.1% top-k Frequency of the element 3 Total frequency of Elements between 8 and 14 number of elements having a non-zero frequency
  • 21. 21 Data Stream Management Systems DBMS DSMS Data model Permanent updatable relations Streams and permanent updatable relations Storage Data is stored on disk Permanent relations are stored on disk Streams are processed on the fly Query SQL language Creating structures Inserting/updating/deleting data Retrieving data (one-time query) SQL-like query language Standard SQL on permanent relations Extended SQL on streams with windowing Continuous queries Performance Large volumes of data Optimization of computer resources to deal with Several streams Several queries Ability to face variations in arrival rates without crash
  • 22. 22 Generic DSMS architecture Input Monitor Output Buffer Query Processor Query Reposi-tory Working Storage Summary Storage Static Storage Streaming Inputs Streaming Outputs Updates to Static Data User Queries Golab & Oszu (2003)
  • 24. 24 How to deal with Big Data Streams Distribution DSMS Sampling/Load Shedding/Sketch/…
  • 25. 25 Approximate answers to queries ¡ When ? ¡ Queries needing unbounded memory ¡ Too much queries/too rapid streams/too high response time requirements ¡ CPU limit ¡ Memory limit ¡ Solution : approximate answers to queries ¡ Sliding windows ¡ Sampling and load shedding ¡ Definition of synopsis
  • 26. 26 Approaches ¡ Two approaches for handling such streams ¡ Use a time window, and query the window as a static table ¡ When you can’t store collected data, or to keep track of historical data ¡ Sampling ¡ Filtering ¡ Counting
  • 27. Windowing ¡ Applying queries/mining tasks to the whole stream (from beginning to current time) ¡ Applying queries/mining to a portion of the stream Beginning of the stream Current date Window on the stream t
  • 28. 29 Windowing ¡ Definition of windows of interest on streams ¡ Fixed windows: September 2014 ¡ Sliding windows: last 3 hours ¡ Landmark windows: from September 1st, 2014 08/12/2014 ¡ Window specification ¡ Physical : last 3 hours ¡ Logical : last 1000 items ¡ Refreshing rate ¡ Rate of producing results (every item, every 10 items, every minute, …)
  • 29. Sliding window Beginning of the stream t tc t’ t c Refreshment time Results Results Give me the last room where Axel has been in the last 10 minutes, updating results every minute 30
  • 30. Sliding window vs. Tumbling window Beginning of the stream t tc t’c t Results Refreshment time Results Give me the last room where Axel has been in the last 10 minutes, updating results every 10 minutes 31
  • 31. 32 Sampling from data stream ¡ Inputs: ¡ Sample size k ¡ Window size n >> k (alternatively, time duration m) (optionnaly) ¡ Stream of data elements that arrive online ¡ Output: ¡ k elements chosen uniformly at random from the last n elements (alternatively, from all elements that have arrived in the last m time units) ¡ Goal: ¡ maintain a data structure that can produce the desired output at any time upon request ¡ Challenge: ¡ don’t know how long stream is ¡ So when/how often to sample?
  • 32. A simple, Unsatisfying Approach ¡ Choose a random subset X={x1, …,xk}, X⊂{0,1,…,n-1} ¡ The sample always consists of the non-expired elements whose indexes are equal to x1, …,xk (modulo n) ¡ Only uses O(k) memory ¡ Technically produces a uniform random sample of each window, but unsatisfying because the sample is highly periodic ¡ Unsuitable for many real applications, particularly those with periodicity in the data 34
  • 33. 35 Reservoir sampling ¡ Classic online algorithm due to Vitter (1985) ¡ Maintains a fixed-size uniform random sample ¡ Size of the data stream need not be known in advance ¡ Data structure: “reservoir” of k data elements ¡ As the ith data element arrives: ¡ Add it to the reservoir with probability p = k/i, discarding a randomly chosen data element from the reservoir to make room
  • 34. 36 Chain Sampling v Chain Sampling method is for sequence based windows. v In this type of sampling when the ith element arrives it is chosen to become the sample with probability Min(i,n)/n. v If the ith element is chosen as the sample, the algorithm also selects the index of the element that will replace it when expires (assuming that it is still present in the sample when it expires). This index is picked uniformly at random from the range i+1…i +n, representing the range of indexes of the elements that will be active when the ith element expires. v When the element with the selected index arrives, the algorithm stores it in the memory and choses the index of the element that will replace it when it expires etc., building a chain of elements to use in case of the expiration of the current element in the sample.
  • 35. 37 Chain Sampling 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
  • 36. Another Simple Approach: oversample ¡ As each element arrives remember it with probability p = ck/n log n; otherwise discard it ¡ Discard elements when they expire ¡ When asked to produce a sample, choose k elements at random from the set in memory ¡ Expected memory usage of O(k log n) ¡ The algorithm can fail if less than k elements from a window are remembered; however with high probability this will not happen 38
  • 37. 39 Min-wise Sampling ¡ For each item, pick a random fraction between 0 and 1 ¡ Store item(s) with the smallest random tag (Nath et al. 2004) 0.391 0.908 0.291 0.555 0.619 0.273 Each item has same chance of least tag, so uniform Can run on multiple streams separately, then merge
  • 38. 40 Hash function - Revision ¡ Hash function: Any algorithm that maps data of arbitrary length to data of a fixed length. The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes. 08/12/2014
  • 39. 41 Stream filtering ¡ Compact data structure, by B.H. Bloom, 1970: ¡ A bit array of size m, ¡ A H family of k hash functions (MD5, SHA256, Murmur), ¡ A set S of n items ¡ False positive probability f=(1-e-kn/m)k ¡ Example M=10, hash=MD5, k=3
  • 40. 42 Sketches ¡ Sketch ¡ Synopsis structure taking advantage of high volumes of data ¡ Provides an approximate result with probabilistic bounds ¡ Random projections on smaller spaces (hash functions) ¡ Many sketch structures: usually dedicated to a specialized task ¡ Examples of sketch structures ¡ COUNT (Flajolet 85) ¡ COUNT SKETCH (Charikar et al. 04)
  • 41. Difference between Sampling and Sketching 43 ¡ Sample sees only those items which were selected to be in the sample; whereas the sketch sees the entire input, but is restricted to retain only a small summary of it ¡ Not every problem can be solved with sampling ¡ Example: counting how many distinct items in the stream ¡ If a large fraction of items aren’t sampled, don’t know if they are all same or all different
  • 42. 44 Sketches: COUNT (Flajolet Martin algorithm) ¡ Goal ¡ Number N of distinct values in a stream (for large N) ¡ Example: number of distinct IP addresses going through a router ¡ Sketch structure ¡ SK: L bits initialized to 0 0 0 0 0 0 0 0 0 ¡ H: hashing function transforming an element of the stream into L bits 18.6.7.1 0 0 1 0 1 0 1 0 ¡ H distributes uniformly elements of the stream on the 2^L possibilities
  • 43. 45 Sketches ¡ Method ¡ Maintenance and update of SK ¡ For each new element e ¡ Compute H(e) ¡ Select the position of the leftmost 1 in H(e) ¡ Force to 1 this position in SK 0 1 0 0 1 0 0 1 SK H(18.6.7.1) 0 0 1 0 1 0 1 0 New SK 0 1 1 0 1 0 0 1
  • 44. 46 Sketches ¡ Result ¡ Select the position R (0…L-1) of the leftmost 0 in SK ¡ E(R) = log2 (φ*N) with φ = 0.77351… è Estimate the count by 2^R/φ ¡ σ(R) = 1.12 SK 1 1 1 0 1 0 0 0 R To improve accuracy of this approximation algorithm, use multiple hash functions and use the average R instead.
  • 45. 47 Sketches ¡ COUNT SKETCH ALGORITHM (Charikar et al. 2004) ¡ Goal ¡ k most frequent elements in a stream (for large number N of distinct values) ¡ Ex. 100 most frequent IP addresses going through a router Input stream 2, 0, 1, 3, 1, 2, 4, . . . 2 2 1 1 1 f(0) f(1) f(2) f(3) f(4)
  • 46. 48 Sketches +12 +7 +23 +15 -5 -12 -23 +1 . . . . . . . . . . . . . . . . . . . . +78 +56 +66 +65 1 2 . . . . . B 1 2 … t e -1 +1 -1 +1
  • 47. 49 Sketches ¡ Sketch structure ¡ h : hash function from [0, … , N-1] to [0, 1, … , B] ¡ s : hash function from [0, … , N-1] to {+1, -1} ¡ Array of B counters: C1, …, CB (with B << N) ¡ Sketch maintenance when e arrives: Ch(e) += s(e) ¡ Use of sketch ¡ Estimation of frequency of object e: : ne ≈ Ch(e) . s(e) ¡ Actually t hash function h and t hash function s: ne ≈ median j∈[1…t] ( Chj(e) . sj(e) ) ¡ Theoretical results on error depending on N, t and B.
  • 48. 50 Sketches ¡ Algorithm ¡ Maintenance of a list (e1, e2, …, ek) of the current k most frequent elements ¡ For a new arriving element e ¡ Add e to the sketch structure ¡ Estimate frequency of e from the sketch structure ¡ If f(e) > f(ek), remove ek and insert e into the list
  • 49. Distributed Systems Answering today Big Data Needs S4, Storm, SAMOA, … 51
  • 50. Yahoo S4 ¡ Simple Scalable Streaming System ¡ Use a decentralized and symmetric architecture ¡ All nodes are similar, no master node ¡ Based on Processing Elements ¡ Each PE has 4 components: 1. Functionality defined by a PE class and associated configuration ¡ input event handler processEvent() ¡ output mechanism output() 2. the types of events that it consumes 3. the keyed attribute in those events 4. the value of the keyed attribute in events which it consumes ¡ Several PEs are available for standard tasks such as count, aggregate, join … ¡ Custom PEs can easily be programmed ¡ The PEs are atomatically deployed on the Processing Nodes (machines): ¡ Ensures load balancing of events, ¡ Notifies appropriate PEs when an event comes in.
  • 52. Twitter Storm ¡ Same princple as S4, with a simplified programming model ¡ Storm provides realtime computation ¡ Scalable ¡ Guarantees no data loss ¡ Extremely robust and fault-tolerant ¡ Programming language agnostic ¡ Concepts ¡ Streams ¡ Spouts ¡ Bolts ¡ Topologies
  • 53. 56 Lambda architecture (By Nathan Marz) REAL TIME STREAM PROCESSING SERVING LAYER SPEED LAYER PRECOMPUTED DATA FLOW QUERIES BATCH LAYER BATCH PROCESSING VIEWS Source: Mathieu DESPRIEE (USI) Generic, scalable and fault-tolerant data processing architecture
  • 54. Lambda architecture 1. All data entering the system is dispatched to both the batch layer and the speed layer for processing. 2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views. 3. The serving layer indexes the batch views so that they can be queried in ad-hoc way. 4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only. 5. Any incoming query can be answered by merging results from batch views and real-time views. http://lambda-architecture.net/
  • 55. 58 Big Data Stream Mining Machine Learning Distributed Batch Hadoop Mahout Stream S4, Storm SAMOA Non Distributed Batch R, WEKA, … Stream MOA
  • 56. 59 What is SAMOA? • NEW Software framework for mining distributed data streams • Big Data mining for evolving streams in REAL-TIME
  • 57. Clustering Methods SAMOA SAMOA architecture Classifier Methods Frequent Pattern Mining S4 Storm … } Use S4, Storm, or other distributed stream processing platform } Use MOA, or other streaming machine learning library } Easy to extend through PACKAGES
  • 59. 62 Too much data streams
  • 60. 63 Type of data used in Big Data initiatives Internal data Traditional sources « New data » Source: Big Data opportunities survey, Unisphere / SAP, May 2013.
  • 61. 64 BI: New Generation 08/12/2014 Decision Monitoring, Alerts, Statistics, fault detection, etc. Semantic filtering and Continuous queries Heterogeneous and dynamic data streams Heterogeneous and static data sensors Semantic data streams Interconnection Ontologies Retro-action
  • 62. 65 Semantic Web technologies for data stream ¡ Annotate stream data with semantic metadata ¡ Apply Linked Data principles to publish streaming data ¡ Interlink streaming data with existing datasets ¡ Integrate data stream processing + reasoning ¡ Objectives : interoperability, automation, enrichment
  • 63. 66 Existing prototypes CQELS SPARQL-Stream EP-SPARQL C-SPARQL
  • 64. 67 Approaches In RDF Stream models (timestamps, events, time intervals, triple-based, graph-based …)
  • 65. 68 Example: SRBench Q: Detect if a hurricane has been observed “A hurricane has a sustained wind (for more than 3 hours) of at least 33 metres per second or 74 miles per hour (119 km/h).” ASK WHERE { STREAM <http://www.cwi.nl/SRBench/observations> [RANGE 10800s SLIDE 600s] {?observation om-owl:procedure ?sensor ; om-owl:observedProperty weather:WindSpeed ; om-owl:result [ om-owl:floatValue ?value ] .} } GROUP BY ?sensor HAVING ( AVG(?value) >= "74"^^xsd:float ) ASK FROM STREAM <http://www.cwi.nl/SRBench/observations> [RANGE 3h STEP 10m] WHERE { ?observation om-owl:procedure ?sensor ; om-owl:observedProperty weather:WindSpeed ; om-owl:result [ om-owl:floatValue ?value ] . } GROUP BY ?sensor HAVING ( AVG(?value) >= "74"^^xsd:float )
  • 66. 69 How to deal with Big Data Streams Cloud Computing DSMS Sampling/Load Shedding [1] et [2] [3] [1] J. Hoeksema & S. Kotoulas : High-performance Distributed Stream Reasoning using S4 (ISWC 2011) [2] D. L. Phuoc & al : Elastic and Scalable Processing of Linked Stream Data in the Cloud (ISWC 2013) [3] N. Jain & al : Sampling Semantic Data Stream: Resolving Overload and Limited Storage Issues. (DaEng 2013)
  • 67. 70 Sampling Extensions for continuous SPARQL PREFIX vocab: http://data-gov.tw.rpi.edu/vocab/p/8/ SELECT ?val WHERE { STREAM <C:/CQELS/streams/data-8.stream>[NOW][UNISAMPLING %80] {?rawData vocab:ozone_8hr_daily_max ?val } GRAPH <C:/CQELS/test.rdf> {?user sioc:account_of ?person} } Operators: [UNISAMPLING %{Sampling Percentage}] [RESSAMPLING %{Reservoir Size}] [CHNSAMPLING %{Window Size}] N. Jain & al : Sampling Semantic Data Stream: Resolving Overload and Limited Storage Issues. (DaEng 2013)
  • 68. 71 Semantic Data Stream Load Shedding (1/2) Observation_AirTemperature_ID TemperatureObservation type AirTemperature observedProperty procedure System_ID MeasureData_AirTemperature_ID result Instant_ID NN floatValue double fahrenheit MeasureData uom Instant type 2004-08-08T06:25:00 inXSDDateTime samplingTime Observation_RelativeHumidity_ID procedure RelativeHumidityObservation type RelativeHumidity observedProperty MeasureData_RelativeHumidity_ID result type floatValue percentage uom NN double (a) (b) (c) (d) (1) (2) RDF Triple approach : Effect of deleting the two triples below (1) and (2) (in dotted line) in the graph : - The deletion of the first RDF triple destroys the link connecting node (a) to node (b) - The deletion of the second RDF triple destroys the link connecting node (c) to node (d) => This has the effect of making nodes (b), (d) and all those connected to them unreachables, in spite of the presence of their data in memory. Which represents 6 RDF striples among 18 i.e. 33% of unusable data
  • 69. 72 Semantic Data Stream Load Shedding (2/2) RDF Graph approach : Effect of deleting sub-graphs such as those formed by nodes (b) or (d) and all nodes to which they are directly connected. => Preserving the semantic level of the information => Protecting the data consistency of the whole graph => Enhancing the Semantic Data Stream systems observedProperty
  • 70. 73 Conclusion: Big Data Stream challenges ¡ Semantic Information aggregation ¡ Information aggregation: “too much data to assimilate but not enough knowledge to act” ¡ Distributed and real-time processing ¡ Design of real-time and distributed algorithms for stream processing and information aggregation ¡ Distribution and parallelization of data mining algorithms ¡ Visual analytics and user modeling ¡ Dynamic user model ¡ Novel visualizations for very large datasets
  • 71. 74 Thanks to Zakia Kazi Aoul, ISEP Marie-Aude Aufaure, ECP Fethi Belghaouti, ISEP-INT Georges Hébrail, EDF R&D Sylvain Lefebvre, ISEP Yousra Chabchoub, ISEP
  • 72. Big Data Linked Data Volume, Variety, Velocity, Veracity, … Value Web of data, Seman(c Web -­‐ A set of principles and good prac1ces allowing to link, publish and search for web data -­‐ Structure and seman1cally enrich RDF data, with a very high scalability -­‐> Big Linked Data Integrate, aggregate, analyze, visualize large data sets, whatever is their type, provenance, speed of their flow … Big Linked Data Linked Big Data Seman8c Technologies Living Lab Linked & Big Data Academic Chair Our Value proposi8on – Seman1c aggrega1on from textual and non textual streams – Manage seman1c heterogeneity, real-­‐1me and distributed processing – Ensure data quality and veracity – Visual analy1cs