SlideShare a Scribd company logo
Marko Grobelnik
marko.grobelnik@ijs.si
Jozef Stefan Institute
Ljubljana, Slovenia
Stavanger, May 8th 2012
 Introduction
◦ What is Big data?
◦ Why Big-Data?
◦ When Big-Data is really a problem?
 Techniques
 Tools
 Applications
 Literature
Big data tutorial_part4
Big data tutorial_part4
 ‘Big-data’ is similar to ‘Small-data’, but
bigger
 …but having data bigger consequently
requires different approaches:
◦ techniques, tools & architectures
 …to solve:
◦ New problems…
◦ …and old problems in a better way.
From “Understanding Big Data” by IBM
Big data tutorial_part4
Big-Data
 Key enablers for the growth of “Big Data” are:
◦ Increase of storage capacities
◦ Increase of processing power
◦ Availability of data
Big data tutorial_part4
Big data tutorial_part4
Big data tutorial_part4
Big data tutorial_part4
Big data tutorial_part4
Big data tutorial_part4
Big data tutorial_part4
Big data tutorial_part4
Big data tutorial_part4
Big data tutorial_part4
 NoSQL
◦ DatabasesMongoDB, CouchDB, Cassandra, Redis, BigTable,
Hbase, Hypertable, Voldemort, Riak, ZooKeeper
 MapReduce
◦ Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine,
S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie,
Greenplum
 Storage
◦ S3, Hadoop Distributed File System
 Servers
◦ EC2, Google App Engine, Elastic, Beanstalk, Heroku
 Processing
◦ R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene,
ElasticSearch, Datameer, BigSheets, Tinkerpop
Big data tutorial_part4
 …when the operations on data are complex:
◦ …e.g. simple counting is not a complex problem
◦ Modeling and reasoning with data of different kinds
can get extremely complex
 Good news about big-data:
◦ Often, because of vast amount of data, modeling
techniques can get simpler (e.g. smart counting can
replace complex model based analytics)…
◦ …as long as we deal with the scale
 Research areas (such
as IR, KDD, ML, NLP,
SemWeb, …) are sub-
cubes within the data
cube
Scalability
Dynamicity
Context
Quality
Usage
Big data tutorial_part4
Big data tutorial_part4
 Good recommendations
can make a big
difference when keeping
a user on a web site
◦ …the key is how rich
context model a system is
using to select information
for a user
◦ Bad recommendations <1%
users, good ones >5% users
click
Contextual
personalized
recommendations
generated in ~20ms
 Domain
 Sub-domain
 Page URL
 URL sub-directories
 Page Meta Tags
 Page Title
 Page Content
 Named Entities
 Has Query
 Referrer Query
 Referring Domain
 Referring URL
 Outgoing URL
 GeoIP Country
 GeoIP State
 GeoIP City
 Absolute Date
 Day of the Week
 Day period
 Hour of the day
 User Agent
 Zip Code
 State
 Income
 Age
 Gender
 Country
 Job Title
 Job Industry
Log Files
(~100M
page clicks
per day)
User
profiles
NYT
articles
Stream of
profiles
Advertisers
Segment Keywords
Stock
Market
Stock Market, mortgage, banking,
investors, Wall Street, turmoil, New
York Stock Exchange
Health diabetes, heart disease, disease, heart,
illness
Green
Energy
Hybrid cars, energy, power, model,
carbonated, fuel, bulbs,
Hybrid cars Hybrid cars, vehicles, model, engines,
diesel
Travel travel, wine, opening, tickets, hotel,
sites, cars, search, restaurant
… …
Segments
Trend Detection System
Stream
of clicks
Trends and
updated segments
Campaign
to sell
segments
$
Sales
 50Gb of uncompressed log files
 10Gb of compressed log files
 0.5Gb of processed log files
 50-100M clicks
 4-6M unique users
 7000 unique pages with more then 100 hits
 Index size 2Gb
 Pre-processing & indexing time
◦ ~10min on workstation (4 cores & 32Gb)
◦ ~1hour on EC2 (2 cores & 16Gb)
Big data tutorial_part4
 Alarms Explorer Server implements three
real-time scenarios on the alarms stream:
1. Root-Cause-Analysis – finding which device is
responsible for occasional “flood” of alarms
2. Short-Term Fault Prediction – predict which
device will fail in next 15mins
3. Long-Term Anomaly Detection – detect
unusual trends in the network
 …system is used in British Telecom
Alarms Server
Alarms
Explorer
Server
Live feed of data
Operator Big board display
Telecom
Network
(~25 000
devices)
Alarms
~10-100/sec
 Presented in “Planetary-Scale Views on a
Large Instant-Messaging Network” by Jure
Leskovec and Eric Horvitz WWW2008
 Observe social and communication
phenomena at a planetary scale
 Largest social network analyzed to date
Research questions:
 How does communication change with user
demographics (age, sex, language, country)?
 How does geography affect communication?
 What is the structure of the communication
network?
33
 We collected the data for June 2006
 Log size:
150Gb/day (compressed)
 Total: 1 month of communication data:
4.5Tb of compressed data
 Activity over June 2006 (30 days)
◦ 245 million users logged in
◦ 180 million users engaged in conversations
◦ 17,5 million new accounts activated
◦ More than 30 billion conversations
◦ More than 255 billion exchanged messages
34
35
36
 Count the number of users logging in from
particular location on the earth
37
 Logins from Europe
38
 6 degrees of separation [Milgram ’60s]
 Average distance between two random users is 6.6
 90% of nodes can be reached in < 8 hops
Hops Nodes
1 10
2 78
3 396
4 8648
5 3299252
6 28395849
7 79059497
8 52995778
9 10321008
10 1955007
11 518410
12 149945
13 44616
14 13740
15 4476
16 1542
17 536
18 167
19 71
20 29
21 16
22 10
23 3
24 2
25 3
Big data tutorial_part4
Big data tutorial_part4

More Related Content

PDF
Denver's Open Data Initiative
PDF
Linked Open Government Data: What’s Next?
PDF
Memory Connected
PPT
Data Without Borders
PDF
Open data presentation 2013 v0 5
DOCX
Conclusion
PPTX
Open Data in Half a Day
PDF
Introducción a Linked Open Data (espacios enlazados y enlazables)
Denver's Open Data Initiative
Linked Open Government Data: What’s Next?
Memory Connected
Data Without Borders
Open data presentation 2013 v0 5
Conclusion
Open Data in Half a Day
Introducción a Linked Open Data (espacios enlazados y enlazables)

What's hot (17)

PDF
How to build and run a big data platform in the 21st century
PPTX
A Planetary-Scale Blockchain Database for the World Computer
PDF
Blockchains and Governance: Interplanetary Database - BigchainDB & IPDB Meetu...
PPTX
The FAIR principle in the Big Data World
PPTX
The FAIR Principle in the Big Data World
PDF
Briefing on US EPA Open Data Strategy using a Linked Data Approach
PDF
Pie chart or pizza: identifying chart types and their virality on Twitter
PDF
Big data - An Introduction
PDF
The technical case for a semantic web
PPTX
Columbia citi economics of net 060515 final
PDF
Technological trends by louise thomasen, track 6 leadership and organisation,...
DOCX
Rrw a robust and reversible watermarking technique for relational
PDF
Machine Learning and Social Participation
PPT
Grant: The Impact of Cloud, Mobile, and Managing the Changing Platforms of Di...
PDF
Building better knowledge graphs through social computing
PDF
Web search-metrics-tutorial-www2010-section-1of7-introduction
PPTX
Project overview big data europe
How to build and run a big data platform in the 21st century
A Planetary-Scale Blockchain Database for the World Computer
Blockchains and Governance: Interplanetary Database - BigchainDB & IPDB Meetu...
The FAIR principle in the Big Data World
The FAIR Principle in the Big Data World
Briefing on US EPA Open Data Strategy using a Linked Data Approach
Pie chart or pizza: identifying chart types and their virality on Twitter
Big data - An Introduction
The technical case for a semantic web
Columbia citi economics of net 060515 final
Technological trends by louise thomasen, track 6 leadership and organisation,...
Rrw a robust and reversible watermarking technique for relational
Machine Learning and Social Participation
Grant: The Impact of Cloud, Mobile, and Managing the Changing Platforms of Di...
Building better knowledge graphs through social computing
Web search-metrics-tutorial-www2010-section-1of7-introduction
Project overview big data europe
Ad

Viewers also liked (17)

PDF
JOHN RAZO 2015 RESUME
PDF
Russian imports 1077
DOC
ΣΥΝΟΨΗ ΠΙΛΟΤΟΥ
PDF
Stb 60335 2_5
PDF
Sp 3.13130.2009
PPTX
My Hobbies
PDF
Isamu Noguchi
PDF
Presentación sin título
DOCX
Les métiers de Solon Finances
PPTX
Презентация ГБУ РА «Комплексный центр социального обслуживания населения в го...
PPTX
Tiger's eye vs. cat's eye stones
PDF
NGCC 2016 - Support large partitions
PPTX
Digital is driving Reskilling of Career Testers
PPT
Problemas de atención Clínica (niñez y adolescencia
PDF
Psicopatología Infantil - Contexto de desarrollo de la niñez y adolescencia
DOCX
La termodinámica en el corte de metales
PDF
Problemas con wiris
JOHN RAZO 2015 RESUME
Russian imports 1077
ΣΥΝΟΨΗ ΠΙΛΟΤΟΥ
Stb 60335 2_5
Sp 3.13130.2009
My Hobbies
Isamu Noguchi
Presentación sin título
Les métiers de Solon Finances
Презентация ГБУ РА «Комплексный центр социального обслуживания населения в го...
Tiger's eye vs. cat's eye stones
NGCC 2016 - Support large partitions
Digital is driving Reskilling of Career Testers
Problemas de atención Clínica (niñez y adolescencia
Psicopatología Infantil - Contexto de desarrollo de la niñez y adolescencia
La termodinámica en el corte de metales
Problemas con wiris
Ad

Similar to Big data tutorial_part4 (20)

PDF
Big data tutorial_part4
PDF
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
PDF
EDF2013: Big Data Tutorial: Marko Grobelnik
PDF
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
PPTX
Big Data World
PDF
Big Data Tutorial - Marko Grobelnik - 25 May 2012
PDF
Big Data Landscape 2018
PPT
Big data and Internet
PDF
Big Data et eGovernment
PPTX
Big data
PPT
ai based computer basic learning Lecture about Bigdata.ppt
PPTX
Big dataorig
PDF
Big Data - Umesh Bellur
PPTX
Big Data By Vijay Bhaskar Semwal
PPTX
Smart Data Module 1 introduction to big and smart data
PPTX
Big Data and Data Science: The Technologies Shaping Our Lives
PDF
Big Data Analytics Introduction chapter.pdf
PPTX
Data mining with big data implementation
PPTX
SQL Server 2008 R2 StreamInsight
PPT
Research issues in the big data and its Challenges
Big data tutorial_part4
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
EDF2013: Big Data Tutorial: Marko Grobelnik
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Big Data World
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Landscape 2018
Big data and Internet
Big Data et eGovernment
Big data
ai based computer basic learning Lecture about Bigdata.ppt
Big dataorig
Big Data - Umesh Bellur
Big Data By Vijay Bhaskar Semwal
Smart Data Module 1 introduction to big and smart data
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data Analytics Introduction chapter.pdf
Data mining with big data implementation
SQL Server 2008 R2 StreamInsight
Research issues in the big data and its Challenges

Recently uploaded (20)

PDF
Computing-Curriculum for Schools in Ghana
PDF
Sports Quiz easy sports quiz sports quiz
PDF
Pre independence Education in Inndia.pdf
PDF
Insiders guide to clinical Medicine.pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
master seminar digital applications in india
PDF
Basic Mud Logging Guide for educational purpose
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
GDM (1) (1).pptx small presentation for students
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Computing-Curriculum for Schools in Ghana
Sports Quiz easy sports quiz sports quiz
Pre independence Education in Inndia.pdf
Insiders guide to clinical Medicine.pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
human mycosis Human fungal infections are called human mycosis..pptx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
master seminar digital applications in india
Basic Mud Logging Guide for educational purpose
VCE English Exam - Section C Student Revision Booklet
102 student loan defaulters named and shamed – Is someone you know on the list?
TR - Agricultural Crops Production NC III.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
STATICS OF THE RIGID BODIES Hibbelers.pdf
GDM (1) (1).pptx small presentation for students
2.FourierTransform-ShortQuestionswithAnswers.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx

Big data tutorial_part4

  • 1. Marko Grobelnik marko.grobelnik@ijs.si Jozef Stefan Institute Ljubljana, Slovenia Stavanger, May 8th 2012
  • 2.  Introduction ◦ What is Big data? ◦ Why Big-Data? ◦ When Big-Data is really a problem?  Techniques  Tools  Applications  Literature
  • 5.  ‘Big-data’ is similar to ‘Small-data’, but bigger  …but having data bigger consequently requires different approaches: ◦ techniques, tools & architectures  …to solve: ◦ New problems… ◦ …and old problems in a better way.
  • 6. From “Understanding Big Data” by IBM
  • 9.  Key enablers for the growth of “Big Data” are: ◦ Increase of storage capacities ◦ Increase of processing power ◦ Availability of data
  • 20.  NoSQL ◦ DatabasesMongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper  MapReduce ◦ Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum  Storage ◦ S3, Hadoop Distributed File System  Servers ◦ EC2, Google App Engine, Elastic, Beanstalk, Heroku  Processing ◦ R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop
  • 22.  …when the operations on data are complex: ◦ …e.g. simple counting is not a complex problem ◦ Modeling and reasoning with data of different kinds can get extremely complex  Good news about big-data: ◦ Often, because of vast amount of data, modeling techniques can get simpler (e.g. smart counting can replace complex model based analytics)… ◦ …as long as we deal with the scale
  • 23.  Research areas (such as IR, KDD, ML, NLP, SemWeb, …) are sub- cubes within the data cube Scalability Dynamicity Context Quality Usage
  • 26.  Good recommendations can make a big difference when keeping a user on a web site ◦ …the key is how rich context model a system is using to select information for a user ◦ Bad recommendations <1% users, good ones >5% users click Contextual personalized recommendations generated in ~20ms
  • 27.  Domain  Sub-domain  Page URL  URL sub-directories  Page Meta Tags  Page Title  Page Content  Named Entities  Has Query  Referrer Query  Referring Domain  Referring URL  Outgoing URL  GeoIP Country  GeoIP State  GeoIP City  Absolute Date  Day of the Week  Day period  Hour of the day  User Agent  Zip Code  State  Income  Age  Gender  Country  Job Title  Job Industry
  • 28. Log Files (~100M page clicks per day) User profiles NYT articles Stream of profiles Advertisers Segment Keywords Stock Market Stock Market, mortgage, banking, investors, Wall Street, turmoil, New York Stock Exchange Health diabetes, heart disease, disease, heart, illness Green Energy Hybrid cars, energy, power, model, carbonated, fuel, bulbs, Hybrid cars Hybrid cars, vehicles, model, engines, diesel Travel travel, wine, opening, tickets, hotel, sites, cars, search, restaurant … … Segments Trend Detection System Stream of clicks Trends and updated segments Campaign to sell segments $ Sales
  • 29.  50Gb of uncompressed log files  10Gb of compressed log files  0.5Gb of processed log files  50-100M clicks  4-6M unique users  7000 unique pages with more then 100 hits  Index size 2Gb  Pre-processing & indexing time ◦ ~10min on workstation (4 cores & 32Gb) ◦ ~1hour on EC2 (2 cores & 16Gb)
  • 31.  Alarms Explorer Server implements three real-time scenarios on the alarms stream: 1. Root-Cause-Analysis – finding which device is responsible for occasional “flood” of alarms 2. Short-Term Fault Prediction – predict which device will fail in next 15mins 3. Long-Term Anomaly Detection – detect unusual trends in the network  …system is used in British Telecom Alarms Server Alarms Explorer Server Live feed of data Operator Big board display Telecom Network (~25 000 devices) Alarms ~10-100/sec
  • 32.  Presented in “Planetary-Scale Views on a Large Instant-Messaging Network” by Jure Leskovec and Eric Horvitz WWW2008
  • 33.  Observe social and communication phenomena at a planetary scale  Largest social network analyzed to date Research questions:  How does communication change with user demographics (age, sex, language, country)?  How does geography affect communication?  What is the structure of the communication network? 33
  • 34.  We collected the data for June 2006  Log size: 150Gb/day (compressed)  Total: 1 month of communication data: 4.5Tb of compressed data  Activity over June 2006 (30 days) ◦ 245 million users logged in ◦ 180 million users engaged in conversations ◦ 17,5 million new accounts activated ◦ More than 30 billion conversations ◦ More than 255 billion exchanged messages 34
  • 35. 35
  • 36. 36
  • 37.  Count the number of users logging in from particular location on the earth 37
  • 38.  Logins from Europe 38
  • 39.  6 degrees of separation [Milgram ’60s]  Average distance between two random users is 6.6  90% of nodes can be reached in < 8 hops Hops Nodes 1 10 2 78 3 396 4 8648 5 3299252 6 28395849 7 79059497 8 52995778 9 10321008 10 1955007 11 518410 12 149945 13 44616 14 13740 15 4476 16 1542 17 536 18 167 19 71 20 29 21 16 22 10 23 3 24 2 25 3