SlideShare a Scribd company logo
Big Data Challenge
COMP 41700
Seminars in Data Science
Summary of the presentation:

Short Introduction of Telecom Italia Big Data Challenge – Donagh

Summary of Paper 1 and Paper 2 – Rajesh

Other interesting insights we can draw from this dataset – Malika
a contest designed to stimulate
the creation and development of
innovative technological ideas in
the Big Data field
history
•
Early 2014 Telecom Italia released first edition which was closed
•
Success meant that the next iteration was open
•
Freely available for anyone to use.
•
https://dandelion.eu/datamine/open-big-data/
data sets
•
Geo-referenced (Milan and the Autonomous Province of Trento)
•
Anonymised
•
Millions of records
•
November -> December 2013
•
extracted from telecom records, energy, weather, public and private
transport, social networks
Milano / Trentino
•
Grid
grid
Telecom Italia Big Data Challenge
Milano datasets
Domain
Telecommunications SMS, Call Internet; MI to Provinces; MI to MI;
Weather Weather Station Data ; Precipitation
Environment Air Quality
News Milano Today
Social Tweets
tweets
•
username - anonymised
•
entities
•
language
•
municipality
•
Tweet time
•
geometry
Paper 1
(Anatomy and efficiency of urban multimodal mobility)
Main Goal: To find the optimal time-respecting path between two Geo locations in multi-modal layer
Where, l(a,b) is the quickest length (time respecting and minimal) trips on the network
d(a,b) is the euclidean distance from the origin 'a' to the destination 'b'
Rail becomes then dominant at 40 kms and air travel is dominant
for trips of distance of order 700 kms. Other transportation modes
play a secondary role, with peaks at 22 kms for the Metro, 40 kms
for Ferries and 70 kms for Coaches
The bus system is covering most of the
short trips, whereas the advantage of
using the Metro and Rail systems emerges
progressively for longer distances
The total number of stop events
Omega grows proportionally with the
urban area populations P.
Where, C(alpha) is the
number of stop events in the
layer 'alpha' and Delta-t is the
duration of the time interval
Paper 2
(High resolution population estimates from telecommunications data)
Data Source: Telecommunications(provided by Telecom Italia)
Census data
Satellite images(provided by Landsat)
Main Goal: Create high-resolution(235m x 235m) population estimates in time and space
Difficulties: Population counts can change rapidly that means is hard to acquire local census estimates
in a timely and accurate manner. The correlation coefficient between call volume and the
underlying population distribution vary with time.
Building map:
41% of area on the map are directly
generated.
To classify the remaining 59% , they train a
Random forest classifier using OpenStreetMap
data as labeled training examples.
Population is distributed exponentially in the beginning:
29% of grid-squares have zero population
5% of grid-squares have a population of 1
3% of grid-squares have population of 2 and so on.
39% of grid-squares have a population over 100
Then follow a normal distribution with a mean of 400 persons
Population Distribution:
10-minute intervals for each of the 235m × 235m grid cells.
Communication activity is approximately log normal
There are 5 types of communications activity: SMSIN,
SMSOUT, CALLIN, CALLOUT, and INTERNET.
Telecommunications activity:
Elementary Model:
Previous research have suggest that the relation between location(i), population and telecommunication:
(w stands for call volume, p stands for population)
Not Perfect:
The relationship between call volume and population
in this region is much weaker below a threshold of
351 persons.
Main reason is that the dense population area tend to
have more cell tower for we to observe the relationship.
Model(1):
Model(2):
Try to find the best hours of call volume data:
Each type correlates most strongly during the hour
from 10 am to 11 am, and as with the total call
volumes, CALLOUT has the greatest correlation,
Approximately 0.68. Thus we use CALLOUT from
10 am to 1 am for the wi in
model(2).
Where else can we use the Telecom Italia
Dataset?
Analyzing cities using the space-time structure of mobile phone
network
•
Attempts to connect telecom usage data from Telecom Italia mobile to geography
of human activity
•
Usage of telecom data to enhance the understanding of cities as space of flows
 Using Telecom Dataset for social network analysis
 investigating social structures through the use
of network and graph theories.
 Anthropology, Biology, Communication Studies, …etc
social network analysis
Traffic monitoring in urban area.
•
Use of Telecom data to track the dense regions.
•
Rerouting strategies
•
Increase the public transport in dense area.
•
Provide more taxies in dense area.
Other Usages
Users localization

Security

Health Care : Tracking users exercises
Thank you...
Special Thanks to my team members:
Hao Wu and He Ping

More Related Content

PPT
OSI 7 Layer Model
PDF
Designing your neural networks – a step by step walkthrough
PPTX
Application layer
PDF
Text prediction based on Recurrent Neural Network Language Model
PPTX
Distributed concurrency control
PPTX
Multiprocessor Architecture (Advanced computer architecture)
PPTX
Remote Procedure Call in Distributed System
PPTX
Practical Drug Discovery using Explainable Artificial Intelligence
OSI 7 Layer Model
Designing your neural networks – a step by step walkthrough
Application layer
Text prediction based on Recurrent Neural Network Language Model
Distributed concurrency control
Multiprocessor Architecture (Advanced computer architecture)
Remote Procedure Call in Distributed System
Practical Drug Discovery using Explainable Artificial Intelligence

What's hot (20)

PPT
Introduction to MPI
PPTX
Ethernet
PDF
Aca2 10 11
PPTX
Network (Hub,switches)
PPTX
Introduction of tcp, ip & udp
PDF
Speech recognition project report
PPTX
Introduction to natural language processing, history and origin
PPTX
Data streaming fundamentals
PPTX
Natural language processing
PPTX
Difference between Homogeneous and Heterogeneous
PPT
Consistency protocols
PPTX
Parallel algorithms
PPTX
Introduction to text to speech
PPT
OSI layer by cisco
PPT
02 protocol architecture
PDF
Machine Learning in NLP
PPT
Ch:2 The Physical Layer
PPTX
PPTX
Corba concepts & corba architecture
PPT
Natural language processing
Introduction to MPI
Ethernet
Aca2 10 11
Network (Hub,switches)
Introduction of tcp, ip & udp
Speech recognition project report
Introduction to natural language processing, history and origin
Data streaming fundamentals
Natural language processing
Difference between Homogeneous and Heterogeneous
Consistency protocols
Parallel algorithms
Introduction to text to speech
OSI layer by cisco
02 protocol architecture
Machine Learning in NLP
Ch:2 The Physical Layer
Corba concepts & corba architecture
Natural language processing
Ad

Similar to Telecom Italia Big Data Challenge (20)

PDF
Mobile data offloading
PDF
Mobile data offloading
PDF
Mobile data offloading
PDF
Gis in telecomm
PDF
Human mobility,urban structure analysis,and spatial community detection from ...
PDF
Implementation of IoTs in Smart Cities
PDF
Modelling traffic flows with gravity models and mobile phone large data
PDF
Cz3210711074
PDF
The collaboration network in OSM - the case of Italy
PPTX
Mobility prediction in telecom cloud using telecom calls.
PDF
A Review on Cooperative Communication Protocols in Wireless World
PDF
A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data
PDF
Modelling dynamic patterns using mobile data
PDF
MODELLING DYNAMIC PATTERNS USING MOBILE DATA
PDF
On the development of methodology for planning and cost modeling of a wide ar...
PDF
City Data Dating: emerging affinities between diverse urban datasets
PDF
A strategy for the matching of mobile phone signals with census data
PPTX
real life applications of network in graph theory.pptx
PPT
ICT AND URBAN PLANNING. By Antonio Caperna
PDF
Integrative Model for Quantitative Evaluation of Selection Telecommunication ...
Mobile data offloading
Mobile data offloading
Mobile data offloading
Gis in telecomm
Human mobility,urban structure analysis,and spatial community detection from ...
Implementation of IoTs in Smart Cities
Modelling traffic flows with gravity models and mobile phone large data
Cz3210711074
The collaboration network in OSM - the case of Italy
Mobility prediction in telecom cloud using telecom calls.
A Review on Cooperative Communication Protocols in Wireless World
A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data
Modelling dynamic patterns using mobile data
MODELLING DYNAMIC PATTERNS USING MOBILE DATA
On the development of methodology for planning and cost modeling of a wide ar...
City Data Dating: emerging affinities between diverse urban datasets
A strategy for the matching of mobile phone signals with census data
real life applications of network in graph theory.pptx
ICT AND URBAN PLANNING. By Antonio Caperna
Integrative Model for Quantitative Evaluation of Selection Telecommunication ...
Ad

Recently uploaded (20)

PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PDF
Introduction to Data Science and Data Analysis
PPTX
modul_python (1).pptx for professional and student
PDF
Microsoft Core Cloud Services powerpoint
PDF
Business Analytics and business intelligence.pdf
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
Introduction to Inferential Statistics.pptx
PDF
Global Data and Analytics Market Outlook Report
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
New ISO 27001_2022 standard and the changes
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
Introduction to the R Programming Language
PPTX
CYBER SECURITY the Next Warefare Tactics
PDF
Navigating the Thai Supplements Landscape.pdf
PDF
Transcultural that can help you someday.
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Introduction to Data Science and Data Analysis
modul_python (1).pptx for professional and student
Microsoft Core Cloud Services powerpoint
Business Analytics and business intelligence.pdf
STERILIZATION AND DISINFECTION-1.ppthhhbx
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Introduction to Inferential Statistics.pptx
Global Data and Analytics Market Outlook Report
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
New ISO 27001_2022 standard and the changes
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Introduction to the R Programming Language
CYBER SECURITY the Next Warefare Tactics
Navigating the Thai Supplements Landscape.pdf
Transcultural that can help you someday.
IBA_Chapter_11_Slides_Final_Accessible.pptx

Telecom Italia Big Data Challenge

  • 1. Big Data Challenge COMP 41700 Seminars in Data Science
  • 2. Summary of the presentation:  Short Introduction of Telecom Italia Big Data Challenge – Donagh  Summary of Paper 1 and Paper 2 – Rajesh  Other interesting insights we can draw from this dataset – Malika
  • 3. a contest designed to stimulate the creation and development of innovative technological ideas in the Big Data field
  • 4. history • Early 2014 Telecom Italia released first edition which was closed • Success meant that the next iteration was open • Freely available for anyone to use. • https://dandelion.eu/datamine/open-big-data/
  • 5. data sets • Geo-referenced (Milan and the Autonomous Province of Trento) • Anonymised • Millions of records • November -> December 2013 • extracted from telecom records, energy, weather, public and private transport, social networks
  • 9. Milano datasets Domain Telecommunications SMS, Call Internet; MI to Provinces; MI to MI; Weather Weather Station Data ; Precipitation Environment Air Quality News Milano Today Social Tweets
  • 11. Paper 1 (Anatomy and efficiency of urban multimodal mobility) Main Goal: To find the optimal time-respecting path between two Geo locations in multi-modal layer Where, l(a,b) is the quickest length (time respecting and minimal) trips on the network d(a,b) is the euclidean distance from the origin 'a' to the destination 'b'
  • 12. Rail becomes then dominant at 40 kms and air travel is dominant for trips of distance of order 700 kms. Other transportation modes play a secondary role, with peaks at 22 kms for the Metro, 40 kms for Ferries and 70 kms for Coaches
  • 13. The bus system is covering most of the short trips, whereas the advantage of using the Metro and Rail systems emerges progressively for longer distances
  • 14. The total number of stop events Omega grows proportionally with the urban area populations P. Where, C(alpha) is the number of stop events in the layer 'alpha' and Delta-t is the duration of the time interval
  • 15. Paper 2 (High resolution population estimates from telecommunications data) Data Source: Telecommunications(provided by Telecom Italia) Census data Satellite images(provided by Landsat) Main Goal: Create high-resolution(235m x 235m) population estimates in time and space Difficulties: Population counts can change rapidly that means is hard to acquire local census estimates in a timely and accurate manner. The correlation coefficient between call volume and the underlying population distribution vary with time.
  • 16. Building map: 41% of area on the map are directly generated. To classify the remaining 59% , they train a Random forest classifier using OpenStreetMap data as labeled training examples.
  • 17. Population is distributed exponentially in the beginning: 29% of grid-squares have zero population 5% of grid-squares have a population of 1 3% of grid-squares have population of 2 and so on. 39% of grid-squares have a population over 100 Then follow a normal distribution with a mean of 400 persons Population Distribution:
  • 18. 10-minute intervals for each of the 235m × 235m grid cells. Communication activity is approximately log normal There are 5 types of communications activity: SMSIN, SMSOUT, CALLIN, CALLOUT, and INTERNET. Telecommunications activity:
  • 19. Elementary Model: Previous research have suggest that the relation between location(i), population and telecommunication: (w stands for call volume, p stands for population) Not Perfect: The relationship between call volume and population in this region is much weaker below a threshold of 351 persons. Main reason is that the dense population area tend to have more cell tower for we to observe the relationship. Model(1):
  • 20. Model(2): Try to find the best hours of call volume data: Each type correlates most strongly during the hour from 10 am to 11 am, and as with the total call volumes, CALLOUT has the greatest correlation, Approximately 0.68. Thus we use CALLOUT from 10 am to 1 am for the wi in model(2).
  • 21. Where else can we use the Telecom Italia Dataset?
  • 22. Analyzing cities using the space-time structure of mobile phone network • Attempts to connect telecom usage data from Telecom Italia mobile to geography of human activity • Usage of telecom data to enhance the understanding of cities as space of flows
  • 23.  Using Telecom Dataset for social network analysis  investigating social structures through the use of network and graph theories.  Anthropology, Biology, Communication Studies, …etc social network analysis
  • 24. Traffic monitoring in urban area. • Use of Telecom data to track the dense regions. • Rerouting strategies • Increase the public transport in dense area. • Provide more taxies in dense area.
  • 26. Thank you... Special Thanks to my team members: Hao Wu and He Ping

Editor's Notes

  • #4: What is the GOAL ? Why are they doing this ?
  • #5: At the beginning of 2014, Telecom Italia, in collaboration with several international partners, launched the Telecom Italia Big Data Challenge. The contest made available to developers, designers and scientists a large dataset of 30+ kinds of data (mobile, weather, energy, etc.) Datasets were released only to be used by the participants after the end of the contest, the demand for those datasets has raised They want people to reuse data
  • #6: The data provided in the dataset of the Big Data Challenge is geo-referenced (areas: Milan and the Autonomous Province of Trento – Italy) and anonymized. The dataset contains millions of records of data covering the period from November to December 2013 extracted from telecommunications records, energy, weather, public and private transport, social networks and events.
  • #7: Some of the datasets referring to the Milano urban area are spatially aggregated using a grid. We refer to this grid as the Milano Grid. The schema of the grid is cellId, geometry expressed as geoJSON
  • #8: Grid has following spatial description The square id numbering starts from the bottom left corner of the grid and grows till its right top corner.
  • #10: Datasets are divided up into domains Telecommunications has 3 datasets – SMS & Internet Calls Call data from Milan to provinces Call Data within Milan
  • #11: Each row corresponds to a tweet. For privacy issues the user id has been obfuscated and the text has been replaced with a list of entites extracted by the Entity Extraction API tool. Entities are provided as links to DBpedia. User, entities, language, municipality, date created, timestamp, geometry