SlideShare a Scribd company logo
A Real-time Processing System
based on Spark streaming
in the filed of
Telecommunications
FOR Hadoop SUMMIT,2017
Geng Wang
Dong Wang
CONTENT
CONTENT
01
What we faced in Telecommunications
1
3
2
4
0.166 million new users / day 33G data / sec by mobile
10T Voice data / day 100T Signal data / day
More data are produced
What we faced in Telecommunications
More Real time Requirements
• Smart Tourism • Intelligent marketing
• Tourist count and analysis
• Best choice of tourist resort
• Recommendation of route
for travelling
• …
• Recommendation of product for
specific customer
• Based on multiple dimensions
(location, age, salary …)
Evolution of Requirement
2014
Real time Marketing
2015
• Operation based
on Location
2017
Hard Real time
…
CEP, Esper
2016
• More data input
2/3/4 G Signal of Location
Content of Business
CONTENT
02
Framework – High level
Data output
Tagging
Data Input
Detailed Framework
• Hadoop Layer
• Basic components
• OCSP core
• Data pre-processing
• Tagging
• Event output (select and
filter)
• Multiple engine ( Spark
Streaming and Storm)
• Muti-tenant
• Check point
• Data transformation
• Parse data to Kafka
• Nifi and Flume
• Customized processor
& sink
• Data source
• Socket
• Local files
• HDFS
Framework - Data Input
flume
agent 1
• 2,3G Signal of
location
flume
agent
2,3,4
• 4G Signal of
location
Nifi
• Content data
of Business
Kafka
Partition
Data Preprocessing
Kafka
transform
transform
transform
Schema 2
Schema 3
Select Filter
Select Expr 1
imsi
Filter Expr 1
imsi!=0
Select Expr 2 Filter Expr 2
Select Expr 3 Filter Expr 3
Select Filter
Select Filter
Uniform
Schema 1
Tagging and Label
Tagging process
Customized
operation
Get by
Key
Codis
User
info
Stay
duration
Cycle of
Marketing
User
name
imsi Phone
number
Base
station
Select, filter & Output
Codis
Kafka
Data with
labels
others
Current location update for
each user
Output 3:
User with specific location &
specific business
Output 2:
New user marketing
Output 1:
User Path in a duration
Configurable process
Data with
labels
End
Check
Interval
Filter
Select
Output
Codis
Framework - Deployment & Configuration
External system
SDTP Socket Source HDFS I/O
Codis I/O
Web
Deployment in
single host
Deployment/ Configuration
OCSP
CONTENT
03
Performance - scale out
flume
flume
Nifi
Nifi
flume
Codis
Codis
Codis
Tagging
SparkData Input
kafKa Spark
Data Output
Kafka Spark
How OCSP works in Smart Tourism?
Tagging
Filter
Select
Output
Codis
• Data Source:
• 4 G signal data
• imsi + location + timestamp
• Data transformation:
• Flume source: socket
• Sink: Kafka (keyed message)
• Streaming processing
• Filter invalid data
• Tagging, get user’s information from codis by imsi
• Tagging, compute the user path in a duration
• Output
• Write the latest location to Kafka
• Use flume to update latest location in Hbase
4G data
socket
flume
Imsi | location | timestamp
Imsi | location | timestamp
Imsi | location | timestamp
…
Kafka
Imsi|location|timestamp|name|age|longitude|latitude
…
Imsi|location|timestamp|name|age|longitude|latitude
Imsi|location|timestamp|name|age|longitude|latitude
HBase
Kafka
flume
Performance - time cost
Scenario Data per 30 s Spark Codis
Kafka
Partition
Output
number
Case 1 0.6 million 20/128G/32 core 10/128G 200 3
Case 2 10 million 28/512G/64 core 10/512G 1200 11
Tagging(Get cache)
Tagging (Operation)
Output
Case1 5 seconds
0.5 s
3 s
1s
1.5 s
Data Transformation 1s
11 s
3 s
2 s
Case2 17 seconds
CONTENT
04
Next Work
Support more Scenarios
Faster
HA
• Join of multiple streams in a time window
• More streaming framework, flink, beam etc.
• Spark upgrade, structured streaming
• Faster cache
• No single point of failure
Open Source
https://github.com/OCSP
Thanks

More Related Content

PPTX
Innovation in the Enterprise Rent-A-Car Data Warehouse
PPTX
Real Time Streaming Architecture at Ford
PDF
Visualizing Big Data in Realtime
PPTX
GeoWave: Open Source Geospatial/Temporal/N-dimensional Indexing for Accumulo,...
PPTX
Active Learning for Fraud Prevention
PPTX
Shaping a Digital Vision
PPTX
Saving the elephant—now, not later
PPTX
Building Data Pipelines with Spark and StreamSets
Innovation in the Enterprise Rent-A-Car Data Warehouse
Real Time Streaming Architecture at Ford
Visualizing Big Data in Realtime
GeoWave: Open Source Geospatial/Temporal/N-dimensional Indexing for Accumulo,...
Active Learning for Fraud Prevention
Shaping a Digital Vision
Saving the elephant—now, not later
Building Data Pipelines with Spark and StreamSets

What's hot (20)

PPTX
Enabling Modern Application Architecture using Data.gov open government data
PPTX
Spark in the Enterprise - 2 Years Later by Alan Saldich
PPTX
Lightning Fast Analytics with Hive LLAP and Druid
PPTX
Securing Data in Hadoop at Uber
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PPTX
Analyzing the World's Largest Security Data Lake!
PPTX
Built-In Security for the Cloud
PDF
Data Pipelines With Streamsets
PDF
About CDAP
PPTX
Real-Time Robot Predictive Maintenance in Action
PDF
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
PPTX
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
PPTX
HIPAA Compliance in the Cloud
PDF
Logging infrastructure for Microservices using StreamSets Data Collector
PDF
Introducing a horizontally scalable, inference-based business Rules Engine fo...
PDF
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
PDF
Building Audi’s enterprise big data platform
PPTX
Docker data science pipeline
PDF
Enterprise Metadata Integration
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Enabling Modern Application Architecture using Data.gov open government data
Spark in the Enterprise - 2 Years Later by Alan Saldich
Lightning Fast Analytics with Hive LLAP and Druid
Securing Data in Hadoop at Uber
Architect’s Open-Source Guide for a Data Mesh Architecture
Analyzing the World's Largest Security Data Lake!
Built-In Security for the Cloud
Data Pipelines With Streamsets
About CDAP
Real-Time Robot Predictive Maintenance in Action
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
HIPAA Compliance in the Cloud
Logging infrastructure for Microservices using StreamSets Data Collector
Introducing a horizontally scalable, inference-based business Rules Engine fo...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
Building Audi’s enterprise big data platform
Docker data science pipeline
Enterprise Metadata Integration
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Ad

Similar to a Real-time Processing System based on Spark streaming int he field of Telecommunications (20)

PPTX
Stream processing on mobile networks
PPTX
Spark Streaming Early Warning Use Case
PPT
Spark streaming
PPTX
Real Time Data Processing Using Spark Streaming
DOCX
Real time web app integration with hadoop on docker
PPTX
Lessons learned from designing QA automation event streaming platform(IoT big...
PPT
strata_spark_streaming.ppt
PPT
strata_spark_streaming.ppt
PPT
strata_spark_streaming.ppt
PPT
strata spark streaming strata spark streamingsrata spark streaming
PPTX
Потоковая обработка больших данных
PDF
Gruter TECHDAY 2014 Realtime Processing in Telco
PPTX
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
PDF
Architecting applications with Hadoop - Fraud Detection
PPTX
Real time analytics with Kafka and SparkStreaming
PPTX
Hortonworks Data In Motion Series Part 4
PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
PPTX
Real Time Data Processing Using Spark Streaming
PDF
Flink Forward Berlin 2018: Krzysztof Zarzycki & Alexey Brodovshuk - "Assistin...
PPTX
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Stream processing on mobile networks
Spark Streaming Early Warning Use Case
Spark streaming
Real Time Data Processing Using Spark Streaming
Real time web app integration with hadoop on docker
Lessons learned from designing QA automation event streaming platform(IoT big...
strata_spark_streaming.ppt
strata_spark_streaming.ppt
strata_spark_streaming.ppt
strata spark streaming strata spark streamingsrata spark streaming
Потоковая обработка больших данных
Gruter TECHDAY 2014 Realtime Processing in Telco
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
Architecting applications with Hadoop - Fraud Detection
Real time analytics with Kafka and SparkStreaming
Hortonworks Data In Motion Series Part 4
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing Using Spark Streaming
Flink Forward Berlin 2018: Krzysztof Zarzycki & Alexey Brodovshuk - "Assistin...
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Cloud computing and distributed systems.
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Machine learning based COVID-19 study performance prediction
PDF
KodekX | Application Modernization Development
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Approach and Philosophy of On baking technology
PDF
Electronic commerce courselecture one. Pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
Unlocking AI with Model Context Protocol (MCP)
Advanced methodologies resolving dimensionality complications for autism neur...
Cloud computing and distributed systems.
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
“AI and Expert System Decision Support & Business Intelligence Systems”
Machine learning based COVID-19 study performance prediction
KodekX | Application Modernization Development
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Programs and apps: productivity, graphics, security and other tools
Approach and Philosophy of On baking technology
Electronic commerce courselecture one. Pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Empathic Computing: Creating Shared Understanding
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectral efficient network and resource selection model in 5G networks
Network Security Unit 5.pdf for BCA BBA.
Reach Out and Touch Someone: Haptics and Empathic Computing

a Real-time Processing System based on Spark streaming int he field of Telecommunications

  • 1. A Real-time Processing System based on Spark streaming in the filed of Telecommunications FOR Hadoop SUMMIT,2017 Geng Wang Dong Wang
  • 4. What we faced in Telecommunications 1 3 2 4 0.166 million new users / day 33G data / sec by mobile 10T Voice data / day 100T Signal data / day More data are produced
  • 5. What we faced in Telecommunications More Real time Requirements • Smart Tourism • Intelligent marketing • Tourist count and analysis • Best choice of tourist resort • Recommendation of route for travelling • … • Recommendation of product for specific customer • Based on multiple dimensions (location, age, salary …)
  • 6. Evolution of Requirement 2014 Real time Marketing 2015 • Operation based on Location 2017 Hard Real time … CEP, Esper 2016 • More data input 2/3/4 G Signal of Location Content of Business
  • 8. Framework – High level Data output Tagging Data Input
  • 9. Detailed Framework • Hadoop Layer • Basic components • OCSP core • Data pre-processing • Tagging • Event output (select and filter) • Multiple engine ( Spark Streaming and Storm) • Muti-tenant • Check point • Data transformation • Parse data to Kafka • Nifi and Flume • Customized processor & sink • Data source • Socket • Local files • HDFS
  • 10. Framework - Data Input flume agent 1 • 2,3G Signal of location flume agent 2,3,4 • 4G Signal of location Nifi • Content data of Business Kafka Partition
  • 11. Data Preprocessing Kafka transform transform transform Schema 2 Schema 3 Select Filter Select Expr 1 imsi Filter Expr 1 imsi!=0 Select Expr 2 Filter Expr 2 Select Expr 3 Filter Expr 3 Select Filter Select Filter Uniform Schema 1
  • 13. Tagging process Customized operation Get by Key Codis User info Stay duration Cycle of Marketing User name imsi Phone number Base station
  • 14. Select, filter & Output Codis Kafka Data with labels others Current location update for each user Output 3: User with specific location & specific business Output 2: New user marketing Output 1: User Path in a duration
  • 16. Framework - Deployment & Configuration External system SDTP Socket Source HDFS I/O Codis I/O Web Deployment in single host Deployment/ Configuration OCSP
  • 18. Performance - scale out flume flume Nifi Nifi flume Codis Codis Codis Tagging SparkData Input kafKa Spark Data Output Kafka Spark
  • 19. How OCSP works in Smart Tourism? Tagging Filter Select Output Codis • Data Source: • 4 G signal data • imsi + location + timestamp • Data transformation: • Flume source: socket • Sink: Kafka (keyed message) • Streaming processing • Filter invalid data • Tagging, get user’s information from codis by imsi • Tagging, compute the user path in a duration • Output • Write the latest location to Kafka • Use flume to update latest location in Hbase 4G data socket flume Imsi | location | timestamp Imsi | location | timestamp Imsi | location | timestamp … Kafka Imsi|location|timestamp|name|age|longitude|latitude … Imsi|location|timestamp|name|age|longitude|latitude Imsi|location|timestamp|name|age|longitude|latitude HBase Kafka flume
  • 20. Performance - time cost Scenario Data per 30 s Spark Codis Kafka Partition Output number Case 1 0.6 million 20/128G/32 core 10/128G 200 3 Case 2 10 million 28/512G/64 core 10/512G 1200 11 Tagging(Get cache) Tagging (Operation) Output Case1 5 seconds 0.5 s 3 s 1s 1.5 s Data Transformation 1s 11 s 3 s 2 s Case2 17 seconds
  • 22. Next Work Support more Scenarios Faster HA • Join of multiple streams in a time window • More streaming framework, flink, beam etc. • Spark upgrade, structured streaming • Faster cache • No single point of failure