SlideShare a Scribd company logo
“Transform Real Time Data 
into Real Time Decisions” 
Mayur Rustagi(@mayur_rustagi) 
“Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 1
OPEN SOURCE 
CUSTOMERS PARTNERSHIPS 
“Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 2
What steered us into Pig 
• DSL for ETL 
• Rich Operator Library 
• Extendable 
• Pluggable 
• Powerful ETL 
“Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 3
What steered us into Spark 
• Powerful in-memory Processing 
• Simple operator on Data 
• Debuggable API 
• Efficient Execution 
• Universally distributed 
“Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 4
Operator Mapping 
Pig Spark 
Load HadoopRDD 
Store saveasObjectFile 
Filter MappedRDD + filter func 
GroupBY (Local rearrange, global rearrange & package) Sort + Group by 
…. … 
“Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 5
Current Flow 
“Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 6
Filter Code implementation 
• https://bitbucket.org/SigmoidDev/spork/src/80a3e4626e4504 
c1829568942e0690abc79d239a/src/org/apache/pig/backend/ 
hadoop/executionengine/spark/converter/FilterConverter.java 
?at=spork-1.0 
“Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 7
Issues 
• Scaling 
• Performance 
• Spark Specific Operators (Cache) 
• Pig on Spark Unit test 
• Some specific joins & rank operation 
“Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 8
Contribute 
• Pig on Spark Umbrella Jira 
• https://issues.apache.org/jira/browse/PIG-4059 
• https://github.com/sigmoidanalytics/spork 
• Issues 
“Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 9
What else is cool 
CloudFlux SigmaStream 
Cloud Deployment PIG/SQL Like DSL 
Fault Tolerance Rich Stream operators 
AutoScaling Multiple Data source/Sink 
Programmatic interface Add custom Operators 
Cloud Agnostic Apache Spark Based 
Apache License Apache License 
“Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 10
Visit us @ P-15, Strata + Hadoop World 
“Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 11
Thank You 
1343 Kingfisher Way 
Sunnyvale, CA, 94087 India Office 
Gulmohar Enclave Road, 
Silver Spring Layout, Munnekollal 
Bengaluru, Karnataka 560037 
US Office 
+1 (760) 203 3257 
contact@sigmoidanalytics.com 
“Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 12

More Related Content

PPTX
Pig on spark
PDF
Rental Cars and Industrialized Learning to Rank with Sean Downes
PDF
Presto@Uber
PPTX
Presto@Netflix Presto Meetup 03-19-15
PDF
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
PDF
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
PDF
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
PDF
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Pig on spark
Rental Cars and Industrialized Learning to Rank with Sean Downes
Presto@Uber
Presto@Netflix Presto Meetup 03-19-15
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar

What's hot (20)

PDF
J-Day Kraków: Listen to the sounds of your application
PDF
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
PPTX
Bullet: A Real Time Data Query Engine
PDF
Presto @ Facebook: Past, Present and Future
PDF
Superset druid realtime
PDF
Developing high frequency indicators using real time tick data on apache supe...
PDF
Presto meetup 2015-03-19 @Facebook
PDF
Lessons from Running Large Scale Spark Workloads
PPTX
presto-at-netflix-hadoop-summit-15
PDF
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
PDF
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
PDF
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
PDF
tdtechtalk20160330johan
PDF
Spark Summit EU talk by Michael Nitschinger
PDF
Aleksei Udatšnõi – Crunching thousands of events per second in nearly real ti...
PDF
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
PDF
Real-Time Spark: From Interactive Queries to Streaming
PDF
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
PDF
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
J-Day Kraków: Listen to the sounds of your application
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Bullet: A Real Time Data Query Engine
Presto @ Facebook: Past, Present and Future
Superset druid realtime
Developing high frequency indicators using real time tick data on apache supe...
Presto meetup 2015-03-19 @Facebook
Lessons from Running Large Scale Spark Workloads
presto-at-netflix-hadoop-summit-15
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
tdtechtalk20160330johan
Spark Summit EU talk by Michael Nitschinger
Aleksei Udatšnõi – Crunching thousands of events per second in nearly real ti...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Real-Time Spark: From Interactive Queries to Streaming
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Ad

Viewers also liked (9)

PPTX
Jonathan Coveney: Why Pig?
PPTX
Can Big Data Save the World? By Jake Porway
PDF
Daeil Kim: Machine Learning at the New York Times
PDF
Luigi presentation NYC Data Science
PDF
HBase Data Types
PDF
Introduction to Hivemall
PDF
Apache Big Data EU 2015 - HBase
PPTX
Data science and Hadoop
PPTX
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Jonathan Coveney: Why Pig?
Can Big Data Save the World? By Jake Porway
Daeil Kim: Machine Learning at the New York Times
Luigi presentation NYC Data Science
HBase Data Types
Introduction to Hivemall
Apache Big Data EU 2015 - HBase
Data science and Hadoop
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Ad

Similar to Pig on Spark (20)

PPTX
Why Real-Time Analytics is Essential for Business Success?
PPTX
Wikibon #IoT #HyperConvergence Presentation via @theCUBE
PPTX
Hyper-Convergence CrowdChat
PPTX
Leveraging Real-Time Analytics as part of driving Data and AI to Improve Busi...
PPTX
HP Discover: Real Time Insights from Big Data
PPTX
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
PPTX
Big data solutions on cloud – the way forward
PPTX
Big Data Solutions on Cloud – The Way Forward by Kiththi Perera SLT
PPTX
Apache spark empowering the real time data driven enterprise - StreamAnalytix...
PDF
Monetizing Big Data with Streaming Analytics for Telecoms Service Providers
PPTX
Azure Stream Analytics : Analyse Data in Motion
PDF
Hadoop 2.0 - Solving the Data Quality Challenge
PDF
Real Time Business Platform by Ivan Novick from Pivotal
PPT
Billions of Rows, Millions of Insights, Right Now
PPTX
The Role of Real-Time BI in Competitive Markets
PPTX
Big data journey to the cloud maz chaudhri 5.30.18
PPTX
06 pig etl features
PDF
Has Traditional MDM Finally Met its Match?
PPTX
Real Time Analytics
PDF
Spark Summit EU talk by Tug Grall
Why Real-Time Analytics is Essential for Business Success?
Wikibon #IoT #HyperConvergence Presentation via @theCUBE
Hyper-Convergence CrowdChat
Leveraging Real-Time Analytics as part of driving Data and AI to Improve Busi...
HP Discover: Real Time Insights from Big Data
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
Big data solutions on cloud – the way forward
Big Data Solutions on Cloud – The Way Forward by Kiththi Perera SLT
Apache spark empowering the real time data driven enterprise - StreamAnalytix...
Monetizing Big Data with Streaming Analytics for Telecoms Service Providers
Azure Stream Analytics : Analyse Data in Motion
Hadoop 2.0 - Solving the Data Quality Challenge
Real Time Business Platform by Ivan Novick from Pivotal
Billions of Rows, Millions of Insights, Right Now
The Role of Real-Time BI in Competitive Markets
Big data journey to the cloud maz chaudhri 5.30.18
06 pig etl features
Has Traditional MDM Finally Met its Match?
Real Time Analytics
Spark Summit EU talk by Tug Grall

Recently uploaded (20)

PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
System and Network Administraation Chapter 3
PPTX
Introduction to Artificial Intelligence
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Transform Your Business with a Software ERP System
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Nekopoi APK 2025 free lastest update
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Design an Analysis of Algorithms I-SECS-1021-03
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Upgrade and Innovation Strategies for SAP ERP Customers
System and Network Administraation Chapter 3
Introduction to Artificial Intelligence
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Transform Your Business with a Software ERP System
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Navsoft: AI-Powered Business Solutions & Custom Software Development
How to Choose the Right IT Partner for Your Business in Malaysia
2025 Textile ERP Trends: SAP, Odoo & Oracle
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Nekopoi APK 2025 free lastest update
Odoo Companies in India – Driving Business Transformation.pdf
Odoo POS Development Services by CandidRoot Solutions
Design an Analysis of Algorithms II-SECS-1021-03
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx

Pig on Spark

  • 1. “Transform Real Time Data into Real Time Decisions” Mayur Rustagi(@mayur_rustagi) “Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 1
  • 2. OPEN SOURCE CUSTOMERS PARTNERSHIPS “Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 2
  • 3. What steered us into Pig • DSL for ETL • Rich Operator Library • Extendable • Pluggable • Powerful ETL “Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 3
  • 4. What steered us into Spark • Powerful in-memory Processing • Simple operator on Data • Debuggable API • Efficient Execution • Universally distributed “Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 4
  • 5. Operator Mapping Pig Spark Load HadoopRDD Store saveasObjectFile Filter MappedRDD + filter func GroupBY (Local rearrange, global rearrange & package) Sort + Group by …. … “Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 5
  • 6. Current Flow “Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 6
  • 7. Filter Code implementation • https://bitbucket.org/SigmoidDev/spork/src/80a3e4626e4504 c1829568942e0690abc79d239a/src/org/apache/pig/backend/ hadoop/executionengine/spark/converter/FilterConverter.java ?at=spork-1.0 “Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 7
  • 8. Issues • Scaling • Performance • Spark Specific Operators (Cache) • Pig on Spark Unit test • Some specific joins & rank operation “Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 8
  • 9. Contribute • Pig on Spark Umbrella Jira • https://issues.apache.org/jira/browse/PIG-4059 • https://github.com/sigmoidanalytics/spork • Issues “Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 9
  • 10. What else is cool CloudFlux SigmaStream Cloud Deployment PIG/SQL Like DSL Fault Tolerance Rich Stream operators AutoScaling Multiple Data source/Sink Programmatic interface Add custom Operators Cloud Agnostic Apache Spark Based Apache License Apache License “Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 10
  • 11. Visit us @ P-15, Strata + Hadoop World “Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 11
  • 12. Thank You 1343 Kingfisher Way Sunnyvale, CA, 94087 India Office Gulmohar Enclave Road, Silver Spring Layout, Munnekollal Bengaluru, Karnataka 560037 US Office +1 (760) 203 3257 contact@sigmoidanalytics.com “Trans form Re a l Time Da t a into Re a l Time De c i s ions ” 12