SlideShare a Scribd company logo
RUNNING SPARK AND MAPREDUCE
TOGETHER IN PRODUCTION
David Chaiken, CTO of Altiscale
chaiken@altiscale.com
#HadoopSherpa
2
AGENDA
• Why run MapReduce and Spark together in production?
• What about H2O, Impala, and other memory-intensive
frameworks?
• Batch + Interactive = Challenges
• Specific issues and solutions
• Ongoing Challenges: Keeping Things Running
• Perspective: Hadoop as a Service versus DIY*
* do it yourself
ALTISCALE PERSPECTIVE:
INFRASTRUCTURE NERDS
• Experienced Technical Yahoos
• Raymie Stata, CEO. Former Yahoo! CTO,
advocate of Apache Software Foundation
• David Chaiken, CTO.
Former Yahoo! Chief Architect
• Charles Wimmer, Head of Operations.
Former Yahoo! SRE
• Hadoop as a Service, built and managed by Big Data,
SaaS, and enterprise software veterans
• Yahoo!, Google, LinkedIn, VMWare, Oracle, ...
3
4
SOLVED:
COST-EFFECTIVE DATA SCIENCE AT SCALE
But how do you make it
easier for data scientists?
Two bad options:
1. Use Hadoop directly
using unfamiliar and
unproductive command-
line tools and APIs
2. Use Hadoop indirectly
via a back-and-forth with
data engineers who
translate needs into
Hadoop programs
Data Scientist’s Workflow
ModelingExploration
Production
Cleansing
Flattening
Serving
Hive
Source
Data
CSV
5
COMMON HADOOP WORKFLOW
Model
Flatten
Explore
Exploration
6
ENTER SPARK. . . AND IMPALA AND H2O
• Interactive, iterative
analysis
• Quick turns
• Memory heavy
7
DOES THIS MEAN THAT MAPREDUCE
DOESN’T MATTER ANYMORE?
HA!
(Don’t believe the hype.)
Exploration
Hive
Source
Data
CSV
IT MATTERS SO MUCH THAT YOU WANT BOTH
ON ONE CLUSTER.
Flattening
Exploration Modeling Serving
Production
Cleansing
BIG DATA MODELING WORKFLOW
8
THE CHALLENGE. . .
9
“Why is my
Spark job not
starting?”
“Why is my Spark
job consuming so
many resources?”
Resource
conflicts!
9
SPECIFIC ISSUES
AND SOLUTIONS
10
INTERACTIVE:
INCREASE CONTAINER SIZE
Challenge: Memory intensive systems take as much
local DRAM as available.
Solutions:
• Spark and H20: Increase YARN container memory size
• Impala: Box using operating system containers
11
• Caution: Larger YARN container settings for interactive
jobs may not be right for batch systems like Hive
• Container size: needs to combine vcores and memory:
yarn.scheduler.maximum-allocation-vcores
yarn.nodemanager.resource.cpu-vcores ...
HIVE+INTERACTIVE:
WATCH OUT FOR LARGE CONTAINER SIZE
12
HIVE + INTERACTIVE:
WATCH OUT FOR FRAGMENTATION
• Caution: Attempting to schedule interactive systems and
batch systems like Hive may result in fragmentation
• Interactive systems may require all-or-nothing scheduling
• Batch jobs with little tasks may starve interactive jobs
13
HIVE + INTERACTIVE:
WATCH OUT FOR FRAGMENTATION
Solutions:
• Reserve interactive nodes before starting batch jobs
• Reduce interactive container size (if the algorithm permits)
• Node labels (YARN-2492) and gang scheduling (YARN-624)
14
ONGOING
CHALLENGES
Keeping things running. . .
15
16
CHALLENGE: SECURITY
• Challenge: User Management not uniform
• MapReduce: collaboration requires getting groups right
• Hive: proxyuser settings have to be right for hiveserver2
• Spark application owner versus connected users
• Impala: “I just gotta be me!”
• As usual, watch out for cluster administrator accounts!
• Challenge: Port and Protocol Management
• Best security practice: open specific ports for specific protocols
• Spark: “I just gotta be free!”
• Spark improved between version 1.0.2 -> 1.1.0,
but still confusing
17
CHALLENGE: WEB SERVING
• How to provide interactive services to business user?
• Concerns: security, variable resources, latency, availability
• Keep serving infrastructure separate from Hadoop
18
CHALLENGE:
RESOURCE ATTRIBUTION (BILLING)
• Accounting for long-running Spark, H2O, Impala clusters?
• Is reserving resources the same as using the resources?
• Trade-off: availability/response time vs. oversubscription.
19
CHALLENGE:
STABILITY VERSUS AGILITY
• Never-ending story: latest hotness versus SLAs*
• New system stability curve. Example…
• SPARK-1476: 2GB limit in Spark for blocks
• Interoperation issues. Example…
• IMPALA-1416: Queries fail with metastore exception after
upgrade and compute stat
• HIVE-8627: Compute stats on a table from Impala caused the
table to be corrupted
• Many issues come down to YARN container size and
JVM heap size configuration
* service level agreements
20
PERSPECTIVE: HADOOP AS A SERVICE
VERSUS DIY (DO IT YOURSELF)
• Data Scientists and Data Engineers:
use the right tools for the right job
• Data Scientists and Data Engineers:
don’t spend your time on cluster maintenance
• Hadoop As A Service: have your cake and eat it, too
• Benefit from the experiences of other customers
• One size does not fit all, but one configuration schema does
• Leave the maintenance to us infrastructure nerds
QUESTIONS? COMMENTS?
21

More Related Content

PPTX
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
PPTX
Intel and Cloudera: Accelerating Enterprise Big Data Success
PPT
Life After Sharding: Monitoring and Management of a Complex Data Cloud
PDF
BDTC2015 hulu-梁宇明-voidbox - docker on yarn
PDF
Running Hadoop as Service in AltiScale Platform
PDF
Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...
PDF
Real-Time Queries in Hadoop w/ Cloudera Impala
PPTX
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Intel and Cloudera: Accelerating Enterprise Big Data Success
Life After Sharding: Monitoring and Management of a Complex Data Cloud
BDTC2015 hulu-梁宇明-voidbox - docker on yarn
Running Hadoop as Service in AltiScale Platform
Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...
Real-Time Queries in Hadoop w/ Cloudera Impala
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

What's hot (20)

PDF
A Closer Look at Apache Kudu
PDF
Bare-metal performance for Big Data workloads on Docker containers
PPTX
ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
PPTX
Bay Area Impala User Group Meetup (Sept 16 2014)
PPTX
Data Science at Scale Using Apache Spark and Apache Hadoop
PPTX
Oracle big data appliance and solutions
PPTX
Hadoop in the Clouds, Virtualization and Virtual Machines
PPTX
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
PPTX
Expert summit SQL Server 2016
PPTX
Cloudbreak - Technical Deep Dive
PDF
DBaaS with EDB Postgres on AWS
 
PPTX
Apache Hadoop YARN: Present and Future
PDF
Hadoop on Cloud: Why and How?
PPTX
Achieving cloud scale with microservices based applications on azure
PPTX
Actian Vector on Hadoop: First Industrial-strength DBMS to Truly Leverage Hadoop
PPTX
Apache Kudu: Technical Deep Dive


PDF
Exponea - Kafka and Hadoop as components of architecture
PPTX
Go Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
PDF
IBM Power8 announce
A Closer Look at Apache Kudu
Bare-metal performance for Big Data workloads on Docker containers
ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Bay Area Impala User Group Meetup (Sept 16 2014)
Data Science at Scale Using Apache Spark and Apache Hadoop
Oracle big data appliance and solutions
Hadoop in the Clouds, Virtualization and Virtual Machines
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Expert summit SQL Server 2016
Cloudbreak - Technical Deep Dive
DBaaS with EDB Postgres on AWS
 
Apache Hadoop YARN: Present and Future
Hadoop on Cloud: Why and How?
Achieving cloud scale with microservices based applications on azure
Actian Vector on Hadoop: First Industrial-strength DBMS to Truly Leverage Hadoop
Apache Kudu: Technical Deep Dive


Exponea - Kafka and Hadoop as components of architecture
Go Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
IBM Power8 announce
Ad

Viewers also liked (20)

PPTX
Karta an ETL Framework to process high volume datasets
PPTX
Carpe Datum: Building Big Data Analytical Applications with HP Haven
PDF
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
PPT
Hadoop for Genomics__HadoopSummit2010
PPTX
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
PPTX
Realistic Synthetic Generation Allows Secure Development
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
PPTX
One Click Hadoop Clusters - Anywhere (Using Docker)
PPTX
Big Data Simplified - Is all about Ab'strakSHeN
PPTX
Hadoop in Validated Environment - Data Governance Initiative
PPTX
HBase and Drill: How loosley typed SQL is ideal for NoSQL
PDF
50 Shades of SQL
PDF
Inspiring Travel at Airbnb [WIP]
PDF
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
PPTX
Open Source SQL for Hadoop: Where are we and Where are we Going?
PPTX
Spark Application Development Made Easy
PPTX
NoSQL Needs SomeSQL
PPTX
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
PPTX
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
PPTX
Big Data Challenges in the Energy Sector
Karta an ETL Framework to process high volume datasets
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Hadoop for Genomics__HadoopSummit2010
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Realistic Synthetic Generation Allows Secure Development
Practical Distributed Machine Learning Pipelines on Hadoop
One Click Hadoop Clusters - Anywhere (Using Docker)
Big Data Simplified - Is all about Ab'strakSHeN
Hadoop in Validated Environment - Data Governance Initiative
HBase and Drill: How loosley typed SQL is ideal for NoSQL
50 Shades of SQL
Inspiring Travel at Airbnb [WIP]
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
Open Source SQL for Hadoop: Where are we and Where are we Going?
Spark Application Development Made Easy
NoSQL Needs SomeSQL
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Big Data Challenges in the Energy Sector
Ad

Similar to Running Spark and MapReduce together in Production (20)

PPTX
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
PDF
Building a Hadoop Data Warehouse with Impala
PDF
Building a Hadoop Data Warehouse with Impala
PDF
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
PDF
Big Data - Big Pitfalls.
PPTX
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
PPTX
Apache hadoop technology : Beginners
PPTX
Apache hadoop technology : Beginners
PPTX
Apache hadoop technology : Beginners
PDF
Troubleshooting Hadoop: Distributed Debugging
PPTX
Hadoop @ eBay: Past, Present, and Future
PDF
Advanced Analytics and Big Data (August 2014)
PDF
Hadoop and SQL: Delivery Analytics Across the Organization
PDF
Hadoop and the Data Warehouse: Point/Counter Point
PDF
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
PPTX
Scaling db infra_pay_pal
PDF
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
PDF
Rapid Cluster Computing with Apache Spark 2016
PPTX
Talend for big_data_intorduction
PPTX
Big data - Online Training
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Big Data - Big Pitfalls.
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Apache hadoop technology : Beginners
Apache hadoop technology : Beginners
Apache hadoop technology : Beginners
Troubleshooting Hadoop: Distributed Debugging
Hadoop @ eBay: Past, Present, and Future
Advanced Analytics and Big Data (August 2014)
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and the Data Warehouse: Point/Counter Point
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Scaling db infra_pay_pal
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Rapid Cluster Computing with Apache Spark 2016
Talend for big_data_intorduction
Big data - Online Training

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Empathic Computing: Creating Shared Understanding
PDF
Machine learning based COVID-19 study performance prediction
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPT
Teaching material agriculture food technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
KodekX | Application Modernization Development
Digital-Transformation-Roadmap-for-Companies.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation_ Review paper, used for researhc scholars
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Empathic Computing: Creating Shared Understanding
Machine learning based COVID-19 study performance prediction
Diabetes mellitus diagnosis method based random forest with bat algorithm
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Big Data Technologies - Introduction.pptx
Programs and apps: productivity, graphics, security and other tools
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Teaching material agriculture food technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Understanding_Digital_Forensics_Presentation.pptx
cuic standard and advanced reporting.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
KodekX | Application Modernization Development

Running Spark and MapReduce together in Production

  • 1. RUNNING SPARK AND MAPREDUCE TOGETHER IN PRODUCTION David Chaiken, CTO of Altiscale chaiken@altiscale.com #HadoopSherpa
  • 2. 2 AGENDA • Why run MapReduce and Spark together in production? • What about H2O, Impala, and other memory-intensive frameworks? • Batch + Interactive = Challenges • Specific issues and solutions • Ongoing Challenges: Keeping Things Running • Perspective: Hadoop as a Service versus DIY* * do it yourself
  • 3. ALTISCALE PERSPECTIVE: INFRASTRUCTURE NERDS • Experienced Technical Yahoos • Raymie Stata, CEO. Former Yahoo! CTO, advocate of Apache Software Foundation • David Chaiken, CTO. Former Yahoo! Chief Architect • Charles Wimmer, Head of Operations. Former Yahoo! SRE • Hadoop as a Service, built and managed by Big Data, SaaS, and enterprise software veterans • Yahoo!, Google, LinkedIn, VMWare, Oracle, ... 3
  • 4. 4 SOLVED: COST-EFFECTIVE DATA SCIENCE AT SCALE But how do you make it easier for data scientists? Two bad options: 1. Use Hadoop directly using unfamiliar and unproductive command- line tools and APIs 2. Use Hadoop indirectly via a back-and-forth with data engineers who translate needs into Hadoop programs
  • 6. 6 ENTER SPARK. . . AND IMPALA AND H2O • Interactive, iterative analysis • Quick turns • Memory heavy
  • 7. 7 DOES THIS MEAN THAT MAPREDUCE DOESN’T MATTER ANYMORE? HA! (Don’t believe the hype.)
  • 8. Exploration Hive Source Data CSV IT MATTERS SO MUCH THAT YOU WANT BOTH ON ONE CLUSTER. Flattening Exploration Modeling Serving Production Cleansing BIG DATA MODELING WORKFLOW 8
  • 9. THE CHALLENGE. . . 9 “Why is my Spark job not starting?” “Why is my Spark job consuming so many resources?” Resource conflicts! 9
  • 11. INTERACTIVE: INCREASE CONTAINER SIZE Challenge: Memory intensive systems take as much local DRAM as available. Solutions: • Spark and H20: Increase YARN container memory size • Impala: Box using operating system containers 11
  • 12. • Caution: Larger YARN container settings for interactive jobs may not be right for batch systems like Hive • Container size: needs to combine vcores and memory: yarn.scheduler.maximum-allocation-vcores yarn.nodemanager.resource.cpu-vcores ... HIVE+INTERACTIVE: WATCH OUT FOR LARGE CONTAINER SIZE 12
  • 13. HIVE + INTERACTIVE: WATCH OUT FOR FRAGMENTATION • Caution: Attempting to schedule interactive systems and batch systems like Hive may result in fragmentation • Interactive systems may require all-or-nothing scheduling • Batch jobs with little tasks may starve interactive jobs 13
  • 14. HIVE + INTERACTIVE: WATCH OUT FOR FRAGMENTATION Solutions: • Reserve interactive nodes before starting batch jobs • Reduce interactive container size (if the algorithm permits) • Node labels (YARN-2492) and gang scheduling (YARN-624) 14
  • 16. 16 CHALLENGE: SECURITY • Challenge: User Management not uniform • MapReduce: collaboration requires getting groups right • Hive: proxyuser settings have to be right for hiveserver2 • Spark application owner versus connected users • Impala: “I just gotta be me!” • As usual, watch out for cluster administrator accounts! • Challenge: Port and Protocol Management • Best security practice: open specific ports for specific protocols • Spark: “I just gotta be free!” • Spark improved between version 1.0.2 -> 1.1.0, but still confusing
  • 17. 17 CHALLENGE: WEB SERVING • How to provide interactive services to business user? • Concerns: security, variable resources, latency, availability • Keep serving infrastructure separate from Hadoop
  • 18. 18 CHALLENGE: RESOURCE ATTRIBUTION (BILLING) • Accounting for long-running Spark, H2O, Impala clusters? • Is reserving resources the same as using the resources? • Trade-off: availability/response time vs. oversubscription.
  • 19. 19 CHALLENGE: STABILITY VERSUS AGILITY • Never-ending story: latest hotness versus SLAs* • New system stability curve. Example… • SPARK-1476: 2GB limit in Spark for blocks • Interoperation issues. Example… • IMPALA-1416: Queries fail with metastore exception after upgrade and compute stat • HIVE-8627: Compute stats on a table from Impala caused the table to be corrupted • Many issues come down to YARN container size and JVM heap size configuration * service level agreements
  • 20. 20 PERSPECTIVE: HADOOP AS A SERVICE VERSUS DIY (DO IT YOURSELF) • Data Scientists and Data Engineers: use the right tools for the right job • Data Scientists and Data Engineers: don’t spend your time on cluster maintenance • Hadoop As A Service: have your cake and eat it, too • Benefit from the experiences of other customers • One size does not fit all, but one configuration schema does • Leave the maintenance to us infrastructure nerds

Editor's Notes

  • #2: http://2015.hadoopsummit.org/san-jose/agenda/ Abstract: Clusters must be tuned properly to run memory-intensive systems like Spark, H2O, and Impala alongside traditional MapReduce jobs. This talk describes Altiscale's experience running the new memory-intensive systems in production for our customers. We focus on the cluster tuning that we needed to do to create environments that run a mix of processing frameworks reliably and efficiently. Our results show that there's no need to rip and replace MapReduce clusters in favor of Spark, or any other memory-intensive system.
  • #5: back-and-forth between data engineers: latency, misunderstanding good news: newer tools (developed over the last 5 years) eliminate the need for data scientists to use the raw interfaces
  • #6: think about nested loops of activity. inner loop: modeling pop out to outer loop (flattening): e.g. to use different classifier to get better signal outer loop: exploration, looks at source form data directly. note that data is often stored twice: source form data (can always go back to it if you need it) and structured/cleaned data typically in hive: more convenient to use this data set in general. from time to time, need to go back to the source form data to look for signal that may not be in the structured source. kind of like flattening, but data is dirtier, harder to understand. common but not universal workflow. mostly an example for the tips in the rest of the presentation
  • #7: back-and-forth between data engineers: latency, misunderstanding good news: newer tools (developed over the last 5 years) eliminate the need for data scientists to use the raw interfaces
  • #8: meme: Google stopped using Map/Reduce years ago reality: there are still lots of M/R jobs running in Google’s infrastructure
  • #9: best of breed suite also applies to modeling, directly on big data, directly on Hadoop over the last few years, huge evolution of tools: ability to do scale-out modeling directly on top of Hadoop. in the past, used to be a relatively rare thing, e.g. Mahout was fairly difficult, only worthwhile using when there’s a lot of benefit. more recently: move modeling off of workstations and directly onto Hadoop cluster tools on top: Hadoop-native, built for scale-out, big data manner, where Hadoop is strong lower: legacy tools that are embracing scale-out computation directly on Hadoop, directly on big data
  • #10: You want them on one cluster because big data is big. When you have the data in multiple environments, like EMR, you pay a penalty. Your jobs run two times slower because you have to keep moving the data around. Data scientists need a mixed environment. It’s not effective for them to have Spark off on its own cluster. It’s just not how they work. However, the community has not come to grips with mixed workloads yet. It’s a bit unstable and you can see this when you start asking yourself questions like “Why is my Spark job not starting?” or “Why is my Spark job consuming so many resources?” Analogy: OLTP + OLAP = Challenges Map/Reduce (especially hidden under SQL) is still awesome for data cleaning and other tasks that are a high bandwidth game. Spark, H2O, Impala are great for interactive, iterative, “inner-loop” data analysis that is a low latency game. Map/Reduce tends to generate lots of little tasks; the newer frameworks self-schedule and need lots of DRAM (soon: lots of DRAM and/or Flash) Running both types together causes resource conflicts!
  • #12: Challenge: Memory intensive systems take as much local DRAM as available. Solutions: Increase YARN container memory size for DRAM-intensive systems like Spark and H2O. Use operating system containers to box Impala on datanodes. Note: alignment of diagrams on the next few slides is critical The diagram could benefit from a legend! - circles instead of squares to avoid the Hermann grid illusion Stinger = hive 0.13 + Tez is intended to be more balanced
  • #18: At Altiscale, we think that AWS is awesome for web serving – even though we know that AWS is not great for Hadoop.
  • #20: Operating system containers (namespaces + cgroups) can help with container/heap size issues. Wouldn’t it be great if JVM could ask for more resources instead of putting itself into a GC loop? Interoperation issues aren’t just technical (or even mostly technical). IMPALA and HIVE have to interoperate, but are championed by competitors (Cloudera and Hortonworks) SPARK-1476 details are in https://altiscale.zendesk.com/agent/tickets/1589 IMPALA-1416/HIVE-8627 are in https://altiscale.zendesk.com/agent/tickets/1510
  • #21: Data scientists should never be spending time getting all of these frameworks to work well together. That’s the job that infrastructure nerds should be doing. Hadoop As A Service: modeled after the internal services of Internet companies