SlideShare a Scribd company logo
Big Data for Quality Engineers
Ahmed Misbah
Agenda
• Introduction to Big Data
– Problem with traditional Large Scale Systems
– Requirements for the new approach
– Hadoop’s Approach
– Batch Processing and Steam Processing
• Big Data Technologies
– Batch Processing Technologies
– Stream Processing Technologies
• Testing Big Data Solutions
Rules
• Phones silent
• No laptops
• Questions/Discussions at anytime welcome
• 10 minute break every 1 hour
INTRODUCTION TO BIG DATA
PROBLEMS WITH TRADITIONAL
LARGE SCALE SYSTEMS
Traditional Large Scale Computing
• Traditionally, computation has been
processor-bound:
– Small amounts of data
– Lots of complex processing
• Early solution: Bigger computers!!
– Faster processor(s)
– More memory
Big Data for QAs
Distributed Systems (1/3)
• More computers instead of bigger computers
• Distributed systems evolved
• Use multiple machines for a single job
Distributed Systems (2/3)
“In pioneer days they used oxen for heavy
pulling, and when one ox couldn’t budge a log,
we didn’t try to grow a larger ox. We shouldn’t
be trying for bigger computers, but for more
systems of computers”
Grace Hopper
Distributed Systems (3/3)
Problems with Distributed Systems
(1/2)
• Programming for traditional distributed
systems in complex:
– Keeping data and processes in sync
– Finite bandwidth
– Partial failures
Problems with Distributed Systems
(2/2)
“Failure is the defining difference between
distributed and local programming, so you
have to design distributed systems with the
expectation of failure”
Ken Arnold, CORBA Designer
The Data Bottleneck (1/4)
• Moore’s Law has held firm for over 40 years:
– Processing power doubles every two years
– Processing speed is no longer the problem
• Getting the data to the processor becomes the
bottleneck
The Data Bottleneck (2/4)
• Example:
– Typical disk data transfer rate: 75MB/sec
– Time taken to transfer 100GB of data to the
processor ≈ 22 minutes
– Actual time will be worse since most servers have
less than 100GB of RAM
The Data Bottleneck (3/4)
• Typically, data is stored in a central location
• Data is copied to the processors at runtime
• Acceptable for limited amounts of data
The Data Bottleneck (4/4)
• Modern system have much more data
– Terabytes/day
– Petabytes/year
• A new approach is required
REQUIREMENTS FOR THE NEW
APPROACH
Requirements for the new approach
(1/2)
• Partial failure support:
– Failure of a component should result in a graceful
degradation of the application performance
– It should not lead to a complete failure of the entire
system
• Data recoverability:
– If a component of the system fails, its workload should
be assumed by still-functioning units in the system
• Component recovery:
– If a component fails then recovers, it should be able to
rejoin the system without requiring full system restart
Requirements for the new approach
(2/2)
• Consistency:
– Component failures during execution of a job
should not affect the outcome of the job
• Scalability:
– Adding load to the system should result in graceful
degradation in performance and not the failure of
the entire system
– Increasing resources should support proportional
increase in load capacity
HADOOP’S APPROACH
A new approach to distributed
computing!
• Distribute data when the data is being stored
• Run computation where the data is stored
Core Concept (1/3)
• Distribute the data as it is initially stored in the
system
• Individual nodes can work on the data local to
those nodes
• No data transfer over the network is required
for initial processing
Core Concept (2/3)
• Applications are written in high-level code
• Developers need not to worry about network
programming or low-level infrastructure
• Nodes talk to each other as little as possible
Core Concept (3/3)
• Data is spread among machines in advance
• Computation happens where the data is
stored
• Data is replicated multiple times on the
system for increased availability and reliability
Fault Tolerance
• If a node fails, the master will detect the failure
and re-assign the work to a different node on the
system
• Restarting a task does not require the
communication with nodes working on other
portions of the data
• If a failed node restarts it is automatically added
back to the system and assigned a new task
• If a node appears to be running slowly, the
master can redundantly execute another instance
of the same task
Big Data for QAs
BATCH PROCESSING VS STREAM
PROCESSING
Batch Processing
• Also known as History-based processing
• Processing is executed against large data
already stored in some storage medium (e.g.
HDFS or S3)
Stream Processing
• Processing executed against batches of data
coming continuously from a stream
BIG DATA TECHNOLOGIES
Batch Processing Technologies (1/2)
• Hadoop
Batch Processing Technologies (2/2)
• Spark
Stream Processing Technologies (1/2)
• Spark Streaming
Stream Processing Technologies (2/2)
• Apache Storm
• Apache Flink
Supporting Technologies
• Apache Kafka
• Akka
TESTING BIG DATA TECHNOLOGIES
Hadoop MapReduce (1/3)
• LocalJobRunner:
– Does not require any Hadoop daemons to be
running
– Uses the local file system instead of HDFS
• MRUnit:
– Built on top of JUnit
– Works with Mockito Framework to provide
required mock objects
Hadoop MapReduce (2/3)
• Apache Hue:
– Is an open source Web interface for analyzing data
with Apache Hadoop
Hadoop MapReduce (3/3)
• MapReduce Job Tracker Web Interface
Apache Spark (1/3)
• Run locally using Eclipse of IntelliJ
• Run using Spark Standalone
• Spark Testing Base:
– For implementing unit tests for Spark code
• Spark Validator:
– A library you can include in your Spark job to validate
the counters and perform operations on success
Apache Spark (2/3)
• Spark UI and History Server:
Big Data for QAs
Apache Spark (3/3)
• Apache Zeppelin (using Sparklet on
Windows)
Big Data for QAs
Performance Testing Tools
• Gatling
• Yahoo Cloud Serving Benchmark (YCSB)
• Jumbune
• Netflix Inviso
• TestDFSIO
• TeraSort
• NNBench
• MRbench
• BigBench
More tools
• https://github.com/Intel-bigdata/HiBench
• https://github.com/yahoo/streaming-
benchmarks
• https://github.com/tdas/spark-streaming-
benchmark
• https://github.com/BBVA/spark-benchmarks
• https://github.com/databricks/spark-perf
Monitoring Tools
• https://ambari.apache.org/
• https://github.com/groupon/sparklint
• https://github.com/linkedin/dr-elephant
• https://github.com/ibm-research-
ireland/sparkoscope
• https://supergloo.com/spark-
monitoring/spark-performance-monitoring-
tools/
Important Considerations
• Number of clusters/nodes
• Hardware Specifications (HDD or SSD)
• Application/Environment Configurations (no.
of cores, no. of partitions, no. of threads,
disk/memory persistence, etc.)
• Data format (Text, Sequence, Avro, etc.)
• Data size
• Compression (Snappy, Gzip, etc.)
• Number of Reducers (MapReduce)
Spark Best Practices
• https://medium.com/teads-engineering/spark-
performance-tuning-from-the-trenches-
7cbde521cf60
• https://dzone.com/articles/apache-spark-
performance-tuning-degree-of-parallel
• https://databricks.com/glossary/spark-tuning
• https://blog.cloudera.com/how-to-tune-your-
apache-spark-jobs-part-1/
• https://www.bi4all.pt/en/news/en-blog/apache-
spark-best-practices/
Sampling
• Sampling is defined as: “the act, process, or
technique of selecting a representative part of
a population for the purpose of determining
parameters or characteristics of the whole
population” - Merriam- Webster dictionary
Useful Resources (1/3)
• Benchmarking Hadoop and HBase on Violin
• Benchmarking Cassandra on Violin
• http://blog.cloudera.com/blog/2014/11/bigbe
nch-toward-an-industry-standard-benchmark-
for-big-data-analytics/
Useful Resources (2/3)
• http://blog.cloudera.com/blog/2015/08/ycsb-the-
open-standard-for-nosql-benchmarking-joins-cloudera-
labs/
• https://discuss.zendesk.com/hc/en-
us/articles/200864057-Running-DFSIO-MapReduce-
benchmark-test
• http://www.michael-
noll.com/blog/2011/04/09/benchmarking-and-stress-
testing-an-hadoop-cluster-with-terasort-testdfsio-
nnbench-mrbench/
Useful Resources (3/3)
• http://bdaafall2015.readthedocs.io/en/latest/
nnbench.html
• http://bdaafall2015.readthedocs.io/en/latest/
mrbench.html
Thank You!

More Related Content

PPTX
Spark Overview and Performance Issues
PPTX
Hadoop Scheduling - a 7 year perspective
PPTX
Resource scheduling
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
PDF
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
PDF
Hadoop Internals
PPTX
Scaling ETL with Hadoop - Avoiding Failure
PDF
Hadoop scheduler
Spark Overview and Performance Issues
Hadoop Scheduling - a 7 year perspective
Resource scheduling
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
Hadoop Internals
Scaling ETL with Hadoop - Avoiding Failure
Hadoop scheduler

What's hot (18)

PDF
Low Latency Streaming Data Processing in Hadoop
PDF
Hadoop Ecosystem and Low Latency Streaming Architecture
PPTX
Bigdata workshop february 2015
PPTX
Spark 1.0
PPTX
Fault-Tolerant File Input & Output
PPTX
Hadoop training-in-hyderabad
PPT
Map reducecloudtech
PPTX
PMIx Updated Overview
PPTX
HPC Resource Management: Futures
PPTX
Gfarm presentation and thesis topic introduction
PPTX
Your Guide to Streaming - The Engineer's Perspective
PDF
Data Streaming For Big Data
PPTX
project--2 nd review_2
PPTX
Apache Apex Introduction with PubMatic
PPTX
Scheduling scheme for hadoop clusters
PDF
Extending Hadoop for Fun & Profit
PDF
Fault tolerant mechanisms in Big Data
Low Latency Streaming Data Processing in Hadoop
Hadoop Ecosystem and Low Latency Streaming Architecture
Bigdata workshop february 2015
Spark 1.0
Fault-Tolerant File Input & Output
Hadoop training-in-hyderabad
Map reducecloudtech
PMIx Updated Overview
HPC Resource Management: Futures
Gfarm presentation and thesis topic introduction
Your Guide to Streaming - The Engineer's Perspective
Data Streaming For Big Data
project--2 nd review_2
Apache Apex Introduction with PubMatic
Scheduling scheme for hadoop clusters
Extending Hadoop for Fun & Profit
Fault tolerant mechanisms in Big Data
Ad

Similar to Big Data for QAs (20)

PDF
Big data and hadoop overvew
PPTX
Big Data Concepts
PPTX
Hadoop
PDF
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
PPTX
Hadoop tutorial for beginners-tibacademy.in
PDF
Big_data_1674238705.ppt is a basic background
PDF
Hadoop introduction
PDF
Lesson 1 introduction to_big_data_and_hadoop.pptx
PDF
Big Data Processing with Hadoop : A Review
PDF
Hadoop Master Class : A concise overview
PDF
Introduction to Spark Training
PPTX
A Glimpse of Bigdata - Introduction
PPTX
Big Data Analytics Presentation on the resourcefulness of Big data
PDF
InternReport
PPTX
Intro to Spark development
ODP
Hadoop introduction
PPSX
Big Data
PPTX
Big Data Processing
PPTX
Architecting Your First Big Data Implementation
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Big data and hadoop overvew
Big Data Concepts
Hadoop
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
Hadoop tutorial for beginners-tibacademy.in
Big_data_1674238705.ppt is a basic background
Hadoop introduction
Lesson 1 introduction to_big_data_and_hadoop.pptx
Big Data Processing with Hadoop : A Review
Hadoop Master Class : A concise overview
Introduction to Spark Training
A Glimpse of Bigdata - Introduction
Big Data Analytics Presentation on the resourcefulness of Big data
InternReport
Intro to Spark development
Hadoop introduction
Big Data
Big Data Processing
Architecting Your First Big Data Implementation
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Ad

More from Ahmed Misbah (20)

PDF
6+1 Technical Tips for Tech Startups (2023 Edition)
PDF
Migrating to Microservices Patterns and Technologies (edition 2023)
PDF
Practical Microservice Architecture (edition 2022).pdf
PDF
Istio as an enabler for migrating to microservices (edition 2022)
PDF
DevOps for absolute beginners (2022 edition)
PDF
TDD Anti-patterns (2022 edition)
PPTX
Implementing FaaS on Kubernetes using Kubeless
PDF
Istio as an Enabler for Migrating Monolithic Applications to Microservices v1.3
PDF
Introduction to TDD
PDF
Getting Started with DevOps
PDF
DevOps for absolute beginners
PPTX
Microservice test strategies for applications based on Spring, K8s and Istio
PPTX
Cucumber jvm best practices v3
PPTX
Welcome to the Professional World
PPTX
More topics on Java
PPTX
Career Paths for Software Professionals
PPTX
Effective User Story Writing
PPTX
AndGen+
PPTX
DDT Testing Library for Android
PPTX
Software Architecture
6+1 Technical Tips for Tech Startups (2023 Edition)
Migrating to Microservices Patterns and Technologies (edition 2023)
Practical Microservice Architecture (edition 2022).pdf
Istio as an enabler for migrating to microservices (edition 2022)
DevOps for absolute beginners (2022 edition)
TDD Anti-patterns (2022 edition)
Implementing FaaS on Kubernetes using Kubeless
Istio as an Enabler for Migrating Monolithic Applications to Microservices v1.3
Introduction to TDD
Getting Started with DevOps
DevOps for absolute beginners
Microservice test strategies for applications based on Spring, K8s and Istio
Cucumber jvm best practices v3
Welcome to the Professional World
More topics on Java
Career Paths for Software Professionals
Effective User Story Writing
AndGen+
DDT Testing Library for Android
Software Architecture

Recently uploaded (20)

PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
top salesforce developer skills in 2025.pdf
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
Transform Your Business with a Software ERP System
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Online Work Permit System for Fast Permit Processing
PDF
medical staffing services at VALiNTRY
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
ISO 45001 Occupational Health and Safety Management System
2025 Textile ERP Trends: SAP, Odoo & Oracle
top salesforce developer skills in 2025.pdf
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Design an Analysis of Algorithms II-SECS-1021-03
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Wondershare Filmora 15 Crack With Activation Key [2025
PTS Company Brochure 2025 (1).pdf.......
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Transform Your Business with a Software ERP System
How to Migrate SBCGlobal Email to Yahoo Easily
Understanding Forklifts - TECH EHS Solution
L1 - Introduction to python Backend.pptx
Operating system designcfffgfgggggggvggggggggg
How Creative Agencies Leverage Project Management Software.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Online Work Permit System for Fast Permit Processing
medical staffing services at VALiNTRY
Adobe Illustrator 28.6 Crack My Vision of Vector Design

Big Data for QAs

  • 1. Big Data for Quality Engineers Ahmed Misbah
  • 2. Agenda • Introduction to Big Data – Problem with traditional Large Scale Systems – Requirements for the new approach – Hadoop’s Approach – Batch Processing and Steam Processing • Big Data Technologies – Batch Processing Technologies – Stream Processing Technologies • Testing Big Data Solutions
  • 3. Rules • Phones silent • No laptops • Questions/Discussions at anytime welcome • 10 minute break every 1 hour
  • 6. Traditional Large Scale Computing • Traditionally, computation has been processor-bound: – Small amounts of data – Lots of complex processing • Early solution: Bigger computers!! – Faster processor(s) – More memory
  • 8. Distributed Systems (1/3) • More computers instead of bigger computers • Distributed systems evolved • Use multiple machines for a single job
  • 9. Distributed Systems (2/3) “In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, we didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers” Grace Hopper
  • 11. Problems with Distributed Systems (1/2) • Programming for traditional distributed systems in complex: – Keeping data and processes in sync – Finite bandwidth – Partial failures
  • 12. Problems with Distributed Systems (2/2) “Failure is the defining difference between distributed and local programming, so you have to design distributed systems with the expectation of failure” Ken Arnold, CORBA Designer
  • 13. The Data Bottleneck (1/4) • Moore’s Law has held firm for over 40 years: – Processing power doubles every two years – Processing speed is no longer the problem • Getting the data to the processor becomes the bottleneck
  • 14. The Data Bottleneck (2/4) • Example: – Typical disk data transfer rate: 75MB/sec – Time taken to transfer 100GB of data to the processor ≈ 22 minutes – Actual time will be worse since most servers have less than 100GB of RAM
  • 15. The Data Bottleneck (3/4) • Typically, data is stored in a central location • Data is copied to the processors at runtime • Acceptable for limited amounts of data
  • 16. The Data Bottleneck (4/4) • Modern system have much more data – Terabytes/day – Petabytes/year • A new approach is required
  • 17. REQUIREMENTS FOR THE NEW APPROACH
  • 18. Requirements for the new approach (1/2) • Partial failure support: – Failure of a component should result in a graceful degradation of the application performance – It should not lead to a complete failure of the entire system • Data recoverability: – If a component of the system fails, its workload should be assumed by still-functioning units in the system • Component recovery: – If a component fails then recovers, it should be able to rejoin the system without requiring full system restart
  • 19. Requirements for the new approach (2/2) • Consistency: – Component failures during execution of a job should not affect the outcome of the job • Scalability: – Adding load to the system should result in graceful degradation in performance and not the failure of the entire system – Increasing resources should support proportional increase in load capacity
  • 21. A new approach to distributed computing! • Distribute data when the data is being stored • Run computation where the data is stored
  • 22. Core Concept (1/3) • Distribute the data as it is initially stored in the system • Individual nodes can work on the data local to those nodes • No data transfer over the network is required for initial processing
  • 23. Core Concept (2/3) • Applications are written in high-level code • Developers need not to worry about network programming or low-level infrastructure • Nodes talk to each other as little as possible
  • 24. Core Concept (3/3) • Data is spread among machines in advance • Computation happens where the data is stored • Data is replicated multiple times on the system for increased availability and reliability
  • 25. Fault Tolerance • If a node fails, the master will detect the failure and re-assign the work to a different node on the system • Restarting a task does not require the communication with nodes working on other portions of the data • If a failed node restarts it is automatically added back to the system and assigned a new task • If a node appears to be running slowly, the master can redundantly execute another instance of the same task
  • 27. BATCH PROCESSING VS STREAM PROCESSING
  • 28. Batch Processing • Also known as History-based processing • Processing is executed against large data already stored in some storage medium (e.g. HDFS or S3)
  • 29. Stream Processing • Processing executed against batches of data coming continuously from a stream
  • 31. Batch Processing Technologies (1/2) • Hadoop
  • 32. Batch Processing Technologies (2/2) • Spark
  • 33. Stream Processing Technologies (1/2) • Spark Streaming
  • 34. Stream Processing Technologies (2/2) • Apache Storm • Apache Flink
  • 36. TESTING BIG DATA TECHNOLOGIES
  • 37. Hadoop MapReduce (1/3) • LocalJobRunner: – Does not require any Hadoop daemons to be running – Uses the local file system instead of HDFS • MRUnit: – Built on top of JUnit – Works with Mockito Framework to provide required mock objects
  • 38. Hadoop MapReduce (2/3) • Apache Hue: – Is an open source Web interface for analyzing data with Apache Hadoop
  • 39. Hadoop MapReduce (3/3) • MapReduce Job Tracker Web Interface
  • 40. Apache Spark (1/3) • Run locally using Eclipse of IntelliJ • Run using Spark Standalone • Spark Testing Base: – For implementing unit tests for Spark code • Spark Validator: – A library you can include in your Spark job to validate the counters and perform operations on success
  • 41. Apache Spark (2/3) • Spark UI and History Server:
  • 43. Apache Spark (3/3) • Apache Zeppelin (using Sparklet on Windows)
  • 45. Performance Testing Tools • Gatling • Yahoo Cloud Serving Benchmark (YCSB) • Jumbune • Netflix Inviso • TestDFSIO • TeraSort • NNBench • MRbench • BigBench
  • 46. More tools • https://github.com/Intel-bigdata/HiBench • https://github.com/yahoo/streaming- benchmarks • https://github.com/tdas/spark-streaming- benchmark • https://github.com/BBVA/spark-benchmarks • https://github.com/databricks/spark-perf
  • 47. Monitoring Tools • https://ambari.apache.org/ • https://github.com/groupon/sparklint • https://github.com/linkedin/dr-elephant • https://github.com/ibm-research- ireland/sparkoscope • https://supergloo.com/spark- monitoring/spark-performance-monitoring- tools/
  • 48. Important Considerations • Number of clusters/nodes • Hardware Specifications (HDD or SSD) • Application/Environment Configurations (no. of cores, no. of partitions, no. of threads, disk/memory persistence, etc.) • Data format (Text, Sequence, Avro, etc.) • Data size • Compression (Snappy, Gzip, etc.) • Number of Reducers (MapReduce)
  • 49. Spark Best Practices • https://medium.com/teads-engineering/spark- performance-tuning-from-the-trenches- 7cbde521cf60 • https://dzone.com/articles/apache-spark- performance-tuning-degree-of-parallel • https://databricks.com/glossary/spark-tuning • https://blog.cloudera.com/how-to-tune-your- apache-spark-jobs-part-1/ • https://www.bi4all.pt/en/news/en-blog/apache- spark-best-practices/
  • 50. Sampling • Sampling is defined as: “the act, process, or technique of selecting a representative part of a population for the purpose of determining parameters or characteristics of the whole population” - Merriam- Webster dictionary
  • 51. Useful Resources (1/3) • Benchmarking Hadoop and HBase on Violin • Benchmarking Cassandra on Violin • http://blog.cloudera.com/blog/2014/11/bigbe nch-toward-an-industry-standard-benchmark- for-big-data-analytics/
  • 52. Useful Resources (2/3) • http://blog.cloudera.com/blog/2015/08/ycsb-the- open-standard-for-nosql-benchmarking-joins-cloudera- labs/ • https://discuss.zendesk.com/hc/en- us/articles/200864057-Running-DFSIO-MapReduce- benchmark-test • http://www.michael- noll.com/blog/2011/04/09/benchmarking-and-stress- testing-an-hadoop-cluster-with-terasort-testdfsio- nnbench-mrbench/
  • 53. Useful Resources (3/3) • http://bdaafall2015.readthedocs.io/en/latest/ nnbench.html • http://bdaafall2015.readthedocs.io/en/latest/ mrbench.html