SlideShare a Scribd company logo
Big Data Testing Approach - Rohit Kharabe
Data Source
(RDBMS,
web logs,
social
media, etc.)
Data Lake
(HDFS
Cluster)
Refined Data
(HDFS
Cluster)
Enterprise
Data
WareHouse
(Data
Factory for
Query and
Analysis)
BI
Data Stage
Validation
1
“MapReduce” Process Validation(Clustering,
Data aggregation or segregation rules,
KeyValue pairs Validation)
2
Algorithm / Output
Validation
3
Report / Business Requirements
Validation4
Python
Data
Integration
and
Refinement
Data
Synthesis
Structured
Data
ETL
Data
Preparation
2
ETL Process
Validation
Data Staging Validation
First stage involves process validation :
1) Data from various source like RDBMS, weblogs, social media, etc. should be validated to make
sure that correct data is pulled into system
2) Comparing source data with the data pushed into the Hadoop system to make sure they match
3) Verify the right data is extracted and loaded into the correct HDFS location
4) Tools like Talend, Datameer can be used for data staging validation
3
"MapReduce" Validation
The second step is a validation of "MapReduce". In this stage, the tester verifies the business logic
validation on every node and then validating them after running against multiple nodes, ensuring that
the :
1) Map Reduce process works correctly
2) Data aggregation or segregation rules are implemented on the data
3) Key value pairs are generated
4) Validating the data after Map Reduce process
Big data tools used for MapReduce are Hadoop, Spark, Hive, Pig, Cascading, Oozie, Kafka, S4, MapR, and
Flume
4
1) MRUnit - Unit testing for MR jobs :
MRUnit lets users define key-value pairs to be given to map and reduce functions, and it tests that the correct
key-value pairs are emitted from each of these functions
2) Local job runner testing - Running MR jobs on a single machine in a single JVM :
The local job runner lets you run Hadoop on a local machine, in one JVM, making MR jobs a little easier to debug
in the case of a job failing
3) Pseudo-distributed testing - Running MR jobs on a single machine using Hadoop :
A pseudo-distributed cluster is composed of a single machine running all Hadoop daemons. This cluster is still
relatively easy to manage (though harder than local job runner) and tests integration with Hadoop better than
the local job runner does
4) Full integration testing - Running MR jobs on a QA cluster
Used to test your MR jobs by running them on a QA cluster composed of at least a few machines. By running
your MR jobs on a QA cluster, you will be testing all aspects of both your job and its integration with Hadoop.
Testing Methods for Hadoop MapReduce Processes
5
Output Validation Phase
The third stage of Big Data testing is the output validation process. The output data files are generated
and ready to be moved to an EDW (Enterprise Data Warehouse) or any other system based on the
requirement.
Activities in third stage includes :
1) To check the transformation rules are correctly applied
2) To check the data integrity and successful data load into the target system
3) To check that there is no data corruption by comparing the target data with the HDFS file system data
6
Report Validation
In this final phase of testing, reports are checked against target data warehouse. Report data should match
with target data warehouse.
7
8
Oracle
NoSql
LogFiles
Social
Media, etc.
Apache
Hadoop
HDFS,
MapR-FS,
Cloudera
Apache
Hadoop
HDFS,
MapR-FS,
Hbase,
VoltDB
Enterprise
Data
WareHouse
BI
Python
Data
Integration
and
Refinement
Data
Synthesis
Data
Preparation
MapReduce,
Spark, Pig Latin
etc.
Flume, Scoop,
LogStash, etc. Pig, Hive, R, etc.
Tableau,
Datameer, d3.js,
etc.
1) Apache Flume Enterprises -
Used to ingest log files from application servers or other systems
2) Apache Sqoop -
Used to import data from a MySQL or Oracle database into HDFS
3) Hive -
Hive is a tool that structures data in Hadoop into the form of relational-like tables and allows queries
using a subset of SQL. An Infrastructure which provides us with various tools for easy extraction,
transformation and loading of data. Hive allows its users to embed customized mappers and reducers.
4) Pig Apache -
Pig provides an alternative language to SQL,called Pig Latin, for querying data stored in HDFS
5) NoSQL -
This enables them to store and retrieve data with all the features of the NoSQL database. Some NoSQL
databases available are CouchDB, MongoDB, Cassandra, Redis, ZooKeeper and Hbase.
Tools for validating pre-Hadoop processing
9
10
6) MapR-FS is a POSIX file system that provides distributed, reliable, high performance, scalable, and
full read/write data storage for the MapR Converged Data Platform. MapR-FS supports the HDFS API,
fast NFS access, access controls (MapR ACEs), and transparent data compression.
7) Apache Spark is an open source big data processing framework built around speed, ease of use, and
sophisticated analytics. Spark gives us a comprehensive, unified framework to manage big data
processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc)
as well as the source of data (batch v. real-time streaming data.
8) HBase is a column-oriented database management system that runs on top of HDFS. It is well suited
for sparse data sets, which are common in many big data use cases.
9) Lucene/Solr -
The most popular open source tool for indexing large blocks of unstructured text is Lucene
Performance Testing
Performance Testing for Big Data includes :
Data ingestion and Throughout: In this stage, the tester verifies how the fast system can
consume data from various data source. Testing involves identifying different message that the
queue can process in a given time frame. It also includes how quickly data can be inserted into
underlying data store for example insertion rate into a Mongo and Cassandra database.
Data Processing: It involves verifying the speed with which the queries or map reduce jobs are
executed. It also includes testing the data processing in isolation when the underlying data store
is populated within the data sets. For example running Map Reduce jobs on the underlying
HDFS
Sub-Component Performance: These systems are made up of multiple components, and it is
essential to test each of these components in isolation. For example, how quickly message is
indexed and consumed, mapreduce jobs, query performance, search, etc.
11
Parameters for Performance Testing
Various parameters to be verified for performance testing are :
Data Storage: How data is stored in different nodes
Commit logs: How large the commit log is allowed to grow
Concurrency: How many threads can perform write and read operation
Caching: Tune the cache setting "row cache" and "key cache."
Timeouts: Values for connection timeout, query timeout, etc.
JVM Parameters: Heap size, GC collection algorithms, etc.
Map reduce performance: Sorts, merge, etc.
Message queue: Message rate, size, etc.
12
 Installation Testing - Installation testing is a kind of quality assurance work that focuses on
what customers will need to do to install and set up the new big data application successfully.
The testing process may involve full, partial or upgrades install/uninstall processes.
 End-to-End Test Environment Operational Testing - To provide complete end to end testing for
application . To verify the process right from the first phase ,i.e., when data gets fetched into
data lake till the last phase ,i.e., output validation of machine learning algorithms
 Backup and Restore Testing – To make ensure that back up nodes are working correctly or not.
Cluster is properly configured or not for backup nodes. All the nodes in cluster are interacting
with each other properly as expected
 Fail over Testing - To make sure that in the case of failure, back up nodes come into action and
the system will resume its work as before with the help of backup node without any much
degradation in performance
 Recovery Testing – To make sure that information can be recovered easily from backup node in
case of failure of data node
13

More Related Content

PPTX
Big Data Testing
PPTX
Testing Big Data: Automated ETL Testing of Hadoop
PPTX
TESTING IN BIG DATA WORLD
PPTX
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
PPTX
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
PDF
Hadoop Integration into Data Warehousing Architectures
PDF
Testistanbul 2016 - Keynote: "Enterprise Challenges of Test Data" by Rex Black
DOCX
Hotel inspection data set analysis copy
Big Data Testing
Testing Big Data: Automated ETL Testing of Hadoop
TESTING IN BIG DATA WORLD
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Hadoop Integration into Data Warehousing Architectures
Testistanbul 2016 - Keynote: "Enterprise Challenges of Test Data" by Rex Black
Hotel inspection data set analysis copy

What's hot (20)

PDF
A Reference Architecture for ETL 2.0
PPTX
Hadoop and Enterprise Data Warehouse
PDF
Testing the Data Warehouse―Big Data, Big Problems
PDF
Big Data Architecture Workshop - Vahid Amiri
PPTX
Big Data Introduction
PPTX
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
PDF
What's new in SQL on Hadoop and Beyond
PDF
Big Data Computing Architecture
PPTX
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
PDF
What is hadoop
PDF
Big Data: Architecture and Performance Considerations in Logical Data Lakes
PPTX
Data warehousing with Hadoop
PDF
About Streaming Data Solutions for Hadoop
PPTX
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
PDF
Introducing Amazon Aurora
PDF
Vayacondios: Divine into Complex Systems
PDF
Big Data Ready Enterprise
PDF
Making BD Work~TIAS_20150622
PPT
Data Privacy at Scale
PPTX
Tools and approaches for migrating big datasets to the cloud
A Reference Architecture for ETL 2.0
Hadoop and Enterprise Data Warehouse
Testing the Data Warehouse―Big Data, Big Problems
Big Data Architecture Workshop - Vahid Amiri
Big Data Introduction
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
What's new in SQL on Hadoop and Beyond
Big Data Computing Architecture
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
What is hadoop
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Data warehousing with Hadoop
About Streaming Data Solutions for Hadoop
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Introducing Amazon Aurora
Vayacondios: Divine into Complex Systems
Big Data Ready Enterprise
Making BD Work~TIAS_20150622
Data Privacy at Scale
Tools and approaches for migrating big datasets to the cloud
Ad

Viewers also liked (20)

PPTX
How to Test Big Data Systems | QualiTest Group
PDF
Performance Testing of Big Data Applications - Impetus Webcast
PPTX
Testing Big Data: Automated Testing of Hadoop with QuerySurge
PDF
The Age of Exabytes: Tools & Approaches for Managing Big Data
PPTX
Practical Pig and PigUnit (Michael Noll, Verisign)
PDF
Big Data - Outcomes Performance Measured
PPTX
Big Data Performance and Capacity Management
PPT
5 Tips For Making Linked In Work For You
DOC
Ravikanth_CV_10 yrs_ETL-BI-BigData-Testing
PDF
Big data testing (1)
ZIP
Data Driven Design - Web Analytics & Testing for Designers (Web Directions So...
PDF
How to perform Analytics testing on your website and tools
PPTX
Islamic banking solution by oracle
PPTX
Tackling non-determinism in Hadoop - Testing and debugging distributed system...
PDF
Big Data: Working with Big SQL data from Spark
PDF
UAT - Cards Migration (Whitepaper)
PDF
Veracity think bugdata #2 6.7.2015
PPTX
Advance Hive, NoSQL Database (HBase) - Module 7
PDF
55993161 te040-r12-cash-management-test-scripts
PPTX
Pig and Pig Latin - Module 5
How to Test Big Data Systems | QualiTest Group
Performance Testing of Big Data Applications - Impetus Webcast
Testing Big Data: Automated Testing of Hadoop with QuerySurge
The Age of Exabytes: Tools & Approaches for Managing Big Data
Practical Pig and PigUnit (Michael Noll, Verisign)
Big Data - Outcomes Performance Measured
Big Data Performance and Capacity Management
5 Tips For Making Linked In Work For You
Ravikanth_CV_10 yrs_ETL-BI-BigData-Testing
Big data testing (1)
Data Driven Design - Web Analytics & Testing for Designers (Web Directions So...
How to perform Analytics testing on your website and tools
Islamic banking solution by oracle
Tackling non-determinism in Hadoop - Testing and debugging distributed system...
Big Data: Working with Big SQL data from Spark
UAT - Cards Migration (Whitepaper)
Veracity think bugdata #2 6.7.2015
Advance Hive, NoSQL Database (HBase) - Module 7
55993161 te040-r12-cash-management-test-scripts
Pig and Pig Latin - Module 5
Ad

Similar to Big Data Testing Approach - Rohit Kharabe (20)

PDF
From Relational Database Management to Big Data: Solutions for Data Migration...
PDF
Google Data Engineering.pdf
PDF
Data Engineering on GCP
PDF
data_engineering_on_GCP_PDE_cheat_sheets
PDF
Strengthening the Quality of Big Data Implementations
PPT
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
PPT
Hadoop and Voldemort @ LinkedIn
PDF
F1803013034
PPTX
Hadoop project design and a usecase
PDF
An experimental evaluation of performance
PPTX
Module-2_HADOOP.pptx
PPTX
BIg Data Analytics-Module-2 vtu engineering.pptx
PPTX
BIg Data Analytics-Module-2 as per vtu syllabus.pptx
PDF
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
PDF
Finding URL pattern with MapReduce and Apache Hadoop
DOCX
Testing insights from data lakes
PDF
Document 22.pdf
PPTX
Managing Big data with Hadoop
PDF
Hadoop paper
From Relational Database Management to Big Data: Solutions for Data Migration...
Google Data Engineering.pdf
Data Engineering on GCP
data_engineering_on_GCP_PDE_cheat_sheets
Strengthening the Quality of Big Data Implementations
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Hadoop and Voldemort @ LinkedIn
F1803013034
Hadoop project design and a usecase
An experimental evaluation of performance
Module-2_HADOOP.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 as per vtu syllabus.pptx
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Finding URL pattern with MapReduce and Apache Hadoop
Testing insights from data lakes
Document 22.pdf
Managing Big data with Hadoop
Hadoop paper

Recently uploaded (20)

PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
top salesforce developer skills in 2025.pdf
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
System and Network Administration Chapter 2
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
assetexplorer- product-overview - presentation
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Introduction to Artificial Intelligence
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Designing Intelligence for the Shop Floor.pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
Design an Analysis of Algorithms II-SECS-1021-03
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Internet Downloader Manager (IDM) Crack 6.42 Build 41
wealthsignaloriginal-com-DS-text-... (1).pdf
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
top salesforce developer skills in 2025.pdf
How to Migrate SBCGlobal Email to Yahoo Easily
System and Network Administration Chapter 2
Reimagine Home Health with the Power of Agentic AI​
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
VVF-Customer-Presentation2025-Ver1.9.pptx
assetexplorer- product-overview - presentation
Design an Analysis of Algorithms I-SECS-1021-03
Introduction to Artificial Intelligence
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Designing Intelligence for the Shop Floor.pdf
CHAPTER 2 - PM Management and IT Context
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
2025 Textile ERP Trends: SAP, Odoo & Oracle

Big Data Testing Approach - Rohit Kharabe

  • 2. Data Source (RDBMS, web logs, social media, etc.) Data Lake (HDFS Cluster) Refined Data (HDFS Cluster) Enterprise Data WareHouse (Data Factory for Query and Analysis) BI Data Stage Validation 1 “MapReduce” Process Validation(Clustering, Data aggregation or segregation rules, KeyValue pairs Validation) 2 Algorithm / Output Validation 3 Report / Business Requirements Validation4 Python Data Integration and Refinement Data Synthesis Structured Data ETL Data Preparation 2 ETL Process Validation
  • 3. Data Staging Validation First stage involves process validation : 1) Data from various source like RDBMS, weblogs, social media, etc. should be validated to make sure that correct data is pulled into system 2) Comparing source data with the data pushed into the Hadoop system to make sure they match 3) Verify the right data is extracted and loaded into the correct HDFS location 4) Tools like Talend, Datameer can be used for data staging validation 3
  • 4. "MapReduce" Validation The second step is a validation of "MapReduce". In this stage, the tester verifies the business logic validation on every node and then validating them after running against multiple nodes, ensuring that the : 1) Map Reduce process works correctly 2) Data aggregation or segregation rules are implemented on the data 3) Key value pairs are generated 4) Validating the data after Map Reduce process Big data tools used for MapReduce are Hadoop, Spark, Hive, Pig, Cascading, Oozie, Kafka, S4, MapR, and Flume 4
  • 5. 1) MRUnit - Unit testing for MR jobs : MRUnit lets users define key-value pairs to be given to map and reduce functions, and it tests that the correct key-value pairs are emitted from each of these functions 2) Local job runner testing - Running MR jobs on a single machine in a single JVM : The local job runner lets you run Hadoop on a local machine, in one JVM, making MR jobs a little easier to debug in the case of a job failing 3) Pseudo-distributed testing - Running MR jobs on a single machine using Hadoop : A pseudo-distributed cluster is composed of a single machine running all Hadoop daemons. This cluster is still relatively easy to manage (though harder than local job runner) and tests integration with Hadoop better than the local job runner does 4) Full integration testing - Running MR jobs on a QA cluster Used to test your MR jobs by running them on a QA cluster composed of at least a few machines. By running your MR jobs on a QA cluster, you will be testing all aspects of both your job and its integration with Hadoop. Testing Methods for Hadoop MapReduce Processes 5
  • 6. Output Validation Phase The third stage of Big Data testing is the output validation process. The output data files are generated and ready to be moved to an EDW (Enterprise Data Warehouse) or any other system based on the requirement. Activities in third stage includes : 1) To check the transformation rules are correctly applied 2) To check the data integrity and successful data load into the target system 3) To check that there is no data corruption by comparing the target data with the HDFS file system data 6
  • 7. Report Validation In this final phase of testing, reports are checked against target data warehouse. Report data should match with target data warehouse. 7
  • 9. 1) Apache Flume Enterprises - Used to ingest log files from application servers or other systems 2) Apache Sqoop - Used to import data from a MySQL or Oracle database into HDFS 3) Hive - Hive is a tool that structures data in Hadoop into the form of relational-like tables and allows queries using a subset of SQL. An Infrastructure which provides us with various tools for easy extraction, transformation and loading of data. Hive allows its users to embed customized mappers and reducers. 4) Pig Apache - Pig provides an alternative language to SQL,called Pig Latin, for querying data stored in HDFS 5) NoSQL - This enables them to store and retrieve data with all the features of the NoSQL database. Some NoSQL databases available are CouchDB, MongoDB, Cassandra, Redis, ZooKeeper and Hbase. Tools for validating pre-Hadoop processing 9
  • 10. 10 6) MapR-FS is a POSIX file system that provides distributed, reliable, high performance, scalable, and full read/write data storage for the MapR Converged Data Platform. MapR-FS supports the HDFS API, fast NFS access, access controls (MapR ACEs), and transparent data compression. 7) Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data. 8) HBase is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data sets, which are common in many big data use cases. 9) Lucene/Solr - The most popular open source tool for indexing large blocks of unstructured text is Lucene
  • 11. Performance Testing Performance Testing for Big Data includes : Data ingestion and Throughout: In this stage, the tester verifies how the fast system can consume data from various data source. Testing involves identifying different message that the queue can process in a given time frame. It also includes how quickly data can be inserted into underlying data store for example insertion rate into a Mongo and Cassandra database. Data Processing: It involves verifying the speed with which the queries or map reduce jobs are executed. It also includes testing the data processing in isolation when the underlying data store is populated within the data sets. For example running Map Reduce jobs on the underlying HDFS Sub-Component Performance: These systems are made up of multiple components, and it is essential to test each of these components in isolation. For example, how quickly message is indexed and consumed, mapreduce jobs, query performance, search, etc. 11
  • 12. Parameters for Performance Testing Various parameters to be verified for performance testing are : Data Storage: How data is stored in different nodes Commit logs: How large the commit log is allowed to grow Concurrency: How many threads can perform write and read operation Caching: Tune the cache setting "row cache" and "key cache." Timeouts: Values for connection timeout, query timeout, etc. JVM Parameters: Heap size, GC collection algorithms, etc. Map reduce performance: Sorts, merge, etc. Message queue: Message rate, size, etc. 12
  • 13.  Installation Testing - Installation testing is a kind of quality assurance work that focuses on what customers will need to do to install and set up the new big data application successfully. The testing process may involve full, partial or upgrades install/uninstall processes.  End-to-End Test Environment Operational Testing - To provide complete end to end testing for application . To verify the process right from the first phase ,i.e., when data gets fetched into data lake till the last phase ,i.e., output validation of machine learning algorithms  Backup and Restore Testing – To make ensure that back up nodes are working correctly or not. Cluster is properly configured or not for backup nodes. All the nodes in cluster are interacting with each other properly as expected  Fail over Testing - To make sure that in the case of failure, back up nodes come into action and the system will resume its work as before with the help of backup node without any much degradation in performance  Recovery Testing – To make sure that information can be recovered easily from backup node in case of failure of data node 13

Editor's Notes

  • #3: A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.
  • #10: Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS).