SlideShare a Scribd company logo
FAST DATA PROCESSING
WITH APACHE SPARK
A case study of faster data processing using Apache Spark in
AWS cluster.
Case Study by : Aptus Data Labs | http://www.aptusdatalabs.com/
K E Y O B J E C T I V E S A N D S O L U T I O N A P P R O A C H
The is an Australia based
organisation specialised in data
insights.
The client's organisation is responsible
for extracting meaningful patterns
from the pharmaceutical data.
The data is collected
regularly from various drug store
across Australia. The data contains the
drug details prescribed to each patient.
The task was to process multiple
batches of the pharmaceutical data.
Reduced Processing Time
A 62% performance Boost
was Achieved as the current
solution was able to process
1.2 billions of data in 1 hour.
The processing time was reduced up to
62%.
B I G D A T A & A N A L Y T I C S
Each batch could contain up-to
billion of records.
Processing of the data included
multiple order by and group by
operation. The result of each record
was also dependent on the results of
preceding and succeeding records
due to which all the records had to be
processed which was a bottleneck.
The existing solution was running on
a 5 node Vertica cluster which took
2.2 hours to process billion records.
The key objectives was to migrate the existing platform to Apache Spark Cluster to
improve the processing time, reduce the IT costs and easy adaptibility to new features
in a futuristic perspective.
P E R F O R M A N C E A N D B E N E F I T S
Reduced It costs
The use of opensource technologies
effectively reduced the it costs.
Fault Tolerant and HA
The solution is able to handle massive
data, is highly scalable and fault
tolerant. The use of yarn cluster
ensures the high availability of the
environment.
Client
In order to migrate the environment , several steps were carried out to bring the best
out of Apache Spark. The following methodologies were used for the solution.
The data is ingested from both Database and HDFS source using spark data source
API.
As data is in structured tabular format, So the DataFrames are used to store data
instead of traditional RDD's. DataFrame work efficiently for structured relational
data which helped to reduce the processing time.
Procedures that did the processing earlier in vertica were replaced by UDFs (User
Defined Functions) in spark.
Spark Sql is used to pass the DataFrames to the UDFs for processing. it is also
used to perform various joins, order by and order by operations faster.
The DataFrame was partitioned to perform the processing across all the nodes in
parallel manner.
The current environment is deployed on a 3 node HDP cluster with Apache spark 1.6 on
AWS. Each node is having 4 cores, 30 GB of memory and 80 GB of ssd. Yarn resource
manager instead of Sparks resource manager to ensure high availability of cluster . Shell
scripts are used for deploying and automating the spark jobs.
62%

More Related Content

PPTX
Cloud computing major project
PPTX
PPTX
Big data and tools
PDF
PPTX
Big Data Processing with Hadoop-MapReduce in Cloud Systems
PDF
R server and spark
PDF
ObjectEngine
PPTX
Getting more out of your big data
Cloud computing major project
Big data and tools
Big Data Processing with Hadoop-MapReduce in Cloud Systems
R server and spark
ObjectEngine
Getting more out of your big data

What's hot (20)

PPTX
Analysis of historical movie data by BHADRA
PDF
CCA175 Exam Cheat Sheet
PPTX
PPTX
1.demystifying big data & hadoop
ODP
An introduction to Apache Hadoop Hive
PDF
An Introduction to Apache Spark
PDF
Apache Spark 101
PDF
PDF
Apache Spark PDF
PPTX
Combining Big Data and HPC in a GRIDScalar Environment
PPT
Big data and hadoop
PDF
13 09-28 hadoop-in_taiwan_2013_opening
PPTX
Big data-at-detik
PPTX
Introducing Data Lakes
PPTX
Big data Analytics Hadoop
PDF
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
PPTX
Specialties of next generation hyper-connected data center services
PPTX
Introducing Big Data
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
PDF
Apache hadoop & map reduce
Analysis of historical movie data by BHADRA
CCA175 Exam Cheat Sheet
1.demystifying big data & hadoop
An introduction to Apache Hadoop Hive
An Introduction to Apache Spark
Apache Spark 101
Apache Spark PDF
Combining Big Data and HPC in a GRIDScalar Environment
Big data and hadoop
13 09-28 hadoop-in_taiwan_2013_opening
Big data-at-detik
Introducing Data Lakes
Big data Analytics Hadoop
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Specialties of next generation hyper-connected data center services
Introducing Big Data
Big data vahidamiri-tabriz-13960226-datastack.ir
Apache hadoop & map reduce
Ad

Similar to FAST DATA PROCESSING WITH APACHE SPARK (20)

PDF
Scala like distributed collections - dumping time-series data with apache spark
PDF
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
PDF
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
PPTX
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PPTX
Unit II Real Time Data Processing tools.pptx
PPTX
IOT.ppt
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
PDF
A Master Guide To Apache Spark Application And Versatile Uses.pdf
PPTX
Spark from the Surface
PPTX
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
PPTX
ETL with SPARK - First Spark London meetup
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
PPTX
In Memory Analytics with Apache Spark
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PPTX
Big Data Analytics Presentation on the resourcefulness of Big data
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PPTX
Is Spark the right choice for data analysis ?
PDF
Spark: A Unified Engine for Big Data Processing
PDF
Optimizing Spark-based data pipelines - are you up for it?
Scala like distributed collections - dumping time-series data with apache spark
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
Unit II Real Time Data Processing tools.pptx
IOT.ppt
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
A Master Guide To Apache Spark Application And Versatile Uses.pdf
Spark from the Surface
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
ETL with SPARK - First Spark London meetup
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
In Memory Analytics with Apache Spark
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Big Data Analytics Presentation on the resourcefulness of Big data
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Is Spark the right choice for data analysis ?
Spark: A Unified Engine for Big Data Processing
Optimizing Spark-based data pipelines - are you up for it?
Ad

More from Kamal Pradhan (6)

PDF
Smart grid summary
PDF
Dense wavelength division multiplexing (DWDM): A Review
PDF
Mathematical modeling and parameter estimation for water quality management s...
PDF
Android Operated Wireless Robot Using 8051 MCU
PDF
Securing Web Communication Using Three Layer Image Shielding
PDF
Color based image processing , tracking and automation using matlab
Smart grid summary
Dense wavelength division multiplexing (DWDM): A Review
Mathematical modeling and parameter estimation for water quality management s...
Android Operated Wireless Robot Using 8051 MCU
Securing Web Communication Using Three Layer Image Shielding
Color based image processing , tracking and automation using matlab

Recently uploaded (20)

PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to machine learning and Linear Models
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Quality review (1)_presentation of this 21
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
1_Introduction to advance data techniques.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to machine learning and Linear Models
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Quality review (1)_presentation of this 21
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
IB Computer Science - Internal Assessment.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Acceptance and paychological effects of mandatory extra coach I classes.pptx
[EN] Industrial Machine Downtime Prediction
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
SAP 2 completion done . PRESENTATION.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Supervised vs unsupervised machine learning algorithms
Clinical guidelines as a resource for EBP(1).pdf
1_Introduction to advance data techniques.pptx
annual-report-2024-2025 original latest.
oil_refinery_comprehensive_20250804084928 (1).pptx
Miokarditis (Inflamasi pada Otot Jantung)

FAST DATA PROCESSING WITH APACHE SPARK

  • 1. FAST DATA PROCESSING WITH APACHE SPARK A case study of faster data processing using Apache Spark in AWS cluster. Case Study by : Aptus Data Labs | http://www.aptusdatalabs.com/ K E Y O B J E C T I V E S A N D S O L U T I O N A P P R O A C H The is an Australia based organisation specialised in data insights. The client's organisation is responsible for extracting meaningful patterns from the pharmaceutical data. The data is collected regularly from various drug store across Australia. The data contains the drug details prescribed to each patient. The task was to process multiple batches of the pharmaceutical data. Reduced Processing Time A 62% performance Boost was Achieved as the current solution was able to process 1.2 billions of data in 1 hour. The processing time was reduced up to 62%. B I G D A T A & A N A L Y T I C S Each batch could contain up-to billion of records. Processing of the data included multiple order by and group by operation. The result of each record was also dependent on the results of preceding and succeeding records due to which all the records had to be processed which was a bottleneck. The existing solution was running on a 5 node Vertica cluster which took 2.2 hours to process billion records. The key objectives was to migrate the existing platform to Apache Spark Cluster to improve the processing time, reduce the IT costs and easy adaptibility to new features in a futuristic perspective. P E R F O R M A N C E A N D B E N E F I T S Reduced It costs The use of opensource technologies effectively reduced the it costs. Fault Tolerant and HA The solution is able to handle massive data, is highly scalable and fault tolerant. The use of yarn cluster ensures the high availability of the environment. Client In order to migrate the environment , several steps were carried out to bring the best out of Apache Spark. The following methodologies were used for the solution. The data is ingested from both Database and HDFS source using spark data source API. As data is in structured tabular format, So the DataFrames are used to store data instead of traditional RDD's. DataFrame work efficiently for structured relational data which helped to reduce the processing time. Procedures that did the processing earlier in vertica were replaced by UDFs (User Defined Functions) in spark. Spark Sql is used to pass the DataFrames to the UDFs for processing. it is also used to perform various joins, order by and order by operations faster. The DataFrame was partitioned to perform the processing across all the nodes in parallel manner. The current environment is deployed on a 3 node HDP cluster with Apache spark 1.6 on AWS. Each node is having 4 cores, 30 GB of memory and 80 GB of ssd. Yarn resource manager instead of Sparks resource manager to ensure high availability of cluster . Shell scripts are used for deploying and automating the spark jobs. 62%