FAST DATA PROCESSING WITH APACHE SPARK

FAST DATA PROCESSING
WITH APACHE SPARK
A case study of faster data processing using Apache Spark in
AWS cluster.
Case Study by : Aptus Data Labs | http://www.aptusdatalabs.com/
K E Y O B J E C T I V E S A N D S O L U T I O N A P P R O A C H
The is an Australia based
organisation specialised in data
insights.
The client's organisation is responsible
for extracting meaningful patterns
from the pharmaceutical data.
The data is collected
regularly from various drug store
across Australia. The data contains the
drug details prescribed to each patient.
The task was to process multiple
batches of the pharmaceutical data.
Reduced Processing Time
A 62% performance Boost
was Achieved as the current
solution was able to process
1.2 billions of data in 1 hour.
The processing time was reduced up to
62%.
B I G D A T A & A N A L Y T I C S
Each batch could contain up-to
billion of records.
Processing of the data included
multiple order by and group by
operation. The result of each record
was also dependent on the results of
preceding and succeeding records
due to which all the records had to be
processed which was a bottleneck.
The existing solution was running on
a 5 node Vertica cluster which took
2.2 hours to process billion records.
The key objectives was to migrate the existing platform to Apache Spark Cluster to
improve the processing time, reduce the IT costs and easy adaptibility to new features
in a futuristic perspective.
P E R F O R M A N C E A N D B E N E F I T S
Reduced It costs
The use of opensource technologies
effectively reduced the it costs.
Fault Tolerant and HA
The solution is able to handle massive
data, is highly scalable and fault
tolerant. The use of yarn cluster
ensures the high availability of the
environment.
Client
In order to migrate the environment , several steps were carried out to bring the best
out of Apache Spark. The following methodologies were used for the solution.
The data is ingested from both Database and HDFS source using spark data source
API.
As data is in structured tabular format, So the DataFrames are used to store data
instead of traditional RDD's. DataFrame work efficiently for structured relational
data which helped to reduce the processing time.
Procedures that did the processing earlier in vertica were replaced by UDFs (User
Defined Functions) in spark.
Spark Sql is used to pass the DataFrames to the UDFs for processing. it is also
used to perform various joins, order by and order by operations faster.
The DataFrame was partitioned to perform the processing across all the nodes in
parallel manner.
The current environment is deployed on a 3 node HDP cluster with Apache spark 1.6 on
AWS. Each node is having 4 cores, 30 GB of memory and 80 GB of ssd. Yarn resource
manager instead of Sparks resource manager to ensure high availability of cluster . Shell
scripts are used for deploying and automating the spark jobs.
62%

FAST DATA PROCESSING WITH APACHE SPARK

More Related Content

What's hot (20)

Similar to FAST DATA PROCESSING WITH APACHE SPARK (20)

More from Kamal Pradhan (6)

Recently uploaded (20)

FAST DATA PROCESSING WITH APACHE SPARK