SlideShare a Scribd company logo
ETL, pivoting and Handling Small File Problems in Spark
Extracting, Putting several transformation and Finally Loading the summarized data into hive
is the most important part of Data Warehousing. Now we face various types of problems in
spark in terms of developing you basic Data Quality Checking. So it is always
recommendable to pass the through the Data with Custom Data Quality checking steps like:
1. Null Checking in String Field
2. Null checking in Numeric Field
3. Alfa-Numeric Characters in Numeric field
4. Data Type selection on the basis of future requirements
5. Data format conversion(Most Important)
6. Filter Data
7. Address, SSN, Telephone, Email id validation etc.
In the transformation phase Spark demands many User Defined Functions as our
requirement goes more complex
Transformation like:
1. Aggregation
2. Routing
3. Normalization
4. De-Normalization
5. Intelligent Counter
6. Lookup
Load phase is putting your temporary table into Hive or HBase or Cassandra and use any
Visualization tool to show the outcome.
Now this article looks into another aspect of Small files handling in Spark which is really
important. It is said to keep in mind that “Don’t let your partition volume too high (Greater
than 2GB and don’t even make it too small which will cause overhead problem”
Now my data source is consisting of many small files, so do look at this step:
Now this execution plan itself shows the beauty of this hack and efficient use of broadcast
variable in spark.
This will definitely reduce down your I/O overhead problems for and provide a better result
in terms of performance.
So the data source is something like this:
The schema goes like this:
Now this data have different null problems where we need to create custom function in
RDD level and format the data.
Another problem with this data is the date format was not same throughout the file,
somewhere it’s like dd/mm/yyyy and somewhere dd-mm-yyyy. So serious amount of Data
Quality and conversion checking was required.
val dataRDD = data.map(line =>
line.split(",")).map(line=>ScoreRecord(checkStrNull(line(0)).trim,checkStrNull(line(1)).trim
,checkStrNull(line(2)).trim,checkStrNull(line(3)).trim,checkStrNull(line(4)).trim,checkStrNul
l(line(5)).trim,checkNumericNull(line(6)).trim.toInt,checkNumericNull(line(7)).trim.toDoub
le,checkNumericNull(line(8)).trim.toInt,checkNumericNull(line(9)).trim.toDouble,checkNu
mericNull(line(10)).trim.toInt));
This has the required conversion and checking.
Now I developed Spark SQL UDF to handle the data conversion problem, So my code goes
like this
df.registerTempTable("cricket_data");
val result = sqlContext.sql("select name,year,case when month in (10,11,12) then 'Q4'
when month in (7,8,9) then 'Q3' when month in (4,5,6) then 'Q2' when month in(1,2,3)
then 'Q1' end Quarter, run_scored from (select
name,year(convert(REPLACE(date_of_match,'/','-'))) as
year,month(convert(REPLACE(date_of_match,'/','-'))) as month,run_scored from
cricket_data) C");
Convert and REPLACE are custom UDF for this Job
Now this query gives me a result like this:
Now in terms of Data Warehouse this is very inefficient data. As the business user
demands summarized data with full visibility throughout the timestamp.
Here in ETL we use a component called “De-Normalizer” [In Informatica]
So it required transformations like:
Aggregator has a sorter which sorts the data first and then implements the aggregation.
Now these are costly transformations in terms of ETL. If we are having data volume 1 Billion
it suffers a big time due to less efficient cache and data mapping
Spark gives a brilliant solution to pivot ta the data in a single line:
val result_pivot = result.groupBy("name","year").pivot("Quarter").agg(sum("run_scored"))
This is an action which pivots the data and transposes huge volume of data within few
minutes.
The data goes like this:
Explain Plan for the Query
Explain Plan for the Pivot
We Load this summarized data in hive and show to the End user , So this how my table got
stored in hive.
Data in Hive
A very simple way to handle ETL in Spark! 

More Related Content

DOCX
Dynamic Width File in Spark_2016
PDF
Delta: Building Merge on Read
PDF
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
PDF
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
PDF
Delta Lake: Optimizing Merge
PDF
Containerized Stream Engine to Build Modern Delta Lake
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
PDF
Spark SQL Beyond Official Documentation
Dynamic Width File in Spark_2016
Delta: Building Merge on Read
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Delta Lake: Optimizing Merge
Containerized Stream Engine to Build Modern Delta Lake
Common Strategies for Improving Performance on Your Delta Lakehouse
Spark SQL Beyond Official Documentation

What's hot (20)

PPTX
ADF Mapping Data Flows Level 300
PDF
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
PDF
A High Performance Mutable Engagement Activity Delta Lake
PPTX
The immutable database datomic
ODP
HadoopDB
PDF
Operating and Supporting Delta Lake in Production
PDF
Everyday Probabilistic Data Structures for Humans
PDF
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
PPTX
Azure Data Factory Data Wrangling with Power Query
PPTX
SCALE - Stream processing and Open Data, a match made in Heaven
PPTX
Azure Data Factory Data Flows Training (Sept 2020 Update)
PDF
Write intensive workloads and lsm trees
PPTX
Azure Data Factory Data Flow Performance Tuning 101
PDF
Data Discovery at Databricks with Amundsen
PPTX
Hundreds of queries in the time of one - Gianmario Spacagna
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
PDF
Change Data Feed in Delta
PPTX
Azure Data Factory Data Flows Training v005
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PDF
Diving into Delta Lake: Unpacking the Transaction Log
ADF Mapping Data Flows Level 300
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
A High Performance Mutable Engagement Activity Delta Lake
The immutable database datomic
HadoopDB
Operating and Supporting Delta Lake in Production
Everyday Probabilistic Data Structures for Humans
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Azure Data Factory Data Wrangling with Power Query
SCALE - Stream processing and Open Data, a match made in Heaven
Azure Data Factory Data Flows Training (Sept 2020 Update)
Write intensive workloads and lsm trees
Azure Data Factory Data Flow Performance Tuning 101
Data Discovery at Databricks with Amundsen
Hundreds of queries in the time of one - Gianmario Spacagna
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Change Data Feed in Delta
Azure Data Factory Data Flows Training v005
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Diving into Delta Lake: Unpacking the Transaction Log
Ad

Viewers also liked (13)

PDF
ELECTRÓNICA+RADIO+TV. Tomo IV: AMPLIFICADORES B.F. ALTAVOCES. VÁLVULAS AMPLIF...
PDF
SEVEN-ÉFESO
DOCX
Dynamic width file in Spark
PDF
ELECTRÓNICA+RADIO+TV. Tomo III: DETECTORES. OSCILADORES. AMPLIFICADORES. Apén...
PDF
A Igreja que queremos ser
PDF
Libertad probatoria en la prueba testimonial del niño, niña en el proceso civ...
PDF
The salesforce nugget volume 4 (link campaign to opportunity)
PPTX
Eosinophilic Gastroenteritis
PDF
Lem resume_pro
DOCX
PDF
Technical training.pptx (1)
PPT
Tp3 power1
PDF
Winning The Race To Value - Vendavo in Aftermarket Spare Parts Industries _ I...
ELECTRÓNICA+RADIO+TV. Tomo IV: AMPLIFICADORES B.F. ALTAVOCES. VÁLVULAS AMPLIF...
SEVEN-ÉFESO
Dynamic width file in Spark
ELECTRÓNICA+RADIO+TV. Tomo III: DETECTORES. OSCILADORES. AMPLIFICADORES. Apén...
A Igreja que queremos ser
Libertad probatoria en la prueba testimonial del niño, niña en el proceso civ...
The salesforce nugget volume 4 (link campaign to opportunity)
Eosinophilic Gastroenteritis
Lem resume_pro
Technical training.pptx (1)
Tp3 power1
Winning The Race To Value - Vendavo in Aftermarket Spare Parts Industries _ I...
Ad

Similar to ETL and pivoting in spark (20)

DOC
Ibm redbook
PDF
Best Practices for Building and Deploying Data Pipelines in Apache Spark
PPTX
ETL
PPTX
Big Data Transformation Powered By Apache Spark.pptx
PPTX
Big Data Transformations Powered By Spark
PPTX
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
PPTX
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PDF
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
PDF
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
PPTX
SPL_ALL_EN.pptx
PPTX
Data ware house architecture
PDF
Using Apache Spark as ETL engine. Pros and Cons
PPT
Java Developers, make the database work for you (NLJUG JFall 2010)
PDF
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
PDF
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
PDF
Cassandra Summit 2015 - A Change of Seasons
PDF
Spark streaming , Spark SQL
PPTX
What Are the Key Steps in Scraping Product Data from Amazon India.pptx
Ibm redbook
Best Practices for Building and Deploying Data Pipelines in Apache Spark
ETL
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformations Powered By Spark
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
SPL_ALL_EN.pptx
Data ware house architecture
Using Apache Spark as ETL engine. Pros and Cons
Java Developers, make the database work for you (NLJUG JFall 2010)
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Cassandra Summit 2015 - A Change of Seasons
Spark streaming , Spark SQL
What Are the Key Steps in Scraping Product Data from Amazon India.pptx

Recently uploaded (20)

PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
737-MAX_SRG.pdf student reference guides
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PPTX
additive manufacturing of ss316l using mig welding
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PPTX
Artificial Intelligence
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPT
Mechanical Engineering MATERIALS Selection
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
composite construction of structures.pdf
PPTX
web development for engineering and engineering
PPTX
Current and future trends in Computer Vision.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
737-MAX_SRG.pdf student reference guides
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
additive manufacturing of ss316l using mig welding
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Artificial Intelligence
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Foundation to blockchain - A guide to Blockchain Tech
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Mechanical Engineering MATERIALS Selection
III.4.1.2_The_Space_Environment.p pdffdf
R24 SURVEYING LAB MANUAL for civil enggi
composite construction of structures.pdf
web development for engineering and engineering
Current and future trends in Computer Vision.pptx
bas. eng. economics group 4 presentation 1.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
Internet of Things (IOT) - A guide to understanding
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf

ETL and pivoting in spark

  • 1. ETL, pivoting and Handling Small File Problems in Spark Extracting, Putting several transformation and Finally Loading the summarized data into hive is the most important part of Data Warehousing. Now we face various types of problems in spark in terms of developing you basic Data Quality Checking. So it is always recommendable to pass the through the Data with Custom Data Quality checking steps like: 1. Null Checking in String Field 2. Null checking in Numeric Field 3. Alfa-Numeric Characters in Numeric field 4. Data Type selection on the basis of future requirements 5. Data format conversion(Most Important) 6. Filter Data 7. Address, SSN, Telephone, Email id validation etc. In the transformation phase Spark demands many User Defined Functions as our requirement goes more complex Transformation like: 1. Aggregation 2. Routing 3. Normalization 4. De-Normalization 5. Intelligent Counter 6. Lookup Load phase is putting your temporary table into Hive or HBase or Cassandra and use any Visualization tool to show the outcome. Now this article looks into another aspect of Small files handling in Spark which is really important. It is said to keep in mind that “Don’t let your partition volume too high (Greater than 2GB and don’t even make it too small which will cause overhead problem” Now my data source is consisting of many small files, so do look at this step:
  • 2. Now this execution plan itself shows the beauty of this hack and efficient use of broadcast variable in spark. This will definitely reduce down your I/O overhead problems for and provide a better result in terms of performance. So the data source is something like this: The schema goes like this: Now this data have different null problems where we need to create custom function in RDD level and format the data. Another problem with this data is the date format was not same throughout the file, somewhere it’s like dd/mm/yyyy and somewhere dd-mm-yyyy. So serious amount of Data Quality and conversion checking was required. val dataRDD = data.map(line => line.split(",")).map(line=>ScoreRecord(checkStrNull(line(0)).trim,checkStrNull(line(1)).trim ,checkStrNull(line(2)).trim,checkStrNull(line(3)).trim,checkStrNull(line(4)).trim,checkStrNul l(line(5)).trim,checkNumericNull(line(6)).trim.toInt,checkNumericNull(line(7)).trim.toDoub le,checkNumericNull(line(8)).trim.toInt,checkNumericNull(line(9)).trim.toDouble,checkNu mericNull(line(10)).trim.toInt)); This has the required conversion and checking. Now I developed Spark SQL UDF to handle the data conversion problem, So my code goes like this
  • 3. df.registerTempTable("cricket_data"); val result = sqlContext.sql("select name,year,case when month in (10,11,12) then 'Q4' when month in (7,8,9) then 'Q3' when month in (4,5,6) then 'Q2' when month in(1,2,3) then 'Q1' end Quarter, run_scored from (select name,year(convert(REPLACE(date_of_match,'/','-'))) as year,month(convert(REPLACE(date_of_match,'/','-'))) as month,run_scored from cricket_data) C"); Convert and REPLACE are custom UDF for this Job Now this query gives me a result like this: Now in terms of Data Warehouse this is very inefficient data. As the business user demands summarized data with full visibility throughout the timestamp. Here in ETL we use a component called “De-Normalizer” [In Informatica] So it required transformations like:
  • 4. Aggregator has a sorter which sorts the data first and then implements the aggregation. Now these are costly transformations in terms of ETL. If we are having data volume 1 Billion it suffers a big time due to less efficient cache and data mapping Spark gives a brilliant solution to pivot ta the data in a single line: val result_pivot = result.groupBy("name","year").pivot("Quarter").agg(sum("run_scored")) This is an action which pivots the data and transposes huge volume of data within few minutes. The data goes like this: Explain Plan for the Query
  • 5. Explain Plan for the Pivot We Load this summarized data in hive and show to the End user , So this how my table got stored in hive. Data in Hive
  • 6. A very simple way to handle ETL in Spark! 