SlideShare a Scribd company logo
Your natural partner to develop innovative
solutions

Nokia Institute of Technology

Nokia Internal Use Only
Agenda
Agenda
•
•
•

MapReduce Summarization Patterns
MapReduce Coding Best Practices
Ctrending MR Performance Evaluation
• Ctrending MR Execution Summary
• Code Profiling
• Profiling Results
• Code Tuning
• Hbase Configuration Tuning
• Tuning Results
• Refactoring Proposal

Nokia Internal Use Only
MapReduce Summarization Patterns
MapReduce Summarization Patterns
•

Numerical Summarizations
• General counting of data set records
• Groups records by a custom key, calculating numerical values per
group
• Known Uses
• Word count, record count, min/max count, avg/median/standard
deviation

Nokia Internal Use Only
MapReduce Summarization Patterns
MapReduce Summarization Patterns
•

Inverted Index
• Indexes large data set into keywords
• Mapper emits keywords/ids values and the framework handles most of
the work
• May use IdentityReducer
• Should benefit from Partitioner for load balance

Nokia Internal Use Only
MapReduce Summarization Patterns
MapReduce Summarization Patterns
•

•

Counting with Counters
• Leverages MapReduce framework’s counters.
• Counters are all stored in-memory locally on each Mapper, then aggregated by the
framework.
• Better performance, however may not exceed tens of counters definition.
Known Uses
• Count number of records, count small number of groups, summations

Nokia Internal Use Only
MapReduce Coding Best Practices
MapReduce Coding Best Practices
•

•

•

Define Output Values
• Create custom Writable extending classes to be used as output from
Mappers;
• Provides cleaner Mapper code and avoids String parsing on Reducer
code side;
Avoid Local Object Creation
• Map and Reduce methods are invoked on very large loops;
• Creating local objects inside map or reduce leads to huge number of
objects being attached to Eden space of Young Generation JVM’s
Heap;
• Reuse Global instances to decrease Young GC Activity;
Use Combiners on Counting Summarizations
• Combiners reduce bandwidth consuption, as it applies aggregations
locally to mappers node, before mapper output is sent to shuffle and
sort phase, then made available for reducers

Nokia Internal Use Only
Ctrending MR Performance Evaluation
Ctrending MR Performance Evaluation
• Ctrending MR Execution Summary
• Total MR Jobs Running: 8
• Avg of processed tweets: 2.2 Million
• Tweets identified as Music related: 10.5%
• Total Execution Time: 2 hours and 20 minutes
• Slowest MapReduces:
• Tweets Counter: 46 minutes
• Nokia Entity Id Join: 1 hour and 10 minutes

Nokia Internal Use Only
Ctrending MR Code Profiling
Ctrending MR Code Profiling
• Mainly applied to Nokia Id Join Mapper
• Added usage of MapReduce framework’s Counters to collect execution
time metrics
• Also used Counters to sum total of entities id being found in Nokia Id Join
mapper
• Needed to create Static fields in search strategy implementations to
collect execution time metric

Nokia Internal Use Only
Ctrending MR Profiling Results
Ctrending MR Profiling Results
TOTAL_ARTISTS_NMS_FOUND

77

TOTAL_ARTISTS_NOT_IN_CACHE
TOTAL_CANDIDATES_FORMATTING_TIME
TOTAL_HBASE_GET_TIME

1,904
67,873
262,647

TOTAL_NORMALIZATION_TIME

22,452

TOTAL_SEARCH_ARTIST_TIME

611,066

TOTAL_SEARCH_CALCULATION_TIME

5,605

TOTAL_SEARCH_NMS_TIME

3,740,552

TOTAL_SEARCH_TIME

4,098,270

TOTAL_SEARCH_TRACK_TIME

3,486,978

TOTAL_TRACKS_NMS_FOUND

145

TOTAL_TRACKS_NOT_IN_CACHE
Nokia Internal Use Only

4,635
Ctrending MR Code Tuning
Ctrending MR Code Tuning
• Tuning Tweets Count MapReduce
• Applied IntSumReducer as combiner.
• Ajusted Hbase Scan to fetch and copy records on blocks of
thousands, in order to optimize network usage between nodes.
• Also set blockCache to false, as this table will always be read
sequentially at once.

Nokia Internal Use Only
Ctrending MR Code Tuning
Ctrending MR Code Tuning
• Tuning Entity Id Search MapReduce
• Removed unnecessary split/indexof calls
• Removed redundant object creation from map method

Nokia Internal Use Only
Ctrending MR Code Tuning
Ctrending MR Code Tuning
• Tuning Entity Id Search MapReduce
• Profiling results shows that NMS Search is the bottleneck
• It costs more than 90% of all MapReduce execution time
• It also shows that NMS Search is not adding enough value
• It founds only 4% of Artists Ids not in cache
• It founds only 3% of Tracks Ids not in cache
• This drove the decision to remove NMS search by simply referencing
CustomCache ISearchStrategy implementation on Mapper setup
method

Nokia Internal Use Only
Hbase Configuration Tuning
Hbase Configuration Tuning
• Artists and Tracks Cache is an inverted indexes structure stored on
Hbase tables.
• These tables present high level of random access to it’s records (Get
operations), while Entity Id Search MapReduce performs searches on the
cache.
• This could have performance optimized if Cache table blocks were made
available in RegionServer’s memory.
• Hbase provides Table level configuration property that increases blocks
priority to be stored on RegionServer’s memory

Nokia Internal Use Only
Hbase Configuration Tuning
Hbase Configuration Tuning
• Additional configuration is required on Hbase RegionServer, so that
block cache is possible most part of the time.
• hbase.regionserver.global.memstore.upperLimit -> defines maximum
% of Heap available for writing in memstores, before put operations
are actually written to disk files.
• hbase.regionserver.global.memstore.lowerLimit -> defines minimum
% of Heap available for writing in memstores. Flush operations will
free memstore until this limit is reached.
• hfile.block.cache.size -> % of Heap to be used to store blocks inmemory

Nokia Internal Use Only
Hbase Configuration Tuning
Hbase Configuration Tuning
• Most Ctrending Hbase put operations are done in batch jobs
(Twitter Crawler).
• Music entities cache requires many Get operations, while
EntityIdSearchMR is executing.
• Simply setting cache tables to be maintained in-memory does not
work, if there is not enough memory available.
• More memory can be made available to cache tables blocks on
RegionServers by decreasing % of Heap reserved to memstore
and increasing it for block cache.

Nokia Internal Use Only
Ctrending MR Tuning Results
Ctrending MR Tuning Results
• TweetsCountMR
• Total Execution Time Prior Tuning: 46 minutes (average)
• Total Execution Time After Tuning: 20 minutes (average)
• EntityIdSearchMR
• Total Execution Time Prior Tuning: 1 hour and 10 minutes (average)
• Total Execution Time Adter Tuning: 6 minutes (average)
• CONCLUSION: Do not ever perform HTTP Requests on MapReduces
again!!!

Nokia Internal Use Only
Refactoring
Refactoring
• Write batch process to read generated rankings and perform requests
to NMS for music entities which ID was not found.
• Better implement this as a Java multi-thread standalone process,
instead of MapReduce
• As input file is small (the filtered rank), Hadoop default InputFormat
implementations will not split it in many Map tasks.
• Unless a custom InputFormat be implemented, develop a
MapReduce for this will probably take long time to execute, as it will
end up with a single Map task to request NMS for all unknown Ids
• Optimize Heap usage on other MRs by avoiding Object creation on
Map methods.
• Enhance code quality (and even performance), by defining
OutputValues for Trending MRs

Nokia Internal Use Only
References
References
• HBase, The Definitive Guide, Lars George, O'Reilly
• MapReduce Design Patterns, Donald Miner, Adam Shook
• Hadoop Official WebSite
• http://hadoop.apache.org/

Nokia Internal Use Only

More Related Content

PDF
Resource Aware Scheduling for Hadoop [Final Presentation]
PDF
Apache Hadoop YARN - The Future of Data Processing with Hadoop
PDF
A sdn based application aware and network provisioning
PPTX
Apache Tez : Accelerating Hadoop Query Processing
PPTX
Yarnthug2014
PDF
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
PDF
Spark and Deep Learning frameworks with distributed workloads
 
PDF
Apache Hadoop YARN - Enabling Next Generation Data Applications
Resource Aware Scheduling for Hadoop [Final Presentation]
Apache Hadoop YARN - The Future of Data Processing with Hadoop
A sdn based application aware and network provisioning
Apache Tez : Accelerating Hadoop Query Processing
Yarnthug2014
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
Spark and Deep Learning frameworks with distributed workloads
 
Apache Hadoop YARN - Enabling Next Generation Data Applications

What's hot (20)

PPTX
Back to School - St. Louis Hadoop Meetup September 2016
ODP
An Introduction to Apache Hadoop Yarn
PPT
Suggested Algorithm to improve Hadoop's performance.
PDF
Yarn
PDF
Hadoop scheduler
PDF
Extending Spark Streaming to Support Complex Event Processing
PPTX
Bigdata workshop february 2015
PPTX
Resource scheduling
PPTX
YARN - Presented At Dallas Hadoop User Group
PDF
Spark on Mesos
PPTX
YARN - Next Generation Compute Platform fo Hadoop
PDF
Introduction to YARN Apps
PPT
Map reducecloudtech
PDF
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
PPTX
Introduction to Hadoop and Big Data
PDF
Cloumon Product Introduction
PPTX
Writing Yarn Applications Hadoop Summit 2012
PPTX
YARN - Hadoop Next Generation Compute Platform
PPTX
MapReduce presentation
PPTX
Apache Hadoop YARN: Present and Future
Back to School - St. Louis Hadoop Meetup September 2016
An Introduction to Apache Hadoop Yarn
Suggested Algorithm to improve Hadoop's performance.
Yarn
Hadoop scheduler
Extending Spark Streaming to Support Complex Event Processing
Bigdata workshop february 2015
Resource scheduling
YARN - Presented At Dallas Hadoop User Group
Spark on Mesos
YARN - Next Generation Compute Platform fo Hadoop
Introduction to YARN Apps
Map reducecloudtech
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Introduction to Hadoop and Big Data
Cloumon Product Introduction
Writing Yarn Applications Hadoop Summit 2012
YARN - Hadoop Next Generation Compute Platform
MapReduce presentation
Apache Hadoop YARN: Present and Future
Ad

Viewers also liked (10)

PPTX
Board presentation wilson healthsystem_ipd_prevention.ppt
PPTX
Centro escolar insa
PDF
708studyguide
PPTX
Board presentation wilson healthsystem_ipd_prevention.ppt
PPSX
1 financial statements
PPSX
1 balance sheet
PDF
711studyguide
PPTX
Bombas
PDF
Bhutto zia and_islam
PPTX
Lock out tag out procedures
Board presentation wilson healthsystem_ipd_prevention.ppt
Centro escolar insa
708studyguide
Board presentation wilson healthsystem_ipd_prevention.ppt
1 financial statements
1 balance sheet
711studyguide
Bombas
Bhutto zia and_islam
Lock out tag out procedures
Ad

Similar to Hadoop tuning (20)

PPTX
Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
PPTX
HBaseCon 2013: Near Real Time Indexing for eBay Search
PPTX
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
PPT
Hadoop ecosystem framework n hadoop in live environment
PDF
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
PDF
Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
PPTX
SF ElasticSearch Meetup 2013.04.06 - Monitoring
PPTX
Zapping ever faster: how Zap sped up by two orders of magnitude using RavenDB
PPTX
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
PPTX
Inside MapR's M7
PDF
Introduction to Spark
PPTX
Inside MapR's M7
PPT
Hadoop MapReduce
PDF
Conhecendo o Apache HBase
PDF
PDF
High Performance Android Apps Improve Ratings with Speed Optimizations and Te...
PPTX
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
PPTX
Application Timeline Server - Past, Present and Future
PPTX
Application Timeline Server - Past, Present and Future
PDF
Hadoop at datasift
Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
HBaseCon 2013: Near Real Time Indexing for eBay Search
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
Hadoop ecosystem framework n hadoop in live environment
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
SF ElasticSearch Meetup 2013.04.06 - Monitoring
Zapping ever faster: how Zap sped up by two orders of magnitude using RavenDB
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
Inside MapR's M7
Introduction to Spark
Inside MapR's M7
Hadoop MapReduce
Conhecendo o Apache HBase
High Performance Android Apps Improve Ratings with Speed Optimizations and Te...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
Hadoop at datasift

More from wchevreuil (10)

PDF
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf
PDF
HBase System Tables / Metadata Info
PDF
HDFS client write/read implementation details
PDF
HBase RITs
PDF
HBase tales from the trenches
PPTX
Hbasecon2019 hbck2 (1)
PDF
Web hdfs and httpfs
PDF
HBase replication
PPT
I nd t_bigdata(1)
PDF
Hadoop - TDC 2012
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf
HBase System Tables / Metadata Info
HDFS client write/read implementation details
HBase RITs
HBase tales from the trenches
Hbasecon2019 hbck2 (1)
Web hdfs and httpfs
HBase replication
I nd t_bigdata(1)
Hadoop - TDC 2012

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
A Presentation on Artificial Intelligence
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
cuic standard and advanced reporting.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Big Data Technologies - Introduction.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Electronic commerce courselecture one. Pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Machine learning based COVID-19 study performance prediction
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
sap open course for s4hana steps from ECC to s4
A Presentation on Artificial Intelligence
Mobile App Security Testing_ A Comprehensive Guide.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
cuic standard and advanced reporting.pdf
Review of recent advances in non-invasive hemoglobin estimation
Big Data Technologies - Introduction.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Electronic commerce courselecture one. Pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Chapter 3 Spatial Domain Image Processing.pdf
Programs and apps: productivity, graphics, security and other tools
Unlocking AI with Model Context Protocol (MCP)
Machine learning based COVID-19 study performance prediction
Building Integrated photovoltaic BIPV_UPV.pdf

Hadoop tuning

  • 1. Your natural partner to develop innovative solutions Nokia Institute of Technology Nokia Internal Use Only
  • 2. Agenda Agenda • • • MapReduce Summarization Patterns MapReduce Coding Best Practices Ctrending MR Performance Evaluation • Ctrending MR Execution Summary • Code Profiling • Profiling Results • Code Tuning • Hbase Configuration Tuning • Tuning Results • Refactoring Proposal Nokia Internal Use Only
  • 3. MapReduce Summarization Patterns MapReduce Summarization Patterns • Numerical Summarizations • General counting of data set records • Groups records by a custom key, calculating numerical values per group • Known Uses • Word count, record count, min/max count, avg/median/standard deviation Nokia Internal Use Only
  • 4. MapReduce Summarization Patterns MapReduce Summarization Patterns • Inverted Index • Indexes large data set into keywords • Mapper emits keywords/ids values and the framework handles most of the work • May use IdentityReducer • Should benefit from Partitioner for load balance Nokia Internal Use Only
  • 5. MapReduce Summarization Patterns MapReduce Summarization Patterns • • Counting with Counters • Leverages MapReduce framework’s counters. • Counters are all stored in-memory locally on each Mapper, then aggregated by the framework. • Better performance, however may not exceed tens of counters definition. Known Uses • Count number of records, count small number of groups, summations Nokia Internal Use Only
  • 6. MapReduce Coding Best Practices MapReduce Coding Best Practices • • • Define Output Values • Create custom Writable extending classes to be used as output from Mappers; • Provides cleaner Mapper code and avoids String parsing on Reducer code side; Avoid Local Object Creation • Map and Reduce methods are invoked on very large loops; • Creating local objects inside map or reduce leads to huge number of objects being attached to Eden space of Young Generation JVM’s Heap; • Reuse Global instances to decrease Young GC Activity; Use Combiners on Counting Summarizations • Combiners reduce bandwidth consuption, as it applies aggregations locally to mappers node, before mapper output is sent to shuffle and sort phase, then made available for reducers Nokia Internal Use Only
  • 7. Ctrending MR Performance Evaluation Ctrending MR Performance Evaluation • Ctrending MR Execution Summary • Total MR Jobs Running: 8 • Avg of processed tweets: 2.2 Million • Tweets identified as Music related: 10.5% • Total Execution Time: 2 hours and 20 minutes • Slowest MapReduces: • Tweets Counter: 46 minutes • Nokia Entity Id Join: 1 hour and 10 minutes Nokia Internal Use Only
  • 8. Ctrending MR Code Profiling Ctrending MR Code Profiling • Mainly applied to Nokia Id Join Mapper • Added usage of MapReduce framework’s Counters to collect execution time metrics • Also used Counters to sum total of entities id being found in Nokia Id Join mapper • Needed to create Static fields in search strategy implementations to collect execution time metric Nokia Internal Use Only
  • 9. Ctrending MR Profiling Results Ctrending MR Profiling Results TOTAL_ARTISTS_NMS_FOUND 77 TOTAL_ARTISTS_NOT_IN_CACHE TOTAL_CANDIDATES_FORMATTING_TIME TOTAL_HBASE_GET_TIME 1,904 67,873 262,647 TOTAL_NORMALIZATION_TIME 22,452 TOTAL_SEARCH_ARTIST_TIME 611,066 TOTAL_SEARCH_CALCULATION_TIME 5,605 TOTAL_SEARCH_NMS_TIME 3,740,552 TOTAL_SEARCH_TIME 4,098,270 TOTAL_SEARCH_TRACK_TIME 3,486,978 TOTAL_TRACKS_NMS_FOUND 145 TOTAL_TRACKS_NOT_IN_CACHE Nokia Internal Use Only 4,635
  • 10. Ctrending MR Code Tuning Ctrending MR Code Tuning • Tuning Tweets Count MapReduce • Applied IntSumReducer as combiner. • Ajusted Hbase Scan to fetch and copy records on blocks of thousands, in order to optimize network usage between nodes. • Also set blockCache to false, as this table will always be read sequentially at once. Nokia Internal Use Only
  • 11. Ctrending MR Code Tuning Ctrending MR Code Tuning • Tuning Entity Id Search MapReduce • Removed unnecessary split/indexof calls • Removed redundant object creation from map method Nokia Internal Use Only
  • 12. Ctrending MR Code Tuning Ctrending MR Code Tuning • Tuning Entity Id Search MapReduce • Profiling results shows that NMS Search is the bottleneck • It costs more than 90% of all MapReduce execution time • It also shows that NMS Search is not adding enough value • It founds only 4% of Artists Ids not in cache • It founds only 3% of Tracks Ids not in cache • This drove the decision to remove NMS search by simply referencing CustomCache ISearchStrategy implementation on Mapper setup method Nokia Internal Use Only
  • 13. Hbase Configuration Tuning Hbase Configuration Tuning • Artists and Tracks Cache is an inverted indexes structure stored on Hbase tables. • These tables present high level of random access to it’s records (Get operations), while Entity Id Search MapReduce performs searches on the cache. • This could have performance optimized if Cache table blocks were made available in RegionServer’s memory. • Hbase provides Table level configuration property that increases blocks priority to be stored on RegionServer’s memory Nokia Internal Use Only
  • 14. Hbase Configuration Tuning Hbase Configuration Tuning • Additional configuration is required on Hbase RegionServer, so that block cache is possible most part of the time. • hbase.regionserver.global.memstore.upperLimit -> defines maximum % of Heap available for writing in memstores, before put operations are actually written to disk files. • hbase.regionserver.global.memstore.lowerLimit -> defines minimum % of Heap available for writing in memstores. Flush operations will free memstore until this limit is reached. • hfile.block.cache.size -> % of Heap to be used to store blocks inmemory Nokia Internal Use Only
  • 15. Hbase Configuration Tuning Hbase Configuration Tuning • Most Ctrending Hbase put operations are done in batch jobs (Twitter Crawler). • Music entities cache requires many Get operations, while EntityIdSearchMR is executing. • Simply setting cache tables to be maintained in-memory does not work, if there is not enough memory available. • More memory can be made available to cache tables blocks on RegionServers by decreasing % of Heap reserved to memstore and increasing it for block cache. Nokia Internal Use Only
  • 16. Ctrending MR Tuning Results Ctrending MR Tuning Results • TweetsCountMR • Total Execution Time Prior Tuning: 46 minutes (average) • Total Execution Time After Tuning: 20 minutes (average) • EntityIdSearchMR • Total Execution Time Prior Tuning: 1 hour and 10 minutes (average) • Total Execution Time Adter Tuning: 6 minutes (average) • CONCLUSION: Do not ever perform HTTP Requests on MapReduces again!!! Nokia Internal Use Only
  • 17. Refactoring Refactoring • Write batch process to read generated rankings and perform requests to NMS for music entities which ID was not found. • Better implement this as a Java multi-thread standalone process, instead of MapReduce • As input file is small (the filtered rank), Hadoop default InputFormat implementations will not split it in many Map tasks. • Unless a custom InputFormat be implemented, develop a MapReduce for this will probably take long time to execute, as it will end up with a single Map task to request NMS for all unknown Ids • Optimize Heap usage on other MRs by avoiding Object creation on Map methods. • Enhance code quality (and even performance), by defining OutputValues for Trending MRs Nokia Internal Use Only
  • 18. References References • HBase, The Definitive Guide, Lars George, O'Reilly • MapReduce Design Patterns, Donald Miner, Adam Shook • Hadoop Official WebSite • http://hadoop.apache.org/ Nokia Internal Use Only

Editor's Notes