SlideShare a Scribd company logo
SCALING IMPALA
Manish Maheshwari | Strata London 2019 | #StrataData
manish@cloudera.com
2 © Cloudera, Inc. All rights reserved.
AGENDA
• Impala overview
• KRPC Improvements
• Scaling issues and solutions
• Understanding query profiles
• Key Takeaways
3 © Cloudera, Inc. All rights reserved.
• Open source
• Fast
• Massively parallel processing (MPP)
• C++, run time code generation, streaming
• Flexible
• Multiple storage engines (HDFS, S3, ADLS, Apache Kudu, …)
• Multiple file formats (Parquet, Text, Sequence, Avro, ORC, …)
• Enterprise-grade
• Authorization, authentication, lineage tracing, auditing, encryption
• >1400 customers, >97000 machines
• Scalable (Now even more!)
• Large clusters with 400+ nodes
Apache Impala
4 © Cloudera, Inc. All rights reserved.
HDFS Kudu S3 HBase
Impala Architecture
Query Compiler
Query Executor
Query Coordinator
Metadata
HDFS NameNode
StateStore
Catalog
FE
(Java)
BE (C++)
Metadata Cache
Impala Daemons
Metadata
Execution
Storage ADLS
SentryHive MetaStore
Query Compiler
Query Executor
Query Coordinator
Metadata
Query Compiler
Query Executor
Query Coordinator
Metadata
HDFS
0
1
Query Compiler
Query Executor
Query Coordinator
Metadata
Kudu S3/ADLS HBase HDFS Kudu S3/ADLS HBase HDFS Kudu S3/ADLS HBase
Impala Daemons
SQL App
ODBC/JDBC
2
34
5
6
7
• Request arrives via ODBC/JDBC
• Planner turns request into collection of plan fragments
• Coordinator initiates execution on remote Impala daemons
• Intermediate results are streamed between Impala daemons
• Query results are streamed back to the client
Select Query flow in Impala
Query Compiler
Query Executor
Query Coordinator
Metadata
Query Compiler
Query Executor
Query Coordinator
Metadata
Impala DaemonsImpala Daemons
6 © Cloudera, Inc. All rights reserved.
What’s KRPC?
• Replaces Thrift RPC for inter daemon communication from CDH 5.15+ for certain
RPCs
• Reduces # of connections in cluster
• Reduce stress on MIT KDC / AD
• KRPC supports both synchronous and asynchronous RPCs
• KRPC supports connection multiplexing
• One connection per direction between every pair of hosts
• Fixed size thread pool maintained by KRPC internally
• Query performances during concurrent execution up by average 2x-3x
7 © Cloudera, Inc. All rights reserved.
KRPC Performance
8 © Cloudera, Inc. All rights reserved.
KRPC Stability
9 © Cloudera, Inc. All rights reserved.
KRPC Throughput
10 © Cloudera, Inc. All rights reserved.
Scaling Issues
• I got 20 more use cases to onboard
• And our data volumes just went up 10x
• And we have 10x more tables and 100x more partitions
• And we have 10x more concurrent queries
• And our ETL is now real time
• Sure, Just add more nodes right?
• Ok, did that too.. My queries are still 3x slower L
11 © Cloudera, Inc. All rights reserved.
Metadata/Catalog Cache Woes
• Impala Catalog and Daemons cache the HMS Metadata + HDFS Block locations
• Memory = num of tables * 5KB + num of partitions * 2KB + num of files * 750B + num of file blocks * 300B +
sum(incremental col stats per table)
• Incremental stats
• For each table, num columns * num partitions * 400B
• e.g. A large telco has over 56K tables
• and partitions and files and blocks and replicas…
• Catalog memory approx. 80 GB, GC times go up significantly
• Reduced memory for query execution.
• OOM issues on the catalogd
• Long metadata loading time, Long time for SS to send the catalog to all the daemons
• Slow DDL and DML Statements, Even `describe table` requires catalog cache for the whole table
12 © Cloudera, Inc. All rights reserved.
Metadata/Catalog Cache Woes
Daemon
#1
CatalogD
A,B,C,D
Table
A
HMS
Table
B
Table
C
Table
D
Namenode
Daemon
#1
A,B,C,D
DaemonsA,B,C,D
Daemon
#1
A,B,C,D
Daemon
#1
A,B,C,D
DaemonsA,B,C,D
Daemon
#1
A,B,C,D
Daemon
#1
A,B,C,D
DaemonsA,B,C,D
StateStore Daemon
13 © Cloudera, Inc. All rights reserved.
Metadata/Catalog Cache - Solutions
• Regularly merge small files
• And run a Refresh Table
• Avoid data ingestion processes that produce many small files
• Use a larger block size (can be over 2GB)
• Optimal partitioning strategy
• Don’t over or under partition
• HDFS file handle cache
• Consider using HBase, Kudu, …
• If nothing works, just delete the data 😜
14 © Cloudera, Inc. All rights reserved.
Dedicated Coordinators
• Coordinators - Compile the queries, creates the execution plan,
• Needs table metadata!!
• Executors - Executes the query plans and sends the results to other executors / coordinator
• Does not need table metadata!!
• Benefits
• Executors need less memory
• Statestore does not need to send metadata to all nodes
• Faster metadata updates and propogation
• Coordinator nodes don’t need to be datanodes
15 © Cloudera, Inc. All rights reserved.
Dedicated Coordinators
Coordinators
CatalogD
A,B,C,D
Table
A
HMS
Table
B
Table
C
Table
Z
Namenode
Executors Executors
StateStore Daemon
CoordinatorsA,B,C,D
Executors
Executors
Executors
Executors
16 © Cloudera, Inc. All rights reserved.
Dedicated Coordinators – Best Practices
• Thumb Rule – 1 Coordinator per 50 executors.
• Start with just one coordinator. Run it on an edge node. (Can run on datanodes too)
• Needs atleast few disks to write any spill data
• Add another coordinator when CPU / network utilization > 80%
• When using a load balancer double the coordinator counts, but set only half as active and rest as backup
• Use sticky connections on the load balancer
• Increase the fe_service_threads on the coordinators to handle client connections
• Increase the Java heap of the coordinators according to the catalog size
• Increase num_metadata_loading_threads
• Default is 16. Increase if you have lots of tables and partitions.
17 © Cloudera, Inc. All rights reserved.
Further Catalog Cache Improvements (In Beta)
Coordinator
#1
CatalogDCoordinator
#2
CatalogD
A B
impala-shell> describe A impala-shell> describe B
Table
A
HMS +
Namenode
Table
B
Table
C
Table
Z
• On-Demand metadata for
coordinators from the CatalogD
• Metadata LRU cache (forget old
tables)
• Metadata release on memory
pressure
• Smart Cache Invalidation
• Compressed Incremental Stats
See IMPALA-7127 for full list of
improvements
HMS
Notifications
Coordinator
#3
C
impala-shell> describe C
18 © Cloudera, Inc. All rights reserved.
HMS Notifications
Hive Metastore
HDFS NameNode
StateStore Catalog
Metadata
2
3
Lightweight
Notifications
Query Coordinators
Coordinators
Lightweight
Notifications
HMS Notifications
Metadata
Query Coordinators
19 © Cloudera, Inc. All rights reserved.
Metadata on-demand (In Beta)
• Use Local catalog cache
• Set --catalog_topic_mode = minimal on the catalog daemons
• Set --use_local_catalog = true on the coordinators
• Time-based catalog cache eviction -
• Set invalidate_tables_timeout_s on both catalogd and coordinators
• E.g invalidate_tables_timeout_s = 3600 will invalidate tables that are older than 1 hour.
• Memory-based catalog cache eviction -
• Set invalidate_tables_on_memory_pressure = true on both catalogd and impalad.
• When the memory pressure reaches 60% of JVM heap size after a Java garbage collection in catalogd,
Impala invalidates 10% of the least recently used tables.
20 © Cloudera, Inc. All rights reserved.
Impala Architecture – Coordinators and Executors
HDFS Kudu S3 HBase
Query Compiler
Query Executor
Query Coordinator
Metadata
Query Executor
HDFS NameNode
StateStore
Catalog
FE
(Java)
BE (C++)
Metadata Cache
Impala Coordinator (‘s) Impala Executors
Metadata
Execution
Storage
Query Executor
Query Executor
Query Executor
Query Executor
Query Executor
Query Executor
Query Executor
Query Executor
Query Executor
ADLS
SentryHive MetaStore
21 © Cloudera, Inc. All rights reserved.
Admission Control Woes
• Impala Admission Control not enabled / Default Memory limit not set for each pool
• Heuristics based memory estimation, not 100% accurate. Worse if table stats are unavailable
• group by’s estimate can be particularly off – when there’s a large number of group by columns.
• Mem estimate = NDV of group by column 1 * NDV of group by column 2 * … NDV of group by column n
• Under admission due to higher than required memory reserved on each daemon
• Queries will OOM out unnecessarily
• Query Status: Admission for query exceeded timeout 60000ms in pool root.nprd_tst_hadoop_data_appl_readonly. Queued
reason: Not enough aggregate memory available in pool root.nprd_tst_hadoop_data_appl_readonly with max mem resources
150.00 GB. Needed 40.00 GB but only 30.00 GB was available.
• Always enable IAC
• Limit the amount of memory used by an individual query using per query mem-limit
• Set it from Impala shell / Hue - set mem_limit=<per query limit>
• Set default memory limit per pool
22 © Cloudera, Inc. All rights reserved.
Query Concurrency Woes
• Impala Admission Control is decentralized, each coordinator makes independent decision on
the basis of last know running queries on the cluster as communicated by the SS
• This makes IAC decisions are fast, But might be a little imprecise during times of heavy load
across many daemons. This is called as over admission.
• Using dedicated coordinators, limits this over admission of queries!!
• You get controlled query concurrency and thus each query runs faster and overall higher
query throughput.
• Ideal total query concurrency = # Cores on the executors / datanodes
23 © Cloudera, Inc. All rights reserved.
Impala Resource Pools
Resource Pools Design -
• 10 impala daemons, 200 GB per daemon – 2 TB total, 8
Tenants
• Ok. Lets divide the memory into tenants. Everyone gets
what they pay for. Good design right?
Issues
• Unused memory cannot be used by other tenants
• Busy tenants queue up queries in admission control
causing overall ”slowness” in query execution.
• Small tenants running large queries will spill to disk
until the spill to disk limit and eventually OOM out
• How bad does it get - 25 + tenants, Only 2/3 active at
any given time
24 © Cloudera, Inc. All rights reserved.
Scaling Impala Resource Pools
• Design resource pools according to peak memory needed
• Use simple grouping to create small, medium and large pools
• Use Cloudera Manager API’s for user chargeback if needed
25 © Cloudera, Inc. All rights reserved.
Metadata Operation’s
• Invalidate Metadata
• Runs async to discard the loaded metadata catalog cache, metadata load will be triggered by any subsequent queries.
• Should be run when
• New tables created / Tables dropped by Hive/Spark
• Block locations changed by HDFS load balancer
• Recover Partitions
• Scans HDFS to check if any new partition directories were added and cache block metadata for those files
• Refresh Table / Refresh Table Partition
• Adding/Removing/Overwriting files into partitions via Hive/Spark
• Running operations like ALTER TABLE
• Reloads metadata for the table from HMS and does an incremental reload of the file and block metadata.
26 © Cloudera, Inc. All rights reserved.
Scaling Metadata Operation’s
• How bad does it get?
• 18K Invalidate tables per day.
• What not to do
• No commands are needed if
operations/ETL runs in Impala
• Always run refresh <table>
<partition> when adding data
• Recover partitions when partitions
are added
• Refresh Table for other changes
• Limit IM to <Table> Only
27 © Cloudera, Inc. All rights reserved.
Automatic Metadata Sync (In Beta)
• CatalogD polls Hive Metastore (HMS) notifications events
• Invalidates the tables when it receives the ALTER TABLE events or the ALTER, ADD,
or DROP their partitions.
• Adds the tables or databases when it receives the CREATE TABLE or CREATE DATABASE events.
• Removes the tables from catalogd when it receives the DROP TABLE or DROP
DATABASE events.
• The operations that do not generate events in HMS, such as adding new data to existing
tables/partitions from Spark are not supported.
• Aka, Load / Insert still needs a refresh table partition
28 © Cloudera, Inc. All rights reserved.
Automatic Metadata Sync (In Beta)
• To disable the event based HMS sync for a new database, set the impala.disableHmsSync database properties in Hive
• CREATE DATABASE <name> WITH DBPROPERTIES ('impala.disableHmsSync'='true');
• To enable or disable the event based HMS sync for a table
• CREATE TABLE <name> WITH TBLPROPERTIES ('impala.disableHmsSync'='true' | 'false');
• To change the event based HMS sync at the table level
• ALTER TABLE <name> WITH TBLPROPERTIES ('impala.disableHmsSync'='true' | 'false’);
• When both table and database level properties are set, the table level property takes precedence.
• If the property is changed from true (meaning events are skipped) to false (meaning events are not skipped), issue a
manual INVALIDATE METADATA command to reset.
29 © Cloudera, Inc. All rights reserved.
Scaling Compute Stats
• Compute Stats is very CPU-intensive – Based on number of rows, number of data files, the total size
of the data files, and the file format.
• For partitioned tables, the numbers are calculated per partition, and as totals for the whole table.
• Limit the number of columns to only compute stats on columns involved in filters, join conditions,
group by or partition by clauses.
• Re-compute stats only when there is > 30% data change
• Run compute stats on weekends/nights. Not needed after every data load.
• If you reload a complete new set of data for a table, but the number of rows and number of distinct
values for each column is relatively unchanged from before, you do not need to recompute stats for
the table
• Use enable_stats_extrapolation (experimental)
30 © Cloudera, Inc. All rights reserved.
Set Statistics Manually
• Quick fix as part of data load, while compute stats can be scheduled on weekends
• Set total number of rows. Applies to both unpartitioned and partitioned tables.
• alter table <table_name> set tblproperties('numRows'='new_value',
'STATS_GENERATED_VIA_STATS_TASK'='true’);
• Set total number of rows for a specific partition. Applies to partitioned tables only. -- You must specify all
the partition key columns in the PARTITION clause.
• alter table table_name partition (keycol1=val1,keycol2=val2...) set
tblproperties('numRows'='new_value',
'STATS_GENERATED_VIA_STATS_TASK'='true’);
• Column stats:
• ALTER TABLE <table_name> SET COLUMN STATS <col_name>
(‘numDVs'=‘100‘)
• Compute numDVs with “SELECT NDV(col)”
31 © Cloudera, Inc. All rights reserved.
Other Scalability Considerations
• Use star schemas, integer join keys
• Check for hot spotting - Increase replication factor for master data / frequently queries data
• Avoid casts – implicit or explicit (easily over 10% improvements for larger volumes)
• Increase RUNTIME_FILTER_WAIT_TIME_MS for complicated queries, but coordinators need to do
more work
• Use HDFS file handle cache
• Give the OS enough free memory to cache data blocks
• Set default compression codec - improves disk read performance
• Use high CPU nodes, fast processors
• Impala clusters - DistCp the data from remote cluster
32 © Cloudera, Inc. All rights reserved.
BI Tools
• Always always close queries
• idle_query_timeout = 60
• idle_session_timeout = 1800
• Use handcrafted SQL’s
• Use different pools for different queries
and encourage use of set mem_limit;
• Use JDBC over Kerberos authentication
mechanism
33 © Cloudera, Inc. All rights reserved.
Understanding Query Profiles
• Impala query profiles can be retrieved from Cloudera manager, impala coordinator webui or from
the command line by executing `profile`
• Includes nanosecond timers for all operations on all nodes
• Quite detailed and exhaustive, but the basics are easy
• We can easily answer -
• What’s the bottleneck for this query?
• Why this run is fast but that run is slow?
• How can I tune to improve this query’s performance.
34 © Cloudera, Inc. All rights reserved.
Understanding Query Profiles
• Always check impala version
and default query options set
• Check warnings
• Query state – running,
cancelled
• Check query type – Query,
DDL, etc
35 © Cloudera, Inc. All rights reserved.
Understanding Query Profiles
• Check per node peak memory usage
• Tell you how much should be the memory limit for these queries
• Shows skews in memory usage among nodes
36 © Cloudera, Inc. All rights reserved.
Understanding Query Profiles
• For completed queries, read the summary in detail
• Check what’s taking the max time, max memory, check for skews in data
• Check row estimates, depending on stats available or not these can be skewed
• Check join order is determined entirely by total size (#rows * column width)
• Try to ensure after partition pruning, we have RHS smaller than LHS
• Broadcast joins are the default, partitioned joins for large tables of roughly equal size.
37 © Cloudera, Inc. All rights reserved.
Understanding Query Profiles
• Read the query timeline in detail
• Check which step is taking the most time and
why?
• Usual culprits
• Metadata load
• Completed admission
• ClientFetchWaitTimer
• First dynamic filter received
• Last row fetched
38 © Cloudera, Inc. All rights reserved.
Understanding Query Profiles
• Check each plan fragment
• Tells us what it did, how many
hosts did it run on
• How much data it processed
• Partition pruning stats for
HDFS scans
• Parquet push down predicates
39 © Cloudera, Inc. All rights reserved.
Key Takeaways
• Always use dedicated coordinators/executors
• IAC should be enabled and memory limit set
• Metadata management is significantly improved
• Zero touch metadata coming soon
• Follow best practices for impala queries and performance tuning –Refer Impala cookbook
40 © Cloudera, Inc. All rights reserved.
Rate today ’s session

More Related Content

PPTX
MongoDB Atlas
PDF
Cloudera Impala Source Code Explanation and Analysis
PPT
Cloudera Impala Internals
PPTX
Hive vs. Impala
PDF
Analyzing SQL Server wait stats, hands-on!
PPTX
Introduction to Apache Hadoop Eco-System
PDF
How Impala Works
PDF
Optimizing Hive Queries
MongoDB Atlas
Cloudera Impala Source Code Explanation and Analysis
Cloudera Impala Internals
Hive vs. Impala
Analyzing SQL Server wait stats, hands-on!
Introduction to Apache Hadoop Eco-System
How Impala Works
Optimizing Hive Queries

What's hot (20)

PDF
Kafka on ZFS: Better Living Through Filesystems
PDF
Impala Architecture presentation
PDF
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
PPTX
Hadoop Operations - Best Practices from the Field
PDF
Stream Processing with Apache Kafka and .NET
PPTX
PPTX
Druid and Hive Together : Use Cases and Best Practices
PDF
Spark shuffle introduction
PDF
HBase Application Performance Improvement
PPTX
Admission Control in Impala
PPTX
Row/Column- Level Security in SQL for Apache Spark
PPT
Solr Presentation
PPTX
Securing Hadoop with Apache Ranger
PDF
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
PPTX
Cql – cassandra query language
PDF
Apache Calcite (a tutorial given at BOSS '21)
PPTX
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
PPTX
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
PPTX
Hadoop Migration to databricks cloud project plan.pptx
Kafka on ZFS: Better Living Through Filesystems
Impala Architecture presentation
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
Hadoop Operations - Best Practices from the Field
Stream Processing with Apache Kafka and .NET
Druid and Hive Together : Use Cases and Best Practices
Spark shuffle introduction
HBase Application Performance Improvement
Admission Control in Impala
Row/Column- Level Security in SQL for Apache Spark
Solr Presentation
Securing Hadoop with Apache Ranger
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
Cql – cassandra query language
Apache Calcite (a tutorial given at BOSS '21)
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Hadoop Migration to databricks cloud project plan.pptx
Ad

Similar to Strata London 2019 Scaling Impala (20)

PPTX
Strata London 2019 Scaling Impala.pptx
PPTX
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
PPTX
Performance Optimizations in Apache Impala
PDF
Impala use case @ edge
PPTX
The Impala Cookbook
PDF
Building a Hadoop Data Warehouse with Impala
PPTX
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
PDF
Building a Hadoop Data Warehouse with Impala
PDF
Cloudera 5.3 Update
PDF
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013
PDF
Impala tech-talk by Dimitris Tsirogiannis
PPTX
Technical Overview on Cloudera Impala
PPTX
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
PPTX
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
PDF
(Aaron myers) hdfs impala
PPTX
Hadoop Operations
PPTX
Configuring a Secure, Multitenant Cluster for the Enterprise
PDF
Impala: Real-time Queries in Hadoop
PDF
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
PDF
Cw13 big data and apache hadoop by amr awadallah-cloudera
Strata London 2019 Scaling Impala.pptx
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Performance Optimizations in Apache Impala
Impala use case @ edge
The Impala Cookbook
Building a Hadoop Data Warehouse with Impala
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Building a Hadoop Data Warehouse with Impala
Cloudera 5.3 Update
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013
Impala tech-talk by Dimitris Tsirogiannis
Technical Overview on Cloudera Impala
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
(Aaron myers) hdfs impala
Hadoop Operations
Configuring a Secure, Multitenant Cluster for the Enterprise
Impala: Real-time Queries in Hadoop
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Cw13 big data and apache hadoop by amr awadallah-cloudera
Ad

Recently uploaded (20)

PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
Foundation of Data Science unit number two notes
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Global journeys: estimating international migration
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Supervised vs unsupervised machine learning algorithms
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Reliability_Chapter_ presentation 1221.5784
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Launch Your Data Science Career in Kochi – 2025
Acceptance and paychological effects of mandatory extra coach I classes.pptx
climate analysis of Dhaka ,Banglades.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Foundation of Data Science unit number two notes
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction-to-Cloud-ComputingFinal.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Global journeys: estimating international migration
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx

Strata London 2019 Scaling Impala

  • 1. SCALING IMPALA Manish Maheshwari | Strata London 2019 | #StrataData manish@cloudera.com
  • 2. 2 © Cloudera, Inc. All rights reserved. AGENDA • Impala overview • KRPC Improvements • Scaling issues and solutions • Understanding query profiles • Key Takeaways
  • 3. 3 © Cloudera, Inc. All rights reserved. • Open source • Fast • Massively parallel processing (MPP) • C++, run time code generation, streaming • Flexible • Multiple storage engines (HDFS, S3, ADLS, Apache Kudu, …) • Multiple file formats (Parquet, Text, Sequence, Avro, ORC, …) • Enterprise-grade • Authorization, authentication, lineage tracing, auditing, encryption • >1400 customers, >97000 machines • Scalable (Now even more!) • Large clusters with 400+ nodes Apache Impala
  • 4. 4 © Cloudera, Inc. All rights reserved. HDFS Kudu S3 HBase Impala Architecture Query Compiler Query Executor Query Coordinator Metadata HDFS NameNode StateStore Catalog FE (Java) BE (C++) Metadata Cache Impala Daemons Metadata Execution Storage ADLS SentryHive MetaStore Query Compiler Query Executor Query Coordinator Metadata Query Compiler Query Executor Query Coordinator Metadata
  • 5. HDFS 0 1 Query Compiler Query Executor Query Coordinator Metadata Kudu S3/ADLS HBase HDFS Kudu S3/ADLS HBase HDFS Kudu S3/ADLS HBase Impala Daemons SQL App ODBC/JDBC 2 34 5 6 7 • Request arrives via ODBC/JDBC • Planner turns request into collection of plan fragments • Coordinator initiates execution on remote Impala daemons • Intermediate results are streamed between Impala daemons • Query results are streamed back to the client Select Query flow in Impala Query Compiler Query Executor Query Coordinator Metadata Query Compiler Query Executor Query Coordinator Metadata Impala DaemonsImpala Daemons
  • 6. 6 © Cloudera, Inc. All rights reserved. What’s KRPC? • Replaces Thrift RPC for inter daemon communication from CDH 5.15+ for certain RPCs • Reduces # of connections in cluster • Reduce stress on MIT KDC / AD • KRPC supports both synchronous and asynchronous RPCs • KRPC supports connection multiplexing • One connection per direction between every pair of hosts • Fixed size thread pool maintained by KRPC internally • Query performances during concurrent execution up by average 2x-3x
  • 7. 7 © Cloudera, Inc. All rights reserved. KRPC Performance
  • 8. 8 © Cloudera, Inc. All rights reserved. KRPC Stability
  • 9. 9 © Cloudera, Inc. All rights reserved. KRPC Throughput
  • 10. 10 © Cloudera, Inc. All rights reserved. Scaling Issues • I got 20 more use cases to onboard • And our data volumes just went up 10x • And we have 10x more tables and 100x more partitions • And we have 10x more concurrent queries • And our ETL is now real time • Sure, Just add more nodes right? • Ok, did that too.. My queries are still 3x slower L
  • 11. 11 © Cloudera, Inc. All rights reserved. Metadata/Catalog Cache Woes • Impala Catalog and Daemons cache the HMS Metadata + HDFS Block locations • Memory = num of tables * 5KB + num of partitions * 2KB + num of files * 750B + num of file blocks * 300B + sum(incremental col stats per table) • Incremental stats • For each table, num columns * num partitions * 400B • e.g. A large telco has over 56K tables • and partitions and files and blocks and replicas… • Catalog memory approx. 80 GB, GC times go up significantly • Reduced memory for query execution. • OOM issues on the catalogd • Long metadata loading time, Long time for SS to send the catalog to all the daemons • Slow DDL and DML Statements, Even `describe table` requires catalog cache for the whole table
  • 12. 12 © Cloudera, Inc. All rights reserved. Metadata/Catalog Cache Woes Daemon #1 CatalogD A,B,C,D Table A HMS Table B Table C Table D Namenode Daemon #1 A,B,C,D DaemonsA,B,C,D Daemon #1 A,B,C,D Daemon #1 A,B,C,D DaemonsA,B,C,D Daemon #1 A,B,C,D Daemon #1 A,B,C,D DaemonsA,B,C,D StateStore Daemon
  • 13. 13 © Cloudera, Inc. All rights reserved. Metadata/Catalog Cache - Solutions • Regularly merge small files • And run a Refresh Table • Avoid data ingestion processes that produce many small files • Use a larger block size (can be over 2GB) • Optimal partitioning strategy • Don’t over or under partition • HDFS file handle cache • Consider using HBase, Kudu, … • If nothing works, just delete the data 😜
  • 14. 14 © Cloudera, Inc. All rights reserved. Dedicated Coordinators • Coordinators - Compile the queries, creates the execution plan, • Needs table metadata!! • Executors - Executes the query plans and sends the results to other executors / coordinator • Does not need table metadata!! • Benefits • Executors need less memory • Statestore does not need to send metadata to all nodes • Faster metadata updates and propogation • Coordinator nodes don’t need to be datanodes
  • 15. 15 © Cloudera, Inc. All rights reserved. Dedicated Coordinators Coordinators CatalogD A,B,C,D Table A HMS Table B Table C Table Z Namenode Executors Executors StateStore Daemon CoordinatorsA,B,C,D Executors Executors Executors Executors
  • 16. 16 © Cloudera, Inc. All rights reserved. Dedicated Coordinators – Best Practices • Thumb Rule – 1 Coordinator per 50 executors. • Start with just one coordinator. Run it on an edge node. (Can run on datanodes too) • Needs atleast few disks to write any spill data • Add another coordinator when CPU / network utilization > 80% • When using a load balancer double the coordinator counts, but set only half as active and rest as backup • Use sticky connections on the load balancer • Increase the fe_service_threads on the coordinators to handle client connections • Increase the Java heap of the coordinators according to the catalog size • Increase num_metadata_loading_threads • Default is 16. Increase if you have lots of tables and partitions.
  • 17. 17 © Cloudera, Inc. All rights reserved. Further Catalog Cache Improvements (In Beta) Coordinator #1 CatalogDCoordinator #2 CatalogD A B impala-shell> describe A impala-shell> describe B Table A HMS + Namenode Table B Table C Table Z • On-Demand metadata for coordinators from the CatalogD • Metadata LRU cache (forget old tables) • Metadata release on memory pressure • Smart Cache Invalidation • Compressed Incremental Stats See IMPALA-7127 for full list of improvements HMS Notifications Coordinator #3 C impala-shell> describe C
  • 18. 18 © Cloudera, Inc. All rights reserved. HMS Notifications Hive Metastore HDFS NameNode StateStore Catalog Metadata 2 3 Lightweight Notifications Query Coordinators Coordinators Lightweight Notifications HMS Notifications Metadata Query Coordinators
  • 19. 19 © Cloudera, Inc. All rights reserved. Metadata on-demand (In Beta) • Use Local catalog cache • Set --catalog_topic_mode = minimal on the catalog daemons • Set --use_local_catalog = true on the coordinators • Time-based catalog cache eviction - • Set invalidate_tables_timeout_s on both catalogd and coordinators • E.g invalidate_tables_timeout_s = 3600 will invalidate tables that are older than 1 hour. • Memory-based catalog cache eviction - • Set invalidate_tables_on_memory_pressure = true on both catalogd and impalad. • When the memory pressure reaches 60% of JVM heap size after a Java garbage collection in catalogd, Impala invalidates 10% of the least recently used tables.
  • 20. 20 © Cloudera, Inc. All rights reserved. Impala Architecture – Coordinators and Executors HDFS Kudu S3 HBase Query Compiler Query Executor Query Coordinator Metadata Query Executor HDFS NameNode StateStore Catalog FE (Java) BE (C++) Metadata Cache Impala Coordinator (‘s) Impala Executors Metadata Execution Storage Query Executor Query Executor Query Executor Query Executor Query Executor Query Executor Query Executor Query Executor Query Executor ADLS SentryHive MetaStore
  • 21. 21 © Cloudera, Inc. All rights reserved. Admission Control Woes • Impala Admission Control not enabled / Default Memory limit not set for each pool • Heuristics based memory estimation, not 100% accurate. Worse if table stats are unavailable • group by’s estimate can be particularly off – when there’s a large number of group by columns. • Mem estimate = NDV of group by column 1 * NDV of group by column 2 * … NDV of group by column n • Under admission due to higher than required memory reserved on each daemon • Queries will OOM out unnecessarily • Query Status: Admission for query exceeded timeout 60000ms in pool root.nprd_tst_hadoop_data_appl_readonly. Queued reason: Not enough aggregate memory available in pool root.nprd_tst_hadoop_data_appl_readonly with max mem resources 150.00 GB. Needed 40.00 GB but only 30.00 GB was available. • Always enable IAC • Limit the amount of memory used by an individual query using per query mem-limit • Set it from Impala shell / Hue - set mem_limit=<per query limit> • Set default memory limit per pool
  • 22. 22 © Cloudera, Inc. All rights reserved. Query Concurrency Woes • Impala Admission Control is decentralized, each coordinator makes independent decision on the basis of last know running queries on the cluster as communicated by the SS • This makes IAC decisions are fast, But might be a little imprecise during times of heavy load across many daemons. This is called as over admission. • Using dedicated coordinators, limits this over admission of queries!! • You get controlled query concurrency and thus each query runs faster and overall higher query throughput. • Ideal total query concurrency = # Cores on the executors / datanodes
  • 23. 23 © Cloudera, Inc. All rights reserved. Impala Resource Pools Resource Pools Design - • 10 impala daemons, 200 GB per daemon – 2 TB total, 8 Tenants • Ok. Lets divide the memory into tenants. Everyone gets what they pay for. Good design right? Issues • Unused memory cannot be used by other tenants • Busy tenants queue up queries in admission control causing overall ”slowness” in query execution. • Small tenants running large queries will spill to disk until the spill to disk limit and eventually OOM out • How bad does it get - 25 + tenants, Only 2/3 active at any given time
  • 24. 24 © Cloudera, Inc. All rights reserved. Scaling Impala Resource Pools • Design resource pools according to peak memory needed • Use simple grouping to create small, medium and large pools • Use Cloudera Manager API’s for user chargeback if needed
  • 25. 25 © Cloudera, Inc. All rights reserved. Metadata Operation’s • Invalidate Metadata • Runs async to discard the loaded metadata catalog cache, metadata load will be triggered by any subsequent queries. • Should be run when • New tables created / Tables dropped by Hive/Spark • Block locations changed by HDFS load balancer • Recover Partitions • Scans HDFS to check if any new partition directories were added and cache block metadata for those files • Refresh Table / Refresh Table Partition • Adding/Removing/Overwriting files into partitions via Hive/Spark • Running operations like ALTER TABLE • Reloads metadata for the table from HMS and does an incremental reload of the file and block metadata.
  • 26. 26 © Cloudera, Inc. All rights reserved. Scaling Metadata Operation’s • How bad does it get? • 18K Invalidate tables per day. • What not to do • No commands are needed if operations/ETL runs in Impala • Always run refresh <table> <partition> when adding data • Recover partitions when partitions are added • Refresh Table for other changes • Limit IM to <Table> Only
  • 27. 27 © Cloudera, Inc. All rights reserved. Automatic Metadata Sync (In Beta) • CatalogD polls Hive Metastore (HMS) notifications events • Invalidates the tables when it receives the ALTER TABLE events or the ALTER, ADD, or DROP their partitions. • Adds the tables or databases when it receives the CREATE TABLE or CREATE DATABASE events. • Removes the tables from catalogd when it receives the DROP TABLE or DROP DATABASE events. • The operations that do not generate events in HMS, such as adding new data to existing tables/partitions from Spark are not supported. • Aka, Load / Insert still needs a refresh table partition
  • 28. 28 © Cloudera, Inc. All rights reserved. Automatic Metadata Sync (In Beta) • To disable the event based HMS sync for a new database, set the impala.disableHmsSync database properties in Hive • CREATE DATABASE <name> WITH DBPROPERTIES ('impala.disableHmsSync'='true'); • To enable or disable the event based HMS sync for a table • CREATE TABLE <name> WITH TBLPROPERTIES ('impala.disableHmsSync'='true' | 'false'); • To change the event based HMS sync at the table level • ALTER TABLE <name> WITH TBLPROPERTIES ('impala.disableHmsSync'='true' | 'false’); • When both table and database level properties are set, the table level property takes precedence. • If the property is changed from true (meaning events are skipped) to false (meaning events are not skipped), issue a manual INVALIDATE METADATA command to reset.
  • 29. 29 © Cloudera, Inc. All rights reserved. Scaling Compute Stats • Compute Stats is very CPU-intensive – Based on number of rows, number of data files, the total size of the data files, and the file format. • For partitioned tables, the numbers are calculated per partition, and as totals for the whole table. • Limit the number of columns to only compute stats on columns involved in filters, join conditions, group by or partition by clauses. • Re-compute stats only when there is > 30% data change • Run compute stats on weekends/nights. Not needed after every data load. • If you reload a complete new set of data for a table, but the number of rows and number of distinct values for each column is relatively unchanged from before, you do not need to recompute stats for the table • Use enable_stats_extrapolation (experimental)
  • 30. 30 © Cloudera, Inc. All rights reserved. Set Statistics Manually • Quick fix as part of data load, while compute stats can be scheduled on weekends • Set total number of rows. Applies to both unpartitioned and partitioned tables. • alter table <table_name> set tblproperties('numRows'='new_value', 'STATS_GENERATED_VIA_STATS_TASK'='true’); • Set total number of rows for a specific partition. Applies to partitioned tables only. -- You must specify all the partition key columns in the PARTITION clause. • alter table table_name partition (keycol1=val1,keycol2=val2...) set tblproperties('numRows'='new_value', 'STATS_GENERATED_VIA_STATS_TASK'='true’); • Column stats: • ALTER TABLE <table_name> SET COLUMN STATS <col_name> (‘numDVs'=‘100‘) • Compute numDVs with “SELECT NDV(col)”
  • 31. 31 © Cloudera, Inc. All rights reserved. Other Scalability Considerations • Use star schemas, integer join keys • Check for hot spotting - Increase replication factor for master data / frequently queries data • Avoid casts – implicit or explicit (easily over 10% improvements for larger volumes) • Increase RUNTIME_FILTER_WAIT_TIME_MS for complicated queries, but coordinators need to do more work • Use HDFS file handle cache • Give the OS enough free memory to cache data blocks • Set default compression codec - improves disk read performance • Use high CPU nodes, fast processors • Impala clusters - DistCp the data from remote cluster
  • 32. 32 © Cloudera, Inc. All rights reserved. BI Tools • Always always close queries • idle_query_timeout = 60 • idle_session_timeout = 1800 • Use handcrafted SQL’s • Use different pools for different queries and encourage use of set mem_limit; • Use JDBC over Kerberos authentication mechanism
  • 33. 33 © Cloudera, Inc. All rights reserved. Understanding Query Profiles • Impala query profiles can be retrieved from Cloudera manager, impala coordinator webui or from the command line by executing `profile` • Includes nanosecond timers for all operations on all nodes • Quite detailed and exhaustive, but the basics are easy • We can easily answer - • What’s the bottleneck for this query? • Why this run is fast but that run is slow? • How can I tune to improve this query’s performance.
  • 34. 34 © Cloudera, Inc. All rights reserved. Understanding Query Profiles • Always check impala version and default query options set • Check warnings • Query state – running, cancelled • Check query type – Query, DDL, etc
  • 35. 35 © Cloudera, Inc. All rights reserved. Understanding Query Profiles • Check per node peak memory usage • Tell you how much should be the memory limit for these queries • Shows skews in memory usage among nodes
  • 36. 36 © Cloudera, Inc. All rights reserved. Understanding Query Profiles • For completed queries, read the summary in detail • Check what’s taking the max time, max memory, check for skews in data • Check row estimates, depending on stats available or not these can be skewed • Check join order is determined entirely by total size (#rows * column width) • Try to ensure after partition pruning, we have RHS smaller than LHS • Broadcast joins are the default, partitioned joins for large tables of roughly equal size.
  • 37. 37 © Cloudera, Inc. All rights reserved. Understanding Query Profiles • Read the query timeline in detail • Check which step is taking the most time and why? • Usual culprits • Metadata load • Completed admission • ClientFetchWaitTimer • First dynamic filter received • Last row fetched
  • 38. 38 © Cloudera, Inc. All rights reserved. Understanding Query Profiles • Check each plan fragment • Tells us what it did, how many hosts did it run on • How much data it processed • Partition pruning stats for HDFS scans • Parquet push down predicates
  • 39. 39 © Cloudera, Inc. All rights reserved. Key Takeaways • Always use dedicated coordinators/executors • IAC should be enabled and memory limit set • Metadata management is significantly improved • Zero touch metadata coming soon • Follow best practices for impala queries and performance tuning –Refer Impala cookbook
  • 40. 40 © Cloudera, Inc. All rights reserved. Rate today ’s session