Strata London 2019 Scaling Impala

SCALING IMPALA
Manish Maheshwari | Strata London 2019 | #StrataData
manish@cloudera.com

2 © Cloudera, Inc. All rights reserved.
AGENDA
• Impala overview
• KRPC Improvements
• Scaling issues and solutions
• Understanding query profiles
• Key Takeaways

• Open source
• Fast
• Massively parallel processing (MPP)
• C++, run time code generation, streaming
• Flexible
• Multiple storage engines (HDFS, S3, ADLS, Apache Kudu, …)
• Multiple file formats (Parquet, Text, Sequence, Avro, ORC, …)
• Enterprise-grade
• Authorization, authentication, lineage tracing, auditing, encryption
• >1400 customers, >97000 machines
• Scalable (Now even more!)
• Large clusters with 400+ nodes
Apache Impala

HDFS Kudu S3 HBase
Impala Architecture
Query Compiler
Query Executor
Query Coordinator
Metadata
HDFS NameNode
StateStore
Catalog
FE
(Java)
BE (C++)
Metadata Cache
Impala Daemons
Metadata
Execution
Storage ADLS
SentryHive MetaStore
Query Compiler
Query Executor
Query Coordinator
Metadata
Query Compiler
Query Executor
Query Coordinator
Metadata

HDFS
0
1
Query Compiler
Query Executor
Query Coordinator
Metadata
Kudu S3/ADLS HBase HDFS Kudu S3/ADLS HBase HDFS Kudu S3/ADLS HBase
Impala Daemons
SQL App
ODBC/JDBC
2
34
5
6
7
• Request arrives via ODBC/JDBC
• Planner turns request into collection of plan fragments
• Coordinator initiates execution on remote Impala daemons
• Intermediate results are streamed between Impala daemons
• Query results are streamed back to the client
Select Query flow in Impala
Query Compiler
Query Executor
Query Coordinator
Metadata
Query Compiler
Query Executor
Query Coordinator
Metadata
Impala DaemonsImpala Daemons

What’s KRPC?
• Replaces Thrift RPC for inter daemon communication from CDH 5.15+ for certain
RPCs
• Reduces # of connections in cluster
• Reduce stress on MIT KDC / AD
• KRPC supports both synchronous and asynchronous RPCs
• KRPC supports connection multiplexing
• One connection per direction between every pair of hosts
• Fixed size thread pool maintained by KRPC internally
• Query performances during concurrent execution up by average 2x-3x

KRPC Performance

KRPC Stability

KRPC Throughput

Scaling Issues
• I got 20 more use cases to onboard
• And our data volumes just went up 10x
• And we have 10x more tables and 100x more partitions
• And we have 10x more concurrent queries
• And our ETL is now real time
• Sure, Just add more nodes right?
• Ok, did that too.. My queries are still 3x slower L

Metadata/Catalog Cache Woes
• Impala Catalog and Daemons cache the HMS Metadata + HDFS Block locations
• Memory = num of tables * 5KB + num of partitions * 2KB + num of files * 750B + num of file blocks * 300B +
sum(incremental col stats per table)
• Incremental stats
• For each table, num columns * num partitions * 400B
• e.g. A large telco has over 56K tables
• and partitions and files and blocks and replicas…
• Catalog memory approx. 80 GB, GC times go up significantly
• Reduced memory for query execution.
• OOM issues on the catalogd
• Long metadata loading time, Long time for SS to send the catalog to all the daemons
• Slow DDL and DML Statements, Even `describe table` requires catalog cache for the whole table

Metadata/Catalog Cache Woes
Daemon
#1
CatalogD
A,B,C,D
Table
A
HMS
Table
B
Table
C
Table
D
Namenode
Daemon
#1
A,B,C,D
DaemonsA,B,C,D
Daemon
#1
A,B,C,D
Daemon
#1
A,B,C,D
DaemonsA,B,C,D
Daemon
#1
A,B,C,D
Daemon
#1
A,B,C,D
DaemonsA,B,C,D
StateStore Daemon

Metadata/Catalog Cache - Solutions
• Regularly merge small files
• And run a Refresh Table
• Avoid data ingestion processes that produce many small files
• Use a larger block size (can be over 2GB)
• Optimal partitioning strategy
• Don’t over or under partition
• HDFS file handle cache
• Consider using HBase, Kudu, …
• If nothing works, just delete the data 😜

Dedicated Coordinators
• Coordinators - Compile the queries, creates the execution plan,
• Needs table metadata!!
• Executors - Executes the query plans and sends the results to other executors / coordinator
• Does not need table metadata!!
• Benefits
• Executors need less memory
• Statestore does not need to send metadata to all nodes
• Faster metadata updates and propogation
• Coordinator nodes don’t need to be datanodes

Dedicated Coordinators
Coordinators
CatalogD
A,B,C,D
Table
A
HMS
Table
B
Table
C
Table
Z
Namenode
Executors Executors
StateStore Daemon
CoordinatorsA,B,C,D
Executors
Executors
Executors
Executors

Dedicated Coordinators – Best Practices
• Thumb Rule – 1 Coordinator per 50 executors.
• Start with just one coordinator. Run it on an edge node. (Can run on datanodes too)
• Needs atleast few disks to write any spill data
• Add another coordinator when CPU / network utilization > 80%
• When using a load balancer double the coordinator counts, but set only half as active and rest as backup
• Use sticky connections on the load balancer
• Increase the fe_service_threads on the coordinators to handle client connections
• Increase the Java heap of the coordinators according to the catalog size
• Increase num_metadata_loading_threads
• Default is 16. Increase if you have lots of tables and partitions.

Further Catalog Cache Improvements (In Beta)
Coordinator
#1
CatalogDCoordinator
#2
CatalogD
A B
impala-shell> describe A impala-shell> describe B
Table
A
HMS +
Namenode
Table
B
Table
C
Table
Z
• On-Demand metadata for
coordinators from the CatalogD
• Metadata LRU cache (forget old
tables)
• Metadata release on memory
pressure
• Smart Cache Invalidation
• Compressed Incremental Stats
See IMPALA-7127 for full list of
improvements
HMS
Notifications
Coordinator
#3
C
impala-shell> describe C

HMS Notifications
Hive Metastore
HDFS NameNode
StateStore Catalog
Metadata
2
3
Lightweight
Notifications
Query Coordinators
Coordinators
Lightweight
Notifications
HMS Notifications
Metadata
Query Coordinators

Metadata on-demand (In Beta)
• Use Local catalog cache
• Set --catalog_topic_mode = minimal on the catalog daemons
• Set --use_local_catalog = true on the coordinators
• Time-based catalog cache eviction -
• Set invalidate_tables_timeout_s on both catalogd and coordinators
• E.g invalidate_tables_timeout_s = 3600 will invalidate tables that are older than 1 hour.
• Memory-based catalog cache eviction -
• Set invalidate_tables_on_memory_pressure = true on both catalogd and impalad.
• When the memory pressure reaches 60% of JVM heap size after a Java garbage collection in catalogd,
Impala invalidates 10% of the least recently used tables.

Impala Architecture – Coordinators and Executors
HDFS Kudu S3 HBase
Query Compiler
Query Executor
Query Coordinator
Metadata
Query Executor
HDFS NameNode
StateStore
Catalog
FE
(Java)
BE (C++)
Metadata Cache
Impala Coordinator (‘s) Impala Executors
Metadata
Execution
Storage
Query Executor
Query Executor
Query Executor
Query Executor
Query Executor
Query Executor
Query Executor
Query Executor
Query Executor
ADLS
SentryHive MetaStore

Admission Control Woes
• Impala Admission Control not enabled / Default Memory limit not set for each pool
• Heuristics based memory estimation, not 100% accurate. Worse if table stats are unavailable
• group by’s estimate can be particularly off – when there’s a large number of group by columns.
• Mem estimate = NDV of group by column 1 * NDV of group by column 2 * … NDV of group by column n
• Under admission due to higher than required memory reserved on each daemon
• Queries will OOM out unnecessarily
• Query Status: Admission for query exceeded timeout 60000ms in pool root.nprd_tst_hadoop_data_appl_readonly. Queued
reason: Not enough aggregate memory available in pool root.nprd_tst_hadoop_data_appl_readonly with max mem resources
150.00 GB. Needed 40.00 GB but only 30.00 GB was available.
• Always enable IAC
• Limit the amount of memory used by an individual query using per query mem-limit
• Set it from Impala shell / Hue - set mem_limit=<per query limit>
• Set default memory limit per pool

Query Concurrency Woes
• Impala Admission Control is decentralized, each coordinator makes independent decision on
the basis of last know running queries on the cluster as communicated by the SS
• This makes IAC decisions are fast, But might be a little imprecise during times of heavy load
across many daemons. This is called as over admission.
• Using dedicated coordinators, limits this over admission of queries!!
• You get controlled query concurrency and thus each query runs faster and overall higher
query throughput.
• Ideal total query concurrency = # Cores on the executors / datanodes

Impala Resource Pools
Resource Pools Design -
• 10 impala daemons, 200 GB per daemon – 2 TB total, 8
Tenants
• Ok. Lets divide the memory into tenants. Everyone gets
what they pay for. Good design right?
Issues
• Unused memory cannot be used by other tenants
• Busy tenants queue up queries in admission control
causing overall ”slowness” in query execution.
• Small tenants running large queries will spill to disk
until the spill to disk limit and eventually OOM out
• How bad does it get - 25 + tenants, Only 2/3 active at
any given time

Scaling Impala Resource Pools
• Design resource pools according to peak memory needed
• Use simple grouping to create small, medium and large pools
• Use Cloudera Manager API’s for user chargeback if needed

Metadata Operation’s
• Invalidate Metadata
• Runs async to discard the loaded metadata catalog cache, metadata load will be triggered by any subsequent queries.
• Should be run when
• New tables created / Tables dropped by Hive/Spark
• Block locations changed by HDFS load balancer
• Recover Partitions
• Scans HDFS to check if any new partition directories were added and cache block metadata for those files
• Refresh Table / Refresh Table Partition
• Adding/Removing/Overwriting files into partitions via Hive/Spark
• Running operations like ALTER TABLE
• Reloads metadata for the table from HMS and does an incremental reload of the file and block metadata.

Scaling Metadata Operation’s
• How bad does it get?
• 18K Invalidate tables per day.
• What not to do
• No commands are needed if
operations/ETL runs in Impala
• Always run refresh <table>
<partition> when adding data
• Recover partitions when partitions
are added
• Refresh Table for other changes
• Limit IM to <Table> Only

Automatic Metadata Sync (In Beta)
• CatalogD polls Hive Metastore (HMS) notifications events
• Invalidates the tables when it receives the ALTER TABLE events or the ALTER, ADD,
or DROP their partitions.
• Adds the tables or databases when it receives the CREATE TABLE or CREATE DATABASE events.
• Removes the tables from catalogd when it receives the DROP TABLE or DROP
DATABASE events.
• The operations that do not generate events in HMS, such as adding new data to existing
tables/partitions from Spark are not supported.
• Aka, Load / Insert still needs a refresh table partition

Automatic Metadata Sync (In Beta)
• To disable the event based HMS sync for a new database, set the impala.disableHmsSync database properties in Hive
• CREATE DATABASE <name> WITH DBPROPERTIES ('impala.disableHmsSync'='true');
• To enable or disable the event based HMS sync for a table
• CREATE TABLE <name> WITH TBLPROPERTIES ('impala.disableHmsSync'='true' | 'false');
• To change the event based HMS sync at the table level
• ALTER TABLE <name> WITH TBLPROPERTIES ('impala.disableHmsSync'='true' | 'false’);
• When both table and database level properties are set, the table level property takes precedence.
• If the property is changed from true (meaning events are skipped) to false (meaning events are not skipped), issue a
manual INVALIDATE METADATA command to reset.

Scaling Compute Stats
• Compute Stats is very CPU-intensive – Based on number of rows, number of data files, the total size
of the data files, and the file format.
• For partitioned tables, the numbers are calculated per partition, and as totals for the whole table.
• Limit the number of columns to only compute stats on columns involved in filters, join conditions,
group by or partition by clauses.
• Re-compute stats only when there is > 30% data change
• Run compute stats on weekends/nights. Not needed after every data load.
• If you reload a complete new set of data for a table, but the number of rows and number of distinct
values for each column is relatively unchanged from before, you do not need to recompute stats for
the table
• Use enable_stats_extrapolation (experimental)

Set Statistics Manually
• Quick fix as part of data load, while compute stats can be scheduled on weekends
• Set total number of rows. Applies to both unpartitioned and partitioned tables.
• alter table <table_name> set tblproperties('numRows'='new_value',
'STATS_GENERATED_VIA_STATS_TASK'='true’);
• Set total number of rows for a specific partition. Applies to partitioned tables only. -- You must specify all
the partition key columns in the PARTITION clause.
• alter table table_name partition (keycol1=val1,keycol2=val2...) set
tblproperties('numRows'='new_value',
'STATS_GENERATED_VIA_STATS_TASK'='true’);
• Column stats:
• ALTER TABLE <table_name> SET COLUMN STATS <col_name>
(‘numDVs'=‘100‘)
• Compute numDVs with “SELECT NDV(col)”

Other Scalability Considerations
• Use star schemas, integer join keys
• Check for hot spotting - Increase replication factor for master data / frequently queries data
• Avoid casts – implicit or explicit (easily over 10% improvements for larger volumes)
• Increase RUNTIME_FILTER_WAIT_TIME_MS for complicated queries, but coordinators need to do
more work
• Use HDFS file handle cache
• Give the OS enough free memory to cache data blocks
• Set default compression codec - improves disk read performance
• Use high CPU nodes, fast processors
• Impala clusters - DistCp the data from remote cluster

BI Tools
• Always always close queries
• idle_query_timeout = 60
• idle_session_timeout = 1800
• Use handcrafted SQL’s
• Use different pools for different queries
and encourage use of set mem_limit;
• Use JDBC over Kerberos authentication
mechanism

Understanding Query Profiles
• Impala query profiles can be retrieved from Cloudera manager, impala coordinator webui or from
the command line by executing `profile`
• Includes nanosecond timers for all operations on all nodes
• Quite detailed and exhaustive, but the basics are easy
• We can easily answer -
• What’s the bottleneck for this query?
• Why this run is fast but that run is slow?
• How can I tune to improve this query’s performance.

• Always check impala version
and default query options set
• Check warnings
• Query state – running,
cancelled
• Check query type – Query,
DDL, etc

• Check per node peak memory usage
• Tell you how much should be the memory limit for these queries
• Shows skews in memory usage among nodes

• For completed queries, read the summary in detail
• Check what’s taking the max time, max memory, check for skews in data
• Check row estimates, depending on stats available or not these can be skewed
• Check join order is determined entirely by total size (#rows * column width)
• Try to ensure after partition pruning, we have RHS smaller than LHS
• Broadcast joins are the default, partitioned joins for large tables of roughly equal size.

• Read the query timeline in detail
• Check which step is taking the most time and
why?
• Usual culprits
• Metadata load
• Completed admission
• ClientFetchWaitTimer
• First dynamic filter received
• Last row fetched

• Check each plan fragment
• Tells us what it did, how many
hosts did it run on
• How much data it processed
• Partition pruning stats for
HDFS scans
• Parquet push down predicates

Key Takeaways
• Always use dedicated coordinators/executors
• IAC should be enabled and memory limit set
• Metadata management is significantly improved
• Zero touch metadata coming soon
• Follow best practices for impala queries and performance tuning –Refer Impala cookbook

Rate today ’s session

Strata London 2019 Scaling Impala

More Related Content

What's hot (20)

Similar to Strata London 2019 Scaling Impala (20)

Recently uploaded (20)

Strata London 2019 Scaling Impala