SlideShare a Scribd company logo
Hash vs Join
A case study evaluating the use of the data
step hash object to replace a SQL join
Geoff Ness
Sep 2014
The Hash Object
• Effectively a lookup table which resides in
memory – key/value pairs
• Similar to associative arrays, dictionaries in other
programming languages
• Fast lookup (O(1)), no sorting required
• Can offer a faster alternative to traditional data
step merge or SQL join, at a price:
– The syntax is unfamiliar to a lot of SAS programmers
– There’s more code to write
– Requires more memory than a join (sometimes much
more)
Using Hash to replace a SQL Join
Fact
table
Dimension
1
Dimension
2
Dimension
3
Dimension
4
SQL Join
Alternative using the Hash Object
• Replacing the join typically requires 3 steps to
be coded:
1 - Create variables by ‘faking’ a set statement:
2 - Then declare hash objects for each dimension:
3 - Finally, join rows from the fact to rows in the
dimensions by calling the hash .find() method:
• The .find() method returns 0 when a matching
row is found in the column from .definekey(), and
the values from .definedata() are populated
Performance Comparison
• When joining 2 dimensions, small fact (100K
rows):
• Joining 2 dimensions, large fact (~10M rows):
• Joining 9 dimensions, small fact (100K rows):
• Joining 9 dimensions, large fact (~10M rows):
Stuff we haven’t considered
• Outer joins (yes these are possible)
• When proc sql will use the hash object ‘under
the covers’
• Performance against RDBMS tables (as
opposed to SAS datasets)
• Hash iterators
• Other things that can be done with the hash
object (sorting, summarisation, de-duplication)
Summary
• Implementing a join using the hash object can
provide a considerable saving in terms of time,
usually at the expense of memory
• The code is a little more involved but breaks
down to a reasonably simple process to
implement
• Things to consider:
– The number and size of tables involved
– The memory required to load all the hash objects into
memory
References
The SAS® Hash Object in Action
http://support.sas.com/resources/papers/proceedings09/153-
2009.pdf
Introduction to SAS® Hash Objects
http://www.scsug.org/wp-content/uploads/2013/11/Introduction-to-
SAS%C2%AE-Hash-Objects-Chris-Schacherer.pdf
A Hash Alternative to the PROC SQL Left Join
http://www.nesug.org/proceedings/nesug06/dm/da07.pdf
Using the Hash Object – SAS® Language Reference: Concepts
http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/defa
ult/viewer.htm#a002585310.htm
Questions?

More Related Content

PPTX
CCI DAY PRESENTATION
PDF
Time series database by Harshil Ambagade
PPTX
Need for Time series Database
PPTX
Elasticsearch Arcihtecture & What's New in Version 5
PDF
Small intro to Big Data - Old version
PDF
Engineering fast indexes
PDF
J-Day Kraków: Listen to the sounds of your application
PPTX
Powering Rails Application With PostgreSQL
CCI DAY PRESENTATION
Time series database by Harshil Ambagade
Need for Time series Database
Elasticsearch Arcihtecture & What's New in Version 5
Small intro to Big Data - Old version
Engineering fast indexes
J-Day Kraków: Listen to the sounds of your application
Powering Rails Application With PostgreSQL

What's hot (20)

PPTX
Time Series Data in a Time Series World
PPTX
Elastic Stack Introduction
PDF
Enabling Presto Caching at Uber with Alluxio
PPTX
Open source big data landscape and possible ITS applications
PPTX
Graph databases
PPTX
ElasticSearch as (only) datastore
PDF
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
PDF
Traxticsearch
PDF
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
PDF
Traffic Matrices and its measurement
PPTX
Cassandra Lunch #59 Functions in Cassandra
PPTX
Apache Spark II (SparkSQL)
PDF
EMR AWS Demo
PDF
Databases and how to choose them
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PPTX
Data science bootcamp day 3
PDF
Why You Definitely Don’t Want to Build Your Own Time Series Database
PDF
Clickhouse at Cloudflare. By Marek Vavrusa
PDF
[4DEV][Łódź] Ivan Vaskevych - InfluxDB and Grafana fighting together with IoT...
PPTX
Bleeding Edge Databases
Time Series Data in a Time Series World
Elastic Stack Introduction
Enabling Presto Caching at Uber with Alluxio
Open source big data landscape and possible ITS applications
Graph databases
ElasticSearch as (only) datastore
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
Traxticsearch
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
Traffic Matrices and its measurement
Cassandra Lunch #59 Functions in Cassandra
Apache Spark II (SparkSQL)
EMR AWS Demo
Databases and how to choose them
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Data science bootcamp day 3
Why You Definitely Don’t Want to Build Your Own Time Series Database
Clickhouse at Cloudflare. By Marek Vavrusa
[4DEV][Łódź] Ivan Vaskevych - InfluxDB and Grafana fighting together with IoT...
Bleeding Edge Databases
Ad

Similar to Data Step Hash Object vs SQL Join (20)

PPTX
Big Data Analytics Module-4 as per vtu .pptx
PPTX
Apache Spark
PPTX
Data base Hash based indexing good.pptxx
PPTX
NoSQL and MongoDB
PPTX
A tour of Amazon Redshift
PDF
Architectural anti-patterns for data handling
PDF
Apache Hadoop 1.1
PDF
Java Memory Analysis: Problems and Solutions
PDF
DataBaseManagementSystems-BTECH--UNIT-5.pdf
PPT
Big Data Technologies - Hadoop
PDF
Big Data Tools MapReduce,Hive and Pig.pdf
PPTX
NOSQL PRESENTATION ON INTRRODUCTION Intro.pptx
PPT
Schemaless Databases
PPTX
Nosql databases
PPTX
HBase in Practice
PPTX
HBase in Practice
PDF
Database Technologies
PPTX
cours database pour etudiant NoSQL (1).pptx
PPT
No sql Database
PPTX
NoSQL.pptx
Big Data Analytics Module-4 as per vtu .pptx
Apache Spark
Data base Hash based indexing good.pptxx
NoSQL and MongoDB
A tour of Amazon Redshift
Architectural anti-patterns for data handling
Apache Hadoop 1.1
Java Memory Analysis: Problems and Solutions
DataBaseManagementSystems-BTECH--UNIT-5.pdf
Big Data Technologies - Hadoop
Big Data Tools MapReduce,Hive and Pig.pdf
NOSQL PRESENTATION ON INTRRODUCTION Intro.pptx
Schemaless Databases
Nosql databases
HBase in Practice
HBase in Practice
Database Technologies
cours database pour etudiant NoSQL (1).pptx
No sql Database
NoSQL.pptx
Ad

Data Step Hash Object vs SQL Join

  • 1. Hash vs Join A case study evaluating the use of the data step hash object to replace a SQL join Geoff Ness Sep 2014
  • 2. The Hash Object • Effectively a lookup table which resides in memory – key/value pairs • Similar to associative arrays, dictionaries in other programming languages • Fast lookup (O(1)), no sorting required • Can offer a faster alternative to traditional data step merge or SQL join, at a price: – The syntax is unfamiliar to a lot of SAS programmers – There’s more code to write – Requires more memory than a join (sometimes much more)
  • 3. Using Hash to replace a SQL Join Fact table Dimension 1 Dimension 2 Dimension 3 Dimension 4
  • 5. Alternative using the Hash Object • Replacing the join typically requires 3 steps to be coded: 1 - Create variables by ‘faking’ a set statement:
  • 6. 2 - Then declare hash objects for each dimension:
  • 7. 3 - Finally, join rows from the fact to rows in the dimensions by calling the hash .find() method: • The .find() method returns 0 when a matching row is found in the column from .definekey(), and the values from .definedata() are populated
  • 8. Performance Comparison • When joining 2 dimensions, small fact (100K rows):
  • 9. • Joining 2 dimensions, large fact (~10M rows):
  • 10. • Joining 9 dimensions, small fact (100K rows):
  • 11. • Joining 9 dimensions, large fact (~10M rows):
  • 12. Stuff we haven’t considered • Outer joins (yes these are possible) • When proc sql will use the hash object ‘under the covers’ • Performance against RDBMS tables (as opposed to SAS datasets) • Hash iterators • Other things that can be done with the hash object (sorting, summarisation, de-duplication)
  • 13. Summary • Implementing a join using the hash object can provide a considerable saving in terms of time, usually at the expense of memory • The code is a little more involved but breaks down to a reasonably simple process to implement • Things to consider: – The number and size of tables involved – The memory required to load all the hash objects into memory
  • 14. References The SAS® Hash Object in Action http://support.sas.com/resources/papers/proceedings09/153- 2009.pdf Introduction to SAS® Hash Objects http://www.scsug.org/wp-content/uploads/2013/11/Introduction-to- SAS%C2%AE-Hash-Objects-Chris-Schacherer.pdf A Hash Alternative to the PROC SQL Left Join http://www.nesug.org/proceedings/nesug06/dm/da07.pdf Using the Hash Object – SAS® Language Reference: Concepts http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/defa ult/viewer.htm#a002585310.htm

Editor's Notes

  • #4: Typical scenario which is handled currently by a SQL join: one large, central fact table containing data to be aggregated against levels from the surrounding dimensions. Note: all the tables involved in this case are SAS datasets resident on Windows servers, created and accessed via SAS 9.2
  • #5: Note that proc sql might actually make use of the hash join ‘under the covers’, depending on how much of the smaller table can fit into a single memory buffer
  • #6: The data step compiler does not know about the hash object when it is created, so we need to supply it with metadata in advance to assist with the formation of the PDV
  • #7: The definekey() method names the column(s) forming the key used to lookup into the hash object. The definedata() method names the columns which are to be returned from the lookup. Once the definedone() method is called, SAS loops over the rows in the dataset named in the dataset parameter and populates the hash object.
  • #8: The join is restricted to matching rows by only outputting when the .find() method has returned 0 for all hash objects
  • #9: The dimensions in this case were reasonably large, 2-4 million rows in each. Some difference in terms of time, but the most noticeable difference is how much more memory is required by the hash method.
  • #10: The memory required for the hash objects hasn’t changed, but the time shows a much more significant difference between the two methods.
  • #11: The memory requirement has increased significantly with the addition of new dimensions, but check out how much less system cpu time is required for the hash object method! This indicates that significantly less data transfer is being handed off to the operating system, and less disk access is required.
  • #12: This is where the real payoff can be seen. Note that memory continues to be a consideration – if you don’t have a lot of RAM available this might rule out the use of the hash object.
  • #13: Outer joins are implemented simply by modifying the behaviour of the data step in response to the hash .find() method. Proc SQL in general won’t use the hash object (instead using a merge join with sort/index) when an outer join is requested. Memory also plays a part in the path the optimizer chooses – which tables can be fit into available memory? Iterators allow a hash object to be treated as an iterable sequence rather than a lookup table