Star Schemas
Patrick Cuba – Consultant
(SAS® Software)
Scalable Performance Data Engine
using
• Case Study – Need for SPDE
• SPDE Library
• Case Study – Need for SPDS
• SPDS Server
 Clusters
 Star Schema
 StarJoin
• Questions
• References
2
• Table build is 6 hours
• Query time is 20 minutes
• Latest is 360GB
• Generation tables hold 24 months
• Generation tables grown to 1TB each
• 300+ columns
• Four balances per credit card (Max 255)
• 20 million customers
• Growing customer base
• Keeps defaults customer balance
3
• At month end the cycle end
and latest credit card for
the month are added to
SAS Generation TablesCycle-end
Month EndCycle-endCycle-end
Cycle-end
Cycle-end
Cycle-end
Month end
Month end
Month end
• Accounts cycle at
different days in the
month
Latest
4
SAS Dataset
• SAS Datasets are flat files
Page 5
libname all_users’/disk1/metadata’;
• Under BASE SAS License
• Scalable Performance Data Engine (SPDE)
• On SMP server (at least 2 CPU’s)
• RAID
SAS SPD Dataset
Data
Part
Data
Part
Data
Part
Data
Part
Data
Part
HBX
Index
IBX
Index
Meta
libname all_users spde ’/disk1/metadata’
datapath= (’/disk2/userdata’ ’/disk3/userdata’)
indexpath= (’/disk4/userindexes’ ’/disk5/userindexes’)
partsize=128M;
6
• Star Schema using StarJoin
• Clustered Cycle & Month end
totalling 1TB
• Table build is 30-40 minutes
• Query time is seconds to 5
minutes
7
Dimension
Dimension
Fact
Dimension
Dimension
• Scalable Performance Data Server
• Client/Server
• SQL Pass-thru
8
• Clusters
M1
M2
M3
M4
M5
M6
M7
M8
Cluster
PROC SPDO LIBRARY=domain-name;
SET ACLUSER user-name;
CLUSTER CREATE cluster-table-name
MEM = SPD-Server-table1
MEM = SPD-Server-table2
MAXSLOT=24
QUIT;
9
• Facts and Dimensions
Dimension
Dimension
Fact
Dimension
Dimension
Pairwise :
7 Joins
1 Select
StarJoin:
3 Steps
10
execute(reset nostarjoin=<1/0>)
Page 11
• 1. Turn it
Page 12
• 2. No
Dim
Dim
Fact
Dim
Dim
Dim
Dim
Page 13
• 3. Single
Dim
Dim
Fact
Dim
Dim
• 4. Single
Fact
• 5. Fact & Dimension
14
Email: patrickcuba@live.co.za
Mobile: 0458 91 2634
Linkedin: http://www.linkedin.com/in/patrickcuba
Page 15
STARJOIN
http://support.sas.com/documentation/cdl/en/spdsug/63088/HTML/default/vi
ewer.htm#n0mlj75x9c4dtzn1ves84e1op3jt.htm
SAS® 9.1 Scalable Performance
Data Engine
http://support.sas.com/documentation/onlinedoc/91pdf/sasdoc_91/base_data
eng_6996.pdf
SAS® 9.2
Scalable Performance
Data Engine
http://support.sas.com/documentation/cdl/en/engspde/61887/PDF/default/en
gspde.pdf
When should you use the SPDE engine
http://support.sas.com/rnd/scalability/spde/when.html

More Related Content

PDF
DBP-010_Using Azure Data Services for Modern Data Applications
PDF
The State of the Data Warehouse in 2017 and Beyond
PDF
Converging Database Transactions and Analytics
PDF
Personalization Journey: From Single Node to Cloud Streaming
PPTX
Building the Foundation for a Latency-Free Life
PPTX
How Kafka and Modern Databases Benefit Apps and Analytics
PDF
Architecting Data in the AWS Ecosystem
PDF
Continuous Optimization for Distributed BigData Analysis
DBP-010_Using Azure Data Services for Modern Data Applications
The State of the Data Warehouse in 2017 and Beyond
Converging Database Transactions and Analytics
Personalization Journey: From Single Node to Cloud Streaming
Building the Foundation for a Latency-Free Life
How Kafka and Modern Databases Benefit Apps and Analytics
Architecting Data in the AWS Ecosystem
Continuous Optimization for Distributed BigData Analysis

What's hot (20)

PDF
Presto: Fast SQL on Everything
PDF
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
PPTX
Budapest Data Forum 2017 - BigQuery, Looker And Big Data Analytics At Petabyt...
PPTX
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
PDF
Building a Machine Learning Recommendation Engine in SQL
PPTX
HP Discover: Real Time Insights from Big Data
PPTX
Data Modeling IoT and Time Series data in NoSQL
PDF
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
PDF
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
PPT
Billions of Rows, Millions of Insights, Right Now
PPTX
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
PPTX
Move your on prem data to a lake in a Lake in Cloud
PPTX
Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse
PPTX
R in Power BI
PDF
Designing Data-Intensive Applications
PPTX
PSSUG Nov 2012: Big Data with SQL Server
PDF
IBM Cloud Day January 2021 - A well architected data lake
PDF
Introducing MongoDB 2.6
PDF
Northwestern Mutual Journey – Transform BI Space to Cloud
PDF
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
Presto: Fast SQL on Everything
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Budapest Data Forum 2017 - BigQuery, Looker And Big Data Analytics At Petabyt...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
Building a Machine Learning Recommendation Engine in SQL
HP Discover: Real Time Insights from Big Data
Data Modeling IoT and Time Series data in NoSQL
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Billions of Rows, Millions of Insights, Right Now
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Move your on prem data to a lake in a Lake in Cloud
Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse
R in Power BI
Designing Data-Intensive Applications
PSSUG Nov 2012: Big Data with SQL Server
IBM Cloud Day January 2021 - A well architected data lake
Introducing MongoDB 2.6
Northwestern Mutual Journey – Transform BI Space to Cloud
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
Ad

Similar to Building a Star Schema v1.1 (20)

PDF
ScaleDB Technical Presentation
PDF
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
PDF
Remote DBA Experts SQL Server 2008 New Features
PPTX
Unifying your data management with Hadoop
PDF
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
PDF
Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...
PPTX
Best storage engine for MySQL
PPTX
Dynamics CRM high volume systems - lessons from the field
PPTX
Solving Office 365 Big Challenges using Cassandra + Spark
PPTX
Sharding Methods for MongoDB
PDF
Suburface 2021 IBM Cloud Data Lake
PDF
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
PPTX
Cloud DWH deep dive
PDF
Webinar: Faster Log Indexing with Fusion
PDF
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
PPTX
Sharding Methods for MongoDB
PPTX
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
PPTX
PPTX
WyspaIT 2016 - Azure Stream Analytics i Azure Machine Learning w analizie str...
PDF
Best Practices – Extreme Performance with Data Warehousing on Oracle Database
ScaleDB Technical Presentation
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Remote DBA Experts SQL Server 2008 New Features
Unifying your data management with Hadoop
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...
Best storage engine for MySQL
Dynamics CRM high volume systems - lessons from the field
Solving Office 365 Big Challenges using Cassandra + Spark
Sharding Methods for MongoDB
Suburface 2021 IBM Cloud Data Lake
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Cloud DWH deep dive
Webinar: Faster Log Indexing with Fusion
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
Sharding Methods for MongoDB
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
WyspaIT 2016 - Azure Stream Analytics i Azure Machine Learning w analizie str...
Best Practices – Extreme Performance with Data Warehousing on Oracle Database
Ad

Building a Star Schema v1.1

  • 1. Star Schemas Patrick Cuba – Consultant (SAS® Software) Scalable Performance Data Engine using
  • 2. • Case Study – Need for SPDE • SPDE Library • Case Study – Need for SPDS • SPDS Server  Clusters  Star Schema  StarJoin • Questions • References 2
  • 3. • Table build is 6 hours • Query time is 20 minutes • Latest is 360GB • Generation tables hold 24 months • Generation tables grown to 1TB each • 300+ columns • Four balances per credit card (Max 255) • 20 million customers • Growing customer base • Keeps defaults customer balance 3
  • 4. • At month end the cycle end and latest credit card for the month are added to SAS Generation TablesCycle-end Month EndCycle-endCycle-end Cycle-end Cycle-end Cycle-end Month end Month end Month end • Accounts cycle at different days in the month Latest 4
  • 5. SAS Dataset • SAS Datasets are flat files Page 5 libname all_users’/disk1/metadata’;
  • 6. • Under BASE SAS License • Scalable Performance Data Engine (SPDE) • On SMP server (at least 2 CPU’s) • RAID SAS SPD Dataset Data Part Data Part Data Part Data Part Data Part HBX Index IBX Index Meta libname all_users spde ’/disk1/metadata’ datapath= (’/disk2/userdata’ ’/disk3/userdata’) indexpath= (’/disk4/userindexes’ ’/disk5/userindexes’) partsize=128M; 6
  • 7. • Star Schema using StarJoin • Clustered Cycle & Month end totalling 1TB • Table build is 30-40 minutes • Query time is seconds to 5 minutes 7 Dimension Dimension Fact Dimension Dimension
  • 8. • Scalable Performance Data Server • Client/Server • SQL Pass-thru 8
  • 9. • Clusters M1 M2 M3 M4 M5 M6 M7 M8 Cluster PROC SPDO LIBRARY=domain-name; SET ACLUSER user-name; CLUSTER CREATE cluster-table-name MEM = SPD-Server-table1 MEM = SPD-Server-table2 MAXSLOT=24 QUIT; 9
  • 10. • Facts and Dimensions Dimension Dimension Fact Dimension Dimension Pairwise : 7 Joins 1 Select StarJoin: 3 Steps 10
  • 12. Page 12 • 2. No Dim Dim Fact Dim Dim Dim Dim
  • 13. Page 13 • 3. Single Dim Dim Fact Dim Dim • 4. Single Fact • 5. Fact & Dimension
  • 14. 14 Email: patrickcuba@live.co.za Mobile: 0458 91 2634 Linkedin: http://www.linkedin.com/in/patrickcuba
  • 15. Page 15 STARJOIN http://support.sas.com/documentation/cdl/en/spdsug/63088/HTML/default/vi ewer.htm#n0mlj75x9c4dtzn1ves84e1op3jt.htm SAS® 9.1 Scalable Performance Data Engine http://support.sas.com/documentation/onlinedoc/91pdf/sasdoc_91/base_data eng_6996.pdf SAS® 9.2 Scalable Performance Data Engine http://support.sas.com/documentation/cdl/en/engspde/61887/PDF/default/en gspde.pdf When should you use the SPDE engine http://support.sas.com/rnd/scalability/spde/when.html

Editor's Notes

  • #2: Star Schemas using Scalable Performance Data Engine by Patrick Cuba @ Cuba BI Consulting
  • #3: Agenda Case Study – Need for SPDE SPDE Library Case Study – Need for SPDS SPDS Server Clusters Star Schema StarJoin Questions References
  • #4: Case Study Latest table is wide (300+ columns) and long (20 million + customers with at least one credit card with minimal of 4 balance records). Repricing of the balances means that the balance given a new APR. Each of the four balances can have a promotional balance. Ex-customer balances are not removed from the mart. Latest table is reconstructed daily taking about 6+ hours. Query times were increasing as promos were being offered to customers and these promo records exist regardless if the customer takes up the offer. Another cause of slowing querying times is the growing customer base. SAS Generation tables were being used but took in excess of an hour to query. In order to speed up querying information from the SAS Generation table and decrease the IO footprint the balance columns were stripped out of the generation tables.
  • #5: Case Study The latest balances per credit card is stored in a Latest dataset As credit cards cycle (credit card statements and capitalised interest is calculated) they are captured in a staging area At month end a snapshot of the Latest table is added to a SAS Generation table for month ends and the staged cycle end data is added to a SAS Generation table for cycle ends.
  • #6: SAS datasets can be thought of as being similar to Excel Spreadsheets. Just like Excel SAS Datasets have rows and columns storing any data type; that is characters, numeric and dates. Access to SAS datasets is achieved by using a Base SAS Libref
  • #7: SPDE Library SAS SPDE is available under the SAS Base license simply assign a SPDE libref to take advantage of the SPDE engine SPDE allows you to partition a dataset across disk areas and directories or even to individual disk spindles that will allow data to be accessed concurrently and in parallel. The server should have at least two CPUs and have threading enabled. Parallel processing allows data to be accessed faster, it allows parallel loads and parallel WHERE selections. By partitioning the data you can also overcome file size limits imposed by some operating systems, eg: 2GB file limits on 32-bit platforms. SPDE also allows Implicit sorting on BY statements Files are split into four parts: Data .dpf and data descriptors (limit by PARTSIZE= and created cyclically across defined paths) Two index files (occupy a single path until full, then starts at the next path) HBX (global) and IDX (segment) Metadata .mdf Each part can be further partitioned utilizing a RAID strategy to further shorten query time.
  • #8: Case Study Dimensional modelling enables us to model the data around slowly changing dimensions and the quick changing facts. Facts are numeric variables such as balances, anything that you can add, subtract, divide, multiply. These are fast changing and called facts. Other attributes that are considered slow changing are stored in Dimension tables, things like birthdate, address, name are considered dimensional values or Slowly Changing Dimensions. We achieved the following: A ‘Latest’ subset at the size of 65GB Total disk usage of Cycle End + Month End + Latest of 1TB in Clusters
  • #9: SPDS SAS Scalable Performance Data Server is a client/server environment that acts as an extension to the SPDE engine You are able to run Pass-Thru queries to the server, it has its own SQL query planner, Access Control Lists (ACL) Security and extensions to the SPDE engine.
  • #10: Clusters Clustering enables us to store multiple cluster members (SPD tables) into a larger virtual table The cluster table can be queried directly (it appears like a SAS dataset in a library). Each member can be referenced by a MINMAXVARLIST column that acts as an index of the cluster Each cluster member is further indexed by their primary keys, unique to cluster or to cluster member. Each cluster member must be identical to the next You cannot append to a cluster, you have to uncluster the cluster and update the individual cluster member and recluster the cluster
  • #11: StarJoin The SPD Server Star Join facility validates, optimizes, and executes SQL queries on data that is configured in a star schema. If the SPD Server Star Join facility is not enabled, or if SPD Server SQL does not detect a star schema, then the SQL will be processed using pair-wise joins. Properly configured star joins require only three steps to complete, regardless of the number of dimension tables. SPD Server pair-wise joins require one step for each table to complete the join. If a star schema consisted of 25 dimension tables and one fact table, the star join is accomplished in three steps; joining the tables in the star schema using pair-wise joins will require 26 steps.
  • #12: StarJoin For SPD Server SQL to take advantage of the STARJOIN planner, the following conditions must be true: STARJOIN optimization must be enabled in SPD Server.
  • #13: StarJoin All dimension tables in the SPD Server star schema must be connected to the fact table.
  • #14: StarJoin SPD Server dimension tables can appear in only one join condition. SPD Server fact tables are equally joined to dimension tables SPD Server SQL infers fact tables by topology (common equally joined predicates). Dimension tables that have no subsetting require a simple index on the dimension table's join column. StarJoin options STARMAGIC=1 forces all dimension tables to be classified as Phase I tables. STARMAGIC=4 requires an exact match on the FACT composite index in order to meet Phase I conditions for STARJOIN. STARMAGIC=8 disables the IN-SET STARJOIN strategy. The IN-SET strategy is enabled by default. STARMAGIC=16 disables the COMPOSITE STARJOIN strategy. The COMPOSITE strategy is enabled by default. Phase 1 tables are loaded into memory and are enabled by default if the dimensional table is very small or is subset Phase 2 tables are not loaded into memory IN-SET strategy works on simple indexes and COMPOSITE strategy works on composite keys, almost all query optimization works on IN-SET strategy.
  • #16: Additional information may be found here: STARJOIN http://support.sas.com/documentation/cdl/en/spdsug/63088/HTML/default/viewer.htm#n0mlj75x9c4dtzn1ves84e1op3jt.htm SAS® 9.1 Scalable Performance Data Engine http://support.sas.com/documentation/onlinedoc/91pdf/sasdoc_91/base_dataeng_6996.pdf SAS® 9.2 Scalable Performance Data Engine http://support.sas.com/documentation/cdl/en/engspde/61887/PDF/default/engspde.pdf When should you use the SPDE engine http://support.sas.com/rnd/scalability/spde/when.html