SlideShare a Scribd company logo
Dev Analytics Aggregated DB
Design Analysis
Analytics Problem Context
• Broadly the problem can be broken into three parts
1. Aggregate from one or more sources
2. Bring the aggregated data from the BE to FE DBs
3. Serve up data in a performant way through the portals
• Fundamental Requirements
• Backend Warehouse
• That can comfortably process some amount of data in a safe robust manner
• That is highly available
• Frontend DB to serve the queries from the portal
• Should serve queries in an efficient manner
• Must be highly available
• To mitigate the effects of disk performance for concurrent queries, read scale-out is required
• Scale out Model
• The above two are components of a stamp unit which would get replicated as instances to support
further scale out
Part 1 : Process data
• What happens?
• Large number of rows get inserted into the mining warehouse, so the CPU/Memory Usage can spike up
• So, we chunk the data to control the resource usage
• Are we setup for efficient writes to the SQL data files ?
• Data and Log file configurations matter
• Exploit Contiguity on disk
• Replication to other secondaries/primaries (for HA) can bring the system to the knees
• So, we chunk the data and provide breathing space
• We need to monitor the replication backlogs
• Are we setup to best contain the replication overheads ?
• Log files play a major role
• Read Efficiencies
• Write Efficiencies
• Spindle separation
• Data Categories
• Data that needs to be strongly consistent across partitions
• Scalability
• Currently we have 2 partitions?
• Reference to a document on the effort required to add another partition?
• Real-time
• Currently we process data once a day
• Current Indexing / Query Design Mantra
• Data comes in time order, prioritize for clustered indexes for write
• For read, rely on scan performance (these need to be validated)
Part 1 : Process Data (Cont)
• Data Categories
• Data that needs to be strongly consistent across partitions
• And highly available?
• Data that needs to be replicated
• Checkpoint/Restart Model?
• Scalability
• Currently we have 2 partitions?
• Reference to a document on the effort required to add another partition?
• Real-time
• Currently we process data once a day
• Current Indexing / Query Design Mantra
• Data comes in time order, prioritize for clustered indexes for write
• For read, rely on scan performance (these need to be validated)
Part 2: Move Data to FE while it is serving
Data
• What happens
• User queries some application specific or trend data
• While the user query is being processed, new data arrives and is inserted in time order
• This writes are chunked and currently occur over a period of 4 hours
• The writes result in fragmentation of the non clustered index, also new data requires SQL
statistics to be updated
• We have no partitioning in FE so defrag and update statistics are costly operations and
moreover they result in CPU/Memory Usage spikes
• In general, large tables make everything inefficient
• Mantra to write without disrupting the query performance
• Partitioning is highly recommended simply because the fact tables are large
• Table Partitioning
• Note the partitions are in time order, helps purge old data as well
• Active Passive is a poor cousin
Part 3: Serving Data
• What happens
• User queries some application specific or trend data
• Trend/Category Wide data requires query of records across applications
• Some queries require processing at the FE because of the nature of the queries
• User is asked to select a filter say over Country(35 choices?), Age(3), Gender (3) and numbers for the top 20 in that category/subcategory are displayed
• Pre-aggregation and sorting of this size/nature also cause DB size bloat
• Feature Use
• For this kind of heavy lifting, it would be useful to determine to what extent this feature is being used
• Queries can be and are disrupted by the writes that occur daily
• Index Mantras
• Rely on Clustered Indexes to the extent possible
• Use few carefully chosen non clustered index as necessary
• Query Mantras
• Intelligent Point Scans
• These serve by removing the need for churning data from time order into application order
• Use of Specific Joins
• Nested Loops Join: http://blogs.msdn.com/craigfr/archive/2006/07/26/679319.aspx
• Hash Join: http://blogs.msdn.com/craigfr/archive/2006/08/10/687630.aspx
• Merge Join: http://blogs.msdn.com/craigfr/archive/2006/08/03/merge-join.aspx
• Be aware/conscious about the size of the tables while designing your sprocs
• Ensure that update statistics is enabled, Auto update is recommended, A forced update after data insertions is desirable
• When not to rely on SQL to make the appropriate choice for you? Ashish?
• Index and Query Reference Spec
• Ashish’s spec
Active Passive Solution
• Passive
• Helps control the data insertion process without disruption
• Cons
• Not the best utilization of resources
• Monitoring Intensive
• Certainly not elegant
• Data that is sent to Passive(Set A) on Day 1 has to be fed to the Passive(Set B) on day 2, while
it also receives data for Day 1
• Or robust to backlogs, should they arise
• At this point we only know that at least the FE will continue to serve, though it may serve
stale data in these situations for longer time than we would like
Table Partitioning
• General Approach
• Create date-wise partitions with two empty partitions mapped to file-groups
• Empty partitions are used for data insertion and data removal
• Defrag and Update Statistics can be performed on staging partitions created on appropriate file groups
• Since splitting an empty partition is a metadata only operation, it is quick and helps maintain the same configuration
• By appropriate file group separation, you mitigate the impact of conflicting i/o operations
• References
• http://www.google.com/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=2&ved=0CC8QFjAB&url=http%3A%2F%2Fdownload.micro
soft.com%2Fdownload%2FD%2FB%2FD%2FDBDE7972-1EB9-470A-BA18-58849DB3EB3B%2FPartTableAndIndexStrat.docx&ei=eKkpUo-
lKIauiQewpYGgCw&usg=AFQjCNHKMusGnaIp9EzsR94YGq8OJRPM1w&bvm=bv.51773540,d.aGc
• http://msdn.microsoft.com/en-us/library/aa964122(SQL.90).aspx
• We believe this is the right long term path for Analytics FE DB evolution
• Even if you offload the BE processing to COSMOS, to serve the portal queries, an FE that allows efficient insertion and removal of data
while servicing requests from the portals is required
• This is an important cog in the design of an optimal and efficient FE unit
• Active/Passive is a stop gap measure with lower utilization of h/w assets
• Need a POC to validate if the query overheads are acceptable (within the desired and acceptable limit)
• References on parallel query execution overheads
• http://technet.microsoft.com/en-us/library/ms345599(v=sql.105).aspx
Other General References
• Fast track Recommendations
• http://msdn.microsoft.com/en-us/library/gg567302.aspx
• http://msdn.microsoft.com/en-us/library/gg605238.aspx
• Best practice recommendations
• http://technet.microsoft.com/en-us/library/dd578580(v=sql.100).aspx

More Related Content

PPTX
DATA WAREHOUSE -- ETL testing Plan
PPTX
Data Migration Solutions
PDF
Get started with data migration
PDF
Data warehousing testing strategies cognos
PPT
Understanding System Performance
PDF
Working with informtiaca teradata parallel transporter
PPT
Ch 7 Physical D B Design
PDF
A data driven etl test framework sqlsat madison
DATA WAREHOUSE -- ETL testing Plan
Data Migration Solutions
Get started with data migration
Data warehousing testing strategies cognos
Understanding System Performance
Working with informtiaca teradata parallel transporter
Ch 7 Physical D B Design
A data driven etl test framework sqlsat madison

What's hot (20)

PDF
Data integration
PPT
Data migration
PPT
Data Verification In QA Department Final
PDF
Accenture informatica interview question answers
PDF
Data Warehouse Testing: It’s All about the Planning
PPTX
Did you mean 'Galene'?
PPT
Cognos framework manager
PPTX
Etl process in data warehouse
PPT
ETL Testing - Introduction to ETL testing
PDF
Migrating Data Warehouse Solutions from Oracle to non-Oracle Databases
PPTX
BUILDING A DATA WAREHOUSE
DOC
Datastage parallell jobs vs datastage server jobs
PPTX
Datastage free tutorial
DOCX
TPT connection Implementation in Informatica
PPT
Data Integration (ETL)
PPT
Improving Reporting Performance
PPT
DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING
PPT
Teradata 13.10
PPS
CS101- Introduction to Computing- Lecture 37
PPTX
ETL Process
Data integration
Data migration
Data Verification In QA Department Final
Accenture informatica interview question answers
Data Warehouse Testing: It’s All about the Planning
Did you mean 'Galene'?
Cognos framework manager
Etl process in data warehouse
ETL Testing - Introduction to ETL testing
Migrating Data Warehouse Solutions from Oracle to non-Oracle Databases
BUILDING A DATA WAREHOUSE
Datastage parallell jobs vs datastage server jobs
Datastage free tutorial
TPT connection Implementation in Informatica
Data Integration (ETL)
Improving Reporting Performance
DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING
Teradata 13.10
CS101- Introduction to Computing- Lecture 37
ETL Process
Ad

Viewers also liked (14)

PPT
The 5 Biggest Mistakes Manufacturers Unwittingly Make on Social Media
PPTX
Partes internas del computador
PDF
Social Media "Content Cookery School"
PPTX
Excess & Surplus
PDF
MasterChef Web Version
PPTX
Index Provisioning for ALM Search - My Presentation
PDF
REVERSE MORTGAGE TO PURCHASE A HOME for REALTORS
DOCX
Kata pengantar kelompok iv fix (rk)
PPTX
Outsmarting The Search Competition with Predatory Thinking
DOCX
Rare_Book_Translation
PPTX
"Прованс" коллекция Bremani
PPTX
090816
DOCX
ZMET Technique- Nescafé coffee, Brand Management Assignment
PPTX
Carolina
The 5 Biggest Mistakes Manufacturers Unwittingly Make on Social Media
Partes internas del computador
Social Media "Content Cookery School"
Excess & Surplus
MasterChef Web Version
Index Provisioning for ALM Search - My Presentation
REVERSE MORTGAGE TO PURCHASE A HOME for REALTORS
Kata pengantar kelompok iv fix (rk)
Outsmarting The Search Competition with Predatory Thinking
Rare_Book_Translation
"Прованс" коллекция Bremani
090816
ZMET Technique- Nescafé coffee, Brand Management Assignment
Carolina
Ad

Similar to Dev Analytics Aggregate DB Design Analysis (20)

PPTX
High Performance and Scalability Database Design
PDF
Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...
PPTX
Building scalable application with sql server
PPT
The thinking persons guide to data warehouse design
PPTX
Tech days 2011 - database design patterns for keeping your database applicati...
PPT
Optimizing Data Accessin Sq Lserver2005
PDF
Best Practices – Extreme Performance with Data Warehousing on Oracle Database
PPTX
Pass chapter meeting - november - partitioning for database availability - ch...
PPT
Performance Tuning And Optimization Microsoft SQL Database
PPTX
Large scale sql server best practices
PPT
Informix partitioning interval_rolling_window_table
PPTX
Ten things to consider for interactive analytics on write once workloads
PPTX
CodeFutures - Scaling Your Database in the Cloud
PPTX
Internal Architecture of Database Management Systems
PPTX
Application architecture for the rest of us - php xperts devcon 2012
PPTX
My Database Skills Killed the Server
PPTX
Real World Performance - OLTP
PPTX
Sql killedserver
PPTX
My SQL Skills Killed the Server
PPTX
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
High Performance and Scalability Database Design
Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...
Building scalable application with sql server
The thinking persons guide to data warehouse design
Tech days 2011 - database design patterns for keeping your database applicati...
Optimizing Data Accessin Sq Lserver2005
Best Practices – Extreme Performance with Data Warehousing on Oracle Database
Pass chapter meeting - november - partitioning for database availability - ch...
Performance Tuning And Optimization Microsoft SQL Database
Large scale sql server best practices
Informix partitioning interval_rolling_window_table
Ten things to consider for interactive analytics on write once workloads
CodeFutures - Scaling Your Database in the Cloud
Internal Architecture of Database Management Systems
Application architecture for the rest of us - php xperts devcon 2012
My Database Skills Killed the Server
Real World Performance - OLTP
Sql killedserver
My SQL Skills Killed the Server
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...

More from Sunita Shrivastava (6)

DOCX
Bing Phone Book Service Arch Spec
DOCX
Cognito Unified API Specification
PPTX
Dev Analytics Overview
PPTX
Logical Architecture for Protection
DOCX
Search Approach - ES, GraphDB
PPTX
ALM Search Presentation for the VSS Arch Council
Bing Phone Book Service Arch Spec
Cognito Unified API Specification
Dev Analytics Overview
Logical Architecture for Protection
Search Approach - ES, GraphDB
ALM Search Presentation for the VSS Arch Council

Dev Analytics Aggregate DB Design Analysis

  • 1. Dev Analytics Aggregated DB Design Analysis
  • 2. Analytics Problem Context • Broadly the problem can be broken into three parts 1. Aggregate from one or more sources 2. Bring the aggregated data from the BE to FE DBs 3. Serve up data in a performant way through the portals • Fundamental Requirements • Backend Warehouse • That can comfortably process some amount of data in a safe robust manner • That is highly available • Frontend DB to serve the queries from the portal • Should serve queries in an efficient manner • Must be highly available • To mitigate the effects of disk performance for concurrent queries, read scale-out is required • Scale out Model • The above two are components of a stamp unit which would get replicated as instances to support further scale out
  • 3. Part 1 : Process data • What happens? • Large number of rows get inserted into the mining warehouse, so the CPU/Memory Usage can spike up • So, we chunk the data to control the resource usage • Are we setup for efficient writes to the SQL data files ? • Data and Log file configurations matter • Exploit Contiguity on disk • Replication to other secondaries/primaries (for HA) can bring the system to the knees • So, we chunk the data and provide breathing space • We need to monitor the replication backlogs • Are we setup to best contain the replication overheads ? • Log files play a major role • Read Efficiencies • Write Efficiencies • Spindle separation • Data Categories • Data that needs to be strongly consistent across partitions • Scalability • Currently we have 2 partitions? • Reference to a document on the effort required to add another partition? • Real-time • Currently we process data once a day • Current Indexing / Query Design Mantra • Data comes in time order, prioritize for clustered indexes for write • For read, rely on scan performance (these need to be validated)
  • 4. Part 1 : Process Data (Cont) • Data Categories • Data that needs to be strongly consistent across partitions • And highly available? • Data that needs to be replicated • Checkpoint/Restart Model? • Scalability • Currently we have 2 partitions? • Reference to a document on the effort required to add another partition? • Real-time • Currently we process data once a day • Current Indexing / Query Design Mantra • Data comes in time order, prioritize for clustered indexes for write • For read, rely on scan performance (these need to be validated)
  • 5. Part 2: Move Data to FE while it is serving Data • What happens • User queries some application specific or trend data • While the user query is being processed, new data arrives and is inserted in time order • This writes are chunked and currently occur over a period of 4 hours • The writes result in fragmentation of the non clustered index, also new data requires SQL statistics to be updated • We have no partitioning in FE so defrag and update statistics are costly operations and moreover they result in CPU/Memory Usage spikes • In general, large tables make everything inefficient • Mantra to write without disrupting the query performance • Partitioning is highly recommended simply because the fact tables are large • Table Partitioning • Note the partitions are in time order, helps purge old data as well • Active Passive is a poor cousin
  • 6. Part 3: Serving Data • What happens • User queries some application specific or trend data • Trend/Category Wide data requires query of records across applications • Some queries require processing at the FE because of the nature of the queries • User is asked to select a filter say over Country(35 choices?), Age(3), Gender (3) and numbers for the top 20 in that category/subcategory are displayed • Pre-aggregation and sorting of this size/nature also cause DB size bloat • Feature Use • For this kind of heavy lifting, it would be useful to determine to what extent this feature is being used • Queries can be and are disrupted by the writes that occur daily • Index Mantras • Rely on Clustered Indexes to the extent possible • Use few carefully chosen non clustered index as necessary • Query Mantras • Intelligent Point Scans • These serve by removing the need for churning data from time order into application order • Use of Specific Joins • Nested Loops Join: http://blogs.msdn.com/craigfr/archive/2006/07/26/679319.aspx • Hash Join: http://blogs.msdn.com/craigfr/archive/2006/08/10/687630.aspx • Merge Join: http://blogs.msdn.com/craigfr/archive/2006/08/03/merge-join.aspx • Be aware/conscious about the size of the tables while designing your sprocs • Ensure that update statistics is enabled, Auto update is recommended, A forced update after data insertions is desirable • When not to rely on SQL to make the appropriate choice for you? Ashish? • Index and Query Reference Spec • Ashish’s spec
  • 7. Active Passive Solution • Passive • Helps control the data insertion process without disruption • Cons • Not the best utilization of resources • Monitoring Intensive • Certainly not elegant • Data that is sent to Passive(Set A) on Day 1 has to be fed to the Passive(Set B) on day 2, while it also receives data for Day 1 • Or robust to backlogs, should they arise • At this point we only know that at least the FE will continue to serve, though it may serve stale data in these situations for longer time than we would like
  • 8. Table Partitioning • General Approach • Create date-wise partitions with two empty partitions mapped to file-groups • Empty partitions are used for data insertion and data removal • Defrag and Update Statistics can be performed on staging partitions created on appropriate file groups • Since splitting an empty partition is a metadata only operation, it is quick and helps maintain the same configuration • By appropriate file group separation, you mitigate the impact of conflicting i/o operations • References • http://www.google.com/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=2&ved=0CC8QFjAB&url=http%3A%2F%2Fdownload.micro soft.com%2Fdownload%2FD%2FB%2FD%2FDBDE7972-1EB9-470A-BA18-58849DB3EB3B%2FPartTableAndIndexStrat.docx&ei=eKkpUo- lKIauiQewpYGgCw&usg=AFQjCNHKMusGnaIp9EzsR94YGq8OJRPM1w&bvm=bv.51773540,d.aGc • http://msdn.microsoft.com/en-us/library/aa964122(SQL.90).aspx • We believe this is the right long term path for Analytics FE DB evolution • Even if you offload the BE processing to COSMOS, to serve the portal queries, an FE that allows efficient insertion and removal of data while servicing requests from the portals is required • This is an important cog in the design of an optimal and efficient FE unit • Active/Passive is a stop gap measure with lower utilization of h/w assets • Need a POC to validate if the query overheads are acceptable (within the desired and acceptable limit) • References on parallel query execution overheads • http://technet.microsoft.com/en-us/library/ms345599(v=sql.105).aspx
  • 9. Other General References • Fast track Recommendations • http://msdn.microsoft.com/en-us/library/gg567302.aspx • http://msdn.microsoft.com/en-us/library/gg605238.aspx • Best practice recommendations • http://technet.microsoft.com/en-us/library/dd578580(v=sql.100).aspx

Editor's Notes

  • #4: SQL is the underlying technology. It must be well understood. We need appropriate configurations so that things work underneath work for us rather than work against us.