SlideShare a Scribd company logo
Data Mining &
Column Stores
Aung Thu Rha Hein
Why use Data Mining?
• Explosive growth of data available
• Major sources:
          • Business: Web, E-Commerce, transactions
          • Science : Remote Sensing, bioinformatics,….
          • Society : news, gadgets, social media

• Too much data but too little information
• To extract useful information from the data and to interpret
  the data
• can automate the process of finding relationships and patterns
  in raw data
What is Data Mining?
• Knowledge Discovery in Databases, or ”KDD”
• the process of extracting hidden predictive information
  from large data sets
• Converting information into knowledge to predict the
  future trends and decisions
• Examples :
             consumer buying behavior of retail supermarket sales
             Google instant, YouTube instant
             Blogs and news: Technorati, News360 and so on
             Social Mining : Livehoods: find pattern and behaviors
              of foursquare check-in data
Data Mining Process
The Cross-Industry Standard Process (CRISP-DM)



                                                 Business understanding

                                                 Data understanding

                                                 Data preparation

                                                 Modeling

                                                 Evaluation

                                                 Deployment
Techniques
I.    Association Rule-also known as market basket analysis.
           discover interesting associations between attributes
II.   Classification- a technique based on machine learning
           use mathematical techniques such as decision trees, linear
            programming, neural network and statistics.
III. Clustering- makes meaningful or useful cluster of objects that
                  have similar characteristic
IV. Prediction-discovers relationship between independent variables
                 and relationship between dependent and
                 independent variables
V. Sequential Patterns-discover similar patterns in data transaction
                 over a business period
Tools
• There are three categories of tools for data mining:
      i. Traditional Data Mining Tools
      ii. Dashboards
      iii. Text-mining Tools


Some data mining tools:
      •   R- r-project.org
      •   Datameer Analytics Solution - datameer.com
      •   SAS Analytics- sas.com
      •   Google Chart API- code.google.com/apis/chart
Column Stores
• stores data tables as columns of data
   • Column Oriented DBMS-
       • Bigtable, DBase, Hypertable, Cassandra(Relational)
       • Sybase IQ, MonetDB, C-Store, Vertica, VectorWise, Infobright (NoSQL)
• Use in systems like data warehouses and data mining
• Example:         Emp_ID Emp_Name             Emp_Dept Emp_Salar
                                                        y
                   1          Smith            IT            40000
                   2          Adam             Sales         35000
                   3          Jones            Marketing     45000
the database must coax its two-dimensional table into one for the operating
system
             • 1,2,3
                Smith, Adam, Jones
               IT, Sales, Marketing
                40000, 35000, 45000
Advantages and Disadvantages of
Column Stores
Advantages
• Only need to read relevant data( improved bandwidth utilization)
• Improved cache locality
       No need to transmit surrounding attributes
• Compression efficiency-column compress better than rows
       Because rows contain values from different domain
       Row-store compression ratio: 1:3
       Colum-Store: 1:10
Disadvantages
• Increased Disk seek time
• Increased cost of inserts.
• Increased tuple reconstruction costs
Case Study: Bazaarvoice
• Facing difficulties to aggregate large amounts of data on the fly in real time
  for analytics product
• Common among queries- a small number of columns with most values
  being aggregates such as counts, sums and averages
• Use InfoBright, an open source database built on MySQL
• Test result using a data set with 100MM records in the main fact table




• Average query execution time for analytical queries was 20x faster than
  MySQL’s
Case Study: Bazaarvoice(cont.)
• disk footprint was over 10x smaller compared to MySQL due to data
  compression.
• Why?
  • Column stores- small disk I/O
  • “knowledge grid”, aggregate data Infobright calculates during data
    loading
      • E.g. pre-calculate min, max, and avg value for each column in the
        pack
  • Limitations of InfoBright
      • does not support DML
      • only way is to bulk loads using “LOAD DATA INFILE …” command
      • no way to update or delete existing data without reloading the table
References
Data Mining
•   http://en.wikipedia.org/wiki/Data_mining
•   http://www.inc.com/magazine/20101001/4-essential-data-mining-tools.html
•   http://www.dataminingtechniques.net/
•   http://www.unc.edu/~xluan/258/datamining.html
•   http://www.data-miners.com/
•   http://www.exforsys.com/tutorials/data-mining/how-data-mining-is-evolving.html
•   http://livehoods.org/


Column Stores
• http://en.wikipedia.org/wiki/Column_store
• http://developer.bazaarvoice.com/why-columns-are-cool
• http://www.calpont.com/doc/Calpont_Whitepaper-Best-Practices-
  in_the_Use_of_Columnar_Databases.pdf

More Related Content

PPTX
Cloud computing security
PPTX
Cloud computing and data security
KEY
Cloud Computing and your Data Warehouse
DOC
Cloud syllabus for indonesia students
PPT
Virgílio Vargas Presentations / CloudViews.Org - Cloud Computing Conference 2...
PPTX
Security issues in cloud database
PPTX
Chap 5 software as a service (saass)
PPTX
Cloud Resource Management
Cloud computing security
Cloud computing and data security
Cloud Computing and your Data Warehouse
Cloud syllabus for indonesia students
Virgílio Vargas Presentations / CloudViews.Org - Cloud Computing Conference 2...
Security issues in cloud database
Chap 5 software as a service (saass)
Cloud Resource Management

What's hot (20)

PPTX
Data Confidentiality in Cloud Computing
PPTX
Ensuring data storage security in cloud computing
PPTX
Multi Tenancy In The Cloud
PPTX
Data Management Gateway - Deep Dive
PPTX
Introduction of cloud computing
PPTX
Saa s multitenant database architecture
PDF
Cloud computing
PPT
Cloud computing 1
PPT
security Issues of cloud computing
PPTX
Cloud Computing Overview
PPT
CLOUD COMPUTING AND STORAGE
PDF
Infrastructure as a service (iaa s)
PPT
Cloud security
PPT
Platform as a Service
PPSX
Multi-tenancy in Private Clouds
PPTX
Microsoft Cloud Computing
PPTX
Third party cloud services cloud computing
PPT
Cloud computing intro
PPT
Multi-tenancy In the Cloud
PPTX
Security in Cloud Computing
Data Confidentiality in Cloud Computing
Ensuring data storage security in cloud computing
Multi Tenancy In The Cloud
Data Management Gateway - Deep Dive
Introduction of cloud computing
Saa s multitenant database architecture
Cloud computing
Cloud computing 1
security Issues of cloud computing
Cloud Computing Overview
CLOUD COMPUTING AND STORAGE
Infrastructure as a service (iaa s)
Cloud security
Platform as a Service
Multi-tenancy in Private Clouds
Microsoft Cloud Computing
Third party cloud services cloud computing
Cloud computing intro
Multi-tenancy In the Cloud
Security in Cloud Computing
Ad

Similar to Data mining & column stores (20)

PPTX
IT webinar 2016
PPT
Data Warehousing, Data Mining & Data Visualisation
PPTX
Dbms and it infrastructure
PPTX
Big data unit 2
PPT
Unit 3 part i Data mining
PPTX
Data warehousing
PPT
data warehousing
DOC
Dwdm unit 1-2016-Data ingarehousing
PPT
Data mining - GDi Techno Solutions
PPT
Big data.ppt
PPTX
Lecture1
PPTX
Lecturedsfndskfjdsklfjldsdsfdsgmjdflgmdflmg.pptx
PPTX
TOPIC.pptx
PPTX
Introduction to data mining and data warehousing
PPT
Data Mining- Unit-I PPT (1).ppt
PDF
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
PPTX
Unushs susus susujss. Ssuusussjjsjsit 4.pptx
PDF
Lecture 1-big data engineering (Introduction).pdf
PPT
Data mining techniques unit 1
PPT
Data Warehouse Introduction to Data Warehouse
IT webinar 2016
Data Warehousing, Data Mining & Data Visualisation
Dbms and it infrastructure
Big data unit 2
Unit 3 part i Data mining
Data warehousing
data warehousing
Dwdm unit 1-2016-Data ingarehousing
Data mining - GDi Techno Solutions
Big data.ppt
Lecture1
Lecturedsfndskfjdsklfjldsdsfdsgmjdflgmdflmg.pptx
TOPIC.pptx
Introduction to data mining and data warehousing
Data Mining- Unit-I PPT (1).ppt
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Unushs susus susujss. Ssuusussjjsjsit 4.pptx
Lecture 1-big data engineering (Introduction).pdf
Data mining techniques unit 1
Data Warehouse Introduction to Data Warehouse
Ad

More from Aung Thu Rha Hein (18)

PPTX
Writing with ease
PDF
Bioinformatics for Computer Scientists
PPTX
Analysis of hybrid image with FFT (Fast Fourier Transform)
PPTX
Introduction to Common Weakness Enumeration (CWE)
PDF
Private Browsing: A Window of Forensic Opportunity
PDF
Network switching
PDF
Digital Forensic: Brief Intro & Research Challenge
PDF
Survey & Review of Digital Forensic
PPTX
Partitioned Based Regression Verification
PDF
CRAXweb: Automatic Exploit Generation for Web Applications
PPTX
Botnets 101
PPTX
Session initiation protocol
PPTX
TPC-H in MongoDB
PPTX
Web application security: Threats & Countermeasures
PPTX
Can the elephants handle the no sql onslaught
PPTX
Fuzzy logic based students’ learning assessment
PPTX
Link state routing protocol
PPTX
Chat bot analysis
Writing with ease
Bioinformatics for Computer Scientists
Analysis of hybrid image with FFT (Fast Fourier Transform)
Introduction to Common Weakness Enumeration (CWE)
Private Browsing: A Window of Forensic Opportunity
Network switching
Digital Forensic: Brief Intro & Research Challenge
Survey & Review of Digital Forensic
Partitioned Based Regression Verification
CRAXweb: Automatic Exploit Generation for Web Applications
Botnets 101
Session initiation protocol
TPC-H in MongoDB
Web application security: Threats & Countermeasures
Can the elephants handle the no sql onslaught
Fuzzy logic based students’ learning assessment
Link state routing protocol
Chat bot analysis

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Cloud computing and distributed systems.
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
Teaching material agriculture food technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
KodekX | Application Modernization Development
PDF
Machine learning based COVID-19 study performance prediction
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Cloud computing and distributed systems.
Mobile App Security Testing_ A Comprehensive Guide.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Network Security Unit 5.pdf for BCA BBA.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The Rise and Fall of 3GPP – Time for a Sabbatical?
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Building Integrated photovoltaic BIPV_UPV.pdf
Teaching material agriculture food technology
Digital-Transformation-Roadmap-for-Companies.pptx
Chapter 3 Spatial Domain Image Processing.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
KodekX | Application Modernization Development
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Data mining & column stores

  • 1. Data Mining & Column Stores Aung Thu Rha Hein
  • 2. Why use Data Mining? • Explosive growth of data available • Major sources: • Business: Web, E-Commerce, transactions • Science : Remote Sensing, bioinformatics,…. • Society : news, gadgets, social media • Too much data but too little information • To extract useful information from the data and to interpret the data • can automate the process of finding relationships and patterns in raw data
  • 3. What is Data Mining? • Knowledge Discovery in Databases, or ”KDD” • the process of extracting hidden predictive information from large data sets • Converting information into knowledge to predict the future trends and decisions • Examples :  consumer buying behavior of retail supermarket sales  Google instant, YouTube instant  Blogs and news: Technorati, News360 and so on  Social Mining : Livehoods: find pattern and behaviors of foursquare check-in data
  • 4. Data Mining Process The Cross-Industry Standard Process (CRISP-DM) Business understanding Data understanding Data preparation Modeling Evaluation Deployment
  • 5. Techniques I. Association Rule-also known as market basket analysis.  discover interesting associations between attributes II. Classification- a technique based on machine learning  use mathematical techniques such as decision trees, linear programming, neural network and statistics. III. Clustering- makes meaningful or useful cluster of objects that have similar characteristic IV. Prediction-discovers relationship between independent variables and relationship between dependent and independent variables V. Sequential Patterns-discover similar patterns in data transaction over a business period
  • 6. Tools • There are three categories of tools for data mining: i. Traditional Data Mining Tools ii. Dashboards iii. Text-mining Tools Some data mining tools: • R- r-project.org • Datameer Analytics Solution - datameer.com • SAS Analytics- sas.com • Google Chart API- code.google.com/apis/chart
  • 7. Column Stores • stores data tables as columns of data • Column Oriented DBMS- • Bigtable, DBase, Hypertable, Cassandra(Relational) • Sybase IQ, MonetDB, C-Store, Vertica, VectorWise, Infobright (NoSQL) • Use in systems like data warehouses and data mining • Example: Emp_ID Emp_Name Emp_Dept Emp_Salar y 1 Smith IT 40000 2 Adam Sales 35000 3 Jones Marketing 45000 the database must coax its two-dimensional table into one for the operating system • 1,2,3 Smith, Adam, Jones IT, Sales, Marketing 40000, 35000, 45000
  • 8. Advantages and Disadvantages of Column Stores Advantages • Only need to read relevant data( improved bandwidth utilization) • Improved cache locality  No need to transmit surrounding attributes • Compression efficiency-column compress better than rows  Because rows contain values from different domain  Row-store compression ratio: 1:3  Colum-Store: 1:10 Disadvantages • Increased Disk seek time • Increased cost of inserts. • Increased tuple reconstruction costs
  • 9. Case Study: Bazaarvoice • Facing difficulties to aggregate large amounts of data on the fly in real time for analytics product • Common among queries- a small number of columns with most values being aggregates such as counts, sums and averages • Use InfoBright, an open source database built on MySQL • Test result using a data set with 100MM records in the main fact table • Average query execution time for analytical queries was 20x faster than MySQL’s
  • 10. Case Study: Bazaarvoice(cont.) • disk footprint was over 10x smaller compared to MySQL due to data compression. • Why? • Column stores- small disk I/O • “knowledge grid”, aggregate data Infobright calculates during data loading • E.g. pre-calculate min, max, and avg value for each column in the pack • Limitations of InfoBright • does not support DML • only way is to bulk loads using “LOAD DATA INFILE …” command • no way to update or delete existing data without reloading the table
  • 11. References Data Mining • http://en.wikipedia.org/wiki/Data_mining • http://www.inc.com/magazine/20101001/4-essential-data-mining-tools.html • http://www.dataminingtechniques.net/ • http://www.unc.edu/~xluan/258/datamining.html • http://www.data-miners.com/ • http://www.exforsys.com/tutorials/data-mining/how-data-mining-is-evolving.html • http://livehoods.org/ Column Stores • http://en.wikipedia.org/wiki/Column_store • http://developer.bazaarvoice.com/why-columns-are-cool • http://www.calpont.com/doc/Calpont_Whitepaper-Best-Practices- in_the_Use_of_Columnar_Databases.pdf