SlideShare a Scribd company logo
Hadoop:
      Do Data Warehousing rules apply?

    Tony Baer

    tony.baer@ovum.com

    June 14, 2012




1                              © Copyright Ovum. All rights reserved. Ovum is a subsidiary of Informa plc.
Agenda



     §  Challenges traditional data stewardship practice

     §  Privacy – is all the world a stage?

     §  Limits to data lifecycle?

     §  Data quality: the big, the bad, the ugly – and it all might be good!




2                                                         © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Data stewardship challenges –
    What s old is new

    Remember?

    § Back to undifferentiated gobblobs of data

    § Programmatic access reigns

    § File systems, not (always) tables             10.102.8.152 - - [05/Nov/2003:00:19:54 -0500] "GET /
                                                     inventory/index.jsp HTTP/1.1" 200 4028 "http://
                                                     www.mycompany.com/index.jsp" "Mozilla/4.08 [en] (Win98;
                                                     I ;Nav)"

    § Batch is back                                 192.168.114.201, -, 03/20/01, 7:55:20, W3SVC2, SALES1,
                                                     172.21.13.45, 4502, 163, 3223, 200, 0, GET,/DeptLogo.gif,
                                                     -, 172.16.255.255, anonymous, 03/20/01, 23:58:11,
                                                     MSFTPSVC, SALES1, 172.16.255.255, 60, 275, 0, 0,

    But…                                                         if index(tempvalue,'?') then tempvalue=scan
                                                                 (tempvalue,1,'?');
                                                                 else if index(tempvalue,'&')>1 then
                                                                 tempvalue=scan(tempvalue,1,'&');

    § Volume, variety, velocity, and where s the
    value??

    § Just because you can, should you?


3                                                   © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Data stewardship questions for Big Data


    §  Can we, should we control this data?

    §  Are there limits to how much we should know?

    §  Can we just keep piling up data forever?

    §  Can we cleanse terabytes of data?

    §  Do we still need good data?




4                                                      © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Use of repeated table of contents page

     §  Challenges traditional data stewardship practice

     §  Privacy – is all the world a stage?

     §  Limits to data lifecycle?

     §  Data quality: the big, the bad, the ugly – and it all might be good!




5                                                         © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Privacy –
    the more things change…

     You have zero privacy
    anyway…. Get over it
        -- Scott McNealy, 1999




                                 Facebook does not actually
                                 delete images… but instead
                                 merely removes the links – a fix
                                  is in sight
                                                         -- ZDNet, 2/6/12

                                 Facebook agrees to 20 years of
                                 federal privacy audits
                                                          -- NY Times, 11/29/11



6                                  © Copyright Ovum. All rights reserved. Ovum is an Informa business.
What privacy?



    Florida made $63m last
    year by selling DMV
    information (name, date
    of birth, type of vehicle
    driven) to companies like
    LexusNexus & Shadow
    Soft.

    -- Terence Craig   & Mary Ludloff
    Privacy and Big Data
    (O’Reilly Media, 2011)




7                                       © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Big Data privacy 101 –
    Don t be creepy

    §  Governance problem first,          How Companies Learn Your
        technology second                         Secrets

    §  Understand the relationship
        with your customers & business
        partners

    §  Keep communications in
        context

    §  Don t catch your customers by       My daughter got this in the mail! he
        surprise                           said. She s still in high school, and
                                           you re sending her coupons for baby
                                           clothes and cribs? Are you trying to
    §  The law still trying to catch up   encourage her to get pregnant?
                                                           -- NY Times 2/16/12

8                                                   © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Use of repeated table of contents page

     §  Challenges traditional data stewardship practice

     §  Privacy – is all the world a stage?

     §  Limits to data lifecycle?

     §  Data quality: the big, the bad, the ugly – and it all might be good!




9                                                         © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Data lifecycle –
     How long can this go on?

     §    Google, Yahoo, Facebook, etc.
           don t deprecate web data

     §    Hadoop designed for
           economical scale-out

     §    Moore s Law, declining cost of
           storage

     §    Is Hadoop Archive the answer?

     §    Is Hadoop the new tape?




Management & skills will be the limit       Aerial view of Quincy, WA data ctrs


10                                                                 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Use of repeated table of contents page

      §  Challenges traditional data stewardship practice

      §  Privacy – is all the world a stage?

      §  Limits to data lifecycle?

      §  Data quality: the big, the bad, the ugly – and it all might be
          good!




11                                                       © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Data Quality & Hadoop –
     Big Quality Questions

     §  Can we cleanse terabytes of data?

     §  Do we still need good data?

     §  Are there new approaches to cleansing Big Data?




12                                                    © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Framing the issue

     §        Garbage in, garbage out, but DW forced the
             issue

     §      Traditional approaches
                §  Profiling, cleansing, MDM

     §      DW vs. Hadoop data quality challenges
                §  Known data sets & known criteria vs. vaguely known
                §  Bounded vs. less bounded tasks

     §      Limitations of MapReduce*
                §  Cleansing & transformation within a single Map
                    operation;
                §  Profiling & matching of unstructured data
                §  Matching of data in operations without inter-process
                    communications

           *Source: David Loshin, "Hadoop and Data Quality, Data Integration, Data Analysis" at
           http://www.dataroundtable.com/?p=8841


13                                                                                      © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Is data quality necessary for Hadoop?


     §  The App
         §  How mission-critical?
         §  Regulatory compliance impacts?
         §  What degree of business impact?

     §  The Data
         §  The 4V s (volume, variety,
             velocity, value) determine what
             approaches to quality are feasible




14                                                © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Examples


     §    Web ad placement optimization

     §    Counter-party risk management
           for capital markets

     §    Customer sentiment analysis

     §    Managing smart utility grids or
           urban infrastructure




15                                           © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Bad data may be good


     §  Sensory data
         §  Outlier or drift?
         §  Time to recalibrate devices?
         §  Time to perform preventive
             maintenance?
         §  Are new/unaccounted environmental
             factors skewing readings?

     §  Human-readable data
         §  Flawed concept of reality?
         §  Flawed assumptions on data meaning?
         §  Changes producing new norm


16                                                 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Big Data quality in Hadoop –
     Emergent approaches

     §    Crowdsourcing data –
            §  Collect data far & wide from as many diverse sources as possible. Torrents of data
                overcome the noise.
            §  Comparative trend analysis of incoming streams to dynamically ID the norm or
                sweet spot of good data
     §    Apply data science to correct the dots
            §  Don t go record by record. Statistically analyze the data set in aggregate.
            §  Iteratively analyze & re-analyze nature of data, keep analyzing outliers
            §  Apply off-the-wall approaches
     §    Enterprise Architectural approach
            §  Semantic (domain) model-driven
            §  Apply cleansing logic at run time
            §  Critical for sensitive, regulatory-driven apps



17                                                                      © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Summary


     §    Challenges traditional data stewardship practice
            §  Combination of old & new
     §    Privacy – is all the world a stage?
            §  Best practices, legal requirements still in flux
            §  Don t be creepy!
     §    Limits to data lifecycle?
            §  Few enterprises are Google or Facebook
            §  Ability to manage large infrastructure will be major limit

     §    Data quality
            §  Strategy depends on type of app & data set(s)
            §  A spectrum of approaches -- from none to classic ETL to aggregate statistical
            §  No single silver bullet



18                                                                           © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Disclaimer


     All Rights Reserved.

     No part of this publication may be reproduced, stored in a retrieval system or
     transmitted in any form by any means, electronic, mechanical, photocopying,
     recording or otherwise, without the prior permission of the publisher, Ovum
     (an Informa business).

     The facts of this report are believed to be correct at the time of publication but
     cannot be guaranteed. Please note that the findings, conclusions and
     recommendations that Ovum delivers will be based on information gathered in
     good faith from both primary and secondary sources, whose accuracy we are not
     always in a position to guarantee. As such Ovum can accept no liability whatever
     for actions taken based on any information that may subsequently prove to be
     incorrect.




19                                                             © Copyright Ovum. All rights reserved. Ovum is an Informa business.
Sessions will resume at 11:25am




                             Page 20

More Related Content

PDF
TDWI NYC Chapter - Tony Baer Ovum on Big data, Data quality, and BI Convergence
PDF
Defensible rim disposal leads to effective discovery responses - 2011.08.09
PPT
Connor big data
PDF
03 2010 Online Buyer 101 Webinar
KEY
May 2012 HUG: The Changing Big Data Landscape
PPTX
Horse meat or beef? (3) D Murphy, National Grid, 21/3/13
PDF
2010 data protection best practices
PDF
Yes! You can raise $$ on Facebook (and other Social Networks)
TDWI NYC Chapter - Tony Baer Ovum on Big data, Data quality, and BI Convergence
Defensible rim disposal leads to effective discovery responses - 2011.08.09
Connor big data
03 2010 Online Buyer 101 Webinar
May 2012 HUG: The Changing Big Data Landscape
Horse meat or beef? (3) D Murphy, National Grid, 21/3/13
2010 data protection best practices
Yes! You can raise $$ on Facebook (and other Social Networks)

Viewers also liked (19)

PDF
Elephant grooming: quality with Hadoop
PDF
Hadoop 2.0 - Solving the Data Quality Challenge
PPTX
Navigating the World of User Data Management and Data Discovery
PPTX
Meeting Performance Goals in multi-tenant Hadoop Clusters
PPTX
What the #$* is a Business Catalog and why you need it
PPTX
Deploying Apache Flume to enable low-latency analytics
PPTX
Extreme Analytics @ eBay
PPTX
Accelerating Data Warehouse Modernization
PPTX
Operationalizing YARN based Hadoop Clusters in the Cloud
PPTX
Using Hadoop to build a Data Quality Service for both real-time and batch data
PPTX
Self-Service Analytics on Hadoop: Lessons Learned
PPTX
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
PDF
Big data Hadoop Analytic and Data warehouse comparison guide
PPTX
Security and Data Governance using Apache Ranger and Apache Atlas
PPTX
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
PPTX
Building a Data Analytics PaaS for Smart Cities
PDF
The Social Lifecycle: Consumer Insights to Improve Your Business
PPTX
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Elephant grooming: quality with Hadoop
Hadoop 2.0 - Solving the Data Quality Challenge
Navigating the World of User Data Management and Data Discovery
Meeting Performance Goals in multi-tenant Hadoop Clusters
What the #$* is a Business Catalog and why you need it
Deploying Apache Flume to enable low-latency analytics
Extreme Analytics @ eBay
Accelerating Data Warehouse Modernization
Operationalizing YARN based Hadoop Clusters in the Cloud
Using Hadoop to build a Data Quality Service for both real-time and batch data
Self-Service Analytics on Hadoop: Lessons Learned
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big data Hadoop Analytic and Data warehouse comparison guide
Security and Data Governance using Apache Ranger and Apache Atlas
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Building a Data Analytics PaaS for Smart Cities
The Social Lifecycle: Consumer Insights to Improve Your Business
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Ad

Similar to Hadoop do data warehousing rules apply (20)

PDF
Opening keynote gianni cooreman
PPTX
Making Big Data a First Class citizen in the enterprise
PDF
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
PPTX
Sizing up big data? Hitting the "V"s - Clive Longbottom, 18/10/12
PDF
Data Ownership: Who Owns 'My Data'?
PDF
Be3 experimentingbigdatainabox-part1:comprehendingthescenario
PPTX
Information Management and Analytics
PDF
Big Data @ Bodensee Barcamp 2010
PDF
What is big data - Architectures and Practical Use Cases
PDF
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
PPTX
Why Data is Drowning the (IT) World?
PPTX
EDF2013: Invited Talk Daragh O'Brien: The Story of Maturity – How data in Bus...
PDF
SAP Sybase Data Management
PPTX
Big data 101
PDF
Big Data is the Future of Healthcare
PPTX
Big and Small Web Data
PPT
Informatics Transform : Re-engineering Libraries for the Data Decade
PDF
BBDO Proximity: Big-data May 2013
PDF
Big Data
PDF
141900791 big-data
Opening keynote gianni cooreman
Making Big Data a First Class citizen in the enterprise
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
Sizing up big data? Hitting the "V"s - Clive Longbottom, 18/10/12
Data Ownership: Who Owns 'My Data'?
Be3 experimentingbigdatainabox-part1:comprehendingthescenario
Information Management and Analytics
Big Data @ Bodensee Barcamp 2010
What is big data - Architectures and Practical Use Cases
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
Why Data is Drowning the (IT) World?
EDF2013: Invited Talk Daragh O'Brien: The Story of Maturity – How data in Bus...
SAP Sybase Data Management
Big data 101
Big Data is the Future of Healthcare
Big and Small Web Data
Informatics Transform : Re-engineering Libraries for the Data Decade
BBDO Proximity: Big-data May 2013
Big Data
141900791 big-data
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Electronic commerce courselecture one. Pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Empathic Computing: Creating Shared Understanding
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Approach and Philosophy of On baking technology
PDF
KodekX | Application Modernization Development
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
Teaching material agriculture food technology
20250228 LYD VKU AI Blended-Learning.pptx
Understanding_Digital_Forensics_Presentation.pptx
Review of recent advances in non-invasive hemoglobin estimation
Per capita expenditure prediction using model stacking based on satellite ima...
Digital-Transformation-Roadmap-for-Companies.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Spectral efficient network and resource selection model in 5G networks
NewMind AI Monthly Chronicles - July 2025
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Electronic commerce courselecture one. Pdf
Machine learning based COVID-19 study performance prediction
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
MYSQL Presentation for SQL database connectivity
Empathic Computing: Creating Shared Understanding
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Approach and Philosophy of On baking technology
KodekX | Application Modernization Development
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Network Security Unit 5.pdf for BCA BBA.
Teaching material agriculture food technology

Hadoop do data warehousing rules apply

  • 1. Hadoop: Do Data Warehousing rules apply? Tony Baer tony.baer@ovum.com June 14, 2012 1 © Copyright Ovum. All rights reserved. Ovum is a subsidiary of Informa plc.
  • 2. Agenda §  Challenges traditional data stewardship practice §  Privacy – is all the world a stage? §  Limits to data lifecycle? §  Data quality: the big, the bad, the ugly – and it all might be good! 2 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 3. Data stewardship challenges – What s old is new Remember? § Back to undifferentiated gobblobs of data § Programmatic access reigns § File systems, not (always) tables 10.102.8.152 - - [05/Nov/2003:00:19:54 -0500] "GET / inventory/index.jsp HTTP/1.1" 200 4028 "http:// www.mycompany.com/index.jsp" "Mozilla/4.08 [en] (Win98; I ;Nav)" § Batch is back 192.168.114.201, -, 03/20/01, 7:55:20, W3SVC2, SALES1, 172.21.13.45, 4502, 163, 3223, 200, 0, GET,/DeptLogo.gif, -, 172.16.255.255, anonymous, 03/20/01, 23:58:11, MSFTPSVC, SALES1, 172.16.255.255, 60, 275, 0, 0, But… if index(tempvalue,'?') then tempvalue=scan (tempvalue,1,'?'); else if index(tempvalue,'&')>1 then tempvalue=scan(tempvalue,1,'&'); § Volume, variety, velocity, and where s the value?? § Just because you can, should you? 3 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 4. Data stewardship questions for Big Data §  Can we, should we control this data? §  Are there limits to how much we should know? §  Can we just keep piling up data forever? §  Can we cleanse terabytes of data? §  Do we still need good data? 4 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 5. Use of repeated table of contents page §  Challenges traditional data stewardship practice §  Privacy – is all the world a stage? §  Limits to data lifecycle? §  Data quality: the big, the bad, the ugly – and it all might be good! 5 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 6. Privacy – the more things change… You have zero privacy anyway…. Get over it -- Scott McNealy, 1999 Facebook does not actually delete images… but instead merely removes the links – a fix is in sight -- ZDNet, 2/6/12 Facebook agrees to 20 years of federal privacy audits -- NY Times, 11/29/11 6 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 7. What privacy? Florida made $63m last year by selling DMV information (name, date of birth, type of vehicle driven) to companies like LexusNexus & Shadow Soft. -- Terence Craig & Mary Ludloff Privacy and Big Data (O’Reilly Media, 2011) 7 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 8. Big Data privacy 101 – Don t be creepy §  Governance problem first, How Companies Learn Your technology second Secrets §  Understand the relationship with your customers & business partners §  Keep communications in context §  Don t catch your customers by My daughter got this in the mail! he surprise said. She s still in high school, and you re sending her coupons for baby clothes and cribs? Are you trying to §  The law still trying to catch up encourage her to get pregnant? -- NY Times 2/16/12 8 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 9. Use of repeated table of contents page §  Challenges traditional data stewardship practice §  Privacy – is all the world a stage? §  Limits to data lifecycle? §  Data quality: the big, the bad, the ugly – and it all might be good! 9 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 10. Data lifecycle – How long can this go on? §  Google, Yahoo, Facebook, etc. don t deprecate web data §  Hadoop designed for economical scale-out §  Moore s Law, declining cost of storage §  Is Hadoop Archive the answer? §  Is Hadoop the new tape? Management & skills will be the limit Aerial view of Quincy, WA data ctrs 10 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 11. Use of repeated table of contents page §  Challenges traditional data stewardship practice §  Privacy – is all the world a stage? §  Limits to data lifecycle? §  Data quality: the big, the bad, the ugly – and it all might be good! 11 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 12. Data Quality & Hadoop – Big Quality Questions §  Can we cleanse terabytes of data? §  Do we still need good data? §  Are there new approaches to cleansing Big Data? 12 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 13. Framing the issue §  Garbage in, garbage out, but DW forced the issue §  Traditional approaches §  Profiling, cleansing, MDM §  DW vs. Hadoop data quality challenges §  Known data sets & known criteria vs. vaguely known §  Bounded vs. less bounded tasks §  Limitations of MapReduce* §  Cleansing & transformation within a single Map operation; §  Profiling & matching of unstructured data §  Matching of data in operations without inter-process communications *Source: David Loshin, "Hadoop and Data Quality, Data Integration, Data Analysis" at http://www.dataroundtable.com/?p=8841 13 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 14. Is data quality necessary for Hadoop? §  The App §  How mission-critical? §  Regulatory compliance impacts? §  What degree of business impact? §  The Data §  The 4V s (volume, variety, velocity, value) determine what approaches to quality are feasible 14 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 15. Examples §  Web ad placement optimization §  Counter-party risk management for capital markets §  Customer sentiment analysis §  Managing smart utility grids or urban infrastructure 15 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 16. Bad data may be good §  Sensory data §  Outlier or drift? §  Time to recalibrate devices? §  Time to perform preventive maintenance? §  Are new/unaccounted environmental factors skewing readings? §  Human-readable data §  Flawed concept of reality? §  Flawed assumptions on data meaning? §  Changes producing new norm 16 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 17. Big Data quality in Hadoop – Emergent approaches §  Crowdsourcing data – §  Collect data far & wide from as many diverse sources as possible. Torrents of data overcome the noise. §  Comparative trend analysis of incoming streams to dynamically ID the norm or sweet spot of good data §  Apply data science to correct the dots §  Don t go record by record. Statistically analyze the data set in aggregate. §  Iteratively analyze & re-analyze nature of data, keep analyzing outliers §  Apply off-the-wall approaches §  Enterprise Architectural approach §  Semantic (domain) model-driven §  Apply cleansing logic at run time §  Critical for sensitive, regulatory-driven apps 17 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 18. Summary §  Challenges traditional data stewardship practice §  Combination of old & new §  Privacy – is all the world a stage? §  Best practices, legal requirements still in flux §  Don t be creepy! §  Limits to data lifecycle? §  Few enterprises are Google or Facebook §  Ability to manage large infrastructure will be major limit §  Data quality §  Strategy depends on type of app & data set(s) §  A spectrum of approaches -- from none to classic ETL to aggregate statistical §  No single silver bullet 18 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 19. Disclaimer All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the publisher, Ovum (an Informa business). The facts of this report are believed to be correct at the time of publication but cannot be guaranteed. Please note that the findings, conclusions and recommendations that Ovum delivers will be based on information gathered in good faith from both primary and secondary sources, whose accuracy we are not always in a position to guarantee. As such Ovum can accept no liability whatever for actions taken based on any information that may subsequently prove to be incorrect. 19 © Copyright Ovum. All rights reserved. Ovum is an Informa business.
  • 20. Sessions will resume at 11:25am Page 20