SlideShare a Scribd company logo
MAKING BIG DATA COME ALIVE
The key to unlocking the Value in the Internet of Things?
Managing the Data!
2
For Big Data the Key is Variety!
4/25/2016© 2015 Think Big, a Teradata Company
Definition: Datasets so complex and large that they are
awkward to work with using standard tools and techniques
Location Social Images Weblogs Videos Text Audio Sensor
Size is not what is most important; it’s variety
3
Example Use Cases
• Predictive Maintenance
• Search and view detail on issue on the fly
• Identify critical alerts
• Root cause analysis
• Understanding usage
• And many more!
4
Changing Technology Landscape
4/25/2016
5 © 2015 Teradata
AccessPreparationAcquisition
Data Lake Architecture
Math
and Stats
Data
Mining
Business
Intelligence
Applications
Languages
Marketing
ANALYTIC
TOOLS & APPS
USERS
Marketing
Executives
Operational
Systems
Frontline
Workers
Customers
Partners
Engineers
Data
Scientists
Business
Analysts
Streams SearchAggregations
Security, Metadata/Lineage, Administration
Distributed Storage
Msg. queues Cleansing Access
ExperimentsGovernanceFeeds
SOURCES
Sensors
email
Social
Telemetry
Mobile
Tabular Data
Machine logs
C
6 © 2015 Teradata
REFERENCE INFORMATION ARCHITECTURE
New with Big Data
Security, Workload ManagementPublishingPreparation
SecuredLanding
Acquisition
SharedViews&Obfuscation
OptimizedStructures
CommonKeys
DerivedValues,SensitiveDataProtection
CommonSummaries
UserDefinedDataSets
Validation&KeyResolution
ERP
SCM
CRM
Images
Audio
and Video
Machine
Logs
Text
Web and
Social
SOURCES
Business
Analysts
Math
and Stats
Data
Mining
Business
Intelligence
Applications
Marketing
ANALYTIC TOOLS
& APPS
Search
Profiling,Masking,Obfuscation
Data Scientists
Business Analysts
Data Modelers
IT
7
How is Data Management Changing?
• Schema on Read?
– Yes… as step one
– But data still has underlying structure
– It’s more like agile modeling – reflect as much structure as needed
• Loosely coupled schemas loses platform guarantees but gains more application
flexibility
• Data Modeling isn’t dead!
• Metadata is more important than ever
4/25/2016© 2015 Think Big, a Teradata Company
8
Changes in Logical Modeling
• JSON-like structures
– Complex collections of relations, arrays, map of items
• Graphs
– Storing complex, dynamically changing not static relationships
• Binary/CLOB/specialized data
– Ability to execute specialized programs to interpret and process
4/25/2016© 2015 Think Big, a Teradata Company
9
Patterns
4/25/2016
10
Important New Patterns
• Denormalized Fact
• Profile
• Event History
• Timeline
• Network
• Distributed Sources
• Late Data
• Deep Aggregates
• Recovery
• Multiple Active Clusters
4/25/2016© 2015 Think Big, a Teradata Company
11
Event id Actor id Time Event col’s Dim id’s Dim col’s Ext. Data
123 uid1 1/1/15
13:16:11
… … … { “TstA” : 1
…}
456 uid2 1/1/15
13:16:14
… … … { “TstB” : 1
…}
• Fact table about common events to allow e.g., analytics in context
– E.g., wearable device, telematics
• Stored in columnar format (e.g., Parquet, ORCfile)
• Join as was value of slowly changing dimensions
• Often “extension” column of unparsed/not modeled JSON-like data
• Partitioned by event time buckets, perhaps also by other dimension(s)
Event History Pattern
4/25/2016© 2015 Think Big, a Teradata Company
12
Actor id Segment
s
Ev1:id Ev1:fact Ev1:dims Ev1:ext Ev2:id …
uid1 [1, 3, 7] 123 1/1/15
13:16:11
… { “TstA” : 1
…}
789 …
uid2 [2, 3] 456 1/1/15
13:16:14
… { “TstB” : 1
…}
0ab …
• Pivot on event history: table of actors with events over time
– Device history, usage in consumer journey
– Enable support/analysis on specific items, long-lived analysis
• May have hierarchy of actors (e.g., cluster, device, component)
• May be array of events, many columns or subsorted (cluster key)
• Also stored in columnar format, may be partitioned
• May be updated in near real-time AND batch
• Often holds cached algorithm values (combined Profile)
Timeline Pattern
4/25/2016© 2015 Think Big, a Teradata Company
13
• Ongoing status of configuration
– Parts in assembly
– Related items (versions)
– Peer groups
• For physical configuration and/or software components
• Maintain links in graph structure
– May be current or historical
• Use links to pull full context from Event History or Timeline
• Search -> simple query -> complex analytics
– E.g., transitive closure, impact analysis
• Technologies
– BlazeDB, TitanDB, Neo4j
– Spark GraphX & GraphFrame, Giraph
Network
4/25/2016© 2015 Think Big, a Teradata Company
14
Late Data
• Delays from intermittent connectivity, upstream failures
• Lineage tracking is critical
• Watermarks to identify when sufficient data has arrived (based on
statistics, upstream)
• May trigger early, on time & late
• Report on how much data has arrived late
4/25/2016© 2015 Think Big, a Teradata Company
Zipfian Distribution
Case Study
4/25/2016
16
• Global manufacturer of storage devices: hard-drives, SSDs, object storage
• Produces 100’s of millions of devices annually
• Each device contains multiple complex components
– Manufacturing sites are geographically dispersed
– Some components are sourced from suppliers
– Each device generates ~100-1000MB of data during its lifecycle
Case Study: Overview
Confidential
17
Business Challenges
Need to speed cycle time for new product
development
Customer’s demanding faster Failure
Analysis
Engineer’s wasting time playing “where’s
Waldo” with the data
Confidential
18
Technical Challenges
Difficulty storing & exposing binary and
other data types
Current DW’s Unable to Keep Pace with the
Volume
No platform for large-scale analytics
Data silos across manufacturing facilities
Confidential
19
Goal: Expose the entire “DNA” of the device—from
development, manufacturing, to reliability testing and “living
behavior” of device for live behavior —to increase operational
efficiency and quality
-- Chief Information Officer
Confidential
20
Platform Overview
Site 1
Site 2
Site 3
Site 4
Final Assembly
Customer Data
Supplier
Shop Floor Data
Shipment
Data
...
Data Sources
End-to-End
Integrated
Data
Big Data Platform Consumers
Ad hoc Analysis
Defect Pattern
Recognition
Enterprise DW
Batch Analytics
Parallelized
batch analytics
App-Specific
Views
New High-Value
Parameters
raw
extracts
Enriched
data
End-to-End
Traceability
Tester Failure
Analytics
Failure Analysis
Customer data
lookup
...
Applications
Confidential
21
• Large volumes of Binary Data:
– Require 5 years for warranty reasons, leading to PB’s of binary objects
• Schema on read:
– Development/Process Engineering teams change the manufacturing/test data very
frequently; thus, the decoding of the binary data changes very frequently.
– It is very difficult to keep pace with these changes with a traditional RDBMS, often
leading to time-consuming data purging and reloading
Use Case 1: Binary Data…with daily
decoding changes
Parsing
Confidential
22
Use Case 2: Wide Structures (Timeline)
10’s of thousands of parameters collected over the course of 6-8
months for a single device…a wide, de-normalized structure reduces
the complexity for end-user analysis
Confidential
23
Solve new problems - exposing previously “untapped” data sources at a scale that allowed for
identification of patterns causing the issues, E.g., scan 380 billion test points for 8 million products.
Several irregular distributions were found, which allowed the team to identify a code-level bug that was
causing the failures (and therefore scrapped drives).”
Use Case 3:
“Un-Paralleled” Parallel Analysis
Confidential
24
Conclusions
4/25/2016
25
Conclusions
• IoT is about blending data
• Data management patterns & practices are foundational
• Lead to effective analytics
• Reach me at @ronbodkin, ron.bodkin@thinkbiganalytics.com
4/25/2016© 2015 Think Big, a Teradata Company
26
Big Data
Strategy &
Roadmap
Analytics &
Data Science
Training &
Managed
Services
How can my
Organization
Get Value
from Big
Data?
How Do We
Reap Value
from Our Big
Data
Investment?
How Do We
Keep Our
People and
Environment
Operating At
A High
Level?
How Do We
Build a Best
Practices Big
Data
Environment
That Will Meet
our Needs?
Data Lake
Implementation
Hadoop, Spark Solutions since 2010.
We’re Hiring
27
• Incorporate all data from all touch points to
understand true customers’ behavior
• Leverage multi-genre advanced analytics
techniques to generate behavior-based insights
• Available NOW
Customer Satisfaction Index Analytic Solution Announcement

More Related Content

PDF
Benefits of Hadoop as Platform as a Service
PPTX
Log I am your father
PPTX
Pouring the Foundation: Data Management in the Energy Industry
PPTX
Shaping a Digital Vision
PPTX
Solving Performance Problems on Hadoop
PPTX
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
PDF
Apache Eagle: Secure Hadoop in Real Time
PPTX
Practical advice to build a data driven company
Benefits of Hadoop as Platform as a Service
Log I am your father
Pouring the Foundation: Data Management in the Energy Industry
Shaping a Digital Vision
Solving Performance Problems on Hadoop
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
Apache Eagle: Secure Hadoop in Real Time
Practical advice to build a data driven company

What's hot (20)

PPTX
LendingClub RealTime BigData Platform with Oracle GoldenGate
PDF
On Demand HDP Clusters using Cloudbreak and Ambari
PPTX
Tools and approaches for migrating big datasets to the cloud
PPTX
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
PPTX
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
PPTX
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
PPTX
Depositing Value from Transactional Data at Danske Bank
PDF
Big Data Architecture and Deployment
PPTX
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
PPTX
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
PPTX
Big Data Analytics Projects - Real World with Pentaho
PPTX
Building intelligent applications, experimental ML with Uber’s Data Science W...
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PDF
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
PDF
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
PPTX
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
PPTX
Keys for Success from Streams to Queries
PPTX
Building big data solutions on azure
PPTX
Big Data in the Real World
PPTX
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
LendingClub RealTime BigData Platform with Oracle GoldenGate
On Demand HDP Clusters using Cloudbreak and Ambari
Tools and approaches for migrating big datasets to the cloud
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Depositing Value from Transactional Data at Danske Bank
Big Data Architecture and Deployment
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Big Data Analytics Projects - Real World with Pentaho
Building intelligent applications, experimental ML with Uber’s Data Science W...
How Hadoop Makes the Natixis Pack More Efficient
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
Keys for Success from Streams to Queries
Building big data solutions on azure
Big Data in the Real World
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
Ad

Viewers also liked (20)

PPTX
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon
PPTX
Data Process Systems, connecting everything
PDF
The Future of Apache Storm
PDF
Cooperative Data Exploration with iPython Notebook
PPTX
Powering a Virtual Power Station with Big Data
PPTX
Protecting Enterprise Data in Apache Hadoop
PDF
The Heterogeneous Data lake
PDF
A Continuously Deployed Hadoop Analytics Platform?
PPTX
PPTX
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
PDF
NLP Structured Data Investigation on Non-Text
PPTX
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
PPTX
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
PPTX
Using a Data Lake at the core of a Life Assurance business
PDF
Architecting a multi-tenanted platform
PPTX
Hadoop Platform at Yahoo
PPTX
Securing Hadoop in an Enterprise Context
PPTX
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
PPTX
Ingest and Stream Processing - What will you choose?
PPTX
The Evolution of Apache Kylin
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon
Data Process Systems, connecting everything
The Future of Apache Storm
Cooperative Data Exploration with iPython Notebook
Powering a Virtual Power Station with Big Data
Protecting Enterprise Data in Apache Hadoop
The Heterogeneous Data lake
A Continuously Deployed Hadoop Analytics Platform?
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
NLP Structured Data Investigation on Non-Text
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Using a Data Lake at the core of a Life Assurance business
Architecting a multi-tenanted platform
Hadoop Platform at Yahoo
Securing Hadoop in an Enterprise Context
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Ingest and Stream Processing - What will you choose?
The Evolution of Apache Kylin
Ad

Similar to The key to unlocking the Value in the IoT? Managing the Data! (20)

PDF
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
PPTX
Fundamentals of Big Data
PDF
The Maturity Model: Taking the Growing Pains Out of Hadoop
PDF
SuanIct-Bigdata desktop-final
PDF
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
PDF
Harness the power of data
PPTX
Big data4businessusers
PDF
GigaOM Putting Big Data to Work by Brett Sheppard
PPTX
How Hewlett Packard Enterprise Gets Real with IoT Analytics
PPTX
Big data analyti data analytical life cycle
PDF
Big data – A Review
PPTX
Applying Big Data
PDF
SWOT of Bigdata Security Using Machine Learning Techniques
PDF
TOUG Big Data Challenge and Impact
PPT
CS8091_BDA_Unit_I_Analytical_Architecture
PPT
Big data
PPTX
Introduction to Harnessing Big Data
PDF
S ba0881 big-data-use-cases-pearson-edge2015-v7
PDF
Ictam big data
PPTX
Big data unit 2
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Fundamentals of Big Data
The Maturity Model: Taking the Growing Pains Out of Hadoop
SuanIct-Bigdata desktop-final
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
Harness the power of data
Big data4businessusers
GigaOM Putting Big Data to Work by Brett Sheppard
How Hewlett Packard Enterprise Gets Real with IoT Analytics
Big data analyti data analytical life cycle
Big data – A Review
Applying Big Data
SWOT of Bigdata Security Using Machine Learning Techniques
TOUG Big Data Challenge and Impact
CS8091_BDA_Unit_I_Analytical_Architecture
Big data
Introduction to Harnessing Big Data
S ba0881 big-data-use-cases-pearson-edge2015-v7
Ictam big data
Big data unit 2

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
A Presentation on Artificial Intelligence
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Electronic commerce courselecture one. Pdf
PPTX
Cloud computing and distributed systems.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation theory and applications.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Approach and Philosophy of On baking technology
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
MYSQL Presentation for SQL database connectivity
A Presentation on Artificial Intelligence
Network Security Unit 5.pdf for BCA BBA.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Dropbox Q2 2025 Financial Results & Investor Presentation
Electronic commerce courselecture one. Pdf
Cloud computing and distributed systems.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
cuic standard and advanced reporting.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation theory and applications.pdf
Machine learning based COVID-19 study performance prediction
Digital-Transformation-Roadmap-for-Companies.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Approach and Philosophy of On baking technology
Understanding_Digital_Forensics_Presentation.pptx
20250228 LYD VKU AI Blended-Learning.pptx

The key to unlocking the Value in the IoT? Managing the Data!

  • 1. MAKING BIG DATA COME ALIVE The key to unlocking the Value in the Internet of Things? Managing the Data!
  • 2. 2 For Big Data the Key is Variety! 4/25/2016© 2015 Think Big, a Teradata Company Definition: Datasets so complex and large that they are awkward to work with using standard tools and techniques Location Social Images Weblogs Videos Text Audio Sensor Size is not what is most important; it’s variety
  • 3. 3 Example Use Cases • Predictive Maintenance • Search and view detail on issue on the fly • Identify critical alerts • Root cause analysis • Understanding usage • And many more!
  • 5. 5 © 2015 Teradata AccessPreparationAcquisition Data Lake Architecture Math and Stats Data Mining Business Intelligence Applications Languages Marketing ANALYTIC TOOLS & APPS USERS Marketing Executives Operational Systems Frontline Workers Customers Partners Engineers Data Scientists Business Analysts Streams SearchAggregations Security, Metadata/Lineage, Administration Distributed Storage Msg. queues Cleansing Access ExperimentsGovernanceFeeds SOURCES Sensors email Social Telemetry Mobile Tabular Data Machine logs C
  • 6. 6 © 2015 Teradata REFERENCE INFORMATION ARCHITECTURE New with Big Data Security, Workload ManagementPublishingPreparation SecuredLanding Acquisition SharedViews&Obfuscation OptimizedStructures CommonKeys DerivedValues,SensitiveDataProtection CommonSummaries UserDefinedDataSets Validation&KeyResolution ERP SCM CRM Images Audio and Video Machine Logs Text Web and Social SOURCES Business Analysts Math and Stats Data Mining Business Intelligence Applications Marketing ANALYTIC TOOLS & APPS Search Profiling,Masking,Obfuscation Data Scientists Business Analysts Data Modelers IT
  • 7. 7 How is Data Management Changing? • Schema on Read? – Yes… as step one – But data still has underlying structure – It’s more like agile modeling – reflect as much structure as needed • Loosely coupled schemas loses platform guarantees but gains more application flexibility • Data Modeling isn’t dead! • Metadata is more important than ever 4/25/2016© 2015 Think Big, a Teradata Company
  • 8. 8 Changes in Logical Modeling • JSON-like structures – Complex collections of relations, arrays, map of items • Graphs – Storing complex, dynamically changing not static relationships • Binary/CLOB/specialized data – Ability to execute specialized programs to interpret and process 4/25/2016© 2015 Think Big, a Teradata Company
  • 10. 10 Important New Patterns • Denormalized Fact • Profile • Event History • Timeline • Network • Distributed Sources • Late Data • Deep Aggregates • Recovery • Multiple Active Clusters 4/25/2016© 2015 Think Big, a Teradata Company
  • 11. 11 Event id Actor id Time Event col’s Dim id’s Dim col’s Ext. Data 123 uid1 1/1/15 13:16:11 … … … { “TstA” : 1 …} 456 uid2 1/1/15 13:16:14 … … … { “TstB” : 1 …} • Fact table about common events to allow e.g., analytics in context – E.g., wearable device, telematics • Stored in columnar format (e.g., Parquet, ORCfile) • Join as was value of slowly changing dimensions • Often “extension” column of unparsed/not modeled JSON-like data • Partitioned by event time buckets, perhaps also by other dimension(s) Event History Pattern 4/25/2016© 2015 Think Big, a Teradata Company
  • 12. 12 Actor id Segment s Ev1:id Ev1:fact Ev1:dims Ev1:ext Ev2:id … uid1 [1, 3, 7] 123 1/1/15 13:16:11 … { “TstA” : 1 …} 789 … uid2 [2, 3] 456 1/1/15 13:16:14 … { “TstB” : 1 …} 0ab … • Pivot on event history: table of actors with events over time – Device history, usage in consumer journey – Enable support/analysis on specific items, long-lived analysis • May have hierarchy of actors (e.g., cluster, device, component) • May be array of events, many columns or subsorted (cluster key) • Also stored in columnar format, may be partitioned • May be updated in near real-time AND batch • Often holds cached algorithm values (combined Profile) Timeline Pattern 4/25/2016© 2015 Think Big, a Teradata Company
  • 13. 13 • Ongoing status of configuration – Parts in assembly – Related items (versions) – Peer groups • For physical configuration and/or software components • Maintain links in graph structure – May be current or historical • Use links to pull full context from Event History or Timeline • Search -> simple query -> complex analytics – E.g., transitive closure, impact analysis • Technologies – BlazeDB, TitanDB, Neo4j – Spark GraphX & GraphFrame, Giraph Network 4/25/2016© 2015 Think Big, a Teradata Company
  • 14. 14 Late Data • Delays from intermittent connectivity, upstream failures • Lineage tracking is critical • Watermarks to identify when sufficient data has arrived (based on statistics, upstream) • May trigger early, on time & late • Report on how much data has arrived late 4/25/2016© 2015 Think Big, a Teradata Company Zipfian Distribution
  • 16. 16 • Global manufacturer of storage devices: hard-drives, SSDs, object storage • Produces 100’s of millions of devices annually • Each device contains multiple complex components – Manufacturing sites are geographically dispersed – Some components are sourced from suppliers – Each device generates ~100-1000MB of data during its lifecycle Case Study: Overview Confidential
  • 17. 17 Business Challenges Need to speed cycle time for new product development Customer’s demanding faster Failure Analysis Engineer’s wasting time playing “where’s Waldo” with the data Confidential
  • 18. 18 Technical Challenges Difficulty storing & exposing binary and other data types Current DW’s Unable to Keep Pace with the Volume No platform for large-scale analytics Data silos across manufacturing facilities Confidential
  • 19. 19 Goal: Expose the entire “DNA” of the device—from development, manufacturing, to reliability testing and “living behavior” of device for live behavior —to increase operational efficiency and quality -- Chief Information Officer Confidential
  • 20. 20 Platform Overview Site 1 Site 2 Site 3 Site 4 Final Assembly Customer Data Supplier Shop Floor Data Shipment Data ... Data Sources End-to-End Integrated Data Big Data Platform Consumers Ad hoc Analysis Defect Pattern Recognition Enterprise DW Batch Analytics Parallelized batch analytics App-Specific Views New High-Value Parameters raw extracts Enriched data End-to-End Traceability Tester Failure Analytics Failure Analysis Customer data lookup ... Applications Confidential
  • 21. 21 • Large volumes of Binary Data: – Require 5 years for warranty reasons, leading to PB’s of binary objects • Schema on read: – Development/Process Engineering teams change the manufacturing/test data very frequently; thus, the decoding of the binary data changes very frequently. – It is very difficult to keep pace with these changes with a traditional RDBMS, often leading to time-consuming data purging and reloading Use Case 1: Binary Data…with daily decoding changes Parsing Confidential
  • 22. 22 Use Case 2: Wide Structures (Timeline) 10’s of thousands of parameters collected over the course of 6-8 months for a single device…a wide, de-normalized structure reduces the complexity for end-user analysis Confidential
  • 23. 23 Solve new problems - exposing previously “untapped” data sources at a scale that allowed for identification of patterns causing the issues, E.g., scan 380 billion test points for 8 million products. Several irregular distributions were found, which allowed the team to identify a code-level bug that was causing the failures (and therefore scrapped drives).” Use Case 3: “Un-Paralleled” Parallel Analysis Confidential
  • 25. 25 Conclusions • IoT is about blending data • Data management patterns & practices are foundational • Lead to effective analytics • Reach me at @ronbodkin, ron.bodkin@thinkbiganalytics.com 4/25/2016© 2015 Think Big, a Teradata Company
  • 26. 26 Big Data Strategy & Roadmap Analytics & Data Science Training & Managed Services How can my Organization Get Value from Big Data? How Do We Reap Value from Our Big Data Investment? How Do We Keep Our People and Environment Operating At A High Level? How Do We Build a Best Practices Big Data Environment That Will Meet our Needs? Data Lake Implementation Hadoop, Spark Solutions since 2010. We’re Hiring
  • 27. 27 • Incorporate all data from all touch points to understand true customers’ behavior • Leverage multi-genre advanced analytics techniques to generate behavior-based insights • Available NOW Customer Satisfaction Index Analytic Solution Announcement

Editor's Notes

  • #6: This illustrates the fundamental processing in iconic form.
  • #7: This slide shows how security and governance would be managed across the different phases of ingest, data preparation and Publishing. It is the heart of the Goldilocks governance This slide builds and starts with Data Scientist that need early access, Modelers next where they expect a lot of data science work already occurred, then jumps up to Business Analysts, the three major user groups, Business Analyst would need access to materialized models and common summaries, but will be mostly accessing through view heavy interfaces to enforce security and help with further materialization or at least representation (views) of the specific models. IT is last and it needs access to everthing
  • #18: Images courtesy of: http://hybridclaims.com
  • #19: Images courtesy of: http://neaglobal.com http://eweek.com
  • #20: Increase customer satisfaction: Commitment to quality Improve customer service and access to data (internally & externally) Increase operational efficiency: improved yield & time-to-market By having end to end visibility to: every test, every diagnostic and all info from all components of a product Enable the business to extract new insights (never-before possible)
  • #27: This slide represents Think Big’s end-to-end big data services portfolio. We have services that span from big data strategy and roadmap all the way to training and managed services. Today we’re going to talk about data lake implementation best practices.