SlideShare a Scribd company logo
Innovation and
Reinvention Driving
Transformation
OCTOBER 9,
2018
2018 HPCC Systems® Community
Day
Dan S. Camper – Sr. Architect, HPCC Solutions Lab
Data Patterns: A Native Open Source Data Profiling Tool for HPCC Systems
What Is Data Profiling?
• A method of examining data to collect
statistics and information about that
data
• Determines the “shape” of the data
• Data types
• Lengths
• Cardinality
• Prominent discrete values
• Patterns
• Also known as “Data Discovery”
Data Patterns: A Data Profiling Tool for HPCC Systems 2
When Would You Profile Data?
• Explore a new dataset
• Determine the real data types
• Determine field population
• Spot garbage data
• Find highly-correlated fields
• Verify data updates
• Ensure that structure has not
changed
• Check for expected cardinality
• Check for expected fill rates
• Check for unexpected garbage
Data Patterns: A Data Profiling Tool for HPCC Systems 3
DataPatterns.Profile()
• Written entirely in ECL
• It is a single FUNCTIONMACRO
• No library or module dependencies
• Performs all profiling checks by default
• Numerous parameters for controlling analysis and output
• Analyze all rows in a dataset or just a sample
• Analyze all fields or only certain fields
• Enable only specified profiling checks
• Specify returned pattern counts
• Creates a single dataset as a result
• One record for each field analyzed
Data Patterns: A Data Profiling Tool for HPCC Systems 4
DataPatterns.Profile() – The Usual Analysis
Data Patterns: A Data Profiling Tool for HPCC Systems 5
Output Description
attribute The name of the field in the input dataset
given_attribute_type The ECL type of the attribute as it was defined in the RECORD definition
best_attribute_type An ECL data type that both allows all values in the input dataset and consumes the
least amount of memory
rec_count The number of records analyzed
fill_count The number of rec_count records containing non-nil values
fill_rate The percentage of rec_count records containing non-nil values
cardinality The number of unique, non-nil values
modes The most common value(s) in the attribute, after coercing all values to STRING,
along with the number of records in which the values were found
min_length The shortest length of a value when expressed as a string
max_length The longest length of a value when expressed as a string
ave_length The average length of a value when expressed as a string
DataPatterns.Profile() – Analysis For Numeric Fields
Data Patterns: A Data Profiling Tool for HPCC Systems 6
Output Description
is_numeric Boolean indicating if the original attribute was numeric and therefore whether or not
the numeric_xxxx output fields will be populated with actual values
numeric_min The smallest non-nil value as a DECIMAL
numeric_max The largest non-nil value as a DECIMAL
numeric_mean The mean (average) non-nil value as a DECIMAL
numeric_std_dev The standard deviation of the non-nil values as a DECIMAL
numeric_lower_quartile The value separating the first (bottom) and second quarters of non-nil values as a
DECIMAL
numeric_median The median non-nil value as a DECIMAL
numeric_upper_quartile The value separating the third and fourth (top) quarters of non-nil values as a
DECIMAL
numeric_correlations A child dataset containing correlation values comparing the current numeric attribute
with all other numeric attributes, listed in descending correlation value order
DataPatterns.Profile() – Text Patterns
• Text patterns give you an idea of what your data looks like when it is expressed as a
human-readable generalized string
• Very useful for spotting data that doesn’t belong
• Converts each character of the string into a fixed character palette to produce a new
string pattern
• Any uppercase letter => A
• Any lowercase letter => a
• Any numeric digit => 9
• Any boolean value => B
• All other characters remain as-is
• By counting the unique patterns and ranking them, you can easily see what kind of
data is very common or very rare
• All string data types are supported
Data Patterns: A Data Profiling Tool for HPCC Systems 7
DataPatterns.Profile() – Text Pattern Analysis
Data Patterns: A Data Profiling Tool for HPCC Systems 8
Output Description
popular_patterns The most common patterns of values; patterns are listed from most- to least-
common and an example (pulled from the data) is shown for each
rare_patterns The least common patterns of values; patterns are listed from least- to most-common
and an example (pulled from the data) is shown for each; patterns already shown in
popular_patterns are not repeated here
Original Value Pattern
45816.01 99999.99
Dan Camper Aaa Aaaaaa
For *only* $10! Aaa *aaaa* $99!
Examples
Some Data To Profile …
Data Patterns: A Data Profiling Tool for HPCC Systems 9
… And How To Profile It
Data Patterns: A Data Profiling Tool for HPCC Systems 10
Import the DataPatterns module
Define a record structure
Declare the dataset
Call the profiler
Show result
Profiling Results – The Usual Suspects
Data Patterns: A Data Profiling Tool for HPCC Systems 11
Profiling Results – Numeric Fields
Data Patterns: A Data Profiling Tool for HPCC Systems 12
Profiling Results – Data Pattern Analysis
Data Patterns: A Data Profiling Tool for HPCC Systems 13
Final Thoughts
• DataPatterns is an open-source ECL bundle
• https://github.com/hpcc-systems/DataPatterns.git
• Currently contains only two functions
• Profile()
• BestRecordStructure()
• Future plans
• Histograms for numeric fields
• Additional information for low-cardinality fields
• Expand correlations to non-numeric discrete-value fields
• Easy comparison of profile results to detect changes
• Visualization
• Data Detectors
Data Patterns: A Data Profiling Tool for HPCC Systems 14
Data Patterns: A Data Profiling Tool for HPCC Systems 15
Questions?
Innovation and
Reinvention Driving
Transformation
OCTOBER 9,
2018
2018 HPCC Systems® Community
Day
Hicham Elhassani – VP Modeling Vertical Support
Dan S. Camper – Sr. Architect, HPCC Solutions Lab
Making IoT Data Actionable Using Predictive Analytics
Making IoT Data Actionable Using Predictive Analytics 17
If you think connected “things” are everywhere NOW . . .
Making IoT Data Actionable Using Predictive Analytics
2016 2017 2018 2020
Consumer 3,963 5,244 7,036 12,863
Business:Cross-Industry 1,102 1,501 2,133 4,381
Business:Vertical-Specific 1,317 1,635 2,028 3,171
Grand Total 6,382 8,381 11,197 20,415
Source: Gartner (January 2017)
IoT Units Installed Base by Category
(Millions of Units)
18
Value proposition?
Cyber risk?
What does the data say?
Who is driving?
Incremental or revolutionary?
Cost vs. Benefit?
Making IoT Data Actionable Using Predictive Analytics
BIG QUESTIONS
FOR
INSURANCE
19
Making IoT Data Actionable Using Predictive Analytics
Importance of collecting Iot data to company’s insurance strategy
(n=120)
8%
70%
22%
Very / Somewhat Important
Neither important or unimportant
Not at all/not very important
Importance for insurers to collect IoT data today
20
Making IoT Data Actionable Using Predictive Analytics
Collection and/or Purchase of Connected Home
Data
(n=120)
1%
4%
19%
38%
38% Collect/purchase, use in decision-making
Collect/purchase, plan to use
Collect/purchase, but not sure how to use
Don’t collect/purchase, but plan to
Don’t collect/purchase, don’t plan to
Collect today
= 24%
Don’t Collect today
= 76%
Collection of Connected Home Data
21
Making IoT Data Actionable Using Predictive Analytics
Timeline to begin collecting Connected Home data
Anticipated Timeline for Collecting and/or Using Connected Homes
Data
(among those not currently using, but planning to use connected homes, n=73)
In next year
In next 2-3 years
In next 4-5 years
In 6+ years
Not sure
4%
52%
34%
7%
3%
Next 3Years
= 56%
4+Years
= 41%
22
Home Loss Statistics and IOT opportunities
Making IoT Data Actionable Using Predictive Analytics
11
%
OTHERTHEFT
25
%
21% 22% 21%
WIND HAIL FIRE WATER
NON-
WEATHERWATER
WEATHER
LIABILITY
Internals data
Security
Freeze
detection
Leak detection
Smoke/CO
Temp/Humidity
Motion sensor
Appliances
Audio/video
External data
Weather API
Social M
events
Loss history
Property info
Geo
information
Internals data
Security
Freeze
detection
Leak detection
Smoke/CO
Temp/Humidity
Motion sensor
Appliances
Audio/Video
External data
Weather API
Social M
events
Loss history
Property info
Geo
information
Internals data
Security
Freeze
detection
Leak detection
Smoke/CO
Temp/Humidity
Motion sensor
Appliances
Audio/video
External data
Weather API
Social M
events
Loss history
Property info
Geo
information
Internals data
Security
Freeze
detection
Leak detection
Smoke/CO
Temp/Humidity
Motion sensor
Appliances
Audio/video
External data
Weather API
Social M
events
Loss history
Property info
Geo
information
Internals data
Security
Freeze
detection
Leak detection
Smoke/CO
Temp/Humidity
Motion sensor
Appliances
Audio/video
External data
Weather API
Social M
events
Loss history
Property info
Geo
information
23
Today, let’s discuss some examples
Occupancy: Monitoring/Prevention
Water Leak:
Monitoring/Alert
24
Making IoT Data Actionable Using Predictive Analytics
Smart Thermostat Data: Primary Residence
HVAC Mode Observations
0
50
100
150
200
250
300
350
Eco
July 4th
Weekend
Source: Nest
25
Making IoT Data Actionable Using Predictive Analytics
Smart Thermostat Data: Vacation Home
0
20
40
60
80
100
120
Eco
HVAC Mode Observations July 4th
Weekend
Source: Nest
26
Making IoT Data Actionable Using Predictive Analytics
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
3/12/20180:00
3/12/20186:00
3/12/201812:00
3/12/201818:00
3/13/20180:00
3/13/20186:00
3/13/201812:00
3/13/201818:00
3/14/20180:00
3/14/20186:00
3/14/201812:00
3/14/201818:00
3/15/20180:00
3/15/20186:00
3/15/201812:00
3/15/201818:00
3/16/20180:00
3/16/20186:00
3/16/201812:00
3/16/201818:00
3/17/20180:00
3/17/20187:00
3/17/201813:00
3/17/201819:00
3/18/20181:00
3/18/20187:00
3/18/201813:00
3/18/201819:00
3/19/20181:00
3/19/20187:00
3/19/201813:00
3/19/201819:00
3/20/20181:00
3/20/20187:00
3/20/201813:00
3/20/201819:00
3/21/20181:00
3/21/20187:00
3/21/201813:00
3/21/201819:00
3/22/20181:00
3/22/20187:00
3/22/201813:00
3/22/201819:00
3/23/20181:00
3/23/20187:00
3/23/201813:00
3/23/201819:00
3/24/20181:00
3/24/20187:00
3/24/201813:00
3/24/201819:00
3/25/20181:00
3/25/20187:00
3/25/201813:00
3/25/201819:00
3/26/20181:00
3/26/20187:00
3/26/201813:00
3/26/201819:00
Shower
Restroo
m
Laundry x3
Dishwasher x2
Child’s bath Dishwasher
Child’s bath
Child’s
bath
Child’s
bath
Child’s
bath
Child’s
bath
Child’s
bath
Source: Streamlabs
Example: Water Leak Detection
27
Example: Water Leak & Assignment of Benefits
Making IoT Data Actionable Using Predictive Analytics
File it
Assign of benefits (AOB) is a
legal tool that allows the
homeowner to transfer their
rights to collect from an
insurance claim to a third
party.
Fix It
AOB is commonly used when
a homeowner employs a
contractor or water
remediation company to fix
water damage from pipe and
appliance leaks
Fake it
This arrangement has
permitted some contractors to
overinflate claims, resulting in
a dramatic increase in
frequency and severity in
Florida water non-weather
claims
Source: Office of Insurance Consumer Advocate, Florida Office of Insurance Regulation
28
Assignment of Benefits – Florida vs USA (Excl. Florida)
Making IoT Data Actionable Using Predictive Analytics
30
25
20
15
10
5
0
LossCost($)
2011 2012 2013 2014 2015 2016
Accidental Water Discharge and Appliance Leakage Loss Cost
USA (Excl. Florida) FloridaSource: LexisNexis Internal Research
29
Broward
Miami-Dade
Palm Beach
Assignment of Benefits – Tri Counties
Making IoT Data Actionable Using Predictive Analytics
Source: LexisNexis Internal Research
30
Broward
Miami-Dade
Palm Beach
Assignment of Benefits – Tri Counties
Making IoT Data Actionable Using Predictive Analytics
Source: LexisNexis Internal Research
31
Water Leak and Geo-located losses
Making IoT Data Actionable Using Predictive Analytics
0.50%
0.45%
0.40%
0.35%
0.30%
0.25%
0.20%
0.15%
0.10%
0.05%
0.00%
Frequency
2011 2012 2013 2014 2015 2016
Accidental Water Discharge and Appliance Leakage Frequency
Broward County Miami-Dade
County
Palm Beach
County
Florida (Excl. Tri
Counties)
Source: LexisNexis Internal Research
32
Harvey: Tweets Containing “Flood”
Making IoT Data Actionable Using Predictive Analytics 33
Weather Events Digital Trail
• Elk City tornado
by the
NOAA:yesterday
17/05/2017
• Flood
• Hail
• Lightning
• Tornado
• Wildfire
Making IoT Data Actionable Using Predictive Analytics 34
Stream Analytics: Push and Pull data sources
Making IoT Data Actionable Using Predictive Analytics
Wind Fire Water
(non-
weather)
Water
(weather
)
Theft Liability Other
Hail
35
Data platforms will be key to unlocking the full potential of this
opportunity
Making IoT Data Actionable Using Predictive Analytics
MARKETING
CONTACT
QUOTE
UNDERWRITIN
G
RENEWAL
COMPLIANCE
CLAIM
IoT
Platform
Insurer
Automatio
n
Mitigation Utilities
Connected Home
Securit
y
Connecte
d Car
Connecte
d Self
Connecte
d
Business
36
How to start unlocking these insights now
Technology/Analytics to
develop and deploy a
pilot program
HPCC Systems
Architecture
Making IoT Data Actionable Using Predictive Analytics 38
HPCC Systems – Pull Architecture
• Device users register at a web portal
• Authentication and authorization via
device manufacturer’s web site
• Authorization response includes an
access token
• All registration information saved
• Thor queries devices for all registered
users in parallel
• Ancillary data, such as weather
conditions local to every device, is
periodically gathered
• Analytics are also run periodically, as
often as needed
• ROXIE updated with analytics results
and are made available to external
services
Making IoT Data Actionable Using Predictive Analytics 39
HPCC Systems – Push Architecture
• Authorized devices whitelisted via
master device management
• Remote devices send their data to
ROXIE
• After validation and normalization,
message stored in Kafka and
Couchbase
• Thor periodically pulls new messages
from Kafka for processing
• Ancillary data, such as weather
conditions local to every device, is
periodically gathered
• Analytics are also run periodically, as
often as needed
• ROXIE updated with analytics results
and are made available to external
services
Making IoT Data Actionable Using Predictive Analytics 40
Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems

More Related Content

PDF
Top-K Dominating Queries on Incomplete Data with Priorities
PDF
E05312426
PPT
Mining from Open Answers in Questionnaire Data
PPT
Real Time Competitive Marketing Intelligence
PPTX
Feature Selection for Document Ranking
PDF
Searching and Sorting Techniques in Data Structure
PPT
Searching algorithms
PDF
An improvised tree algorithm for association rule mining using transaction re...
Top-K Dominating Queries on Incomplete Data with Priorities
E05312426
Mining from Open Answers in Questionnaire Data
Real Time Competitive Marketing Intelligence
Feature Selection for Document Ranking
Searching and Sorting Techniques in Data Structure
Searching algorithms
An improvised tree algorithm for association rule mining using transaction re...

What's hot (19)

PPT
Mining Product Reputations On the Web
PPTX
Searching Techniques and Analysis
PPT
data mining
PPT
PPSX
Algorithm and Programming (Searching)
PPTX
Step By Step Guide to Learn R
PPT
Data1
PPTX
Data analysis
PPTX
Data Mining: Mining ,associations, and correlations
PPTX
Mining frequent patterns association
PPTX
Searching techniques in Data Structure And Algorithm
PPTX
Cost estimation for Query Optimization
PPTX
Introduction to dm and dw
PPTX
Exploratory data analysis with Python
PDF
A classification of methods for frequent pattern mining
PDF
REVIEW: Frequent Pattern Mining Techniques
PDF
D0352630
PPTX
Graph Based Machine Learning on Relational Data
PPT
Data mining technique for classification and feature evaluation using stream ...
Mining Product Reputations On the Web
Searching Techniques and Analysis
data mining
Algorithm and Programming (Searching)
Step By Step Guide to Learn R
Data1
Data analysis
Data Mining: Mining ,associations, and correlations
Mining frequent patterns association
Searching techniques in Data Structure And Algorithm
Cost estimation for Query Optimization
Introduction to dm and dw
Exploratory data analysis with Python
A classification of methods for frequent pattern mining
REVIEW: Frequent Pattern Mining Techniques
D0352630
Graph Based Machine Learning on Relational Data
Data mining technique for classification and feature evaluation using stream ...
Ad

Similar to Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems (20)

PPTX
DataPatterns - Profiling in ECL Watch
PDF
Data analytics, a (short) tour
PDF
Understanding your Data - Data Analytics Lifecycle and Machine Learning
PDF
Left Brain, Right Brain: How to Unify Enterprise Analytics
PPTX
Code Camp - Data Profiling and Quality Analysis Framework
PDF
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
PPTX
MODULE 1_Introduction to Data analytics and life cycle..pptx
PDF
Lecture 2 - Data Mining (Data mining).pdf
PDF
BIM Data Mining Unit2 by Tekendra Nath Yogi
PPTX
Morden EcoSystem.pptx
PPTX
Kp-Data Analytics-ts.pptx
PDF
Data Profiling, Data Catalogs and Metadata Harmonisation
PPTX
Big data analyti data analytical life cycle
PDF
Big data overview
DOCX
Module Overview Careers in Analytics In this module, we .docx
DOCX
Module Overview Careers in Analytics In this module, we .docx
PPTX
Pengertian data dan Informasi pada mata kuliah analisa data
PPTX
727325165-Unit-1-Data-Analytics-PPT-1.pptx
PPTX
Introduction to data analytics - Intro to Data Analytics
PDF
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
DataPatterns - Profiling in ECL Watch
Data analytics, a (short) tour
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Left Brain, Right Brain: How to Unify Enterprise Analytics
Code Camp - Data Profiling and Quality Analysis Framework
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
MODULE 1_Introduction to Data analytics and life cycle..pptx
Lecture 2 - Data Mining (Data mining).pdf
BIM Data Mining Unit2 by Tekendra Nath Yogi
Morden EcoSystem.pptx
Kp-Data Analytics-ts.pptx
Data Profiling, Data Catalogs and Metadata Harmonisation
Big data analyti data analytical life cycle
Big data overview
Module Overview Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docx
Pengertian data dan Informasi pada mata kuliah analisa data
727325165-Unit-1-Data-Analytics-PPT-1.pptx
Introduction to data analytics - Intro to Data Analytics
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Ad

More from HPCC Systems (20)

PPTX
Natural Language to SQL Query conversion using Machine Learning Techniques on...
PPT
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
PPTX
Towards Trustable AI for Complex Systems
PPTX
Welcome
PPTX
Closing / Adjourn
PPTX
Community Website: Virtual Ribbon Cutting
PPTX
Path to 8.0
PPTX
Release Cycle Changes
PPTX
Geohashing with Uber’s H3 Geospatial Index
PPTX
Advancements in HPCC Systems Machine Learning
PPTX
Docker Support
PPTX
Expanding HPCC Systems Deep Neural Network Capabilities
PPTX
Leveraging Intra-Node Parallelization in HPCC Systems
PPTX
Leveraging the Spark-HPCC Ecosystem
PPTX
Work Unit Analysis Tool
PPTX
Community Award Ceremony
PPTX
Dapper Tool - A Bundle to Make your ECL Neater
PPTX
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
PPTX
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
PPTX
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Towards Trustable AI for Complex Systems
Welcome
Closing / Adjourn
Community Website: Virtual Ribbon Cutting
Path to 8.0
Release Cycle Changes
Geohashing with Uber’s H3 Geospatial Index
Advancements in HPCC Systems Machine Learning
Docker Support
Expanding HPCC Systems Deep Neural Network Capabilities
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging the Spark-HPCC Ecosystem
Work Unit Analysis Tool
Community Award Ceremony
Dapper Tool - A Bundle to Make your ECL Neater
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Mega Projects Data Mega Projects Data
PDF
Foundation of Data Science unit number two notes
PPTX
Introduction to machine learning and Linear Models
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
1_Introduction to advance data techniques.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Computer network topology notes for revision
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Data_Analytics_and_PowerBI_Presentation.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
IB Computer Science - Internal Assessment.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Qualitative Qantitative and Mixed Methods.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Mega Projects Data Mega Projects Data
Foundation of Data Science unit number two notes
Introduction to machine learning and Linear Models
Reliability_Chapter_ presentation 1221.5784
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
1_Introduction to advance data techniques.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Introduction to Knowledge Engineering Part 1
Miokarditis (Inflamasi pada Otot Jantung)
oil_refinery_comprehensive_20250804084928 (1).pptx
Computer network topology notes for revision
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf

Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems

  • 1. Innovation and Reinvention Driving Transformation OCTOBER 9, 2018 2018 HPCC Systems® Community Day Dan S. Camper – Sr. Architect, HPCC Solutions Lab Data Patterns: A Native Open Source Data Profiling Tool for HPCC Systems
  • 2. What Is Data Profiling? • A method of examining data to collect statistics and information about that data • Determines the “shape” of the data • Data types • Lengths • Cardinality • Prominent discrete values • Patterns • Also known as “Data Discovery” Data Patterns: A Data Profiling Tool for HPCC Systems 2
  • 3. When Would You Profile Data? • Explore a new dataset • Determine the real data types • Determine field population • Spot garbage data • Find highly-correlated fields • Verify data updates • Ensure that structure has not changed • Check for expected cardinality • Check for expected fill rates • Check for unexpected garbage Data Patterns: A Data Profiling Tool for HPCC Systems 3
  • 4. DataPatterns.Profile() • Written entirely in ECL • It is a single FUNCTIONMACRO • No library or module dependencies • Performs all profiling checks by default • Numerous parameters for controlling analysis and output • Analyze all rows in a dataset or just a sample • Analyze all fields or only certain fields • Enable only specified profiling checks • Specify returned pattern counts • Creates a single dataset as a result • One record for each field analyzed Data Patterns: A Data Profiling Tool for HPCC Systems 4
  • 5. DataPatterns.Profile() – The Usual Analysis Data Patterns: A Data Profiling Tool for HPCC Systems 5 Output Description attribute The name of the field in the input dataset given_attribute_type The ECL type of the attribute as it was defined in the RECORD definition best_attribute_type An ECL data type that both allows all values in the input dataset and consumes the least amount of memory rec_count The number of records analyzed fill_count The number of rec_count records containing non-nil values fill_rate The percentage of rec_count records containing non-nil values cardinality The number of unique, non-nil values modes The most common value(s) in the attribute, after coercing all values to STRING, along with the number of records in which the values were found min_length The shortest length of a value when expressed as a string max_length The longest length of a value when expressed as a string ave_length The average length of a value when expressed as a string
  • 6. DataPatterns.Profile() – Analysis For Numeric Fields Data Patterns: A Data Profiling Tool for HPCC Systems 6 Output Description is_numeric Boolean indicating if the original attribute was numeric and therefore whether or not the numeric_xxxx output fields will be populated with actual values numeric_min The smallest non-nil value as a DECIMAL numeric_max The largest non-nil value as a DECIMAL numeric_mean The mean (average) non-nil value as a DECIMAL numeric_std_dev The standard deviation of the non-nil values as a DECIMAL numeric_lower_quartile The value separating the first (bottom) and second quarters of non-nil values as a DECIMAL numeric_median The median non-nil value as a DECIMAL numeric_upper_quartile The value separating the third and fourth (top) quarters of non-nil values as a DECIMAL numeric_correlations A child dataset containing correlation values comparing the current numeric attribute with all other numeric attributes, listed in descending correlation value order
  • 7. DataPatterns.Profile() – Text Patterns • Text patterns give you an idea of what your data looks like when it is expressed as a human-readable generalized string • Very useful for spotting data that doesn’t belong • Converts each character of the string into a fixed character palette to produce a new string pattern • Any uppercase letter => A • Any lowercase letter => a • Any numeric digit => 9 • Any boolean value => B • All other characters remain as-is • By counting the unique patterns and ranking them, you can easily see what kind of data is very common or very rare • All string data types are supported Data Patterns: A Data Profiling Tool for HPCC Systems 7
  • 8. DataPatterns.Profile() – Text Pattern Analysis Data Patterns: A Data Profiling Tool for HPCC Systems 8 Output Description popular_patterns The most common patterns of values; patterns are listed from most- to least- common and an example (pulled from the data) is shown for each rare_patterns The least common patterns of values; patterns are listed from least- to most-common and an example (pulled from the data) is shown for each; patterns already shown in popular_patterns are not repeated here Original Value Pattern 45816.01 99999.99 Dan Camper Aaa Aaaaaa For *only* $10! Aaa *aaaa* $99! Examples
  • 9. Some Data To Profile … Data Patterns: A Data Profiling Tool for HPCC Systems 9
  • 10. … And How To Profile It Data Patterns: A Data Profiling Tool for HPCC Systems 10 Import the DataPatterns module Define a record structure Declare the dataset Call the profiler Show result
  • 11. Profiling Results – The Usual Suspects Data Patterns: A Data Profiling Tool for HPCC Systems 11
  • 12. Profiling Results – Numeric Fields Data Patterns: A Data Profiling Tool for HPCC Systems 12
  • 13. Profiling Results – Data Pattern Analysis Data Patterns: A Data Profiling Tool for HPCC Systems 13
  • 14. Final Thoughts • DataPatterns is an open-source ECL bundle • https://github.com/hpcc-systems/DataPatterns.git • Currently contains only two functions • Profile() • BestRecordStructure() • Future plans • Histograms for numeric fields • Additional information for low-cardinality fields • Expand correlations to non-numeric discrete-value fields • Easy comparison of profile results to detect changes • Visualization • Data Detectors Data Patterns: A Data Profiling Tool for HPCC Systems 14
  • 15. Data Patterns: A Data Profiling Tool for HPCC Systems 15 Questions?
  • 16. Innovation and Reinvention Driving Transformation OCTOBER 9, 2018 2018 HPCC Systems® Community Day Hicham Elhassani – VP Modeling Vertical Support Dan S. Camper – Sr. Architect, HPCC Solutions Lab Making IoT Data Actionable Using Predictive Analytics
  • 17. Making IoT Data Actionable Using Predictive Analytics 17
  • 18. If you think connected “things” are everywhere NOW . . . Making IoT Data Actionable Using Predictive Analytics 2016 2017 2018 2020 Consumer 3,963 5,244 7,036 12,863 Business:Cross-Industry 1,102 1,501 2,133 4,381 Business:Vertical-Specific 1,317 1,635 2,028 3,171 Grand Total 6,382 8,381 11,197 20,415 Source: Gartner (January 2017) IoT Units Installed Base by Category (Millions of Units) 18
  • 19. Value proposition? Cyber risk? What does the data say? Who is driving? Incremental or revolutionary? Cost vs. Benefit? Making IoT Data Actionable Using Predictive Analytics BIG QUESTIONS FOR INSURANCE 19
  • 20. Making IoT Data Actionable Using Predictive Analytics Importance of collecting Iot data to company’s insurance strategy (n=120) 8% 70% 22% Very / Somewhat Important Neither important or unimportant Not at all/not very important Importance for insurers to collect IoT data today 20
  • 21. Making IoT Data Actionable Using Predictive Analytics Collection and/or Purchase of Connected Home Data (n=120) 1% 4% 19% 38% 38% Collect/purchase, use in decision-making Collect/purchase, plan to use Collect/purchase, but not sure how to use Don’t collect/purchase, but plan to Don’t collect/purchase, don’t plan to Collect today = 24% Don’t Collect today = 76% Collection of Connected Home Data 21
  • 22. Making IoT Data Actionable Using Predictive Analytics Timeline to begin collecting Connected Home data Anticipated Timeline for Collecting and/or Using Connected Homes Data (among those not currently using, but planning to use connected homes, n=73) In next year In next 2-3 years In next 4-5 years In 6+ years Not sure 4% 52% 34% 7% 3% Next 3Years = 56% 4+Years = 41% 22
  • 23. Home Loss Statistics and IOT opportunities Making IoT Data Actionable Using Predictive Analytics 11 % OTHERTHEFT 25 % 21% 22% 21% WIND HAIL FIRE WATER NON- WEATHERWATER WEATHER LIABILITY Internals data Security Freeze detection Leak detection Smoke/CO Temp/Humidity Motion sensor Appliances Audio/video External data Weather API Social M events Loss history Property info Geo information Internals data Security Freeze detection Leak detection Smoke/CO Temp/Humidity Motion sensor Appliances Audio/Video External data Weather API Social M events Loss history Property info Geo information Internals data Security Freeze detection Leak detection Smoke/CO Temp/Humidity Motion sensor Appliances Audio/video External data Weather API Social M events Loss history Property info Geo information Internals data Security Freeze detection Leak detection Smoke/CO Temp/Humidity Motion sensor Appliances Audio/video External data Weather API Social M events Loss history Property info Geo information Internals data Security Freeze detection Leak detection Smoke/CO Temp/Humidity Motion sensor Appliances Audio/video External data Weather API Social M events Loss history Property info Geo information 23
  • 24. Today, let’s discuss some examples Occupancy: Monitoring/Prevention Water Leak: Monitoring/Alert 24
  • 25. Making IoT Data Actionable Using Predictive Analytics Smart Thermostat Data: Primary Residence HVAC Mode Observations 0 50 100 150 200 250 300 350 Eco July 4th Weekend Source: Nest 25
  • 26. Making IoT Data Actionable Using Predictive Analytics Smart Thermostat Data: Vacation Home 0 20 40 60 80 100 120 Eco HVAC Mode Observations July 4th Weekend Source: Nest 26
  • 27. Making IoT Data Actionable Using Predictive Analytics 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 3/12/20180:00 3/12/20186:00 3/12/201812:00 3/12/201818:00 3/13/20180:00 3/13/20186:00 3/13/201812:00 3/13/201818:00 3/14/20180:00 3/14/20186:00 3/14/201812:00 3/14/201818:00 3/15/20180:00 3/15/20186:00 3/15/201812:00 3/15/201818:00 3/16/20180:00 3/16/20186:00 3/16/201812:00 3/16/201818:00 3/17/20180:00 3/17/20187:00 3/17/201813:00 3/17/201819:00 3/18/20181:00 3/18/20187:00 3/18/201813:00 3/18/201819:00 3/19/20181:00 3/19/20187:00 3/19/201813:00 3/19/201819:00 3/20/20181:00 3/20/20187:00 3/20/201813:00 3/20/201819:00 3/21/20181:00 3/21/20187:00 3/21/201813:00 3/21/201819:00 3/22/20181:00 3/22/20187:00 3/22/201813:00 3/22/201819:00 3/23/20181:00 3/23/20187:00 3/23/201813:00 3/23/201819:00 3/24/20181:00 3/24/20187:00 3/24/201813:00 3/24/201819:00 3/25/20181:00 3/25/20187:00 3/25/201813:00 3/25/201819:00 3/26/20181:00 3/26/20187:00 3/26/201813:00 3/26/201819:00 Shower Restroo m Laundry x3 Dishwasher x2 Child’s bath Dishwasher Child’s bath Child’s bath Child’s bath Child’s bath Child’s bath Child’s bath Source: Streamlabs Example: Water Leak Detection 27
  • 28. Example: Water Leak & Assignment of Benefits Making IoT Data Actionable Using Predictive Analytics File it Assign of benefits (AOB) is a legal tool that allows the homeowner to transfer their rights to collect from an insurance claim to a third party. Fix It AOB is commonly used when a homeowner employs a contractor or water remediation company to fix water damage from pipe and appliance leaks Fake it This arrangement has permitted some contractors to overinflate claims, resulting in a dramatic increase in frequency and severity in Florida water non-weather claims Source: Office of Insurance Consumer Advocate, Florida Office of Insurance Regulation 28
  • 29. Assignment of Benefits – Florida vs USA (Excl. Florida) Making IoT Data Actionable Using Predictive Analytics 30 25 20 15 10 5 0 LossCost($) 2011 2012 2013 2014 2015 2016 Accidental Water Discharge and Appliance Leakage Loss Cost USA (Excl. Florida) FloridaSource: LexisNexis Internal Research 29
  • 30. Broward Miami-Dade Palm Beach Assignment of Benefits – Tri Counties Making IoT Data Actionable Using Predictive Analytics Source: LexisNexis Internal Research 30
  • 31. Broward Miami-Dade Palm Beach Assignment of Benefits – Tri Counties Making IoT Data Actionable Using Predictive Analytics Source: LexisNexis Internal Research 31
  • 32. Water Leak and Geo-located losses Making IoT Data Actionable Using Predictive Analytics 0.50% 0.45% 0.40% 0.35% 0.30% 0.25% 0.20% 0.15% 0.10% 0.05% 0.00% Frequency 2011 2012 2013 2014 2015 2016 Accidental Water Discharge and Appliance Leakage Frequency Broward County Miami-Dade County Palm Beach County Florida (Excl. Tri Counties) Source: LexisNexis Internal Research 32
  • 33. Harvey: Tweets Containing “Flood” Making IoT Data Actionable Using Predictive Analytics 33
  • 34. Weather Events Digital Trail • Elk City tornado by the NOAA:yesterday 17/05/2017 • Flood • Hail • Lightning • Tornado • Wildfire Making IoT Data Actionable Using Predictive Analytics 34
  • 35. Stream Analytics: Push and Pull data sources Making IoT Data Actionable Using Predictive Analytics Wind Fire Water (non- weather) Water (weather ) Theft Liability Other Hail 35
  • 36. Data platforms will be key to unlocking the full potential of this opportunity Making IoT Data Actionable Using Predictive Analytics MARKETING CONTACT QUOTE UNDERWRITIN G RENEWAL COMPLIANCE CLAIM IoT Platform Insurer Automatio n Mitigation Utilities Connected Home Securit y Connecte d Car Connecte d Self Connecte d Business 36
  • 37. How to start unlocking these insights now Technology/Analytics to develop and deploy a pilot program
  • 38. HPCC Systems Architecture Making IoT Data Actionable Using Predictive Analytics 38
  • 39. HPCC Systems – Pull Architecture • Device users register at a web portal • Authentication and authorization via device manufacturer’s web site • Authorization response includes an access token • All registration information saved • Thor queries devices for all registered users in parallel • Ancillary data, such as weather conditions local to every device, is periodically gathered • Analytics are also run periodically, as often as needed • ROXIE updated with analytics results and are made available to external services Making IoT Data Actionable Using Predictive Analytics 39
  • 40. HPCC Systems – Push Architecture • Authorized devices whitelisted via master device management • Remote devices send their data to ROXIE • After validation and normalization, message stored in Kafka and Couchbase • Thor periodically pulls new messages from Kafka for processing • Ancillary data, such as weather conditions local to every device, is periodically gathered • Analytics are also run periodically, as often as needed • ROXIE updated with analytics results and are made available to external services Making IoT Data Actionable Using Predictive Analytics 40

Editor's Notes

  • #18: Devices in the Internet of Things communicate with each other, only a human isn’t directly prompting the interaction. Today we call this “The Internet of Things,” but that’s only because it’s new. In five years we’ll probably just call it “the internet.”
  • #19: Gartner put the number of IoT devices at 8 billion in 2017. For 2020, they estimate TWENTY billion. Cisco estimates 50 billion. We can be sure they’re both wrong, but one of them might be close. The point is, there will be tens of billions of devices generating data. And on the data side, what’s interesting is that humans have generated the majority of the data out there today, from pictures and texts, to movies, to scholarly articles. But soon the data created by “things” will dwarf the data created by humans.
  • #20: There has been a lot of activity over the past year but these same key questions are still largely unanswered. [Walk through points] And I’ll add one more --- Consumer engagement. What gets the consumer to push through setup challenges, encourage them to replace batteries, or even engage with the device through an app? There is still a lot of ambivalence and complexity out there so instead of taking a step back like we did last year, let’s take a step in and look at some specific use cases. Who will be the winners and loser in the devices and platforms. There will continue to be consolidation, new entries and exits. This makes partnerships and data agreements complicated. Who is driving? Is it the Consumer, the insurer or the infrastructure. As I showed on the previous slide… You may want to prevent water losses, but that doesn’t mean your policyholder shares that concern. He or she may be more likely to opt for voice activated mood lighting. Discounts or carrier device buys may help to remedy this over time. Connected utility meters, built in capabilities may influence in time. Cyber risk: In 2016 there was a major Distributed Denial of Service attack that shut down a number of websites. Wifi enable baby monitors have been hacked. Carriers do have to consider this when potentially connected their brand with a device. Do you want that connected thermostat you encouraged your customer to buy to be susceptible to ransomware that extort a payment to keep the heat on during the winter? .. . The good news is that there are good companies out there today working on building more sophisticated technology to protect connected devices. Much of the purported benefit of the connected home is speculation. How does this data really play out? Does the connected water sensor really prevent loss payments to a significant degree. Does it reduce frequency? Just Severity? How much? We need a lot more data to know for sure. And multiply that across the dozens of devices that are available. How big is the disruption? If at the end of the day we end up with a lot of new data sources that allow us to offer another 5% discount, or that help us validate the home security system discounts carriers are already giving . . . Then it’s still useful but not revolutionary. On the other hand, being able to price a risk from the ground-up using a multitude of IoT real time data becomes a reality then maybe it does. The other question here is loss mitigation versus loss avoidance. Finally, is cost. Particularly the cost of the device. As we discussed above, the consumer may not buy the devices you want them to have, which means the insurer would potentially need to foot the bill (either directly or through discounting and/or rate). That math needs to work, and a $5 device will be a lot more attractive to mitigate flood risk under a give sink then an $80 device.
  • #24: Insurers can explore many ways to avoid and limit losses So where does LexisNexis fit in the IoT world? We can analyze, normalize, and score this data for our customers (WITH THE CONSUMERS PERMISSION, OF COURSE). We can solve the many to many challenge, not only for insurers, but for IoT companies, too. We can take millions of datapoints and turn them into something digestible and meaningful to the industry. I hope this all sounds familiar, because it’s what we do every day already. And the normalization can take many forms. It’s not hard to imagine that the Nest, the Ecobee, the Lyric, and the Sensi - all smart thermostats which use occupancy to make decisions – might produce different data. It might come at different intervals, at different levels of granularity, and there may be differences in sensitivity between them. Clearly there’s an opportunity for us to normalize that data on the way in so that we can produce occupancy score or attribute from thermostats that works for ALL popular models of thermostat. This is not too different from what we’ve done in the UBI space to normalize driver scores across phone types.
  • #26: This is one piece of the data that we can collect from Nest thermostats. In this case I once again got one of my co-workers to agree to let me use his data – but he won’t let me use his real name because he is paranoid that his rates will go up. We are going to call him “Shawn” Shawn has two Nest thermostats and they each send data nearly 150 times a day. This data stream has dozens of field including everything from the actual temperature in the home, the desired temperature, the location of the thermostat the consumer has specified and whether someone has locked in a temperature other than those in the settings. The nest thermostat switches to “Eco” mode when it doesn’t detect anyone present in the home and this data is captured as well.
  • #27: Here is Shawn’s lake House. Only one thermostat in this house but it is consistently reporting “Eco Status” until we get to the Holiday weekend. Now this is a very clear example and not every example will be this clear but it is evident.
  • #30: Assignment of Benefits mainly impacts water non-weather claims associated with leaking pipes and damaged appliances
  • #35: Small circles are tweets containing ‘tornado’, large circles are official sightings So we are starting to harvest based on keywords to 1: build up data to have a baseline  (i.e. background noise) 2: ‘hoping’ for an event to see spikes   Right now we are grabbing tweets with words (also partial) containing the keywords Flood Hail Lightning Tornado Wildfire
  • #37: So where does LexisNexis fit in the IoT world? We can analyze, normalize, and score this data for our customers (WITH THE CONSUMERS PERMISSION, OF COURSE). We can solve the many to many challenge, not only for insurers, but for IoT companies, too. We can take millions of datapoints and turn them into something digestible and meaningful to the industry. I hope this all sounds familiar, because it’s what we do every day already. And the normalization can take many forms. It’s not hard to imagine that the Nest, the Ecobee, the Lyric, and the Sensi - all smart thermostats which use occupancy to make decisions – might produce different data. It might come at different intervals, at different levels of granularity, and there may be differences in sensitivity between them. Clearly there’s an opportunity for us to normalize that data on the way in so that we can produce occupancy score or attribute from thermostats that works for ALL popular models of thermostat. This is not too different from what we’ve done in the UBI space to normalize driver scores across phone types.
  • #38: For a carrier that wants to get started in IoT the first objective is to get data, and this can be a challenge by yourself. However, LexisNexis offers to be your partner in collecting and interpreting this data. An easy place to start is by leveraging the devices that are already in your customer’s homes.   LexisNexis is in the process of rolling out internal pilots with our employees to collect Nest thermostat data via an API connection. As we move into phase II of this program by early next year, we invite you to join us. For your customers that opt in, and have a Nest in their home, you will be able to simply supply them with a URL to begin collecting data.   LexisNexis will then collect and process data, including pooling with participants should you choose to participate in data sharing and share the aggregate results with the broader group.   If you are interested in a water device pilot, we are happy to work with you as well and are happy to facilitate conversations with device makers that fit your needs.