SlideShare a Scribd company logo
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Practical Computing with Chaos
Ted Dunning, Chief Applications Architect MapR Technologies
Email tdunning@mapr.com tdunning@apache.org
Twitter @Ted_Dunning
© 2014 MapR Technologies 3
e-book available courtesy of MapR
Also at MapR booth
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detection
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
© 2014 MapR Technologies 4
Practical Machine Learning series (O’Reilly)
• Machine learning is becoming mainstream
• Need pragmatic approaches that take into account real world
business settings:
– Time to value
– Limited resources
– Availability of data
– Expertise and cost of team to develop and to maintain system
• Look for approaches with big benefits for the effort expended
© 2014 MapR Technologies 5
Agenda
• Monty Hall
• Randomized geo-coding
• Thompson sampling
– Bayesian Bandits
– Targeting
– Bayesian ranking
• Dithering (sound, signals)
• Synthetic data (preview)
© 2014 MapR Technologies 6
Let’s Start with Trouble
• Monty Hall problem (oops, done)
• Three doors, one with a fabulous prize
• You pick one
• Monte shows you one of the remaining doors is empty
• You can switch at this point to the other door or not
• Should you switch?
© 2014 MapR Technologies 7
© 2014 MapR Technologies 8
© 2014 MapR Technologies 9
© 2014 MapR Technologies 10
The Real Problem
• Doing the math isn’t too hard
• Convincing somebody you have the right answer is really hard
© 2014 MapR Technologies 11
Live Coding
With REAL Chaos
© 2014 MapR Technologies 12
Geo-coding
© 2014 MapR Technologies 13
Geo-coding
• Some databases have disk locality  key locality
• The primary key is totally ordered
• Embedding a total ordering of the points in a plane is possible
– But loses some distance information
– A line is not a square!
• We want to do proximity searches
– This gets harder in the polar regions for most codings
© 2014 MapR Technologies 14
Space Filling Curve
0 1
23 01
2 3
0
1 2
3 0
1 2
3
0
1 2
3
© 2014 MapR Technologies 15
Space Filling Curve
0123
2
3
3
1
0
2
2
3
1
1
0
0 3
20
1
© 2014 MapR Technologies 16
Z-coding – Interleave Bits
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 17
Neighbors Often Share Prefix
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
00. 11.11
10. 01.01
00. 11.01
© 2014 MapR Technologies 18
Often, not always
Close Far
© 2014 MapR Technologies 19
Random Sampling to Derive Keys
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 20
"00.01.01"
"00.01.10"
"00.01.11"
"00.11.00"
"00.11.01"
"00.11.10"
"00.11.11"
"01.00.10"
"01.10.00"
"01.10.10”
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 21
"00.01.01"
"00.01.10"
"00.01.11"
"00.11.00"
"00.11.01"
"00.11.10"
"00.11.11"
"01.00.10"
"01.10.00"
"01.10.10”
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 22
"00.01.10" - "00.01.11"
"00.11.00" - "00.11.11"
"01.00.10"
"01.10.00" - "01.10.10”
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 23
Dithering
© 2014 MapR Technologies 24
• 4 bit sine wave (listen for artifacts as volume decreases)
• White dithering (artifacts gone, we hear through the noise)
• Noise shaping (noise is easier to hear through)
© 2014 MapR Technologies 25
0 1 2 3 4 5 6
−4−2024
Time
© 2014 MapR Technologies 26
The Shape of the Noise
Noise
Frequency
−0.4 −0.2 0.0 0.2 0.4
010003000
© 2014 MapR Technologies 27
The Effect After Averaging
0 1 2 3 4 5 6
−4−2024
Time
© 2014 MapR Technologies 28
Thompson Sampling
© 2014 MapR Technologies 29
Learning in the Real World
• In the real world we get to pick our training examples
– Do we try this restaurant or not?
• Learning has real and opportunity costs
• Not learning has real and opportunity costs as well
• Every sub-optimal choice we make incurs regret
– We would like to minimize this
– But we can’t quantify regret without incurring regret!
© 2014 MapR Technologies 30
An Example
• Pick one of five options
– Purple, blue, green, red, yellow
– Each has a random payoff
• If you pick a bad option, regret = mean(best) – mean(yours)
• The best known algorithm uses randomization
– Best = minimal regret + minimal code complexity
© 2014 MapR Technologies 31
Demo – The Algorithm
© 2014 MapR Technologies 32
Synthetic Data
© 2014 MapR Technologies 33
select IR.ENC_KEY ,IR.ENCOUNTER_ ,IR.ETYPE ,IR.bill_type ,IR.CONTR_ ,IR.SOURCE_CD
,IR.sub_source_cd ,IR.HP_CD ,IR.LOB_CD ,IR.FDO ,IR.TDOS ,IR.member_Nbr
,IR.HIC_NBR ,IR.MEMBER_SOURCE_CD ,IR.HDR_ERRCD ,IR.HDR_ERRDESC
,IR.PROVIDER_NBR ,IR.provider_type ,IR.PROVIDER_SOURCE_CD
,IR.cms_provider_ty e ,IR.SPEC_CD ,IR.SPEC_DESC ,IR.rev_cd ,IR.rev_cd_desc
,IR.proc_cd ,IR.diag_cd ,IR.DIAG_CD_KEY ,IR.DIAGNOSIS_KEY ,IR.rec_state_cd
,IR.rec_status_cd ,IR.DG_ERRCD ,IR.DG_ERRDESC
FROM (SELECT distinct enc.encounter_key as ENC_KEY,
enc.encounter_nbr as ENCOUNTER_, typ.encounter_type_cd as ETYPE,
bt.bill_type, cnt.contract_nbr as CONTR_,
ds.SOURCE_CD, enc.sub_source_cd, enc.HP_CD, lob.LOB_CD,
enc.new_min_dt as FDOS, substr(enc.new_max_dt, 1, 10) as TDOS,
enc.member_Nbr, m.HIC_NBR, m.MEMBER_SOURCE_CD, eerr.error_cd as HDR_ERRCD,
eerr.ERROR_DESC as HDR_ERRDESC, enc.PROVIDER_NBR, prv.provider_type,
prv.PROVIDER_SOURCE_CD, diag.cms_provider_type,
sp.specialty_cd as SPEC_CD, sp.specialty_desc as SPEC_DESC, svc.rev_cd,
rev.rev_cd_desc, svc.proc_cd, dgcd.diag_cd, dgcd.DIAG_CD_KEY, diag.DIAGNOSIS_KEY,
st.rec_state_cd, sts.rec_status_cd, derr.error_cd as DG_ERRCD,
derr.error_desc as DG_ERRDESC
FROM oicpcuhg.ir_encounter enc
`
Can You See the Problem?
© 2014 MapR Technologies 34
INNER JOIN oicpcuhg.ir_encountertype typ
ON (typ.encounter_type_key = enc.encounter_type_key)
LEFT OUTER JOIN oicpcuhg.ir_billtype bt
ON (bt.bill_type_key = enc.bill_type_key)
LEFT OUTER JOIN oicpcuhg.ir_contract cnt
ON (cnt.contract_key = enc.contract_key)
LEFT OUTER JOIN oicpcuhg.ir_datasource ds
ON (ds.source_key = enc.data_source_key)
LEFT OUTER JOIN oicpcuhg.ir_lineofbusiness lob
ON (lob.lob_key = enc.lob_key)
INNER JOIN oicpcuhg.ir_member m
ON (
m.hp_cd = enc.hp_cd
AND m.member_source_cd = enc.member_source_cd
AND m.member_nbr = enc.member_nbr)
LEFT OUTER JOIN oicpcuhg.ir_encountererror eerror
ON (eerror.encounter_key = enc.encounter_key and
eerror.active_flg = 'Y')
LEFT OUTER JOIN oicpcuhg.ir_error eerr
ON (eerr.error_key = eerror.error_key)
LEFT OUTER JOIN oicpcuhg.ir_provider prv
ON (prv.hp_cd = enc.hp_cd and
prv.provider_source_cd = enc.provider_source_cd and
prv.provider_nbr = enc.provider_nbr)
© 2014 MapR Technologies 35
LEFT OUTER JOIN oicpcuhg.ir_encounterspecialty esp
ON (esp.encounter_key = enc.encounter_key)
LEFT OUTER JOIN oicpcuhg.ir_specialty sp
ON (sp.specialty_key = esp.specialty_key)
LEFT OUTER JOIN oicpcuhg.ir_service svc
ON (svc.encounter_key = enc.encounter_key)
LEFT OUTER JOIN oicpcuhg.ir_revenue rev
ON (rev.rev_cd = svc.rev_cd)
LEFT OUTER JOIN oicpcuhg.ir_diagnosis diag
ON (diag.encounter_key = enc.encounter_key)
INNER JOIN oicpcuhg.ir_diagcd dgcd
ON (dgcd.diag_cd_key = diag.diag_cd_key)
INNER JOIN oicpcuhg.ir_recordstate st
ON (st.rec_state_key = diag.rec_state_key)
INNER JOIN oicpcuhg.ir_recordstatus sts
ON (sts.rec_status_key = diag.rec_status_key)
LEFT OUTER JOIN oicpcuhg.ir_diagnosiserror derror
ON (derror.diagnosis_key = diag.diagnosis_key and
derror.active_flg = 'Y')
LEFT OUTER JOIN oicpcuhg.ir_error derr
ON (derr.error_key = derror.error_key)) IR
INNER JOIN oicpcuhg.umr_req_inbound umr
ON (trim(umr.member_nbr) = IR.member_Nbr AND
trim(umr.hhc_from_ccyymmdd) = IR.TDOS AND
trim(umr.sub_mcare_mbr) = IR.HIC_NBR AND
trim(umr.diag1) = IR.diag_cd)
© 2014 MapR Technologies 36
One Attack
• The customer can’t give you the data
– They can’t trust you, by law
• But they can probably summarize the data
– How many columns
– What types
– Perhaps statistical summaries
© 2014 MapR Technologies 37
Bug Replication Without Security Violation
Customer You
DataData
DataFake
DataFake
x y α ξ
x y α ξ
© 2014 MapR Technologies 38
The Upshot
• So random numbers are useful
• But simple distributions not so much
• How can YOU generate cool data?
© 2014 MapR Technologies 39
e-book available courtesy of MapR
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detection
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
© 2014 MapR Technologies 40
Last October: Time Series Databases
by Ted Dunning and Ellen Friedman © Oct 2014 (published by O’Reilly)
© 2014 MapR Technologies 41
Coming in February: Real World Hadoop
by Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly)
© 2014 MapR Technologies 42
Thank you for coming today!

More Related Content

PPTX
Doing-the-impossible
PPTX
What is the past future tense of data?
PPTX
Possible Visions for Mahout 1.0
PPTX
Anomaly Detection - New York Machine Learning
PPTX
Which Algorithms Really Matter
PPTX
Dunning time-series-2015
PPTX
My talk about recommendation and search to the Hive
PPTX
Building multi-modal recommendation engines using search engines
Doing-the-impossible
What is the past future tense of data?
Possible Visions for Mahout 1.0
Anomaly Detection - New York Machine Learning
Which Algorithms Really Matter
Dunning time-series-2015
My talk about recommendation and search to the Hive
Building multi-modal recommendation engines using search engines

Viewers also liked (20)

PDF
Learning Linear Models with Hadoop
PDF
Omid Efficient Transaction Mgmt and Processing for HBase
PPTX
Hadoop Puzzlers
PDF
Scalding: Twitter's New DSL for Hadoop
PPTX
Enterprise Integration of Disruptive Technologies
PPTX
Enterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
PPTX
Hadoop and R Go to the Movies
PPTX
Safer on the road with Hadoop! Setting up a Data Science Platform
PDF
Hadoop-scale Search with Solr
PPTX
Building and Improving Products with Hadoop
PPTX
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
PDF
Hdfs high availability
PDF
Fast, Scalable Graph Processing: Apache Giraph on YARN
PPTX
Hive & HBase For Transaction Processing
PPTX
How to Determine which Algorithms Really Matter
PDF
Apache Pig for Data Scientists
PPTX
Analyzing Historical Data of Applications on YARN for Fun and Profit
PPTX
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
PDF
Experiences Streaming Analytics at Petabyte Scale
PPTX
A Hadoop-enabled Ship Tracking Application for the Port of Rotterdam
Learning Linear Models with Hadoop
Omid Efficient Transaction Mgmt and Processing for HBase
Hadoop Puzzlers
Scalding: Twitter's New DSL for Hadoop
Enterprise Integration of Disruptive Technologies
Enterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
Hadoop and R Go to the Movies
Safer on the road with Hadoop! Setting up a Data Science Platform
Hadoop-scale Search with Solr
Building and Improving Products with Hadoop
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Hdfs high availability
Fast, Scalable Graph Processing: Apache Giraph on YARN
Hive & HBase For Transaction Processing
How to Determine which Algorithms Really Matter
Apache Pig for Data Scientists
Analyzing Historical Data of Applications on YARN for Fun and Profit
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
Experiences Streaming Analytics at Petabyte Scale
A Hadoop-enabled Ship Tracking Application for the Port of Rotterdam
Ad

Similar to Practical Computing With Chaos (20)

PDF
Practical Computing Wiith Chaos
PPTX
How to find what you didn't know to look for, oractical anomaly detection
PPTX
Deep Learning for Fraud Detection
PPTX
Predictive Analytics with Hadoop
PDF
Mathematical bridges From Old to New
PPTX
Dunning ml-conf-2014
PPTX
Ted Dunning, Chief Application Architect, MapR at MLconf SF
PPTX
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
PPTX
Cheap learning-dunning-9-18-2015
PPTX
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
PPTX
Anomaly Detection: How to find what you didn’t know to look for
PPTX
How to tell which algorithms really matter
PPTX
Realistic Synthetic Generation Allows Secure Development
PPTX
Realistic Synthetic Generation Allows Secure Development
PPTX
Sharing Sensitive Data Securely
PPTX
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
PPTX
Goto amsterdam-2013-skinned
PPTX
GoTo Amsterdam 2013 Skinned
PDF
Strata 2014 Anomaly Detection
PPTX
Architecting R into Storm Application Development Process
Practical Computing Wiith Chaos
How to find what you didn't know to look for, oractical anomaly detection
Deep Learning for Fraud Detection
Predictive Analytics with Hadoop
Mathematical bridges From Old to New
Dunning ml-conf-2014
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Cheap learning-dunning-9-18-2015
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Anomaly Detection: How to find what you didn’t know to look for
How to tell which algorithms really matter
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
Sharing Sensitive Data Securely
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
Goto amsterdam-2013-skinned
GoTo Amsterdam 2013 Skinned
Strata 2014 Anomaly Detection
Architecting R into Storm Application Development Process
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Electronic commerce courselecture one. Pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Modernizing your data center with Dell and AMD
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
cuic standard and advanced reporting.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Big Data Technologies - Introduction.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Electronic commerce courselecture one. Pdf
Understanding_Digital_Forensics_Presentation.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Modernizing your data center with Dell and AMD
CIFDAQ's Market Insight: SEC Turns Pro Crypto
cuic standard and advanced reporting.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
“AI and Expert System Decision Support & Business Intelligence Systems”
Building Integrated photovoltaic BIPV_UPV.pdf
Machine learning based COVID-19 study performance prediction
Big Data Technologies - Introduction.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation_ Review paper, used for researhc scholars
The AUB Centre for AI in Media Proposal.docx
20250228 LYD VKU AI Blended-Learning.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Chapter 3 Spatial Domain Image Processing.pdf

Practical Computing With Chaos

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  • 2. © 2014 MapR Technologies 2 Practical Computing with Chaos Ted Dunning, Chief Applications Architect MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning
  • 3. © 2014 MapR Technologies 3 e-book available courtesy of MapR Also at MapR booth http://bit.ly/1jQ9QuL A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
  • 4. © 2014 MapR Technologies 4 Practical Machine Learning series (O’Reilly) • Machine learning is becoming mainstream • Need pragmatic approaches that take into account real world business settings: – Time to value – Limited resources – Availability of data – Expertise and cost of team to develop and to maintain system • Look for approaches with big benefits for the effort expended
  • 5. © 2014 MapR Technologies 5 Agenda • Monty Hall • Randomized geo-coding • Thompson sampling – Bayesian Bandits – Targeting – Bayesian ranking • Dithering (sound, signals) • Synthetic data (preview)
  • 6. © 2014 MapR Technologies 6 Let’s Start with Trouble • Monty Hall problem (oops, done) • Three doors, one with a fabulous prize • You pick one • Monte shows you one of the remaining doors is empty • You can switch at this point to the other door or not • Should you switch?
  • 7. © 2014 MapR Technologies 7
  • 8. © 2014 MapR Technologies 8
  • 9. © 2014 MapR Technologies 9
  • 10. © 2014 MapR Technologies 10 The Real Problem • Doing the math isn’t too hard • Convincing somebody you have the right answer is really hard
  • 11. © 2014 MapR Technologies 11 Live Coding With REAL Chaos
  • 12. © 2014 MapR Technologies 12 Geo-coding
  • 13. © 2014 MapR Technologies 13 Geo-coding • Some databases have disk locality  key locality • The primary key is totally ordered • Embedding a total ordering of the points in a plane is possible – But loses some distance information – A line is not a square! • We want to do proximity searches – This gets harder in the polar regions for most codings
  • 14. © 2014 MapR Technologies 14 Space Filling Curve 0 1 23 01 2 3 0 1 2 3 0 1 2 3 0 1 2 3
  • 15. © 2014 MapR Technologies 15 Space Filling Curve 0123 2 3 3 1 0 2 2 3 1 1 0 0 3 20 1
  • 16. © 2014 MapR Technologies 16 Z-coding – Interleave Bits 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 17. © 2014 MapR Technologies 17 Neighbors Often Share Prefix 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10 00. 11.11 10. 01.01 00. 11.01
  • 18. © 2014 MapR Technologies 18 Often, not always Close Far
  • 19. © 2014 MapR Technologies 19 Random Sampling to Derive Keys 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 20. © 2014 MapR Technologies 20 "00.01.01" "00.01.10" "00.01.11" "00.11.00" "00.11.01" "00.11.10" "00.11.11" "01.00.10" "01.10.00" "01.10.10” 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 21. © 2014 MapR Technologies 21 "00.01.01" "00.01.10" "00.01.11" "00.11.00" "00.11.01" "00.11.10" "00.11.11" "01.00.10" "01.10.00" "01.10.10” 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 22. © 2014 MapR Technologies 22 "00.01.10" - "00.01.11" "00.11.00" - "00.11.11" "01.00.10" "01.10.00" - "01.10.10” 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 23. © 2014 MapR Technologies 23 Dithering
  • 24. © 2014 MapR Technologies 24 • 4 bit sine wave (listen for artifacts as volume decreases) • White dithering (artifacts gone, we hear through the noise) • Noise shaping (noise is easier to hear through)
  • 25. © 2014 MapR Technologies 25 0 1 2 3 4 5 6 −4−2024 Time
  • 26. © 2014 MapR Technologies 26 The Shape of the Noise Noise Frequency −0.4 −0.2 0.0 0.2 0.4 010003000
  • 27. © 2014 MapR Technologies 27 The Effect After Averaging 0 1 2 3 4 5 6 −4−2024 Time
  • 28. © 2014 MapR Technologies 28 Thompson Sampling
  • 29. © 2014 MapR Technologies 29 Learning in the Real World • In the real world we get to pick our training examples – Do we try this restaurant or not? • Learning has real and opportunity costs • Not learning has real and opportunity costs as well • Every sub-optimal choice we make incurs regret – We would like to minimize this – But we can’t quantify regret without incurring regret!
  • 30. © 2014 MapR Technologies 30 An Example • Pick one of five options – Purple, blue, green, red, yellow – Each has a random payoff • If you pick a bad option, regret = mean(best) – mean(yours) • The best known algorithm uses randomization – Best = minimal regret + minimal code complexity
  • 31. © 2014 MapR Technologies 31 Demo – The Algorithm
  • 32. © 2014 MapR Technologies 32 Synthetic Data
  • 33. © 2014 MapR Technologies 33 select IR.ENC_KEY ,IR.ENCOUNTER_ ,IR.ETYPE ,IR.bill_type ,IR.CONTR_ ,IR.SOURCE_CD ,IR.sub_source_cd ,IR.HP_CD ,IR.LOB_CD ,IR.FDO ,IR.TDOS ,IR.member_Nbr ,IR.HIC_NBR ,IR.MEMBER_SOURCE_CD ,IR.HDR_ERRCD ,IR.HDR_ERRDESC ,IR.PROVIDER_NBR ,IR.provider_type ,IR.PROVIDER_SOURCE_CD ,IR.cms_provider_ty e ,IR.SPEC_CD ,IR.SPEC_DESC ,IR.rev_cd ,IR.rev_cd_desc ,IR.proc_cd ,IR.diag_cd ,IR.DIAG_CD_KEY ,IR.DIAGNOSIS_KEY ,IR.rec_state_cd ,IR.rec_status_cd ,IR.DG_ERRCD ,IR.DG_ERRDESC FROM (SELECT distinct enc.encounter_key as ENC_KEY, enc.encounter_nbr as ENCOUNTER_, typ.encounter_type_cd as ETYPE, bt.bill_type, cnt.contract_nbr as CONTR_, ds.SOURCE_CD, enc.sub_source_cd, enc.HP_CD, lob.LOB_CD, enc.new_min_dt as FDOS, substr(enc.new_max_dt, 1, 10) as TDOS, enc.member_Nbr, m.HIC_NBR, m.MEMBER_SOURCE_CD, eerr.error_cd as HDR_ERRCD, eerr.ERROR_DESC as HDR_ERRDESC, enc.PROVIDER_NBR, prv.provider_type, prv.PROVIDER_SOURCE_CD, diag.cms_provider_type, sp.specialty_cd as SPEC_CD, sp.specialty_desc as SPEC_DESC, svc.rev_cd, rev.rev_cd_desc, svc.proc_cd, dgcd.diag_cd, dgcd.DIAG_CD_KEY, diag.DIAGNOSIS_KEY, st.rec_state_cd, sts.rec_status_cd, derr.error_cd as DG_ERRCD, derr.error_desc as DG_ERRDESC FROM oicpcuhg.ir_encounter enc ` Can You See the Problem?
  • 34. © 2014 MapR Technologies 34 INNER JOIN oicpcuhg.ir_encountertype typ ON (typ.encounter_type_key = enc.encounter_type_key) LEFT OUTER JOIN oicpcuhg.ir_billtype bt ON (bt.bill_type_key = enc.bill_type_key) LEFT OUTER JOIN oicpcuhg.ir_contract cnt ON (cnt.contract_key = enc.contract_key) LEFT OUTER JOIN oicpcuhg.ir_datasource ds ON (ds.source_key = enc.data_source_key) LEFT OUTER JOIN oicpcuhg.ir_lineofbusiness lob ON (lob.lob_key = enc.lob_key) INNER JOIN oicpcuhg.ir_member m ON ( m.hp_cd = enc.hp_cd AND m.member_source_cd = enc.member_source_cd AND m.member_nbr = enc.member_nbr) LEFT OUTER JOIN oicpcuhg.ir_encountererror eerror ON (eerror.encounter_key = enc.encounter_key and eerror.active_flg = 'Y') LEFT OUTER JOIN oicpcuhg.ir_error eerr ON (eerr.error_key = eerror.error_key) LEFT OUTER JOIN oicpcuhg.ir_provider prv ON (prv.hp_cd = enc.hp_cd and prv.provider_source_cd = enc.provider_source_cd and prv.provider_nbr = enc.provider_nbr)
  • 35. © 2014 MapR Technologies 35 LEFT OUTER JOIN oicpcuhg.ir_encounterspecialty esp ON (esp.encounter_key = enc.encounter_key) LEFT OUTER JOIN oicpcuhg.ir_specialty sp ON (sp.specialty_key = esp.specialty_key) LEFT OUTER JOIN oicpcuhg.ir_service svc ON (svc.encounter_key = enc.encounter_key) LEFT OUTER JOIN oicpcuhg.ir_revenue rev ON (rev.rev_cd = svc.rev_cd) LEFT OUTER JOIN oicpcuhg.ir_diagnosis diag ON (diag.encounter_key = enc.encounter_key) INNER JOIN oicpcuhg.ir_diagcd dgcd ON (dgcd.diag_cd_key = diag.diag_cd_key) INNER JOIN oicpcuhg.ir_recordstate st ON (st.rec_state_key = diag.rec_state_key) INNER JOIN oicpcuhg.ir_recordstatus sts ON (sts.rec_status_key = diag.rec_status_key) LEFT OUTER JOIN oicpcuhg.ir_diagnosiserror derror ON (derror.diagnosis_key = diag.diagnosis_key and derror.active_flg = 'Y') LEFT OUTER JOIN oicpcuhg.ir_error derr ON (derr.error_key = derror.error_key)) IR INNER JOIN oicpcuhg.umr_req_inbound umr ON (trim(umr.member_nbr) = IR.member_Nbr AND trim(umr.hhc_from_ccyymmdd) = IR.TDOS AND trim(umr.sub_mcare_mbr) = IR.HIC_NBR AND trim(umr.diag1) = IR.diag_cd)
  • 36. © 2014 MapR Technologies 36 One Attack • The customer can’t give you the data – They can’t trust you, by law • But they can probably summarize the data – How many columns – What types – Perhaps statistical summaries
  • 37. © 2014 MapR Technologies 37 Bug Replication Without Security Violation Customer You DataData DataFake DataFake x y α ξ x y α ξ
  • 38. © 2014 MapR Technologies 38 The Upshot • So random numbers are useful • But simple distributions not so much • How can YOU generate cool data?
  • 39. © 2014 MapR Technologies 39 e-book available courtesy of MapR http://bit.ly/1jQ9QuL A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
  • 40. © 2014 MapR Technologies 40 Last October: Time Series Databases by Ted Dunning and Ellen Friedman © Oct 2014 (published by O’Reilly)
  • 41. © 2014 MapR Technologies 41 Coming in February: Real World Hadoop by Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly)
  • 42. © 2014 MapR Technologies 42 Thank you for coming today!

Editor's Notes

  • #4: Talk track: 2nd in series, first was on how to build a simple recommender. This one on anomaly detection is being sold by O’Reilly on Amazon, but for a limited time MapR is giving away the e-book for free. Here’s the link where you can register to get one.
  • #5: Talk track: ELLEN New ways to do it that take into account real world business goals, realistic resources, new types of data and best time to value…