SlideShare a Scribd company logo
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Practical Computing with Chaos
Ted Dunning, Chief Applications Architect MapR Technologies
Email tdunning@mapr.com tdunning@apache.org
Twitter @Ted_Dunning
© 2014 MapR Technologies 3
e-book available courtesy of MapR
Also at MapR booth
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detection
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
© 2014 MapR Technologies 4
Practical Machine Learning series (O’Reilly)
• Machine learning is becoming mainstream
• Need pragmatic approaches that take into account real world
business settings:
– Time to value
– Limited resources
– Availability of data
– Expertise and cost of team to develop and to maintain system
• Look for approaches with big benefits for the effort expended
© 2014 MapR Technologies 5
Agenda
• Monty Hall
• Randomized geo-coding
• Thompson sampling
– Bayesian Bandits
– Targeting
– Bayesian ranking
• Dithering (sound, signals)
• Synthetic data (preview)
© 2014 MapR Technologies 6
Let’s Start with Trouble
• Monty Hall problem (oops, done)
• Three doors, one with a fabulous prize
• You pick one
• Monte shows you one of the remaining doors is empty
• You can switch at this point to the other door or not
• Should you switch?
© 2014 MapR Technologies 7
© 2014 MapR Technologies 8
© 2014 MapR Technologies 9
© 2014 MapR Technologies 10
The Real Problem
• Doing the math isn’t too hard
• Convincing somebody you have the right answer is really hard
© 2014 MapR Technologies 11
Live Coding
With REAL Chaos
© 2014 MapR Technologies 12
Geo-coding
© 2014 MapR Technologies 13
Geo-coding
• Some databases have disk locality  key locality
• The primary key is totally ordered
• Embedding a total ordering of the points in a plane is possible
– But loses some distance information
– A line is not a square!
• We want to do proximity searches
– This gets harder in the polar regions for most codings
© 2014 MapR Technologies 14
Space Filling Curve
0 1
23 01
2 3
0
1 2
3 0
1 2
3
0
1 2
3
© 2014 MapR Technologies 15
Space Filling Curve
0123
2
3
3
1
0
2
2
3
1
1
0
0 3
20
1
© 2014 MapR Technologies 16
Z-coding – Interleave Bits
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 17
Neighbors Often Share Prefix
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
00. 11.11
10. 01.01
00. 11.01
© 2014 MapR Technologies 18
Often, not always
Close Far
© 2014 MapR Technologies 19
Random Sampling to Derive Keys
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 20
"00.01.01"
"00.01.10"
"00.01.11"
"00.11.00"
"00.11.01"
"00.11.10"
"00.11.11"
"01.00.10"
"01.10.00"
"01.10.10”
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 21
"00.01.01"
"00.01.10"
"00.01.11"
"00.11.00"
"00.11.01"
"00.11.10"
"00.11.11"
"01.00.10"
"01.10.00"
"01.10.10”
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 22
"00.01.10" - "00.01.11"
"00.11.00" - "00.11.11"
"01.00.10"
"01.10.00" - "01.10.10”
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 23
Dithering
© 2014 MapR Technologies 24
• 4 bit sine wave (listen for artifacts as volume decreases)
• White dithering (artifacts gone, we hear through the noise)
• Noise shaping (noise is easier to hear through)
© 2014 MapR Technologies 25
0 1 2 3 4 5 6
−4−2024
Time
© 2014 MapR Technologies 26
The Shape of the Noise
Noise
Frequency
−0.4 −0.2 0.0 0.2 0.4
010003000
© 2014 MapR Technologies 27
The Effect After Averaging
0 1 2 3 4 5 6
−4−2024
Time
© 2014 MapR Technologies 28
Thompson Sampling
© 2014 MapR Technologies 29
Learning in the Real World
• In the real world we get to pick our training examples
– Do we try this restaurant or not?
• Learning has real and opportunity costs
• Not learning has real and opportunity costs as well
• Every sub-optimal choice we make incurs regret
– We would like to minimize this
– But we can’t quantify regret without incurring regret!
© 2014 MapR Technologies 30
An Example
• Pick one of five options
– Purple, blue, green, red, yellow
– Each has a random payoff
• If you pick a bad option, regret = mean(best) – mean(yours)
• The best known algorithm uses randomization
– Best = minimal regret + minimal code complexity
© 2014 MapR Technologies 31
Demo – The Algorithm
© 2014 MapR Technologies 32
Synthetic Data
© 2014 MapR Technologies 33
select IR.ENC_KEY ,IR.ENCOUNTER_ ,IR.ETYPE ,IR.bill_type ,IR.CONTR_ ,IR.SOURCE_CD
,IR.sub_source_cd ,IR.HP_CD ,IR.LOB_CD ,IR.FDO ,IR.TDOS ,IR.member_Nbr
,IR.HIC_NBR ,IR.MEMBER_SOURCE_CD ,IR.HDR_ERRCD ,IR.HDR_ERRDESC
,IR.PROVIDER_NBR ,IR.provider_type ,IR.PROVIDER_SOURCE_CD
,IR.cms_provider_ty e ,IR.SPEC_CD ,IR.SPEC_DESC ,IR.rev_cd ,IR.rev_cd_desc
,IR.proc_cd ,IR.diag_cd ,IR.DIAG_CD_KEY ,IR.DIAGNOSIS_KEY ,IR.rec_state_cd
,IR.rec_status_cd ,IR.DG_ERRCD ,IR.DG_ERRDESC
FROM (SELECT distinct enc.encounter_key as ENC_KEY,
enc.encounter_nbr as ENCOUNTER_, typ.encounter_type_cd as ETYPE,
bt.bill_type, cnt.contract_nbr as CONTR_,
ds.SOURCE_CD, enc.sub_source_cd, enc.HP_CD, lob.LOB_CD,
enc.new_min_dt as FDOS, substr(enc.new_max_dt, 1, 10) as TDOS,
enc.member_Nbr, m.HIC_NBR, m.MEMBER_SOURCE_CD, eerr.error_cd as HDR_ERRCD,
eerr.ERROR_DESC as HDR_ERRDESC, enc.PROVIDER_NBR, prv.provider_type,
prv.PROVIDER_SOURCE_CD, diag.cms_provider_type,
sp.specialty_cd as SPEC_CD, sp.specialty_desc as SPEC_DESC, svc.rev_cd,
rev.rev_cd_desc, svc.proc_cd, dgcd.diag_cd, dgcd.DIAG_CD_KEY, diag.DIAGNOSIS_KEY,
st.rec_state_cd, sts.rec_status_cd, derr.error_cd as DG_ERRCD,
derr.error_desc as DG_ERRDESC
FROM oicpcuhg.ir_encounter enc
`
Can You See the Problem?
© 2014 MapR Technologies 34
INNER JOIN oicpcuhg.ir_encountertype typ
ON (typ.encounter_type_key = enc.encounter_type_key)
LEFT OUTER JOIN oicpcuhg.ir_billtype bt
ON (bt.bill_type_key = enc.bill_type_key)
LEFT OUTER JOIN oicpcuhg.ir_contract cnt
ON (cnt.contract_key = enc.contract_key)
LEFT OUTER JOIN oicpcuhg.ir_datasource ds
ON (ds.source_key = enc.data_source_key)
LEFT OUTER JOIN oicpcuhg.ir_lineofbusiness lob
ON (lob.lob_key = enc.lob_key)
INNER JOIN oicpcuhg.ir_member m
ON (
m.hp_cd = enc.hp_cd
AND m.member_source_cd = enc.member_source_cd
AND m.member_nbr = enc.member_nbr)
LEFT OUTER JOIN oicpcuhg.ir_encountererror eerror
ON (eerror.encounter_key = enc.encounter_key and
eerror.active_flg = 'Y')
LEFT OUTER JOIN oicpcuhg.ir_error eerr
ON (eerr.error_key = eerror.error_key)
LEFT OUTER JOIN oicpcuhg.ir_provider prv
ON (prv.hp_cd = enc.hp_cd and
prv.provider_source_cd = enc.provider_source_cd and
prv.provider_nbr = enc.provider_nbr)
© 2014 MapR Technologies 35
LEFT OUTER JOIN oicpcuhg.ir_encounterspecialty esp
ON (esp.encounter_key = enc.encounter_key)
LEFT OUTER JOIN oicpcuhg.ir_specialty sp
ON (sp.specialty_key = esp.specialty_key)
LEFT OUTER JOIN oicpcuhg.ir_service svc
ON (svc.encounter_key = enc.encounter_key)
LEFT OUTER JOIN oicpcuhg.ir_revenue rev
ON (rev.rev_cd = svc.rev_cd)
LEFT OUTER JOIN oicpcuhg.ir_diagnosis diag
ON (diag.encounter_key = enc.encounter_key)
INNER JOIN oicpcuhg.ir_diagcd dgcd
ON (dgcd.diag_cd_key = diag.diag_cd_key)
INNER JOIN oicpcuhg.ir_recordstate st
ON (st.rec_state_key = diag.rec_state_key)
INNER JOIN oicpcuhg.ir_recordstatus sts
ON (sts.rec_status_key = diag.rec_status_key)
LEFT OUTER JOIN oicpcuhg.ir_diagnosiserror derror
ON (derror.diagnosis_key = diag.diagnosis_key and
derror.active_flg = 'Y')
LEFT OUTER JOIN oicpcuhg.ir_error derr
ON (derr.error_key = derror.error_key)) IR
INNER JOIN oicpcuhg.umr_req_inbound umr
ON (trim(umr.member_nbr) = IR.member_Nbr AND
trim(umr.hhc_from_ccyymmdd) = IR.TDOS AND
trim(umr.sub_mcare_mbr) = IR.HIC_NBR AND
trim(umr.diag1) = IR.diag_cd)
© 2014 MapR Technologies 36
One Attack
• The customer can’t give you the data
– They can’t trust you, by law
• But they can probably summarize the data
– How many columns
– What types
– Perhaps statistical summaries
© 2014 MapR Technologies 37
Bug Replication Without Security Violation
Customer You
DataData
DataFake
DataFake
x y α ξ
x y α ξ
© 2014 MapR Technologies 38
The Upshot
• So random numbers are useful
• But simple distributions not so much
• How can YOU generate cool data?
© 2014 MapR Technologies 39
e-book available courtesy of MapR
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detection
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
© 2014 MapR Technologies 40
Last October: Time Series Databases
by Ted Dunning and Ellen Friedman © Oct 2014 (published by O’Reilly)
© 2014 MapR Technologies 41
Coming in February: Real World Hadoop
by Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly)
© 2014 MapR Technologies 42
Thank you for coming today!

More Related Content

PPTX
Doing-the-impossible
PPTX
What is the past future tense of data?
PPTX
Possible Visions for Mahout 1.0
PPTX
Anomaly Detection - New York Machine Learning
PPTX
Which Algorithms Really Matter
PPTX
Dunning time-series-2015
PPTX
My talk about recommendation and search to the Hive
PPTX
Building multi-modal recommendation engines using search engines
Doing-the-impossible
What is the past future tense of data?
Possible Visions for Mahout 1.0
Anomaly Detection - New York Machine Learning
Which Algorithms Really Matter
Dunning time-series-2015
My talk about recommendation and search to the Hive
Building multi-modal recommendation engines using search engines

Similar to Practical Computing with Chaos (20)

PDF
Practical Computing Wiith Chaos
PPTX
How to find what you didn't know to look for, oractical anomaly detection
PPTX
Deep Learning for Fraud Detection
PPTX
How to Determine which Algorithms Really Matter
PPTX
Predictive Analytics with Hadoop
PDF
Mathematical bridges From Old to New
PPTX
Dunning ml-conf-2014
PPTX
Ted Dunning, Chief Application Architect, MapR at MLconf SF
PPTX
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
PPTX
Cheap learning-dunning-9-18-2015
PPTX
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
PPTX
Anomaly Detection: How to find what you didn’t know to look for
PPTX
How to tell which algorithms really matter
PPTX
Realistic Synthetic Generation Allows Secure Development
PPTX
Realistic Synthetic Generation Allows Secure Development
PPTX
Sharing Sensitive Data Securely
PPTX
Hadoop and R Go to the Movies
PPTX
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
PPTX
Goto amsterdam-2013-skinned
PPTX
GoTo Amsterdam 2013 Skinned
Practical Computing Wiith Chaos
How to find what you didn't know to look for, oractical anomaly detection
Deep Learning for Fraud Detection
How to Determine which Algorithms Really Matter
Predictive Analytics with Hadoop
Mathematical bridges From Old to New
Dunning ml-conf-2014
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Cheap learning-dunning-9-18-2015
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Anomaly Detection: How to find what you didn’t know to look for
How to tell which algorithms really matter
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
Sharing Sensitive Data Securely
Hadoop and R Go to the Movies
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
Goto amsterdam-2013-skinned
GoTo Amsterdam 2013 Skinned
Ad

More from MapR Technologies (20)

PPTX
Converging your data landscape
PPTX
ML Workshop 2: Machine Learning Model Comparison & Evaluation
PPTX
Self-Service Data Science for Leveraging ML & AI on All of Your Data
PPTX
Enabling Real-Time Business with Change Data Capture
PPTX
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
PPTX
ML Workshop 1: A New Architecture for Machine Learning Logistics
PPTX
Machine Learning Success: The Key to Easier Model Management
PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
PDF
Live Tutorial – Streaming Real-Time Events Using Apache APIs
PPTX
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
PDF
Live Machine Learning Tutorial: Churn Prediction
PDF
An Introduction to the MapR Converged Data Platform
PPTX
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
PPTX
Best Practices for Data Convergence in Healthcare
PPTX
Geo-Distributed Big Data and Analytics
PPTX
MapR Product Update - Spring 2017
PPTX
3 Benefits of Multi-Temperature Data Management for Data Analytics
PPTX
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
PPTX
MapR and Cisco Make IT Better
PPTX
Evolving from RDBMS to NoSQL + SQL
Converging your data landscape
ML Workshop 2: Machine Learning Model Comparison & Evaluation
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Enabling Real-Time Business with Change Data Capture
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
ML Workshop 1: A New Architecture for Machine Learning Logistics
Machine Learning Success: The Key to Easier Model Management
Data Warehouse Modernization: Accelerating Time-To-Action
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Live Machine Learning Tutorial: Churn Prediction
An Introduction to the MapR Converged Data Platform
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
Best Practices for Data Convergence in Healthcare
Geo-Distributed Big Data and Analytics
MapR Product Update - Spring 2017
3 Benefits of Multi-Temperature Data Management for Data Analytics
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR and Cisco Make IT Better
Evolving from RDBMS to NoSQL + SQL
Ad

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
A Presentation on Artificial Intelligence
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
cuic standard and advanced reporting.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation theory and applications.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation_ Review paper, used for researhc scholars
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Weekly Chronicles - August'25 Week I
A Presentation on Artificial Intelligence
Unlocking AI with Model Context Protocol (MCP)
cuic standard and advanced reporting.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Spectral efficient network and resource selection model in 5G networks
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Understanding_Digital_Forensics_Presentation.pptx
Empathic Computing: Creating Shared Understanding
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation theory and applications.pdf
Approach and Philosophy of On baking technology
Encapsulation_ Review paper, used for researhc scholars
The AUB Centre for AI in Media Proposal.docx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Big Data Technologies - Introduction.pptx
Per capita expenditure prediction using model stacking based on satellite ima...

Practical Computing with Chaos

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  • 2. © 2014 MapR Technologies 2 Practical Computing with Chaos Ted Dunning, Chief Applications Architect MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning
  • 3. © 2014 MapR Technologies 3 e-book available courtesy of MapR Also at MapR booth http://bit.ly/1jQ9QuL A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
  • 4. © 2014 MapR Technologies 4 Practical Machine Learning series (O’Reilly) • Machine learning is becoming mainstream • Need pragmatic approaches that take into account real world business settings: – Time to value – Limited resources – Availability of data – Expertise and cost of team to develop and to maintain system • Look for approaches with big benefits for the effort expended
  • 5. © 2014 MapR Technologies 5 Agenda • Monty Hall • Randomized geo-coding • Thompson sampling – Bayesian Bandits – Targeting – Bayesian ranking • Dithering (sound, signals) • Synthetic data (preview)
  • 6. © 2014 MapR Technologies 6 Let’s Start with Trouble • Monty Hall problem (oops, done) • Three doors, one with a fabulous prize • You pick one • Monte shows you one of the remaining doors is empty • You can switch at this point to the other door or not • Should you switch?
  • 7. © 2014 MapR Technologies 7
  • 8. © 2014 MapR Technologies 8
  • 9. © 2014 MapR Technologies 9
  • 10. © 2014 MapR Technologies 10 The Real Problem • Doing the math isn’t too hard • Convincing somebody you have the right answer is really hard
  • 11. © 2014 MapR Technologies 11 Live Coding With REAL Chaos
  • 12. © 2014 MapR Technologies 12 Geo-coding
  • 13. © 2014 MapR Technologies 13 Geo-coding • Some databases have disk locality  key locality • The primary key is totally ordered • Embedding a total ordering of the points in a plane is possible – But loses some distance information – A line is not a square! • We want to do proximity searches – This gets harder in the polar regions for most codings
  • 14. © 2014 MapR Technologies 14 Space Filling Curve 0 1 23 01 2 3 0 1 2 3 0 1 2 3 0 1 2 3
  • 15. © 2014 MapR Technologies 15 Space Filling Curve 0123 2 3 3 1 0 2 2 3 1 1 0 0 3 20 1
  • 16. © 2014 MapR Technologies 16 Z-coding – Interleave Bits 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 17. © 2014 MapR Technologies 17 Neighbors Often Share Prefix 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10 00. 11.11 10. 01.01 00. 11.01
  • 18. © 2014 MapR Technologies 18 Often, not always Close Far
  • 19. © 2014 MapR Technologies 19 Random Sampling to Derive Keys 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 20. © 2014 MapR Technologies 20 "00.01.01" "00.01.10" "00.01.11" "00.11.00" "00.11.01" "00.11.10" "00.11.11" "01.00.10" "01.10.00" "01.10.10” 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 21. © 2014 MapR Technologies 21 "00.01.01" "00.01.10" "00.01.11" "00.11.00" "00.11.01" "00.11.10" "00.11.11" "01.00.10" "01.10.00" "01.10.10” 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 22. © 2014 MapR Technologies 22 "00.01.10" - "00.01.11" "00.11.00" - "00.11.11" "01.00.10" "01.10.00" - "01.10.10” 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 23. © 2014 MapR Technologies 23 Dithering
  • 24. © 2014 MapR Technologies 24 • 4 bit sine wave (listen for artifacts as volume decreases) • White dithering (artifacts gone, we hear through the noise) • Noise shaping (noise is easier to hear through)
  • 25. © 2014 MapR Technologies 25 0 1 2 3 4 5 6 −4−2024 Time
  • 26. © 2014 MapR Technologies 26 The Shape of the Noise Noise Frequency −0.4 −0.2 0.0 0.2 0.4 010003000
  • 27. © 2014 MapR Technologies 27 The Effect After Averaging 0 1 2 3 4 5 6 −4−2024 Time
  • 28. © 2014 MapR Technologies 28 Thompson Sampling
  • 29. © 2014 MapR Technologies 29 Learning in the Real World • In the real world we get to pick our training examples – Do we try this restaurant or not? • Learning has real and opportunity costs • Not learning has real and opportunity costs as well • Every sub-optimal choice we make incurs regret – We would like to minimize this – But we can’t quantify regret without incurring regret!
  • 30. © 2014 MapR Technologies 30 An Example • Pick one of five options – Purple, blue, green, red, yellow – Each has a random payoff • If you pick a bad option, regret = mean(best) – mean(yours) • The best known algorithm uses randomization – Best = minimal regret + minimal code complexity
  • 31. © 2014 MapR Technologies 31 Demo – The Algorithm
  • 32. © 2014 MapR Technologies 32 Synthetic Data
  • 33. © 2014 MapR Technologies 33 select IR.ENC_KEY ,IR.ENCOUNTER_ ,IR.ETYPE ,IR.bill_type ,IR.CONTR_ ,IR.SOURCE_CD ,IR.sub_source_cd ,IR.HP_CD ,IR.LOB_CD ,IR.FDO ,IR.TDOS ,IR.member_Nbr ,IR.HIC_NBR ,IR.MEMBER_SOURCE_CD ,IR.HDR_ERRCD ,IR.HDR_ERRDESC ,IR.PROVIDER_NBR ,IR.provider_type ,IR.PROVIDER_SOURCE_CD ,IR.cms_provider_ty e ,IR.SPEC_CD ,IR.SPEC_DESC ,IR.rev_cd ,IR.rev_cd_desc ,IR.proc_cd ,IR.diag_cd ,IR.DIAG_CD_KEY ,IR.DIAGNOSIS_KEY ,IR.rec_state_cd ,IR.rec_status_cd ,IR.DG_ERRCD ,IR.DG_ERRDESC FROM (SELECT distinct enc.encounter_key as ENC_KEY, enc.encounter_nbr as ENCOUNTER_, typ.encounter_type_cd as ETYPE, bt.bill_type, cnt.contract_nbr as CONTR_, ds.SOURCE_CD, enc.sub_source_cd, enc.HP_CD, lob.LOB_CD, enc.new_min_dt as FDOS, substr(enc.new_max_dt, 1, 10) as TDOS, enc.member_Nbr, m.HIC_NBR, m.MEMBER_SOURCE_CD, eerr.error_cd as HDR_ERRCD, eerr.ERROR_DESC as HDR_ERRDESC, enc.PROVIDER_NBR, prv.provider_type, prv.PROVIDER_SOURCE_CD, diag.cms_provider_type, sp.specialty_cd as SPEC_CD, sp.specialty_desc as SPEC_DESC, svc.rev_cd, rev.rev_cd_desc, svc.proc_cd, dgcd.diag_cd, dgcd.DIAG_CD_KEY, diag.DIAGNOSIS_KEY, st.rec_state_cd, sts.rec_status_cd, derr.error_cd as DG_ERRCD, derr.error_desc as DG_ERRDESC FROM oicpcuhg.ir_encounter enc ` Can You See the Problem?
  • 34. © 2014 MapR Technologies 34 INNER JOIN oicpcuhg.ir_encountertype typ ON (typ.encounter_type_key = enc.encounter_type_key) LEFT OUTER JOIN oicpcuhg.ir_billtype bt ON (bt.bill_type_key = enc.bill_type_key) LEFT OUTER JOIN oicpcuhg.ir_contract cnt ON (cnt.contract_key = enc.contract_key) LEFT OUTER JOIN oicpcuhg.ir_datasource ds ON (ds.source_key = enc.data_source_key) LEFT OUTER JOIN oicpcuhg.ir_lineofbusiness lob ON (lob.lob_key = enc.lob_key) INNER JOIN oicpcuhg.ir_member m ON ( m.hp_cd = enc.hp_cd AND m.member_source_cd = enc.member_source_cd AND m.member_nbr = enc.member_nbr) LEFT OUTER JOIN oicpcuhg.ir_encountererror eerror ON (eerror.encounter_key = enc.encounter_key and eerror.active_flg = 'Y') LEFT OUTER JOIN oicpcuhg.ir_error eerr ON (eerr.error_key = eerror.error_key) LEFT OUTER JOIN oicpcuhg.ir_provider prv ON (prv.hp_cd = enc.hp_cd and prv.provider_source_cd = enc.provider_source_cd and prv.provider_nbr = enc.provider_nbr)
  • 35. © 2014 MapR Technologies 35 LEFT OUTER JOIN oicpcuhg.ir_encounterspecialty esp ON (esp.encounter_key = enc.encounter_key) LEFT OUTER JOIN oicpcuhg.ir_specialty sp ON (sp.specialty_key = esp.specialty_key) LEFT OUTER JOIN oicpcuhg.ir_service svc ON (svc.encounter_key = enc.encounter_key) LEFT OUTER JOIN oicpcuhg.ir_revenue rev ON (rev.rev_cd = svc.rev_cd) LEFT OUTER JOIN oicpcuhg.ir_diagnosis diag ON (diag.encounter_key = enc.encounter_key) INNER JOIN oicpcuhg.ir_diagcd dgcd ON (dgcd.diag_cd_key = diag.diag_cd_key) INNER JOIN oicpcuhg.ir_recordstate st ON (st.rec_state_key = diag.rec_state_key) INNER JOIN oicpcuhg.ir_recordstatus sts ON (sts.rec_status_key = diag.rec_status_key) LEFT OUTER JOIN oicpcuhg.ir_diagnosiserror derror ON (derror.diagnosis_key = diag.diagnosis_key and derror.active_flg = 'Y') LEFT OUTER JOIN oicpcuhg.ir_error derr ON (derr.error_key = derror.error_key)) IR INNER JOIN oicpcuhg.umr_req_inbound umr ON (trim(umr.member_nbr) = IR.member_Nbr AND trim(umr.hhc_from_ccyymmdd) = IR.TDOS AND trim(umr.sub_mcare_mbr) = IR.HIC_NBR AND trim(umr.diag1) = IR.diag_cd)
  • 36. © 2014 MapR Technologies 36 One Attack • The customer can’t give you the data – They can’t trust you, by law • But they can probably summarize the data – How many columns – What types – Perhaps statistical summaries
  • 37. © 2014 MapR Technologies 37 Bug Replication Without Security Violation Customer You DataData DataFake DataFake x y α ξ x y α ξ
  • 38. © 2014 MapR Technologies 38 The Upshot • So random numbers are useful • But simple distributions not so much • How can YOU generate cool data?
  • 39. © 2014 MapR Technologies 39 e-book available courtesy of MapR http://bit.ly/1jQ9QuL A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
  • 40. © 2014 MapR Technologies 40 Last October: Time Series Databases by Ted Dunning and Ellen Friedman © Oct 2014 (published by O’Reilly)
  • 41. © 2014 MapR Technologies 41 Coming in February: Real World Hadoop by Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly)
  • 42. © 2014 MapR Technologies 42 Thank you for coming today!

Editor's Notes

  • #4: Talk track: 2nd in series, first was on how to build a simple recommender. This one on anomaly detection is being sold by O’Reilly on Amazon, but for a limited time MapR is giving away the e-book for free. Here’s the link where you can register to get one.
  • #5: Talk track: ELLEN New ways to do it that take into account real world business goals, realistic resources, new types of data and best time to value…