SlideShare a Scribd company logo
Using the Hadoop Ecosystem to
Drive Healthcare Innovation
Aly Sivji
April 25, 2017
About Me
• Aly Sivji
– Twitter: @CaiusSivjus
– Blog: http://alysivji.github.io
• Senior Analyst @ IBM Watson Health
– Value-Based Care: Planning Solutions
• Grad Student @ Northwestern University
– Medical Informatics
• Interests:
– Technology 🐍
– Data 📈
– Star Trek 🖖🖖
Overview
• Big Data drives most industries
Overview
• What about Healthcare?
– Machine Learning
• Fraud detection ($65+ billion lost every year)
– Wired Article
– dataiku - Detecting Medicare Fraud
• Preventing unnecessary procedures
– Data Mining
• Identifying medication prescribed together
– Recommender Systems
• Finding similar patients
Overview
Healthcare is Different.
People who work in healthcare
Additional Reading
• John Halamka (The Health Care Blog)
• Health Catalyst
Overview
• Data Analytics / Data Science
– Retrospective versus Predictive
• Machine Learning
– Types of Algorithms
• Healthcare Analytics
Overview
• Apache Hadoop Ecosystem
– Big Data framework
– Distributed computation on commodity hardware
– Demo!
Road to Electronic Health Records
1920s –
Modern
record
keeping
begins
1960s – Dr.
Larry Weed
introduces
problem-
oriented
medical
records
1972 –
Regenstrief
Institute
develops
first EMR
System
1980s-90s –
Siloed adoption
by departments
& admin
1996 –
HIPAA
establishes
national
standards
for
electronic
health
records
2004 –
President Bush
calls for
Computerized
Health Records
2009: EHRs Go Mainstream
• HITECH Act passed by President Obama
– $25.9 billion to expand Health IT (HIT) adoption
• Meaningful Use (MU) program
– Incentive payments for using HIT to
• Improve quality, safety, efficiency of care
• Engage patients
• Increase care co-ordination
– Goal: MU compliance => better outcomes
EHR Adoption: Doubled Since 2008
Office-based Physician Electronic Health Record Adoption (2005-2015)
Source: Office of the National Coordinator for Health Information Technology. 'Office-based Physician Electronic Health Record
Adoption,' Health IT Quick-Stat #50. dashboard.healthit.gov/quickstats/pages/physician-ehr-adoption-trends.php. Dec 2016.
Health Data Today
• Electronic Health Records
• Genomic Data ($1000 genome)
• Medical Internet of Things (mIoT)
• Wearable devices
• Bottom Line: Data is growing
Big Data = 'Bigger Data' in Healthcare (article)
Data Analytics
• Businesses collect lots of data
– IBM: 90% of world’s data created in last 2 years
• How can we find hidden patterns in the data
and make information actionable?
Data Science!
Types of Analytics
• Retrospective Analytics
– Summarizing historical activity / performance
– Limited scope for making future plans
• Better than nothing
Types of Analytics
• Predictive Analytics
– Finding patterns (correlations) between historical
environment and results
– Apply to current environment to make predictions
Predictive Analytics
"Once you have enough data, you start to see
patterns. You can then build a model of how
these data work. Once you build a model, you
can predict.”
Michael Wu
Chief Scientist, Lithium Technologies
Predictive Analytics
Machine Learning (ML)
“Field of study that gives computers the ability
to learn without being explicitly programmed”
Arthur Samuel
Artificial Intelligence Pioneer
Machine Learning Algorithms
• A probabilistic framework to create models
used for predictions
• Predictive models are developed iteratively
• Models are refined until they converge
– i.e. output gets close to a specific value
Types of ML Algorithms
• Unsupervised Learning
– Group objects by similar characteristics
– Given inputs (X), find label for each observation
• Supervised Learning
– Given inputs (X) and output (Y)
– Find function f that maps X to Y
– Given new inputs (Xnew), predict value/label (Ynew)
Types of Supervised Learning
• Regression
– Try to predict a value (continuous variable)
• Classification
– Try to predict a label (discrete variable)
Analytics in Healthcare
“Advanced analytics can be used to improve
medical outcomes, increase financial
performance, deepen relationships with
customers and patients, and drive new medical
innovations”
Jason Burke
Author of Health Analytics
Healthcare Challenges
• US Healthcare spending = $3.4 trillion / year
Healthcare Challenges
• US system wastes $750 billion annually
Source: Washington Post (Sept 2012). Retrieved from https://www.washingtonpost.com/news/wonk/wp/2012/09/07/we-spend-
750-billion-on-unnecessary-health-care-two-charts-explain-why/
Healthcare Challenges
• Low quality
– To Err is Human Report:
• 44,000 - 98,000 deaths to preventable medical errors
– Rates poorly when compared to other countries
• Last in 2014 Commonwealth Fund survey on:
– Quality of care
– Access to doctors
– Equity
Solution: Big Data!
• Use data analytics and machine learning to
improve outcomes & lower costs
Types of Healthcare Analytics
Good News
• Most of the analytical and software
capabilities needed to drive systemic changes
in healthcare are already available as:
– Commercial software
– Open Source solutions 🎉
• Hadoop ecosystem
Big Data
• Characteristics (4 V’s of Big Data)
– Volume
• Scale of data
– Variety
• Diversity of data (many sources)
– Velocity
• Speed of data
– Veracity
• Certainty of data
• 5th V: Value?
Types of Data
• Structured
– Highly organized information that fits neatly into a
relational database (columns and rows)
• Unstructured
– Has internal structure, but does not fit into a
traditional database (or spreadsheet)
– Most data is unstructured (>80%)
– Can use Extract-Transform-Load (ETL) Processing to
turn unstructured data into structured data
Apache Hadoop
• Set of open source software technology components that
form a scalable system we can use to analyze Big Data
• Main features:
– Distributed storage and processing
• Data is too big for a single computer
– Runs on commodity hardware
– Fault tolerant
• Hardware failures are common and handle automatically
– Runs in Java Virtual Machine (JVM) environment
Sample Hadoop Stack
Source: Soong, K. (Feb 2016). Big Data Specialization. Retrieved from http://ksoong.org/big-data
Core Hadoop Components
• Yet Another Resource Negotiator (YARN)
– “Operating System” for Hadoop
– Controls how resources are allocated to different
applications and execution engines across cluster
Core Hadoop Components
• Hadoop Distributed File System (HDFS)
– Highly scalable storage system
Data File
Core Hadoop Components
• Hadoop Distributed File System (HDFS)
– Too big to fit on single machine => Partition
A B
C D
Core Hadoop Components
• Hadoop Distributed File System (HDFS)
– Split across multiple machines
– Data is protected against hardware failure
A B
C
A
D
A
C D
B
C D
Server 1 Server 2 Server 3 Server 4
Core Hadoop Components
• Hadoop Distributed File System (HDFS)
– Server goes down, we can still reconstruct data
A B
C
A
D
A
C D
B
C D
Server 1 Server 2 Server 3 Server 4
🔥
Core Hadoop Components
• Execution Engine
– Used when running analytic applications
– Distributed data allows us to perform parallel
computations
– MapReduce execution engine comes bundled with the
Hadoop core distribution
– Can plug-in different components
• Tez, Storm, Spark, etc
MapReduce Overview
Source: Eckroth, J. (n.d.). MapReduce. Retrieved from http://cinf401.artifice.cc/notes/mapreduce.html
HDFS
HDFS
MapReduce Example
Source: Zhang, X. (Jul 2013). A Simple Example to Demonstrate how does the
MapReduce work. Retrieved from http://xiaochongzhang.me/blog/?p=338
MapReduce Limitations
• Lot of read/writes
– I/O becomes bottleneck when performing analysis
• Machine Learning algorithms are iterative
– Many reads and writes cycles before convergence
– Slow runtime
• There must be a better way!
Apache Tez
• Optimizes workflow to limit number of writes
• Less I/O => faster execution
Apache Storm
• Execution engine for real-time streaming
applications
• Data is analyzed as it is generated BEFORE it is
stored
Apache Spark
• In-memory computational engine
• Read in data once, subsequent calculations
are done in-memory
Logistic Regression Runtime
Other Apache Projects
• Apache Hive
– SQL interface to data stored in HDFS
– Analysts with SQL experience can use Hadoop
Other Apache Projects
• Databases
– Apache HBase
– Apache Cassandra
Other Apache Projects
• Apache Kafka
– Messaging system for streaming data
Optimal Hadoop Workflow
• Depends on what you are trying to do
• Data Lake (HDFS)
– Storage repository that holds data in raw format
– Read into Spark to perform analysis
• Use Data Science and Machine Learning algorithms
• Demo will walkthrough this workflow
Using The Hadoop Ecosystem to Drive Healthcare Innovation
Dataset
• Texas Department of State Health Services
– Released State Inpatient / Outpatient data (link)
• Inpatient (IP) - 1999 to 2010
• Outpatient (OP) – Q42009 to 2010
– Data is de-identified and made available for free
– Tab-delimited text files (for each quarter)
• IP data – 450MB base table, 500MB charges
• OP data – 750MB base table, 700MB charges
Spark Background
• Java, Scala, Python, and R APIs (docs)
• Built around the concept of Resilient
Distributed Datasets (RDDs)
– Can perform MapReduce on RDD
OR
– Use the Spark DataFrame abstraction
*Recommended*
Spark DataFrame
• Distributed collection of rows and named
columns
– Think relational database or spreadsheet
– Akin to pandas DataFrame or R data.frame
# Displays the content of the DataFrame
df.show()
#
# +----+-------+
# | age| name|
# +----+-------+
# |null|Michael|
# | 30| Andy|
# | 19| Justin|
# +----+-------+
Questions?
• Slides and code available at
https://github.com/alysivji/talks

More Related Content

PDF
H2O for Medicine and Intro to H2O in Python
ODP
Big Data Analytics - Introduction
PPTX
Introduction to Data Science
PPT
Big Tools for Big Data
PPTX
Fortune Teller API - Doing Data Science with Apache Spark
PPTX
Exploring Big Data Analytics Tools
PDF
Big Data and Health Care
H2O for Medicine and Intro to H2O in Python
Big Data Analytics - Introduction
Introduction to Data Science
Big Tools for Big Data
Fortune Teller API - Doing Data Science with Apache Spark
Exploring Big Data Analytics Tools
Big Data and Health Care

What's hot (20)

PDF
From Big Data to Fast Data
PPTX
Are you ready for BIG DATA?
PDF
Data science workshop
PPTX
Consumerization of BI - Bring Your Own Insight
PPT
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
PDF
Machine Data Analytics
PDF
The evolution of data analytics
PPT
Big Data: An Overview
PPT
Big Data As a service - Sethuonline.com | Sathyabama University Chennai
PPTX
Intro to bigdata on gcp (1)
PPTX
Big Data Analysis Patterns - TriHUG 6/27/2013
PPTX
Big Data Analytics Using Hadoop
PPTX
Big Data HPC Convergence and a bunch of other things
PPTX
Hadoop in Validated Environment - Data Governance Initiative
PPTX
Hadoop
PPTX
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
PPTX
Big Data and Hadoop
PPTX
Applying Noisy Knowledge Graphs to Real Problems
PPTX
Intro to Big Data Hadoop
DOCX
Big data abstract
From Big Data to Fast Data
Are you ready for BIG DATA?
Data science workshop
Consumerization of BI - Bring Your Own Insight
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Machine Data Analytics
The evolution of data analytics
Big Data: An Overview
Big Data As a service - Sethuonline.com | Sathyabama University Chennai
Intro to bigdata on gcp (1)
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analytics Using Hadoop
Big Data HPC Convergence and a bunch of other things
Hadoop in Validated Environment - Data Governance Initiative
Hadoop
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
Big Data and Hadoop
Applying Noisy Knowledge Graphs to Real Problems
Intro to Big Data Hadoop
Big data abstract
Ad

Similar to Using The Hadoop Ecosystem to Drive Healthcare Innovation (20)

PPTX
Introduction to BIG DATA
PDF
PPTX
Big Data in Clinical Research
PPTX
Lecture1
PDF
Big_data_1674238705.ppt is a basic background
PPT
Big data.ppt
PPT
Hadoop HDFS.ppt
PPTX
Data lake-itweekend-sharif university-vahid amiry
PPTX
Lecture1 BIG DATA and Types of data in details
PPTX
Unushs susus susujss. Ssuusussjjsjsit 4.pptx
PPT
Data analytics & its Trends
PPTX
Big data unit 2
PDF
Agile Big Data Analytics Development: An Architecture-Centric Approach
PDF
Introduction to Big Data
PDF
Big Data Analytics M1.pdf big data analytics
PPTX
Predictive Analytics: Context and Use Cases
PDF
00-01 DSnDA.pdf
PDF
Lecture1 introduction to big data
PPTX
Big data and data mining
PPTX
Big Data Open Source Technologies
Introduction to BIG DATA
Big Data in Clinical Research
Lecture1
Big_data_1674238705.ppt is a basic background
Big data.ppt
Hadoop HDFS.ppt
Data lake-itweekend-sharif university-vahid amiry
Lecture1 BIG DATA and Types of data in details
Unushs susus susujss. Ssuusussjjsjsit 4.pptx
Data analytics & its Trends
Big data unit 2
Agile Big Data Analytics Development: An Architecture-Centric Approach
Introduction to Big Data
Big Data Analytics M1.pdf big data analytics
Predictive Analytics: Context and Use Cases
00-01 DSnDA.pdf
Lecture1 introduction to big data
Big data and data mining
Big Data Open Source Technologies
Ad

More from Dan Wellisch (19)

PPTX
Measuring, Mismeasuring, and Remeasuring - Creating Meaningful Key Performanc...
PDF
The Role Of Community-Based Organizations in Achieving Population Health Goals
PDF
Health Industry Cybersecurity Best Practices
PDF
Driving Data to Cut Healthcare Costs
PDF
US Healthcare Reform Landscape - Addendum to June 2018 Presentation to the Ch...
PDF
Payer Analytics In A Shifting Healthcare Landscape - June Presentation To Chi...
PDF
Who Is A HIPAA Business Associate ?
PDF
Chronic Care Management - Implemented By TimeDoc - May 2018
PDF
Managing HIPAA Business Associate Relationships - April 24, 2018
PPTX
Using Models For Analytically-Driven Cultural Transformation
PPTX
Analyzing Breast Cancer Dataset with Azure Machine Learning Studio
PPTX
Simple Linear Regression: Step-By-Step
PPTX
Helping Health Healthcare: Financial Decision Support
PDF
AWS Machine Learning Workshop
PDF
What Are The All Payer Claims Databases (SCPDs) And What Could Be Used For?
PDF
HIPAA Panel Discussion
PDF
Using Predictive Analytics For Care Management And Coordination
PDF
Rcm (Revenue Cycle Management)
PDF
Driving to consumerism
Measuring, Mismeasuring, and Remeasuring - Creating Meaningful Key Performanc...
The Role Of Community-Based Organizations in Achieving Population Health Goals
Health Industry Cybersecurity Best Practices
Driving Data to Cut Healthcare Costs
US Healthcare Reform Landscape - Addendum to June 2018 Presentation to the Ch...
Payer Analytics In A Shifting Healthcare Landscape - June Presentation To Chi...
Who Is A HIPAA Business Associate ?
Chronic Care Management - Implemented By TimeDoc - May 2018
Managing HIPAA Business Associate Relationships - April 24, 2018
Using Models For Analytically-Driven Cultural Transformation
Analyzing Breast Cancer Dataset with Azure Machine Learning Studio
Simple Linear Regression: Step-By-Step
Helping Health Healthcare: Financial Decision Support
AWS Machine Learning Workshop
What Are The All Payer Claims Databases (SCPDs) And What Could Be Used For?
HIPAA Panel Discussion
Using Predictive Analytics For Care Management And Coordination
Rcm (Revenue Cycle Management)
Driving to consumerism

Recently uploaded (20)

PPTX
AI_in_Pharmaceutical_Technology_Presentation.pptx
PPTX
HEMODYNAMICS - I DERANGEMENTS OF BODY FLUIDS.pptx
PPTX
Rheumatic heart diseases with Type 2 Diabetes Mellitus
PPTX
NUTRITIONAL PROBLEMS, CHANGES NEEDED TO PREVENT MALNUTRITION
PPTX
General Pharmacology by Nandini Ratne, Nagpur College of Pharmacy, Hingna Roa...
PPT
Recent advances in Diagnosis of Autoimmune Disorders
PDF
Dr. Jasvant Modi - Passionate About Philanthropy
PPTX
BLS, BCLS Module-A life saving procedure
PPTX
First aid in common emergency conditions.pptx
PDF
Dermatology diseases Index August 2025.pdf
PPTX
COMMUNICATION SKILSS IN NURSING PRACTICE
PPTX
Basics of pharmacology (Pharmacology I).pptx
PPTX
ABG advance Arterial Blood Gases Analysis
PDF
Dr Masood Ahmed Expertise And Sucess Story
PDF
MINERAL & VITAMIN CHARTS fggfdtujhfd.pdf
PPTX
Importance of Immediate Response (1).pptx
PPTX
Nursing Care Aspects for High Risk newborn.pptx
PPT
Parental-Carer-mental-illness-and-Potential-impact-on-Dependant-Children.ppt
PPTX
1. Drug Distribution System.pptt b pharmacy
PPT
Microscope is an instrument that makes an enlarged image of a small object, t...
AI_in_Pharmaceutical_Technology_Presentation.pptx
HEMODYNAMICS - I DERANGEMENTS OF BODY FLUIDS.pptx
Rheumatic heart diseases with Type 2 Diabetes Mellitus
NUTRITIONAL PROBLEMS, CHANGES NEEDED TO PREVENT MALNUTRITION
General Pharmacology by Nandini Ratne, Nagpur College of Pharmacy, Hingna Roa...
Recent advances in Diagnosis of Autoimmune Disorders
Dr. Jasvant Modi - Passionate About Philanthropy
BLS, BCLS Module-A life saving procedure
First aid in common emergency conditions.pptx
Dermatology diseases Index August 2025.pdf
COMMUNICATION SKILSS IN NURSING PRACTICE
Basics of pharmacology (Pharmacology I).pptx
ABG advance Arterial Blood Gases Analysis
Dr Masood Ahmed Expertise And Sucess Story
MINERAL & VITAMIN CHARTS fggfdtujhfd.pdf
Importance of Immediate Response (1).pptx
Nursing Care Aspects for High Risk newborn.pptx
Parental-Carer-mental-illness-and-Potential-impact-on-Dependant-Children.ppt
1. Drug Distribution System.pptt b pharmacy
Microscope is an instrument that makes an enlarged image of a small object, t...

Using The Hadoop Ecosystem to Drive Healthcare Innovation

  • 1. Using the Hadoop Ecosystem to Drive Healthcare Innovation Aly Sivji April 25, 2017
  • 2. About Me • Aly Sivji – Twitter: @CaiusSivjus – Blog: http://alysivji.github.io • Senior Analyst @ IBM Watson Health – Value-Based Care: Planning Solutions • Grad Student @ Northwestern University – Medical Informatics • Interests: – Technology 🐍 – Data 📈 – Star Trek 🖖🖖
  • 3. Overview • Big Data drives most industries
  • 4. Overview • What about Healthcare? – Machine Learning • Fraud detection ($65+ billion lost every year) – Wired Article – dataiku - Detecting Medicare Fraud • Preventing unnecessary procedures – Data Mining • Identifying medication prescribed together – Recommender Systems • Finding similar patients
  • 5. Overview Healthcare is Different. People who work in healthcare Additional Reading • John Halamka (The Health Care Blog) • Health Catalyst
  • 6. Overview • Data Analytics / Data Science – Retrospective versus Predictive • Machine Learning – Types of Algorithms • Healthcare Analytics
  • 7. Overview • Apache Hadoop Ecosystem – Big Data framework – Distributed computation on commodity hardware – Demo!
  • 8. Road to Electronic Health Records 1920s – Modern record keeping begins 1960s – Dr. Larry Weed introduces problem- oriented medical records 1972 – Regenstrief Institute develops first EMR System 1980s-90s – Siloed adoption by departments & admin 1996 – HIPAA establishes national standards for electronic health records 2004 – President Bush calls for Computerized Health Records
  • 9. 2009: EHRs Go Mainstream • HITECH Act passed by President Obama – $25.9 billion to expand Health IT (HIT) adoption • Meaningful Use (MU) program – Incentive payments for using HIT to • Improve quality, safety, efficiency of care • Engage patients • Increase care co-ordination – Goal: MU compliance => better outcomes
  • 10. EHR Adoption: Doubled Since 2008 Office-based Physician Electronic Health Record Adoption (2005-2015) Source: Office of the National Coordinator for Health Information Technology. 'Office-based Physician Electronic Health Record Adoption,' Health IT Quick-Stat #50. dashboard.healthit.gov/quickstats/pages/physician-ehr-adoption-trends.php. Dec 2016.
  • 11. Health Data Today • Electronic Health Records • Genomic Data ($1000 genome) • Medical Internet of Things (mIoT) • Wearable devices • Bottom Line: Data is growing Big Data = 'Bigger Data' in Healthcare (article)
  • 12. Data Analytics • Businesses collect lots of data – IBM: 90% of world’s data created in last 2 years • How can we find hidden patterns in the data and make information actionable? Data Science!
  • 13. Types of Analytics • Retrospective Analytics – Summarizing historical activity / performance – Limited scope for making future plans • Better than nothing
  • 14. Types of Analytics • Predictive Analytics – Finding patterns (correlations) between historical environment and results – Apply to current environment to make predictions
  • 15. Predictive Analytics "Once you have enough data, you start to see patterns. You can then build a model of how these data work. Once you build a model, you can predict.” Michael Wu Chief Scientist, Lithium Technologies
  • 17. Machine Learning (ML) “Field of study that gives computers the ability to learn without being explicitly programmed” Arthur Samuel Artificial Intelligence Pioneer
  • 18. Machine Learning Algorithms • A probabilistic framework to create models used for predictions • Predictive models are developed iteratively • Models are refined until they converge – i.e. output gets close to a specific value
  • 19. Types of ML Algorithms • Unsupervised Learning – Group objects by similar characteristics – Given inputs (X), find label for each observation • Supervised Learning – Given inputs (X) and output (Y) – Find function f that maps X to Y – Given new inputs (Xnew), predict value/label (Ynew)
  • 20. Types of Supervised Learning • Regression – Try to predict a value (continuous variable) • Classification – Try to predict a label (discrete variable)
  • 21. Analytics in Healthcare “Advanced analytics can be used to improve medical outcomes, increase financial performance, deepen relationships with customers and patients, and drive new medical innovations” Jason Burke Author of Health Analytics
  • 22. Healthcare Challenges • US Healthcare spending = $3.4 trillion / year
  • 23. Healthcare Challenges • US system wastes $750 billion annually Source: Washington Post (Sept 2012). Retrieved from https://www.washingtonpost.com/news/wonk/wp/2012/09/07/we-spend- 750-billion-on-unnecessary-health-care-two-charts-explain-why/
  • 24. Healthcare Challenges • Low quality – To Err is Human Report: • 44,000 - 98,000 deaths to preventable medical errors – Rates poorly when compared to other countries • Last in 2014 Commonwealth Fund survey on: – Quality of care – Access to doctors – Equity
  • 25. Solution: Big Data! • Use data analytics and machine learning to improve outcomes & lower costs
  • 26. Types of Healthcare Analytics
  • 27. Good News • Most of the analytical and software capabilities needed to drive systemic changes in healthcare are already available as: – Commercial software – Open Source solutions 🎉 • Hadoop ecosystem
  • 28. Big Data • Characteristics (4 V’s of Big Data) – Volume • Scale of data – Variety • Diversity of data (many sources) – Velocity • Speed of data – Veracity • Certainty of data • 5th V: Value?
  • 29. Types of Data • Structured – Highly organized information that fits neatly into a relational database (columns and rows) • Unstructured – Has internal structure, but does not fit into a traditional database (or spreadsheet) – Most data is unstructured (>80%) – Can use Extract-Transform-Load (ETL) Processing to turn unstructured data into structured data
  • 30. Apache Hadoop • Set of open source software technology components that form a scalable system we can use to analyze Big Data • Main features: – Distributed storage and processing • Data is too big for a single computer – Runs on commodity hardware – Fault tolerant • Hardware failures are common and handle automatically – Runs in Java Virtual Machine (JVM) environment
  • 31. Sample Hadoop Stack Source: Soong, K. (Feb 2016). Big Data Specialization. Retrieved from http://ksoong.org/big-data
  • 32. Core Hadoop Components • Yet Another Resource Negotiator (YARN) – “Operating System” for Hadoop – Controls how resources are allocated to different applications and execution engines across cluster
  • 33. Core Hadoop Components • Hadoop Distributed File System (HDFS) – Highly scalable storage system Data File
  • 34. Core Hadoop Components • Hadoop Distributed File System (HDFS) – Too big to fit on single machine => Partition A B C D
  • 35. Core Hadoop Components • Hadoop Distributed File System (HDFS) – Split across multiple machines – Data is protected against hardware failure A B C A D A C D B C D Server 1 Server 2 Server 3 Server 4
  • 36. Core Hadoop Components • Hadoop Distributed File System (HDFS) – Server goes down, we can still reconstruct data A B C A D A C D B C D Server 1 Server 2 Server 3 Server 4 🔥
  • 37. Core Hadoop Components • Execution Engine – Used when running analytic applications – Distributed data allows us to perform parallel computations – MapReduce execution engine comes bundled with the Hadoop core distribution – Can plug-in different components • Tez, Storm, Spark, etc
  • 38. MapReduce Overview Source: Eckroth, J. (n.d.). MapReduce. Retrieved from http://cinf401.artifice.cc/notes/mapreduce.html HDFS HDFS
  • 39. MapReduce Example Source: Zhang, X. (Jul 2013). A Simple Example to Demonstrate how does the MapReduce work. Retrieved from http://xiaochongzhang.me/blog/?p=338
  • 40. MapReduce Limitations • Lot of read/writes – I/O becomes bottleneck when performing analysis • Machine Learning algorithms are iterative – Many reads and writes cycles before convergence – Slow runtime • There must be a better way!
  • 41. Apache Tez • Optimizes workflow to limit number of writes • Less I/O => faster execution
  • 42. Apache Storm • Execution engine for real-time streaming applications • Data is analyzed as it is generated BEFORE it is stored
  • 43. Apache Spark • In-memory computational engine • Read in data once, subsequent calculations are done in-memory Logistic Regression Runtime
  • 44. Other Apache Projects • Apache Hive – SQL interface to data stored in HDFS – Analysts with SQL experience can use Hadoop
  • 45. Other Apache Projects • Databases – Apache HBase – Apache Cassandra
  • 46. Other Apache Projects • Apache Kafka – Messaging system for streaming data
  • 47. Optimal Hadoop Workflow • Depends on what you are trying to do • Data Lake (HDFS) – Storage repository that holds data in raw format – Read into Spark to perform analysis • Use Data Science and Machine Learning algorithms • Demo will walkthrough this workflow
  • 49. Dataset • Texas Department of State Health Services – Released State Inpatient / Outpatient data (link) • Inpatient (IP) - 1999 to 2010 • Outpatient (OP) – Q42009 to 2010 – Data is de-identified and made available for free – Tab-delimited text files (for each quarter) • IP data – 450MB base table, 500MB charges • OP data – 750MB base table, 700MB charges
  • 50. Spark Background • Java, Scala, Python, and R APIs (docs) • Built around the concept of Resilient Distributed Datasets (RDDs) – Can perform MapReduce on RDD OR – Use the Spark DataFrame abstraction *Recommended*
  • 51. Spark DataFrame • Distributed collection of rows and named columns – Think relational database or spreadsheet – Akin to pandas DataFrame or R data.frame # Displays the content of the DataFrame df.show() # # +----+-------+ # | age| name| # +----+-------+ # |null|Michael| # | 30| Andy| # | 19| Justin| # +----+-------+
  • 52. Questions? • Slides and code available at https://github.com/alysivji/talks

Editor's Notes

  • #3: Before we get to what we’re talking about. I’ll talk about me.
  • #4: Data has been making a huge difference in other industries Chase uses machine learning algorithms to flag purchases that could be fraudulent. Last time this happened, I booked my flight using my American Airlines card and booked my hotel and conference on my United card. Chase didn’t know about the flight so it asked for my confirmation. Saves them money for having to pay for fraudulent purchases. Amazon uses data mining to find products purchased together and makes suggestions to increase revenue. Spark was created in Scala and most people who learn Scala do so in order to use Spark in its native language. Amazon doesn’t know this, but it can use data to figure this out. Netflix’s recommendation system finds users who are similar to you and uses their ratings to make predictions for media for you to watch
  • #5: Medical fraud dedection could be more robust or similar algorithms can find unnecessary procedures (purchases that do not match my profile) Data mining to suggest medication that is always prescribed together if an order is missing it Recommendation system to find similar patients. Group them by the treatment prescribed, rate their outcomes and use that information to suggest optimal course of action Why is this not widespread in healthcare?
  • #6: People who work in healthcare know, healthcare is different. We won’t really go into too many details why, but you can find out more at the links provided. I will spend some time discussing how healthcare has changed and made it easier to facilitate a data revolution
  • #7: What do we mean by data revolution? Data is ubiquitous... We’ll explore data science in some depth to understand the basic principles of the field and get a grasp on how we can make our information actionable Bee is Buzzword Bee! I’ll try to include him every time I use a buzzword
  • #8: Next we’ll talk about we can use the Hadoop ecosystem to analyze healthcare data
  • #9: Is paved with good intentions ;) 1920s [1] Healthcare professionals realized that documenting patient care benefited both providers and patients. Patient records established the details, complications and outcomes of patient care. Once healthcare providers realized that they were better able to treat patients with complete and accurate medical history, documentation became wildly popular. Health records were soon recognized as being critical to the safety and quality of the patient experience. 1960s [2] Charting how we currently know it. First, a patient database is collected. Then use that information to start the diagnosis process. Database is very thorough contains: Family history Prior encounter information Lab results Current health status 1972 [1, 2] There are quite a few cases of electronic record system pilots (thru universities and large healthcare facilities), this is the first major system that was developed. Did not attract many physicians 1980s-90s [1, 2] Computers made their way into hospitals, like they did in every other professional environment, but systems did not speak to each other 1996 HIPAA was passed and national standards for electronic health records was established 2004 [1, 3] In his 2004 State of the Union, President George W Bush calls for computerized health records. Established the Office of the National Coordinator for Health Information Technology. It coordinates nationwide efforts to implement HealthIT and electronic exchange of health information. References [1] http://www.rasmussen.edu/degrees/health-sciences/blog/health-information-management-history/ [2] http://www.nethealth.com/a-history-of-electronic-medical-records-infographic/ [3] https://en.wikipedia.org/wiki/Office_of_the_National_Coordinator_for_Health_Information_Technology
  • #10: Meaningful Use provided incentive payments to healthcare providers who could demonstrate they used health information technology in a ‘meaningful way’ to improve quality, engage patients, increase care coordination. Goal is that MU compliance will result in: Better clinical outcomes Improved population health outcomes Increased transparency and efficiency Empowered individuals https://en.wikipedia.org/wiki/Health_Information_Technology_for_Economic_and_Clinical_Health_Act https://www.healthit.gov/providers-professionals/meaningful-use-definition-objectives
  • #11: Did it work? Well… it did increase EHR adoption
  • #12: * EHR systems have a wealth of data and are collecting more each day * Genomic sequencing costs less than $1000 dollar, I’ve heard about a race to $100 as well * Medical sensors are collecting information at a dizzying pace. One big application is patient sensors in post-acute care environments where patients are hooked up to machines collecting real-time data * People are more concerned about their health than ever before and the consumer wearable industry is growing.
  • #13: But we’re getting ahead of ourselves. I need to introduce the topic of data analytics References [1] https://datascience.berkeley.edu/about/what-is-data-science/
  • #14: References [1] http://blog.datagravity.com/the-transition-to-predictive-analytics/
  • #15: References [1] http://blog.datagravity.com/the-transition-to-predictive-analytics/
  • #16: References [1] http://www.informationweek.com/big-data/big-data-analytics/big-data-analytics-descriptive-vs-predictive-vs-prescriptive/d/d-id/1113279
  • #17: References [1] https://marketoonist.com/2016/12/predictive-analytics.html
  • #18: This leads nicely into the topic of Machine Learning References http://www.ibmbigdatahub.com/blog/how-does-machine-learning-work?cm_mmc=OSocial_Twitter-_-IBM+Analytics_Inbound+Marketing-_-WW_WW-_-B+Yelland+3-20-2017&cm_mmca1=000000VQ&cm_mmca2=10000779&
  • #20: References [1] http://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/ [2] http://www.ibmbigdatahub.com/blog/how-does-machine-learning-work
  • #22: Why is this relevant to us in healthcare?
  • #23: Analytics is suited to the specific challenges in healthcare References [1] http://www.pbs.org/newshour/rundown/new-peak-us-health-care-spending-10345-per-person/ [2] http://www.pgpf.org/chart-archive/0006_health-care-oecd
  • #25: References [1] https://en.wikipedia.org/wiki/To_Err_is_Human_(report) [2] http://time.com/2888403/u-s-health-care-ranked-worst-in-the-developed-world/
  • #27: Healthcare analytics is broad as we can see from this diagram. Lots of areas where a little bit of deliberate data science and machine learning to make a difference
  • #28: Worth noting that most of the analytical capabilities needed to drive systemic changes in healthcare are already available in commercial software
  • #29: So let’s start talking about Big Data. What is big data? In healthcare, there is a lot of data… each genome is around 200GB of raw data. Lots of different information… clinical, notes, lab information, demographic result data, patient generated data Velocity data... Real time sensors monitoring patients Veracity... How sure are we that the data we get is correct? References [1] http://www.ibmbigdatahub.com/infographic/extracting-business-value-4-vs-big-data
  • #30: References [1] https://www.trifacta.com/blog/structured-unstructured-data/ [2] http://sherpasoftware.com/blog/structured-and-unstructured-data-what-is-it/
  • #31: How can we deal with all this data? Hadoop Ecosystem!
  • #33: References [1] http://www.littlebeelibrary.com/pdfs/Apache_Hadoop.pdf
  • #38: Execution engine is used to perform calculations on the underlying data
  • #39: The MapReduce engine runs the map step on all nodes in the cluster to produce a set of intermediate output files. It then sorts these intermediate les and then runs a reduce step to take the sorted intermediate les and aggregate the data to get a final result. This process is scalable but relatively slow because of the need to write lots of intermediate les to disk and then read them again.
  • #44: The key takeaway from this presentation: Use Spark to do all calculations