Using The Hadoop Ecosystem to Drive Healthcare Innovation

Using the Hadoop Ecosystem to
Drive Healthcare Innovation
Aly Sivji
April 25, 2017

About Me
• Aly Sivji
– Twitter: @CaiusSivjus
– Blog: http://alysivji.github.io
• Senior Analyst @ IBM Watson Health
– Value-Based Care: Planning Solutions
• Grad Student @ Northwestern University
– Medical Informatics
• Interests:
– Technology 🐍
– Data 📈
– Star Trek 🖖🖖

Overview
• Big Data drives most industries

Overview
• What about Healthcare?
– Machine Learning
• Fraud detection ($65+ billion lost every year)
– Wired Article
– dataiku - Detecting Medicare Fraud
• Preventing unnecessary procedures
– Data Mining
• Identifying medication prescribed together
– Recommender Systems
• Finding similar patients

Overview
Healthcare is Different.
People who work in healthcare
Additional Reading
• John Halamka (The Health Care Blog)
• Health Catalyst

Overview
• Data Analytics / Data Science
– Retrospective versus Predictive
• Machine Learning
– Types of Algorithms
• Healthcare Analytics

Overview
• Apache Hadoop Ecosystem
– Big Data framework
– Distributed computation on commodity hardware
– Demo!

Road to Electronic Health Records
1920s –
Modern
record
keeping
begins
1960s – Dr.
Larry Weed
introduces
problem-
oriented
medical
records
1972 –
Regenstrief
Institute
develops
first EMR
System
1980s-90s –
Siloed adoption
by departments
& admin
1996 –
HIPAA
establishes
national
standards
for
electronic
health
records
2004 –
President Bush
calls for
Computerized
Health Records

2009: EHRs Go Mainstream
• HITECH Act passed by President Obama
– $25.9 billion to expand Health IT (HIT) adoption
• Meaningful Use (MU) program
– Incentive payments for using HIT to
• Improve quality, safety, efficiency of care
• Engage patients
• Increase care co-ordination
– Goal: MU compliance => better outcomes

EHR Adoption: Doubled Since 2008
Office-based Physician Electronic Health Record Adoption (2005-2015)
Source: Office of the National Coordinator for Health Information Technology. 'Office-based Physician Electronic Health Record
Adoption,' Health IT Quick-Stat #50. dashboard.healthit.gov/quickstats/pages/physician-ehr-adoption-trends.php. Dec 2016.

Health Data Today
• Electronic Health Records
• Genomic Data ($1000 genome)
• Medical Internet of Things (mIoT)
• Wearable devices
• Bottom Line: Data is growing
Big Data = 'Bigger Data' in Healthcare (article)

Data Analytics
• Businesses collect lots of data
– IBM: 90% of world’s data created in last 2 years
• How can we find hidden patterns in the data
and make information actionable?
Data Science!

Types of Analytics
• Retrospective Analytics
– Summarizing historical activity / performance
– Limited scope for making future plans
• Better than nothing

Types of Analytics
• Predictive Analytics
– Finding patterns (correlations) between historical
environment and results
– Apply to current environment to make predictions

Predictive Analytics
"Once you have enough data, you start to see
patterns. You can then build a model of how
these data work. Once you build a model, you
can predict.”
Michael Wu
Chief Scientist, Lithium Technologies

Machine Learning (ML)
“Field of study that gives computers the ability
to learn without being explicitly programmed”
Arthur Samuel
Artificial Intelligence Pioneer

Machine Learning Algorithms
• A probabilistic framework to create models
used for predictions
• Predictive models are developed iteratively
• Models are refined until they converge
– i.e. output gets close to a specific value

Types of ML Algorithms
• Unsupervised Learning
– Group objects by similar characteristics
– Given inputs (X), find label for each observation
• Supervised Learning
– Given inputs (X) and output (Y)
– Find function f that maps X to Y
– Given new inputs (Xnew), predict value/label (Ynew)

Types of Supervised Learning
• Regression
– Try to predict a value (continuous variable)
• Classification
– Try to predict a label (discrete variable)

Analytics in Healthcare
“Advanced analytics can be used to improve
medical outcomes, increase financial
performance, deepen relationships with
customers and patients, and drive new medical
innovations”
Jason Burke
Author of Health Analytics

Healthcare Challenges
• US Healthcare spending = $3.4 trillion / year

• US system wastes $750 billion annually
Source: Washington Post (Sept 2012). Retrieved from https://www.washingtonpost.com/news/wonk/wp/2012/09/07/we-spend-
750-billion-on-unnecessary-health-care-two-charts-explain-why/

• Low quality
– To Err is Human Report:
• 44,000 - 98,000 deaths to preventable medical errors
– Rates poorly when compared to other countries
• Last in 2014 Commonwealth Fund survey on:
– Quality of care
– Access to doctors
– Equity

Solution: Big Data!
• Use data analytics and machine learning to
improve outcomes & lower costs

Good News
• Most of the analytical and software
capabilities needed to drive systemic changes
in healthcare are already available as:
– Commercial software
– Open Source solutions 🎉
• Hadoop ecosystem

Big Data
• Characteristics (4 V’s of Big Data)
– Volume
• Scale of data
– Variety
• Diversity of data (many sources)
– Velocity
• Speed of data
– Veracity
• Certainty of data
• 5th V: Value?

Types of Data
• Structured
– Highly organized information that fits neatly into a
relational database (columns and rows)
• Unstructured
– Has internal structure, but does not fit into a
traditional database (or spreadsheet)
– Most data is unstructured (>80%)
– Can use Extract-Transform-Load (ETL) Processing to
turn unstructured data into structured data

Apache Hadoop
• Set of open source software technology components that
form a scalable system we can use to analyze Big Data
• Main features:
– Distributed storage and processing
• Data is too big for a single computer
– Runs on commodity hardware
– Fault tolerant
• Hardware failures are common and handle automatically
– Runs in Java Virtual Machine (JVM) environment

Sample Hadoop Stack
Source: Soong, K. (Feb 2016). Big Data Specialization. Retrieved from http://ksoong.org/big-data

Core Hadoop Components
• Yet Another Resource Negotiator (YARN)
– “Operating System” for Hadoop
– Controls how resources are allocated to different
applications and execution engines across cluster

• Hadoop Distributed File System (HDFS)
– Highly scalable storage system
Data File

– Too big to fit on single machine => Partition
A B
C D

– Split across multiple machines
– Data is protected against hardware failure
A B
C
A
D
A
C D
B
C D
Server 1 Server 2 Server 3 Server 4

– Server goes down, we can still reconstruct data
A B
C
A
D
A
C D
B
C D
Server 1 Server 2 Server 3 Server 4
🔥

• Execution Engine
– Used when running analytic applications
– Distributed data allows us to perform parallel
computations
– MapReduce execution engine comes bundled with the
Hadoop core distribution
– Can plug-in different components
• Tez, Storm, Spark, etc

MapReduce Overview
Source: Eckroth, J. (n.d.). MapReduce. Retrieved from http://cinf401.artifice.cc/notes/mapreduce.html
HDFS
HDFS

MapReduce Example
Source: Zhang, X. (Jul 2013). A Simple Example to Demonstrate how does the
MapReduce work. Retrieved from http://xiaochongzhang.me/blog/?p=338

MapReduce Limitations
• Lot of read/writes
– I/O becomes bottleneck when performing analysis
• Machine Learning algorithms are iterative
– Many reads and writes cycles before convergence
– Slow runtime
• There must be a better way!

Apache Tez
• Optimizes workflow to limit number of writes
• Less I/O => faster execution

Apache Storm
• Execution engine for real-time streaming
applications
• Data is analyzed as it is generated BEFORE it is
stored

Apache Spark
• In-memory computational engine
• Read in data once, subsequent calculations
are done in-memory
Logistic Regression Runtime

Other Apache Projects
• Apache Hive
– SQL interface to data stored in HDFS
– Analysts with SQL experience can use Hadoop

• Databases
– Apache HBase
– Apache Cassandra

• Apache Kafka
– Messaging system for streaming data

Optimal Hadoop Workflow
• Depends on what you are trying to do
• Data Lake (HDFS)
– Storage repository that holds data in raw format
– Read into Spark to perform analysis
• Use Data Science and Machine Learning algorithms
• Demo will walkthrough this workflow

Using The Hadoop Ecosystem to Drive Healthcare Innovation

Dataset
• Texas Department of State Health Services
– Released State Inpatient / Outpatient data (link)
• Inpatient (IP) - 1999 to 2010
• Outpatient (OP) – Q42009 to 2010
– Data is de-identified and made available for free
– Tab-delimited text files (for each quarter)
• IP data – 450MB base table, 500MB charges
• OP data – 750MB base table, 700MB charges

Spark Background
• Java, Scala, Python, and R APIs (docs)
• Built around the concept of Resilient
Distributed Datasets (RDDs)
– Can perform MapReduce on RDD
OR
– Use the Spark DataFrame abstraction
*Recommended*

Spark DataFrame
• Distributed collection of rows and named
columns
– Think relational database or spreadsheet
– Akin to pandas DataFrame or R data.frame
# Displays the content of the DataFrame
df.show()
#
# +----+-------+
# | age| name|
# +----+-------+
# |null|Michael|
# | 30| Andy|
# | 19| Justin|
# +----+-------+

Questions?
• Slides and code available at
https://github.com/alysivji/talks

Using The Hadoop Ecosystem to Drive Healthcare Innovation

More Related Content

What's hot (20)

Similar to Using The Hadoop Ecosystem to Drive Healthcare Innovation (20)

More from Dan Wellisch (19)

Recently uploaded (20)

Using The Hadoop Ecosystem to Drive Healthcare Innovation

Editor's Notes