Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with MapR Database

Data Pipeline using: Spark, MapR Event Store
for Kafka, and MapR Database

2 © 2018 MapR Technologies, Inc
•  Overview of Kafka API
•  Use Spark Structured Streaming to
continously:
•  Read from Kafka topic
•  Transfom
•  Write to MapR document
database
•  Analyze using Spark SQL

Agenda
2

Use Case: Open Payment Dataset
•  Payments Drug and Device companies make to
•  Physicians and Teaching Hospitals for
•  Travel, Research, Gifts, Speaking fees, and Meals
•  https://www.cms.gov/openpayments/

•  Medicare payment data for most common inpatient diagnoses
•  Data shows dramatically different charges and payments
•  https://www.cms.gov/research-statistics-data-and-systems/statistics-trends-and-
reports/medicare-provider-charge-data/inpatient.html
InPatient Payment Dataset (IPPS)
Provider ID Provider
State
DRG Total
Discharges
Average
Charges
Average
Total
payments
Avg
Medicare
Payments
1261770 CA
Heart
Transplant
13 1,172,866 251,876 244,457

What is a Stream ?
•  A stream is a continuous sequence of events
•  Events are key-value pairs

What is Streaming Data?
Fraud detection Smart Machinery Smart Meters Home Automation
Networks Manufacturing Security Systems Patient Monitoring

Monitoring devices combined with ML can provide alerts for Sepsis,
which is one of the leading causes for death in hospitals
– http://www.computerweekly.com/news/450422258/Putting-sepsis-
algorithms-into-electronic-patient-records
Examples of Streaming Data

A Stanford team has shown that a machine-learning model can identify arrhythmias
from an EKG better than an expert
•  https://www.technologyreview.com/s/608234/the-machines-are-getting-ready-
to-play-doctor/
Example of Streaming Data combined with Machine Learning

the ECG app on the Apple Watch can check heart rhythms and send a notification if an
irregular heart rhythm is identified.
•  https://www.apple.com/newsroom/2018/12/ecg-app-and-irregular-heart-
rhythm-notification-available-today-on-apple-watch/
Example of Streaming Data combined with Machine Learning

https://mapr.com/blog/ml-iot-connected-medical-devices/
Applying Machine Learning to Live Patient Data

Intro to the Kafka API

Topics:
Logical collection of events
Organize Events into Categories
Organize Data into Topics with the MapR Event Store for Kafka

Topics:
Logical collection of events
Organize Events into Categories
Organize Data into Topics with the MapR Event Store for Kafka
Consumers
MapR Cluster
Topic: Pressure
Topic: Temperature
Topic: Warnings
Consumers
Consumers
Kafka API Kafka API

Topics are partitioned for throughput and
scalability

Scalable Messaging with MapR Event Streams
Server 1
Partition1: Topic - Pressure
Partition1: Topic - Temperature
Partition1: Topic - Warning
Server 2
Server 3

New Messages are
Added to the end
Partition is like an Event Log
New
Message
6 5 4 3 2 1
Old
Message

Messages are delivered in the order they are received
Partition is like a Queue

Events remain on the partition, available to other consumers

Unlike a queue, Events are not deleted when Read

Messages can be persisted forever
Or
Older messages can be deleted automatically based on time to live
Not deleting messages, minimizes disk read/writes

When Are Messages Deleted?
MapR Cluster
6 5 4 3 2 1Partition
1
Older
message

High Performance at Scale:
•  Partitioning for Parallel operations
•  Not deleting messages, minimizes disk read/writes

Processing Same Message for Different Purposes

Georgia Health Connect
ALLOY Health:
Exchange State HIE
Clinical Data Viewer
Reporting and Analytics
Clinical Data
Financial Data
Provider
Organizations
What are the outcomes in the entire
state on diabetes?
Are there doctors that are doing this
better than others?
Georgia Health Connect

Use Case: Streaming System of Record for Healthcare

Spark Distributed Datasets
partitioned
•  Distributed collection of objects
Dataset[T]
•  Partitioned across a cluster
•  Operated on in parallel
•  in memory can be Cached

DataFrame = Dataset[Row] can use Spark SQL

DataFrame is like a table
Dataset[Row]
row
columns

Dataset[Typed Object] can use Spark SQL and Functions
A Dataset is a distributed collection of objects
Dataset[objects]
object
columns
partitioned

Process the Data with Spark Structured Streaming

Datasets Read from Stream
Task
Cache
Process
& Cache
Data
offsets
Stream
partition
Task
Cache
Process
& Cache
Data
Task
Cache
Process
& Cache
Data
Driver
Stream
partition
Stream
partition
Data is cached for
aggregations
And windowed
functions

new data in the
data stream
=
new rows appended
to an unbounded table
Data stream as an unbounded table

Treat Stream as Unbounded Tables

The Stream is continuously processed

Spark automatically streamifies SQL plans
Image reference Databricks

Stream Processing

Use Case: Payment Data
"NEW","Covered Recipient Physician",,,,"132655","GREGG","D","ALZATE",,"8745
AERO DRIVE","STE 200","SAN DIEGO","CA","92123","United States",,,"Medical
Doctor","Allopathic & Osteopathic Physicians|Radiology|Diagnostic
Radiology","CA",,,,,"DFINE, Inc","100000000326","DFINE, Inc","CA","United States",
90.87,"02/12/2016","1","In-kind items and services","Food and Beverage",,,,"No","No
Third Party
Payment",,,,,"No","346039438","No","Yes","Covered","Device","Radiology","StabiliT",
,"Covered","Device","Radiology","STAR Tumor Ablation
System",,,,,,,,,,,,,,,,,"2016","06/30/2017"
{
"_id":"317150_08/26/2016_346122858",
"physician_id":"317150",
"date_payment":"08/26/2016",
"record_id":"346122858",
"payer":"Mission Pharmacal Company",
"amount":9.23,
"Physician_Specialty":"Obstetrics & Gynecology",
"Nature_of_payment":"Food and Beverage"
}

Scenario: Payment Data
Provider ID Date Payer Payer
State
Provider
Specialty
Provider
State
Amount Payment
Nature
1261770 01/11/2016
Southern
Anesthesia &
Surgical, Inc
CO
Oral and
Maxillofacial
Surgery
CA 117.5 Food and Beverage

val df1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "maprdemo:9092")
.option("subscribe", "/apps/stream:payments”)
.option("startingOffsets", "earliest")
.option("failOnDataLoss", false)
.option("maxOffsetsPerTrigger", 1000)
.load()

Streaming pipeline Kafka Data source

case class Payment(_id: String, physician_id: String,
date_payment: String, payer: String, payer_state: String,
amount: Double, physician_type: String,physician_specialty: String,
physician_state: String, nature_of_payment: String)
def parsePayment(str: String): Payment = {
val td = str.split(",(?=([^"]*"[^"]*")*[^"]*$)")
val physician_id =td(5)
val amount = td(30).toDouble
. . .
Payment(id, physician_id, date_payment, payer, payer_state, amount,
physician_type, focus, physician_state, nature_of_payment)
}

Function to Parse CSV data to Payment Object

//register a user-defined function (UDF) to deserialize the message
spark.udf.register("deserialize",
(message: String) => parsePayment(message))
//use the UDF in a select expression
val df2 = df1.selectExpr("""deserialize(CAST(value as STRING)) AS
message""").select($"message".as[Payment])

Parse message txt to Payment Object

Writing to a Memory Sink
Write results to Memory
Start running the query
val query = cdf
.writeStream
.queryName("uber")
.format("memory")
.outputMode("append”)

query.start().awaitTermination()

%sql select * from payments limit 3:

%sql select physician_specialty, count(*) as cnt from payment2
group by physician_specialty order by cnt desc

MapR-DB Connector for Apache Spark
Spark Streaming writing to MapR-DB JSON

Spark MapR-DB Connector

Designed for Partitioning and Scaling

MapR-DB JSON Document Store
Data is automatically partitioned
and sorted by _id row key!
{
"_id":”AK_08/26/2016_346122858",
"physician_id":"317150",
"date_payment":"08/26/2016",
"payer":"Mission Pharmacal Company",
"payer_state":”CO",
"amount":9.23,
”physician_specialty":”Gynecology",
“physician_state":”AK"
”nature_of_payment":"Food and Beverage"
}

MapR Database shell
maprdb mapr:> find /user/mapr/paytable --limit 2
{"_id":"AK_01/03/2016_1309545_391702408", "amount":94.21,
"date_payment":"01/03/2016", "nature_of_payment":"Food and Beverage",
"payer":"Stryker Corporation", "payer_state":"MI","physician_id":"1309545",
"physician_specialty":"Specialist", "physician_state":"AK",
"physician_type":"Medical Doctor"}
{"_id":"AK_01/04/2016_100801_396460394","amount":11.59,
"date_payment":"01/04/2016", "nature_of_payment":"Food and Beverage",
"payer":"AbbVie, Inc.", "payer_state":"IL", "physician_id":"100801",
"physician_specialty":"Family Medicine", "physician_state":"AK",
"physician_type":"Medical Doctor"}
2 document(s) found.

Writing to a MapR-DB Sink
Write Streaming DataFrame
Query Results to MapR-DB

Start running the query
val query = cdf.writeStream
.format(MapRDBSourceConfig.Format)
.option(MapRDBSourceConfig.TablePathOption, tableName)
.option(MapRDBSourceConfig.IdFieldPathOption, "_id")
.option(MapRDBSourceConfig.CreateTableOption, false)
.option("checkpointLocation", "/user/mapr/check")
.option(MapRDBSourceConfig.BulkModeOption, true)
.option(MapRDBSourceConfig.SampleSizeOption, 1000)

query.start().awaitTermination()

Streaming Dataset Read from Topic partitions, Transformed,
Written to MapR Database partitions

Store in MapR Database

Explore the Data With Spark SQL

•  Spark SQL queries and updates to MapR-DB
•  With projection and filter pushdown, custom partitioning, and data locality

Spark SQL Querying MapR-DB JSON

val pdf: Dataset[Payment] = spark
.loadFromMapRDB[Payment](tableName, schema)
.as[Payment]

Spark Distributed Datasets read from MapR-DB Partitions
Worker
Task
Worker
Driver
Cache 1
Cache 2
Cache 3
Process
& Cache
Data
Process
& Cache
Data
Process
& Cache
Data
Task
Task
Driver
tasks
tasks
tasks

Data
Frame
Load data
pdf.createOrReplaceTempView(”payment")
pdf.select("_id", "payer", "amount”).show

Load the data into a Dataframe

•  Which physician specialties, or physician states, are getting the most payments ?
•  Count , total amount $, average amount $, standard deviation
•  Which companies, or company states, are making the most payments ?
•  Count , total amount $, average amount $, standard deviation
•  What are the Top Nature of Payments by count, amount ?
•  What are the payments over $2,000,000 ?
Now We Can Ask Questions Like:
Physician ID Date Payer Payer
State
Physician
Specialty
Physician
State
Amount Payment
Nature
1261770 01/11/2016
Southern
Anesthesia &
Surgical, Inc
CO
Oral and
Maxillofacial
Surgery
CA 117.5 Food and Beverage

Language Integrated Queries
L I Query Description
agg(expr, exprs) Aggregates on entire DataFrame
distinct Returns new DataFrame with unique rows
except(other) Returns new DataFrame with rows from this DataFrame not in other
DataFrame
filter(expr);
where(condition)
Filter based on the SQL expression or condition
groupBy(cols:
Columns)
Groups DataFrame using specified columns
join (DataFrame,
joinExpr)
Joins with another DataFrame using given join expression
sort(sortcol) Returns new DataFrame sorted by specified column
select(col) Selects set of columns

MapR Data Platform

Top 5 Nature of Payment by count
val res = pdf.groupBy(”Nature_of_Payment")
.count()
.orderBy(desc(count))
.show(5)

Top 5 Nature of Payment by amount of payment
%sql select Nature_of_payment,
sum(amount) as total from payments
group by Nature_of_payment order by total desc limit 5

Top 5 Physician Specialties by total Amount
%sql select physician_specialty, sum(amount) as total
from payments where physician_specialty IS NOT NULL
group by physician_specialty order by total desc limit 5

Average Payment by Specialty

What are the Nature of Payments with payments > $1000 ?
pdf.filter($"amount" > 1000)
.groupBy("Nature_of_payment")
.count().orderBy(desc("count")).show()

Top Payers by Total Amount with count
%sql select payer, payer_state, count(*) as cnt,
sum(amount) as total from payments
group by payer, payer_state order by total desc limit 10

Scenario: InPatient Payment Data
State
DRG Total
Discharges
Average
Charges
Average
Total
payments
Avg
Medicare
Payments
1261770 CA
Heart
Transplant
13 1,172,866 251,876 244,457

MapR Database shell
maprdb mapr:> find /user/mapr/ippstable --limit 2
{"_id":"001_2014_010033", "avg_covered_charges":1172866.385, "avg_medicare_payments":
244457.9231, "avg_total_payments":251876.3077, "drg_code":"001", "drg_definition":"HEART
TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC", "provider_address":"619
SOUTH 19TH STREET", "provider_city":"BIRMINGHAM", "provider_id":"010033",
"provider_name":"UNIVERSITY OF ALABAMA HOSPITAL", "provider_region":"AL -
Birmingham", "provider_state":"AL", "provider_zip":"35233", "total_discharges":13,
"year":"2014"}
{"_id":"001_2014_030103", "avg_covered_charges":437531.3, "avg_medicare_payments":
133509.55, "avg_total_payments":240422.8, "drg_code":"001", "drg_definition":"HEART
TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC",
"provider_address":"5777 EAST MAYO BOULEVARD", "provider_city":"PHOENIX",
"provider_id":"030103", "provider_name":"MAYO CLINIC HOSPITAL", "provider_region":"AZ -
Phoenix", "provider_state":"AZ", "provider_zip":"85054", "total_discharges":20, "year":"2014”}

•  What are the top charges, and payments by Hospital?
•  by Hospital State?
•  What are the top charges, and payments by diagnosis?
•  Most common diagnosis ?
•  What are the statistics for payments for most common diagnosis (Sepsis )
•  What are the most expensive states for most common diagnosis sepsis
Now We Can Ask Questions Like:
State
DRG Total
Discharges
Average
Charges
Average
Total
payments
Avg
Medicare
Payments
1261770 CA
Heart
Transplant
13 1,172,866 251,876 244,457

InPatient Prospective Payment System payments

To Download the code : Streaming data pipeline blogs
•  https://mapr.com/blog/etl-pipeline-healthcare-dataset-with-spark-json-mapr-db/
•  https://mapr.com/blog/streaming-data-pipeline-transform-store-explore-healthcare-
dataset-mapr-db/

http://mapr.com/ebooks/
New Spark Ebook

Read about Streaming Architectures
mapr.com/ebooks

Read about Machine Learning Logistics
mapr.com/ebooks

MapR Free ODT http://learn.mapr.com/
To Learn More: New Spark 2.0 training

https://mapr.com/blog/
MapR Blog

To Learn More:
•  https://mapr.com/blog/ml-iot-connected-medical-devices/

To Learn More:
•  https://mapr.com/blog/how-stream-first-architecture-patterns-are-revolutionizing-
healthcare-platforms/

Applying Machine Learning to Live Patient Data
•  https://www.slideshare.net/caroljmcdonald/applying-machine-learning-to-live-
patient-data

Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with MapR Database

More Related Content

What's hot (19)

Similar to Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with MapR Database (20)

More from Carol McDonald (12)

Recently uploaded (20)

Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with MapR Database