SlideShare a Scribd company logo
Data	Pipeline	using:	Spark,	MapR	Event	Store	
for	Kafka,	and	MapR	Database
2 © 2018 MapR Technologies, Inc
•  Overview	of	Kafka	API	
•  Use	Spark	Structured	Streaming	to	
continously:	
•  Read	from	Kafka	topic	
•  Transfom		
•  Write	to	MapR	document	
database	
•  Analyze	using	Spark	SQL	
	
Agenda	
2
3 © 2018 MapR Technologies, Inc
Use	Case:	Open	Payment	Dataset	
•  Payments Drug and Device companies make to
•  Physicians and Teaching Hospitals for
•  Travel, Research, Gifts, Speaking fees, and Meals
•  https://www.cms.gov/openpayments/
4 © 2018 MapR Technologies, Inc
•  Medicare	payment	data	for	most	common	inpatient	diagnoses	
•  Data	shows	dramatically	different	charges	and	payments		
•  https://www.cms.gov/research-statistics-data-and-systems/statistics-trends-and-
reports/medicare-provider-charge-data/inpatient.html	
InPatient	Payment	Dataset	(IPPS)	
Provider	ID Provider	
State	
DRG Total
Discharges
Average		
Charges
Average	
Total	
payments	
Avg		
Medicare
Payments
1261770 CA
Heart	
Transplant	
13	 1,172,866	 251,876	 244,457
What	is	Streaming	Data?
6 © 2018 MapR Technologies, Inc
What	is	a	Stream	?	
•  A stream is a continuous sequence of events
•  Events are key-value pairs
7 © 2018 MapR Technologies, Inc
What	is	Streaming	Data?	
Fraud detection Smart Machinery Smart Meters Home Automation
Networks Manufacturing Security Systems Patient Monitoring
8 © 2018 MapR Technologies, Inc
Monitoring	devices	combined	with	ML	can	provide	alerts	for	Sepsis,	
which	is	one	of	the	leading	causes	for	death	in	hospitals	
– http://www.computerweekly.com/news/450422258/Putting-sepsis-
algorithms-into-electronic-patient-records	
Examples	of	Streaming	Data
9 © 2018 MapR Technologies, Inc
A	Stanford	team	has	shown	that	a	machine-learning	model	can	identify	arrhythmias	
from	an	EKG	better	than	an	expert	
•  https://www.technologyreview.com/s/608234/the-machines-are-getting-ready-
to-play-doctor/	
Example	of	Streaming	Data	combined	with	Machine	Learning
10 © 2018 MapR Technologies, Inc
	the	ECG	app	on	the	Apple	Watch	can	check	heart	rhythms	and	send	a	notification	if	an	
irregular	heart	rhythm		is	identified.	
•  https://www.apple.com/newsroom/2018/12/ecg-app-and-irregular-heart-
rhythm-notification-available-today-on-apple-watch/	
Example	of	Streaming	Data	combined	with	Machine	Learning
11 © 2018 MapR Technologies, Inc
https://mapr.com/blog/ml-iot-connected-medical-devices/	
Applying	Machine	Learning	to	Live	Patient	Data
Kafka	API	and	Streaming	Data
13 © 2018 MapR Technologies, Inc
Intro	to	the	Kafka	API
14 © 2018 MapR Technologies, Inc
Topics:		
Logical	collection	of	events		
Organize	Events	into	Categories	
Organize	Data	into	Topics	with	the	MapR	Event	Store	for	Kafka
15 © 2018 MapR Technologies, Inc
Topics:		
Logical	collection	of	events		
Organize	Events	into	Categories	
Organize	Data	into	Topics	with	the	MapR	Event	Store	for	Kafka	
Consumers
MapR Cluster
Topic: Pressure
Topic: Temperature
Topic: Warnings
Consumers
Consumers
Kafka API Kafka API
16 © 2018 MapR Technologies, Inc
Topics	are	partitioned	for	throughput	and	
scalability	
	
Scalable	Messaging	with	MapR	Event	Streams	
Server 1
Partition1: Topic - Pressure
Partition1: Topic - Temperature
Partition1: Topic - Warning
Server 2
Partition2: Topic - Pressure
Partition2: Topic - Temperature
Partition2: Topic - Warning
Server 3
Partition3: Topic - Pressure
Partition3: Topic - Temperature
Partition3: Topic - Warning
17 © 2018 MapR Technologies, Inc
New	Messages	are		
Added	to	the	end	
Partition	is	like	an	Event	Log	
New
Message
6 5 4 3 2 1
Old
Message
18 © 2018 MapR Technologies, Inc
Messages	are	delivered	in	the	order	they	are	received	
Partition	is	like	a	Queue
19 © 2018 MapR Technologies, Inc
Events	remain	on	the	partition,	available	to	other	consumers	
	
	
Unlike	a	queue,	Events	are	not	deleted	when	Read
20 © 2018 MapR Technologies, Inc
Messages	can	be	persisted	forever		
Or			
Older	messages	can	be	deleted	automatically	based	on	time	to	live	
Not	deleting	messages,	minimizes	disk	read/writes	
		
	
When	Are	Messages	Deleted?	
MapR Cluster
6 5 4 3 2 1Partition
1
Older
message
21 © 2018 MapR Technologies, Inc
High	Performance	at	Scale:		
•  Partitioning	for	Parallel	operations		
•  Not	deleting	messages,	minimizes	disk	read/writes
22 © 2018 MapR Technologies, Inc
Processing	Same	Message	for	Different	Purposes
23 © 2018 MapR Technologies, Inc
Georgia	Health	Connect	
ALLOY Health:
Exchange State HIE
Clinical Data Viewer
Reporting and Analytics
Clinical Data
Financial Data
Provider
Organizations
What are the outcomes in the entire
state on diabetes?
Are there doctors that are doing this
better than others?
Georgia Health Connect
24 © 2018 MapR Technologies, Inc
Use	Case:	Streaming	System	of	Record	for	Healthcare
Spark	Structured	Streaming
26 © 2018 MapR Technologies, Inc
Spark	Distributed	Datasets	
partitioned
•  Distributed collection of objects
Dataset[T]
•  Partitioned across a cluster
•  Operated on in parallel
•  in memory can be Cached
27 © 2018 MapR Technologies, Inc
DataFrame	=	Dataset[Row]	can	use	Spark	SQL	
	
DataFrame is like a table
Dataset[Row]
row
columns
28 © 2018 MapR Technologies, Inc
Dataset[Typed	Object]		can	use	Spark	SQL	and	Functions	
A Dataset is a distributed collection of objects
Dataset[objects]
object
columns
partitioned
29 © 2018 MapR Technologies, Inc
Process	the	Data	with	Spark	Structured	Streaming
30 © 2018 MapR Technologies, Inc
Datasets	Read	from	Stream	
Task	
Cache		
Process
& Cache
Data
offsets
Stream
partition
Task	
Cache		
Process
& Cache
Data
Task	
Cache		
Process
& Cache
Data
Driver	
Stream
partition
Stream
partition
Data is cached for
aggregations
And windowed
functions
31 © 2018 MapR Technologies, Inc
new data in the
data stream
=
new rows appended
to an unbounded table
Data stream as an unbounded table
	
Treat	Stream	as	Unbounded	Tables
32 © 2018 MapR Technologies, Inc
The	Stream	is	continuously	processed
33 © 2018 MapR Technologies, Inc
Spark	automatically		streamifies	SQL	plans	
Image	reference	Databricks
34 © 2018 MapR Technologies, Inc
Stream	Processing
35 © 2018 MapR Technologies, Inc
Use	Case:	Payment	Data	
"NEW","Covered Recipient Physician",,,,"132655","GREGG","D","ALZATE",,"8745
AERO DRIVE","STE 200","SAN DIEGO","CA","92123","United States",,,"Medical
Doctor","Allopathic & Osteopathic Physicians|Radiology|Diagnostic
Radiology","CA",,,,,"DFINE, Inc","100000000326","DFINE, Inc","CA","United States",
90.87,"02/12/2016","1","In-kind items and services","Food and Beverage",,,,"No","No
Third Party
Payment",,,,,"No","346039438","No","Yes","Covered","Device","Radiology","StabiliT",
,"Covered","Device","Radiology","STAR Tumor Ablation
System",,,,,,,,,,,,,,,,,"2016","06/30/2017"
{
"_id":"317150_08/26/2016_346122858",
"physician_id":"317150",
"date_payment":"08/26/2016",
"record_id":"346122858",
"payer":"Mission Pharmacal Company",
"amount":9.23,
"Physician_Specialty":"Obstetrics & Gynecology",
"Nature_of_payment":"Food and Beverage"
}
36 © 2018 MapR Technologies, Inc
Scenario:	Payment	Data	
Provider	ID Date Payer Payer
State
Provider	
Specialty
Provider	
State	
Amount Payment
Nature
1261770 01/11/2016
Southern	
Anesthesia	&	
Surgical,	Inc	
CO	
Oral	and	
Maxillofacial	
Surgery	
CA	 117.5 Food	and	Beverage
37 © 2018 MapR Technologies, Inc
val df1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "maprdemo:9092")
.option("subscribe", "/apps/stream:payments”)
.option("startingOffsets", "earliest")
.option("failOnDataLoss", false)
.option("maxOffsetsPerTrigger", 1000)
.load()
	
Streaming	pipeline		Kafka	Data	source
38 © 2018 MapR Technologies, Inc
df1.printSchema()
root

|-- key: binary (nullable = true)

|-- value: binary (nullable = true)

|-- topic: string (nullable = true)

|-- partition: integer (nullable = true)

|-- offset: long (nullable = true)

|-- timestamp: timestamp (nullable = true)

|-- timestampType: integer (nullable = true)
Kafka	DataFrame	schema
39 © 2018 MapR Technologies, Inc
case class Payment(_id: String, physician_id: String,
date_payment: String, payer: String, payer_state: String,
amount: Double, physician_type: String,physician_specialty: String,
physician_state: String, nature_of_payment: String)
def parsePayment(str: String): Payment = {
val td = str.split(",(?=([^"]*"[^"]*")*[^"]*$)")
val physician_id =td(5)
val amount = td(30).toDouble
. . .
Payment(id, physician_id, date_payment, payer, payer_state, amount,
physician_type, focus, physician_state, nature_of_payment)
}
	
Function	to	Parse	CSV	data	to	Payment	Object
40 © 2018 MapR Technologies, Inc
//register a user-defined function (UDF) to deserialize the message
spark.udf.register("deserialize",
(message: String) => parsePayment(message))
//use the UDF in a select expression
val df2 = df1.selectExpr("""deserialize(CAST(value as STRING)) AS
message""").select($"message".as[Payment])
	
Parse	message	txt	to	Payment	Object
41 © 2018 MapR Technologies, Inc
Writing	to	a	Memory	Sink	
Write		results	to	Memory	
Start	running	the	query	
val	query	=	cdf	
	.writeStream	
	.queryName("uber")	
	.format("memory")	
	.outputMode("append”)		
	
query.start().awaitTermination()
42 © 2018 MapR Technologies, Inc
%sql	select	*	from	payments	limit	4:		
Streaming	Applicaton	
+--------------------+------------+------------+--------------------+-----------+--------+-------------------+--------------------+---------------+------------------+
| _id|physician_id|date_payment| payer|payer_state| amount| physician_type| physician_specialty|physician_state| nature_of_payment|
+--------------------+------------+------------+--------------------+-----------+--------+-------------------+--------------------+---------------+------------------+
|CA_Diagnostic Rad...| 132655| 02/12/2016| DFINE, Inc| CA| 90.87| Medical Doctor|Diagnostic Radiology| CA| Food and Beverage|
|AZ_Vascular & Int...| 1006832| 01/26/2016| DFINE, Inc| CA| 25.0| Medical Doctor|Vascular & Interv...| AZ|Travel and Lodging|
|AZ_Vascular & Int...| 1006832| 02/13/2016| DFINE, Inc| CA| 32.0| Medical Doctor|Vascular & Interv...| AZ|Travel and Lodging|
|AZ_Vascular & Int...| 1006832| 02/19/2016| DFINE, Inc| CA| 27.27| Medical Doctor|Vascular & Interv...| AZ|Travel and Lodging|
43 © 2018 MapR Technologies, Inc
%sql	select	*	from	payments	limit	3:		
Streaming	Applicaton
44 © 2018 MapR Technologies, Inc
%sql	select	physician_specialty,	count(*)	as	cnt	from	payment2		
									group	by	physician_specialty	order	by	cnt	desc	
Streaming	Applicaton
Spark		&	MapR-DB
46 © 2018 MapR Technologies, Inc
MapR-DB Connector for Apache Spark 	
Spark	Streaming	writing	to	MapR-DB	JSON
47 © 2018 MapR Technologies, Inc
Spark	MapR-DB	Connector
48 © 2018 MapR Technologies, Inc
Designed	for	Partitioning	and	Scaling
49 © 2018 MapR Technologies, Inc
MapR-DB	JSON	Document	Store	
Data is automatically partitioned
and sorted by _id row key!
{
"_id":”AK_08/26/2016_346122858",
"physician_id":"317150",
"date_payment":"08/26/2016",
"payer":"Mission Pharmacal Company",
"payer_state":”CO",
"amount":9.23,
”physician_specialty":”Gynecology",
“physician_state":”AK"
”nature_of_payment":"Food and Beverage"
}
50 © 2018 MapR Technologies, Inc
MapR	Database	shell	
maprdb mapr:> find /user/mapr/paytable --limit 2
{"_id":"AK_01/03/2016_1309545_391702408", "amount":94.21,
"date_payment":"01/03/2016", "nature_of_payment":"Food and Beverage",
"payer":"Stryker Corporation", "payer_state":"MI","physician_id":"1309545",
"physician_specialty":"Specialist", "physician_state":"AK",
"physician_type":"Medical Doctor"}
{"_id":"AK_01/04/2016_100801_396460394","amount":11.59,
"date_payment":"01/04/2016", "nature_of_payment":"Food and Beverage",
"payer":"AbbVie, Inc.", "payer_state":"IL", "physician_id":"100801",
"physician_specialty":"Family Medicine", "physician_state":"AK",
"physician_type":"Medical Doctor"}
2 document(s) found.	
Data is automatically partitioned
and sorted by _id row key!
51 © 2018 MapR Technologies, Inc
Writing	to	a	MapR-DB	Sink	
Write		Streaming	DataFrame	
Query	Results	to	MapR-DB	
	
Start	running	the	query	
	val	query	=	cdf.writeStream	
						.format(MapRDBSourceConfig.Format)	
						.option(MapRDBSourceConfig.TablePathOption,	tableName)	
						.option(MapRDBSourceConfig.IdFieldPathOption,	"_id")	
						.option(MapRDBSourceConfig.CreateTableOption,	false)	
						.option("checkpointLocation",	"/user/mapr/check")	
						.option(MapRDBSourceConfig.BulkModeOption,	true)	
						.option(MapRDBSourceConfig.SampleSizeOption,	1000)	
	
				query.start().awaitTermination()
52 © 2018 MapR Technologies, Inc
Streaming	Dataset	Read	from	Topic	partitions,	Transformed,	
Written	to	MapR	Database	partitions
53 © 2018 MapR Technologies, Inc
Store	in	MapR	Database
Explore	the	Data	With	Spark	SQL
55 © 2018 MapR Technologies, Inc
•  Spark SQL queries and updates to MapR-DB
•  With projection and filter pushdown, custom partitioning, and data locality
	
Spark	SQL	Querying	MapR-DB	JSON
56 © 2018 MapR Technologies, Inc
val pdf: Dataset[Payment] = spark
.loadFromMapRDB[Payment](tableName, schema)
.as[Payment]
	
Spark	Distributed	Datasets	read	from	MapR-DB	Partitions	
Worker	
Task	
Worker	
Driver	
Cache	1	
Cache	2	
Cache	3	
Process
& Cache
Data
Process
& Cache
Data
Process
& Cache
Data
Task	
Task	
Driver	
tasks
tasks
tasks
57 © 2018 MapR Technologies, Inc
Data
Frame
Load data
pdf.createOrReplaceTempView(”payment")	
pdf.select("_id",	"payer",	"amount”).show	
	
Load	the	data	into	a	Dataframe	
Data is automatically partitioned
and sorted by _id row key!
58 © 2018 MapR Technologies, Inc
•  Which	physician	specialties,	or	physician	states,	are	getting	the	most	payments	?		
•  Count	,	total	amount	$,		average	amount	$,	standard	deviation	
•  Which	companies,	or	company	states,		are	making	the	most	payments	?		
•  Count	,	total	amount	$,		average	amount	$,	standard	deviation	
•  What	are	the	Top	Nature	of	Payments	by	count,	amount		?	
•  What	are	the	payments	over	$2,000,000	?	
Now	We	Can	Ask	Questions	Like:	
Physician	ID Date Payer Payer
State
Physician	
Specialty
Physician	
State	
Amount Payment
Nature
1261770 01/11/2016
Southern	
Anesthesia	&	
Surgical,	Inc	
CO	
Oral	and	
Maxillofacial	
Surgery	
CA	 117.5 Food	and	Beverage
59 © 2018 MapR Technologies, Inc
Language	Integrated	Queries	
L	I	Query	 Description	
agg(expr, exprs) Aggregates	on	entire	DataFrame	
distinct Returns	new	DataFrame	with	unique	rows	
except(other) Returns	new	DataFrame	with	rows	from	this	DataFrame	not	in	other	
DataFrame	
filter(expr);
where(condition)
Filter	based	on	the	SQL	expression	or	condition	
groupBy(cols:
Columns)
Groups	DataFrame	using	specified	columns	
join (DataFrame,
joinExpr)
Joins	with	another	DataFrame	using	given	join	expression	
sort(sortcol) Returns	new	DataFrame	sorted	by	specified	column	
select(col) Selects	set	of	columns
60 © 2018 MapR Technologies, Inc
MapR	Data	Platform
61 © 2018 MapR Technologies, Inc
Top	5	Nature	of	Payment	by	count	
val res = pdf.groupBy(”Nature_of_Payment")
.count()
.orderBy(desc(count))
.show(5)
62 © 2018 MapR Technologies, Inc
Top		5	Nature	of	Payment	by	amount	of	payment	
%sql select Nature_of_payment,
sum(amount) as total from payments
group by Nature_of_payment order by total desc limit 5
63 © 2018 MapR Technologies, Inc
Top	5	Physician	Specialties	by	total	Amount	
%sql select physician_specialty, sum(amount) as total
from payments where physician_specialty IS NOT NULL
group by physician_specialty order by total desc limit 5
64 © 2018 MapR Technologies, Inc
Average	Payment	by	Specialty
65 © 2018 MapR Technologies, Inc
What	are	the	Nature	of	Payments	with	payments	>	$1000	?	
pdf.filter($"amount" > 1000)
.groupBy("Nature_of_payment")
.count().orderBy(desc("count")).show()
66 © 2018 MapR Technologies, Inc
Top	Payers	by	Total	Amount	with	count	
%sql select payer, payer_state, count(*) as cnt,
sum(amount) as total from payments
group by payer, payer_state order by total desc limit 10
67 © 2018 MapR Technologies, Inc
Scenario:	InPatient	Payment	Data	
Provider	ID Provider	
State	
DRG Total
Discharges
Average		
Charges
Average	
Total	
payments	
Avg		
Medicare
Payments
1261770 CA
Heart	
Transplant	
13	 1,172,866	 251,876	 244,457
68 © 2018 MapR Technologies, Inc
MapR	Database	shell	
maprdb mapr:> find /user/mapr/ippstable --limit 2
{"_id":"001_2014_010033", "avg_covered_charges":1172866.385, "avg_medicare_payments":
244457.9231, "avg_total_payments":251876.3077, "drg_code":"001", "drg_definition":"HEART
TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC", "provider_address":"619
SOUTH 19TH STREET", "provider_city":"BIRMINGHAM", "provider_id":"010033",
"provider_name":"UNIVERSITY OF ALABAMA HOSPITAL", "provider_region":"AL -
Birmingham", "provider_state":"AL", "provider_zip":"35233", "total_discharges":13,
"year":"2014"}
{"_id":"001_2014_030103", "avg_covered_charges":437531.3, "avg_medicare_payments":
133509.55, "avg_total_payments":240422.8, "drg_code":"001", "drg_definition":"HEART
TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC",
"provider_address":"5777 EAST MAYO BOULEVARD", "provider_city":"PHOENIX",
"provider_id":"030103", "provider_name":"MAYO CLINIC HOSPITAL", "provider_region":"AZ -
Phoenix", "provider_state":"AZ", "provider_zip":"85054", "total_discharges":20, "year":"2014”}
69 © 2018 MapR Technologies, Inc
•  What	are	the	top	charges,	and		payments	by	Hospital?		
•  	 	by	Hospital	State?		
•  What	are	the	top	charges,	and		payments	by	diagnosis?		
•  Most	common	diagnosis	?		
•  What	are	the	statistics	for	payments	for	most	common	diagnosis	(Sepsis	)	
•  What	are	the	most	expensive	states	for	most	common	diagnosis		sepsis	
Now	We	Can	Ask	Questions	Like:	
Provider	ID Provider	
State	
DRG Total
Discharges
Average		
Charges
Average	
Total	
payments	
Avg		
Medicare
Payments
1261770 CA
Heart	
Transplant	
13	 1,172,866	 251,876	 244,457
70 © 2018 MapR Technologies, Inc
InPatient	Prospective	Payment	System	payments
71 © 2018 MapR Technologies, Inc
To	Download	the	code	:	Streaming	data	pipeline	blogs	
•  https://mapr.com/blog/etl-pipeline-healthcare-dataset-with-spark-json-mapr-db/	
•  https://mapr.com/blog/streaming-data-pipeline-transform-store-explore-healthcare-
dataset-mapr-db/
72 © 2018 MapR Technologies, Inc
http://mapr.com/ebooks/	
New	Spark	Ebook
73 © 2018 MapR Technologies, Inc
Read	about	Streaming	Architectures	
mapr.com/ebooks
74 © 2018 MapR Technologies, Inc
Read	about	Machine	Learning	Logistics	
mapr.com/ebooks
75 © 2018 MapR Technologies, Inc
MapR	Free	ODT			http://learn.mapr.com/	
To	Learn	More:	New	Spark	2.0	training
76 © 2018 MapR Technologies, Inc
https://mapr.com/blog/	
MapR	Blog
77 © 2018 MapR Technologies, Inc
To	Learn	More:	
•  https://mapr.com/blog/ml-iot-connected-medical-devices/
78 © 2018 MapR Technologies, Inc
To	Learn	More:	
•  https://mapr.com/blog/how-stream-first-architecture-patterns-are-revolutionizing-
healthcare-platforms/
79 © 2018 MapR Technologies, Inc
Applying	Machine	Learning	to	Live	Patient	Data	
•  https://www.slideshare.net/caroljmcdonald/applying-machine-learning-to-live-
patient-data

More Related Content

PDF
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
PDF
Predicting Flight Delays with Spark Machine Learning
PDF
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
PDF
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
PDF
How Big Data is Reducing Costs and Improving Outcomes in Health Care
PPTX
Apache Spark Machine Learning Decision Trees
PDF
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Predicting Flight Delays with Spark Machine Learning
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Apache Spark Machine Learning Decision Trees
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...

What's hot (19)

PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
PDF
Applying Machine Learning to Live Patient Data
PDF
Introduction to machine learning with GPUs
PDF
Demystifying AI, Machine Learning and Deep Learning
PDF
Streaming patterns revolutionary architectures
PDF
Fast Cars, Big Data How Streaming can help Formula 1
PDF
Spark graphx
PPTX
Managing a Multi-Tenant Data Lake
PDF
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
PDF
Live Machine Learning Tutorial: Churn Prediction
PDF
The Keys to Digital Transformation
PPTX
MapR on Azure: Getting Value from Big Data in the Cloud -
PDF
Enterprise Data Lakes
PDF
Advanced Threat Detection on Streaming Data
PDF
Insight Platforms Accelerate Digital Transformation
PPTX
Building a Scalable Data Science Platform with R
PPTX
When Streaming Becomes Strategic
PPTX
Geo-Distributed Big Data and Analytics
PPTX
Deep Learning vs. Cheap Learning
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to Live Patient Data
Introduction to machine learning with GPUs
Demystifying AI, Machine Learning and Deep Learning
Streaming patterns revolutionary architectures
Fast Cars, Big Data How Streaming can help Formula 1
Spark graphx
Managing a Multi-Tenant Data Lake
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Live Machine Learning Tutorial: Churn Prediction
The Keys to Digital Transformation
MapR on Azure: Getting Value from Big Data in the Cloud -
Enterprise Data Lakes
Advanced Threat Detection on Streaming Data
Insight Platforms Accelerate Digital Transformation
Building a Scalable Data Science Platform with R
When Streaming Becomes Strategic
Geo-Distributed Big Data and Analytics
Deep Learning vs. Cheap Learning
Ad

Similar to Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with MapR Database (20)

PDF
Dive into Spark Streaming
PDF
Spark Streaming Data Pipelines
PDF
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
PDF
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
PPTX
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
PDF
What's new with Apache Spark's Structured Streaming?
PDF
Streaming Analytics for Financial Enterprises
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PPTX
Using The Hadoop Ecosystem to Drive Healthcare Innovation
PDF
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
PPTX
How Spark is Enabling the New Wave of Converged Cloud Applications
PPTX
How Spark is Enabling the New Wave of Converged Applications
PPTX
Architectual Comparison of Apache Apex and Spark Streaming
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
PPTX
Enabling Real-Time Business with Change Data Capture
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
PPTX
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
PDF
Making Structured Streaming Ready for Production
PDF
Apache Big Data Europe 2015: Selected Talks
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Dive into Spark Streaming
Spark Streaming Data Pipelines
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
What's new with Apache Spark's Structured Streaming?
Streaming Analytics for Financial Enterprises
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Using The Hadoop Ecosystem to Drive Healthcare Innovation
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Applications
Architectual Comparison of Apache Apex and Spark Streaming
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Enabling Real-Time Business with Change Data Capture
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Making Structured Streaming Ready for Production
Apache Big Data Europe 2015: Selected Talks
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Ad

More from Carol McDonald (12)

PDF
Spark machine learning predicting customer churn
PDF
Streaming Patterns Revolutionary Architectures with the Kafka API
PDF
Apache Spark Machine Learning
PDF
Build a Time Series Application with Apache Spark and Apache HBase
PDF
Apache Spark streaming and HBase
PDF
Machine Learning Recommendations with Spark
PDF
Apache Spark Overview
PDF
Introduction to Spark
DOC
CU9411MW.DOC
PDF
Getting started with HBase
PDF
Introduction to Spark on Hadoop
PDF
NoSQL HBase schema design and SQL with Apache Drill
Spark machine learning predicting customer churn
Streaming Patterns Revolutionary Architectures with the Kafka API
Apache Spark Machine Learning
Build a Time Series Application with Apache Spark and Apache HBase
Apache Spark streaming and HBase
Machine Learning Recommendations with Spark
Apache Spark Overview
Introduction to Spark
CU9411MW.DOC
Getting started with HBase
Introduction to Spark on Hadoop
NoSQL HBase schema design and SQL with Apache Drill

Recently uploaded (20)

PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
medical staffing services at VALiNTRY
PDF
System and Network Administration Chapter 2
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
System and Network Administraation Chapter 3
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Introduction to Artificial Intelligence
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Digital Strategies for Manufacturing Companies
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
AI in Product Development-omnex systems
PDF
top salesforce developer skills in 2025.pdf
How to Choose the Right IT Partner for Your Business in Malaysia
VVF-Customer-Presentation2025-Ver1.9.pptx
medical staffing services at VALiNTRY
System and Network Administration Chapter 2
Softaken Excel to vCard Converter Software.pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
How to Migrate SBCGlobal Email to Yahoo Easily
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Design an Analysis of Algorithms I-SECS-1021-03
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
System and Network Administraation Chapter 3
Navsoft: AI-Powered Business Solutions & Custom Software Development
Wondershare Filmora 15 Crack With Activation Key [2025
Introduction to Artificial Intelligence
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Digital Strategies for Manufacturing Companies
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
AI in Product Development-omnex systems
top salesforce developer skills in 2025.pdf

Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with MapR Database

  • 2. 2 © 2018 MapR Technologies, Inc •  Overview of Kafka API •  Use Spark Structured Streaming to continously: •  Read from Kafka topic •  Transfom •  Write to MapR document database •  Analyze using Spark SQL Agenda 2
  • 3. 3 © 2018 MapR Technologies, Inc Use Case: Open Payment Dataset •  Payments Drug and Device companies make to •  Physicians and Teaching Hospitals for •  Travel, Research, Gifts, Speaking fees, and Meals •  https://www.cms.gov/openpayments/
  • 4. 4 © 2018 MapR Technologies, Inc •  Medicare payment data for most common inpatient diagnoses •  Data shows dramatically different charges and payments •  https://www.cms.gov/research-statistics-data-and-systems/statistics-trends-and- reports/medicare-provider-charge-data/inpatient.html InPatient Payment Dataset (IPPS) Provider ID Provider State DRG Total Discharges Average Charges Average Total payments Avg Medicare Payments 1261770 CA Heart Transplant 13 1,172,866 251,876 244,457
  • 6. 6 © 2018 MapR Technologies, Inc What is a Stream ? •  A stream is a continuous sequence of events •  Events are key-value pairs
  • 7. 7 © 2018 MapR Technologies, Inc What is Streaming Data? Fraud detection Smart Machinery Smart Meters Home Automation Networks Manufacturing Security Systems Patient Monitoring
  • 8. 8 © 2018 MapR Technologies, Inc Monitoring devices combined with ML can provide alerts for Sepsis, which is one of the leading causes for death in hospitals – http://www.computerweekly.com/news/450422258/Putting-sepsis- algorithms-into-electronic-patient-records Examples of Streaming Data
  • 9. 9 © 2018 MapR Technologies, Inc A Stanford team has shown that a machine-learning model can identify arrhythmias from an EKG better than an expert •  https://www.technologyreview.com/s/608234/the-machines-are-getting-ready- to-play-doctor/ Example of Streaming Data combined with Machine Learning
  • 10. 10 © 2018 MapR Technologies, Inc the ECG app on the Apple Watch can check heart rhythms and send a notification if an irregular heart rhythm is identified. •  https://www.apple.com/newsroom/2018/12/ecg-app-and-irregular-heart- rhythm-notification-available-today-on-apple-watch/ Example of Streaming Data combined with Machine Learning
  • 11. 11 © 2018 MapR Technologies, Inc https://mapr.com/blog/ml-iot-connected-medical-devices/ Applying Machine Learning to Live Patient Data
  • 13. 13 © 2018 MapR Technologies, Inc Intro to the Kafka API
  • 14. 14 © 2018 MapR Technologies, Inc Topics: Logical collection of events Organize Events into Categories Organize Data into Topics with the MapR Event Store for Kafka
  • 15. 15 © 2018 MapR Technologies, Inc Topics: Logical collection of events Organize Events into Categories Organize Data into Topics with the MapR Event Store for Kafka Consumers MapR Cluster Topic: Pressure Topic: Temperature Topic: Warnings Consumers Consumers Kafka API Kafka API
  • 16. 16 © 2018 MapR Technologies, Inc Topics are partitioned for throughput and scalability Scalable Messaging with MapR Event Streams Server 1 Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Server 2 Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Server 3 Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning
  • 17. 17 © 2018 MapR Technologies, Inc New Messages are Added to the end Partition is like an Event Log New Message 6 5 4 3 2 1 Old Message
  • 18. 18 © 2018 MapR Technologies, Inc Messages are delivered in the order they are received Partition is like a Queue
  • 19. 19 © 2018 MapR Technologies, Inc Events remain on the partition, available to other consumers Unlike a queue, Events are not deleted when Read
  • 20. 20 © 2018 MapR Technologies, Inc Messages can be persisted forever Or Older messages can be deleted automatically based on time to live Not deleting messages, minimizes disk read/writes When Are Messages Deleted? MapR Cluster 6 5 4 3 2 1Partition 1 Older message
  • 21. 21 © 2018 MapR Technologies, Inc High Performance at Scale: •  Partitioning for Parallel operations •  Not deleting messages, minimizes disk read/writes
  • 22. 22 © 2018 MapR Technologies, Inc Processing Same Message for Different Purposes
  • 23. 23 © 2018 MapR Technologies, Inc Georgia Health Connect ALLOY Health: Exchange State HIE Clinical Data Viewer Reporting and Analytics Clinical Data Financial Data Provider Organizations What are the outcomes in the entire state on diabetes? Are there doctors that are doing this better than others? Georgia Health Connect
  • 24. 24 © 2018 MapR Technologies, Inc Use Case: Streaming System of Record for Healthcare
  • 26. 26 © 2018 MapR Technologies, Inc Spark Distributed Datasets partitioned •  Distributed collection of objects Dataset[T] •  Partitioned across a cluster •  Operated on in parallel •  in memory can be Cached
  • 27. 27 © 2018 MapR Technologies, Inc DataFrame = Dataset[Row] can use Spark SQL DataFrame is like a table Dataset[Row] row columns
  • 28. 28 © 2018 MapR Technologies, Inc Dataset[Typed Object] can use Spark SQL and Functions A Dataset is a distributed collection of objects Dataset[objects] object columns partitioned
  • 29. 29 © 2018 MapR Technologies, Inc Process the Data with Spark Structured Streaming
  • 30. 30 © 2018 MapR Technologies, Inc Datasets Read from Stream Task Cache Process & Cache Data offsets Stream partition Task Cache Process & Cache Data Task Cache Process & Cache Data Driver Stream partition Stream partition Data is cached for aggregations And windowed functions
  • 31. 31 © 2018 MapR Technologies, Inc new data in the data stream = new rows appended to an unbounded table Data stream as an unbounded table Treat Stream as Unbounded Tables
  • 32. 32 © 2018 MapR Technologies, Inc The Stream is continuously processed
  • 33. 33 © 2018 MapR Technologies, Inc Spark automatically streamifies SQL plans Image reference Databricks
  • 34. 34 © 2018 MapR Technologies, Inc Stream Processing
  • 35. 35 © 2018 MapR Technologies, Inc Use Case: Payment Data "NEW","Covered Recipient Physician",,,,"132655","GREGG","D","ALZATE",,"8745 AERO DRIVE","STE 200","SAN DIEGO","CA","92123","United States",,,"Medical Doctor","Allopathic & Osteopathic Physicians|Radiology|Diagnostic Radiology","CA",,,,,"DFINE, Inc","100000000326","DFINE, Inc","CA","United States", 90.87,"02/12/2016","1","In-kind items and services","Food and Beverage",,,,"No","No Third Party Payment",,,,,"No","346039438","No","Yes","Covered","Device","Radiology","StabiliT", ,"Covered","Device","Radiology","STAR Tumor Ablation System",,,,,,,,,,,,,,,,,"2016","06/30/2017" { "_id":"317150_08/26/2016_346122858", "physician_id":"317150", "date_payment":"08/26/2016", "record_id":"346122858", "payer":"Mission Pharmacal Company", "amount":9.23, "Physician_Specialty":"Obstetrics & Gynecology", "Nature_of_payment":"Food and Beverage" }
  • 36. 36 © 2018 MapR Technologies, Inc Scenario: Payment Data Provider ID Date Payer Payer State Provider Specialty Provider State Amount Payment Nature 1261770 01/11/2016 Southern Anesthesia & Surgical, Inc CO Oral and Maxillofacial Surgery CA 117.5 Food and Beverage
  • 37. 37 © 2018 MapR Technologies, Inc val df1 = spark.readStream.format("kafka") .option("kafka.bootstrap.servers", "maprdemo:9092") .option("subscribe", "/apps/stream:payments”) .option("startingOffsets", "earliest") .option("failOnDataLoss", false) .option("maxOffsetsPerTrigger", 1000) .load() Streaming pipeline Kafka Data source
  • 38. 38 © 2018 MapR Technologies, Inc df1.printSchema() root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true) Kafka DataFrame schema
  • 39. 39 © 2018 MapR Technologies, Inc case class Payment(_id: String, physician_id: String, date_payment: String, payer: String, payer_state: String, amount: Double, physician_type: String,physician_specialty: String, physician_state: String, nature_of_payment: String) def parsePayment(str: String): Payment = { val td = str.split(",(?=([^"]*"[^"]*")*[^"]*$)") val physician_id =td(5) val amount = td(30).toDouble . . . Payment(id, physician_id, date_payment, payer, payer_state, amount, physician_type, focus, physician_state, nature_of_payment) } Function to Parse CSV data to Payment Object
  • 40. 40 © 2018 MapR Technologies, Inc //register a user-defined function (UDF) to deserialize the message spark.udf.register("deserialize", (message: String) => parsePayment(message)) //use the UDF in a select expression val df2 = df1.selectExpr("""deserialize(CAST(value as STRING)) AS message""").select($"message".as[Payment]) Parse message txt to Payment Object
  • 41. 41 © 2018 MapR Technologies, Inc Writing to a Memory Sink Write results to Memory Start running the query val query = cdf .writeStream .queryName("uber") .format("memory") .outputMode("append”) query.start().awaitTermination()
  • 42. 42 © 2018 MapR Technologies, Inc %sql select * from payments limit 4: Streaming Applicaton +--------------------+------------+------------+--------------------+-----------+--------+-------------------+--------------------+---------------+------------------+ | _id|physician_id|date_payment| payer|payer_state| amount| physician_type| physician_specialty|physician_state| nature_of_payment| +--------------------+------------+------------+--------------------+-----------+--------+-------------------+--------------------+---------------+------------------+ |CA_Diagnostic Rad...| 132655| 02/12/2016| DFINE, Inc| CA| 90.87| Medical Doctor|Diagnostic Radiology| CA| Food and Beverage| |AZ_Vascular & Int...| 1006832| 01/26/2016| DFINE, Inc| CA| 25.0| Medical Doctor|Vascular & Interv...| AZ|Travel and Lodging| |AZ_Vascular & Int...| 1006832| 02/13/2016| DFINE, Inc| CA| 32.0| Medical Doctor|Vascular & Interv...| AZ|Travel and Lodging| |AZ_Vascular & Int...| 1006832| 02/19/2016| DFINE, Inc| CA| 27.27| Medical Doctor|Vascular & Interv...| AZ|Travel and Lodging|
  • 43. 43 © 2018 MapR Technologies, Inc %sql select * from payments limit 3: Streaming Applicaton
  • 44. 44 © 2018 MapR Technologies, Inc %sql select physician_specialty, count(*) as cnt from payment2 group by physician_specialty order by cnt desc Streaming Applicaton
  • 46. 46 © 2018 MapR Technologies, Inc MapR-DB Connector for Apache Spark Spark Streaming writing to MapR-DB JSON
  • 47. 47 © 2018 MapR Technologies, Inc Spark MapR-DB Connector
  • 48. 48 © 2018 MapR Technologies, Inc Designed for Partitioning and Scaling
  • 49. 49 © 2018 MapR Technologies, Inc MapR-DB JSON Document Store Data is automatically partitioned and sorted by _id row key! { "_id":”AK_08/26/2016_346122858", "physician_id":"317150", "date_payment":"08/26/2016", "payer":"Mission Pharmacal Company", "payer_state":”CO", "amount":9.23, ”physician_specialty":”Gynecology", “physician_state":”AK" ”nature_of_payment":"Food and Beverage" }
  • 50. 50 © 2018 MapR Technologies, Inc MapR Database shell maprdb mapr:> find /user/mapr/paytable --limit 2 {"_id":"AK_01/03/2016_1309545_391702408", "amount":94.21, "date_payment":"01/03/2016", "nature_of_payment":"Food and Beverage", "payer":"Stryker Corporation", "payer_state":"MI","physician_id":"1309545", "physician_specialty":"Specialist", "physician_state":"AK", "physician_type":"Medical Doctor"} {"_id":"AK_01/04/2016_100801_396460394","amount":11.59, "date_payment":"01/04/2016", "nature_of_payment":"Food and Beverage", "payer":"AbbVie, Inc.", "payer_state":"IL", "physician_id":"100801", "physician_specialty":"Family Medicine", "physician_state":"AK", "physician_type":"Medical Doctor"} 2 document(s) found. Data is automatically partitioned and sorted by _id row key!
  • 51. 51 © 2018 MapR Technologies, Inc Writing to a MapR-DB Sink Write Streaming DataFrame Query Results to MapR-DB Start running the query val query = cdf.writeStream .format(MapRDBSourceConfig.Format) .option(MapRDBSourceConfig.TablePathOption, tableName) .option(MapRDBSourceConfig.IdFieldPathOption, "_id") .option(MapRDBSourceConfig.CreateTableOption, false) .option("checkpointLocation", "/user/mapr/check") .option(MapRDBSourceConfig.BulkModeOption, true) .option(MapRDBSourceConfig.SampleSizeOption, 1000) query.start().awaitTermination()
  • 52. 52 © 2018 MapR Technologies, Inc Streaming Dataset Read from Topic partitions, Transformed, Written to MapR Database partitions
  • 53. 53 © 2018 MapR Technologies, Inc Store in MapR Database
  • 55. 55 © 2018 MapR Technologies, Inc •  Spark SQL queries and updates to MapR-DB •  With projection and filter pushdown, custom partitioning, and data locality Spark SQL Querying MapR-DB JSON
  • 56. 56 © 2018 MapR Technologies, Inc val pdf: Dataset[Payment] = spark .loadFromMapRDB[Payment](tableName, schema) .as[Payment] Spark Distributed Datasets read from MapR-DB Partitions Worker Task Worker Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data Task Task Driver tasks tasks tasks
  • 57. 57 © 2018 MapR Technologies, Inc Data Frame Load data pdf.createOrReplaceTempView(”payment") pdf.select("_id", "payer", "amount”).show Load the data into a Dataframe Data is automatically partitioned and sorted by _id row key!
  • 58. 58 © 2018 MapR Technologies, Inc •  Which physician specialties, or physician states, are getting the most payments ? •  Count , total amount $, average amount $, standard deviation •  Which companies, or company states, are making the most payments ? •  Count , total amount $, average amount $, standard deviation •  What are the Top Nature of Payments by count, amount ? •  What are the payments over $2,000,000 ? Now We Can Ask Questions Like: Physician ID Date Payer Payer State Physician Specialty Physician State Amount Payment Nature 1261770 01/11/2016 Southern Anesthesia & Surgical, Inc CO Oral and Maxillofacial Surgery CA 117.5 Food and Beverage
  • 59. 59 © 2018 MapR Technologies, Inc Language Integrated Queries L I Query Description agg(expr, exprs) Aggregates on entire DataFrame distinct Returns new DataFrame with unique rows except(other) Returns new DataFrame with rows from this DataFrame not in other DataFrame filter(expr); where(condition) Filter based on the SQL expression or condition groupBy(cols: Columns) Groups DataFrame using specified columns join (DataFrame, joinExpr) Joins with another DataFrame using given join expression sort(sortcol) Returns new DataFrame sorted by specified column select(col) Selects set of columns
  • 60. 60 © 2018 MapR Technologies, Inc MapR Data Platform
  • 61. 61 © 2018 MapR Technologies, Inc Top 5 Nature of Payment by count val res = pdf.groupBy(”Nature_of_Payment") .count() .orderBy(desc(count)) .show(5)
  • 62. 62 © 2018 MapR Technologies, Inc Top 5 Nature of Payment by amount of payment %sql select Nature_of_payment, sum(amount) as total from payments group by Nature_of_payment order by total desc limit 5
  • 63. 63 © 2018 MapR Technologies, Inc Top 5 Physician Specialties by total Amount %sql select physician_specialty, sum(amount) as total from payments where physician_specialty IS NOT NULL group by physician_specialty order by total desc limit 5
  • 64. 64 © 2018 MapR Technologies, Inc Average Payment by Specialty
  • 65. 65 © 2018 MapR Technologies, Inc What are the Nature of Payments with payments > $1000 ? pdf.filter($"amount" > 1000) .groupBy("Nature_of_payment") .count().orderBy(desc("count")).show()
  • 66. 66 © 2018 MapR Technologies, Inc Top Payers by Total Amount with count %sql select payer, payer_state, count(*) as cnt, sum(amount) as total from payments group by payer, payer_state order by total desc limit 10
  • 67. 67 © 2018 MapR Technologies, Inc Scenario: InPatient Payment Data Provider ID Provider State DRG Total Discharges Average Charges Average Total payments Avg Medicare Payments 1261770 CA Heart Transplant 13 1,172,866 251,876 244,457
  • 68. 68 © 2018 MapR Technologies, Inc MapR Database shell maprdb mapr:> find /user/mapr/ippstable --limit 2 {"_id":"001_2014_010033", "avg_covered_charges":1172866.385, "avg_medicare_payments": 244457.9231, "avg_total_payments":251876.3077, "drg_code":"001", "drg_definition":"HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC", "provider_address":"619 SOUTH 19TH STREET", "provider_city":"BIRMINGHAM", "provider_id":"010033", "provider_name":"UNIVERSITY OF ALABAMA HOSPITAL", "provider_region":"AL - Birmingham", "provider_state":"AL", "provider_zip":"35233", "total_discharges":13, "year":"2014"} {"_id":"001_2014_030103", "avg_covered_charges":437531.3, "avg_medicare_payments": 133509.55, "avg_total_payments":240422.8, "drg_code":"001", "drg_definition":"HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC", "provider_address":"5777 EAST MAYO BOULEVARD", "provider_city":"PHOENIX", "provider_id":"030103", "provider_name":"MAYO CLINIC HOSPITAL", "provider_region":"AZ - Phoenix", "provider_state":"AZ", "provider_zip":"85054", "total_discharges":20, "year":"2014”}
  • 69. 69 © 2018 MapR Technologies, Inc •  What are the top charges, and payments by Hospital? •  by Hospital State? •  What are the top charges, and payments by diagnosis? •  Most common diagnosis ? •  What are the statistics for payments for most common diagnosis (Sepsis ) •  What are the most expensive states for most common diagnosis sepsis Now We Can Ask Questions Like: Provider ID Provider State DRG Total Discharges Average Charges Average Total payments Avg Medicare Payments 1261770 CA Heart Transplant 13 1,172,866 251,876 244,457
  • 70. 70 © 2018 MapR Technologies, Inc InPatient Prospective Payment System payments
  • 71. 71 © 2018 MapR Technologies, Inc To Download the code : Streaming data pipeline blogs •  https://mapr.com/blog/etl-pipeline-healthcare-dataset-with-spark-json-mapr-db/ •  https://mapr.com/blog/streaming-data-pipeline-transform-store-explore-healthcare- dataset-mapr-db/
  • 72. 72 © 2018 MapR Technologies, Inc http://mapr.com/ebooks/ New Spark Ebook
  • 73. 73 © 2018 MapR Technologies, Inc Read about Streaming Architectures mapr.com/ebooks
  • 74. 74 © 2018 MapR Technologies, Inc Read about Machine Learning Logistics mapr.com/ebooks
  • 75. 75 © 2018 MapR Technologies, Inc MapR Free ODT http://learn.mapr.com/ To Learn More: New Spark 2.0 training
  • 76. 76 © 2018 MapR Technologies, Inc https://mapr.com/blog/ MapR Blog
  • 77. 77 © 2018 MapR Technologies, Inc To Learn More: •  https://mapr.com/blog/ml-iot-connected-medical-devices/
  • 78. 78 © 2018 MapR Technologies, Inc To Learn More: •  https://mapr.com/blog/how-stream-first-architecture-patterns-are-revolutionizing- healthcare-platforms/
  • 79. 79 © 2018 MapR Technologies, Inc Applying Machine Learning to Live Patient Data •  https://www.slideshare.net/caroljmcdonald/applying-machine-learning-to-live- patient-data