SlideShare a Scribd company logo
DATASERVICES
PROCESSING (BIG) DATA THE
MICROSERVICE WAY
Dr. Josef Adersberger ( @adersberger), QAware GmbH
http://www.datasciencecentral.com
ENTERPRISE
http://www.cardinalfang.net/misc/companies_list.html
?
PROCESSING
BIG DATA FAST DATA
SMART DATA
All things distributed:
‣distributed 

processing
‣distributed 

databases
Data to information:
‣machine (deep) learning
‣advanced statistics
‣natural language processing
‣semantic web
Low latency and 

high throughput:
‣stream processing
‣messaging
‣event-driven
DATA

PROCESSING
SYSTEM

INTEGRATION
APIS UIS
data -> information
information -> userinformation -> systems
information 

-> blended information
SOLUTIONS
The {big,SMART,FAST} data 

Swiss Army Knifes
( )
node
Distributed Data
Distributed Processing
Driver data flow
icon credits to Nimal Raj (database), Arthur Shlain (console) and alvarobueno (takslist)
DATA SERVICES
{BIG, FAST,
SMART}
DATA
MICRO-

SERVICE
BASIC IDEA: DATA PROCESSING WITHIN A GRAPH OF MICROSERVICES
Microservice

(aka Dataservice)
Message 

Queue
Sources Processors Sinks
DIRECTED GRAPH OF MICROSERVICES EXCHANGING DATA VIA MESSAGING
BASIC IDEA: COHERENT PLATFORM FOR MICRO- AND DATASERVICES
CLUSTER OPERATING SYSTEM
IAAS ON PREM LOCAL
MICROSERVICES
DATASERVICES
MICROSERVICES PLATFORM
DATASERVICES PLATFORM
OPEN SOURCE DATASERVICE PLATFORMS
‣ Open source project based on the Spring stack
‣ Microservices: Spring Boot
‣ Messaging: Kafka 0.9, Kafka 0.10, RabbitMQ
‣ Standardized API with several open source implementations
‣ Microservices: JavaEE micro container
‣ Messaging: JMS
‣ Open source by Lightbend (part. commercialised & proprietary)
‣ Microservices: Lagom, Play
‣ Messaging: akka
ARCHITECT’S VIEW
- ON SPRING CLOUD DATA FLOW
DATASERVICES
BASIC IDEA: DATA PROCESSING WITHIN A GRAPH OF MICROSERVICES
Sources Processors Sinks
DIRECTED GRAPH OF SPRING BOOT MICROSERVICES EXCHANGING DATA VIA MESSAGING
Stream
App
Message 

Broker
Channel
THE BIG PICTURE
SPRING CLOUD DATA FLOW SERVER (SCDF SERVER)
TARGET RUNTIME
SPI
API
LOCAL
SCDF Shell
SCDF Admin UI
Flo Stream Designer
THE BIG PICTURE
SPRING CLOUD DATA FLOW SERVER (SCDF SERVER)
TARGET RUNTIME
MESSAGE BROKER
APP
SPRING BOOT
SPRING FRAMEWORK
SPRING CLOUD STREAM
SPRING INTEGRATION
BINDER
APP
APP
APP
CHANNELS

(input/output)
THE VEINS: SCALABLE DATA LOGISTICS WITH MESSAGING
Sources Processors Sinks
STREAM PARTITIONING: TO BE ABLE TO SCALE MICROSERVICES
BACK PRESSURE HANDLING: TO BE ABLE TO COPE WITH PEEKS
STREAM PARTITIONING
output 

instances

(consumer group)
PARTITION KEY -> PARTITION SELECTOR -> PARTITION INDEX
input

(provider)
f(message)->field f(field)->index f(index)->pindex
pindex = index % output instances
message 

partitioning
BACK PRESSURE HANDLING
1
3
2
1. Signals if (message) pressure is too high
2. Regulates inbound (message) flow
3. (Data) retention lake
DISCLAIMER: THERE IS ALSO A TASK EXECUTION MODEL (WE WILL IGNORE)
‣ short-living
‣finite data set
‣programming model = Spring Cloud Task
‣starters available for JDBC and Spark 

as data source/sink
CONNECTED CAR PLATFORM
EDGE SERVICE
MQTT Broker

(apigee Link)
MQTT Source Data 

Cleansing
Realtime traffic

analytics
KPI ANALYTICS
Spark
DASHBOARD
react-vis
Presto
Masterdata

Blending
Camel
KafkaKafka
ESB
gPRC
DEVELOPERS’S VIEW
-ON SPRING CLOUD DATA FLOW
DATASERVICES
ASSEMBLING A STREAM
▸ App starters: A set of pre-built

apps aka dataservices
▸ Composition of apps with linux-style 

pipe syntax:
http | magichappenshere | log
Starter app
Custom app
https://www.pinterest.de/pin/272116002461148164
MORE PIPES
twitterstream 

--consumerKey=<CONSUMER_KEY> 

--consumerSecret=<CONSUMER_SECRET> 

--accessToken=<ACCESS_TOKEN> 

--accessTokenSecret=<ACCESS_TOKEN_SECRET> 

| log
:tweets.twitterstream > 

field-value-counter 

--fieldName=lang --name=language
:tweets.twitterstream > 

filter 

--expression=#jsonPath(payload,’$.lang’)=='en'
--outputType=application/json
with parameters:
with explicit input channel & analytics:
with SpEL expression and explicit output type
OUR SAMPLE APPLICATION: WORLD MOOD
https://github.com/adersberger/spring-cloud-dataflow-samples
twitterstream
Starter app
Custom app
filter

(lang=en)
log
twitter ingester

(test data)
tweet extractor

(text)
sentiment

analysis

(StanfordNLP)
field-value-counter
DEVELOPING CUSTOM APPS: THE VERY BEGINNING
https://start.spring.io
@SpringBootApplication
@EnableBinding(Source.class)
public class TwitterIngester {
private Iterator<String> lines;
@Bean
@InboundChannelAdapter(value = Source.OUTPUT,
poller = @Poller(fixedDelay = "200", maxMessagesPerPoll = "1"))
public MessageSource<String> twitterMessageSource() {
return () -> new GenericMessage<>(emitTweet());
}
private String emitTweet() {
if (lines == null || !lines.hasNext()) lines = readTweets();
return lines.next();
}
private Iterator<String> readTweets() {
//…
}
}
PROGRAMMING MODEL: SOURCE
@RunWith(SpringRunner.class)
@SpringBootTest(webEnvironment= SpringBootTest.WebEnvironment.RANDOM_PORT)
public class TwitterIngesterTest {
@Autowired
private Source source;
@Autowired
private MessageCollector collector;
@Test
public void tweetIngestionTest() throws InterruptedException {
for (int i = 0; i < 100; i++) {
Message<String> message = (Message<String>) 

collector.forChannel(source.output()).take();
assert (message.getPayload().length() > 0);
}
}
}
PROGRAMMING MODEL: SOURCE TESTING
PROGRAMMING MODEL: PROCESSOR (WITH STANFORD NLP)
@SpringBootApplication
@EnableBinding(Processor.class)
public class TweetSentimentProcessor {
@Autowired
StanfordNLP nlp;
@StreamListener(Processor.INPUT) //input channel with default name
@SendTo(Processor.OUTPUT) //output channel with default name
public int analyzeSentiment(String tweet){
return TupleBuilder.tuple().of("mood", findSentiment(tweet));
}
public int findSentiment(String tweet) {
int mainSentiment = 0;
if (tweet != null && tweet.length() > 0) {
int longest = 0;
Annotation annotation = nlp.process(tweet);
for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
Tree tree = sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class);
int sentiment = RNNCoreAnnotations.getPredictedClass(tree);
String partText = sentence.toString();
if (partText.length() > longest) {
mainSentiment = sentiment;
longest = partText.length();
}
}
}
return mainSentiment;
}
}
PROGRAMMING MODEL: PROCESSOR TESTING
@RunWith(SpringRunner.class)
@SpringBootTest(webEnvironment= SpringBootTest.WebEnvironment.RANDOM_PORT)
public class TweetSentimentProcessorTest {
@Autowired
private Processor processor;
@Autowired
private MessageCollector collector;
@Autowired
private TweetSentimentProcessor sentimentProcessor;
@Test
public void testAnalysis() {
checkFor("I hate everybody around me!");
checkFor("The world is lovely");
checkFor("I f***ing hate everybody around me. They're from hell");
checkFor("Sunny day today!");
}
private void checkFor(String msg) {
processor.input().send(new GenericMessage<>(msg));
assertThat(
collector.forChannel(processor.output()),
receivesPayloadThat(
equalTo(TupleBuilder.tuple().of("mood", sentimentProcessor.findSentiment(msg)));
}
}
DEVELOPING THE STREAM DEFINITIONS WITH FLO
http://projects.spring.io/spring-flo/
RUNNING IT LOCAL
RUNNING THE DATASERVICES
$ redis-server &

$ zookeeper-server-start.sh . /config/zookeeper.properties &

$ kafka-server-start.sh ./config/server.properties &

$ java -jar spring-cloud-dataflow-server-local-1.2.0.RELEASE.jar &

$ java -jar ../spring-cloud-dataflow-shell-1.2.0.RELEASE.jar
dataflow:> app import —uri [1]



dataflow:> app register --name tweetsentimentalyzer --type processor --uri file:///libs/
worldmoodindex-0.0.2-SNAPSHOT.jar



dataflow:> stream create tweets-ingestion --definition "twitterstream --consumerKey=A --
consumerSecret=B --accessToken=C --accessTokenSecret=D | filter —
expression=#jsonPath(payload,’$.lang')=='en' | log" —deploy



dataflow:> stream create tweets-analyzer --definition “:tweets-ingestion.filter >
tweetsentimentalyzer | field-value-counter --fieldName=mood —name=Mood"


dataflow:> stream deploy tweets-analyzer —properties
“deployer.tweetsentimentalyzer.memory=1024m,deployer.tweetsentimentalyzer.count=8,

app.transform.producer.partitionKeyExpression=payload.id"
[1] http://repo.spring.io/libs-release/org/springframework/cloud/stream/app/spring-cloud-stream-app-descriptor/Bacon.RELEASE/
spring-cloud-stream-app-descriptor-Bacon.RELEASE.kafka-10-apps-maven-repo-url.properties
Dataservices: Processing (Big) Data the Microservice Way
RUNNING IT IN THE CLOUD
RUNNING THE DATASERVICES
$ git clone https://github.com/spring-cloud/spring-cloud-dataflow-server-kubernetes

$ kubectl create -f src/etc/kubernetes/kafka-zk-controller.yml

$ kubectl create -f src/etc/kubernetes/kafka-zk-service.yml

$ kubectl create -f src/etc/kubernetes/kafka-controller.yml

$ kubectl create -f src/etc/kubernetes/mysql-controller.yml

$ kubectl create -f src/etc/kubernetes/mysql-service.yml

$ kubectl create -f src/etc/kubernetes/kafka-service.yml

$ kubectl create -f src/etc/kubernetes/redis-controller.yml

$ kubectl create -f src/etc/kubernetes/redis-service.yml

$ kubectl create -f src/etc/kubernetes/scdf-config-kafka.yml

$ kubectl create -f src/etc/kubernetes/scdf-secrets.yml

$ kubectl create -f src/etc/kubernetes/scdf-service.yml

$ kubectl create -f src/etc/kubernetes/scdf-controller.yml

$ kubectl get svc #lookup external ip “scdf” <IP>
$ java -jar ../spring-cloud-dataflow-shell-1.2.0.RELEASE.jar
dataflow:> dataflow config server --uri http://<IP>:9393

dataflow:> app import —uri [2]

dataflow:> app register --type processor --name tweetsentimentalyzer --uri docker:qaware/
tweetsentimentalyzer-processor:latest
dataflow:> …
[2] http://repo.spring.io/libs-release/org/springframework/cloud/stream/app/spring-cloud-stream-app-descriptor/Bacon.RELEASE/spring-
cloud-stream-app-descriptor-Bacon.RELEASE.stream-apps-kafka-09-docker
LESSONS LEARNED
PRO CON
specialized programming

model -> efficient
specialized execution 

environment -> efficient
support for all types of data

(big, fast, smart)
disjoint programming model 

(data processing <-> services)
maybe a disjoint execution

environment

(data stack <-> service stack)
BEST USED
further on: as default for {big,fast,smart} data processing
PRO CON
coherent execution
environment (runs on
microservice stack)
coherent programming
model with emphasis on
separation of concerns
bascialy supports all types of
data (big, fast, smart)
has limitations on throughput

(big & fast data) due to less
optimization (like data affinity,
query optimizer, …) and
message-wise processing
technology immature in certain

parts (e.g. diagnosability)
BEST USED FOR
hybrid applications of data processing, system integration, API, UI
moderate throughput data applications with existing dev team
Message by message processing
TWITTER.COM/QAWARE - SLIDESHARE.NET/QAWARE
Thank you!
Questions?
josef.adersberger@qaware.de
@adersberger
https://github.com/adersberger/spring-cloud-dataflow-samples
BONUS SLIDES
MORE…
▸ Reactive programming
▸ Diagnosability
public Flux<String> transform(@Input(“input”) Flux<String> input) {
return input.map(s -> s.toUpperCase());
}
@EnableBinding(Sink::class)
@EnableConfigurationProperties(PostgresSinkProperties::class)
class PostgresSink {
@Autowired
lateinit var props: PostgresSinkProperties
@StreamListener(Sink.INPUT)
fun processTweet(message: String) {
Database.connect(props.url, user = props.user, password = props.password,
driver = "org.postgresql.Driver")
transaction {
SchemaUtils.create(Messages)
Messages.insert {
it[Messages.message] = message
}
}
}
}
object Messages : Table() {
val id = integer("id").autoIncrement().primaryKey()
val message = text("message")
}
PROGRAMMING MODEL: SINK (WITH KOTLIN)
MICRO ANALYTICS SERVICES
Microservice
Dashboard
Microservice …
BLUEPRINT ARCHITECTURE
ARCHITECT’S VIEW
THE SECRET OF BIG DATA PERFORMANCE
Rule 1: Be as close to the data as possible!

(CPU cache > memory > local disk > network)
Rule 2: Reduce data volume as early as possible! 

(as long as you don’t sacrifice parallelization)
Rule 3: Parallelize as much as possible!
Rule 4: Premature diagnosability and optimization
THE BIG PICTURE
http://cloud.spring.io/spring-cloud-dataflow
BASIC IDEA: BI-MODAL SOURCES AND SINKS
Sources Processors Sinks
READ FROM / WRITE TO: FILE, DATABASE, URL, …
INGEST FROM / DIGEST TO: TWITTER, MQ, LOG, …

More Related Content

PDF
Dataservices: Processing Big Data the Microservice Way
PDF
Developing Real-Time Data Pipelines with Apache Kafka
PDF
DevoxxUK: Optimizating Application Performance on Kubernetes
PDF
Kafka and Storm - event processing in realtime
PPTX
Developing Real-Time Data Pipelines with Apache Kafka
PDF
Streaming Processing with a Distributed Commit Log
PDF
Bootstrapping Microservices with Kafka, Akka and Spark
PPTX
Zoo keeper in the wild
Dataservices: Processing Big Data the Microservice Way
Developing Real-Time Data Pipelines with Apache Kafka
DevoxxUK: Optimizating Application Performance on Kubernetes
Kafka and Storm - event processing in realtime
Developing Real-Time Data Pipelines with Apache Kafka
Streaming Processing with a Distributed Commit Log
Bootstrapping Microservices with Kafka, Akka and Spark
Zoo keeper in the wild

What's hot (20)

PDF
Monitoring with Prometheus
PDF
KSQL - Stream Processing simplified!
PPTX
The Future of Apache Storm
PDF
Scaling Apache Storm - Strata + Hadoop World 2014
PDF
[Demo session] 관리형 Kafka 서비스 - Oracle Event Hub Service
PDF
Containerizing Distributed Pipes
PDF
A Journey through the JDKs (Java 9 to Java 11)
PPTX
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
PDF
Reactive Design Patterns
PDF
Distributed real time stream processing- why and how
PDF
TDC2016POA | Trilha Infraestrutura - Apache Mesos & Marathon: gerenciando rem...
PDF
Performance Analysis and Optimizations for Kafka Streams Applications
PDF
RedisConf18 - 2,000 Instances and Beyond
PDF
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
PDF
Open-source Infrastructure at Lyft
PPTX
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
PPTX
MongoDB World 2018: What's Next? The Path to Sharded Transactions
PPTX
A fun cup of joe with open liberty
PPTX
Typesafe spark- Zalando meetup
PDF
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Monitoring with Prometheus
KSQL - Stream Processing simplified!
The Future of Apache Storm
Scaling Apache Storm - Strata + Hadoop World 2014
[Demo session] 관리형 Kafka 서비스 - Oracle Event Hub Service
Containerizing Distributed Pipes
A Journey through the JDKs (Java 9 to Java 11)
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Reactive Design Patterns
Distributed real time stream processing- why and how
TDC2016POA | Trilha Infraestrutura - Apache Mesos & Marathon: gerenciando rem...
Performance Analysis and Optimizations for Kafka Streams Applications
RedisConf18 - 2,000 Instances and Beyond
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Open-source Infrastructure at Lyft
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
MongoDB World 2018: What's Next? The Path to Sharded Transactions
A fun cup of joe with open liberty
Typesafe spark- Zalando meetup
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Ad

Similar to Dataservices: Processing (Big) Data the Microservice Way (20)

PDF
Dataservices - Processing Big Data The Microservice Way
PDF
Cloud-Native Streaming and Event-Driven Microservices
PPTX
Sweet Streams (Are made of this)
PDF
Pivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
PDF
Operationalizing Machine Learning: Serving ML Models
PDF
Developing real-time data pipelines with Spring and Kafka
PPTX
Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...
PDF
Building a Data Exchange with Spring Cloud Data Flow
PDF
Stream and Batch Processing in the Cloud with Data Microservices
PDF
Reactive Microservices with Spring 5: WebFlux
PPTX
Asynchronous design with Spring and RTI: 1M events per second
PPTX
Trivento summercamp masterclass 9/9/2016
PPTX
Trivento summercamp fast data 9/9/2016
PDF
Microservices with Spring 5 Webflux - jProfessionals
PDF
Handling not so big data
PDF
Stream Processing in the Cloud With Data Microservices
PDF
Spring 5 Webflux - Advances in Java 2018
PDF
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
PDF
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
PDF
Data pipelines from zero to solid
Dataservices - Processing Big Data The Microservice Way
Cloud-Native Streaming and Event-Driven Microservices
Sweet Streams (Are made of this)
Pivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
Operationalizing Machine Learning: Serving ML Models
Developing real-time data pipelines with Spring and Kafka
Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...
Building a Data Exchange with Spring Cloud Data Flow
Stream and Batch Processing in the Cloud with Data Microservices
Reactive Microservices with Spring 5: WebFlux
Asynchronous design with Spring and RTI: 1M events per second
Trivento summercamp masterclass 9/9/2016
Trivento summercamp fast data 9/9/2016
Microservices with Spring 5 Webflux - jProfessionals
Handling not so big data
Stream Processing in the Cloud With Data Microservices
Spring 5 Webflux - Advances in Java 2018
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Data pipelines from zero to solid
Ad

More from QAware GmbH (20)

PDF
QAware_Mario-Leander_Reimer_Architecting and Building a K8s-based AI Platform...
PDF
Frontends mit Hilfe von KI entwickeln.pdf
PDF
Mit ChatGPT Dinosaurier besiegen - Möglichkeiten und Grenzen von LLM für die ...
PDF
50 Shades of K8s Autoscaling #JavaLand24.pdf
PDF
Make Agile Great - PM-Erfahrungen aus zwei virtuellen internationalen SAFe-Pr...
PPTX
Fully-managed Cloud-native Databases: The path to indefinite scale @ CNN Mainz
PDF
Down the Ivory Tower towards Agile Architecture
PDF
"Mixed" Scrum-Teams – Die richtige Mischung macht's!
PDF
Make Developers Fly: Principles for Platform Engineering
PDF
Der Tod der Testpyramide? – Frontend-Testing mit Playwright
PDF
Was kommt nach den SPAs
PDF
Cloud Migration mit KI: der Turbo
PDF
Migration von stark regulierten Anwendungen in die Cloud: Dem Teufel die See...
PDF
Aus blau wird grün! Ansätze und Technologien für nachhaltige Kubernetes-Cluster
PDF
Endlich gute API Tests. Boldly Testing APIs Where No One Has Tested Before.
PDF
Kubernetes with Cilium in AWS - Experience Report!
PDF
50 Shades of K8s Autoscaling
PDF
Kontinuierliche Sicherheitstests für APIs mit Testkube und OWASP ZAP
PDF
Service Mesh Pain & Gain. Experiences from a client project.
PDF
50 Shades of K8s Autoscaling
QAware_Mario-Leander_Reimer_Architecting and Building a K8s-based AI Platform...
Frontends mit Hilfe von KI entwickeln.pdf
Mit ChatGPT Dinosaurier besiegen - Möglichkeiten und Grenzen von LLM für die ...
50 Shades of K8s Autoscaling #JavaLand24.pdf
Make Agile Great - PM-Erfahrungen aus zwei virtuellen internationalen SAFe-Pr...
Fully-managed Cloud-native Databases: The path to indefinite scale @ CNN Mainz
Down the Ivory Tower towards Agile Architecture
"Mixed" Scrum-Teams – Die richtige Mischung macht's!
Make Developers Fly: Principles for Platform Engineering
Der Tod der Testpyramide? – Frontend-Testing mit Playwright
Was kommt nach den SPAs
Cloud Migration mit KI: der Turbo
Migration von stark regulierten Anwendungen in die Cloud: Dem Teufel die See...
Aus blau wird grün! Ansätze und Technologien für nachhaltige Kubernetes-Cluster
Endlich gute API Tests. Boldly Testing APIs Where No One Has Tested Before.
Kubernetes with Cilium in AWS - Experience Report!
50 Shades of K8s Autoscaling
Kontinuierliche Sicherheitstests für APIs mit Testkube und OWASP ZAP
Service Mesh Pain & Gain. Experiences from a client project.
50 Shades of K8s Autoscaling

Recently uploaded (20)

PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Supervised vs unsupervised machine learning algorithms
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
annual-report-2024-2025 original latest.
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to Knowledge Engineering Part 1
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Supervised vs unsupervised machine learning algorithms
Reliability_Chapter_ presentation 1221.5784
Miokarditis (Inflamasi pada Otot Jantung)
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Data_Analytics_and_PowerBI_Presentation.pptx
Clinical guidelines as a resource for EBP(1).pdf
annual-report-2024-2025 original latest.
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction-to-Cloud-ComputingFinal.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to machine learning and Linear Models
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to Knowledge Engineering Part 1

Dataservices: Processing (Big) Data the Microservice Way

  • 1. DATASERVICES PROCESSING (BIG) DATA THE MICROSERVICE WAY Dr. Josef Adersberger ( @adersberger), QAware GmbH
  • 3. BIG DATA FAST DATA SMART DATA All things distributed: ‣distributed 
 processing ‣distributed 
 databases Data to information: ‣machine (deep) learning ‣advanced statistics ‣natural language processing ‣semantic web Low latency and 
 high throughput: ‣stream processing ‣messaging ‣event-driven
  • 4. DATA
 PROCESSING SYSTEM
 INTEGRATION APIS UIS data -> information information -> userinformation -> systems information 
 -> blended information
  • 6. The {big,SMART,FAST} data 
 Swiss Army Knifes ( )
  • 7. node Distributed Data Distributed Processing Driver data flow icon credits to Nimal Raj (database), Arthur Shlain (console) and alvarobueno (takslist)
  • 9. BASIC IDEA: DATA PROCESSING WITHIN A GRAPH OF MICROSERVICES Microservice
 (aka Dataservice) Message 
 Queue Sources Processors Sinks DIRECTED GRAPH OF MICROSERVICES EXCHANGING DATA VIA MESSAGING
  • 10. BASIC IDEA: COHERENT PLATFORM FOR MICRO- AND DATASERVICES CLUSTER OPERATING SYSTEM IAAS ON PREM LOCAL MICROSERVICES DATASERVICES MICROSERVICES PLATFORM DATASERVICES PLATFORM
  • 11. OPEN SOURCE DATASERVICE PLATFORMS ‣ Open source project based on the Spring stack ‣ Microservices: Spring Boot ‣ Messaging: Kafka 0.9, Kafka 0.10, RabbitMQ ‣ Standardized API with several open source implementations ‣ Microservices: JavaEE micro container ‣ Messaging: JMS ‣ Open source by Lightbend (part. commercialised & proprietary) ‣ Microservices: Lagom, Play ‣ Messaging: akka
  • 12. ARCHITECT’S VIEW - ON SPRING CLOUD DATA FLOW DATASERVICES
  • 13. BASIC IDEA: DATA PROCESSING WITHIN A GRAPH OF MICROSERVICES Sources Processors Sinks DIRECTED GRAPH OF SPRING BOOT MICROSERVICES EXCHANGING DATA VIA MESSAGING Stream App Message 
 Broker Channel
  • 14. THE BIG PICTURE SPRING CLOUD DATA FLOW SERVER (SCDF SERVER) TARGET RUNTIME SPI API LOCAL SCDF Shell SCDF Admin UI Flo Stream Designer
  • 15. THE BIG PICTURE SPRING CLOUD DATA FLOW SERVER (SCDF SERVER) TARGET RUNTIME MESSAGE BROKER APP SPRING BOOT SPRING FRAMEWORK SPRING CLOUD STREAM SPRING INTEGRATION BINDER APP APP APP CHANNELS
 (input/output)
  • 16. THE VEINS: SCALABLE DATA LOGISTICS WITH MESSAGING Sources Processors Sinks STREAM PARTITIONING: TO BE ABLE TO SCALE MICROSERVICES BACK PRESSURE HANDLING: TO BE ABLE TO COPE WITH PEEKS
  • 17. STREAM PARTITIONING output 
 instances
 (consumer group) PARTITION KEY -> PARTITION SELECTOR -> PARTITION INDEX input
 (provider) f(message)->field f(field)->index f(index)->pindex pindex = index % output instances message 
 partitioning
  • 18. BACK PRESSURE HANDLING 1 3 2 1. Signals if (message) pressure is too high 2. Regulates inbound (message) flow 3. (Data) retention lake
  • 19. DISCLAIMER: THERE IS ALSO A TASK EXECUTION MODEL (WE WILL IGNORE) ‣ short-living ‣finite data set ‣programming model = Spring Cloud Task ‣starters available for JDBC and Spark 
 as data source/sink
  • 20. CONNECTED CAR PLATFORM EDGE SERVICE MQTT Broker
 (apigee Link) MQTT Source Data 
 Cleansing Realtime traffic
 analytics KPI ANALYTICS Spark DASHBOARD react-vis Presto Masterdata
 Blending Camel KafkaKafka ESB gPRC
  • 21. DEVELOPERS’S VIEW -ON SPRING CLOUD DATA FLOW DATASERVICES
  • 22. ASSEMBLING A STREAM ▸ App starters: A set of pre-built
 apps aka dataservices ▸ Composition of apps with linux-style 
 pipe syntax: http | magichappenshere | log Starter app Custom app
  • 23. https://www.pinterest.de/pin/272116002461148164 MORE PIPES twitterstream 
 --consumerKey=<CONSUMER_KEY> 
 --consumerSecret=<CONSUMER_SECRET> 
 --accessToken=<ACCESS_TOKEN> 
 --accessTokenSecret=<ACCESS_TOKEN_SECRET> 
 | log :tweets.twitterstream > 
 field-value-counter 
 --fieldName=lang --name=language :tweets.twitterstream > 
 filter 
 --expression=#jsonPath(payload,’$.lang’)=='en' --outputType=application/json with parameters: with explicit input channel & analytics: with SpEL expression and explicit output type
  • 24. OUR SAMPLE APPLICATION: WORLD MOOD https://github.com/adersberger/spring-cloud-dataflow-samples twitterstream Starter app Custom app filter
 (lang=en) log twitter ingester
 (test data) tweet extractor
 (text) sentiment
 analysis
 (StanfordNLP) field-value-counter
  • 25. DEVELOPING CUSTOM APPS: THE VERY BEGINNING https://start.spring.io
  • 26. @SpringBootApplication @EnableBinding(Source.class) public class TwitterIngester { private Iterator<String> lines; @Bean @InboundChannelAdapter(value = Source.OUTPUT, poller = @Poller(fixedDelay = "200", maxMessagesPerPoll = "1")) public MessageSource<String> twitterMessageSource() { return () -> new GenericMessage<>(emitTweet()); } private String emitTweet() { if (lines == null || !lines.hasNext()) lines = readTweets(); return lines.next(); } private Iterator<String> readTweets() { //… } } PROGRAMMING MODEL: SOURCE
  • 27. @RunWith(SpringRunner.class) @SpringBootTest(webEnvironment= SpringBootTest.WebEnvironment.RANDOM_PORT) public class TwitterIngesterTest { @Autowired private Source source; @Autowired private MessageCollector collector; @Test public void tweetIngestionTest() throws InterruptedException { for (int i = 0; i < 100; i++) { Message<String> message = (Message<String>) 
 collector.forChannel(source.output()).take(); assert (message.getPayload().length() > 0); } } } PROGRAMMING MODEL: SOURCE TESTING
  • 28. PROGRAMMING MODEL: PROCESSOR (WITH STANFORD NLP) @SpringBootApplication @EnableBinding(Processor.class) public class TweetSentimentProcessor { @Autowired StanfordNLP nlp; @StreamListener(Processor.INPUT) //input channel with default name @SendTo(Processor.OUTPUT) //output channel with default name public int analyzeSentiment(String tweet){ return TupleBuilder.tuple().of("mood", findSentiment(tweet)); } public int findSentiment(String tweet) { int mainSentiment = 0; if (tweet != null && tweet.length() > 0) { int longest = 0; Annotation annotation = nlp.process(tweet); for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) { Tree tree = sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class); int sentiment = RNNCoreAnnotations.getPredictedClass(tree); String partText = sentence.toString(); if (partText.length() > longest) { mainSentiment = sentiment; longest = partText.length(); } } } return mainSentiment; } }
  • 29. PROGRAMMING MODEL: PROCESSOR TESTING @RunWith(SpringRunner.class) @SpringBootTest(webEnvironment= SpringBootTest.WebEnvironment.RANDOM_PORT) public class TweetSentimentProcessorTest { @Autowired private Processor processor; @Autowired private MessageCollector collector; @Autowired private TweetSentimentProcessor sentimentProcessor; @Test public void testAnalysis() { checkFor("I hate everybody around me!"); checkFor("The world is lovely"); checkFor("I f***ing hate everybody around me. They're from hell"); checkFor("Sunny day today!"); } private void checkFor(String msg) { processor.input().send(new GenericMessage<>(msg)); assertThat( collector.forChannel(processor.output()), receivesPayloadThat( equalTo(TupleBuilder.tuple().of("mood", sentimentProcessor.findSentiment(msg))); } }
  • 30. DEVELOPING THE STREAM DEFINITIONS WITH FLO http://projects.spring.io/spring-flo/
  • 31. RUNNING IT LOCAL RUNNING THE DATASERVICES $ redis-server &
 $ zookeeper-server-start.sh . /config/zookeeper.properties &
 $ kafka-server-start.sh ./config/server.properties &
 $ java -jar spring-cloud-dataflow-server-local-1.2.0.RELEASE.jar &
 $ java -jar ../spring-cloud-dataflow-shell-1.2.0.RELEASE.jar dataflow:> app import —uri [1]
 
 dataflow:> app register --name tweetsentimentalyzer --type processor --uri file:///libs/ worldmoodindex-0.0.2-SNAPSHOT.jar
 
 dataflow:> stream create tweets-ingestion --definition "twitterstream --consumerKey=A -- consumerSecret=B --accessToken=C --accessTokenSecret=D | filter — expression=#jsonPath(payload,’$.lang')=='en' | log" —deploy
 
 dataflow:> stream create tweets-analyzer --definition “:tweets-ingestion.filter > tweetsentimentalyzer | field-value-counter --fieldName=mood —name=Mood" 
 dataflow:> stream deploy tweets-analyzer —properties “deployer.tweetsentimentalyzer.memory=1024m,deployer.tweetsentimentalyzer.count=8,
 app.transform.producer.partitionKeyExpression=payload.id" [1] http://repo.spring.io/libs-release/org/springframework/cloud/stream/app/spring-cloud-stream-app-descriptor/Bacon.RELEASE/ spring-cloud-stream-app-descriptor-Bacon.RELEASE.kafka-10-apps-maven-repo-url.properties
  • 33. RUNNING IT IN THE CLOUD RUNNING THE DATASERVICES $ git clone https://github.com/spring-cloud/spring-cloud-dataflow-server-kubernetes
 $ kubectl create -f src/etc/kubernetes/kafka-zk-controller.yml
 $ kubectl create -f src/etc/kubernetes/kafka-zk-service.yml
 $ kubectl create -f src/etc/kubernetes/kafka-controller.yml
 $ kubectl create -f src/etc/kubernetes/mysql-controller.yml
 $ kubectl create -f src/etc/kubernetes/mysql-service.yml
 $ kubectl create -f src/etc/kubernetes/kafka-service.yml
 $ kubectl create -f src/etc/kubernetes/redis-controller.yml
 $ kubectl create -f src/etc/kubernetes/redis-service.yml
 $ kubectl create -f src/etc/kubernetes/scdf-config-kafka.yml
 $ kubectl create -f src/etc/kubernetes/scdf-secrets.yml
 $ kubectl create -f src/etc/kubernetes/scdf-service.yml
 $ kubectl create -f src/etc/kubernetes/scdf-controller.yml
 $ kubectl get svc #lookup external ip “scdf” <IP> $ java -jar ../spring-cloud-dataflow-shell-1.2.0.RELEASE.jar dataflow:> dataflow config server --uri http://<IP>:9393
 dataflow:> app import —uri [2]
 dataflow:> app register --type processor --name tweetsentimentalyzer --uri docker:qaware/ tweetsentimentalyzer-processor:latest dataflow:> … [2] http://repo.spring.io/libs-release/org/springframework/cloud/stream/app/spring-cloud-stream-app-descriptor/Bacon.RELEASE/spring- cloud-stream-app-descriptor-Bacon.RELEASE.stream-apps-kafka-09-docker
  • 35. PRO CON specialized programming
 model -> efficient specialized execution 
 environment -> efficient support for all types of data
 (big, fast, smart) disjoint programming model 
 (data processing <-> services) maybe a disjoint execution
 environment
 (data stack <-> service stack) BEST USED further on: as default for {big,fast,smart} data processing
  • 36. PRO CON coherent execution environment (runs on microservice stack) coherent programming model with emphasis on separation of concerns bascialy supports all types of data (big, fast, smart) has limitations on throughput
 (big & fast data) due to less optimization (like data affinity, query optimizer, …) and message-wise processing technology immature in certain
 parts (e.g. diagnosability) BEST USED FOR hybrid applications of data processing, system integration, API, UI moderate throughput data applications with existing dev team Message by message processing
  • 37. TWITTER.COM/QAWARE - SLIDESHARE.NET/QAWARE Thank you! Questions? josef.adersberger@qaware.de @adersberger https://github.com/adersberger/spring-cloud-dataflow-samples
  • 39. MORE… ▸ Reactive programming ▸ Diagnosability public Flux<String> transform(@Input(“input”) Flux<String> input) { return input.map(s -> s.toUpperCase()); }
  • 40. @EnableBinding(Sink::class) @EnableConfigurationProperties(PostgresSinkProperties::class) class PostgresSink { @Autowired lateinit var props: PostgresSinkProperties @StreamListener(Sink.INPUT) fun processTweet(message: String) { Database.connect(props.url, user = props.user, password = props.password, driver = "org.postgresql.Driver") transaction { SchemaUtils.create(Messages) Messages.insert { it[Messages.message] = message } } } } object Messages : Table() { val id = integer("id").autoIncrement().primaryKey() val message = text("message") } PROGRAMMING MODEL: SINK (WITH KOTLIN)
  • 43. ARCHITECT’S VIEW THE SECRET OF BIG DATA PERFORMANCE Rule 1: Be as close to the data as possible!
 (CPU cache > memory > local disk > network) Rule 2: Reduce data volume as early as possible! 
 (as long as you don’t sacrifice parallelization) Rule 3: Parallelize as much as possible! Rule 4: Premature diagnosability and optimization
  • 45. BASIC IDEA: BI-MODAL SOURCES AND SINKS Sources Processors Sinks READ FROM / WRITE TO: FILE, DATABASE, URL, … INGEST FROM / DIGEST TO: TWITTER, MQ, LOG, …