Session 8 - Creating Data Processing Services | Train the Trainers Program

i4Trust Website
i4Trust Community
End-to-end AI Solution With
PySpark & Real-time Data
Processing With Apache NiFi
Rihab Feki, Machine Learning Engineer and Evangelist
Sherifa Fayed, Technical Expert and Evangelist
FIWARE Foundation

Learning goals
● Managing real time data with the Context broker
● Data transformation (JSON-LD to CSV) and persistence with Apache NiFi
● Setting up a Google Cloud environment
○ Creating a Dataproc cluster and connecting it to Jupyter Notebook
○ Using Google Cloud Storage Service (GCS)
● Modeling a ML solution based on PySpark for multi-classification
● Deploying the ML model with Flask and getting predictions in real time
2

End to End AI service architecture powered by FIWARE
3

What is Apache NiFi?
4
● System to process and distribute
data
● Supports powerful and scalable
directed graphs of data routing and
transformation
● Web based user interface
● Tracking data flow from beginning
to end

5
Connecting NiFi to the Context Broker
NGSI-LD
Context
Broker
cURL or
Postman
NiFi (or
Draco)
1026:1026 5050:5050
27017:27017
MongoDB

Entity: Steel plate geometric measurements
6
Link to dataset

7

Data processing and persistence with NiFi
9

Overview about NiFi workflow
11
● ListenHTTP: Configured as source for receiving notifications from the Context Broker
● GetFile: Reads data in JSON-LD format
● JoltTransformJSON: Transforms nested JSON to a simple attribute value JSON file which
will be used to form the CSV file
● ConvertRecord: Converts each JSON file to a CSV file
● MergeContent: Merges the resulting CSV record files to form an aggregated CSV dataset
(PS: The min number of entries can be set to perform the merge processor. Also a max
number of flow files can be set)
● PutGCSObject: Saves the resulting CSV in Google Cloud Storage bucket

Demo: Data transformation and persistence
12

13

What is PySpark?
14
PySpark is an interface for Apache Spark in Python.
PySpark is a language for performing exploratory data analysis at scale, building
machine learning pipelines, and creating ETLs for a data platform.

What is Cloud Dataproc?
Batch processing, querying, streaming
Machine Learning
15
Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools.
Big data processing

The main benefits of Dataproc
● It’s a managed service: No need for a system administrator to set it up.
● It’s fast: Cluster creation in about 90 seconds.
● It’s cheaper than building your own cluster: Because you can spin up a Dataproc cluster
when you need to run a job and shut it down afterward, so you only pay when jobs are
running.
● It’s integrated with other Google Cloud services: Including Cloud Storage, BigQuery, and
Cloud Bigtable, so it’s easy to get data into and out of it.
16

What makes Dataproc special?
Typical mode of operation of Hadoop/Spark on premise or in cloud require
you deploy a cluster, and then you proceed to fill up said cluster with jobs
17

What makes Dataproc special?
Rather than submitting the
job to an already-deployed
cluster, you submit the job to
Dataproc, which creates a
cluster on your behalf
on-demand.
➢ A cluster is now a
means to an end for
job execution.
18

Let’s see how Dataproc makes
it easy and scalable...
19
Data scientists are big fans of Jupyter Notebooks
However, getting an Apache Spark cluster set-up with Jupyter Notebooks can be complicated

Apache Spark and Jupyter Lab architecture on Google
Cloud
20

How it works ?
1. Setting up the Google cloud environment and creating a project
2. Creating a Google Cloud Storage bucket for your cluster
3. Creating a Dataproc Cluster with Jupyter and Component Gateway
4. Accessing the JupyterLab web UI on Dataproc
5. Creating a Notebook and developing the AI algorithm with PySpark
21

Creating a Dataproc cluster using cloud shell
22
gcloud beta dataproc clusters create ${CLUSTER_NAME}
--region=${REGION}
--image-version=1.4
--master-machine-type=n1-standard-4
--worker-machine-type=n1-standard-4
--bucket=${BUCKET_NAME}
--optional-components=ANACONDA,JUPYTER
--enable-component-gateway

Component gateway for additional cluster components
23

Steel plates faults prediction
24
● Features: 27
Geometric Measurements
of the steel plates
● Fault types: 7
○ Pastry
○ Z_Scratch
○ K_Scatch
○ Stains
○ Dirtiness
○ Bumps
○ Other_Faults
Dataset format: CSV | Number of Samples: 1941
Link to dataset

Demo:
Cloud environment set up
Modeling the ML solution based on PySpark
25

ML model deployment with Flask architecture
26
27017:27017
5000:5000
www
Orion
Context
Broker
Model
prediction
Saved
Model
(.parquet)
Model training
Jupyter Notebook
cURL or
Postman
1026:1026

Useful links
● Source code and documentation
https://github.com/RihabFekii/PySpark-AI-service_Data-processing-NiFi
● Jupyter Notebook for Steel faults classification based on PySpark
https://github.com/RihabFekii/PySpark-AI-service_Data-processing-NiFi/blob/master/PySpark/P
ySpark_Steel_faults_Classification.ipynb
● Data processing and persistence with Apache NiFi documentation
https://github.com/RihabFekii/PySpark-AI-service_Data-processing-NiFi/tree/master/Nifi
● NGSI-LD Context Broker
○ Docker hub: https://hub.docker.com/r/fiware/orion-ld
○ Documentation: https://github.com/FIWARE/context.Orion-LD
● Google Cloud Console: https://console.cloud.google.com/
● Flask Apps with Docker: https://runnable.com/docker/python/docker-compose-with-flask-apps
● 27

Summary
28
● Context Broker does not store data or persist it
● Google Cloud Dataproc service provides data scientists an easy way to set up, control
and secure data science environments. Plus making it simple and fast for them to
integrate it with other open source data tools.
● Once the Dataproc cluster is created, it is not possible to change the configuration or
install new dependencies, libraries,..
● Dataproc jobs are limited to some programming languages.
● Apache NiFi might not be the easiest tool for data processing but it manages data flows
and automates them and it fits when dealing with large scale data or real-time data.
● Other cloud platforms could be used (AWS, Azure, Databricks,..)

Thank you!
http://fiware.org
Follow @FIWARE on Twitter

32
Creating an entity in the Context Broker
unique id and type
Attributes of the
created entity

33
Subscribing to changes and listening
posting subscription to Orion
subscribing to all entities of
certain type
sending notification to port NiFi is listening on
subscribing to relevant attributes

34
Subscribing to changes and listening

Inducing a change and receiving a notification
35

Processor Out Count jumps to 1
changing the value of X_Minimum
Inducing a change and receiving a notification

Setting up the cloud environment
37

Creating a project in Google Cloud Platform
38
We can manage the
project via the Cloud Shell

Creating a Google Cloud Storage bucket
39
➢ Store datastes
➢ Store Notebooks
➢ Store logs
➢ Store output ﬁles

Creating a Dataproc cluster using cloud shell
40
gcloud beta dataproc clusters create ${CLUSTER_NAME}
--region=${REGION}
--image-version=1.4
--master-machine-type=n1-standard-4
--worker-machine-type=n1-standard-4
--bucket=${BUCKET_NAME}
--optional-components=ANACONDA,JUPYTER
--enable-component-gateway

Creating a Dataproc cluster using GUI
41

Component gateway for additional cluster components
42

Overview of the Dataproc cluster
43

Dataproc cluster web interfaces
44

Dataproc cluster : Jupyter lab interface
45

Creating a Jupyter Notebook and provisioning data from
Google Cloud Bucket
46
Link to Notebook

Submitting a Pyspark job using Dataproc GUI
47

Submitting a Pyspark job to Dataproc cluster
48

www.egm.io
Fluid Machine Learning
lifecycle with FIWARE
Benoit Orihuela – i4Trust Training Webinar

A TYPICAL ML LIFECYCLE
• A Data Scientist
• Get and clean up data
• Prepare and train a ML model
• An IT person
• Package and deploy the ML model
• An end user
• Discover the available ML models (with respect to privacy)
• Ask to use one or more of them (and optionally pay for it)
• Get real time data (predictions, outliers,…) from a ML model
ML lifecycle with FIWARE - i4Trust - 12/05/2021 3

WHAT DO WE AIM AT?
Bridge the gap between data scientists and operations (MLOps)
Develop the Machine Learning as a Service (MLaaS) model
And also:
More and more use cases requiring ML / AI activities
FIWARE needs to offer a rich variety of tools

THE TRAINING AND PREPARATION PHASE

THE DISCOVERY AND REGISTRATION PHASE

THE PREDICTION PHASE

DEMONSTRATIONS
• Demonstration #1 - End to end demonstration of a ML model development, deployment and use
• Use of Jupyter notebook as interface
• Applied to a simplistic water flow calculation
• Demonstration #2 – Events generation from video stream analysis
• Realtime extraction of context information from a video stream

Thank You!
Tel:
E.mail:
www.egm.io
Benoit ORIHUELA
Lead Architect
+33 687427107
benoit.orihuela@egm.io

www.egm.io
MlaaS for Image analysis
Anwar ALFATAYRI

2
REAL LIFE EXAMPLE: SOCIAL DISTANCING
Number of people : 14
Groups of 2 people : 1
Groups >4 People: 0

Machine learning on the edge
TWO APPROACHES
3
Image 3 people detected
Street
Fiware Cloud

4
Machine learning as a service
TWO APPROACHES
Image
3 people detected
Street Fiware Cloud
API Rest

Session 8 - Creating Data Processing Services | Train the Trainers Program

More Related Content

What's hot (20)

Similar to Session 8 - Creating Data Processing Services | Train the Trainers Program (20)

More from FIWARE (20)

Recently uploaded (20)

Session 8 - Creating Data Processing Services | Train the Trainers Program