SlideShare a Scribd company logo
i4Trust Website
i4Trust Community
End-to-end AI Solution With
PySpark & Real-time Data
Processing With Apache NiFi
Rihab Feki, Machine Learning Engineer and Evangelist
Sherifa Fayed, Technical Expert and Evangelist
FIWARE Foundation
Learning goals
● Managing real time data with the Context broker
● Data transformation (JSON-LD to CSV) and persistence with Apache NiFi
● Setting up a Google Cloud environment
○ Creating a Dataproc cluster and connecting it to Jupyter Notebook
○ Using Google Cloud Storage Service (GCS)
● Modeling a ML solution based on PySpark for multi-classification
● Deploying the ML model with Flask and getting predictions in real time
2
End to End AI service architecture powered by FIWARE
3
What is Apache NiFi?
4
● System to process and distribute
data
● Supports powerful and scalable
directed graphs of data routing and
transformation
● Web based user interface
● Tracking data flow from beginning
to end
5
Connecting NiFi to the Context Broker
NGSI-LD
Context
Broker
cURL or
Postman
NiFi (or
Draco)
1026:1026 5050:5050
27017:27017
MongoDB
Entity: Steel plate geometric measurements
6
Link to dataset
End to End AI service architecture powered by FIWARE
7
Dataflow overview
8
Ingesting
Data processing and persistence with NiFi
9
The overall NiFi workflow
10
Overview about NiFi workflow
11
● ListenHTTP: Configured as source for receiving notifications from the Context Broker
● GetFile: Reads data in JSON-LD format
● JoltTransformJSON: Transforms nested JSON to a simple attribute value JSON file which
will be used to form the CSV file
● ConvertRecord: Converts each JSON file to a CSV file
● MergeContent: Merges the resulting CSV record files to form an aggregated CSV dataset
(PS: The min number of entries can be set to perform the merge processor. Also a max
number of flow files can be set)
● PutGCSObject: Saves the resulting CSV in Google Cloud Storage bucket
Demo: Data transformation and persistence
12
End to End AI service architecture powered by FIWARE
13
What is PySpark?
14
PySpark is an interface for Apache Spark in Python.
PySpark is a language for performing exploratory data analysis at scale, building
machine learning pipelines, and creating ETLs for a data platform.
What is Cloud Dataproc?
Batch processing, querying, streaming
Machine Learning
15
Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools.
Big data processing
The main benefits of Dataproc
● It’s a managed service: No need for a system administrator to set it up.
● It’s fast: Cluster creation in about 90 seconds.
● It’s cheaper than building your own cluster: Because you can spin up a Dataproc cluster
when you need to run a job and shut it down afterward, so you only pay when jobs are
running.
● It’s integrated with other Google Cloud services: Including Cloud Storage, BigQuery, and
Cloud Bigtable, so it’s easy to get data into and out of it.
16
What makes Dataproc special?
Typical mode of operation of Hadoop/Spark   on premise or in cloud  require
you deploy a cluster, and then you proceed to fill up said cluster with jobs
17
What makes Dataproc special?
Rather than submitting the
job to an already-deployed
cluster, you submit the job to
Dataproc, which creates a
cluster on your behalf
on-demand.
➢ A cluster is now a
means to an end for
job execution.
18
Let’s see how Dataproc makes
it easy and scalable...
19
Data scientists are big fans of Jupyter Notebooks
However, getting an Apache Spark cluster set-up with Jupyter Notebooks can be complicated
Apache Spark and Jupyter Lab architecture on Google
Cloud
20
How it works ?
1. Setting up the Google cloud environment and creating a project
2. Creating a Google Cloud Storage bucket for your cluster
3. Creating a Dataproc Cluster with Jupyter and Component Gateway
4. Accessing the JupyterLab web UI on Dataproc
5. Creating a Notebook and developing the AI algorithm with PySpark
21
Creating a Dataproc cluster using cloud shell
22
gcloud beta dataproc clusters create ${CLUSTER_NAME} 
--region=${REGION} 
--image-version=1.4 
--master-machine-type=n1-standard-4 
--worker-machine-type=n1-standard-4 
--bucket=${BUCKET_NAME} 
--optional-components=ANACONDA,JUPYTER 
--enable-component-gateway
Component gateway for additional cluster components
23
Steel plates faults prediction
24
● Features: 27
Geometric Measurements
of the steel plates
● Fault types: 7
○ Pastry
○ Z_Scratch
○ K_Scatch
○ Stains
○ Dirtiness
○ Bumps
○ Other_Faults
Dataset format: CSV | Number of Samples: 1941
Link to dataset
Demo:
Cloud environment set up
Modeling the ML solution based on PySpark
25
ML model deployment with Flask architecture
26
27017:27017
5000:5000
www
Orion
Context
Broker
Model
prediction
Saved
Model
(.parquet)
Model training
Jupyter Notebook
cURL or
Postman
1026:1026
Useful links
● Source code and documentation
https://github.com/RihabFekii/PySpark-AI-service_Data-processing-NiFi
● Jupyter Notebook for Steel faults classification based on PySpark
https://github.com/RihabFekii/PySpark-AI-service_Data-processing-NiFi/blob/master/PySpark/P
ySpark_Steel_faults_Classification.ipynb
● Data processing and persistence with Apache NiFi documentation
https://github.com/RihabFekii/PySpark-AI-service_Data-processing-NiFi/tree/master/Nifi
● NGSI-LD Context Broker
○ Docker hub: https://hub.docker.com/r/fiware/orion-ld
○ Documentation: https://github.com/FIWARE/context.Orion-LD
● Google Cloud Console: https://console.cloud.google.com/
● Flask Apps with Docker: https://runnable.com/docker/python/docker-compose-with-flask-apps
● 27
Summary
28
● Context Broker does not store data or persist it
● Google Cloud Dataproc service provides data scientists an easy way to set up, control
and secure data science environments. Plus making it simple and fast for them to
integrate it with other open source data tools.
● Once the Dataproc cluster is created, it is not possible to change the configuration or
install new dependencies, libraries,..
● Dataproc jobs are limited to some programming languages.
● Apache NiFi might not be the easiest tool for data processing but it manages data flows
and automates them and it fits when dealing with large scale data or real-time data.
● Other cloud platforms could be used (AWS, Azure, Databricks,..)
Thank you!
http://fiware.org
Follow @FIWARE on Twitter
30
Q&A
31
Annex
32
Creating an entity in the Context Broker
unique id and type
Attributes of the
created entity
33
Subscribing to changes and listening
posting subscription to Orion
subscribing to all entities of
certain type
sending notification to port NiFi is listening on
subscribing to relevant attributes
34
Subscribing to changes and listening
Inducing a change and receiving a notification
35
Processor Out Count jumps to 1
changing the value of X_Minimum
Inducing a change and receiving a notification
Setting up the cloud environment
37
Creating a project in Google Cloud Platform
38
We can manage the
project via the Cloud Shell
Creating a Google Cloud Storage bucket
39
➢ Store datastes
➢ Store Notebooks
➢ Store logs
➢ Store output files
Creating a Dataproc cluster using cloud shell
40
gcloud beta dataproc clusters create ${CLUSTER_NAME} 
--region=${REGION} 
--image-version=1.4 
--master-machine-type=n1-standard-4 
--worker-machine-type=n1-standard-4 
--bucket=${BUCKET_NAME} 
--optional-components=ANACONDA,JUPYTER 
--enable-component-gateway
Creating a Dataproc cluster using GUI
41
Component gateway for additional cluster components
42
Overview of the Dataproc cluster
43
Dataproc cluster web interfaces
44
Dataproc cluster : Jupyter lab interface
45
Creating a Jupyter Notebook and provisioning data from
Google Cloud Bucket
46
Link to Notebook
Submitting a Pyspark job using Dataproc GUI
47
Submitting a Pyspark job to Dataproc cluster
48
www.egm.io
Fluid Machine Learning
lifecycle with FIWARE
Benoit Orihuela – i4Trust Training Webinar
A TYPICAL ML LIFECYCLE
• A Data Scientist
• Get and clean up data
• Prepare and train a ML model
• An IT person
• Package and deploy the ML model
• An end user
• Discover the available ML models (with respect to privacy)
• Ask to use one or more of them (and optionally pay for it)
• Get real time data (predictions, outliers,…) from a ML model
ML lifecycle with FIWARE - i4Trust - 12/05/2021 3
WHAT DO WE AIM AT?
ML lifecycle with FIWARE - i4Trust - 12/05/2021 4
Bridge the gap between data scientists and operations (MLOps)
Develop the Machine Learning as a Service (MLaaS) model
And also:
More and more use cases requiring ML / AI activities
FIWARE needs to offer a rich variety of tools
THE TRAINING AND PREPARATION PHASE
ML lifecycle with FIWARE - i4Trust - 12/05/2021 5
THE DISCOVERY AND REGISTRATION PHASE
ML lifecycle with FIWARE - i4Trust - 12/05/2021 6
THE PREDICTION PHASE
ML lifecycle with FIWARE - i4Trust - 12/05/2021 7
DEMONSTRATIONS
• Demonstration #1 - End to end demonstration of a ML model development, deployment and use
• Use of Jupyter notebook as interface
• Applied to a simplistic water flow calculation
• Demonstration #2 – Events generation from video stream analysis
• Realtime extraction of context information from a video stream
ML lifecycle with FIWARE - i4Trust - 12/05/2021 8
Thank You!
Tel:
E.mail:
www.egm.io
Benoit ORIHUELA
Lead Architect
+33 687427107
benoit.orihuela@egm.io
www.egm.io
MlaaS for Image analysis
Anwar ALFATAYRI
2
REAL LIFE EXAMPLE: SOCIAL DISTANCING
Number of people : 14
Groups of 2 people : 1
Groups of 3 people : 2
Groups of 4 people : 1
Groups >4 People: 0
Machine learning on the edge
TWO APPROACHES
3
Image 3 people detected
Street
Fiware Cloud
4
Machine learning as a service
TWO APPROACHES
Image
3 people detected
Street Fiware Cloud
API Rest

More Related Content

PDF
Kong, Keyrock, Keycloak, i4Trust - Options to Secure FIWARE in Production
PDF
FIWARE Training: FIWARE Training: i4Trust Marketplace
PPTX
OpenID for Verifiable Credentials
PDF
分散型IDと検証可能なアイデンティティ技術概要
PDF
FIWARE Global Summit - The Scorpio NGSI-LD Broker: Features and Supported Arc...
PDF
OpenID for SSI
PDF
FIWARE Global Summit - NGSI-LD – an Evolution from NGSIv2
PDF
FIWARE Training: IoT and Legacy
Kong, Keyrock, Keycloak, i4Trust - Options to Secure FIWARE in Production
FIWARE Training: FIWARE Training: i4Trust Marketplace
OpenID for Verifiable Credentials
分散型IDと検証可能なアイデンティティ技術概要
FIWARE Global Summit - The Scorpio NGSI-LD Broker: Features and Supported Arc...
OpenID for SSI
FIWARE Global Summit - NGSI-LD – an Evolution from NGSIv2
FIWARE Training: IoT and Legacy

What's hot (20)

PPTX
Azure IoT Summary
PDF
FIWARE Training: JSON-LD and NGSI-LD
PDF
Azure Arc Overview from Microsoft
PPTX
Comparison of MQTT and DDS as M2M Protocols for the Internet of Things
PDF
OpenID Connect 4 SSI (DIFCon F2F)
PDF
FIWARE Global Summit - NGSI-LD - NGSI with Linked Data
PPTX
Azure active directory
PPTX
POLE Investigations with Neo4j
PDF
FIWARE Training: Introduction to Smart Data Models
PPTX
FIWARE Wednesday Webinars - FIWARE Overview
PPTX
Architecting Azure IoT Solutions
PDF
Standardizing the Data Distribution Service (DDS) API for Modern C++
PDF
Designing a complete ci cd pipeline using argo events, workflow and cd products
PDF
Hyperledger Besuの動向
PDF
FIWARE Wednesday Webinars - NGSI-LD and Smart Data Models: Standard Access to...
PPTX
OpenStack Architecture and Use Cases
PDF
FIWARE Wednesday Webinars - Introduction to NGSI-LD
PDF
OpenShift 4, the smarter Kubernetes platform
PDF
EUDI wallets with OpenID for verifiable credentials (OID4VCI/OID4VP)
PPTX
Azure Stack Fundamentals
Azure IoT Summary
FIWARE Training: JSON-LD and NGSI-LD
Azure Arc Overview from Microsoft
Comparison of MQTT and DDS as M2M Protocols for the Internet of Things
OpenID Connect 4 SSI (DIFCon F2F)
FIWARE Global Summit - NGSI-LD - NGSI with Linked Data
Azure active directory
POLE Investigations with Neo4j
FIWARE Training: Introduction to Smart Data Models
FIWARE Wednesday Webinars - FIWARE Overview
Architecting Azure IoT Solutions
Standardizing the Data Distribution Service (DDS) API for Modern C++
Designing a complete ci cd pipeline using argo events, workflow and cd products
Hyperledger Besuの動向
FIWARE Wednesday Webinars - NGSI-LD and Smart Data Models: Standard Access to...
OpenStack Architecture and Use Cases
FIWARE Wednesday Webinars - Introduction to NGSI-LD
OpenShift 4, the smarter Kubernetes platform
EUDI wallets with OpenID for verifiable credentials (OID4VCI/OID4VP)
Azure Stack Fundamentals
Ad

Similar to Session 8 - Creating Data Processing Services | Train the Trainers Program (20)

PDF
Day 13 - Creating Data Processing Services | Train the Trainers Program
PDF
A Tool For Big Data Analysis using Apache Spark
PPTX
The Edge to AI Deep Dive Barcelona Meetup March 2019
PDF
Productionalizing a spark application
PPTX
Machine Learning and Hadoop: Present and future
PPTX
Hadoop and Machine Learning
PPTX
03_aiops-1.pptx
PPTX
Production ML Systems and Computer Vision with Google Cloud
PDF
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
PDF
Hopsworks at Google AI Huddle, Sunnyvale
PDF
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
PDF
An overview of modern scalable web development
PDF
C19013010 the tutorial to build shared ai services session 1
PPTX
Spark ML Pipeline serving
PDF
Edge to ai analytics from edge to cloud with efficient movement of machine data
PPTX
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine ...
PDF
FIWARE Global Summit - Big Data and Machine Learning with FIWARE
PDF
Google Cloud Platform for Data Science teams
Day 13 - Creating Data Processing Services | Train the Trainers Program
A Tool For Big Data Analysis using Apache Spark
The Edge to AI Deep Dive Barcelona Meetup March 2019
Productionalizing a spark application
Machine Learning and Hadoop: Present and future
Hadoop and Machine Learning
03_aiops-1.pptx
Production ML Systems and Computer Vision with Google Cloud
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Hopsworks at Google AI Huddle, Sunnyvale
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
An overview of modern scalable web development
C19013010 the tutorial to build shared ai services session 1
Spark ML Pipeline serving
Edge to ai analytics from edge to cloud with efficient movement of machine data
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine ...
FIWARE Global Summit - Big Data and Machine Learning with FIWARE
Google Cloud Platform for Data Science teams
Ad

More from FIWARE (20)

PPTX
Behm_Herne_NeMo_akt.pptx
PDF
Katharina Hogrebe Herne Digital Days.pdf
PPTX
Christoph Mertens_IDSA_Introduction to Data Spaces.pptx
PPTX
Behm_Herne_NeMo.pptx
PPTX
Evangelists + iHubs Promo Slides.pptx
PPTX
Lukas Künzel Smart City Operating System.pptx
PPTX
Pierre Golz Der Transformationsprozess im Konzern Stadt.pptx
PPTX
Dennis Wendland_The i4Trust Collaboration Programme.pptx
PPTX
Ulrich Ahle_FIWARE.pptx
PPTX
Aleksandar Vrglevski _FIWARE DACH_OSIH.pptx
PDF
Water Quality - Lukas Kuenzel.pdf
PPTX
Cameron Brooks_FGS23_FIWARE Summit_Keynote_Cameron.pptx
PPTX
FiWareSummit.msGIS-Data-to-Value.2023.06.12.pptx
PPTX
Boris Otto_FGS2023_Opening- EU Innovations from Data_PUB_V1_BOt.pptx
PPTX
Bjoern de Vidts_FGS23_Opening_athumi - bjord de vidts - personal data spaces....
PDF
Abdulrahman Ibrahim_FGS23 Opening - Abdulrahman Ibrahim.pdf
PDF
FGS2023_Opening_Red Hat Keynote Andrea Battaglia.pdf
PPTX
HTAG_Skalierung_Plattform_lokal_final_versand.pptx
PPTX
WE_LoRaWAN _ IoT.pptx
PPTX
EU Opp_Clara Pezuela - German chapter.pptx
Behm_Herne_NeMo_akt.pptx
Katharina Hogrebe Herne Digital Days.pdf
Christoph Mertens_IDSA_Introduction to Data Spaces.pptx
Behm_Herne_NeMo.pptx
Evangelists + iHubs Promo Slides.pptx
Lukas Künzel Smart City Operating System.pptx
Pierre Golz Der Transformationsprozess im Konzern Stadt.pptx
Dennis Wendland_The i4Trust Collaboration Programme.pptx
Ulrich Ahle_FIWARE.pptx
Aleksandar Vrglevski _FIWARE DACH_OSIH.pptx
Water Quality - Lukas Kuenzel.pdf
Cameron Brooks_FGS23_FIWARE Summit_Keynote_Cameron.pptx
FiWareSummit.msGIS-Data-to-Value.2023.06.12.pptx
Boris Otto_FGS2023_Opening- EU Innovations from Data_PUB_V1_BOt.pptx
Bjoern de Vidts_FGS23_Opening_athumi - bjord de vidts - personal data spaces....
Abdulrahman Ibrahim_FGS23 Opening - Abdulrahman Ibrahim.pdf
FGS2023_Opening_Red Hat Keynote Andrea Battaglia.pdf
HTAG_Skalierung_Plattform_lokal_final_versand.pptx
WE_LoRaWAN _ IoT.pptx
EU Opp_Clara Pezuela - German chapter.pptx

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Empathic Computing: Creating Shared Understanding
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Encapsulation theory and applications.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Understanding_Digital_Forensics_Presentation.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectral efficient network and resource selection model in 5G networks
Reach Out and Touch Someone: Haptics and Empathic Computing
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
NewMind AI Weekly Chronicles - August'25 Week I
The Rise and Fall of 3GPP – Time for a Sabbatical?
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Empathic Computing: Creating Shared Understanding
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Advanced methodologies resolving dimensionality complications for autism neur...
Dropbox Q2 2025 Financial Results & Investor Presentation
Encapsulation_ Review paper, used for researhc scholars
Encapsulation theory and applications.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

Session 8 - Creating Data Processing Services | Train the Trainers Program

  • 1. i4Trust Website i4Trust Community End-to-end AI Solution With PySpark & Real-time Data Processing With Apache NiFi Rihab Feki, Machine Learning Engineer and Evangelist Sherifa Fayed, Technical Expert and Evangelist FIWARE Foundation
  • 2. Learning goals ● Managing real time data with the Context broker ● Data transformation (JSON-LD to CSV) and persistence with Apache NiFi ● Setting up a Google Cloud environment ○ Creating a Dataproc cluster and connecting it to Jupyter Notebook ○ Using Google Cloud Storage Service (GCS) ● Modeling a ML solution based on PySpark for multi-classification ● Deploying the ML model with Flask and getting predictions in real time 2
  • 3. End to End AI service architecture powered by FIWARE 3
  • 4. What is Apache NiFi? 4 ● System to process and distribute data ● Supports powerful and scalable directed graphs of data routing and transformation ● Web based user interface ● Tracking data flow from beginning to end
  • 5. 5 Connecting NiFi to the Context Broker NGSI-LD Context Broker cURL or Postman NiFi (or Draco) 1026:1026 5050:5050 27017:27017 MongoDB
  • 6. Entity: Steel plate geometric measurements 6 Link to dataset
  • 7. End to End AI service architecture powered by FIWARE 7
  • 9. Data processing and persistence with NiFi 9
  • 10. The overall NiFi workflow 10
  • 11. Overview about NiFi workflow 11 ● ListenHTTP: Configured as source for receiving notifications from the Context Broker ● GetFile: Reads data in JSON-LD format ● JoltTransformJSON: Transforms nested JSON to a simple attribute value JSON file which will be used to form the CSV file ● ConvertRecord: Converts each JSON file to a CSV file ● MergeContent: Merges the resulting CSV record files to form an aggregated CSV dataset (PS: The min number of entries can be set to perform the merge processor. Also a max number of flow files can be set) ● PutGCSObject: Saves the resulting CSV in Google Cloud Storage bucket
  • 12. Demo: Data transformation and persistence 12
  • 13. End to End AI service architecture powered by FIWARE 13
  • 14. What is PySpark? 14 PySpark is an interface for Apache Spark in Python. PySpark is a language for performing exploratory data analysis at scale, building machine learning pipelines, and creating ETLs for a data platform.
  • 15. What is Cloud Dataproc? Batch processing, querying, streaming Machine Learning 15 Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools. Big data processing
  • 16. The main benefits of Dataproc ● It’s a managed service: No need for a system administrator to set it up. ● It’s fast: Cluster creation in about 90 seconds. ● It’s cheaper than building your own cluster: Because you can spin up a Dataproc cluster when you need to run a job and shut it down afterward, so you only pay when jobs are running. ● It’s integrated with other Google Cloud services: Including Cloud Storage, BigQuery, and Cloud Bigtable, so it’s easy to get data into and out of it. 16
  • 17. What makes Dataproc special? Typical mode of operation of Hadoop/Spark   on premise or in cloud  require you deploy a cluster, and then you proceed to fill up said cluster with jobs 17
  • 18. What makes Dataproc special? Rather than submitting the job to an already-deployed cluster, you submit the job to Dataproc, which creates a cluster on your behalf on-demand. ➢ A cluster is now a means to an end for job execution. 18
  • 19. Let’s see how Dataproc makes it easy and scalable... 19 Data scientists are big fans of Jupyter Notebooks However, getting an Apache Spark cluster set-up with Jupyter Notebooks can be complicated
  • 20. Apache Spark and Jupyter Lab architecture on Google Cloud 20
  • 21. How it works ? 1. Setting up the Google cloud environment and creating a project 2. Creating a Google Cloud Storage bucket for your cluster 3. Creating a Dataproc Cluster with Jupyter and Component Gateway 4. Accessing the JupyterLab web UI on Dataproc 5. Creating a Notebook and developing the AI algorithm with PySpark 21
  • 22. Creating a Dataproc cluster using cloud shell 22 gcloud beta dataproc clusters create ${CLUSTER_NAME} --region=${REGION} --image-version=1.4 --master-machine-type=n1-standard-4 --worker-machine-type=n1-standard-4 --bucket=${BUCKET_NAME} --optional-components=ANACONDA,JUPYTER --enable-component-gateway
  • 23. Component gateway for additional cluster components 23
  • 24. Steel plates faults prediction 24 ● Features: 27 Geometric Measurements of the steel plates ● Fault types: 7 ○ Pastry ○ Z_Scratch ○ K_Scatch ○ Stains ○ Dirtiness ○ Bumps ○ Other_Faults Dataset format: CSV | Number of Samples: 1941 Link to dataset
  • 25. Demo: Cloud environment set up Modeling the ML solution based on PySpark 25
  • 26. ML model deployment with Flask architecture 26 27017:27017 5000:5000 www Orion Context Broker Model prediction Saved Model (.parquet) Model training Jupyter Notebook cURL or Postman 1026:1026
  • 27. Useful links ● Source code and documentation https://github.com/RihabFekii/PySpark-AI-service_Data-processing-NiFi ● Jupyter Notebook for Steel faults classification based on PySpark https://github.com/RihabFekii/PySpark-AI-service_Data-processing-NiFi/blob/master/PySpark/P ySpark_Steel_faults_Classification.ipynb ● Data processing and persistence with Apache NiFi documentation https://github.com/RihabFekii/PySpark-AI-service_Data-processing-NiFi/tree/master/Nifi ● NGSI-LD Context Broker ○ Docker hub: https://hub.docker.com/r/fiware/orion-ld ○ Documentation: https://github.com/FIWARE/context.Orion-LD ● Google Cloud Console: https://console.cloud.google.com/ ● Flask Apps with Docker: https://runnable.com/docker/python/docker-compose-with-flask-apps ● 27
  • 28. Summary 28 ● Context Broker does not store data or persist it ● Google Cloud Dataproc service provides data scientists an easy way to set up, control and secure data science environments. Plus making it simple and fast for them to integrate it with other open source data tools. ● Once the Dataproc cluster is created, it is not possible to change the configuration or install new dependencies, libraries,.. ● Dataproc jobs are limited to some programming languages. ● Apache NiFi might not be the easiest tool for data processing but it manages data flows and automates them and it fits when dealing with large scale data or real-time data. ● Other cloud platforms could be used (AWS, Azure, Databricks,..)
  • 32. 32 Creating an entity in the Context Broker unique id and type Attributes of the created entity
  • 33. 33 Subscribing to changes and listening posting subscription to Orion subscribing to all entities of certain type sending notification to port NiFi is listening on subscribing to relevant attributes
  • 34. 34 Subscribing to changes and listening
  • 35. Inducing a change and receiving a notification 35
  • 36. Processor Out Count jumps to 1 changing the value of X_Minimum Inducing a change and receiving a notification
  • 37. Setting up the cloud environment 37
  • 38. Creating a project in Google Cloud Platform 38 We can manage the project via the Cloud Shell
  • 39. Creating a Google Cloud Storage bucket 39 ➢ Store datastes ➢ Store Notebooks ➢ Store logs ➢ Store output files
  • 40. Creating a Dataproc cluster using cloud shell 40 gcloud beta dataproc clusters create ${CLUSTER_NAME} --region=${REGION} --image-version=1.4 --master-machine-type=n1-standard-4 --worker-machine-type=n1-standard-4 --bucket=${BUCKET_NAME} --optional-components=ANACONDA,JUPYTER --enable-component-gateway
  • 41. Creating a Dataproc cluster using GUI 41
  • 42. Component gateway for additional cluster components 42
  • 43. Overview of the Dataproc cluster 43
  • 44. Dataproc cluster web interfaces 44
  • 45. Dataproc cluster : Jupyter lab interface 45
  • 46. Creating a Jupyter Notebook and provisioning data from Google Cloud Bucket 46 Link to Notebook
  • 47. Submitting a Pyspark job using Dataproc GUI 47
  • 48. Submitting a Pyspark job to Dataproc cluster 48
  • 49. www.egm.io Fluid Machine Learning lifecycle with FIWARE Benoit Orihuela – i4Trust Training Webinar
  • 50. A TYPICAL ML LIFECYCLE • A Data Scientist • Get and clean up data • Prepare and train a ML model • An IT person • Package and deploy the ML model • An end user • Discover the available ML models (with respect to privacy) • Ask to use one or more of them (and optionally pay for it) • Get real time data (predictions, outliers,…) from a ML model ML lifecycle with FIWARE - i4Trust - 12/05/2021 3
  • 51. WHAT DO WE AIM AT? ML lifecycle with FIWARE - i4Trust - 12/05/2021 4 Bridge the gap between data scientists and operations (MLOps) Develop the Machine Learning as a Service (MLaaS) model And also: More and more use cases requiring ML / AI activities FIWARE needs to offer a rich variety of tools
  • 52. THE TRAINING AND PREPARATION PHASE ML lifecycle with FIWARE - i4Trust - 12/05/2021 5
  • 53. THE DISCOVERY AND REGISTRATION PHASE ML lifecycle with FIWARE - i4Trust - 12/05/2021 6
  • 54. THE PREDICTION PHASE ML lifecycle with FIWARE - i4Trust - 12/05/2021 7
  • 55. DEMONSTRATIONS • Demonstration #1 - End to end demonstration of a ML model development, deployment and use • Use of Jupyter notebook as interface • Applied to a simplistic water flow calculation • Demonstration #2 – Events generation from video stream analysis • Realtime extraction of context information from a video stream ML lifecycle with FIWARE - i4Trust - 12/05/2021 8
  • 56. Thank You! Tel: E.mail: www.egm.io Benoit ORIHUELA Lead Architect +33 687427107 benoit.orihuela@egm.io
  • 57. www.egm.io MlaaS for Image analysis Anwar ALFATAYRI
  • 58. 2 REAL LIFE EXAMPLE: SOCIAL DISTANCING Number of people : 14 Groups of 2 people : 1 Groups of 3 people : 2 Groups of 4 people : 1 Groups >4 People: 0
  • 59. Machine learning on the edge TWO APPROACHES 3 Image 3 people detected Street Fiware Cloud
  • 60. 4 Machine learning as a service TWO APPROACHES Image 3 people detected Street Fiware Cloud API Rest