SlideShare a Scribd company logo
Simplifying AI Integration on
Spark
Hemshankar Sahu
Principal Software Engineer @ Informatica
About Speaker
Hemshankar Sahu
Principal Software Engineer @ Informatica
M. Tech. in Computer Science and Engg. From IIT Roorkee
9+ Years of Experience in IT Industry working as Full Stack Developer and ML Engineer.
Currently working on developing framework to help Integration of Machine Learning Algorithm
and Models into production system.
About Informatica
Enterprise Cloud Data Management leader
9,500+
customers
18 Trillion
cloud transactions
per month
85%
of Fortune 100
5
A Leader in Five
Gartner Magic
Quadrants
Agenda
▪ Context for the Talk
▪ Personas Involved
▪ Informatica On Spark
▪ Problem Details
▪ AI/ML Integration Problems
▪ Solution Details
▪ New Offering: AISR
▪ Simplifying AI/ML integration on Spark
▪ Demo
▪ Deploying, Integration, Auto CI-CD of AI
Solutions
▪ Summary
Context for the Talk
Personas Involved
Data Scientist vs Data Engineers: Personas involved in operationalizing the ML Algorithms
Data Scientist Data Engineer
Tasks Data Exploring, Model Building, Model Training
Data Ingestion, Data Pre-processing,
Transformation and Cleansing
Languages Python, R, Lisp SQL, Scala, Java/Python
Tools Notebook, R Studio, Matlab Spark, Data Engg. Tools (like Informatica)
Libraries Tensorflow, Keres, Pandas, Sickit Learn Hadoop, Spark
Informatica On Spark
Informatica Data Engineering Integration (DEI) Generates Spark Code
Executes On Cluster
Data Engineering Tool which uses Spark as Execution Engine
Same, familiar
Informatica design-time
Informatica Intelligent Cloud
Services
Cloud Data Integration Elastic
Enabling Spark serverless support for auto-scaling and provisioning
Auto-scaling Spark
cluster
Deployed to your
cloud network
Problem Details
AI/ML Integration Issues
Example problem use-case: Collaborating Data Engineers and Data Scientists
Informatica
DEI
Python 2.7
Python 2.7
Python 2.7
Python 3.6Python Developer
Python Developer
R Developer
Python 2.7 Python 2.7
Master
V1
V2
?
?
Spark Cluster
Issues
▪ Team Collaboration Required
▪ Data Scientist and Data Engineer invests time to
collaborate
▪ Manually Deploy the Binaries
▪ Downtime for each new version
▪ No Support for Different Runtimes
Data Science Team Data Engineering Team
V2 V2
Solution Details
New Offering: AISR
▪ Repository of AI Solutions
▪ A Solution is
▪ Code and Metadata
▪ Dependencies
▪ Runtime Details
▪ A Solution can
▪ Be in any language*
▪ With any dependency
▪ Run on GPU**
AI Solutions Repository
* Only Python supported in current release
** Provided hardware are present and drivers are installed, and solution contains the respective code
Runtimes
Tensorflow_Numpy
Sickitlearn_OpenCV
Solutions
Sentiment Analysis
AISR
Generated Code for executing from various platforms
Solution code, can be in any language
Dependencies: Files, installed software etc.
AISR
Image Processing
Image Classification
Image To Text
Example
Based on A General Solutions Repository
Solutions
Repository
CPP
Python
R
Java
DEI
Spark
REST
Java
Simplifying AI/ML integration on Spark
Example use-case solution: Collaborating Data Scientists and Data Engineers
Python 2.7
Python 2.7
Informatica
DEI
Python 3.6
Python Developer
Python Developer
R Developer
Master
V1
V2
AISR
Runtime-1
Runtime-1
Runtime-2
Runtime-3
V1
Runtime
V1
Runtime
V1
Runtime
Cluster
Benefits
▪ Minimum Collaboration
▪ Between Data Scientist and Data Engineer
▪ Auto Deploy of new Version
▪ No Downtime
▪ Multiple Versions Support
▪ Different version of same solution can be used.
▪ Support for Different Runtime
Data Science Team Data Engineering Team
V1
Runtime
V1
Runtime
Demo
Demo Use Case
Easy Collaboration, No Downtime and CI-CD
AISR DEI
Data Scientist Data Engineer
Image
Classification
Simplified Integration In Action
Runtimes
Python + TF + OpenCV
R Eco System
Solutions
Image To Text V1
AI Solutions Repo DEI
Generated Java Code for executing at spark executors
INFA wrapper and Core code, can be in any language
Dependencies: Files, installed software etc.
Object Detection V1
YARN
Spark Job Executor 1 Executor 2
Node 1
Node 2 Node 3
HDFS
CLUSTERInformatica
Data Scientist
Data Engineer
Mapping
Cached Binaries
Spark Job
Demo Recap
▪ Easily Created Solution
▪ Easily added a new AI Solution from Jupyter Notebook
▪ Explored the details of added solution
▪ Deployed and Tested
▪ Added Solution was deployed
▪ Explored various consumption options
▪ Created REST Endpoint and used it for testing
▪ Easily Integrated with Spark
▪ Created a mapping job using Informatica
▪ Created new Transformation to use the Deployed Solution
▪ Ran the mapping on Spark with selected Solution
▪ CI-CD
▪ Retrained the Solution with few clicks
▪ Used the re-trained Solution without any changes or downtime
AISR DEI
Summary
Summary
▪ Data Scientist Vs Data Engineer
▪ Collaboration is challenging and time consuming
▪ Easy Spark Job Creation using DEI
▪ Drag and Drop way of Spark Job Creation
▪ Easy Spark-AI Solution Integration using AISR
▪ Minimum Collaboration
▪ Processing happens at Spark Scale within Spark Cluster
▪ Better performance as compared to other serving platforms.
▪ Inbuilt CI-CD for AI Solutions
▪ No downtime in case Solution upgrades
▪ No changes required from Data Engineering environment
▪ AISR Framework
▪ Based on Generic Solutions Repository Implementation
▪ Partners can develop plugins to add or consume AI Solutions
▪ Overall Production Cost Reduction
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

PDF
KFServing, Model Monitoring with Apache Spark and a Feature Store
PDF
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
PDF
Overcoming Regulatory & Compliance Hurdles with Hybrid Cloud EKS and Weave Gi...
PDF
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
PDF
Challenges of Operationalising Data Science in Production
PDF
Weave GitOps - continuous delivery for any Kubernetes
PPTX
Infrastructure Solutions for Deploying AI/ML/DL Workloads at Scale
PPTX
Magdalena Stenius: MLOPS Will Change Machine Learning
KFServing, Model Monitoring with Apache Spark and a Feature Store
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
Overcoming Regulatory & Compliance Hurdles with Hybrid Cloud EKS and Weave Gi...
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
Challenges of Operationalising Data Science in Production
Weave GitOps - continuous delivery for any Kubernetes
Infrastructure Solutions for Deploying AI/ML/DL Workloads at Scale
Magdalena Stenius: MLOPS Will Change Machine Learning

What's hot (20)

PDF
PDF
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
PDF
Model versioning done right: A ModelDB 2.0 Walkthrough
PDF
Scaling ML-Based Threat Detection For Production Cyber Attacks
PDF
DevOps for Applications in Azure Databricks: Creating Continuous Integration ...
PDF
MLflow with R
PDF
[AI] ML Operationalization with Microsoft Azure
PDF
Building a Streaming Data Pipeline for Trains Delays Processing
PDF
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
PDF
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
PDF
Seamless MLOps with Seldon and MLflow
PDF
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
PDF
Using Apache Spark for Predicting Degrading and Failing Parts in Aviation
PPTX
Google Vertex AI
PDF
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
PDF
Vertex AI: Pipelines for your MLOps workflows
PDF
Hamburg Data Science Meetup - MLOps with a Feature Store
PDF
MLOps with Kubeflow
PPTX
DAIS Europe Nov. 2020 presentation on MLflow Model Serving
PDF
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Model versioning done right: A ModelDB 2.0 Walkthrough
Scaling ML-Based Threat Detection For Production Cyber Attacks
DevOps for Applications in Azure Databricks: Creating Continuous Integration ...
MLflow with R
[AI] ML Operationalization with Microsoft Azure
Building a Streaming Data Pipeline for Trains Delays Processing
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Seamless MLOps with Seldon and MLflow
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Using Apache Spark for Predicting Degrading and Failing Parts in Aviation
Google Vertex AI
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
Vertex AI: Pipelines for your MLOps workflows
Hamburg Data Science Meetup - MLOps with a Feature Store
MLOps with Kubeflow
DAIS Europe Nov. 2020 presentation on MLflow Model Serving
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...
Ad

Similar to Simplifying AI integration on Apache Spark (20)

PDF
A Comprehensive Guide to Python for AI, ML, and Data Science
PDF
Enabling a hardware accelerated deep learning data science experience for Apa...
PPTX
Data Science and CDSW
PPTX
Innovations using PowerAI
PPTX
Scaling Data Science on Big Data
PPTX
Artificial Intelligence and Machine Learning and Python FINAL.pptx
PDF
AI Scalability for the Next Decade
PDF
Spark summit 2019 infrastructure for deep learning in apache spark 0425
PPTX
Northwestern 20181004 v9
PPTX
Part 1: Introducing the Cloudera Data Science Workbench
PPTX
Part 2: A Visual Dive into Machine Learning and Deep Learning 

PDF
The Future of Data Science
PPTX
DevOps for AI Apps
PDF
Infrastructure for Deep Learning in Apache Spark
PDF
Deep Learning Image Processing Applications in the Enterprise
PDF
Ibm coe openpowerailabdubaiwithraptor
PDF
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
PPTX
Python in Artificial Intelligence and Machine Learning.pptx
PDF
Why Hire Python Developers for AIML What Hiring Managers Need to Know.pdf
PDF
SkillsFuture Festival at NUS 2019- Artificial Intelligence for Everyone - A P...
A Comprehensive Guide to Python for AI, ML, and Data Science
Enabling a hardware accelerated deep learning data science experience for Apa...
Data Science and CDSW
Innovations using PowerAI
Scaling Data Science on Big Data
Artificial Intelligence and Machine Learning and Python FINAL.pptx
AI Scalability for the Next Decade
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Northwestern 20181004 v9
Part 1: Introducing the Cloudera Data Science Workbench
Part 2: A Visual Dive into Machine Learning and Deep Learning 

The Future of Data Science
DevOps for AI Apps
Infrastructure for Deep Learning in Apache Spark
Deep Learning Image Processing Applications in the Enterprise
Ibm coe openpowerailabdubaiwithraptor
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
Python in Artificial Intelligence and Machine Learning.pptx
Why Hire Python Developers for AIML What Hiring Managers Need to Know.pdf
SkillsFuture Festival at NUS 2019- Artificial Intelligence for Everyone - A P...
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Computer network topology notes for revision
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPT
Quality review (1)_presentation of this 21
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Computer network topology notes for revision
Major-Components-ofNKJNNKNKNKNKronment.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Mega Projects Data Mega Projects Data
.pdf is not working space design for the following data for the following dat...
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Clinical guidelines as a resource for EBP(1).pdf
Launch Your Data Science Career in Kochi – 2025
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Supervised vs unsupervised machine learning algorithms
Quality review (1)_presentation of this 21
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Business Acumen Training GuidePresentation.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx

Simplifying AI integration on Apache Spark

  • 1. Simplifying AI Integration on Spark Hemshankar Sahu Principal Software Engineer @ Informatica
  • 2. About Speaker Hemshankar Sahu Principal Software Engineer @ Informatica M. Tech. in Computer Science and Engg. From IIT Roorkee 9+ Years of Experience in IT Industry working as Full Stack Developer and ML Engineer. Currently working on developing framework to help Integration of Machine Learning Algorithm and Models into production system.
  • 3. About Informatica Enterprise Cloud Data Management leader 9,500+ customers 18 Trillion cloud transactions per month 85% of Fortune 100 5 A Leader in Five Gartner Magic Quadrants
  • 4. Agenda ▪ Context for the Talk ▪ Personas Involved ▪ Informatica On Spark ▪ Problem Details ▪ AI/ML Integration Problems ▪ Solution Details ▪ New Offering: AISR ▪ Simplifying AI/ML integration on Spark ▪ Demo ▪ Deploying, Integration, Auto CI-CD of AI Solutions ▪ Summary
  • 6. Personas Involved Data Scientist vs Data Engineers: Personas involved in operationalizing the ML Algorithms Data Scientist Data Engineer Tasks Data Exploring, Model Building, Model Training Data Ingestion, Data Pre-processing, Transformation and Cleansing Languages Python, R, Lisp SQL, Scala, Java/Python Tools Notebook, R Studio, Matlab Spark, Data Engg. Tools (like Informatica) Libraries Tensorflow, Keres, Pandas, Sickit Learn Hadoop, Spark
  • 7. Informatica On Spark Informatica Data Engineering Integration (DEI) Generates Spark Code Executes On Cluster Data Engineering Tool which uses Spark as Execution Engine
  • 8. Same, familiar Informatica design-time Informatica Intelligent Cloud Services Cloud Data Integration Elastic Enabling Spark serverless support for auto-scaling and provisioning Auto-scaling Spark cluster Deployed to your cloud network
  • 10. AI/ML Integration Issues Example problem use-case: Collaborating Data Engineers and Data Scientists Informatica DEI Python 2.7 Python 2.7 Python 2.7 Python 3.6Python Developer Python Developer R Developer Python 2.7 Python 2.7 Master V1 V2 ? ? Spark Cluster Issues ▪ Team Collaboration Required ▪ Data Scientist and Data Engineer invests time to collaborate ▪ Manually Deploy the Binaries ▪ Downtime for each new version ▪ No Support for Different Runtimes Data Science Team Data Engineering Team V2 V2
  • 12. New Offering: AISR ▪ Repository of AI Solutions ▪ A Solution is ▪ Code and Metadata ▪ Dependencies ▪ Runtime Details ▪ A Solution can ▪ Be in any language* ▪ With any dependency ▪ Run on GPU** AI Solutions Repository * Only Python supported in current release ** Provided hardware are present and drivers are installed, and solution contains the respective code Runtimes Tensorflow_Numpy Sickitlearn_OpenCV Solutions Sentiment Analysis AISR Generated Code for executing from various platforms Solution code, can be in any language Dependencies: Files, installed software etc. AISR Image Processing Image Classification Image To Text Example Based on A General Solutions Repository Solutions Repository CPP Python R Java DEI Spark REST Java
  • 13. Simplifying AI/ML integration on Spark Example use-case solution: Collaborating Data Scientists and Data Engineers Python 2.7 Python 2.7 Informatica DEI Python 3.6 Python Developer Python Developer R Developer Master V1 V2 AISR Runtime-1 Runtime-1 Runtime-2 Runtime-3 V1 Runtime V1 Runtime V1 Runtime Cluster Benefits ▪ Minimum Collaboration ▪ Between Data Scientist and Data Engineer ▪ Auto Deploy of new Version ▪ No Downtime ▪ Multiple Versions Support ▪ Different version of same solution can be used. ▪ Support for Different Runtime Data Science Team Data Engineering Team V1 Runtime V1 Runtime
  • 14. Demo
  • 15. Demo Use Case Easy Collaboration, No Downtime and CI-CD AISR DEI Data Scientist Data Engineer Image Classification
  • 16. Simplified Integration In Action Runtimes Python + TF + OpenCV R Eco System Solutions Image To Text V1 AI Solutions Repo DEI Generated Java Code for executing at spark executors INFA wrapper and Core code, can be in any language Dependencies: Files, installed software etc. Object Detection V1 YARN Spark Job Executor 1 Executor 2 Node 1 Node 2 Node 3 HDFS CLUSTERInformatica Data Scientist Data Engineer Mapping Cached Binaries Spark Job
  • 17. Demo Recap ▪ Easily Created Solution ▪ Easily added a new AI Solution from Jupyter Notebook ▪ Explored the details of added solution ▪ Deployed and Tested ▪ Added Solution was deployed ▪ Explored various consumption options ▪ Created REST Endpoint and used it for testing ▪ Easily Integrated with Spark ▪ Created a mapping job using Informatica ▪ Created new Transformation to use the Deployed Solution ▪ Ran the mapping on Spark with selected Solution ▪ CI-CD ▪ Retrained the Solution with few clicks ▪ Used the re-trained Solution without any changes or downtime AISR DEI
  • 19. Summary ▪ Data Scientist Vs Data Engineer ▪ Collaboration is challenging and time consuming ▪ Easy Spark Job Creation using DEI ▪ Drag and Drop way of Spark Job Creation ▪ Easy Spark-AI Solution Integration using AISR ▪ Minimum Collaboration ▪ Processing happens at Spark Scale within Spark Cluster ▪ Better performance as compared to other serving platforms. ▪ Inbuilt CI-CD for AI Solutions ▪ No downtime in case Solution upgrades ▪ No changes required from Data Engineering environment ▪ AISR Framework ▪ Based on Generic Solutions Repository Implementation ▪ Partners can develop plugins to add or consume AI Solutions ▪ Overall Production Cost Reduction
  • 20. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.