Simplifying AI integration on Apache Spark

Simplifying AI Integration on
Spark
Hemshankar Sahu
Principal Software Engineer @ Informatica

About Speaker
Hemshankar Sahu
Principal Software Engineer @ Informatica
M. Tech. in Computer Science and Engg. From IIT Roorkee
9+ Years of Experience in IT Industry working as Full Stack Developer and ML Engineer.
Currently working on developing framework to help Integration of Machine Learning Algorithm
and Models into production system.

About Informatica
Enterprise Cloud Data Management leader
9,500+
customers
18 Trillion
cloud transactions
per month
85%
of Fortune 100
5
A Leader in Five
Gartner Magic
Quadrants

Agenda
▪ Context for the Talk
▪ Personas Involved
▪ Informatica On Spark
▪ Problem Details
▪ AI/ML Integration Problems
▪ Solution Details
▪ New Offering: AISR
▪ Simplifying AI/ML integration on Spark
▪ Demo
▪ Deploying, Integration, Auto CI-CD of AI
Solutions
▪ Summary

Personas Involved
Data Scientist vs Data Engineers: Personas involved in operationalizing the ML Algorithms
Data Scientist Data Engineer
Tasks Data Exploring, Model Building, Model Training
Data Ingestion, Data Pre-processing,
Transformation and Cleansing
Languages Python, R, Lisp SQL, Scala, Java/Python
Tools Notebook, R Studio, Matlab Spark, Data Engg. Tools (like Informatica)
Libraries Tensorflow, Keres, Pandas, Sickit Learn Hadoop, Spark

Informatica On Spark
Informatica Data Engineering Integration (DEI) Generates Spark Code
Executes On Cluster
Data Engineering Tool which uses Spark as Execution Engine

Same, familiar
Informatica design-time
Informatica Intelligent Cloud
Services
Cloud Data Integration Elastic
Enabling Spark serverless support for auto-scaling and provisioning
Auto-scaling Spark
cluster
Deployed to your
cloud network

AI/ML Integration Issues
Example problem use-case: Collaborating Data Engineers and Data Scientists
Informatica
DEI
Python 2.7
Python 2.7
Python 2.7
Python 3.6Python Developer
Python Developer
R Developer
Python 2.7 Python 2.7
Master
V1
V2
?
?
Spark Cluster
Issues
▪ Team Collaboration Required
▪ Data Scientist and Data Engineer invests time to
collaborate
▪ Manually Deploy the Binaries
▪ Downtime for each new version
▪ No Support for Different Runtimes
Data Science Team Data Engineering Team
V2 V2

New Offering: AISR
▪ Repository of AI Solutions
▪ A Solution is
▪ Code and Metadata
▪ Dependencies
▪ Runtime Details
▪ A Solution can
▪ Be in any language*
▪ With any dependency
▪ Run on GPU**
AI Solutions Repository
* Only Python supported in current release
** Provided hardware are present and drivers are installed, and solution contains the respective code
Runtimes
Tensorflow_Numpy
Sickitlearn_OpenCV
Solutions
Sentiment Analysis
AISR
Generated Code for executing from various platforms
Solution code, can be in any language
Dependencies: Files, installed software etc.
AISR
Image Processing
Image Classification
Image To Text
Example
Based on A General Solutions Repository
Solutions
Repository
CPP
Python
R
Java
DEI
Spark
REST
Java

Simplifying AI/ML integration on Spark
Example use-case solution: Collaborating Data Scientists and Data Engineers
Python 2.7
Python 2.7
Informatica
DEI
Python 3.6
Python Developer
Python Developer
R Developer
Master
V1
V2
AISR
Runtime-1
Runtime-1
Runtime-2
Runtime-3
V1
Runtime
V1
Runtime
V1
Runtime
Cluster
Benefits
▪ Minimum Collaboration
▪ Between Data Scientist and Data Engineer
▪ Auto Deploy of new Version
▪ No Downtime
▪ Multiple Versions Support
▪ Different version of same solution can be used.
▪ Support for Different Runtime
Data Science Team Data Engineering Team
V1
Runtime
V1
Runtime

Demo Use Case
Easy Collaboration, No Downtime and CI-CD
AISR DEI
Data Scientist Data Engineer
Image
Classification

Simplified Integration In Action
Runtimes
Python + TF + OpenCV
R Eco System
Solutions
Image To Text V1
AI Solutions Repo DEI
Generated Java Code for executing at spark executors
INFA wrapper and Core code, can be in any language
Dependencies: Files, installed software etc.
Object Detection V1
YARN
Spark Job Executor 1 Executor 2
Node 1
Node 2 Node 3
HDFS
CLUSTERInformatica
Data Scientist
Data Engineer
Mapping
Cached Binaries
Spark Job

Demo Recap
▪ Easily Created Solution
▪ Easily added a new AI Solution from Jupyter Notebook
▪ Explored the details of added solution
▪ Deployed and Tested
▪ Added Solution was deployed
▪ Explored various consumption options
▪ Created REST Endpoint and used it for testing
▪ Easily Integrated with Spark
▪ Created a mapping job using Informatica
▪ Created new Transformation to use the Deployed Solution
▪ Ran the mapping on Spark with selected Solution
▪ CI-CD
▪ Retrained the Solution with few clicks
▪ Used the re-trained Solution without any changes or downtime
AISR DEI

Summary
▪ Data Scientist Vs Data Engineer
▪ Collaboration is challenging and time consuming
▪ Easy Spark Job Creation using DEI
▪ Drag and Drop way of Spark Job Creation
▪ Easy Spark-AI Solution Integration using AISR
▪ Minimum Collaboration
▪ Processing happens at Spark Scale within Spark Cluster
▪ Better performance as compared to other serving platforms.
▪ Inbuilt CI-CD for AI Solutions
▪ No downtime in case Solution upgrades
▪ No changes required from Data Engineering environment
▪ AISR Framework
▪ Based on Generic Solutions Repository Implementation
▪ Partners can develop plugins to add or consume AI Solutions
▪ Overall Production Cost Reduction

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Simplifying AI integration on Apache Spark

More Related Content

What's hot (20)

Similar to Simplifying AI integration on Apache Spark (20)

More from Databricks (20)

Recently uploaded (20)

Simplifying AI integration on Apache Spark