SlideShare a Scribd company logo
1Š Cloudera, Inc. All rights reserved.
Josh YEH | Software Engineer
https://www.linkedin.com/in/joshyeh/
Data Science in the Enterprise
2Š Cloudera, Inc. All rights reserved.
This is the age of machine learning.
2
Cost of compute
Data volume
Time
Machine
Learning
NO
Machine
Learning
1950
s
1960
s
1970
s
1980
s
1990
s
2000
s
2010
s
3Š Cloudera, Inc. All rights reserved.
4Š Cloudera, Inc. All rights reserved.
CDSW
Data Scientist
Machine Learning and
Advanced Analytical
Platform
CM / CDH
Data Engineer
Data Admin
ETL and ...
5Š Cloudera, Inc. All rights reserved.
Machine learning opportunities are everywhere.
Data Engineering Data Science (Exploratory) Production (Operational)
Data Wrangling
Visualization
and Analysis
Model Training
& Testing
Production
Data Pipelines Batch Scoring
Online Scoring
Serving
Data GovernanceGovernance
Processing
Acquisition
Reports,
Dashboards
Data has never been
more plentiful
Open source data science and
machine learning libraries are
rapidly evolving
Commodity (and on-demand) compute
makes scalable production machine
learning affordable
6Š Cloudera, Inc. All rights reserved.
But, there are challenges:
Data Engineering Data Science (Exploratory) Production (Operational)
Data Wrangling
Visualization
and Analysis
Model Training
& Testing
Production
Data Pipelines Batch Scoring
Online Scoring
Serving
Data GovernanceGovernance
Processing
Acquisition
Reports,
Dashboards
Most data science done at
small scale, individually,
and is difficult to replicate
Very few models
reach production
Teams have different,
conflicting requests for
languages & libraries
Data needs to move
across multiple different
systems
7Š Cloudera, Inc. All rights reserved.
Our enterprise customers say:
Access
For sensitive data, secure clusters are
difficult to access. And IT typically
doesn’t want random packages
installed on a secure cluster.
Popular open source tools don’t easily
connect to these environments, or
always support Hadoop data formats.
Scale
Laptops rarely have capacity for
medium, let alone big data. This
leads to a lot of sampling.
Popular frameworks don’t easily
parallelize on a cluster. Typically
code has to get rewritten for
production.
Developer Experience
Notebooks, while awesome, don’t
easily support virtual environment
and dependency management,
especially for teams. This makes
sharing and reproducibility hard.
Notebooks are also challenging to
“put into production.”
8Š Cloudera, Inc. All rights reserved.
Cloudera: The Enterprise Platform for
Data Science and Machine Learning
The data is now here
30B
CONNECTED DEVICES
440x
MORE DATA
Cloudera first to integrate Spark
Modern Platform for Machine Learning and Advanced Analytics
Leading adoption among enterprises
500Customers
Run Spark on
9Š Cloudera, Inc. All rights reserved.
Our goal: an open data science at enterprise scale
Help more data scientists
use the power of Cloudera
Use a powerful, familiar
environment with direct access to
Cloudera data and compute
Data Scientist
Data Engineer
Make it easy and secure to
add new users, use cases
Offer secure self-service analytics
and a faster path to production on
common, affordable infrastructure
Enterprise Architect
Hadoop Admin
10Š Cloudera, Inc. All rights reserved.
An open ecosystem for agility and innovation
Open Ecosystem Black Box
11Š Cloudera, Inc. All rights reserved.
Rich data science libraries and the power of Spark
IT
drive adoption while maintaining compliance
Data Scientist
explore, experiment, iterate
12Š Cloudera, Inc. All rights reserved.
Supports the complete data science pipeline
From data to exploration to action
Data Engineering Data Science (Exploratory) Production (Operational)
Data Wrangling
Visualization
and Analysis
Model Training
& Testing
Production
Data Pipelines Batch Scoring
Online Scoring
Serving
Data GovernanceGovernance
Processing
Acquisition
Reports,
Dashboards
CDSW
Data Scientist
Machine Learning and
Advanced Analytical Platform
CM / CDH
Data Engineer
Data Administrator,
ETL and ...
13Š Cloudera, Inc. All rights reserved.
Data science at enterprise scale requires a full stack.
• Support unlimited data
• Provide sufficient tools for Analysts
• Provide sufficient tools for
Data Scientists + Data Engineers
• Enable real-time use cases
• Provide data governance
• Provide full-stack security
• Deploy in the cloud
• Integrate with partner tools
• Be easy for IT to deploy/maintain
✓Hadoop
✓Impala, Hive, Hue
✓Spark, Data Science Workbench
✓Kafka, Spark Streaming
✓Navigator + Partners
✓Kerberos, Sentry, Record Service, KMS/KTS
✓Cloudera Director
✓Rich Ecosystem
✓Cloudera Manager + Director
14Š Cloudera, Inc. All rights reserved.
Introducing Cloudera Data Science Workbench
Self-service data science for the enterprise
Accelerates data science from
development to production with:
• Secure self-service environments
for data scientists to work against
Cloudera clusters
• Support for Python, R, and Scala,
plus project dependency isolation
for multiple library versions
• Workflow automation, version
control, collaboration and sharing
15Š Cloudera, Inc. All rights reserved.
Data scientists can:
• Use R, Python, or Scala from a web
browser, with no desktop footprint
• Install any library or framework within
isolated project environments
• Directly access data in secure clusters
with Spark and Impala
• Share insights with their team for
reproducible, collaborative research
• Automate and monitor data pipelines
using built-in job scheduling
IT can:
• Give their data science team the
freedom to work how they want, when
they want
• Stay compliant with out-of-the-box
support for full platform security,
especially Kerberos
• Run on-premises or in the cloud,
wherever data is managed
With Cloudera Data Science Workbench…
16Š Cloudera, Inc. All rights reserved.
Enable Data Scientists IT can focus data engineering
With Cloudera Data Science Workbench…
17Š Cloudera, Inc. All rights reserved.
Demo
CDH security integration, Share Report, Group Collaboration
Data Visualization, Abnormality Detection,
GPU resource management, Train ML models
18Š Cloudera, Inc. All rights reserved.
Connect to Kerberized Hadoop cluster with
• As easy as one single
configuration for a given
user.
• CDSW application will
refresh kerberos
authentication every 15
mins
• Support both Kerberos
password and key tab
authentication.
• Revoke is easy.
19Š Cloudera, Inc. All rights reserved.
20Š Cloudera, Inc. All rights reserved.
21Š Cloudera, Inc. All rights reserved.
22Š Cloudera, Inc. All rights reserved.
23Š Cloudera, Inc. All rights reserved.
24Š Cloudera, Inc. All rights reserved.
25Š Cloudera, Inc. All rights reserved.
26Š Cloudera, Inc. All rights reserved.
27Š Cloudera, Inc. All rights reserved.
28Š Cloudera, Inc. All rights reserved.
29Š Cloudera, Inc. All rights reserved.
30Š Cloudera, Inc. All rights reserved.
31Š Cloudera, Inc. All rights reserved.
32Š Cloudera, Inc. All rights reserved.
33Š Cloudera, Inc. All rights reserved.
34Š Cloudera, Inc. All rights reserved.
35Š Cloudera, Inc. All rights reserved.
Cloudera Fast Forward Lab (CFFL)
Machine Learning Kickstart kit
36Š Cloudera, Inc. All rights reserved.
What is Cloudera Fast Forward Labs (CFFL)
Fast Forward Labs (FFL) is a small data science/machine learning research and
consulting team.
FFL provides a subscription advisory service that helps technology leaders keep
current on machine learning and artificial intelligence innovations that will be
practical within the next 6-24 months
37Š Cloudera, Inc. All rights reserved.
1.Executives struggle with machine learning strategy
There are thousands of different ML
techniques available, and innovation is
incredibly fast. It's impossible for any
company know about all the options, let alone
evaluate the few that may meet their needs.
The vendor ecosystem is increasingly complex,
and hype muddies the waters. Analyst firms
like Gartner and Forrester take a high-level
and long view of the market, so they don't
provide any real actionable intelligence about
ML that could be useful today.
38Š Cloudera, Inc. All rights reserved.
2. Cut through the ML hype
A Cloudera Fast Forward Labs research subscription is the
best way to keep up with the latest in machine learning and
artificial intelligence. Each quarter, we publish a research
report and provide software prototypes highlighting
emerging machine learning capabilities that will be useful in
the next two years. The prototypes prove the algorithms,
reduce risk and uncertainty, and shorten time to value. Our
advising services expand on the research and help you apply
machine learning to your specific line of work, keeping you
ahead of the curve and above the confusion and hype.
39Š Cloudera, Inc. All rights reserved.
3. Research and so much more
40Š Cloudera, Inc. All rights reserved.
CFFL Customer Successes
• A bank's head of innovation read the CFFL report on Natural Language
Generation. He is now generating 80% of its regulatory compliance
documentation automatically, saving 30% of the hours.
• One of the big four accounting firms is using Summarization techniques to
reissue tax guidance to its customers. What were once static memos, are now
dynamically updated documents. Their ML system sits on a newsfeed of
updates to judicial interpretation of the tax law, automatically alerting the CPA
and their customers.
• Another customer read our report on Image and video analysis with deep
learning. CFFL conducted a feasibility study to get surgical robots to be able
detect when a surgery is going wrong. The findings wound up radically
changing the company's product roadmap.
41Š Cloudera, Inc. All rights reserved.
Future Data Science Trending
Invest $$ on the right direction; avoid building technical debts
42Š Cloudera, Inc. All rights reserved.
Trending 1: GPU for training; CPU for deployment
1. GPU is built for Machine Learning ( Deep
Neural Network Learning / AI ).
2. Data retention (/revamp) is costly.
a. Eg: the data from the development,
test, manufacture process for chips
that are used in autonomous car, need
to need 15 years of retention.
3. Re-train / Re-deploy new DL model with
more and more data set daily, weekly,
yearly etc...
43Š Cloudera, Inc. All rights reserved.
Trending 2: Investment on Right Tool Sets
1. Don’t reinvent the wheel: Leverage
what’s available out there to shorten
development cycle.
2. Use community support: Mainstream
tools have ample enthusiastic support and
contribution.
3. Easier troubleshooting
44Š Cloudera, Inc. All rights reserved.
Trending 3: Application Containerization
1. Maintenance: legacy application,
complicated framework, 3rd party
libraries, evaluate free open projects. Too
many things could go wrong.
2. Resource Optimization: easy resource
management via cgroup in cluster.
3. Deployment: Easy testing, validation and
deployment.
4. Scale: kubernete to manage containers
and easy scale-up.
45Š Cloudera, Inc. All rights reserved.
Trending 4: Simplified Data Scientist Workflow
1. Work Anywhere 2. Train at large scale 3. Seamless deployment
90+% resource spent on
46Š Cloudera, Inc. All rights reserved.
Thank you!
jjyeh@cloudera.com

More Related Content

PPTX
Next-Gen ML/AI Platform
PPTX
The Five Markers on Your Big Data Journey
PPTX
Put Alternative Data to Use in Capital Markets

PPTX
The Vortex of Change - Digital Transformation (Presented by Intel)
PPTX
Get Started with Cloudera’s Cyber Solution
PPTX
Advanced Analytics for Investment Firms and Machine Learning
PPTX
Unlocking data science in the enterprise - with Oracle and Cloudera
PPTX
2020 Cloudera Data Impact Awards Finalists
Next-Gen ML/AI Platform
The Five Markers on Your Big Data Journey
Put Alternative Data to Use in Capital Markets

The Vortex of Change - Digital Transformation (Presented by Intel)
Get Started with Cloudera’s Cyber Solution
Advanced Analytics for Investment Firms and Machine Learning
Unlocking data science in the enterprise - with Oracle and Cloudera
2020 Cloudera Data Impact Awards Finalists

What's hot (20)

PPTX
Digital Government: Data + Government Isn't Enough | Wrangle Conference 2017
PPTX
The Big Picture: Real-time Data is Defining Intelligent Offers
PPTX
IoT-Enabled Predictive Maintenance
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
Becoming Data-Driven Through Cultural Change
PPTX
Demystifying ML & AI
PPTX
Cloudera Fast Forward Labs: Accelerate machine learning
PPTX
Transforming Insurance Analytics with Big Data and Automated Machine Learning

PPTX
From Insight to Action: Using Data Science to Transform Your Organization
PPTX
The Future of Data Management: The Enterprise Data Hub
PDF
Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...
PPTX
The Transformation of your Data in modern IT (Presented by DellEMC)
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
PPTX
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
PDF
Flash session -streaming--ses1243-lon
PDF
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
PPTX
Engaging with Cloudera & Morning Wrap Up
PPT
Value proposition for big data isv partners 0714
PPTX
Optimizing Regulatory Compliance with Big Data
Digital Government: Data + Government Isn't Enough | Wrangle Conference 2017
The Big Picture: Real-time Data is Defining Intelligent Offers
IoT-Enabled Predictive Maintenance
Cloudera Data Impact Awards 2021 - Finalists
Becoming Data-Driven Through Cultural Change
Demystifying ML & AI
Cloudera Fast Forward Labs: Accelerate machine learning
Transforming Insurance Analytics with Big Data and Automated Machine Learning

From Insight to Action: Using Data Science to Transform Your Organization
The Future of Data Management: The Enterprise Data Hub
Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...
The Transformation of your Data in modern IT (Presented by DellEMC)
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Flash session -streaming--ses1243-lon
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
Engaging with Cloudera & Morning Wrap Up
Value proposition for big data isv partners 0714
Optimizing Regulatory Compliance with Big Data
Ad

Similar to Data Science in Enterprise (20)

PPTX
Part 1: Introducing the Cloudera Data Science Workbench
PDF
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
PPTX
Data Science and CDSW
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PDF
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
PPTX
Introducing the data science sandbox as a service 8.30.18
PDF
Machine Learning in the Enterprise 2019
PDF
Data Science and Machine Learning for the Enterprise
PPTX
The Vision & Challenge of Applied Machine Learning
PPTX
Part 2: A Visual Dive into Machine Learning and Deep Learning 

PPTX
The Edge to AI Deep Dive Barcelona Meetup March 2019
PPTX
The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...
PPTX
Deep Learning with Cloudera
PDF
Cloud-Native Machine Learning: Emerging Trends and the Road Ahead
PPTX
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
PDF
Data Science in the Enterprise
PDF
Big Data LDN 2017: Machine Learning, AI & The Future of Data Analytics
PDF
Enterprise machine learning on k8s lessons learned and the road ahead
PDF
Machine Learning Model Deployment: Strategy to Implementation
PPTX
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine ...
Part 1: Introducing the Cloudera Data Science Workbench
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
Data Science and CDSW
Introducing Cloudera Data Science Workbench for HDP 2.12.19
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
Introducing the data science sandbox as a service 8.30.18
Machine Learning in the Enterprise 2019
Data Science and Machine Learning for the Enterprise
The Vision & Challenge of Applied Machine Learning
Part 2: A Visual Dive into Machine Learning and Deep Learning 

The Edge to AI Deep Dive Barcelona Meetup March 2019
The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...
Deep Learning with Cloudera
Cloud-Native Machine Learning: Emerging Trends and the Road Ahead
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Data Science in the Enterprise
Big Data LDN 2017: Machine Learning, AI & The Future of Data Analytics
Enterprise machine learning on k8s lessons learned and the road ahead
Machine Learning Model Deployment: Strategy to Implementation
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine ...
Ad

Recently uploaded (20)

PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Computer network topology notes for revision
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Global journeys: estimating international migration
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Business Acumen Training GuidePresentation.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Database Infoormation System (DBIS).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Supervised vs unsupervised machine learning algorithms
Computer network topology notes for revision
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Global journeys: estimating international migration
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Galatica Smart Energy Infrastructure Startup Pitch Deck
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
1_Introduction to advance data techniques.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx

Data Science in Enterprise

  • 1. 1Š Cloudera, Inc. All rights reserved. Josh YEH | Software Engineer https://www.linkedin.com/in/joshyeh/ Data Science in the Enterprise
  • 2. 2Š Cloudera, Inc. All rights reserved. This is the age of machine learning. 2 Cost of compute Data volume Time Machine Learning NO Machine Learning 1950 s 1960 s 1970 s 1980 s 1990 s 2000 s 2010 s
  • 3. 3Š Cloudera, Inc. All rights reserved.
  • 4. 4Š Cloudera, Inc. All rights reserved. CDSW Data Scientist Machine Learning and Advanced Analytical Platform CM / CDH Data Engineer Data Admin ETL and ...
  • 5. 5Š Cloudera, Inc. All rights reserved. Machine learning opportunities are everywhere. Data Engineering Data Science (Exploratory) Production (Operational) Data Wrangling Visualization and Analysis Model Training & Testing Production Data Pipelines Batch Scoring Online Scoring Serving Data GovernanceGovernance Processing Acquisition Reports, Dashboards Data has never been more plentiful Open source data science and machine learning libraries are rapidly evolving Commodity (and on-demand) compute makes scalable production machine learning affordable
  • 6. 6Š Cloudera, Inc. All rights reserved. But, there are challenges: Data Engineering Data Science (Exploratory) Production (Operational) Data Wrangling Visualization and Analysis Model Training & Testing Production Data Pipelines Batch Scoring Online Scoring Serving Data GovernanceGovernance Processing Acquisition Reports, Dashboards Most data science done at small scale, individually, and is difficult to replicate Very few models reach production Teams have different, conflicting requests for languages & libraries Data needs to move across multiple different systems
  • 7. 7Š Cloudera, Inc. All rights reserved. Our enterprise customers say: Access For sensitive data, secure clusters are difficult to access. And IT typically doesn’t want random packages installed on a secure cluster. Popular open source tools don’t easily connect to these environments, or always support Hadoop data formats. Scale Laptops rarely have capacity for medium, let alone big data. This leads to a lot of sampling. Popular frameworks don’t easily parallelize on a cluster. Typically code has to get rewritten for production. Developer Experience Notebooks, while awesome, don’t easily support virtual environment and dependency management, especially for teams. This makes sharing and reproducibility hard. Notebooks are also challenging to “put into production.”
  • 8. 8Š Cloudera, Inc. All rights reserved. Cloudera: The Enterprise Platform for Data Science and Machine Learning The data is now here 30B CONNECTED DEVICES 440x MORE DATA Cloudera first to integrate Spark Modern Platform for Machine Learning and Advanced Analytics Leading adoption among enterprises 500Customers Run Spark on
  • 9. 9Š Cloudera, Inc. All rights reserved. Our goal: an open data science at enterprise scale Help more data scientists use the power of Cloudera Use a powerful, familiar environment with direct access to Cloudera data and compute Data Scientist Data Engineer Make it easy and secure to add new users, use cases Offer secure self-service analytics and a faster path to production on common, affordable infrastructure Enterprise Architect Hadoop Admin
  • 10. 10Š Cloudera, Inc. All rights reserved. An open ecosystem for agility and innovation Open Ecosystem Black Box
  • 11. 11Š Cloudera, Inc. All rights reserved. Rich data science libraries and the power of Spark IT drive adoption while maintaining compliance Data Scientist explore, experiment, iterate
  • 12. 12Š Cloudera, Inc. All rights reserved. Supports the complete data science pipeline From data to exploration to action Data Engineering Data Science (Exploratory) Production (Operational) Data Wrangling Visualization and Analysis Model Training & Testing Production Data Pipelines Batch Scoring Online Scoring Serving Data GovernanceGovernance Processing Acquisition Reports, Dashboards CDSW Data Scientist Machine Learning and Advanced Analytical Platform CM / CDH Data Engineer Data Administrator, ETL and ...
  • 13. 13Š Cloudera, Inc. All rights reserved. Data science at enterprise scale requires a full stack. • Support unlimited data • Provide sufficient tools for Analysts • Provide sufficient tools for Data Scientists + Data Engineers • Enable real-time use cases • Provide data governance • Provide full-stack security • Deploy in the cloud • Integrate with partner tools • Be easy for IT to deploy/maintain ✓Hadoop ✓Impala, Hive, Hue ✓Spark, Data Science Workbench ✓Kafka, Spark Streaming ✓Navigator + Partners ✓Kerberos, Sentry, Record Service, KMS/KTS ✓Cloudera Director ✓Rich Ecosystem ✓Cloudera Manager + Director
  • 14. 14Š Cloudera, Inc. All rights reserved. Introducing Cloudera Data Science Workbench Self-service data science for the enterprise Accelerates data science from development to production with: • Secure self-service environments for data scientists to work against Cloudera clusters • Support for Python, R, and Scala, plus project dependency isolation for multiple library versions • Workflow automation, version control, collaboration and sharing
  • 15. 15Š Cloudera, Inc. All rights reserved. Data scientists can: • Use R, Python, or Scala from a web browser, with no desktop footprint • Install any library or framework within isolated project environments • Directly access data in secure clusters with Spark and Impala • Share insights with their team for reproducible, collaborative research • Automate and monitor data pipelines using built-in job scheduling IT can: • Give their data science team the freedom to work how they want, when they want • Stay compliant with out-of-the-box support for full platform security, especially Kerberos • Run on-premises or in the cloud, wherever data is managed With Cloudera Data Science Workbench…
  • 16. 16Š Cloudera, Inc. All rights reserved. Enable Data Scientists IT can focus data engineering With Cloudera Data Science Workbench…
  • 17. 17Š Cloudera, Inc. All rights reserved. Demo CDH security integration, Share Report, Group Collaboration Data Visualization, Abnormality Detection, GPU resource management, Train ML models
  • 18. 18Š Cloudera, Inc. All rights reserved. Connect to Kerberized Hadoop cluster with • As easy as one single configuration for a given user. • CDSW application will refresh kerberos authentication every 15 mins • Support both Kerberos password and key tab authentication. • Revoke is easy.
  • 19. 19Š Cloudera, Inc. All rights reserved.
  • 20. 20Š Cloudera, Inc. All rights reserved.
  • 21. 21Š Cloudera, Inc. All rights reserved.
  • 22. 22Š Cloudera, Inc. All rights reserved.
  • 23. 23Š Cloudera, Inc. All rights reserved.
  • 24. 24Š Cloudera, Inc. All rights reserved.
  • 25. 25Š Cloudera, Inc. All rights reserved.
  • 26. 26Š Cloudera, Inc. All rights reserved.
  • 27. 27Š Cloudera, Inc. All rights reserved.
  • 28. 28Š Cloudera, Inc. All rights reserved.
  • 29. 29Š Cloudera, Inc. All rights reserved.
  • 30. 30Š Cloudera, Inc. All rights reserved.
  • 31. 31Š Cloudera, Inc. All rights reserved.
  • 32. 32Š Cloudera, Inc. All rights reserved.
  • 33. 33Š Cloudera, Inc. All rights reserved.
  • 34. 34Š Cloudera, Inc. All rights reserved.
  • 35. 35Š Cloudera, Inc. All rights reserved. Cloudera Fast Forward Lab (CFFL) Machine Learning Kickstart kit
  • 36. 36Š Cloudera, Inc. All rights reserved. What is Cloudera Fast Forward Labs (CFFL) Fast Forward Labs (FFL) is a small data science/machine learning research and consulting team. FFL provides a subscription advisory service that helps technology leaders keep current on machine learning and artificial intelligence innovations that will be practical within the next 6-24 months
  • 37. 37Š Cloudera, Inc. All rights reserved. 1.Executives struggle with machine learning strategy There are thousands of different ML techniques available, and innovation is incredibly fast. It's impossible for any company know about all the options, let alone evaluate the few that may meet their needs. The vendor ecosystem is increasingly complex, and hype muddies the waters. Analyst firms like Gartner and Forrester take a high-level and long view of the market, so they don't provide any real actionable intelligence about ML that could be useful today.
  • 38. 38Š Cloudera, Inc. All rights reserved. 2. Cut through the ML hype A Cloudera Fast Forward Labs research subscription is the best way to keep up with the latest in machine learning and artificial intelligence. Each quarter, we publish a research report and provide software prototypes highlighting emerging machine learning capabilities that will be useful in the next two years. The prototypes prove the algorithms, reduce risk and uncertainty, and shorten time to value. Our advising services expand on the research and help you apply machine learning to your specific line of work, keeping you ahead of the curve and above the confusion and hype.
  • 39. 39Š Cloudera, Inc. All rights reserved. 3. Research and so much more
  • 40. 40Š Cloudera, Inc. All rights reserved. CFFL Customer Successes • A bank's head of innovation read the CFFL report on Natural Language Generation. He is now generating 80% of its regulatory compliance documentation automatically, saving 30% of the hours. • One of the big four accounting firms is using Summarization techniques to reissue tax guidance to its customers. What were once static memos, are now dynamically updated documents. Their ML system sits on a newsfeed of updates to judicial interpretation of the tax law, automatically alerting the CPA and their customers. • Another customer read our report on Image and video analysis with deep learning. CFFL conducted a feasibility study to get surgical robots to be able detect when a surgery is going wrong. The findings wound up radically changing the company's product roadmap.
  • 41. 41Š Cloudera, Inc. All rights reserved. Future Data Science Trending Invest $$ on the right direction; avoid building technical debts
  • 42. 42Š Cloudera, Inc. All rights reserved. Trending 1: GPU for training; CPU for deployment 1. GPU is built for Machine Learning ( Deep Neural Network Learning / AI ). 2. Data retention (/revamp) is costly. a. Eg: the data from the development, test, manufacture process for chips that are used in autonomous car, need to need 15 years of retention. 3. Re-train / Re-deploy new DL model with more and more data set daily, weekly, yearly etc...
  • 43. 43Š Cloudera, Inc. All rights reserved. Trending 2: Investment on Right Tool Sets 1. Don’t reinvent the wheel: Leverage what’s available out there to shorten development cycle. 2. Use community support: Mainstream tools have ample enthusiastic support and contribution. 3. Easier troubleshooting
  • 44. 44Š Cloudera, Inc. All rights reserved. Trending 3: Application Containerization 1. Maintenance: legacy application, complicated framework, 3rd party libraries, evaluate free open projects. Too many things could go wrong. 2. Resource Optimization: easy resource management via cgroup in cluster. 3. Deployment: Easy testing, validation and deployment. 4. Scale: kubernete to manage containers and easy scale-up.
  • 45. 45Š Cloudera, Inc. All rights reserved. Trending 4: Simplified Data Scientist Workflow 1. Work Anywhere 2. Train at large scale 3. Seamless deployment 90+% resource spent on
  • 46. 46Š Cloudera, Inc. All rights reserved. Thank you! jjyeh@cloudera.com