SlideShare a Scribd company logo
© Hitachi Vantara Corporation 2018© Hitachi Vantara Corporation 2018.
Eine Challenge für Architekten
#DOAG2018, Nürnberg, 20. November 2018
Machine Learning
Harald Erb
Solutions Engineer, EMEA Central
Data Analytics & IoT
© Hitachi Vantara Corporation 2018
OT
107+
YEARS
BUSINESS
INDUSTRIAL
CONSUMER
CITY
IT
58+
YEARS
COMMUNICATIONS
BIG DATA
ANALYTICS
ARTIFICIAL
INTELLIGENCE
CLOUD
IT SYSTEMS
IoT
INSIGHT
Hitachi
Hitachi Vantara: What? And Why?
© Hitachi Vantara Corporation 2018
Analytics
Artificial
Intelligence
Machine
Learning
Stream
Analytics
Batch
Analytics
Data Management
Data
Orchestration
Data
Engineering
Data
Blending
Data
Collection
Asset
Asset
Management
Asset
Avatars
Data
Stores
Business
Connectors
APIs,
App-Enabling Studio
Alerts,
Notifications
Dashboards,
UI, UX
Maintenance Insights Manufacturing Insights Video Insights
Manufacturing Utilities Logistics Oil & Gas Mining Financial Services Other
Co-Creation
Services
Professional
Services
Edge
Asset
Integration
Device
Control
Data Caching,
Filtering
◼ Industry solutions (Hitachi BUs, ISVs, SIs) ◼ IoT applications ◼ IoT platform services ◼ Edge-to-cloud infrastructure
Foundry
Edge controllers,
appliances
Converged and
hyperconverged
Block-, file- and
object storage
Pentaho
IoT Solution Portfolio of Hitachi Vantara
© Hitachi Vantara Corporation 2018
© Hitachi Vantara Corporation 2018
public class AgendaForThisTalk {
public String Topic1 = "Process, Architecture and 1 Example";
public String Topic2 = "1 Dataset, 4 different Tools";
public String Topic3 = "From Prototype to Production";
}
© Hitachi Vantara Corporation 2018
Process, Architecture and 1 Example_
© Hitachi Vantara Corporation 2018
Data Warehouse & Analytics back in 1998
z
Data
Discovery
z
© Hitachi Vantara Corporation 2018
Architecting for Analytics & Machine Learning today
Source: Carlton E. Sapp: “Preparing and Architecting for Machine Learning”, Gartner, 2017
© Hitachi Vantara Corporation 2018
Analytic Dashboard Example
Source: Carlton E. Sapp: “Preparing and Architecting for Machine Learning”, Gartner, 2017
© Hitachi Vantara Corporation 2018
Überschrift
Source: XXXXXX
End-to-end Fleet Management Solution
• Combine Sensor Data with Contextual Information
• Overcome scarce availability of capable technicians
• Lower costs, reduced customer downtimes_
Fleet Optimization
© Hitachi Vantara Corporation 2018
Truck Leasing Company
Issues: Trucks have become more and more technology based, availability of capable technicians
is scarce, and managing large number of maintenance centers is expensive.
Business Objectives: Better Fleet Management - lowering costs, improving efficiency of
maintenance to reduce customer downtime
Strategic Goals: Need to take competitive advantage of truck data. 40-50,000 trucks purchased
per year – more data than any truck OEM. Gain a competitive edge through lowered costs &
predictive technology (automation in repairs & diagnostics).
© Hitachi Vantara Corporation 2018
Vehicle Sensor Data
Store models and sensor data
Asset ModelSensor DataUtility Vehicle Asset
Air Pressure
Axle Vibration
Lights
Load Weight
Movement
Temperature
Sensor
Data
Journey
Stream
Blend
Infer
Sense
Inspect
Embed & Integrate
Store
© Hitachi Vantara Corporation 2018
Adding Context to Sensor Data
IoT Data Refinery
Contextual DataSensor Data
Sensor
Data
Journey
Stream
Blend
Infer
Sense
Inspect
Embed & Integrate
Store
Vehicle Location
• GPS
• Lat / Long
• Mapping
• Movement
Vehicle Profile
• Make
• Model
• Mileage
Operational
Systems
• Maintenance
History
• Maintenance
Schedule
• Service Centers
• Parts Ordering
• Parts Inventory
Business
Outcomes
• Real-Time Fleet Status and Health
• Repair Recommendations
• Optimized Maintenance Scheduling
• Automated Parts Ordering
© Hitachi Vantara Corporation 2018
Fleet Management Dashboard: Situation Overview
Source: XXXXXX
© Hitachi Vantara Corporation 2018
Fleet Management Dashboard: Contextual View
Source: XXXXXX
© Hitachi Vantara Corporation 2018
Data Science – Personas & Process Model
Source: K. Bollhöfer, Chief Data Scientist, *um
Cross-functional
team
© Hitachi Vantara Corporation 2018
Architecture Challenge
Source: D. Sculley, et al.: “Hidden technical debt in Machine learning systems”, 2015
© Hitachi Vantara Corporation 2018
1 Dataset, 4 different Tools_
© Hitachi Vantara Corporation 2018
1 Dataset, 4 Tools
Dataset:
California
House Prices
github.com/ageron/handson-ml/tree/master/datasets/housing
Tools:
Jupyter Notebook
• End-to-end ML Projects
• ML with preferred programing
language like Python, R, Julia
• Live-Code embedded in Markup
Document
Oracle Data Visualization
• For data exploration
• ML to explain dependencies in
dataset
• 1-click analytical functions and
model training possible
H2O Flow
• End-to-end ML Platform with own
compute engine, AutoML,…
• Notebook UI, H2O algorithms
accessible from Python and R
• Java-Export of trained Models
Pentaho Data Integration
• Embedding ML Code in ETL
dataflows
• ML Orchestration: Model training
and management
• Plugin Machine Intelligence
Free trial versions
available for all tools!
© Hitachi Vantara Corporation 2018
Machine Learning Coding with Jupyter Notebook & Python
Start here: jupyter.org
© Hitachi Vantara Corporation 2018
Überschrift
Source: XXXXXX
© Hitachi Vantara Corporation 2018
© Hitachi Vantara Corporation 2018
Überschrift
Source: XXXXXX
© Hitachi Vantara Corporation 2018
Machine Learning with Jupyter Notebook & Python
ML Process support: End-to-end, focus on experimentation not on production ML
Personas: Data Scientists
Useful: • Notebooks allow reproduceable ML from data ingestion to model
evaluation
• Multiple programming languages, access to latest ML frameworks via
Python interface
• Sophisticated visualizations
Architecture &
Development
related:
• Trained ML models can be serialized/saved, i.e. via Pickle, Joblib
• Models can be published as REST API endpoints, ie. via Flask
• Notebooks stored as JSON files → Code versioning / merge not allways
easy, less comfort compared to other IDE‘s (no syntax highlighting,
no key word completion)
• Large-scale ML possible, i.e. via Apache Spark MLlib; Cluster
deployments are better done separately
© Hitachi Vantara Corporation 2018
Dataset Exploration with Oracle Data Visualization
Start here: www.oracle.com/technetwork/middleware/oracle-data-visualization
© Hitachi Vantara Corporation 2018
Überschrift
© Hitachi Vantara Corporation 2018
Dataset Exploration with Oracle Data Visualization
ML Process support: Data Understanding phase, Results Presentation & Story Telling
Personas: Business Analysts, Data Scientists
Useful: • Highly interactive Charts and Filters and intuitive analysis support
(i.e. pattern brushing)
• Supporting functions to highlight and explain attribute/variable
dependencies
• Formular editor, advanced (ML) functions and Model training and
scoring available for experimentation (experienced Users only)
Architecture &
Development
related:
• Use Oracle Analytics Cloud for better collaboration; can be combined
with Data Visualization Desktop
• Dataset preparation functionality is improving over time, but will not
replace existing ETL platforms (Scalability, Job-Management)
• Good for rapid prototyping, but limited reusability of results
(curated datasets, ML Models)
© Hitachi Vantara Corporation 2018
Machine Learning Orchestration and more with Pentaho
Start here: community.hitachivantara.com/docs/DOC-1009931-downloads
Blog: community.hitachivantara.com/community/products-and-
solutions/pentaho/blog/2018/10/16/deep-learning-coming-to-pentaho
© Hitachi Vantara Corporation 2018
Bring Your Own (ML) Code
© Hitachi Vantara Corporation 2018
ETL-Tools + Python/R – A Door Opener for Deep Learning?
© Hitachi Vantara Corporation 2018
Embedding ML Algorithms into Data Pipelines
Source: XXXXXX
© Hitachi Vantara Corporation 2018
„Model Zoo“ managed by your Data Integration Solution
© Hitachi Vantara Corporation 2018
Machine Learning within Data Integration Platforms
ML Process support: Data Preparation phase, Operationalize ML, Experimentation?
Personas: Data Engineers, Data Scientists (supporting)
Useful: • DI platforms provide advanced data preparation and automation
features for effective creation of datasets
• Intuitive UI, Drag & drop instead of coding
• Skilled DI team already in place
Architecture &
Development
related:
• DI platforms are optimized to utilize full computing power for ETL/ELT,
but not for ML Tasks (i.e. parallel execution might need extra effort)
• When using R / Python interfaces: ensure to collect status infos +
performance metrics of your ML model execution within DI logging
• „ML toolbox“ of DI solutions is often not complete: i.e. script-based
work arounds needed for imputation of missing values, working with
latest algorithms, model management
• Limited and not intuitive data exploration features, i.e. visualisations
© Hitachi Vantara Corporation 2018
End-to-end Machine Learning for everyone with H20 (Flow)
Start here: www.h2o.ai/download
Accessing a 2 node H2O cluster in a R environment
© Hitachi Vantara Corporation 2018
Überschrift
Source: XXXXXX
© Hitachi Vantara Corporation 2018
Überschrift
Source: XXXXXX
© Hitachi Vantara Corporation 2018
End-to-end Machine Learning for everyone with H20 Flow
ML Process support: End-to-end, focus is on Model training/evaluation, Feature Engineering
Personas: Data Scientists, ambitious Business Analysts(?)
Useful: • H2O Flow: intuitive Notebook-style UI + user guidance
• Use the programing language you already know like R, Python
• AutoML can be used for automating the ML workflow, including training
and tuning of many models within a user-specified time limit
Architecture &
Development
related:
• Takes advantage of the computing power of distributed systems and in-
memory computing to accelerate machine learning
• Works on existing big data infrastructure, on bare metal or on top of
existing Hadoop or Spark clusters; can ingest data from HDFS, Spark, S3
• Model deployment into production with Java (POJO) and binary formats
(MOJO), Hive UDF, or as API endpoint
• H2O Flow: Limited Data Exploration (no visual only in combination with
R/Python
© Hitachi Vantara Corporation 2018
Another tool decision: Which ML Framework to choose?
Source: Dr. D. James: „Entscheidungsmatrix „Machine Learning“, it-novum.com
© Hitachi Vantara Corporation 2018
From Prototype to Production_
© Hitachi Vantara Corporation 2018
From Exploratory Data Science to Production Workflows
Line of
Governance
• Commercial exploitation
• Integration to operations
• Non-functional requirements
• Standardisation & governance
Model
• Unbounded discovery
• Self-Service sandbox
• Wide toolset / IDE’s
• Agile methods
© Hitachi Vantara Corporation 2018
Model deployment
Source: Sergei Izrailev: Design Patterns for Machine Learning in Production, 2017
• Data transformations must be the
same in training and scoring
• Interface between building &
scoring:
− In-memory - model is never
persisted, train then score
→ single & applications
− Data only, i.e. PMML, etc
→ code is independent
− Serialized objects - Pickle, R,
Spark, custom → reuse code
− Code + Data – i.e., H2O’s POJO
→ code is generated
Detailed article in
“Java aktuell” 06/2018
© Hitachi Vantara Corporation 2018
Überschrift
© Hitachi Vantara Corporation 2018
Überschrift
Source: XXXXXX
H2O POJO (Plain Old Java Object):
• ML Model implemented through
Java classes
• has dependencies with H2O
specific classes
© Hitachi Vantara Corporation 2018
Überschrift
Source: XXXXXX
© Hitachi Vantara Corporation 2018
Gradient Boosting Machines (GBM)
• A family of powerful machine-
learning techniques for regression
and classification problems
• GBM’s produce a prediction model
in the form of an ensemble of weak
prediction models, typically
decision trees
© Hitachi Vantara Corporation 2018
From Prototype to Machine Learning in Production
Source: Sergei Izrailev: Design Patterns for Machine Learning in Production, 2017
© Hitachi Vantara Corporation 2018
Applying ML Model for Scoring of new (unlabeled) Data
Source: K. Wähner: How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka
• Add ML model =
Java application to
Apache Kafka
stream processing
application to
apply it on new
incoming events
• Spark Streaming
also allows ML
model serving via
mini batches and
makes use of in-
memory
processing and
load distribution
© Hitachi Vantara Corporation 2018
ML Model as API endpoint: Object detection example
Source: A. Rosebrock: “A scalable Keras + deep learning REST API”, pyimagesearch Blog
• “API first” approach instead of rewriting
code to replicate ML model in a
programming language supported by the
enterprise IT
• Web APIs have made it easy for cross-
language applications to work well. If a
developer needs a Model to create a ML
powered web application, they would
just need to get the URL Endpoint from
where the API is being served
• Webservice development frameworks like
Flask allow prototyping in Python
• For production environments, an
additional web server and a messaging
should be considered
Keras: a simple, modular, and extensible Deep Learning library, written in Python and
designed to enable fast experimentation with deep neural networks
Redis (Remote Dictionary Server): implements a distributed, in-memory key-value
database with optional durability
© Hitachi Vantara Corporation 2018Source: A. Rosebrock: "Building a simple Keras + deep learning REST API", The Keras Blog
© Hitachi Vantara Corporation 2018
Überschrift
Source: XXXXXX
© Hitachi Vantara Corporation 2018
Object Detection – Use Case
PicSure AI Platform für Versichrungslösungen
© Hitachi Vantara Corporation 2018
Takeaway
• Find the right problem
• Define constraints
• Design components and interfaces
• Take into account organizational constraints
• Production can’t be an afterthought
• The process is a lot of work, but it’s not
rocket science
Processing huge amounts of data with complex algorithms can a bit too much of
time. Kudos Randall Munroe / XKCD for the original
© Hitachi Vantara Corporation 2018© Hitachi Vantara Corporation 2018. All Rights Reserved
Thank You
© Hitachi Vantara Corporation 2018

More Related Content

PDF
Webinar: Open Source Business Intelligence Intro
PPTX
BI Reporting Application Comparison
PPTX
Oracle business analytics best practices
PDF
Informatica to ODI Migration – What, Why and How | Informatica to Oracle Dat...
PDF
Open Source ETL vs Commercial ETL
PDF
Informatica Pentaho Etl Tools Comparison
PPT
Pentaho Partner Program Info
PDF
Webinar: Open Source Business Intelligence Intro
Webinar: Open Source Business Intelligence Intro
BI Reporting Application Comparison
Oracle business analytics best practices
Informatica to ODI Migration – What, Why and How | Informatica to Oracle Dat...
Open Source ETL vs Commercial ETL
Informatica Pentaho Etl Tools Comparison
Pentaho Partner Program Info
Webinar: Open Source Business Intelligence Intro

What's hot (19)

PDF
Eh p8 sp10_delta_scope_final
PDF
Pentaho data integration 4.0 and my sql
PDF
SAUG Melbourne plenary 2017 embedded analytics
PPTX
Transform Your Data Integration Platform From Informatica To ODI
DOC
Resume (3)
PDF
New BI Tools with HANA
PPTX
Informatica overview
PPTX
MS Cloud Day - Introduction to Windows Azure platform and real world case study
PPSX
Sap HANA Presentation to SAPnsight Dallas Breakfast Huddle in June 2014
PPT
Informatica session
PDF
Lecture about SAP HANA and Enterprise Comupting at University of Halle
PDF
Fulfilling real time analytics on obi apps platform
PPTX
Sap s4 hana logistics ppt
PPTX
Modern Reporting at Scale: How to Distribute Information and Answers to the M...
PPTX
Why and How Migrate Informatica to ODI | Infa to ODI Migration | Infa to ODI ...
PPT
Stratebi_Emilio_Arias_PCM14
DOCX
Digital economy with the speed of s4 hana
PPTX
Restart EAM at OSRAM with a lean approach
PDF
Self service BI overview + Power BI
Eh p8 sp10_delta_scope_final
Pentaho data integration 4.0 and my sql
SAUG Melbourne plenary 2017 embedded analytics
Transform Your Data Integration Platform From Informatica To ODI
Resume (3)
New BI Tools with HANA
Informatica overview
MS Cloud Day - Introduction to Windows Azure platform and real world case study
Sap HANA Presentation to SAPnsight Dallas Breakfast Huddle in June 2014
Informatica session
Lecture about SAP HANA and Enterprise Comupting at University of Halle
Fulfilling real time analytics on obi apps platform
Sap s4 hana logistics ppt
Modern Reporting at Scale: How to Distribute Information and Answers to the M...
Why and How Migrate Informatica to ODI | Infa to ODI Migration | Infa to ODI ...
Stratebi_Emilio_Arias_PCM14
Digital economy with the speed of s4 hana
Restart EAM at OSRAM with a lean approach
Self service BI overview + Power BI
Ad

Similar to Machine Learning - Eine Challenge für Architekten (20)

PDF
DATAOPS: THE NEXT BIG WAVE ON YOUR DATA JOURNEY - Big Data Expo
PDF
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
PPTX
Data analytics on Azure
PPTX
A practical guidance of the enterprise machine learning
PPTX
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
PDF
Belgrade R - Intro to H2O and Deep Water
PDF
H2O at BelgradeR Meetup
PDF
Bringing Deep Learning into production
PPTX
ISV Showcase: End-to-end Machine Learning using H2O on Azure
PPTX
AI and AutoML: Debunking Myths
PDF
Using PySpark to Process Boat Loads of Data
PDF
Resume
PDF
Data meets AI - ATP Roadshow India
PDF
Machine Learning on Google Cloud with H2O
PDF
Marvin Platform – Potencializando equipes de Machine Learning
PDF
Machine learning model to production
PPTX
Maximise investment in your LMS with Extended Enterprise
PDF
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
PPTX
H2O 0xdata MLconf
PDF
Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
DATAOPS: THE NEXT BIG WAVE ON YOUR DATA JOURNEY - Big Data Expo
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Data analytics on Azure
A practical guidance of the enterprise machine learning
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Belgrade R - Intro to H2O and Deep Water
H2O at BelgradeR Meetup
Bringing Deep Learning into production
ISV Showcase: End-to-end Machine Learning using H2O on Azure
AI and AutoML: Debunking Myths
Using PySpark to Process Boat Loads of Data
Resume
Data meets AI - ATP Roadshow India
Machine Learning on Google Cloud with H2O
Marvin Platform – Potencializando equipes de Machine Learning
Machine learning model to production
Maximise investment in your LMS with Extended Enterprise
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
H2O 0xdata MLconf
Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
Ad

More from Harald Erb (13)

PDF
Actionable Insights with AI - Snowflake for Data Science
PDF
Snowflake for Data Engineering
PDF
Dataiku & Snowflake Meetup Berlin 2020
PDF
Does it only have to be ML + AI?
PDF
Delivering rapid-fire Analytics with Snowflake and Tableau
PDF
DOAG Big Data Days 2017 - Cloud Journey
PDF
Do you know what k-Means? Cluster-Analysen
PDF
Exploratory Analysis in the Data Lab - Team-Sport or for Nerds only?
PDF
Big Data Discovery + Analytics = Datengetriebene Innovation!
PDF
Big Data Discovery
PDF
DOAG News 2012 - Analytische Mehrwerte mit Big Data
PDF
Oracle Unified Information Architeture + Analytics by Example
PDF
Endeca Web Acquisition Toolkit - Integration verteilter Web-Anwendungen und a...
Actionable Insights with AI - Snowflake for Data Science
Snowflake for Data Engineering
Dataiku & Snowflake Meetup Berlin 2020
Does it only have to be ML + AI?
Delivering rapid-fire Analytics with Snowflake and Tableau
DOAG Big Data Days 2017 - Cloud Journey
Do you know what k-Means? Cluster-Analysen
Exploratory Analysis in the Data Lab - Team-Sport or for Nerds only?
Big Data Discovery + Analytics = Datengetriebene Innovation!
Big Data Discovery
DOAG News 2012 - Analytische Mehrwerte mit Big Data
Oracle Unified Information Architeture + Analytics by Example
Endeca Web Acquisition Toolkit - Integration verteilter Web-Anwendungen und a...

Recently uploaded (20)

PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
Quality review (1)_presentation of this 21
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Computer network topology notes for revision
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Database Infoormation System (DBIS).pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
1_Introduction to advance data techniques.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Quality review (1)_presentation of this 21
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Knowledge Engineering Part 1
Computer network topology notes for revision
STUDY DESIGN details- Lt Col Maksud (21).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Database Infoormation System (DBIS).pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”

Machine Learning - Eine Challenge für Architekten

  • 1. © Hitachi Vantara Corporation 2018© Hitachi Vantara Corporation 2018. Eine Challenge für Architekten #DOAG2018, Nürnberg, 20. November 2018 Machine Learning Harald Erb Solutions Engineer, EMEA Central Data Analytics & IoT
  • 2. © Hitachi Vantara Corporation 2018 OT 107+ YEARS BUSINESS INDUSTRIAL CONSUMER CITY IT 58+ YEARS COMMUNICATIONS BIG DATA ANALYTICS ARTIFICIAL INTELLIGENCE CLOUD IT SYSTEMS IoT INSIGHT Hitachi Hitachi Vantara: What? And Why?
  • 3. © Hitachi Vantara Corporation 2018 Analytics Artificial Intelligence Machine Learning Stream Analytics Batch Analytics Data Management Data Orchestration Data Engineering Data Blending Data Collection Asset Asset Management Asset Avatars Data Stores Business Connectors APIs, App-Enabling Studio Alerts, Notifications Dashboards, UI, UX Maintenance Insights Manufacturing Insights Video Insights Manufacturing Utilities Logistics Oil & Gas Mining Financial Services Other Co-Creation Services Professional Services Edge Asset Integration Device Control Data Caching, Filtering ◼ Industry solutions (Hitachi BUs, ISVs, SIs) ◼ IoT applications ◼ IoT platform services ◼ Edge-to-cloud infrastructure Foundry Edge controllers, appliances Converged and hyperconverged Block-, file- and object storage Pentaho IoT Solution Portfolio of Hitachi Vantara
  • 4. © Hitachi Vantara Corporation 2018
  • 5. © Hitachi Vantara Corporation 2018 public class AgendaForThisTalk { public String Topic1 = "Process, Architecture and 1 Example"; public String Topic2 = "1 Dataset, 4 different Tools"; public String Topic3 = "From Prototype to Production"; }
  • 6. © Hitachi Vantara Corporation 2018 Process, Architecture and 1 Example_
  • 7. © Hitachi Vantara Corporation 2018 Data Warehouse & Analytics back in 1998 z Data Discovery z
  • 8. © Hitachi Vantara Corporation 2018 Architecting for Analytics & Machine Learning today Source: Carlton E. Sapp: “Preparing and Architecting for Machine Learning”, Gartner, 2017
  • 9. © Hitachi Vantara Corporation 2018 Analytic Dashboard Example Source: Carlton E. Sapp: “Preparing and Architecting for Machine Learning”, Gartner, 2017
  • 10. © Hitachi Vantara Corporation 2018 Überschrift Source: XXXXXX End-to-end Fleet Management Solution • Combine Sensor Data with Contextual Information • Overcome scarce availability of capable technicians • Lower costs, reduced customer downtimes_ Fleet Optimization
  • 11. © Hitachi Vantara Corporation 2018 Truck Leasing Company Issues: Trucks have become more and more technology based, availability of capable technicians is scarce, and managing large number of maintenance centers is expensive. Business Objectives: Better Fleet Management - lowering costs, improving efficiency of maintenance to reduce customer downtime Strategic Goals: Need to take competitive advantage of truck data. 40-50,000 trucks purchased per year – more data than any truck OEM. Gain a competitive edge through lowered costs & predictive technology (automation in repairs & diagnostics).
  • 12. © Hitachi Vantara Corporation 2018 Vehicle Sensor Data Store models and sensor data Asset ModelSensor DataUtility Vehicle Asset Air Pressure Axle Vibration Lights Load Weight Movement Temperature Sensor Data Journey Stream Blend Infer Sense Inspect Embed & Integrate Store
  • 13. © Hitachi Vantara Corporation 2018 Adding Context to Sensor Data IoT Data Refinery Contextual DataSensor Data Sensor Data Journey Stream Blend Infer Sense Inspect Embed & Integrate Store Vehicle Location • GPS • Lat / Long • Mapping • Movement Vehicle Profile • Make • Model • Mileage Operational Systems • Maintenance History • Maintenance Schedule • Service Centers • Parts Ordering • Parts Inventory Business Outcomes • Real-Time Fleet Status and Health • Repair Recommendations • Optimized Maintenance Scheduling • Automated Parts Ordering
  • 14. © Hitachi Vantara Corporation 2018 Fleet Management Dashboard: Situation Overview Source: XXXXXX
  • 15. © Hitachi Vantara Corporation 2018 Fleet Management Dashboard: Contextual View Source: XXXXXX
  • 16. © Hitachi Vantara Corporation 2018 Data Science – Personas & Process Model Source: K. Bollhöfer, Chief Data Scientist, *um Cross-functional team
  • 17. © Hitachi Vantara Corporation 2018 Architecture Challenge Source: D. Sculley, et al.: “Hidden technical debt in Machine learning systems”, 2015
  • 18. © Hitachi Vantara Corporation 2018 1 Dataset, 4 different Tools_
  • 19. © Hitachi Vantara Corporation 2018 1 Dataset, 4 Tools Dataset: California House Prices github.com/ageron/handson-ml/tree/master/datasets/housing Tools: Jupyter Notebook • End-to-end ML Projects • ML with preferred programing language like Python, R, Julia • Live-Code embedded in Markup Document Oracle Data Visualization • For data exploration • ML to explain dependencies in dataset • 1-click analytical functions and model training possible H2O Flow • End-to-end ML Platform with own compute engine, AutoML,… • Notebook UI, H2O algorithms accessible from Python and R • Java-Export of trained Models Pentaho Data Integration • Embedding ML Code in ETL dataflows • ML Orchestration: Model training and management • Plugin Machine Intelligence Free trial versions available for all tools!
  • 20. © Hitachi Vantara Corporation 2018 Machine Learning Coding with Jupyter Notebook & Python Start here: jupyter.org
  • 21. © Hitachi Vantara Corporation 2018 Überschrift Source: XXXXXX
  • 22. © Hitachi Vantara Corporation 2018
  • 23. © Hitachi Vantara Corporation 2018 Überschrift Source: XXXXXX
  • 24. © Hitachi Vantara Corporation 2018 Machine Learning with Jupyter Notebook & Python ML Process support: End-to-end, focus on experimentation not on production ML Personas: Data Scientists Useful: • Notebooks allow reproduceable ML from data ingestion to model evaluation • Multiple programming languages, access to latest ML frameworks via Python interface • Sophisticated visualizations Architecture & Development related: • Trained ML models can be serialized/saved, i.e. via Pickle, Joblib • Models can be published as REST API endpoints, ie. via Flask • Notebooks stored as JSON files → Code versioning / merge not allways easy, less comfort compared to other IDE‘s (no syntax highlighting, no key word completion) • Large-scale ML possible, i.e. via Apache Spark MLlib; Cluster deployments are better done separately
  • 25. © Hitachi Vantara Corporation 2018 Dataset Exploration with Oracle Data Visualization Start here: www.oracle.com/technetwork/middleware/oracle-data-visualization
  • 26. © Hitachi Vantara Corporation 2018 Überschrift
  • 27. © Hitachi Vantara Corporation 2018 Dataset Exploration with Oracle Data Visualization ML Process support: Data Understanding phase, Results Presentation & Story Telling Personas: Business Analysts, Data Scientists Useful: • Highly interactive Charts and Filters and intuitive analysis support (i.e. pattern brushing) • Supporting functions to highlight and explain attribute/variable dependencies • Formular editor, advanced (ML) functions and Model training and scoring available for experimentation (experienced Users only) Architecture & Development related: • Use Oracle Analytics Cloud for better collaboration; can be combined with Data Visualization Desktop • Dataset preparation functionality is improving over time, but will not replace existing ETL platforms (Scalability, Job-Management) • Good for rapid prototyping, but limited reusability of results (curated datasets, ML Models)
  • 28. © Hitachi Vantara Corporation 2018 Machine Learning Orchestration and more with Pentaho Start here: community.hitachivantara.com/docs/DOC-1009931-downloads Blog: community.hitachivantara.com/community/products-and- solutions/pentaho/blog/2018/10/16/deep-learning-coming-to-pentaho
  • 29. © Hitachi Vantara Corporation 2018 Bring Your Own (ML) Code
  • 30. © Hitachi Vantara Corporation 2018 ETL-Tools + Python/R – A Door Opener for Deep Learning?
  • 31. © Hitachi Vantara Corporation 2018 Embedding ML Algorithms into Data Pipelines Source: XXXXXX
  • 32. © Hitachi Vantara Corporation 2018 „Model Zoo“ managed by your Data Integration Solution
  • 33. © Hitachi Vantara Corporation 2018 Machine Learning within Data Integration Platforms ML Process support: Data Preparation phase, Operationalize ML, Experimentation? Personas: Data Engineers, Data Scientists (supporting) Useful: • DI platforms provide advanced data preparation and automation features for effective creation of datasets • Intuitive UI, Drag & drop instead of coding • Skilled DI team already in place Architecture & Development related: • DI platforms are optimized to utilize full computing power for ETL/ELT, but not for ML Tasks (i.e. parallel execution might need extra effort) • When using R / Python interfaces: ensure to collect status infos + performance metrics of your ML model execution within DI logging • „ML toolbox“ of DI solutions is often not complete: i.e. script-based work arounds needed for imputation of missing values, working with latest algorithms, model management • Limited and not intuitive data exploration features, i.e. visualisations
  • 34. © Hitachi Vantara Corporation 2018 End-to-end Machine Learning for everyone with H20 (Flow) Start here: www.h2o.ai/download Accessing a 2 node H2O cluster in a R environment
  • 35. © Hitachi Vantara Corporation 2018 Überschrift Source: XXXXXX
  • 36. © Hitachi Vantara Corporation 2018 Überschrift Source: XXXXXX
  • 37. © Hitachi Vantara Corporation 2018 End-to-end Machine Learning for everyone with H20 Flow ML Process support: End-to-end, focus is on Model training/evaluation, Feature Engineering Personas: Data Scientists, ambitious Business Analysts(?) Useful: • H2O Flow: intuitive Notebook-style UI + user guidance • Use the programing language you already know like R, Python • AutoML can be used for automating the ML workflow, including training and tuning of many models within a user-specified time limit Architecture & Development related: • Takes advantage of the computing power of distributed systems and in- memory computing to accelerate machine learning • Works on existing big data infrastructure, on bare metal or on top of existing Hadoop or Spark clusters; can ingest data from HDFS, Spark, S3 • Model deployment into production with Java (POJO) and binary formats (MOJO), Hive UDF, or as API endpoint • H2O Flow: Limited Data Exploration (no visual only in combination with R/Python
  • 38. © Hitachi Vantara Corporation 2018 Another tool decision: Which ML Framework to choose? Source: Dr. D. James: „Entscheidungsmatrix „Machine Learning“, it-novum.com
  • 39. © Hitachi Vantara Corporation 2018 From Prototype to Production_
  • 40. © Hitachi Vantara Corporation 2018 From Exploratory Data Science to Production Workflows Line of Governance • Commercial exploitation • Integration to operations • Non-functional requirements • Standardisation & governance Model • Unbounded discovery • Self-Service sandbox • Wide toolset / IDE’s • Agile methods
  • 41. © Hitachi Vantara Corporation 2018 Model deployment Source: Sergei Izrailev: Design Patterns for Machine Learning in Production, 2017 • Data transformations must be the same in training and scoring • Interface between building & scoring: − In-memory - model is never persisted, train then score → single & applications − Data only, i.e. PMML, etc → code is independent − Serialized objects - Pickle, R, Spark, custom → reuse code − Code + Data – i.e., H2O’s POJO → code is generated Detailed article in “Java aktuell” 06/2018
  • 42. © Hitachi Vantara Corporation 2018 Überschrift
  • 43. © Hitachi Vantara Corporation 2018 Überschrift Source: XXXXXX H2O POJO (Plain Old Java Object): • ML Model implemented through Java classes • has dependencies with H2O specific classes
  • 44. © Hitachi Vantara Corporation 2018 Überschrift Source: XXXXXX
  • 45. © Hitachi Vantara Corporation 2018 Gradient Boosting Machines (GBM) • A family of powerful machine- learning techniques for regression and classification problems • GBM’s produce a prediction model in the form of an ensemble of weak prediction models, typically decision trees
  • 46. © Hitachi Vantara Corporation 2018 From Prototype to Machine Learning in Production Source: Sergei Izrailev: Design Patterns for Machine Learning in Production, 2017
  • 47. © Hitachi Vantara Corporation 2018 Applying ML Model for Scoring of new (unlabeled) Data Source: K. Wähner: How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka • Add ML model = Java application to Apache Kafka stream processing application to apply it on new incoming events • Spark Streaming also allows ML model serving via mini batches and makes use of in- memory processing and load distribution
  • 48. © Hitachi Vantara Corporation 2018 ML Model as API endpoint: Object detection example Source: A. Rosebrock: “A scalable Keras + deep learning REST API”, pyimagesearch Blog • “API first” approach instead of rewriting code to replicate ML model in a programming language supported by the enterprise IT • Web APIs have made it easy for cross- language applications to work well. If a developer needs a Model to create a ML powered web application, they would just need to get the URL Endpoint from where the API is being served • Webservice development frameworks like Flask allow prototyping in Python • For production environments, an additional web server and a messaging should be considered Keras: a simple, modular, and extensible Deep Learning library, written in Python and designed to enable fast experimentation with deep neural networks Redis (Remote Dictionary Server): implements a distributed, in-memory key-value database with optional durability
  • 49. © Hitachi Vantara Corporation 2018Source: A. Rosebrock: "Building a simple Keras + deep learning REST API", The Keras Blog
  • 50. © Hitachi Vantara Corporation 2018 Überschrift Source: XXXXXX
  • 51. © Hitachi Vantara Corporation 2018 Object Detection – Use Case PicSure AI Platform für Versichrungslösungen
  • 52. © Hitachi Vantara Corporation 2018 Takeaway • Find the right problem • Define constraints • Design components and interfaces • Take into account organizational constraints • Production can’t be an afterthought • The process is a lot of work, but it’s not rocket science Processing huge amounts of data with complex algorithms can a bit too much of time. Kudos Randall Munroe / XKCD for the original
  • 53. © Hitachi Vantara Corporation 2018© Hitachi Vantara Corporation 2018. All Rights Reserved Thank You
  • 54. © Hitachi Vantara Corporation 2018