SlideShare a Scribd company logo
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
 Überraschend mehr Möglichkeiten
© OPITZ CONSULTING 2020
Automatisierung mit Hilfe von
Infrastructure as Code
Fabian Hardt,
Senior Consultant – Big Data / Analytics
Data Lab Umgebung für Data
Scientists
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich© OPITZ CONSULTING 2020
#Figures€49,0
55,5 56,0
2017 2018 2019
Turnover in million euros Ø Sector distribution
Retail/
Logistics/
Services
Chemical
&
pharma
Other
Public
IndustryFinance
455 482
555
2017 2018 2019
Employees
91% customer satisfaction
Recommendation: NPS 40.26
> 130
actively managed
service customers
> 600
active
customers
> 5000
databases
in support
> 4000
systems
in support
> 99.5%
SLA compliance
§
> 90% of our
customers’ projects
use agile methods
“Sound management in conjunction with a
long-term customer-oriented strategy
guarantees financial success.”
– Torsten Schlautmann
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich Seite 3
Agenda
1
2
3
4
5
Data Lake
Data Lab vs. Data Factory
Infrastructure as Code
Demo
Summary
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich Seite 4
Data Lake
 Purpose
 Requirements
 Architecture
1
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Data Lake - Purpose
 Primary task: Centralized data provisioning
 All data of a company is collected in one centralized data platform
 Also external data is collected (social-media, weather, stock market prices, etc.)
 Similar to DWH Systems, integration of different data sources
 Store Raw data without knowing the intended purpose
 Structured data
 Unstructured data
 Suitable for very large amounts of data
 Mostly implemented with big data technologies
Data Lake
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Data Lake – Enterprise Requirements (Data Factory)
 Usability of stored Data
 Searchability of data
 Evaluability of data
 Availability of data
 Data Security / Data protection regulations
 Authentication of users
 Authorization of users
 Data Governance
 Quality Assurance of data
 Governs the compliance requirements
} Users should only see the data that they’re allowed to see
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Core components of a Data Lake
Refined Data
- Quality assured data,
- typically data that could
transition into a classical DWH
Data Refinery
- Preprocessing Area
Raw Data
- sensor data,
- streaming,
- social media,
- documents,
- Images
- …
Metadata
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich Seite 9
Data Lab vs. Data Factory
 Purpose
 Requirements
 Architecture
2
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Data Lab - purpose
 Develop processes and algorithms to gain data insights
 Development Area for Data Scientists
 Allows Data Scientist to install and use any software of their choice
 Allows Data Scientist experimental processes with external data
 Allows Data Scientist to manage all data and processes
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Data Lab - Requirements
 Should give all freedoms to Data Scientist
 Use all kinds of technologies and software (in container)
 Use data as close as possible to real enterprise data
 Use data from external sources
 Should match all Requirements from a Software Development
Environment
 Software artifact should be runnable in Data Factory
 Should prevent Data Scientist from seeing data he should not see
 Should ensure data lake does not get messed up
 Should ensure even resource hungry processes do not influence the data
lake
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Data Lab – Requirements implementation
 Data Lab as Sandbox dedicated to Data Scientist
 No Authorization needed
 Gives all freedoms to Data Scientist
 Access to Data Lake to use Data from there
 Only read access at most
 Sometimes only working on copies of data from Data Lake
 Sandbox Architecture prevents Data Lab from influencing Data Lake
processes
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Data Lab - Architecture
Refined Data
- Quality assured data,
- typically data that could
transition into a classical
DWH
Data Refinery
- Preprocessing Area
Raw Data
- sensor data,
- streaming,
- social media,
- documents,
- Images
- …
Metadata
Analytics Sandbox
External Data
Data
from
Lake
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Data Factory - Requirements
 New models and algorithms should be integrated easily
 Established algorithms should be updated easily
 Found insights should be integrated in Data Lake to be evalueable
 Challenge to design extensible data model
 Single Data Factory components should not influence ohter Data Lake
processes
 Must be complient conform
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Data Science vs. Enterprise
 Looking for business insights in a
huge amount of data
 Mainly used to train algorithms
 Not a production system
 no operating team necessary
 Usually no other target groups
 Often no further authorization is needed
 Usually only authentication is enabled
Data Lab Enterprise Data Lake
 Is very integrated into the daily
business
 Production system
 Typically many target applications are
based on this collection of data
 High availability
 Backup & Recovery
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Data Science vs. Enterprise
Data Lab
 Explorative approach
 Insights generating
 Work with production near data
 Use Sandboxes
 Data Scientists
 Experts from the specialist department
 Goal: Train models and algorithms
 Generating value from trained
models and algorithms
 Automatic processing of data from
Data Lake
 Further development like in other
software development products
 Makes use of trained models and
algorithms
Data Factory
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Deployment from Data Lab to Factory
Data FactoryData Lab Trained Algorithms
Generated Insights from Analysis
Refined Data
Data Refinery
Raw Data
DWH
Weitere Datenquellen
No Data Transfer from Lab to
Factory
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Conclusion
Why do we need a Data Lab?
 Where is the problem? Why should we do this?
 „Playground“ / sandbox for Data Scientists
 No risk to break something in EDL / Data Factory
 No dependency on release cycles, etc.
 "Free" choice of tools
 Made possible by Infrastructure as Code
 Setting up entire environments in minutes
 Very similar or identical setup to EDL / Data Factory
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich Seite 19
Infrastructure as Code (IaC)
 What is IaC
 Advantages
3
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Keyword – Infrastructure as Code
 Called IaC
 Process of managing and provision computers, vm‘s, networks
 Version controlled infrastructure definitions
 Repeatable and reliable
 Frameworks
 Hashicorp Terraform
 Puppet
 SaltStack
 RedHat Ansible
 …
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Advantages of IaC
 Costs
 Reduce costs by
 Avoiding repetitive tasks
 Avoiding mistakes
 Reuse code / infrastructure
 Speed
 Provisioning on multiple machines in parallel
 Calculations / regex in code
 Risk
 Reduces the risk through
 Better preparation
 Idempotent approach
 Prepared updates
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Challenges for IaC
 A few new skills needed
 Typical development workflows (e.g. GitFlow)
 Some basic coding paradigms
 JSON
 Ruby
 YAML
 HCL
 …
 Monitoring
 Who is provisioning – what?, where?, how?
 Legacy tracking via worksheet not possible
 → Solution: e.g. Ansible Tower / AWX
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Reuse / reproducible result
Use Playbooks / scripts / templates for all stages
DEV TEST PROD
Playbook
Script
Reliable and easier
deployment in next
stages
Development of
scripts / playbooks
in DEV environment
Inventory Inventory Inventory
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Often seen like this
TEST
Oh, this is our Data
Science environment!
Yes, I trained this
modelon TEST
environment first.
No, we can‘t use TEST
for new Release now,
Data Scientist using it!
DEV Team
Data Science Team
Product-
management
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Better like this…
PROD
TEST
Data Lab /
SandboxFork
Trained models
New Reports
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Terraform (Hashicorp)
 Open Source software tool
 Released 2014
 Code in HCL (Hashicorp Configuration Language) or JSON
 Terraform CLI to trigger Terraform actions
 Manages public / private cloud infrastructure
 Extendable with providers – AWS, Azure, OCI, vSphere
 Performs CRUD operations to accomplish the target state
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Cloud-init
 Customizes OS directly after cloning from template
 Many public- / private clouds and OS Types are supported
 Perfectly for „low-level stuff“
 Set default locale
 Set hostnames
 Set up SSH keys
 Mount some shared folders
 It can also install some software packages
 Transfer point to e.g. Ansible
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Ansible (RedHat)
 Open Source software, founded by AnsibleWorks in 2012
 Very good supported on many Unix-like systems – but also on Windows
 Strengths: Provisioning, configuration management, some kind of
application development
 Agentless – working with temporary SSH access
 Features
 Inventory based configuration of different stages
 Ansible Vault – Store sensitive data
 A playbook can be written idempotent
 Ansible Tower / Upstream project AWX – REST API, web bases console for Ansible
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich Seite 29
Demo
 Here comes a small sample…
4
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Data Lab creation workflow
Admin
Chooses template
according to requirements
Environment is
getting provisioned
Data Lab –
ready to use
Pull data through APIs
Push data into Data Lab
Analytics Sandbox
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Sample workflow (on premise)
Admin
triggers runs a playbook
vm creation
add dns entries /
AD or LDAP users
do some configuration
copy data into Data Lab
notify Data Scientists
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Sample workflow (Oracle Cloud)
Admin
triggers use predefined stack
do some configuration
copy data into Data Lab
notify Data Scientists
create vm instances and
networks
Optional: add
some firewall
rules
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
Summary
 First of all → Don‘t use TEST as Data Science ENV
 Use a dedicated Data Lab environment to innovate
 IaC is the way to build and destroy a Data Lab environment
 Scalable and repeatable
 Mix multiple clouds or on-premise environments
 Use the mix of roles to succeed – admins, developers, data scientists
 Increase satisfaction (company-wide)
 More flexibility for Data Scientists
 Matching processes for administration and development in Data Factory
 Speed up innovation lifecycle
© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
 Überraschend mehr Möglichkeiten
@OC_WIRE
OPITZCONSULTING
opitzconsulting
opitz-consulting-bcb8-1009116
WWW.OPITZ-CONSULTING.COM
Vielen Dank für Ihr Interesse, wir freuen uns auf
den gemeinsamen Austausch mit Ihnen!
Fabian Hardt
Senior Consultant,
Business Intelligence & Analytics
Kirchstraße 6
51647 Gummersbach
Fabian.Hardt@opitz-consulting.com
+49 (0) 2261 6001-1045

More Related Content

PDF
Complex Analytics using Open Source Technologies
PPTX
Harnessing Hadoop Distuption: A Telco Case Study
PDF
Future of Data Platform in Cloud Native world
PPTX
Building intelligent applications, experimental ML with Uber’s Data Science W...
PDF
Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and TensorFlow
PDF
dvprimer-architecture
PDF
Teradata Aster Discovery Platform
PDF
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
Complex Analytics using Open Source Technologies
Harnessing Hadoop Distuption: A Telco Case Study
Future of Data Platform in Cloud Native world
Building intelligent applications, experimental ML with Uber’s Data Science W...
Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and TensorFlow
dvprimer-architecture
Teradata Aster Discovery Platform
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing

What's hot (20)

PDF
Ten Pillars of World Class Data Virtualization
PDF
Solving the Really Big Tech Problems with IoT
PDF
Hortonworks & IBM solutions
PDF
Oracle IoT Cloud Service - First practical experience
PDF
Teradata Listener™: Radically Simplify Big Data Streaming
PPTX
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...
DOCX
Unit 1-android-and-its-tools-ass
PDF
Teradata Aster: Big Data Discovery Made Easy
PDF
The New Database Frontier: Harnessing the Cloud
PDF
SAP Analytics Cloud: Haben Sie schon alle Datenquellen im Live-Zugriff?
ODP
Pentaho Data Integration Introduction
PDF
30 for 30: Quick Start Your Pentaho Evaluation
PDF
Big Data at Oracle - Strata 2015 San Jose
PPTX
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
PDF
Hybrid Cloud Strategy for Big Data and Analytics
PDF
D365 Finance & Operations - Data & Analytics (see newer release of this docum...
PPTX
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
PPTX
Oil and gas big data edition
PDF
Secure Your Data with Virtual Data Fabric (ASEAN)
PDF
Pentaho Enterprise vs. Pentaho Community
Ten Pillars of World Class Data Virtualization
Solving the Really Big Tech Problems with IoT
Hortonworks & IBM solutions
Oracle IoT Cloud Service - First practical experience
Teradata Listener™: Radically Simplify Big Data Streaming
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...
Unit 1-android-and-its-tools-ass
Teradata Aster: Big Data Discovery Made Easy
The New Database Frontier: Harnessing the Cloud
SAP Analytics Cloud: Haben Sie schon alle Datenquellen im Live-Zugriff?
Pentaho Data Integration Introduction
30 for 30: Quick Start Your Pentaho Evaluation
Big Data at Oracle - Strata 2015 San Jose
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
Hybrid Cloud Strategy for Big Data and Analytics
D365 Finance & Operations - Data & Analytics (see newer release of this docum...
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
Oil and gas big data edition
Secure Your Data with Virtual Data Fabric (ASEAN)
Pentaho Enterprise vs. Pentaho Community
Ad

Similar to Automatisierte Provisionierung einer Data Lab Umgebung für Data Scientists (20)

PDF
Next Gen Big Data Plattform mit Hadoop, APIs und Kubernetes
PDF
Gse uk-cedrinemadera-2018-shared
PDF
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
PDF
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...
PDF
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
PDF
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
PDF
Setup a Data Science Pipeline in a Highly Regulated Environment
PDF
Hadoop-based architecture approaches
PDF
Data Science at Scale - The DevOps Approach
PDF
Analytics&IoT
PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
PDF
Industrial Data Science
PPTX
Demystifying data engineering
PDF
Enabling Your Data Science Team with Modern Data Engineering
PDF
Future of Data Engineering
PDF
GE’s Industrial Data Lake Platform
PDF
Hadoop Summit Tokyo HDP Sandbox Workshop
PDF
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
PDF
Hadoop 2.0: YARN to Further Optimize Data Processing
PPTX
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
Next Gen Big Data Plattform mit Hadoop, APIs und Kubernetes
Gse uk-cedrinemadera-2018-shared
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
Setup a Data Science Pipeline in a Highly Regulated Environment
Hadoop-based architecture approaches
Data Science at Scale - The DevOps Approach
Analytics&IoT
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Industrial Data Science
Demystifying data engineering
Enabling Your Data Science Team with Modern Data Engineering
Future of Data Engineering
GE’s Industrial Data Lake Platform
Hadoop Summit Tokyo HDP Sandbox Workshop
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Hadoop 2.0: YARN to Further Optimize Data Processing
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
Ad

More from Fabian Hardt (14)

PDF
Ist die Cloud eine Einbahnstraße? Die Realität hinter der Flexibilität und Po...
PDF
DDD und Data Mesh - Unterstützen durch modernes Plattformdesign
PDF
Data Mesh & DDD: Synergien für datengetriebene Exzellenz
PDF
Vanilla, cherry or blueberry - which on-prem Kubernetes distribution is best ...
PPTX
Advanced Observability & Security
PPTX
Advanced Observability & Security
PPTX
Mit APIs auf der Überholspur zur produktorientierten Organisation
PPTX
Data Mesh und Domain Driven Design - rücken Analytics und SD nun doch näher z...
PDF
Analytics meets Integration – Modern Development mit Data APIs
PDF
Service Mesh Advanced Use Cases
PDF
How Service Mesh Fits into the Modern Data Stack
PDF
Modern Data Stack – Buzzword oder echter Game-Changer?
PDF
Persönliche Filmtipps mittels Recommender System und Chatbot
PDF
Augmented Analytics mit Amazon Alexa
Ist die Cloud eine Einbahnstraße? Die Realität hinter der Flexibilität und Po...
DDD und Data Mesh - Unterstützen durch modernes Plattformdesign
Data Mesh & DDD: Synergien für datengetriebene Exzellenz
Vanilla, cherry or blueberry - which on-prem Kubernetes distribution is best ...
Advanced Observability & Security
Advanced Observability & Security
Mit APIs auf der Überholspur zur produktorientierten Organisation
Data Mesh und Domain Driven Design - rücken Analytics und SD nun doch näher z...
Analytics meets Integration – Modern Development mit Data APIs
Service Mesh Advanced Use Cases
How Service Mesh Fits into the Modern Data Stack
Modern Data Stack – Buzzword oder echter Game-Changer?
Persönliche Filmtipps mittels Recommender System und Chatbot
Augmented Analytics mit Amazon Alexa

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Electronic commerce courselecture one. Pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
A Presentation on Artificial Intelligence
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Teaching material agriculture food technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Mobile App Security Testing_ A Comprehensive Guide.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
NewMind AI Monthly Chronicles - July 2025
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Machine learning based COVID-19 study performance prediction
Review of recent advances in non-invasive hemoglobin estimation
NewMind AI Weekly Chronicles - August'25 Week I
Electronic commerce courselecture one. Pdf
Network Security Unit 5.pdf for BCA BBA.
A Presentation on Artificial Intelligence
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Digital-Transformation-Roadmap-for-Companies.pptx
Chapter 3 Spatial Domain Image Processing.pdf
20250228 LYD VKU AI Blended-Learning.pptx
MYSQL Presentation for SQL database connectivity
Diabetes mellitus diagnosis method based random forest with bat algorithm
“AI and Expert System Decision Support & Business Intelligence Systems”

Automatisierte Provisionierung einer Data Lab Umgebung für Data Scientists

  • 1. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich  Überraschend mehr Möglichkeiten © OPITZ CONSULTING 2020 Automatisierung mit Hilfe von Infrastructure as Code Fabian Hardt, Senior Consultant – Big Data / Analytics Data Lab Umgebung für Data Scientists
  • 2. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich© OPITZ CONSULTING 2020 #Figures€49,0 55,5 56,0 2017 2018 2019 Turnover in million euros Ø Sector distribution Retail/ Logistics/ Services Chemical & pharma Other Public IndustryFinance 455 482 555 2017 2018 2019 Employees 91% customer satisfaction Recommendation: NPS 40.26 > 130 actively managed service customers > 600 active customers > 5000 databases in support > 4000 systems in support > 99.5% SLA compliance § > 90% of our customers’ projects use agile methods “Sound management in conjunction with a long-term customer-oriented strategy guarantees financial success.” – Torsten Schlautmann
  • 3. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Seite 3 Agenda 1 2 3 4 5 Data Lake Data Lab vs. Data Factory Infrastructure as Code Demo Summary
  • 4. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Seite 4 Data Lake  Purpose  Requirements  Architecture 1
  • 5. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Lake - Purpose  Primary task: Centralized data provisioning  All data of a company is collected in one centralized data platform  Also external data is collected (social-media, weather, stock market prices, etc.)  Similar to DWH Systems, integration of different data sources  Store Raw data without knowing the intended purpose  Structured data  Unstructured data  Suitable for very large amounts of data  Mostly implemented with big data technologies Data Lake
  • 6. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Lake – Enterprise Requirements (Data Factory)  Usability of stored Data  Searchability of data  Evaluability of data  Availability of data  Data Security / Data protection regulations  Authentication of users  Authorization of users  Data Governance  Quality Assurance of data  Governs the compliance requirements } Users should only see the data that they’re allowed to see
  • 7. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Core components of a Data Lake Refined Data - Quality assured data, - typically data that could transition into a classical DWH Data Refinery - Preprocessing Area Raw Data - sensor data, - streaming, - social media, - documents, - Images - … Metadata
  • 8. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Seite 9 Data Lab vs. Data Factory  Purpose  Requirements  Architecture 2
  • 9. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Lab - purpose  Develop processes and algorithms to gain data insights  Development Area for Data Scientists  Allows Data Scientist to install and use any software of their choice  Allows Data Scientist experimental processes with external data  Allows Data Scientist to manage all data and processes
  • 10. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Lab - Requirements  Should give all freedoms to Data Scientist  Use all kinds of technologies and software (in container)  Use data as close as possible to real enterprise data  Use data from external sources  Should match all Requirements from a Software Development Environment  Software artifact should be runnable in Data Factory  Should prevent Data Scientist from seeing data he should not see  Should ensure data lake does not get messed up  Should ensure even resource hungry processes do not influence the data lake
  • 11. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Lab – Requirements implementation  Data Lab as Sandbox dedicated to Data Scientist  No Authorization needed  Gives all freedoms to Data Scientist  Access to Data Lake to use Data from there  Only read access at most  Sometimes only working on copies of data from Data Lake  Sandbox Architecture prevents Data Lab from influencing Data Lake processes
  • 12. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Lab - Architecture Refined Data - Quality assured data, - typically data that could transition into a classical DWH Data Refinery - Preprocessing Area Raw Data - sensor data, - streaming, - social media, - documents, - Images - … Metadata Analytics Sandbox External Data Data from Lake
  • 13. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Factory - Requirements  New models and algorithms should be integrated easily  Established algorithms should be updated easily  Found insights should be integrated in Data Lake to be evalueable  Challenge to design extensible data model  Single Data Factory components should not influence ohter Data Lake processes  Must be complient conform
  • 14. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Science vs. Enterprise  Looking for business insights in a huge amount of data  Mainly used to train algorithms  Not a production system  no operating team necessary  Usually no other target groups  Often no further authorization is needed  Usually only authentication is enabled Data Lab Enterprise Data Lake  Is very integrated into the daily business  Production system  Typically many target applications are based on this collection of data  High availability  Backup & Recovery
  • 15. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Science vs. Enterprise Data Lab  Explorative approach  Insights generating  Work with production near data  Use Sandboxes  Data Scientists  Experts from the specialist department  Goal: Train models and algorithms  Generating value from trained models and algorithms  Automatic processing of data from Data Lake  Further development like in other software development products  Makes use of trained models and algorithms Data Factory
  • 16. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Deployment from Data Lab to Factory Data FactoryData Lab Trained Algorithms Generated Insights from Analysis Refined Data Data Refinery Raw Data DWH Weitere Datenquellen No Data Transfer from Lab to Factory
  • 17. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Conclusion Why do we need a Data Lab?  Where is the problem? Why should we do this?  „Playground“ / sandbox for Data Scientists  No risk to break something in EDL / Data Factory  No dependency on release cycles, etc.  "Free" choice of tools  Made possible by Infrastructure as Code  Setting up entire environments in minutes  Very similar or identical setup to EDL / Data Factory
  • 18. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Seite 19 Infrastructure as Code (IaC)  What is IaC  Advantages 3
  • 19. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Keyword – Infrastructure as Code  Called IaC  Process of managing and provision computers, vm‘s, networks  Version controlled infrastructure definitions  Repeatable and reliable  Frameworks  Hashicorp Terraform  Puppet  SaltStack  RedHat Ansible  …
  • 20. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Advantages of IaC  Costs  Reduce costs by  Avoiding repetitive tasks  Avoiding mistakes  Reuse code / infrastructure  Speed  Provisioning on multiple machines in parallel  Calculations / regex in code  Risk  Reduces the risk through  Better preparation  Idempotent approach  Prepared updates
  • 21. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Challenges for IaC  A few new skills needed  Typical development workflows (e.g. GitFlow)  Some basic coding paradigms  JSON  Ruby  YAML  HCL  …  Monitoring  Who is provisioning – what?, where?, how?  Legacy tracking via worksheet not possible  → Solution: e.g. Ansible Tower / AWX
  • 22. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Reuse / reproducible result Use Playbooks / scripts / templates for all stages DEV TEST PROD Playbook Script Reliable and easier deployment in next stages Development of scripts / playbooks in DEV environment Inventory Inventory Inventory
  • 23. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Often seen like this TEST Oh, this is our Data Science environment! Yes, I trained this modelon TEST environment first. No, we can‘t use TEST for new Release now, Data Scientist using it! DEV Team Data Science Team Product- management
  • 24. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Better like this… PROD TEST Data Lab / SandboxFork Trained models New Reports
  • 25. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Terraform (Hashicorp)  Open Source software tool  Released 2014  Code in HCL (Hashicorp Configuration Language) or JSON  Terraform CLI to trigger Terraform actions  Manages public / private cloud infrastructure  Extendable with providers – AWS, Azure, OCI, vSphere  Performs CRUD operations to accomplish the target state
  • 26. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Cloud-init  Customizes OS directly after cloning from template  Many public- / private clouds and OS Types are supported  Perfectly for „low-level stuff“  Set default locale  Set hostnames  Set up SSH keys  Mount some shared folders  It can also install some software packages  Transfer point to e.g. Ansible
  • 27. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Ansible (RedHat)  Open Source software, founded by AnsibleWorks in 2012  Very good supported on many Unix-like systems – but also on Windows  Strengths: Provisioning, configuration management, some kind of application development  Agentless – working with temporary SSH access  Features  Inventory based configuration of different stages  Ansible Vault – Store sensitive data  A playbook can be written idempotent  Ansible Tower / Upstream project AWX – REST API, web bases console for Ansible
  • 28. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Seite 29 Demo  Here comes a small sample… 4
  • 29. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich
  • 30. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Data Lab creation workflow Admin Chooses template according to requirements Environment is getting provisioned Data Lab – ready to use Pull data through APIs Push data into Data Lab Analytics Sandbox
  • 31. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Sample workflow (on premise) Admin triggers runs a playbook vm creation add dns entries / AD or LDAP users do some configuration copy data into Data Lab notify Data Scientists
  • 32. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Sample workflow (Oracle Cloud) Admin triggers use predefined stack do some configuration copy data into Data Lab notify Data Scientists create vm instances and networks Optional: add some firewall rules
  • 33. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich Summary  First of all → Don‘t use TEST as Data Science ENV  Use a dedicated Data Lab environment to innovate  IaC is the way to build and destroy a Data Lab environment  Scalable and repeatable  Mix multiple clouds or on-premise environments  Use the mix of roles to succeed – admins, developers, data scientists  Increase satisfaction (company-wide)  More flexibility for Data Scientists  Matching processes for administration and development in Data Factory  Speed up innovation lifecycle
  • 34. © OPITZ CONSULTING 2020 Informationsklassifikation: Öffentlich  Überraschend mehr Möglichkeiten @OC_WIRE OPITZCONSULTING opitzconsulting opitz-consulting-bcb8-1009116 WWW.OPITZ-CONSULTING.COM Vielen Dank für Ihr Interesse, wir freuen uns auf den gemeinsamen Austausch mit Ihnen! Fabian Hardt Senior Consultant, Business Intelligence & Analytics Kirchstraße 6 51647 Gummersbach Fabian.Hardt@opitz-consulting.com +49 (0) 2261 6001-1045