Automatisierte Provisionierung einer Data Lab Umgebung für Data Scientists

© OPITZ CONSULTING 2020
Informationsklassifikation:
Öffentlich
 Überraschend mehr Möglichkeiten
Automatisierung mit Hilfe von
Infrastructure as Code
Fabian Hardt,
Senior Consultant – Big Data / Analytics
Data Lab Umgebung für Data
Scientists

Öffentlich© OPITZ CONSULTING 2020
#Figures€49,0
55,5 56,0
2017 2018 2019
Turnover in million euros Ø Sector distribution
Retail/
Logistics/
Services
Chemical
&
pharma
Other
Public
IndustryFinance
455 482
555
2017 2018 2019
Employees
91% customer satisfaction
Recommendation: NPS 40.26
> 130
actively managed
service customers
> 600
active
customers
> 5000
databases
in support
> 4000
systems
in support
> 99.5%
SLA compliance
§
> 90% of our
customers’ projects
use agile methods
“Sound management in conjunction with a
long-term customer-oriented strategy
guarantees financial success.”
– Torsten Schlautmann

Öffentlich Seite 3
Agenda
1
2
3
4
5
Data Lake
Data Lab vs. Data Factory
Infrastructure as Code
Demo
Summary

Öffentlich Seite 4
Data Lake
 Purpose
 Requirements
 Architecture
1

Öffentlich
Data Lake - Purpose
 Primary task: Centralized data provisioning
 All data of a company is collected in one centralized data platform
 Also external data is collected (social-media, weather, stock market prices, etc.)
 Similar to DWH Systems, integration of different data sources
 Store Raw data without knowing the intended purpose
 Structured data
 Unstructured data
 Suitable for very large amounts of data
 Mostly implemented with big data technologies
Data Lake

Öffentlich
Data Lake – Enterprise Requirements (Data Factory)
 Usability of stored Data
 Searchability of data
 Evaluability of data
 Availability of data
 Data Security / Data protection regulations
 Authentication of users
 Authorization of users
 Data Governance
 Quality Assurance of data
 Governs the compliance requirements
} Users should only see the data that they’re allowed to see

Öffentlich
Core components of a Data Lake
Refined Data
- Quality assured data,
- typically data that could
transition into a classical DWH
Data Refinery
- Preprocessing Area
Raw Data
- sensor data,
- streaming,
- social media,
- documents,
- Images
- …
Metadata

Öffentlich Seite 9
Data Lab vs. Data Factory
 Purpose
 Requirements
 Architecture
2

Öffentlich
Data Lab - purpose
 Develop processes and algorithms to gain data insights
 Development Area for Data Scientists
 Allows Data Scientist to install and use any software of their choice
 Allows Data Scientist experimental processes with external data
 Allows Data Scientist to manage all data and processes

Öffentlich
Data Lab - Requirements
 Should give all freedoms to Data Scientist
 Use all kinds of technologies and software (in container)
 Use data as close as possible to real enterprise data
 Use data from external sources
 Should match all Requirements from a Software Development
Environment
 Software artifact should be runnable in Data Factory
 Should prevent Data Scientist from seeing data he should not see
 Should ensure data lake does not get messed up
 Should ensure even resource hungry processes do not influence the data
lake

Öffentlich
Data Lab – Requirements implementation
 Data Lab as Sandbox dedicated to Data Scientist
 No Authorization needed
 Gives all freedoms to Data Scientist
 Access to Data Lake to use Data from there
 Only read access at most
 Sometimes only working on copies of data from Data Lake
 Sandbox Architecture prevents Data Lab from influencing Data Lake
processes

Öffentlich
Data Lab - Architecture
Refined Data
- Quality assured data,
- typically data that could
transition into a classical
DWH
Data Refinery
- Preprocessing Area
Raw Data
- sensor data,
- streaming,
- social media,
- documents,
- Images
- …
Metadata
Analytics Sandbox
External Data
Data
from
Lake

Öffentlich
Data Factory - Requirements
 New models and algorithms should be integrated easily
 Established algorithms should be updated easily
 Found insights should be integrated in Data Lake to be evalueable
 Challenge to design extensible data model
 Single Data Factory components should not influence ohter Data Lake
processes
 Must be complient conform

Öffentlich
Data Science vs. Enterprise
 Looking for business insights in a
huge amount of data
 Mainly used to train algorithms
 Not a production system
 no operating team necessary
 Usually no other target groups
 Often no further authorization is needed
 Usually only authentication is enabled
Data Lab Enterprise Data Lake
 Is very integrated into the daily
business
 Production system
 Typically many target applications are
based on this collection of data
 High availability
 Backup & Recovery

Öffentlich
Data Science vs. Enterprise
Data Lab
 Explorative approach
 Insights generating
 Work with production near data
 Use Sandboxes
 Data Scientists
 Experts from the specialist department
 Goal: Train models and algorithms
 Generating value from trained
models and algorithms
 Automatic processing of data from
Data Lake
 Further development like in other
software development products
 Makes use of trained models and
algorithms
Data Factory

Öffentlich
Deployment from Data Lab to Factory
Data FactoryData Lab Trained Algorithms
Generated Insights from Analysis
Refined Data
Data Refinery
Raw Data
DWH
Weitere Datenquellen
No Data Transfer from Lab to
Factory

Öffentlich
Conclusion
Why do we need a Data Lab?
 Where is the problem? Why should we do this?
 „Playground“ / sandbox for Data Scientists
 No risk to break something in EDL / Data Factory
 No dependency on release cycles, etc.
 "Free" choice of tools
 Made possible by Infrastructure as Code
 Setting up entire environments in minutes
 Very similar or identical setup to EDL / Data Factory

Öffentlich Seite 19
Infrastructure as Code (IaC)
 What is IaC
 Advantages
3

Öffentlich
Keyword – Infrastructure as Code
 Called IaC
 Process of managing and provision computers, vm‘s, networks
 Version controlled infrastructure definitions
 Repeatable and reliable
 Frameworks
 Hashicorp Terraform
 Puppet
 SaltStack
 RedHat Ansible
 …

Öffentlich
Advantages of IaC
 Costs
 Reduce costs by
 Avoiding repetitive tasks
 Avoiding mistakes
 Reuse code / infrastructure
 Speed
 Provisioning on multiple machines in parallel
 Calculations / regex in code
 Risk
 Reduces the risk through
 Better preparation
 Idempotent approach
 Prepared updates

Öffentlich
Challenges for IaC
 A few new skills needed
 Typical development workflows (e.g. GitFlow)
 Some basic coding paradigms
 JSON
 Ruby
 YAML
 HCL
 …
 Monitoring
 Who is provisioning – what?, where?, how?
 Legacy tracking via worksheet not possible
 → Solution: e.g. Ansible Tower / AWX

Öffentlich
Reuse / reproducible result
Use Playbooks / scripts / templates for all stages
DEV TEST PROD
Playbook
Script
Reliable and easier
deployment in next
stages
Development of
scripts / playbooks
in DEV environment
Inventory Inventory Inventory

Öffentlich
Often seen like this
TEST
Oh, this is our Data
Science environment!
Yes, I trained this
modelon TEST
environment first.
No, we can‘t use TEST
for new Release now,
Data Scientist using it!
DEV Team
Data Science Team
Product-
management

Öffentlich
Better like this…
PROD
TEST
Data Lab /
SandboxFork
Trained models
New Reports

Öffentlich
Terraform (Hashicorp)
 Open Source software tool
 Released 2014
 Code in HCL (Hashicorp Configuration Language) or JSON
 Terraform CLI to trigger Terraform actions
 Manages public / private cloud infrastructure
 Extendable with providers – AWS, Azure, OCI, vSphere
 Performs CRUD operations to accomplish the target state

Öffentlich
Cloud-init
 Customizes OS directly after cloning from template
 Many public- / private clouds and OS Types are supported
 Perfectly for „low-level stuff“
 Set default locale
 Set hostnames
 Set up SSH keys
 Mount some shared folders
 It can also install some software packages
 Transfer point to e.g. Ansible

Öffentlich
Ansible (RedHat)
 Open Source software, founded by AnsibleWorks in 2012
 Very good supported on many Unix-like systems – but also on Windows
 Strengths: Provisioning, configuration management, some kind of
application development
 Agentless – working with temporary SSH access
 Features
 Inventory based configuration of different stages
 Ansible Vault – Store sensitive data
 A playbook can be written idempotent
 Ansible Tower / Upstream project AWX – REST API, web bases console for Ansible

Öffentlich Seite 29
Demo
 Here comes a small sample…
4

Öffentlich

Öffentlich
Data Lab creation workflow
Admin
Chooses template
according to requirements
Environment is
getting provisioned
Data Lab –
ready to use
Pull data through APIs
Push data into Data Lab
Analytics Sandbox

Öffentlich
Sample workflow (on premise)
Admin
triggers runs a playbook
vm creation
add dns entries /
AD or LDAP users
do some configuration
copy data into Data Lab
notify Data Scientists

Öffentlich
Sample workflow (Oracle Cloud)
Admin
triggers use predefined stack
do some configuration
copy data into Data Lab
notify Data Scientists
create vm instances and
networks
Optional: add
some firewall
rules

Öffentlich
Summary
 First of all → Don‘t use TEST as Data Science ENV
 Use a dedicated Data Lab environment to innovate
 IaC is the way to build and destroy a Data Lab environment
 Scalable and repeatable
 Mix multiple clouds or on-premise environments
 Use the mix of roles to succeed – admins, developers, data scientists
 Increase satisfaction (company-wide)
 More flexibility for Data Scientists
 Matching processes for administration and development in Data Factory
 Speed up innovation lifecycle

Öffentlich
 Überraschend mehr Möglichkeiten
@OC_WIRE
OPITZCONSULTING
opitzconsulting
opitz-consulting-bcb8-1009116
WWW.OPITZ-CONSULTING.COM
Vielen Dank für Ihr Interesse, wir freuen uns auf
den gemeinsamen Austausch mit Ihnen!
Fabian Hardt
Senior Consultant,
Business Intelligence & Analytics
Kirchstraße 6
51647 Gummersbach
Fabian.Hardt@opitz-consulting.com
+49 (0) 2261 6001-1045

Automatisierte Provisionierung einer Data Lab Umgebung für Data Scientists

More Related Content

What's hot (20)

Similar to Automatisierte Provisionierung einer Data Lab Umgebung für Data Scientists (20)

More from Fabian Hardt (14)

Recently uploaded (20)

Automatisierte Provisionierung einer Data Lab Umgebung für Data Scientists