SlideShare a Scribd company logo
© OPITZ CONSULTING 2018
Informationsklassifikation:
Öffentlich
 Überraschend mehr Möglichkeiten
© OPITZ CONSULTING 2018
Big Data Stories from the Field
Matthias Diekstall, Roland Wammers,
Manuel Marowski
From Theory to Practice
© OPITZ CONSULTING 2018
Informationsklassifikation:
Öffentlich Seite 2
Agenda
1
2
3
DWH Modernization with AWS BigData
Advanced Analytics & Complex Event
Processing at congstar
Stream Analytics & Machine Learning with
AWS OC Quickstarter
Big Data Stories from the Field
© OPITZ CONSULTING 2018
Informationsklassifikation:
Öffentlich Seite 3
DWH Modernization with AWS
BigData as an Insurance Company
 Once upon a Time …
 Defined Targets
 Challenges
 Our Proposal
 Technical Implementation
 … and they lived happily ever after
1
Big Data Stories from the Field
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 4
Once upon a Time …
 Mid-sized insurance company
 6000 Employees
 4 M Clients
 14 M Contracts
 3.2 B EUR in Revenues
 Enterprise DWH established
 Standard Reporting in place
 Data Mining in a few departments
 Using MS Excel mostly
 Partially R desktop usage
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 5
Defined Targets
 Get a feeling for new technologies (Hadoop Ecosystem)
 Learn their approach to data processing
 Low investment
 „Big Data Test Drive“
 Increase flexibility for data sources
 Enable self service for departments on a larger scale
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 6
Challenges
 No tangible use case initially
 No decision regarding products/license
model
 No good grasp on fundamental
concepts of Big Data technologies
 Little resources for driving this project
 No hardware available (short-term)
 Direct connectivity to source systems
questionable
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 7
Our Proposal
 Quick start with a cloud-based solution
 Start small and allow for growth
 Allow a wide variety of technologies without having to dedicate resources
to administration and operation
 To be more precise:
 Prepare environment for easy startup
 Train/coach employees in essential aspects
 Use AWS technologies
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 8
Technical Implementation
 AWS IAM for user management
 AWS S3 for data storage
 AWS EMR as the basis for data processing
 Hive
 Pig
 Spark
 Python
 Zeppelin as graphical frontend
 Augmented with R Studio
 Mini Tutorials for users
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 9
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 10
AWS Mini tutorials for users
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 11
… and they lived happily ever after
 Results
 Targets achieved at minimal cost (< $500 in ~ 3 months)
 Competency development
 Better understanding of „how it works“
 Lessons learned
 Focus on as few tools as possible
 Create simple step-by-step tutorials
 Even a hypothetical use case is better than none
© OPITZ CONSULTING 2018
Informationsklassifikation:
Öffentlich Seite 12
Advanced Analytics & Complex Event
Processing at congstar
 First Thoughts
 Creating the Base
 Working with the Data
 First Steps to Advanced Analytics
2
Big Data Stories from the Field
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 13
congstar GmbH
 Subsidiary of Telekom Deutschland GmbH
 Founded in July 2007
 Sells mobile contracts and DSL
 Over 4.500.000 customers
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 14
Motivation
 Better understanding of the user
 Improve the user experience
 Enhance existing systems
 Being prepared for future requirements
 Create new content in reasonable time
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 15
Challenges
 Building a big data system for advanced analytics and complex event
processing in AWS
 Find right technologies in Hadoop
 Find suitable AWS services
 Keeping the costs low
 Provisioning
 Replacing old systems with new technology
 Secure data transfer between on prem and AWS
 Live agile
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 16
Infrastructure as code
 Testing resources and services via AWS management console
 Creating CloudFormation templates
 Infrastructure as code
 Create stacks for development, test and production system
 Working with stacks
 Adjustments made in the code
 Diff of old and new code
 Rollback function in case of error
 Establishing a secure VPN connection
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 17
Overview of the basic Infrastructure
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 18
Collecting and loading data into S3
 Data transfer
 Initial connection only established from the on prem network
 Need on prem solution to transfer data into S3
 NIFI
 Web UI
 Schedule flows
 No programming skills needed
 Limited to used processors
 Format: CSV, AVRO
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 19
Process data
 Using Spark (Scala)
 Fast data processing
 Needs implementation
 Format: Parquet or Avro – saves space, time and money
 Organize the data
 Layer
 Partitions
 Purpose
 Source
 …
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 20
Using spot instances
 Data-backup capabilities
 Set a max. bidding price you are willing to pay
 Saves time and money
 Cons:
 You loose the instances when the spot-price increases you max. price
 2 minutes to save your data
 Hybrid model for Hadoop
 Master and 1/3 workers on on-demand instances
 Rest on spot instances
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 21
Get data available with SQL
 Create Glue catalog with a Glue crawler
 Scans all sub folders of a S3 path
 Tries to recognize the right format
 Classifies according to the file type
 Glue catalog
 Used as Hive metastore on an EMR cluster
 Used in Athena for ad hoc analytics
 Not all classifiers are perfect
 Manual adjustments of the crawler are required
 Manual adjustments of the table definitions are required
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 22
Testing Exasol on AWS market place
 Starting Exasol on EC2 instance
 Using an EBS instance
 Testing various instances
 Duplicating the instance to be more free in testing
 Testing different server types/sizes
 Testing licensed software (AWS Marketplace) before buying expensive
license
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 23
Amazon SageMaker
 JupyterHub
 Python-based API
 Focusing on development, learning, testing and distributing ML-Models
 Easy switching between several algorithms
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field Seite 24
Outlook
 Combine Exasol with ML models created by SageMaker
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field
Stream Analytics & Machine Learning with
AWS OC Quickstarter
© OPITZ CONSULTING 2018
Informationsklassifikation:
Öffentlich Seite 26
Stream Analytics & Machine
Learning with AWS OC Quickstarter
 Use case
 DWH offloading
 Architectural overview
 The data flow
 Industrial use case
3
Big Data Stories from the Field
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field
Use case: Twitter Stream Analytics
Seite 27
Twitter
Streaming Data
Machine Learning sentiment analysis
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field
DWH Offloading
DWH
Integration
Layer
Enterprise
Layer
User View
Layer
Source
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field
DWH offloading
Data
Integration
Layer
Enterprise
Layer
Offload
Refined Data Lake
User View
Layer
ETL
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field
Advantages of DWH-Offloading
 Cost savings through outsourcing to low-cost storage space
 Combining structured data with unstructured data
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field
Used technologies
 Scala
 Hive, Oozie, Kafka, Spark, Sqoop
➢ Stream Processing
➢ DWH Offloading
➢ Scheduling
 Spark.ML
➢ sentiment analysis
 AWS
➢ infrastructure / Hadoop / HDFS / S3 / Data lake
 ELK-Stack (Elastic Search, Logstash, Kibana)
➢ Visualization / Indexed data access
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field
© OPITZ CONSULTING 2018
Informationsklassifikation:
ÖffentlichBig Data Stories from the Field
Industrial use cases
 Predictive Maintenance
 Real-time error detection in production processes
 Dynamic evaluation of component quality
© OPITZ CONSULTING 2018
Informationsklassifikation:
Öffentlich
 Überraschend mehr Möglichkeiten
@OC_WIRE OPITZCONSULTING opitzconsultingWWW.OPITZ-CONSULTING.COM
Seite 35
Contact us!
Big Data Stories from the Field
Matthias Diekstall
Developer
+49 201 892994-1753
Matthias.Diekstall@opitz-consulting.com
Roland Wammers
Solution Architect
+49 201 892994-1757
Roland.Wammers@opitz-consulting.com
Manuel Marowski
Developer
+49 201 892994-1748
Manuel.Marowski@opitz-consulting.com

More Related Content

PDF
ALT-F1.BE : The Accelerator (Google Cloud Platform)
PPTX
Heise Developer World 2016 - Big Data ist tot, es lebe Business Intelligenz
PDF
Aug14 2018 trish damkroger intel
PDF
Big Data Science in the Cloud from Big Data World Conference 2013
PDF
Modern Thinking área digital MSKM 21/09/2017
PDF
Cwin16 tls-partner-mark logic-an innovation journey in manufacturing
PDF
7 inspiring Big Data factories in AWS
PDF
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 Dataset
ALT-F1.BE : The Accelerator (Google Cloud Platform)
Heise Developer World 2016 - Big Data ist tot, es lebe Business Intelligenz
Aug14 2018 trish damkroger intel
Big Data Science in the Cloud from Big Data World Conference 2013
Modern Thinking área digital MSKM 21/09/2017
Cwin16 tls-partner-mark logic-an innovation journey in manufacturing
7 inspiring Big Data factories in AWS
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 Dataset

What's hot (20)

PPTX
Connecting Buildings with AWS
PPTX
TechEvent biGenius What's New
PDF
Autograph - Natural Signatures for Graph Modelling, Simon Brueckheimer, Ciena
PDF
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
PDF
Understanding the Operational Database Infrastructure for IoT and Fast Data
PDF
AI/ML is a Means to Digital Transformation, Not an End Itself
PDF
Graph-Based Identity Resolution at Scale
PPTX
Couchbase & HPCC Systems – A complete mobile & data platform in the enterprise
PDF
FIWARE Global Summit - International Data Spaces - From Industry 4.0 to Data ...
PDF
Graph+AI for Fin. Services
PDF
Big Data Paris - A Modern Enterprise Architecture
PDF
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
PPTX
Cloudsim Projects With Source Code
PDF
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
PPTX
Big data an elephant business opportunities
PDF
TigerGraph UI Toolkits Financial Crimes
PDF
Machine Learning Feature Design with TigerGraph 3.0 No-Code GUI
PDF
A view of graph data usage by Cerved
PPTX
Well Architected Framework - Data
PDF
Deep Learning Image Processing Applications in the Enterprise
Connecting Buildings with AWS
TechEvent biGenius What's New
Autograph - Natural Signatures for Graph Modelling, Simon Brueckheimer, Ciena
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Understanding the Operational Database Infrastructure for IoT and Fast Data
AI/ML is a Means to Digital Transformation, Not an End Itself
Graph-Based Identity Resolution at Scale
Couchbase & HPCC Systems – A complete mobile & data platform in the enterprise
FIWARE Global Summit - International Data Spaces - From Industry 4.0 to Data ...
Graph+AI for Fin. Services
Big Data Paris - A Modern Enterprise Architecture
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Cloudsim Projects With Source Code
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Big data an elephant business opportunities
TigerGraph UI Toolkits Financial Crimes
Machine Learning Feature Design with TigerGraph 3.0 No-Code GUI
A view of graph data usage by Cerved
Well Architected Framework - Data
Deep Learning Image Processing Applications in the Enterprise
Ad

Similar to Analytics Web Day | From Theory to Practice: Big Data Stories from the Field (20)

PDF
Next Gen Big Data Plattform mit Hadoop, APIs und Kubernetes
PPT
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
PPT
Gartner peer forum sept 2011 orbitz
PDF
Automatisierte Provisionierung einer Data Lab Umgebung für Data Scientists
PPTX
Implementation_Big_Data_Presentation.pptx
PDF
Building a modern data platform in the cloud. AWS DevDay Nordics
PDF
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
PPTX
7 Steps Big Data Journey for Enterprises
PDF
Engineering Data Pipeline for Data-Driven Analytics
PPTX
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
PDF
Building A Self Service Analytics Platform on Hadoop
PDF
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
PDF
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
PDF
Hadoop meets Agile! - An Agile Big Data Model
PDF
Hadoop-based architecture approaches
PPTX
Scalable data pipeline at Traveloka - Facebook Dev Bandung
PPTX
Best practices to build a sustainable data lake on cloud - Impetus Webinar
PPTX
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
PPTX
Building a Big Data Pipeline
PDF
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
Next Gen Big Data Plattform mit Hadoop, APIs und Kubernetes
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Gartner peer forum sept 2011 orbitz
Automatisierte Provisionierung einer Data Lab Umgebung für Data Scientists
Implementation_Big_Data_Presentation.pptx
Building a modern data platform in the cloud. AWS DevDay Nordics
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
7 Steps Big Data Journey for Enterprises
Engineering Data Pipeline for Data-Driven Analytics
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Building A Self Service Analytics Platform on Hadoop
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Hadoop meets Agile! - An Agile Big Data Model
Hadoop-based architecture approaches
Scalable data pipeline at Traveloka - Facebook Dev Bandung
Best practices to build a sustainable data lake on cloud - Impetus Webinar
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Building a Big Data Pipeline
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
Ad

More from AWS Germany (20)

PDF
Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and ...
PDF
Modern Applications Web Day | Impress Your Friends with Your First Serverless...
PDF
Modern Applications Web Day | Manage Your Infrastructure and Configuration on...
PDF
Modern Applications Web Day | Container Workloads on AWS
PDF
Modern Applications Web Day | Continuous Delivery to Amazon EKS with Spinnaker
PDF
Building Smart Home skills for Alexa
PDF
Hotel or Taxi? "Sorting hat" for travel expenses with AWS ML infrastructure
PDF
Wild Rydes with Big Data/Kinesis focus: AWS Serverless Workshop
PDF
Log Analytics with AWS
PDF
Deep Dive into Concepts and Tools for Analyzing Streaming Data on AWS
PDF
AWS Programme für Nonprofits
PDF
Microservices and Data Design
PDF
Serverless vs. Developers – the real crash
PDF
Query your data in S3 with SQL and optimize for cost and performance
PDF
Secret Management with Hashicorp’s Vault
PDF
EKS Workshop
PDF
Scale to Infinity with ECS
PDF
Containers on AWS - State of the Union
PDF
Deploying and Scaling Your First Cloud Application with Amazon Lightsail
PDF
Building Personalized Data Products - From Idea to Product
Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and ...
Modern Applications Web Day | Impress Your Friends with Your First Serverless...
Modern Applications Web Day | Manage Your Infrastructure and Configuration on...
Modern Applications Web Day | Container Workloads on AWS
Modern Applications Web Day | Continuous Delivery to Amazon EKS with Spinnaker
Building Smart Home skills for Alexa
Hotel or Taxi? "Sorting hat" for travel expenses with AWS ML infrastructure
Wild Rydes with Big Data/Kinesis focus: AWS Serverless Workshop
Log Analytics with AWS
Deep Dive into Concepts and Tools for Analyzing Streaming Data on AWS
AWS Programme für Nonprofits
Microservices and Data Design
Serverless vs. Developers – the real crash
Query your data in S3 with SQL and optimize for cost and performance
Secret Management with Hashicorp’s Vault
EKS Workshop
Scale to Infinity with ECS
Containers on AWS - State of the Union
Deploying and Scaling Your First Cloud Application with Amazon Lightsail
Building Personalized Data Products - From Idea to Product

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PPTX
Big Data Technologies - Introduction.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Approach and Philosophy of On baking technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Empathic Computing: Creating Shared Understanding
Big Data Technologies - Introduction.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
NewMind AI Monthly Chronicles - July 2025
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Per capita expenditure prediction using model stacking based on satellite ima...
Approach and Philosophy of On baking technology
Unlocking AI with Model Context Protocol (MCP)
Understanding_Digital_Forensics_Presentation.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Mobile App Security Testing_ A Comprehensive Guide.pdf
Review of recent advances in non-invasive hemoglobin estimation
Encapsulation_ Review paper, used for researhc scholars
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation theory and applications.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Building Integrated photovoltaic BIPV_UPV.pdf

Analytics Web Day | From Theory to Practice: Big Data Stories from the Field

  • 1. © OPITZ CONSULTING 2018 Informationsklassifikation: Öffentlich  Überraschend mehr Möglichkeiten © OPITZ CONSULTING 2018 Big Data Stories from the Field Matthias Diekstall, Roland Wammers, Manuel Marowski From Theory to Practice
  • 2. © OPITZ CONSULTING 2018 Informationsklassifikation: Öffentlich Seite 2 Agenda 1 2 3 DWH Modernization with AWS BigData Advanced Analytics & Complex Event Processing at congstar Stream Analytics & Machine Learning with AWS OC Quickstarter Big Data Stories from the Field
  • 3. © OPITZ CONSULTING 2018 Informationsklassifikation: Öffentlich Seite 3 DWH Modernization with AWS BigData as an Insurance Company  Once upon a Time …  Defined Targets  Challenges  Our Proposal  Technical Implementation  … and they lived happily ever after 1 Big Data Stories from the Field
  • 4. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 4 Once upon a Time …  Mid-sized insurance company  6000 Employees  4 M Clients  14 M Contracts  3.2 B EUR in Revenues  Enterprise DWH established  Standard Reporting in place  Data Mining in a few departments  Using MS Excel mostly  Partially R desktop usage
  • 5. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 5 Defined Targets  Get a feeling for new technologies (Hadoop Ecosystem)  Learn their approach to data processing  Low investment  „Big Data Test Drive“  Increase flexibility for data sources  Enable self service for departments on a larger scale
  • 6. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 6 Challenges  No tangible use case initially  No decision regarding products/license model  No good grasp on fundamental concepts of Big Data technologies  Little resources for driving this project  No hardware available (short-term)  Direct connectivity to source systems questionable
  • 7. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 7 Our Proposal  Quick start with a cloud-based solution  Start small and allow for growth  Allow a wide variety of technologies without having to dedicate resources to administration and operation  To be more precise:  Prepare environment for easy startup  Train/coach employees in essential aspects  Use AWS technologies
  • 8. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 8 Technical Implementation  AWS IAM for user management  AWS S3 for data storage  AWS EMR as the basis for data processing  Hive  Pig  Spark  Python  Zeppelin as graphical frontend  Augmented with R Studio  Mini Tutorials for users
  • 9. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 9
  • 10. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 10 AWS Mini tutorials for users
  • 11. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 11 … and they lived happily ever after  Results  Targets achieved at minimal cost (< $500 in ~ 3 months)  Competency development  Better understanding of „how it works“  Lessons learned  Focus on as few tools as possible  Create simple step-by-step tutorials  Even a hypothetical use case is better than none
  • 12. © OPITZ CONSULTING 2018 Informationsklassifikation: Öffentlich Seite 12 Advanced Analytics & Complex Event Processing at congstar  First Thoughts  Creating the Base  Working with the Data  First Steps to Advanced Analytics 2 Big Data Stories from the Field
  • 13. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 13 congstar GmbH  Subsidiary of Telekom Deutschland GmbH  Founded in July 2007  Sells mobile contracts and DSL  Over 4.500.000 customers
  • 14. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 14 Motivation  Better understanding of the user  Improve the user experience  Enhance existing systems  Being prepared for future requirements  Create new content in reasonable time
  • 15. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 15 Challenges  Building a big data system for advanced analytics and complex event processing in AWS  Find right technologies in Hadoop  Find suitable AWS services  Keeping the costs low  Provisioning  Replacing old systems with new technology  Secure data transfer between on prem and AWS  Live agile
  • 16. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 16 Infrastructure as code  Testing resources and services via AWS management console  Creating CloudFormation templates  Infrastructure as code  Create stacks for development, test and production system  Working with stacks  Adjustments made in the code  Diff of old and new code  Rollback function in case of error  Establishing a secure VPN connection
  • 17. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 17 Overview of the basic Infrastructure
  • 18. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 18 Collecting and loading data into S3  Data transfer  Initial connection only established from the on prem network  Need on prem solution to transfer data into S3  NIFI  Web UI  Schedule flows  No programming skills needed  Limited to used processors  Format: CSV, AVRO
  • 19. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 19 Process data  Using Spark (Scala)  Fast data processing  Needs implementation  Format: Parquet or Avro – saves space, time and money  Organize the data  Layer  Partitions  Purpose  Source  …
  • 20. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 20 Using spot instances  Data-backup capabilities  Set a max. bidding price you are willing to pay  Saves time and money  Cons:  You loose the instances when the spot-price increases you max. price  2 minutes to save your data  Hybrid model for Hadoop  Master and 1/3 workers on on-demand instances  Rest on spot instances
  • 21. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 21 Get data available with SQL  Create Glue catalog with a Glue crawler  Scans all sub folders of a S3 path  Tries to recognize the right format  Classifies according to the file type  Glue catalog  Used as Hive metastore on an EMR cluster  Used in Athena for ad hoc analytics  Not all classifiers are perfect  Manual adjustments of the crawler are required  Manual adjustments of the table definitions are required
  • 22. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 22 Testing Exasol on AWS market place  Starting Exasol on EC2 instance  Using an EBS instance  Testing various instances  Duplicating the instance to be more free in testing  Testing different server types/sizes  Testing licensed software (AWS Marketplace) before buying expensive license
  • 23. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 23 Amazon SageMaker  JupyterHub  Python-based API  Focusing on development, learning, testing and distributing ML-Models  Easy switching between several algorithms
  • 24. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Seite 24 Outlook  Combine Exasol with ML models created by SageMaker
  • 25. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Stream Analytics & Machine Learning with AWS OC Quickstarter
  • 26. © OPITZ CONSULTING 2018 Informationsklassifikation: Öffentlich Seite 26 Stream Analytics & Machine Learning with AWS OC Quickstarter  Use case  DWH offloading  Architectural overview  The data flow  Industrial use case 3 Big Data Stories from the Field
  • 27. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Use case: Twitter Stream Analytics Seite 27 Twitter Streaming Data Machine Learning sentiment analysis
  • 28. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field DWH Offloading DWH Integration Layer Enterprise Layer User View Layer Source
  • 29. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field DWH offloading Data Integration Layer Enterprise Layer Offload Refined Data Lake User View Layer ETL
  • 30. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Advantages of DWH-Offloading  Cost savings through outsourcing to low-cost storage space  Combining structured data with unstructured data
  • 31. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Used technologies  Scala  Hive, Oozie, Kafka, Spark, Sqoop ➢ Stream Processing ➢ DWH Offloading ➢ Scheduling  Spark.ML ➢ sentiment analysis  AWS ➢ infrastructure / Hadoop / HDFS / S3 / Data lake  ELK-Stack (Elastic Search, Logstash, Kibana) ➢ Visualization / Indexed data access
  • 32. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field
  • 33. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field
  • 34. © OPITZ CONSULTING 2018 Informationsklassifikation: ÖffentlichBig Data Stories from the Field Industrial use cases  Predictive Maintenance  Real-time error detection in production processes  Dynamic evaluation of component quality
  • 35. © OPITZ CONSULTING 2018 Informationsklassifikation: Öffentlich  Überraschend mehr Möglichkeiten @OC_WIRE OPITZCONSULTING opitzconsultingWWW.OPITZ-CONSULTING.COM Seite 35 Contact us! Big Data Stories from the Field Matthias Diekstall Developer +49 201 892994-1753 Matthias.Diekstall@opitz-consulting.com Roland Wammers Solution Architect +49 201 892994-1757 Roland.Wammers@opitz-consulting.com Manuel Marowski Developer +49 201 892994-1748 Manuel.Marowski@opitz-consulting.com