SlideShare a Scribd company logo
Featured Project:
Marina Bay Sands Casino Resort, Singapore
Connecting teams project-wide
Big Data in a production environment:
Lessons Learnt
LAST Conference 2017
Mark Grebler - Aconex
CONFIDENTIAL | 2
Featured Project:
Marina Bay Sands Casino Resort, Singapore
Connecting teams project-wide
What does Big Data mean to you?
Summary
• What is the Insights project
• Big Data for Data Science
• Big Data in a production, user-facing environment
• Lessons Learnt
• Problems still to solve
What’s an Aconex?
Pronounced: Ay-conn-ex
Highly flexible and customisable data model with low level concepts
=
useful for many types of projects
Aconex has Flexible data
The Insights Project
Highly flexible and customisable data model with low level concepts
=
Difficult to produce meaningful customer reports
Flexible Data Needs transformation
The Insights Project
What does it look like
The Insights Project
Typical Big Data Architectures
Insights architecture
Looks similar to the other architectures
But the differences exist
We use the AWS console
to deploy new
infrastructure
I add new hardware by
buying a new box and
connecting it to the
network
Quotes from Data Engineers interviewed
We deploy by copying the
jar file to the cluster
We don’t have any CI, I
just build it on my box
We test by running it over
some data and ensuring it
doesn’t crash
We have some
rudimentary tests
What are the differences
Other Big Data Projects
● Internal client
● Simple authentication
● For Data Scientists
● Single environment
○ Sometimes 2 or 3
● Manual infrastructure management
● Sanity testing
● Manual integration
● Manual deployment
● Unrestricted data access
Insights Project
● External client
● Integrated authentication
● For end users
● Multiple environments
○ Due to data sovereignty (10)
● Infrastructure as code
● Unit → end-to-end testing
● Continuous integration
● Single-step deployment
● Data access restrictions
It’s not always so black and white, but the left side represents quite a lot of other projects I’ve seen.
Lessons Learnt
● VPN to control data access
● Autoscaling application server
● Network independence
● Zero downtime-deployments with
automatic rollback
○ ElasticBeanstalk provides this
Lessons learned: Infrastructure-as-code
● Must be easily reproducible because we need to do it 10+ times
● Automation of infrastructure management
○ Infrastructure is a core part of the Big Data project, so it must be treated as important as our
application code
○ Terraform is used to manage the infrastructure, including:
■ Networking and VPN management
■ Security
■ Provisioning VMs and other infrastructure
■ Replication and ingestion of data from Data Centres
■ Database Administration and Automation
Lessons learned: Access segregation
● Different accounts for testing and
production
● Separate VPCs for each environment
● Multiple user roles allows fine-grained
control of access
● VPN used as a further level to restrict
data access
Lessons learned: Integration and deployment
Continuous Integration
Once built, versioned artifacts are pushed to s3 buckets
Deployments
Ansible is used to roll out new versions of the
application and transformations
Infrastructure
Terraform controls the base infrastructure
● Deployments run in parallel across environments
● Docker image used for deployments to control
dependencies
Lessons Learnt: Automate Testing
● Big Data testing is hard
● Automated unit tests to ensure transformations are correct
○ We pair with our QA to generate the data, and validate the expected output for the unit tests
○ TDD-ish, but often testing done after development
● Automated Integration tests using a large data set
○ To ensure regressions haven’t occurred
● Manual end-to-end sanity tests
○ This should be automated in the future
● Manual exploratory testing
Problems to resolve
● Testing
○ Big Data testing is time consuming
■ Particularly around data generation
○ How to effectively automate testing of the infrastructure
○ How to automate end-to-end sanity testing.
● Infrastructure
○ CI/CD with Terraform
○ So many moving parts makes management difficult
● Ingestion and transformations
○ How to move from batch processing to incremental or streaming
○ Removing the database clones
● Effectively communicating to the business what/why we’re doing what we are
○ Why are things so slow?

More Related Content

PDF
Capgemini: Observability within the Dutch government
PDF
Fineo Technical Overview - NextSQL for IoT
PDF
Security Events Logging at Bell with the Elastic Stack
PDF
Migrating a legacy logging system: Etsy’s journey to Elastic Cloud
PDF
Improving search at Wellcome Collection
PPTX
Enterprise Performance Planning
PDF
Presentation cisco ucs director & flex pod
PDF
IPv17 teaser
Capgemini: Observability within the Dutch government
Fineo Technical Overview - NextSQL for IoT
Security Events Logging at Bell with the Elastic Stack
Migrating a legacy logging system: Etsy’s journey to Elastic Cloud
Improving search at Wellcome Collection
Enterprise Performance Planning
Presentation cisco ucs director & flex pod
IPv17 teaser

What's hot (20)

PDF
Reinventing enterprise defense with the Elastic Stack
PPTX
Build A Better Way to Deliver IT
PDF
IPv17 sync17
PDF
Monitoring
PDF
O monitoramento da infraestrutura facilitado, da ingestão ao insight
PDF
InfluxDB + Telegraf Operator: Easy Kubernetes Monitoring
PDF
Automatize a detecção de ameaças e evite falsos positivos
PDF
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
PDF
Detection, Response and the Azazel Rootkit
PPTX
Maplelabs scalable-field-device-cloud-native
PDF
Architecture for Scale [AppFirst]
PDF
Elastic APM: amplificação dos seus logs e métricas para proporcionar um panor...
PDF
OSMC 2017 | Icinga2 in a 24/7 Broadcast Environment by Dave Kempe
PDF
How To Build Auto-Adaptive Machine Learning Models with Kubernetes
PDF
Using OPC-UA to Extract IIoT Time Series Data from PLC and SCADA Systems
PDF
Webinar slides: DevOps Tutorial: how to automate your database infrastructure
PDF
Automate threat detections and avoid false positives
PDF
DBOps
PDF
[WSO2Con USA 2018] Microservices, Containers, and Beyond
PPTX
Historic Opportunities: Discover the Power of Ignition's Historian
Reinventing enterprise defense with the Elastic Stack
Build A Better Way to Deliver IT
IPv17 sync17
Monitoring
O monitoramento da infraestrutura facilitado, da ingestão ao insight
InfluxDB + Telegraf Operator: Easy Kubernetes Monitoring
Automatize a detecção de ameaças e evite falsos positivos
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
Detection, Response and the Azazel Rootkit
Maplelabs scalable-field-device-cloud-native
Architecture for Scale [AppFirst]
Elastic APM: amplificação dos seus logs e métricas para proporcionar um panor...
OSMC 2017 | Icinga2 in a 24/7 Broadcast Environment by Dave Kempe
How To Build Auto-Adaptive Machine Learning Models with Kubernetes
Using OPC-UA to Extract IIoT Time Series Data from PLC and SCADA Systems
Webinar slides: DevOps Tutorial: how to automate your database infrastructure
Automate threat detections and avoid false positives
DBOps
[WSO2Con USA 2018] Microservices, Containers, and Beyond
Historic Opportunities: Discover the Power of Ignition's Historian
Ad

Similar to Last Conference 2017: Big Data in a Production Environment: Lessons Learnt (20)

PDF
Data Science in Production: Technologies That Drive Adoption of Data Science ...
PDF
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
PDF
The journey to Native Cloud Architecture & Microservices, tracing the footste...
PDF
Designing for operability and managability
PDF
Netflix Architecture and Open Source
PPTX
Technology insights: Decision Science Platform
PPTX
Maturing IoT solutions with Microsoft Azure (Sam Vanhoutte & Glenn Colpaert a...
PDF
Deploy Eclipse hawBit in Production
PDF
DevOps for TYPO3 Teams and Projects
PDF
Triangle Devops Meetup 10/2015
DOC
GRANT DELP724
PPTX
Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...
PDF
From monolith to microservices
PPTX
Big Data on Cloud Native Platform
PPTX
Big Data on Cloud Native Platform
PDF
DATA @ NFLX (Tableau Conference 2014 Presentation)
PDF
Workshop: Delivering chnages for applications and databases
PDF
PLNOG19 - Piotr Marecki - Espresso: Scalable and Programmable Peering Edge
PDF
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...
PPTX
Cloud computing
Data Science in Production: Technologies That Drive Adoption of Data Science ...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
The journey to Native Cloud Architecture & Microservices, tracing the footste...
Designing for operability and managability
Netflix Architecture and Open Source
Technology insights: Decision Science Platform
Maturing IoT solutions with Microsoft Azure (Sam Vanhoutte & Glenn Colpaert a...
Deploy Eclipse hawBit in Production
DevOps for TYPO3 Teams and Projects
Triangle Devops Meetup 10/2015
GRANT DELP724
Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...
From monolith to microservices
Big Data on Cloud Native Platform
Big Data on Cloud Native Platform
DATA @ NFLX (Tableau Conference 2014 Presentation)
Workshop: Delivering chnages for applications and databases
PLNOG19 - Piotr Marecki - Espresso: Scalable and Programmable Peering Edge
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...
Cloud computing
Ad

Recently uploaded (20)

PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Digital Strategies for Manufacturing Companies
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Transform Your Business with a Software ERP System
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Nekopoi APK 2025 free lastest update
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
How Creative Agencies Leverage Project Management Software.pdf
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
How to Choose the Right IT Partner for Your Business in Malaysia
Odoo Companies in India – Driving Business Transformation.pdf
Upgrade and Innovation Strategies for SAP ERP Customers
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Softaken Excel to vCard Converter Software.pdf
Understanding Forklifts - TECH EHS Solution
Digital Strategies for Manufacturing Companies
Design an Analysis of Algorithms I-SECS-1021-03
Transform Your Business with a Software ERP System
CHAPTER 2 - PM Management and IT Context
Nekopoi APK 2025 free lastest update
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Operating system designcfffgfgggggggvggggggggg
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Design an Analysis of Algorithms II-SECS-1021-03
How Creative Agencies Leverage Project Management Software.pdf

Last Conference 2017: Big Data in a Production Environment: Lessons Learnt

  • 1. Featured Project: Marina Bay Sands Casino Resort, Singapore Connecting teams project-wide Big Data in a production environment: Lessons Learnt LAST Conference 2017 Mark Grebler - Aconex
  • 2. CONFIDENTIAL | 2 Featured Project: Marina Bay Sands Casino Resort, Singapore Connecting teams project-wide
  • 3. What does Big Data mean to you?
  • 4. Summary • What is the Insights project • Big Data for Data Science • Big Data in a production, user-facing environment • Lessons Learnt • Problems still to solve
  • 6. Highly flexible and customisable data model with low level concepts = useful for many types of projects Aconex has Flexible data The Insights Project
  • 7. Highly flexible and customisable data model with low level concepts = Difficult to produce meaningful customer reports Flexible Data Needs transformation The Insights Project
  • 8. What does it look like The Insights Project
  • 9. Typical Big Data Architectures
  • 10. Insights architecture Looks similar to the other architectures
  • 11. But the differences exist We use the AWS console to deploy new infrastructure I add new hardware by buying a new box and connecting it to the network Quotes from Data Engineers interviewed We deploy by copying the jar file to the cluster We don’t have any CI, I just build it on my box We test by running it over some data and ensuring it doesn’t crash We have some rudimentary tests
  • 12. What are the differences Other Big Data Projects ● Internal client ● Simple authentication ● For Data Scientists ● Single environment ○ Sometimes 2 or 3 ● Manual infrastructure management ● Sanity testing ● Manual integration ● Manual deployment ● Unrestricted data access Insights Project ● External client ● Integrated authentication ● For end users ● Multiple environments ○ Due to data sovereignty (10) ● Infrastructure as code ● Unit → end-to-end testing ● Continuous integration ● Single-step deployment ● Data access restrictions It’s not always so black and white, but the left side represents quite a lot of other projects I’ve seen.
  • 13. Lessons Learnt ● VPN to control data access ● Autoscaling application server ● Network independence ● Zero downtime-deployments with automatic rollback ○ ElasticBeanstalk provides this
  • 14. Lessons learned: Infrastructure-as-code ● Must be easily reproducible because we need to do it 10+ times ● Automation of infrastructure management ○ Infrastructure is a core part of the Big Data project, so it must be treated as important as our application code ○ Terraform is used to manage the infrastructure, including: ■ Networking and VPN management ■ Security ■ Provisioning VMs and other infrastructure ■ Replication and ingestion of data from Data Centres ■ Database Administration and Automation
  • 15. Lessons learned: Access segregation ● Different accounts for testing and production ● Separate VPCs for each environment ● Multiple user roles allows fine-grained control of access ● VPN used as a further level to restrict data access
  • 16. Lessons learned: Integration and deployment Continuous Integration Once built, versioned artifacts are pushed to s3 buckets Deployments Ansible is used to roll out new versions of the application and transformations Infrastructure Terraform controls the base infrastructure ● Deployments run in parallel across environments ● Docker image used for deployments to control dependencies
  • 17. Lessons Learnt: Automate Testing ● Big Data testing is hard ● Automated unit tests to ensure transformations are correct ○ We pair with our QA to generate the data, and validate the expected output for the unit tests ○ TDD-ish, but often testing done after development ● Automated Integration tests using a large data set ○ To ensure regressions haven’t occurred ● Manual end-to-end sanity tests ○ This should be automated in the future ● Manual exploratory testing
  • 18. Problems to resolve ● Testing ○ Big Data testing is time consuming ■ Particularly around data generation ○ How to effectively automate testing of the infrastructure ○ How to automate end-to-end sanity testing. ● Infrastructure ○ CI/CD with Terraform ○ So many moving parts makes management difficult ● Ingestion and transformations ○ How to move from batch processing to incremental or streaming ○ Removing the database clones ● Effectively communicating to the business what/why we’re doing what we are ○ Why are things so slow?

Editor's Notes

  • #6: Who's had a house built for them, built their own house, or organised a significant renovation? How many documents were needed? How many conversations were had? Think of the number of documents to build a skyscraper, or a refinery, etc.
  • #11: Looks similar. From the outside, no real differences.