SlideShare a Scribd company logo
© Hortonworks Inc. 2011
Hadoop Engineering Best Practices
Raja Aluri, Release Eng
Deepesh Khandelwal, Quality Eng
Ramya Sunil, Quality Eng
Page 1
© Hortonworks Inc. 2011
Agenda
• Source Mechanics
• Why do System Testing?
• Test Matrix
• Automated Testing Flow
• Test Planning
• Planning your own System Testing
• Q & A
Page 2
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Apache Hortonworks Partner Source
Mechanics
• Hortonworks Open Source Philosophy
• How we do Apache first development
• How we incorporate fixes or features that did not make into apache yet
• How we integrate our partner contributions to the source code
• Bookkeeping of the delta between apache and Hortonworks
Page 3
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Apache-Hortonworks-Partner Source flow
Page 4
Architecting the Future of Big Data
Partner
ApacheRef
HDPRef
Partner
HWX
ApacheRef
HDP
Apache Git
Hadoopbranch-2
Hadoopbranch-2.4
Issue Type Course of Action
Normal Issue Patch in Apache first
Urgent Issue Patch in HWX Repo first
Read-Write Repository
Read-Only Repository
Continuous
Merges
Continuous
Merges
HDP Build
CI
HDP
Package
Repo
HDP
Maven
Repository
Publish
Releases
QE Workflow
for Testing
© Hortonworks Inc. 2011
Unit Testing
• Test individual parts of the program in isolation, white-box testing
• Homogeneous cluster, usually in-memory
• One configuration, usually 1 operating system and unsecure
• Limited dataset, usually few kilobytes
Page 5
Architecting the Future of Big Data
Unit testing
component A
Unit testing
component
C
Unit testing
component
B
?? ??
??
??
DB
Interaction
Concurrent
user
interaction
Third party
connectors
??
??
??
© Hortonworks Inc. 2011
System Testing
• Mimics production environment
– Multiple nodes in the cluster
– Multiple concurrent users
– Different workloads
• Multiple configurations to test
• Large dataset, more complex and richer
• Encompasses different types of testing
– Functional
– Performance, Stress and Reliability
– High Availability
– Backwards Compatibility
– Integration testing
– Third party connectors
– Upgrade testing
Page 6
Architecting the Future of Big Data
© Hortonworks Inc. 2011
System Testing cont...
• Heterogeneous testing
– Cross version testing
– Cross operating system testing
– Hardware configs like Disk and CPU
– Security settings, level of encryption
Page 7
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Test Matrix
• Total of ~15000+ configurations to test!
Page 8
Architecting the Future of Big Data
OS
•CentOS
•SuSE
•Debian
•Ubuntu
•Windows
JDK
•Oracle JDK
•OpenJDK
•Different version - 1.6.x, 1.7.x,
1.8.x
Security
•Disabled
•Enabled – MIT-only, AD-only,
MIT-AD
•Ranger - enabled/disabled
Encryption
•Wire encryption –
enabled/disabled
•Transparent Data Encryption
– enabled/disabled
DB
•Mysql
•Oracle
•Postgres
•MSSQL
File system
•HDFS
•WASB
•Other vendor specific FSs
Others
•Tez – enabled/disabled
•Slider apps v/s standalone
© Hortonworks Inc. 2011
Automated Testing Flow
Page 9
Architecting the Future of Big Data
Build Job
Apache
Repos
Internal
Commits
Staging
Repo
QE Deploy
Trigger
Provision VMs
Deploy HDP Stack
Test Setup & Execution
Test analysis
Continuous Integration
Publishing Builds to staging
repo
Installer deploying bits from
staging repo to test cluster
Bug tracking system
© Hortonworks Inc. 2011
Test Planning
20+ components in the HDP stack and growing!
Page 10
Architecting the Future of Big Data
Test
plan
Internal
developers
Apache jiras
and
community
forums
Product
Management
Support
tickets
© Hortonworks Inc. 2011
Planning your own QATS
Architecting the Future of Big Data
Page 11
© Hortonworks Inc. 2011
Typical user scenarios
• Fresh install
• Upgrade stack, going from an earlier release to a newer one
• Migration, changing distributions
• Applying changes to an existing cluster
– Upgrading hardware in regards to CPU, memory, disks
– Changing dependent software pieces like OS, JDK
– Changing security settings like turning ON Kerberos, Encryption
– Changing component configs in *-site.xml, enabling HA
Page 12
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Planning your own QATS
Page 13
Architecting the Future of Big Data
E2E automation
Preparation
phase
• Collect
requirements
on the stack
and workload
• Identify
appropriate
hardware
CI development
phase
• Build in-
house CI
system for
deployment
and testing
Testing phase
• Build basic
acceptance
tests
• End to end
automation
for your
application
© Hortonworks Inc. 2011
Preparation Phase
• Collect the stack requirements
– Identify all the stack components that will be installed including the third-party
applications, connectors
– Identify the installer
– Identify configs
• Hardware selection
– Should be scaled appropriately to mimic production environment
– Prefer multi-node than single-node with component services distributed
• Collect workload information
– Use actual workload whenever possible
– If not, simulate the workload, some tools available
– Use rumen to obtain jobtrace from existing clusters
– Use gridmix to generate workload
– Data set size and complexity
– Number of concurrent users
Page 14
Architecting the Future of Big Data
© Hortonworks Inc. 2011
CI Development phase
• Implement a CI system
– Modularize CI system, eg individual Jenkins jobs for provision, deploy and test
• Determine the cadence of testing
• Establish reporting
Page 15
Architecting the Future of Big Data
Provision
cluster
Deploy Test
© Hortonworks Inc. 2011
Testing Phase
• Basic Acceptance Tests
– Basic service check for individual deployed components
– Basic acceptance tests to validate integrations
• Establish baseline – to track performance of pipeline components in
future
• Compatibility tests (including apps, third party connectors, dashboards
etc)
• E2E automation to simulate production workloads
Page 16
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Q & A
Page 17
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Thank You!
Architecting the Future of Big Data
Page 18

More Related Content

PPTX
Hadoop operations-2015-hadoop-summit-san-jose-v5
PDF
The Future of Apache Storm
PPTX
CBlocks - Posix compliant files systems for HDFS
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
PDF
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
PDF
Scaling Hadoop at LinkedIn
PPTX
Enterprise Grade Streaming under 2ms on Hadoop
Hadoop operations-2015-hadoop-summit-san-jose-v5
The Future of Apache Storm
CBlocks - Posix compliant files systems for HDFS
Flexible and Real-Time Stream Processing with Apache Flink
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Scaling Hadoop at LinkedIn
Enterprise Grade Streaming under 2ms on Hadoop

What's hot (20)

PPTX
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
PDF
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
PPTX
Ingest and Stream Processing - What will you choose?
PPTX
Unified Batch & Stream Processing with Apache Samza
PPTX
Hadoop Operations - Best Practices from the Field
PPTX
Hadoop & cloud storage object store integration in production (final)
PDF
Hadoop - Lessons Learned
PPTX
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
PPTX
How the Internet of Things are Turning the Internet Upside Down
PPTX
HadoopCon- Trend Micro SPN Hadoop Overview
PDF
ORC 2015: Faster, Better, Smaller
PPTX
High Availability for HBase Tables - Past, Present, and Future
PPTX
Simplified Cluster Operation & Troubleshooting
PPTX
Evolving HDFS to Generalized Storage Subsystem
PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
PPTX
Storage and-compute-hdfs-map reduce
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PPTX
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
PPTX
Taming the Elephant: Efficient and Effective Apache Hadoop Management
PPTX
Keep your hadoop cluster at its best! v4
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
Ingest and Stream Processing - What will you choose?
Unified Batch & Stream Processing with Apache Samza
Hadoop Operations - Best Practices from the Field
Hadoop & cloud storage object store integration in production (final)
Hadoop - Lessons Learned
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
How the Internet of Things are Turning the Internet Upside Down
HadoopCon- Trend Micro SPN Hadoop Overview
ORC 2015: Faster, Better, Smaller
High Availability for HBase Tables - Past, Present, and Future
Simplified Cluster Operation & Troubleshooting
Evolving HDFS to Generalized Storage Subsystem
HDFS Tiered Storage: Mounting Object Stores in HDFS
Storage and-compute-hdfs-map reduce
LLAP: Sub-Second Analytical Queries in Hive
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Keep your hadoop cluster at its best! v4
Ad

Similar to Hadoop engineering bo_f_final (20)

PPTX
Hadoop operations-2014-strata-new-york-v5
PPTX
Lessons learned from designing QA automation event streaming platform(IoT big...
PPTX
Hortonworks HBase Meetup Presentation
PDF
Deploying and Managing Hadoop Clusters with AMBARI
PPTX
Containers and Big Data
PDF
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
PPTX
Running Non-MapReduce Big Data Applications on Apache Hadoop
PDF
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
PDF
Elephant grooming: quality with Hadoop
PDF
Storm Demo Talk - Colorado Springs May 2015
PDF
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
PPTX
Apache Hadoop YARN 2015: Present and Future
PPTX
Hadoop Summit Europe 2015 - YARN Present and Future
PDF
Containers and Big Data
PDF
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
PDF
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
PDF
Apache Hadoop on the Open Cloud
PDF
Storm Demo Talk - Denver Apr 2015
PDF
Delivering Apache Hadoop for the Modern Data Architecture
PDF
Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop operations-2014-strata-new-york-v5
Lessons learned from designing QA automation event streaming platform(IoT big...
Hortonworks HBase Meetup Presentation
Deploying and Managing Hadoop Clusters with AMBARI
Containers and Big Data
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Running Non-MapReduce Big Data Applications on Apache Hadoop
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
Elephant grooming: quality with Hadoop
Storm Demo Talk - Colorado Springs May 2015
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
Apache Hadoop YARN 2015: Present and Future
Hadoop Summit Europe 2015 - YARN Present and Future
Containers and Big Data
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Apache Hadoop on the Open Cloud
Storm Demo Talk - Denver Apr 2015
Delivering Apache Hadoop for the Modern Data Architecture
Hadoop Summit Tokyo HDP Sandbox Workshop
Ad

Recently uploaded (20)

PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Well-logging-methods_new................
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
additive manufacturing of ss316l using mig welding
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
composite construction of structures.pdf
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
Sustainable Sites - Green Building Construction
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Digital Logic Computer Design lecture notes
PPTX
web development for engineering and engineering
PPTX
bas. eng. economics group 4 presentation 1.pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Automation-in-Manufacturing-Chapter-Introduction.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
OOP with Java - Java Introduction (Basics)
Well-logging-methods_new................
Embodied AI: Ushering in the Next Era of Intelligent Systems
additive manufacturing of ss316l using mig welding
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Internet of Things (IOT) - A guide to understanding
composite construction of structures.pdf
Model Code of Practice - Construction Work - 21102022 .pdf
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Sustainable Sites - Green Building Construction
UNIT 4 Total Quality Management .pptx
Digital Logic Computer Design lecture notes
web development for engineering and engineering
bas. eng. economics group 4 presentation 1.pptx

Hadoop engineering bo_f_final

  • 1. © Hortonworks Inc. 2011 Hadoop Engineering Best Practices Raja Aluri, Release Eng Deepesh Khandelwal, Quality Eng Ramya Sunil, Quality Eng Page 1
  • 2. © Hortonworks Inc. 2011 Agenda • Source Mechanics • Why do System Testing? • Test Matrix • Automated Testing Flow • Test Planning • Planning your own System Testing • Q & A Page 2 Architecting the Future of Big Data
  • 3. © Hortonworks Inc. 2011 Apache Hortonworks Partner Source Mechanics • Hortonworks Open Source Philosophy • How we do Apache first development • How we incorporate fixes or features that did not make into apache yet • How we integrate our partner contributions to the source code • Bookkeeping of the delta between apache and Hortonworks Page 3 Architecting the Future of Big Data
  • 4. © Hortonworks Inc. 2011 Apache-Hortonworks-Partner Source flow Page 4 Architecting the Future of Big Data Partner ApacheRef HDPRef Partner HWX ApacheRef HDP Apache Git Hadoopbranch-2 Hadoopbranch-2.4 Issue Type Course of Action Normal Issue Patch in Apache first Urgent Issue Patch in HWX Repo first Read-Write Repository Read-Only Repository Continuous Merges Continuous Merges HDP Build CI HDP Package Repo HDP Maven Repository Publish Releases QE Workflow for Testing
  • 5. © Hortonworks Inc. 2011 Unit Testing • Test individual parts of the program in isolation, white-box testing • Homogeneous cluster, usually in-memory • One configuration, usually 1 operating system and unsecure • Limited dataset, usually few kilobytes Page 5 Architecting the Future of Big Data Unit testing component A Unit testing component C Unit testing component B ?? ?? ?? ?? DB Interaction Concurrent user interaction Third party connectors ?? ?? ??
  • 6. © Hortonworks Inc. 2011 System Testing • Mimics production environment – Multiple nodes in the cluster – Multiple concurrent users – Different workloads • Multiple configurations to test • Large dataset, more complex and richer • Encompasses different types of testing – Functional – Performance, Stress and Reliability – High Availability – Backwards Compatibility – Integration testing – Third party connectors – Upgrade testing Page 6 Architecting the Future of Big Data
  • 7. © Hortonworks Inc. 2011 System Testing cont... • Heterogeneous testing – Cross version testing – Cross operating system testing – Hardware configs like Disk and CPU – Security settings, level of encryption Page 7 Architecting the Future of Big Data
  • 8. © Hortonworks Inc. 2011 Test Matrix • Total of ~15000+ configurations to test! Page 8 Architecting the Future of Big Data OS •CentOS •SuSE •Debian •Ubuntu •Windows JDK •Oracle JDK •OpenJDK •Different version - 1.6.x, 1.7.x, 1.8.x Security •Disabled •Enabled – MIT-only, AD-only, MIT-AD •Ranger - enabled/disabled Encryption •Wire encryption – enabled/disabled •Transparent Data Encryption – enabled/disabled DB •Mysql •Oracle •Postgres •MSSQL File system •HDFS •WASB •Other vendor specific FSs Others •Tez – enabled/disabled •Slider apps v/s standalone
  • 9. © Hortonworks Inc. 2011 Automated Testing Flow Page 9 Architecting the Future of Big Data Build Job Apache Repos Internal Commits Staging Repo QE Deploy Trigger Provision VMs Deploy HDP Stack Test Setup & Execution Test analysis Continuous Integration Publishing Builds to staging repo Installer deploying bits from staging repo to test cluster Bug tracking system
  • 10. © Hortonworks Inc. 2011 Test Planning 20+ components in the HDP stack and growing! Page 10 Architecting the Future of Big Data Test plan Internal developers Apache jiras and community forums Product Management Support tickets
  • 11. © Hortonworks Inc. 2011 Planning your own QATS Architecting the Future of Big Data Page 11
  • 12. © Hortonworks Inc. 2011 Typical user scenarios • Fresh install • Upgrade stack, going from an earlier release to a newer one • Migration, changing distributions • Applying changes to an existing cluster – Upgrading hardware in regards to CPU, memory, disks – Changing dependent software pieces like OS, JDK – Changing security settings like turning ON Kerberos, Encryption – Changing component configs in *-site.xml, enabling HA Page 12 Architecting the Future of Big Data
  • 13. © Hortonworks Inc. 2011 Planning your own QATS Page 13 Architecting the Future of Big Data E2E automation Preparation phase • Collect requirements on the stack and workload • Identify appropriate hardware CI development phase • Build in- house CI system for deployment and testing Testing phase • Build basic acceptance tests • End to end automation for your application
  • 14. © Hortonworks Inc. 2011 Preparation Phase • Collect the stack requirements – Identify all the stack components that will be installed including the third-party applications, connectors – Identify the installer – Identify configs • Hardware selection – Should be scaled appropriately to mimic production environment – Prefer multi-node than single-node with component services distributed • Collect workload information – Use actual workload whenever possible – If not, simulate the workload, some tools available – Use rumen to obtain jobtrace from existing clusters – Use gridmix to generate workload – Data set size and complexity – Number of concurrent users Page 14 Architecting the Future of Big Data
  • 15. © Hortonworks Inc. 2011 CI Development phase • Implement a CI system – Modularize CI system, eg individual Jenkins jobs for provision, deploy and test • Determine the cadence of testing • Establish reporting Page 15 Architecting the Future of Big Data Provision cluster Deploy Test
  • 16. © Hortonworks Inc. 2011 Testing Phase • Basic Acceptance Tests – Basic service check for individual deployed components – Basic acceptance tests to validate integrations • Establish baseline – to track performance of pipeline components in future • Compatibility tests (including apps, third party connectors, dashboards etc) • E2E automation to simulate production workloads Page 16 Architecting the Future of Big Data
  • 17. © Hortonworks Inc. 2011 Q & A Page 17 Architecting the Future of Big Data
  • 18. © Hortonworks Inc. 2011 Thank You! Architecting the Future of Big Data Page 18