SlideShare a Scribd company logo
Hadoop:
Big Data Stacks validation w/ iTest
How to tame the elephant?

 Konstantin Boudnik, Ph.D., Cloudera Inc.
 cos@apache.org
 Hadoop Committer, Pig Contributor
 Roman Shaposhnik, Cloudera Inc.
 rvs@cloudera.com
 Hadoop Contributor, Oozie Committer
 Andre Arcilla
 belowsov@gmail.com
 Hadoop Integration Architect
This content is licensed under
Creative Commons BY-NC-SA 3.0
Agenda
● Problem we are facing
   ○ Big Data Stacks
   ○ Why validation
● What is "Success" and the effort to achieve it
● Solutions
   ○ Ops testing
   ○ Platform certification
   ○ Application testing
● Stack on stack
   ○ Test artifacts are First Class Citizen
   ○ Assembling validation stack (vstack)
   ○ Tailoring vstack for target clusters
   ○ D3: Deployment/Dependencies/Determinism
Not on Agenda
● Development cycles
   ○ Features, fixes, commits, patch testing
● Release engineering
   ○ Release process
   ○ Package preparations
   ○ Release notes
   ○ RE deployments
   ○ Artifact management
   ○ Branching strategies
● Application deployment
   ○ Cluster update strategies (rolling vs. offline)
   ○ Cluster upgrades
   ○ Monitoring
   ○ Statistics collection
● We aren't done yet, but...
What's a Big Data Stack anyway?

        Just a base layer!
What is Big Data Stack?

         Guess again...
What is Big Data Stack?


          A bit more...
What is Big Data Stack?


          A bit or two more...
What is Big Data Stack?


           A bit or two more + a bit = HIVE
What is Big Data Stack?
What is Big Data Stack?
            And a Sqoop of flavor...
What is Big Data Stack?
             A Babylon tower?
What is the glue that holds
        the bricks?

● Packaging
   ○ RPM, DEB, YINST, EINST?
   ○ Not the best fit for Java
● Maven
   ○ Part of the Java ecosystem
   ○ Not the best tool for non-Java
     artifacts
● Common APIs we will assume
   ○ Versioned artifacts
   ○ Dependencies
   ○ BOMs
How can we possibly guarantee that everything is
                    right?
Development & Deployment
       Discipline !
Is it enough really?
Of course not...
Components:
 ● I want Pig 0.7
 ● You need Pig 0.8
Components:
 ● I want Pig 0.7
 ● You need Pig 0.8
Configurations:
 ● Have I put in 5th data volume to DN's conf of my
   6TB nodes?
 ● Have I _not_ copied it accidently to my 4TB
   nodes?
Components:
 ● I want Pig 0.7
 ● You need Pig 0.8
Configurations:
 ● Have I put in 5th data volume to DN's conf of my
   6TB nodes?
 ● Have I _not_ copied it accidently to my 4TB
   nodes?
Auxiliary services:
 ● Does my Sqoop has Oracle connector?
Honestly:
 ● Can anyone remember all these details?
What if you've missed some?
How would you like your 10th re-spin of a release?
Make it bloody this time, please...
Redeployments...
Angry customers...
LOST REVENUE ;(
And don't you have anything better to do with that
                 life of yours?
Who Needs A Stack Anyway?

● Developers
   ○ reference platform which is complete and works
   ○ less handholding for downstream groups
● QE
   ○ definitive testable artifact, something to explore and
     certify
   ○ free QE to do QE
● Operations
   ○ deploy a "Hadoop", not a "Hadoop erector set"
   ○ confidence in dev/QE effort

● Eliminate cross-team effort "bleed"
● Align teams on a common "axis"
Successful Stack Deployment System

               "Customers Are Happy!"

● Deploy stack configuration X to a cluster Y, for such X & Y:
   ○ satisfy devs, QE and Ops
   ○ variable enough to absorb all the crazy use cases
Who Needs Stack Testing Anyway?
● All stack customers
   ○ Yes QE for sure (integration, system, CI levels...), but
      also
   ○ Devs: did my latest patch break security auth or slowed
      HDFS to a crawl?
   ○ Ops: you ask for "striping". Did YOU test striping? Can
      WE test it as well?
   ○ Customers: did you test MY stack
Successful Integration Testing System

                "Customers Are Happy!"

 1. QE: easy to deploy, configure, run, add tests
 2. Stakeholders: provides relevant info about stack quality
     1. requires plenty of relevant tests and datapoints
         1. see 1)
Effort Required To Deploy/Test Stacks
The Parable of Garage
● Lots of stuff coexisting together
    ○ fancy, polished machines (mountain bike)
    ○ utility tools (hedge trimmer, shop vacuum)
    ○ ugly misfits (old paint pan, smelly dry mushrooms)
● Cannot be "simplified" due to complexity and evolving
  nature
● Need to manage complexity: use, change, check
● Keep objects inviolate: cannot expect a bucket to mount
  directly on a wall
● Reliance on external services: studs, drywall, garage door
The Complexity Management Solution

● Provide the Framework and the means for objects to plug
  into this framework
● Framework
    ○ holds everything together, enables management
    ○ sophisticated design, well engineered, sturdy
● "Glue"/"Bubblegum logic"
    ○ binds componets to the framework via minimally-
      invasive API
    ○ easy to design, engineering complexity is lower
    ○ follows well-documented process
● Components
    ○ actual participants of the garage
    ○ require no modifications
Engineering Effort Matrix

            Development    Level of       External                     Develop in
                                                        Investment
               Effort     Complexity    Dependencies                    Parallel




Framework     Medium         High           Many         One time         No




                                         Few or none
Component                 Low-Medium                    Incremental,
            Low-Medium                   provided by                      Yes
Glue                       by example                  per component
                                          framework




     ● Framework - high complexity one time effort
     ● Components - per-component incremental effort,
       average complexity, code-by-example
Validation Stack for Big Data
              A Babylon tower vs Tower of Hanoi
Validation Stack (tailored)
              Or something like this...
Use accepted platform, tools, practices
● JVM is simply The Best
    ○ Disclaimer: not to start religious war
● Yet, Java isn't dynamic enough (as in JDK6)
    ○ But we don't care what's your implementation language
    ○ Groovy, Scala, Clojure, JPython (?)
● Everyone knows JUnit/TestNG
    ○ alas not everyone can use it effectively
● Dependency tracking and packaging
    ○ Maven
● Information radiators facilitate data comprehension and
  sharing
    ○ TeamCity
    ○ Jenkins
Few more pieces
● Tests/workloads have to be artifact'ed
   ○ It's not good to go fishing for test classes
● Artifacts have to be self-contained
   ○ Reading 20 pages to find a URL to copy a file from?
      "Forget about it" (C)
● Standard execution interface
   ○ JUnit's Runner is as good as any custom one
● A recognizable reporting format
   ○ XML sucks, but at least it has a structure
A test artifact (PigSmoke 0.9-SNAPSHOT)
<project>
 <groupId>org.apache.pig</groupId>
 <artifactId>pigsmoke</artifactId>
 <packaging>jar</packaging>
 <version>0.9.0-SNAPSHOT</version>
 <dependencies>
  <dependency>
    <groupId>org.apache.pig</groupId>
    <artifactId>pigunit</artifactId>
    <version>0.9.0-SNAPSHOT</version>
  </dependency>
  <dependency>
    <groupId>org.apache.pig</groupId>
    <artifactId>pig</artifactId>
    <version>0.9.0-SNAPSHOT</version>
  </dependency>
 </dependencies>
</project>
How do we write iTest artifacts
$ cat TestHadoopTinySmoke.groovy
     class TestHadoopTinySmoke {
         ....
           @BeforeClass
            static void setUp() throws IOException {
                   String pattern = null; //Let's unpack everything
                   JarContent.unpackJarContainer(TestHadoopSmoke.class, '.' , pattern);
                    .....
             }

          @Test
          void testCacheArchive() {
              def conf = (new Configuration()).get("fs.default.name");
              ....
              sh.exec("hadoop fs -rmr ${testDir}/cachefile/out",
                        "hadoop ....
Add suitable dependencies (if desired)
<project>
 ...
   <dependency>
     <groupId>org.apache.pig</groupId>
     <artifactId>pigsmoke</artifactId>
     <version>0.9-SNAPSHOT</version>
     <scope>test</scope>
   </dependency>
   <!-- OMG: Hadoop dependency _WAS_ missed -->
   <dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-core</artifactId>
     <version>0.20.2-CDH3B4-SNAPSHOT</version>
   </dependency>
 ...
Unpack data (if needed)
...
      <execution>
         <id>unpack-testartifact-jar</id>
       <phase>generate-test-resources</phase>
       <goals>
        <goal>unpack</goal>
       </goals>
       <configuration>
        <artifactItems>
         <artifactItem>
           <groupId>org.apache.pig</groupId>
           <artifactId>pigsmoke</artifactId>
           <version>0.9-SNAPSHOT</version>
           <type>jar</type>
           <outputDirectory>${project.build.directory}</outputDirectory>
           <includes>test/data/**/*</includes>
         </artifactItem>
        </artifactItems>
       </configuration>
      </execution>
...
Find runtime libraries (if required)
 ...
       <execution>
         <id>find-lzo-jar</id>
         <phase>pre-integration-test</phase>
         <goals> <goal>execute</goal> </goals>
         <configuration>
          <source>
            try {
              project.properties['lzo.jar'] = new File("${HADOOP_HOME}/lib").list(
                [accept:{d, f-> f ==~ /hadoop.*lzo.*.jar/ }] as FilenameFilter
              ).toList().get(0);
            } catch (java.lang.IndexOutOfBoundsException ioob) {
              log.error "No lzo.jar has been found under ${HADOOP_HOME}/lib. Check your
installation.";
              throw ioob;
            }
          </source>
         </configuration>
       </execution>
  </project>
Take it easy: iTest will do the rest
...
      <execution>
       <id>check-testslist</id>
       <phase>pre-integration-test</phase>
       <goals> <goal>execute</goal> </goals>
       <configuration>
        <source><![CDATA[
          import org.apache.itest.*

         if (project.properties['org.codehaus.groovy.maven.destination'] &&
             project.properties['org.codehaus.groovy.maven.jar']) {
              def prefix = project.properties['org.codehaus.groovy.maven.destination'];
             JarContent.listContent(project.properties['org.codehaus.groovy.maven.jar']).
               each {
                 TestListUtils.touchTestFiles(prefix, it);
               };
         }]]>
        </source>
       </configuration>
      </execution>
...
Tailored validation stack
<project>
 <groupId>com.cloudera.itest</groupId>
 <artifactId>smoke-tests</artifactId>
 <packaging>pom</packaging>
 <version>1.0-SNAPSHOT</version>
 <name>hadoop-stack-validation</name>

 ...

  <!-- List of modules which should be executed as a part of stack testing run -->
  <modules>
   <module>pig</module>
   <module>hive</module>
   <module>hadoop</module>
   <module>hbase</module>
   <module>sqoop</module>
  </modules>

 ...
</project>
Just let Jenkins do its job (results)
Just let Jenkins do its job (trending)
What else needs to be taken care of?
  ● Packaged deployment
     ○ packaged artifact verification
     ○ stack validation

Little Puppet on top of JVM:

static PackageManager pm;

@BeforeClass
 static void setUp() {
 ....
 pm = PackageManager.getPackageManager()
 pm.addBinRepo("default", "http://archive.canonical.com/", key)
 pm.refresh()
 pm.install("hadoop-0.20")
The coolest thing about single platform:
void commonPackageTest(String[] gpkgs, Closure smoke, ...) {
   pkgs.each { pm.install(it) }
   pkgs.each { assertTrue("package ${it.name} is
notinstalled",
                           pm.isInstalled(it)) }
   pkgs.each { pm.svc_do(it, "start")}
   smoke.call(args)
}
@Test
void testHadoop() {
  commonPackageTest(["hadoop-0.20", ...],
                        this.&commonSmokeTestsuiteRun,
                        TestHadoopSmoke.class)
  commonPackageTest(["hadoop-0.20", ...],
                        { sh.exec("hadoop fs -ls /") })
}
Fully automated unified reporting
iTest: current status

 ● Version 0.1 available at http://github.com/cloudera/iTest
 ● Apache2.0 licensed
 ● Contributors are welcome (free cookies to first 25)
Putting all technologies together:
 ● Puppet, iTest, Whirr:
    1. Change hits a SCM repo
    2. Hudson build produces Maven + Packaged artifacts
    3. Automatic deployment of modified stacks
    4. Automatic validation using corresponding stack of
       integration tests
    5. Rinse and repeat

 ● Challenges:
    ○ Maven versions vs. packaged versions vs. source
    ○ Strict, draconian discipline in test creation
    ○ Battling combinatoric explosion of stacks
    ○ Size of the cluster (pseudo-distributed <-> 500 nodes)
    ○ Self contained dependencies (JDK to the rescue!)
    ○ Sure, but does it brew espresso?
Definition of Success
● Build powerful platform that allows to
  sustain a high rate of product innovation
  without accumulating the technical debt
● Tighten organizational feedback loop to
  accelerate product evolution
● Improve culture and processes to achieve
  Agility with Stability as the organization
  builds new technologies

More Related Content

PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
PPTX
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
PPTX
Big Data Anti-Patterns: Lessons From the Front LIne
PPTX
Is hadoop for you
PPTX
Scaling ETL with Hadoop - Avoiding Failure
PPTX
Functional Programming and Big Data
PDF
Connecting Hadoop and Oracle
PPTX
R for hadoopers
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
Big Data Anti-Patterns: Lessons From the Front LIne
Is hadoop for you
Scaling ETL with Hadoop - Avoiding Failure
Functional Programming and Big Data
Connecting Hadoop and Oracle
R for hadoopers

What's hot (20)

PPTX
Mutable Data in Hive's Immutable World
PPTX
Deeplearning
PPTX
Scaling etl with hadoop shapira 3
PPTX
Incredible Impala
PDF
Hadoop - How It Works
PPTX
Hadoop databases for oracle DBAs
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PPTX
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
PDF
Facebook keynote-nicolas-qcon
PDF
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
PDF
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
PPTX
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
PPTX
Data Wrangling and Oracle Connectors for Hadoop
PPTX
Unified Batch & Stream Processing with Apache Samza
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PPTX
Hadoop operations-2015-hadoop-summit-san-jose-v5
PPTX
Python in the Hadoop Ecosystem (Rock Health presentation)
PDF
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
PDF
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Mutable Data in Hive's Immutable World
Deeplearning
Scaling etl with hadoop shapira 3
Incredible Impala
Hadoop - How It Works
Hadoop databases for oracle DBAs
LLAP: Sub-Second Analytical Queries in Hive
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Facebook keynote-nicolas-qcon
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Data Wrangling and Oracle Connectors for Hadoop
Unified Batch & Stream Processing with Apache Samza
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Hadoop operations-2015-hadoop-summit-san-jose-v5
Python in the Hadoop Ecosystem (Rock Health presentation)
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Ad

Viewers also liked (7)

PPTX
2014 International Software Testing Conference in Seoul
PPTX
Applying Testing Techniques for Big Data and Hadoop
PDF
Big data testing (1)
PDF
Big Data, Big Trouble: Getting into the Flow of Hadoop Testing
PPTX
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
PPTX
How to be an awesome test automation professional
PPTX
Big Data Analytics with Hadoop
2014 International Software Testing Conference in Seoul
Applying Testing Techniques for Big Data and Hadoop
Big data testing (1)
Big Data, Big Trouble: Getting into the Flow of Hadoop Testing
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
How to be an awesome test automation professional
Big Data Analytics with Hadoop
Ad

Similar to Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant? (20)

PDF
Deploying Hadoop-Based Bigdata Environments
PDF
Deploying Hadoop-based Bigdata Environments
PPTX
How to develop Big Data Pipelines for Hadoop, by Costin Leau
PDF
Hadoop Summit 2010 Challenges And Uniqueness Of Qe And Re Processes In Hadoop
PPTX
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
PPTX
Big Data Technology Stack : Nutshell
PPTX
Architectures, Frameworks and Infrastructure
ODP
On HBase Integration Testing
PDF
Building Applications using Apache Hadoop
PPTX
The Evolution of the Hadoop Ecosystem
PPT
Capital onehadoopintro
PPTX
Hadoop and their in big data analysis EcoSystem.pptx
PDF
Elephant grooming: quality with Hadoop
PPT
Hadoop applicationarchitectures
PDF
Aug 2012 HUG: Hug BigTop
PDF
Webinar: The Future of Hadoop
PDF
Introduction to HADOOP.pdf
PDF
Coscup 2013 : Continuous Integration on top of hadoop
PPTX
GeeCon.cz - Integration Testing from the Trenches Rebooted
PPTX
Cloudera Hadoop Distribution
Deploying Hadoop-Based Bigdata Environments
Deploying Hadoop-based Bigdata Environments
How to develop Big Data Pipelines for Hadoop, by Costin Leau
Hadoop Summit 2010 Challenges And Uniqueness Of Qe And Re Processes In Hadoop
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Big Data Technology Stack : Nutshell
Architectures, Frameworks and Infrastructure
On HBase Integration Testing
Building Applications using Apache Hadoop
The Evolution of the Hadoop Ecosystem
Capital onehadoopintro
Hadoop and their in big data analysis EcoSystem.pptx
Elephant grooming: quality with Hadoop
Hadoop applicationarchitectures
Aug 2012 HUG: Hug BigTop
Webinar: The Future of Hadoop
Introduction to HADOOP.pdf
Coscup 2013 : Continuous Integration on top of hadoop
GeeCon.cz - Integration Testing from the Trenches Rebooted
Cloudera Hadoop Distribution

More from Dmitri Shiryaev (6)

PDF
Enterprise Cyber-Physical Edge Virtualization Engine (EVE) Project.pdf
PDF
Uniting Data JavaOne2013
PDF
RFID Technology and Internet of Things
PDF
Composite Applications with SOA, BPEL and Java EE
PDF
A Guide to the SOA Galaxy: Strategy, Design and Best Practices
PDF
SOA Strategy and Architecture
Enterprise Cyber-Physical Edge Virtualization Engine (EVE) Project.pdf
Uniting Data JavaOne2013
RFID Technology and Internet of Things
Composite Applications with SOA, BPEL and Java EE
A Guide to the SOA Galaxy: Strategy, Design and Best Practices
SOA Strategy and Architecture

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
KodekX | Application Modernization Development
PPTX
Big Data Technologies - Introduction.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Empathic Computing: Creating Shared Understanding
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Advanced methodologies resolving dimensionality complications for autism neur...
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Weekly Chronicles - August'25 Week I
MYSQL Presentation for SQL database connectivity
Network Security Unit 5.pdf for BCA BBA.
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
KodekX | Application Modernization Development
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
sap open course for s4hana steps from ECC to s4
Dropbox Q2 2025 Financial Results & Investor Presentation
Encapsulation_ Review paper, used for researhc scholars
Empathic Computing: Creating Shared Understanding
Per capita expenditure prediction using model stacking based on satellite ima...
Programs and apps: productivity, graphics, security and other tools
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The Rise and Fall of 3GPP – Time for a Sabbatical?

Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant?

  • 1. Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant? Konstantin Boudnik, Ph.D., Cloudera Inc. cos@apache.org Hadoop Committer, Pig Contributor Roman Shaposhnik, Cloudera Inc. rvs@cloudera.com Hadoop Contributor, Oozie Committer Andre Arcilla belowsov@gmail.com Hadoop Integration Architect
  • 2. This content is licensed under Creative Commons BY-NC-SA 3.0
  • 3. Agenda ● Problem we are facing ○ Big Data Stacks ○ Why validation ● What is "Success" and the effort to achieve it ● Solutions ○ Ops testing ○ Platform certification ○ Application testing ● Stack on stack ○ Test artifacts are First Class Citizen ○ Assembling validation stack (vstack) ○ Tailoring vstack for target clusters ○ D3: Deployment/Dependencies/Determinism
  • 4. Not on Agenda ● Development cycles ○ Features, fixes, commits, patch testing ● Release engineering ○ Release process ○ Package preparations ○ Release notes ○ RE deployments ○ Artifact management ○ Branching strategies ● Application deployment ○ Cluster update strategies (rolling vs. offline) ○ Cluster upgrades ○ Monitoring ○ Statistics collection ● We aren't done yet, but...
  • 5. What's a Big Data Stack anyway? Just a base layer!
  • 6. What is Big Data Stack? Guess again...
  • 7. What is Big Data Stack? A bit more...
  • 8. What is Big Data Stack? A bit or two more...
  • 9. What is Big Data Stack? A bit or two more + a bit = HIVE
  • 10. What is Big Data Stack?
  • 11. What is Big Data Stack? And a Sqoop of flavor...
  • 12. What is Big Data Stack? A Babylon tower?
  • 13. What is the glue that holds the bricks? ● Packaging ○ RPM, DEB, YINST, EINST? ○ Not the best fit for Java ● Maven ○ Part of the Java ecosystem ○ Not the best tool for non-Java artifacts ● Common APIs we will assume ○ Versioned artifacts ○ Dependencies ○ BOMs
  • 14. How can we possibly guarantee that everything is right?
  • 15. Development & Deployment Discipline !
  • 16. Is it enough really?
  • 18. Components: ● I want Pig 0.7 ● You need Pig 0.8
  • 19. Components: ● I want Pig 0.7 ● You need Pig 0.8 Configurations: ● Have I put in 5th data volume to DN's conf of my 6TB nodes? ● Have I _not_ copied it accidently to my 4TB nodes?
  • 20. Components: ● I want Pig 0.7 ● You need Pig 0.8 Configurations: ● Have I put in 5th data volume to DN's conf of my 6TB nodes? ● Have I _not_ copied it accidently to my 4TB nodes? Auxiliary services: ● Does my Sqoop has Oracle connector?
  • 21. Honestly: ● Can anyone remember all these details?
  • 22. What if you've missed some?
  • 23. How would you like your 10th re-spin of a release? Make it bloody this time, please...
  • 27. And don't you have anything better to do with that life of yours?
  • 28. Who Needs A Stack Anyway? ● Developers ○ reference platform which is complete and works ○ less handholding for downstream groups ● QE ○ definitive testable artifact, something to explore and certify ○ free QE to do QE ● Operations ○ deploy a "Hadoop", not a "Hadoop erector set" ○ confidence in dev/QE effort ● Eliminate cross-team effort "bleed" ● Align teams on a common "axis"
  • 29. Successful Stack Deployment System "Customers Are Happy!" ● Deploy stack configuration X to a cluster Y, for such X & Y: ○ satisfy devs, QE and Ops ○ variable enough to absorb all the crazy use cases
  • 30. Who Needs Stack Testing Anyway? ● All stack customers ○ Yes QE for sure (integration, system, CI levels...), but also ○ Devs: did my latest patch break security auth or slowed HDFS to a crawl? ○ Ops: you ask for "striping". Did YOU test striping? Can WE test it as well? ○ Customers: did you test MY stack
  • 31. Successful Integration Testing System "Customers Are Happy!" 1. QE: easy to deploy, configure, run, add tests 2. Stakeholders: provides relevant info about stack quality 1. requires plenty of relevant tests and datapoints 1. see 1)
  • 32. Effort Required To Deploy/Test Stacks
  • 33. The Parable of Garage ● Lots of stuff coexisting together ○ fancy, polished machines (mountain bike) ○ utility tools (hedge trimmer, shop vacuum) ○ ugly misfits (old paint pan, smelly dry mushrooms) ● Cannot be "simplified" due to complexity and evolving nature ● Need to manage complexity: use, change, check ● Keep objects inviolate: cannot expect a bucket to mount directly on a wall ● Reliance on external services: studs, drywall, garage door
  • 34. The Complexity Management Solution ● Provide the Framework and the means for objects to plug into this framework ● Framework ○ holds everything together, enables management ○ sophisticated design, well engineered, sturdy ● "Glue"/"Bubblegum logic" ○ binds componets to the framework via minimally- invasive API ○ easy to design, engineering complexity is lower ○ follows well-documented process ● Components ○ actual participants of the garage ○ require no modifications
  • 35. Engineering Effort Matrix Development Level of External Develop in Investment Effort Complexity Dependencies Parallel Framework Medium High Many One time No Few or none Component Low-Medium Incremental, Low-Medium provided by Yes Glue by example per component framework ● Framework - high complexity one time effort ● Components - per-component incremental effort, average complexity, code-by-example
  • 36. Validation Stack for Big Data A Babylon tower vs Tower of Hanoi
  • 37. Validation Stack (tailored) Or something like this...
  • 38. Use accepted platform, tools, practices ● JVM is simply The Best ○ Disclaimer: not to start religious war ● Yet, Java isn't dynamic enough (as in JDK6) ○ But we don't care what's your implementation language ○ Groovy, Scala, Clojure, JPython (?) ● Everyone knows JUnit/TestNG ○ alas not everyone can use it effectively ● Dependency tracking and packaging ○ Maven ● Information radiators facilitate data comprehension and sharing ○ TeamCity ○ Jenkins
  • 39. Few more pieces ● Tests/workloads have to be artifact'ed ○ It's not good to go fishing for test classes ● Artifacts have to be self-contained ○ Reading 20 pages to find a URL to copy a file from? "Forget about it" (C) ● Standard execution interface ○ JUnit's Runner is as good as any custom one ● A recognizable reporting format ○ XML sucks, but at least it has a structure
  • 40. A test artifact (PigSmoke 0.9-SNAPSHOT) <project> <groupId>org.apache.pig</groupId> <artifactId>pigsmoke</artifactId> <packaging>jar</packaging> <version>0.9.0-SNAPSHOT</version> <dependencies> <dependency> <groupId>org.apache.pig</groupId> <artifactId>pigunit</artifactId> <version>0.9.0-SNAPSHOT</version> </dependency> <dependency> <groupId>org.apache.pig</groupId> <artifactId>pig</artifactId> <version>0.9.0-SNAPSHOT</version> </dependency> </dependencies> </project>
  • 41. How do we write iTest artifacts $ cat TestHadoopTinySmoke.groovy class TestHadoopTinySmoke { .... @BeforeClass static void setUp() throws IOException { String pattern = null; //Let's unpack everything JarContent.unpackJarContainer(TestHadoopSmoke.class, '.' , pattern); ..... } @Test void testCacheArchive() { def conf = (new Configuration()).get("fs.default.name"); .... sh.exec("hadoop fs -rmr ${testDir}/cachefile/out", "hadoop ....
  • 42. Add suitable dependencies (if desired) <project> ... <dependency> <groupId>org.apache.pig</groupId> <artifactId>pigsmoke</artifactId> <version>0.9-SNAPSHOT</version> <scope>test</scope> </dependency> <!-- OMG: Hadoop dependency _WAS_ missed --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>0.20.2-CDH3B4-SNAPSHOT</version> </dependency> ...
  • 43. Unpack data (if needed) ... <execution> <id>unpack-testartifact-jar</id> <phase>generate-test-resources</phase> <goals> <goal>unpack</goal> </goals> <configuration> <artifactItems> <artifactItem> <groupId>org.apache.pig</groupId> <artifactId>pigsmoke</artifactId> <version>0.9-SNAPSHOT</version> <type>jar</type> <outputDirectory>${project.build.directory}</outputDirectory> <includes>test/data/**/*</includes> </artifactItem> </artifactItems> </configuration> </execution> ...
  • 44. Find runtime libraries (if required) ... <execution> <id>find-lzo-jar</id> <phase>pre-integration-test</phase> <goals> <goal>execute</goal> </goals> <configuration> <source> try { project.properties['lzo.jar'] = new File("${HADOOP_HOME}/lib").list( [accept:{d, f-> f ==~ /hadoop.*lzo.*.jar/ }] as FilenameFilter ).toList().get(0); } catch (java.lang.IndexOutOfBoundsException ioob) { log.error "No lzo.jar has been found under ${HADOOP_HOME}/lib. Check your installation."; throw ioob; } </source> </configuration> </execution> </project>
  • 45. Take it easy: iTest will do the rest ... <execution> <id>check-testslist</id> <phase>pre-integration-test</phase> <goals> <goal>execute</goal> </goals> <configuration> <source><![CDATA[ import org.apache.itest.* if (project.properties['org.codehaus.groovy.maven.destination'] && project.properties['org.codehaus.groovy.maven.jar']) { def prefix = project.properties['org.codehaus.groovy.maven.destination']; JarContent.listContent(project.properties['org.codehaus.groovy.maven.jar']). each { TestListUtils.touchTestFiles(prefix, it); }; }]]> </source> </configuration> </execution> ...
  • 46. Tailored validation stack <project> <groupId>com.cloudera.itest</groupId> <artifactId>smoke-tests</artifactId> <packaging>pom</packaging> <version>1.0-SNAPSHOT</version> <name>hadoop-stack-validation</name> ... <!-- List of modules which should be executed as a part of stack testing run --> <modules> <module>pig</module> <module>hive</module> <module>hadoop</module> <module>hbase</module> <module>sqoop</module> </modules> ... </project>
  • 47. Just let Jenkins do its job (results)
  • 48. Just let Jenkins do its job (trending)
  • 49. What else needs to be taken care of? ● Packaged deployment ○ packaged artifact verification ○ stack validation Little Puppet on top of JVM: static PackageManager pm; @BeforeClass static void setUp() { .... pm = PackageManager.getPackageManager() pm.addBinRepo("default", "http://archive.canonical.com/", key) pm.refresh() pm.install("hadoop-0.20")
  • 50. The coolest thing about single platform: void commonPackageTest(String[] gpkgs, Closure smoke, ...) { pkgs.each { pm.install(it) } pkgs.each { assertTrue("package ${it.name} is notinstalled", pm.isInstalled(it)) } pkgs.each { pm.svc_do(it, "start")} smoke.call(args) } @Test void testHadoop() { commonPackageTest(["hadoop-0.20", ...], this.&commonSmokeTestsuiteRun, TestHadoopSmoke.class) commonPackageTest(["hadoop-0.20", ...], { sh.exec("hadoop fs -ls /") }) }
  • 52. iTest: current status ● Version 0.1 available at http://github.com/cloudera/iTest ● Apache2.0 licensed ● Contributors are welcome (free cookies to first 25)
  • 53. Putting all technologies together: ● Puppet, iTest, Whirr: 1. Change hits a SCM repo 2. Hudson build produces Maven + Packaged artifacts 3. Automatic deployment of modified stacks 4. Automatic validation using corresponding stack of integration tests 5. Rinse and repeat ● Challenges: ○ Maven versions vs. packaged versions vs. source ○ Strict, draconian discipline in test creation ○ Battling combinatoric explosion of stacks ○ Size of the cluster (pseudo-distributed <-> 500 nodes) ○ Self contained dependencies (JDK to the rescue!) ○ Sure, but does it brew espresso?
  • 54. Definition of Success ● Build powerful platform that allows to sustain a high rate of product innovation without accumulating the technical debt ● Tighten organizational feedback loop to accelerate product evolution ● Improve culture and processes to achieve Agility with Stability as the organization builds new technologies