SlideShare a Scribd company logo
Hadoop gets Groovy
Steve Loughran– Hortonworks
stevel at hortonworks.com
@steveloughran

Berlin, June 2012




© Hortonworks Inc. 2012
Where are you in this diagram?
                                    Hadoop Skills
                                                      Doug,Owen
                                                      Arun, Jakob
   Groovy Skills



                                         @steveloughran




                     James Strachan
                     Guillamue Laforge
                                                               Page 2
          © Hortonworks Inc. 2012
Grumpy : Groovy Hadoop Library

• Something lightweight for testing

• Wanted to play in the M/R layer

• Already using Groovy

• Liked: JVM integration, tooling, libraries, IntelliJ
 IDEA, Books…



        git@github.com:steveloughran/grumpy.git


                                                         Page 3
      © Hortonworks Inc. 2012
What is Groovy?
A dynamic language within the JVM

• Java++
   –Maps, lists, tuples, Closures

• Flavours of Ruby and Python
    –'Duck' typing, Grails, (Scripting)



A way to do things in the JVM that Sun didn't imagine



                                                    Page 4
      © Hortonworks Inc. 2012
Can use & subclass java classes:
class LineCountMapper
  extends Mapper<LongWritable, Text, Text, IntWritable> {

static final def emitKey = new Text("lines")
static final def one = new IntWritable(1)

void map(LongWritable key,
           Text value,
           Mapper.Context context) {
    context.write(emitKey, one)
  }
}




                                                       Page 5
      © Hortonworks Inc. 2012
Closures & lists

class CountReducer2 extends Reducer {

    def reduce(Text k,
               Iterable values,
               Reducer.Context ctx) {

        def sum = values.collect() {it.get() }.sum()

        ctx.write(k, new IntWritable(sum));
    }

}




                                                       Page 6
          © Hortonworks Inc. 2012
Closures & lists

values.collect() {
    it.get()
  }.sum()

List<values> -> List<int> -> int




                                   Page 7
    © Hortonworks Inc. 2012
Result: MR jobs in Groovy
In:
gate1,b46cca4d3f5f313176e50a0e38e7fde3,,2006-10-30,16:06:17,Fleurball
gate1,f1191b79236083ce59981e049d863604,,2006-10-30,16:06:20,vklaptop
gate1,b45c7795f5be038dda8615ab44676872,,2006-10-30,16:06:21,Franky Panky
gate1,02e73779c77fcd4e9f90a193c4f3e7ff,,2006-10-30,16:06:23,
gate1,eef1836efddf8dbfe5e2a3cd5c13745f,,2006-10-30,16:06:24,Vas
gate1,b46cca4d3f5f313176e50a0e38e7fde3,,2006-10-30,16:06:32,Fleurball
gate1,f1191b79236083ce59981e049d863604,,2006-10-30,16:06:36,vklaptop
gate1,b45c7795f5be038dda8615ab44676872,,2006-10-30,16:06:37,Franky Panky
gate1,eef1836efddf8dbfe5e2a3cd5c13745f,,2006-10-30,16:06:38,Vas
gate1,02e73779c77fcd4e9f90a193c4f3e7ff,,2006-10-30,16:06:43,
gate1,2afaf990ce75f0a7208f7f012c8d12ad,,2006-10-30,16:06:54,Smiley



Out: 163,198,223 device sightings!




                                                                       Page 8
       © Hortonworks Inc. 2012
why no Pig? Sliding Window Debounce
void map(LongWritable key, BlueEvent event,
           Mapper.Context context) {

    BlueEvent ev2 = window.insert(event)
    List<BlueEvent> expired = window.purgeExpired(event)
    expired.each { evt ->
        emit(context, evt)
    }
}

void cleanup(Mapper.Context context) {
  window.each { evt ->
    emit(context, evt)
  }
}

                                                           Page 9
        © Hortonworks Inc. 2012
Device sightings by day for 2007

1600000




1400000




1200000




1000000




800000




600000




400000




200000




      0
          1   6   11   16   21   26   31   36   41   46   51   56   61   66   71   76   81   86   91   96   101 106 111 116 121 126 131 136 141 146 151 156 161 166 171 176 181 186 191 196 201 206 211 216 221 226 231 236 241 246 251 256 261 266 271 276 281 286 291 296 301 306 311 316 321 326 331 336 341 346 351 356 361 366




                                                                                                                                                                                                                                                                                                      Page 10
                                                      © Hortonworks Inc. 2012
Improving Hadoop APIs
Configuration.metaClass.setAt = { key, val ->
 set(key.toString(), val.toString())
}

Configuration.metaClass.getAt = { key ->
  get(key)
}

Configuration.metaClass.add = {map ->
  map.each {elt ->
    set((elt.key).toString(),
        (elt.value).toString() )
}


                                                Page 11
     © Hortonworks Inc. 2012
& Configuration gets better

conf['mapscript'] = new File(src).text

String scriptText = conf['mapscript']

conf.add([
  window:60000,
  'redscript':reduceScript
  ])



Extending to Job class trickier –subclassing better


                                                      Page 12
     © Hortonworks Inc. 2012
New today! script driven MR jobs!
protected void setup(Mapper.Context ctx) {
   this.ctx = ctx
   this.conf = ctx.configuration
   ScriptCompiler comp = new ScriptCompiler(conf)
   String scriptText = conf['mapscript']
   map = comp.parse(scriptText, this, ctx)
 }

 protected void map(Writable key, Writable value,
    Mapper.Context ctx) {
   map.setProperty('key',key)
   map.setProperty('value',value)
   map.run()
 }



                                                    Page 13
     © Hortonworks Inc. 2012
Things to consider
• Performance: Groovy 2 on Java7
• 'False friends' -Types, if(), exceptions

• If you can use Pig, use it.
• Use Groovy for testing, extending Hadoop
  classes (output formatter, etc)
• Play with YARN and Giraph with it




                                             Page 14
     © Hortonworks Inc. 2012
Questions?

hortonworks.com




                             Page 15
   © Hortonworks Inc. 2012
hortonworks.com




                             Page 16
   © Hortonworks Inc. 2012
Performance?
• Groovy 1 over-introspects
• HLL hides a lot of overhead



• If your work is I/O bound, less important
• Speed of development vs execution
• Need to benchmark on Java 7




                                              Page 17
     © Hortonworks Inc. 2012

More Related Content

PDF
PPTX
Ordered Record Collection
PDF
Os Davis
PDF
Sparse matrix computations in MapReduce
ODP
Gpars concepts explained
PPTX
Introduction to MapReduce and Hadoop
PDF
Introduction to Hadoop and MapReduce
PPTX
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Ordered Record Collection
Os Davis
Sparse matrix computations in MapReduce
Gpars concepts explained
Introduction to MapReduce and Hadoop
Introduction to Hadoop and MapReduce
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14

What's hot (20)

PDF
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
PPT
Hadoop institutes in hyderabad
PDF
Trading volume mapping R in recent environment
PDF
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
PDF
Hadoop 101 for bioinformaticians
PDF
ClickHouse Features for Advanced Users, by Aleksei Milovidov
PPT
2008 Ur Tech Talk Zshao
PDF
Indexed Hive
PPT
Hive Percona 2009
PPTX
Distributed caching and computing v3.7
PPTX
GoodFit: Multi-Resource Packing of Tasks with Dependencies
PDF
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
PPTX
Hadoop & Hive Change the Data Warehousing Game Forever
PDF
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
PDF
NoSQL @ CodeMash 2010
PDF
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
PDF
Apache Spark
PPTX
Map Reduce Online
PDF
Apache Spark Internals - Part 2
PDF
Presto in Treasure Data
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Hadoop institutes in hyderabad
Trading volume mapping R in recent environment
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
Hadoop 101 for bioinformaticians
ClickHouse Features for Advanced Users, by Aleksei Milovidov
2008 Ur Tech Talk Zshao
Indexed Hive
Hive Percona 2009
Distributed caching and computing v3.7
GoodFit: Multi-Resource Packing of Tasks with Dependencies
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Hadoop & Hive Change the Data Warehousing Game Forever
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
NoSQL @ CodeMash 2010
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Apache Spark
Map Reduce Online
Apache Spark Internals - Part 2
Presto in Treasure Data
Ad

Viewers also liked (15)

PPTX
Teamcenter – sap integration gateway
PPT
Strategic review (Sample)
PPTX
Advanced Work Packaging in Construction: An Introduction
PDF
mpx Replay, Expedite Your Catch-Up and C3 Workflow 2 of 2
PPTX
Diarrhea:Myths and facts, Precaution
PDF
Energy Strategy Group_Report 2012 efficienza energetica
PPTX
Nt1310 project
PDF
Alta White Paper D2C eCommerce Case Study 2016
PDF
Information från Läkemedelsverket #5 2013
PPT
"15 Business Story Ideas to Jump on Now"
PPTX
Credit cards
DOC
PDF
Secure PIN Management How to Issue and Change PINs Securely over the Web
PDF
Enterprise workspaces - Extending SAP NetWeaver Portal capabilities
PDF
Basics of Coding in Pediatrics Medical Billing
Teamcenter – sap integration gateway
Strategic review (Sample)
Advanced Work Packaging in Construction: An Introduction
mpx Replay, Expedite Your Catch-Up and C3 Workflow 2 of 2
Diarrhea:Myths and facts, Precaution
Energy Strategy Group_Report 2012 efficienza energetica
Nt1310 project
Alta White Paper D2C eCommerce Case Study 2016
Information från Läkemedelsverket #5 2013
"15 Business Story Ideas to Jump on Now"
Credit cards
Secure PIN Management How to Issue and Change PINs Securely over the Web
Enterprise workspaces - Extending SAP NetWeaver Portal capabilities
Basics of Coding in Pediatrics Medical Billing
Ad

Similar to Hadoop gets Groovy (20)

PPTX
The Fundamentals Guide to HDP and HDInsight
PDF
Hadoop Jungle
PPTX
Hadoop ecosystem
PPTX
Big Data Scala by the Bay: Interactive Spark in your Browser
PPT
Behm Shah Pagerank
PDF
LISA Qooxdoo Tutorial Handouts
PDF
Hadoop ecosystem
PPTX
Writing Hadoop Jobs in Scala using Scalding
PPTX
LocationTech Projects
PDF
Cascading - A Java Developer’s Companion to the Hadoop World
PPT
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
PDF
Bring Cartography to the Cloud
PPTX
Hackathon bonn
PPTX
Hadoop fault tolerance
PDF
Scalding for Hadoop
PPT
Taste Java In The Clouds
PDF
Apache Spark & Hadoop
PDF
Tez: Accelerating Data Pipelines - fifthel
PPTX
Clojure And Swing
PDF
Hadoop past, present and future
The Fundamentals Guide to HDP and HDInsight
Hadoop Jungle
Hadoop ecosystem
Big Data Scala by the Bay: Interactive Spark in your Browser
Behm Shah Pagerank
LISA Qooxdoo Tutorial Handouts
Hadoop ecosystem
Writing Hadoop Jobs in Scala using Scalding
LocationTech Projects
Cascading - A Java Developer’s Companion to the Hadoop World
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Bring Cartography to the Cloud
Hackathon bonn
Hadoop fault tolerance
Scalding for Hadoop
Taste Java In The Clouds
Apache Spark & Hadoop
Tez: Accelerating Data Pipelines - fifthel
Clojure And Swing
Hadoop past, present and future

More from Steve Loughran (20)

PPTX
Hadoop Vectored IO
PPTX
The age of rename() is over
PPTX
What does Rename Do: (detailed version)
PPTX
Put is the new rename: San Jose Summit Edition
PPTX
@Dissidentbot: dissent will be automated!
PPTX
PUT is the new rename()
PPT
Extreme Programming Deployed
PPT
PPTX
I hate mocking
PPTX
What does rename() do?
PPTX
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
PPTX
Apache Spark and Object Stores —for London Spark User Group
PPTX
Spark Summit East 2017: Apache spark and object stores
PPTX
Hadoop, Hive, Spark and Object Stores
PPTX
Apache Spark and Object Stores
PPTX
Household INFOSEC in a Post-Sony Era
PPTX
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
PPTX
Hadoop and Kerberos: the Madness Beyond the Gate
PPTX
Slider: Applications on YARN
PPTX
YARN Services
Hadoop Vectored IO
The age of rename() is over
What does Rename Do: (detailed version)
Put is the new rename: San Jose Summit Edition
@Dissidentbot: dissent will be automated!
PUT is the new rename()
Extreme Programming Deployed
I hate mocking
What does rename() do?
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Apache Spark and Object Stores —for London Spark User Group
Spark Summit East 2017: Apache spark and object stores
Hadoop, Hive, Spark and Object Stores
Apache Spark and Object Stores
Household INFOSEC in a Post-Sony Era
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate
Slider: Applications on YARN
YARN Services

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
sap open course for s4hana steps from ECC to s4
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Machine learning based COVID-19 study performance prediction
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
cuic standard and advanced reporting.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation theory and applications.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Empathic Computing: Creating Shared Understanding
Dropbox Q2 2025 Financial Results & Investor Presentation
sap open course for s4hana steps from ECC to s4
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Machine learning based COVID-19 study performance prediction
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
cuic standard and advanced reporting.pdf
The AUB Centre for AI in Media Proposal.docx
Spectral efficient network and resource selection model in 5G networks
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
MYSQL Presentation for SQL database connectivity
Encapsulation theory and applications.pdf
Review of recent advances in non-invasive hemoglobin estimation
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation_ Review paper, used for researhc scholars
Empathic Computing: Creating Shared Understanding

Hadoop gets Groovy

  • 1. Hadoop gets Groovy Steve Loughran– Hortonworks stevel at hortonworks.com @steveloughran Berlin, June 2012 © Hortonworks Inc. 2012
  • 2. Where are you in this diagram? Hadoop Skills Doug,Owen Arun, Jakob Groovy Skills @steveloughran James Strachan Guillamue Laforge Page 2 © Hortonworks Inc. 2012
  • 3. Grumpy : Groovy Hadoop Library • Something lightweight for testing • Wanted to play in the M/R layer • Already using Groovy • Liked: JVM integration, tooling, libraries, IntelliJ IDEA, Books… git@github.com:steveloughran/grumpy.git Page 3 © Hortonworks Inc. 2012
  • 4. What is Groovy? A dynamic language within the JVM • Java++ –Maps, lists, tuples, Closures • Flavours of Ruby and Python –'Duck' typing, Grails, (Scripting) A way to do things in the JVM that Sun didn't imagine Page 4 © Hortonworks Inc. 2012
  • 5. Can use & subclass java classes: class LineCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { static final def emitKey = new Text("lines") static final def one = new IntWritable(1) void map(LongWritable key, Text value, Mapper.Context context) { context.write(emitKey, one) } } Page 5 © Hortonworks Inc. 2012
  • 6. Closures & lists class CountReducer2 extends Reducer { def reduce(Text k, Iterable values, Reducer.Context ctx) { def sum = values.collect() {it.get() }.sum() ctx.write(k, new IntWritable(sum)); } } Page 6 © Hortonworks Inc. 2012
  • 7. Closures & lists values.collect() { it.get() }.sum() List<values> -> List<int> -> int Page 7 © Hortonworks Inc. 2012
  • 8. Result: MR jobs in Groovy In: gate1,b46cca4d3f5f313176e50a0e38e7fde3,,2006-10-30,16:06:17,Fleurball gate1,f1191b79236083ce59981e049d863604,,2006-10-30,16:06:20,vklaptop gate1,b45c7795f5be038dda8615ab44676872,,2006-10-30,16:06:21,Franky Panky gate1,02e73779c77fcd4e9f90a193c4f3e7ff,,2006-10-30,16:06:23, gate1,eef1836efddf8dbfe5e2a3cd5c13745f,,2006-10-30,16:06:24,Vas gate1,b46cca4d3f5f313176e50a0e38e7fde3,,2006-10-30,16:06:32,Fleurball gate1,f1191b79236083ce59981e049d863604,,2006-10-30,16:06:36,vklaptop gate1,b45c7795f5be038dda8615ab44676872,,2006-10-30,16:06:37,Franky Panky gate1,eef1836efddf8dbfe5e2a3cd5c13745f,,2006-10-30,16:06:38,Vas gate1,02e73779c77fcd4e9f90a193c4f3e7ff,,2006-10-30,16:06:43, gate1,2afaf990ce75f0a7208f7f012c8d12ad,,2006-10-30,16:06:54,Smiley Out: 163,198,223 device sightings! Page 8 © Hortonworks Inc. 2012
  • 9. why no Pig? Sliding Window Debounce void map(LongWritable key, BlueEvent event, Mapper.Context context) { BlueEvent ev2 = window.insert(event) List<BlueEvent> expired = window.purgeExpired(event) expired.each { evt -> emit(context, evt) } } void cleanup(Mapper.Context context) { window.each { evt -> emit(context, evt) } } Page 9 © Hortonworks Inc. 2012
  • 10. Device sightings by day for 2007 1600000 1400000 1200000 1000000 800000 600000 400000 200000 0 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 116 121 126 131 136 141 146 151 156 161 166 171 176 181 186 191 196 201 206 211 216 221 226 231 236 241 246 251 256 261 266 271 276 281 286 291 296 301 306 311 316 321 326 331 336 341 346 351 356 361 366 Page 10 © Hortonworks Inc. 2012
  • 11. Improving Hadoop APIs Configuration.metaClass.setAt = { key, val -> set(key.toString(), val.toString()) } Configuration.metaClass.getAt = { key -> get(key) } Configuration.metaClass.add = {map -> map.each {elt -> set((elt.key).toString(), (elt.value).toString() ) } Page 11 © Hortonworks Inc. 2012
  • 12. & Configuration gets better conf['mapscript'] = new File(src).text String scriptText = conf['mapscript'] conf.add([ window:60000, 'redscript':reduceScript ]) Extending to Job class trickier –subclassing better Page 12 © Hortonworks Inc. 2012
  • 13. New today! script driven MR jobs! protected void setup(Mapper.Context ctx) { this.ctx = ctx this.conf = ctx.configuration ScriptCompiler comp = new ScriptCompiler(conf) String scriptText = conf['mapscript'] map = comp.parse(scriptText, this, ctx) } protected void map(Writable key, Writable value, Mapper.Context ctx) { map.setProperty('key',key) map.setProperty('value',value) map.run() } Page 13 © Hortonworks Inc. 2012
  • 14. Things to consider • Performance: Groovy 2 on Java7 • 'False friends' -Types, if(), exceptions • If you can use Pig, use it. • Use Groovy for testing, extending Hadoop classes (output formatter, etc) • Play with YARN and Giraph with it Page 14 © Hortonworks Inc. 2012
  • 15. Questions? hortonworks.com Page 15 © Hortonworks Inc. 2012
  • 16. hortonworks.com Page 16 © Hortonworks Inc. 2012
  • 17. Performance? • Groovy 1 over-introspects • HLL hides a lot of overhead • If your work is I/O bound, less important • Speed of development vs execution • Need to benchmark on Java 7 Page 17 © Hortonworks Inc. 2012

Editor's Notes

  • #3: What is the knowledge/skill level of the audience?I’m taking about Groovy, not time to cover Hadoop as wellDisclaimer: I am a Groovy user, not an Expert.
  • #5: How to describe Groovy? It depends on what you are doing with it? You can look at it and come to different conclusions based on your useIt&apos;s like Java with the datatypes they forgotIt&apos;s got ruby concepts (closures), keywords from python, and dynamic &apos;duck&apos; typingIt lets you do things in the JVM that the original Java authors didn&apos;t expect
  • #8: It&apos;s a map and reduce -in the reduce. Turtles all the way down. Or at least elephants.