Spring for Apache
Hadoop
By Zenyk Matchyshyn
Agenda
• Goals of the project
• Hadoop Map/Reduce
• Scripting
• HBase
• Hive
• Pig
• Other
• Alternatives
2
Big Data – Why?
Because of Terabytes and Petabytes:
• Smart meter analysis
• Genome processing
• Sentiment & social media analysis
• Network capacity trending & management
• Ad targeting
• Fraud detection
3
Goals
• Provide programmatic model to work with
Hadoop ecosystem
• Simplify client libraries usage
• Provide Spring friendly wrappers
• Enable real-world usage as a part of Spring
Batch & Spring Integration
• Leverage Spring features
4
Supported distros
• Apache Hadoop
• Cloudera CDH
• Greenplum HD
5
HADOOP
6
Hadoop
7
Hadoop
Map/Reduce
HDFS
HBase
Pig Hive
Hadoop basics
Split Map Shuffle Reduce
8
Dog ate the bone
Cat ate the fish
Dog, 1
Ate, 1
The, 1
Bone, 1
Cat, 1
Ate, 1
The, 1
Fish,1
Dog, 1
Ate, {1, 1}
The, {1, 1}
Bone, 1
Cat, 1
Fish,1
Dog, 1
Ate, 2
The, 2
Bone, 1
Cat, 1
Fish,1
Configuration
9
<?xml version="1.0" encoding="UTF-8"?>
<beans:beans xmlns="http://www.springframework.org/schema/hadoop"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:beans="http://www.springframework.org/schema/beans"
xmlns:context="http://www.springframework.org/schema/context"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/context
http://www.springframework.org/schema/context/spring-context.xsd
http://www.springframework.org/schema/hadoop
http://www.springframework.org/schema/hadoop/spring-hadoop.xsd">
<context:property-placeholder location="hadoop.properties"/>
<configuration>
fs.default.name=${hd.fs}
mapred.job.tracker=${hd.jt}
</configuration>
………………….
</beans:beans
Job definition
10
<hdp:job id=“hadoopJob"
input-path="${wordcount.input.path}"
output-path="${wordcount.output.path}"
libs="file:${app.repo}/supporting-lib-*.jar"
mapper="org.company.Mapper"
reducer="org.company.Reducer"/>
Configuration conf = new Configuration();
Job job = new Job(conf, “hadoopJob");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Maper.class);
job.setReducerClass(Reducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new
Path(args[1]));
job.waitForCompletion(true);
Job Execution
11
<hdp:job-runner id="runner" run-at-startup="true"
pre-action=“someScript“
post-action=“someOtherScript“
job-ref=“hadoopJob" />
• Hadoop Streaming:
• Hadoop Tool Executor:
Other approaches
12
<hdp:streaming id="streaming"
input-path="/input/" output-path="/ouput/"
mapper="${path.cat}" reducer="${path.wc}"/>
<hdp:tool-runner id="someTool" tool-class="org.foo.SomeTool" run-at-startup="true">
<hdp:arg value="data/in.txt"/>
<hdp:arg value="data/out.txt"/>
property=value
</hdp:tool-runner>
SCRIPTING
13
Details
• Supports JVM languages from JSR-223
(Groovy, JRuby, Jython, Rhino)
• Exposes SimplerFileSystem
• Provides implicit variables
• Exposes FsShell to mimic HDFS shell
• Exposes DistCp to mimic distcp from Hadoop
14
Example
15
<hdp:script-tasklet id="script-tasklet">
<hdp:script language="groovy">
inputPath = "/user/gutenberg/input/word/"
outputPath = "/user/gutenberg/output/word/"
if (fsh.test(inputPath)) {
fsh.rmr(inputPath) }
if (fsh.test(outputPath)) {
fsh.rmr(outputPath) }
inputFile = "src/main/resources/data/nietzsche-chapter-1.txt"
fsh.put(inputFile, inputPath)
</hdp:script>
</hdp:script-tasklet>
HBASE
16
HBase
17
Hadoop
Map/Reduce
HDFS
HBase
Pig Hive
HBase basics
• Distributed, column oriented store
• Independent of Hadoop
• No translation into Map/Reduce
• Stores data in MapFiles (indexed SequenceFiles)
18
Create ‘sometable’, ‘clmnfamily1’
Put ‘sometable’, ‘row_id1’, ‘clmnfamily1:c1’, ‘some values’
Scan ‘sometable’
Features
• Easy connection interface
• Thread safe
• DAO friendly support and wrappers:
• HbaseTemplate
• TableCallback
• RowMapper
• ResultsExtractor
• Binding table to current thread
19
Example - beans
20
<hdp:hbase-configuration/>
<bean id="hbaseTemplate"
class="org.springframework.data.hadoop.hbase.HbaseTemplate"
p:configuration-ref="hbaseConfiguration"/>
Example - code
21
template.execute("MyTable", new TableCallback<Object>() {
@Override
public Object doInTable(HTable table) throws Throwable {
Put p = new Put(Bytes.toBytes("SomeRow"));
p.add(Bytes.toBytes("SomeColumn"), Bytes.toBytes("SomeQualifier"), Bytes.toBytes("AValue"));
table.put(p);
return null;
}
});
List<String> rows = template.find("MyTable", "SomeColumn", new RowMapper<String>() {
@Override
public String mapRow(Result result, int rowNum) throws Exception {
return result.toString();
}
}));
HIVE
22
Hive
23
Hadoop
Map/Reduce
HDFS
HBase
Pig Hive
Hive basics
• SQL-like interface - HiveQL
• Has its own structure
• Not a pipeline like Pig
• Basically a distributed data warehouse
• Has execution optimization
24
Features
• Hive server
• DAO friendly Hive Thrift Client simplification
• Hive JDBC driver within Spring DAO ecosystem
• Hive scripting
• Thread safe
25
Example - beans
26
<hdp:hive-server host=“hivehost" port="10001" />
<hdp:hive-template />
<hdp:hive-client-factory host="some-host" port="some-port" >
<hdp:script location="classpath:org/company/hive/script.q">
<arguments>ignore-case=true</arguments>
</hdp:script>
</hdp:hive-client-factory>
<hdp:hive-runner id="hiveRunner" run-at-startup="true">
<hdp:script>
DROP TABLE IF EXITS testHiveBatchTable;
CREATE TABLE testHiveBatchTable (key int, value string);
</hdp:script>
<hdp:script location="hive-scripts/script.q"/>
</hdp:hive-runner>
Example - template
27
return hiveTemplate.execute(new HiveClientCallback<List<String>>() {
@Override
public List<String> doInHive(HiveClient hiveClient) throws Exception {
return hiveClient.get_all_databases();
}
}));
PIG
28
Pig basics
29
Hadoop
Map/Reduce
HDFS
HBase
Pig Hive
Pig
• High level language for data analysis
• Uses PigLatin to describe data flows
(translates into MapReduce)
• Filters, Joins, Projections, Groupings, Counts,
etc.
• Example:
30
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
Features
• Scripts execution
• DAO friendly template
• Thread safe
31
Example - beans
32
<hdp:pig-factory exec-type="LOCAL" job-name="pig-script" configuration-ref="hadoopConfiguration"
properties-location="pig-dev.properties”">
source=${pig.script.src}
<script location="org/company/pig/script.pig“/>
</hdp:pig-factory>
<hdp:pig-runner id="pigRunner" run-at-startup="true">
<hdp:script>
A = LOAD 'src/test/resources/logs/apache_access.log' USING PigStorage() AS (name:chararray, age:int);
B = FOREACH A GENERATE name;
DUMP B;
</hdp:script>
<hdp:script location="pig-scripts/script.pig">
<arguments>electric=sea</arguments>
</hdp:script>
</hdp:pig-runner>
<hdp:pig-template/>
Example - template
33
return pigTemplate.execute(new PigCallback<Set<String>() {
@Override
public Set<String> doInPig(PigServer pig) throws ExecException, IOException {
return pig.getAliasKeySet();
}
}));
Other features
• Cascading support
• Works well with Hadoop security
• Spring Batch tasklets
• Spring Integration support
34
Alternatives & related
• Apache Flume – distributed data collection
• Apache Oozie – workflow scheduler
• Apache Sqoop – SQL bulk import/export
35
Q/A
?
36

More Related Content

PPTX
How to develop Big Data Pipelines for Hadoop, by Costin Leau
PDF
Data Engineering with Spring, Hadoop and Hive
PPTX
Rapid Development of Big Data applications using Spring for Apache Hadoop
PDF
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
PPTX
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
PDF
Apache Hadoop 1.1
PDF
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
PDF
Introduction to the Hadoop Ecosystem (SEACON Edition)
How to develop Big Data Pipelines for Hadoop, by Costin Leau
Data Engineering with Spring, Hadoop and Hive
Rapid Development of Big Data applications using Spring for Apache Hadoop
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Apache Hadoop 1.1
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)

What's hot (20)

PDF
Introduction to the Hadoop Ecosystem (codemotion Edition)
PPT
HIVE: Data Warehousing & Analytics on Hadoop
PDF
Hadoop Administration pdf
PPT
Hadoop Hive Talk At IIT-Delhi
PPTX
An intriduction to hive
ODP
Hadoop - Overview
PPTX
Qubole @ AWS Meetup Bangalore - July 2015
PDF
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
KEY
Hive vs Pig for HadoopSourceCodeReading
PDF
Introduction to Apache Hive
PPTX
Introduction to Pig
KEY
Intro To Hadoop
PDF
Amebaサービスのログ解析基盤
PPT
Hadoop at Ebay
PDF
Hadoop sqoop
PDF
Hadoop Primer
PPTX
Faster Faster Faster! Datamarts with Hive at Yahoo
PPTX
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
PDF
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
PDF
Getting started with Hadoop, Hive, and Elastic MapReduce
Introduction to the Hadoop Ecosystem (codemotion Edition)
HIVE: Data Warehousing & Analytics on Hadoop
Hadoop Administration pdf
Hadoop Hive Talk At IIT-Delhi
An intriduction to hive
Hadoop - Overview
Qubole @ AWS Meetup Bangalore - July 2015
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Hive vs Pig for HadoopSourceCodeReading
Introduction to Apache Hive
Introduction to Pig
Intro To Hadoop
Amebaサービスのログ解析基盤
Hadoop at Ebay
Hadoop sqoop
Hadoop Primer
Faster Faster Faster! Datamarts with Hive at Yahoo
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Getting started with Hadoop, Hive, and Elastic MapReduce
Ad

Viewers also liked (20)

PPTX
Installing apache sqoop
PPTX
Load data into hive and csv
PPTX
Hadoop - Integration Patterns and Practices__HadoopSummit2010
PPTX
Chicago Data Summit: Flume: An Introduction
PPT
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
PDF
Designing a reactive data platform: Challenges, patterns, and anti-patterns
PDF
Apache Sqoop: Unlocking Hadoop for Your Relational Database
PDF
Hadoop Application Architectures - Fraud Detection
PPTX
Spark Streaming & Kafka-The Future of Stream Processing
PDF
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
PDF
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
PPTX
Building Continuously Curated Ingestion Pipelines
PPTX
Open Source Big Data Ingestion - Without the Heartburn!
PPTX
Data Ingestion, Extraction & Parsing on Hadoop
PDF
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
PDF
Architectural Patterns for Streaming Applications
PDF
Apache Flume - DataDayTexas
PDF
Apache Flume
PDF
Sqoop on Spark for Data Ingestion
PPTX
Integrating Apache Spark and NiFi for Data Lakes
Installing apache sqoop
Load data into hive and csv
Hadoop - Integration Patterns and Practices__HadoopSummit2010
Chicago Data Summit: Flume: An Introduction
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Designing a reactive data platform: Challenges, patterns, and anti-patterns
Apache Sqoop: Unlocking Hadoop for Your Relational Database
Hadoop Application Architectures - Fraud Detection
Spark Streaming & Kafka-The Future of Stream Processing
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Building Continuously Curated Ingestion Pipelines
Open Source Big Data Ingestion - Without the Heartburn!
Data Ingestion, Extraction & Parsing on Hadoop
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Architectural Patterns for Streaming Applications
Apache Flume - DataDayTexas
Apache Flume
Sqoop on Spark for Data Ingestion
Integrating Apache Spark and NiFi for Data Lakes
Ad

Similar to Spring for Apache Hadoop (20)

PDF
SpringPeople Introduction to Apache Hadoop
PPTX
Hands on Hadoop and pig
PPT
Recommender.system.presentation.pjug.05.20.2014
PPT
Spring data iii
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
PDF
introduction to data processing using Hadoop and Pig
PPTX
Hadoop and their in big data analysis EcoSystem.pptx
PPTX
Big Data and Hadoop Components
PDF
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
PPT
Hadoop - Introduction to Hadoop
PDF
Webinar: The Future of Hadoop
PPT
Taylor bosc2010
PDF
Hadoop big data
PPTX
Brief Introduction about Hadoop and Core Services.
PPTX
Big Data Processing Using Hadoop Infrastructure
PPTX
Hadoop, Infrastructure and Stack
PDF
BIGDATA ppts
PPTX
hadoop-ecosystem-ppt.pptx
SpringPeople Introduction to Apache Hadoop
Hands on Hadoop and pig
Recommender.system.presentation.pjug.05.20.2014
Spring data iii
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
introduction to data processing using Hadoop and Pig
Hadoop and their in big data analysis EcoSystem.pptx
Big Data and Hadoop Components
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop - Introduction to Hadoop
Webinar: The Future of Hadoop
Taylor bosc2010
Hadoop big data
Brief Introduction about Hadoop and Core Services.
Big Data Processing Using Hadoop Infrastructure
Hadoop, Infrastructure and Stack
BIGDATA ppts
hadoop-ecosystem-ppt.pptx

More from zenyk (12)

PDF
Semasearch Spring - 2015
PDF
Проект Каскад
PPTX
Ecois.me and uMuni
PPTX
Semasearch Intro
PPTX
SEMASEARCH - Високі технології у боротьбі з корупцією та на захисті держави
PPT
Introduction to Clojure - EDGE Lviv
PPTX
Puppet / DevOps - EDGE Lviv
PPTX
Hadoop Solutions
PPTX
Emotional Intelligence
PPTX
Lviv EDGE 2 - NoSQL
PPTX
Amazon Clouds in Action
PPTX
Modern Java Web Development
Semasearch Spring - 2015
Проект Каскад
Ecois.me and uMuni
Semasearch Intro
SEMASEARCH - Високі технології у боротьбі з корупцією та на захисті держави
Introduction to Clojure - EDGE Lviv
Puppet / DevOps - EDGE Lviv
Hadoop Solutions
Emotional Intelligence
Lviv EDGE 2 - NoSQL
Amazon Clouds in Action
Modern Java Web Development

Recently uploaded (20)

PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
August Patch Tuesday
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PPTX
Tartificialntelligence_presentation.pptx
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Five Habits of High-Impact Board Members
PDF
Getting Started with Data Integration: FME Form 101
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
A novel scalable deep ensemble learning framework for big data classification...
August Patch Tuesday
sustainability-14-14877-v2.pddhzftheheeeee
Web Crawler for Trend Tracking Gen Z Insights.pptx
Tartificialntelligence_presentation.pptx
A comparative study of natural language inference in Swahili using monolingua...
Chapter 5: Probability Theory and Statistics
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
observCloud-Native Containerability and monitoring.pptx
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
A review of recent deep learning applications in wood surface defect identifi...
Enhancing emotion recognition model for a student engagement use case through...
Five Habits of High-Impact Board Members
Getting Started with Data Integration: FME Form 101
A contest of sentiment analysis: k-nearest neighbor versus neural network
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf

Spring for Apache Hadoop

  • 1. Spring for Apache Hadoop By Zenyk Matchyshyn
  • 2. Agenda • Goals of the project • Hadoop Map/Reduce • Scripting • HBase • Hive • Pig • Other • Alternatives 2
  • 3. Big Data – Why? Because of Terabytes and Petabytes: • Smart meter analysis • Genome processing • Sentiment & social media analysis • Network capacity trending & management • Ad targeting • Fraud detection 3
  • 4. Goals • Provide programmatic model to work with Hadoop ecosystem • Simplify client libraries usage • Provide Spring friendly wrappers • Enable real-world usage as a part of Spring Batch & Spring Integration • Leverage Spring features 4
  • 5. Supported distros • Apache Hadoop • Cloudera CDH • Greenplum HD 5
  • 8. Hadoop basics Split Map Shuffle Reduce 8 Dog ate the bone Cat ate the fish Dog, 1 Ate, 1 The, 1 Bone, 1 Cat, 1 Ate, 1 The, 1 Fish,1 Dog, 1 Ate, {1, 1} The, {1, 1} Bone, 1 Cat, 1 Fish,1 Dog, 1 Ate, 2 The, 2 Bone, 1 Cat, 1 Fish,1
  • 9. Configuration 9 <?xml version="1.0" encoding="UTF-8"?> <beans:beans xmlns="http://www.springframework.org/schema/hadoop" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:beans="http://www.springframework.org/schema/beans" xmlns:context="http://www.springframework.org/schema/context" xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd"> <context:property-placeholder location="hadoop.properties"/> <configuration> fs.default.name=${hd.fs} mapred.job.tracker=${hd.jt} </configuration> …………………. </beans:beans
  • 10. Job definition 10 <hdp:job id=“hadoopJob" input-path="${wordcount.input.path}" output-path="${wordcount.output.path}" libs="file:${app.repo}/supporting-lib-*.jar" mapper="org.company.Mapper" reducer="org.company.Reducer"/> Configuration conf = new Configuration(); Job job = new Job(conf, “hadoopJob"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Maper.class); job.setReducerClass(Reducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);
  • 11. Job Execution 11 <hdp:job-runner id="runner" run-at-startup="true" pre-action=“someScript“ post-action=“someOtherScript“ job-ref=“hadoopJob" />
  • 12. • Hadoop Streaming: • Hadoop Tool Executor: Other approaches 12 <hdp:streaming id="streaming" input-path="/input/" output-path="/ouput/" mapper="${path.cat}" reducer="${path.wc}"/> <hdp:tool-runner id="someTool" tool-class="org.foo.SomeTool" run-at-startup="true"> <hdp:arg value="data/in.txt"/> <hdp:arg value="data/out.txt"/> property=value </hdp:tool-runner>
  • 14. Details • Supports JVM languages from JSR-223 (Groovy, JRuby, Jython, Rhino) • Exposes SimplerFileSystem • Provides implicit variables • Exposes FsShell to mimic HDFS shell • Exposes DistCp to mimic distcp from Hadoop 14
  • 15. Example 15 <hdp:script-tasklet id="script-tasklet"> <hdp:script language="groovy"> inputPath = "/user/gutenberg/input/word/" outputPath = "/user/gutenberg/output/word/" if (fsh.test(inputPath)) { fsh.rmr(inputPath) } if (fsh.test(outputPath)) { fsh.rmr(outputPath) } inputFile = "src/main/resources/data/nietzsche-chapter-1.txt" fsh.put(inputFile, inputPath) </hdp:script> </hdp:script-tasklet>
  • 18. HBase basics • Distributed, column oriented store • Independent of Hadoop • No translation into Map/Reduce • Stores data in MapFiles (indexed SequenceFiles) 18 Create ‘sometable’, ‘clmnfamily1’ Put ‘sometable’, ‘row_id1’, ‘clmnfamily1:c1’, ‘some values’ Scan ‘sometable’
  • 19. Features • Easy connection interface • Thread safe • DAO friendly support and wrappers: • HbaseTemplate • TableCallback • RowMapper • ResultsExtractor • Binding table to current thread 19
  • 20. Example - beans 20 <hdp:hbase-configuration/> <bean id="hbaseTemplate" class="org.springframework.data.hadoop.hbase.HbaseTemplate" p:configuration-ref="hbaseConfiguration"/>
  • 21. Example - code 21 template.execute("MyTable", new TableCallback<Object>() { @Override public Object doInTable(HTable table) throws Throwable { Put p = new Put(Bytes.toBytes("SomeRow")); p.add(Bytes.toBytes("SomeColumn"), Bytes.toBytes("SomeQualifier"), Bytes.toBytes("AValue")); table.put(p); return null; } }); List<String> rows = template.find("MyTable", "SomeColumn", new RowMapper<String>() { @Override public String mapRow(Result result, int rowNum) throws Exception { return result.toString(); } }));
  • 24. Hive basics • SQL-like interface - HiveQL • Has its own structure • Not a pipeline like Pig • Basically a distributed data warehouse • Has execution optimization 24
  • 25. Features • Hive server • DAO friendly Hive Thrift Client simplification • Hive JDBC driver within Spring DAO ecosystem • Hive scripting • Thread safe 25
  • 26. Example - beans 26 <hdp:hive-server host=“hivehost" port="10001" /> <hdp:hive-template /> <hdp:hive-client-factory host="some-host" port="some-port" > <hdp:script location="classpath:org/company/hive/script.q"> <arguments>ignore-case=true</arguments> </hdp:script> </hdp:hive-client-factory> <hdp:hive-runner id="hiveRunner" run-at-startup="true"> <hdp:script> DROP TABLE IF EXITS testHiveBatchTable; CREATE TABLE testHiveBatchTable (key int, value string); </hdp:script> <hdp:script location="hive-scripts/script.q"/> </hdp:hive-runner>
  • 27. Example - template 27 return hiveTemplate.execute(new HiveClientCallback<List<String>>() { @Override public List<String> doInHive(HiveClient hiveClient) throws Exception { return hiveClient.get_all_databases(); } }));
  • 30. Pig • High level language for data analysis • Uses PigLatin to describe data flows (translates into MapReduce) • Filters, Joins, Projections, Groupings, Counts, etc. • Example: 30 A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float); B = FOREACH A GENERATE name; DUMP B;
  • 31. Features • Scripts execution • DAO friendly template • Thread safe 31
  • 32. Example - beans 32 <hdp:pig-factory exec-type="LOCAL" job-name="pig-script" configuration-ref="hadoopConfiguration" properties-location="pig-dev.properties”"> source=${pig.script.src} <script location="org/company/pig/script.pig“/> </hdp:pig-factory> <hdp:pig-runner id="pigRunner" run-at-startup="true"> <hdp:script> A = LOAD 'src/test/resources/logs/apache_access.log' USING PigStorage() AS (name:chararray, age:int); B = FOREACH A GENERATE name; DUMP B; </hdp:script> <hdp:script location="pig-scripts/script.pig"> <arguments>electric=sea</arguments> </hdp:script> </hdp:pig-runner> <hdp:pig-template/>
  • 33. Example - template 33 return pigTemplate.execute(new PigCallback<Set<String>() { @Override public Set<String> doInPig(PigServer pig) throws ExecException, IOException { return pig.getAliasKeySet(); } }));
  • 34. Other features • Cascading support • Works well with Hadoop security • Spring Batch tasklets • Spring Integration support 34
  • 35. Alternatives & related • Apache Flume – distributed data collection • Apache Oozie – workflow scheduler • Apache Sqoop – SQL bulk import/export 35