SlideShare a Scribd company logo
DO	
  NOT	
  USE	
  PUBLICLY	
  
PRIOR	
  TO	
  10/23/12	
  

Building	
  ApplicaCons	
  on	
  Hadoop	
  
Headline	
  Goes	
  Here	
  
Mark	
  Grover	
  
Speaker	
  Name	
  or	
  Subhead	
  Goes	
  Here	
  
SoFware	
  Engineer,	
  Cloudera	
  
@mark_grover	
  
Jfokus	
  2014	
  (February	
  4th,	
  2014)	
  
	
  

1

©2014 Cloudera, Inc. All Rights
Reserved.
Agenda	
  
•  Brief	
  intro	
  to	
  Hadoop	
  and	
  the	
  ecosystem	
  
•  Developing	
  apps	
  on	
  Hadoop	
  
•  What’s	
  the	
  current	
  problem?	
  
•  How	
  are	
  we	
  fixing	
  it?	
  

2

©2014 Cloudera, Inc. All Rights
Reserved.
What	
  is	
  Apache	
  Hadoop?	
  
Apache Hadoop	
  is	
  an	
  open	
  source	
  
pla_orm	
  for	
  data	
  storage	
  and	
  processing	
  
that	
  is…	
  
ü  Scalable	
  
ü  Fault	
  tolerant	
  
ü  Distributed	
  
Has	
  the	
  Flexibility	
  to	
  Store	
  and	
  
Mine	
  Any	
  Type	
  of	
  Data	
  

	
  
§  Ask	
  quesCons	
  across	
  structured	
  and	
  
unstructured	
  data	
  that	
  were	
  previously	
  
impossible	
  to	
  ask	
  or	
  solve	
  
§  Not	
  bound	
  by	
  a	
  single	
  schema	
  
3

CORE	
  HADOOP	
  SYSTEM	
  COMPONENTS	
  
Hadoop	
  Distributed	
  
File	
  System	
  (HDFS)	
  
	
  
Self-­‐Healing,	
  High	
  
Bandwidth	
  Clustered	
  
Storage	
  

Excels	
  at	
  
Processing	
  Complex	
  Data	
  

	
  
	
  
MapReduce	
  
	
  

Distributed	
  CompuCng	
  
Framework	
  

Scales	
  
Economically	
  

	
  
§  Scale-­‐out	
  architecture	
  divides	
  workloads	
  
across	
  mulCple	
  nodes	
  

	
  
§  Can	
  be	
  deployed	
  on	
  commodity	
  
hardware	
  

§  Flexible	
  file	
  system	
  eliminates	
  ETL	
  
bo^lenecks	
  

§  Open	
  source	
  pla_orm	
  guards	
  against	
  
vendor	
  lock	
  

©2014 Cloudera, Inc. All Rights
Reserved.
Developing	
  apps	
  on	
  Hadoop	
  
Kite	
  SDK	
  

4

©2014 Cloudera, Inc. All Rights
Reserved.
“[I]t’s	
  not	
  enough	
  to	
  just	
  build	
  a	
  scalable	
  
and	
  stable	
  system;	
  the	
  system	
  also	
  has	
  to	
  
be	
  easy	
  enough	
  for	
  thousands	
  of	
  internal	
  
developers	
  of	
  all	
  types	
  and	
  all	
  skill	
  levels	
  to	
  
use.”	
  

2

5
h^p://gigaom.com/data/how-­‐disney-­‐built-­‐a-­‐big-­‐data-­‐pla_orm-­‐on-­‐a-­‐startup-­‐budget/	
  
Hadoop	
  is	
  incredibly	
  powerful	
  

6

©2014 Cloudera, Inc. All Rights
Reserved.
Hadoop	
  is	
  incredibly	
  flexible	
  

7

©2014 Cloudera, Inc. All Rights
Reserved.
Hadoop	
  is	
  incredibly	
  low-­‐level	
  

8

©2014 Cloudera, Inc. All Rights
Reserved.
Hadoop	
  is	
  incredibly	
  complex	
  

9

©2014 Cloudera, Inc. All Rights
Reserved.
A	
  typical	
  system	
  (zoom	
  100:1)	
  

10

©2014 Cloudera, Inc. All Rights
Reserved.
A	
  typical	
  system	
  (zoom	
  10:1)	
  

11

©2014 Cloudera, Inc. All Rights
Reserved.
A	
  typical	
  system	
  (zoom	
  5:1)	
  

12

©2014 Cloudera, Inc. All Rights
Reserved.
What	
  you	
  actually	
  care	
  about	
  
•  Gelng	
  data	
  from	
  A	
  to	
  B	
  
•  Using	
  it	
  later	
  

13

©2014 Cloudera, Inc. All Rights
Reserved.
Infrastructure	
  details	
  
•  SerializaCon,	
  file	
  formats,	
  and	
  compression	
  
•  Metadata	
  capture	
  and	
  maintenance	
  
•  Dataset	
  organizaCon	
  and	
  parCConing	
  
•  Durability	
  and	
  delivery	
  guarantees	
  
•  Well-­‐defined	
  failure	
  semanCcs	
  
•  Performance	
  and	
  health	
  instrumentaCon	
  

14

©2014 Cloudera, Inc. All Rights
Reserved.
Kite	
  SDK	
  
•  Make	
  Hadoop	
  accessible	
  to	
  the	
  enterprise	
  developer	
  
•  Address	
  the	
  most	
  common	
  cases	
  
•  Codify	
  expert	
  pa^erns	
  and	
  pracCces	
  for	
  building	
  data-­‐oriented	
  

systems	
  and	
  applicaCons.	
  
•  Let	
  developers	
  focus	
  on	
  business	
  logic,	
  not	
  plumbing	
  or	
  
infrastructure.	
  
•  Provide	
  smart	
  defaults	
  for	
  pla_orm	
  choices.	
  
•  Support	
  piecemeal	
  adopCon	
  via	
  loosely-­‐coupled	
  modules	
  
15

©2014 Cloudera, Inc. All Rights
Reserved.
Kite	
  SDK	
  
•  An	
  open	
  source	
  set	
  of	
  libraries,	
  guides,	
  and	
  examples	
  for	
  

building	
  data-­‐oriented	
  systems	
  and	
  applicaCons	
  
•  Provides	
  higher	
  level	
  APIs	
  atop	
  exisCng	
  components	
  of	
  CDH	
  
•  Supports	
  piecemeal	
  adopCon	
  via	
  loosely	
  coupled	
  modules	
  

16

©2014 Cloudera, Inc. All Rights
Reserved.
Kite	
  SDK	
  
• 

Data	
  –	
  logical	
  abstracCons	
  of	
  records,	
  datasets	
  and	
  repositories	
  with	
  implementaCons	
  for	
  
HDFS	
  and	
  HBase	
  (upcoming)	
  
• 
• 
• 
• 
• 
• 

APIs	
  to	
  drasCcally	
  simplify	
  working	
  with	
  datasets	
  in	
  Hadoop	
  filesystems.	
  The	
  Data	
  module:	
  
	
  handles	
  automaCc	
  serializaCon	
  and	
  deserializaCon	
  of	
  Java	
  POJOs	
  as	
  well	
  as	
  Avro	
  Records.	
  
AutomaCc	
  compression.	
  
File	
  and	
  directory	
  layout	
  and	
  management.	
  
AutomaCc	
  parCConing	
  based	
  on	
  configurable	
  funcCons.	
  
A	
  metadata	
  provider	
  plugin	
  interface	
  to	
  integrate	
  with	
  centralized	
  metadata	
  management	
  
systems.	
  	
  

• 
• 
• 

17

Morphlines	
  –	
  declaraCve	
  ETL	
  stream	
  processing	
  library	
  	
  
Maven	
  Plugin	
  –	
  tools	
  for	
  working	
  with	
  datasets	
  and	
  running	
  jobs	
  
TODO:	
  Add	
  more!!!	
  

©2014 Cloudera, Inc. All Rights
Reserved.
Co-­‐authoring	
  O’Reilly	
  book	
  
•  Titled	
  ‘Hadoop	
  ApplicaCon	
  Architectures’	
  
•  How	
  to	
  build	
  end-­‐to-­‐end	
  soluCons	
  using	
  	
  

Apache	
  Hadoop	
  and	
  related	
  tools	
  
•  Updates	
  on	
  Twi^er:	
  @hadooparchbook	
  
•  h^p://www.hadooparchitecturebook.com/	
  

18

©2014 Cloudera, Inc. All Rights
Reserved.
Code	
  
DatasetRepository repo = new FileSystemDatasetRepository.Builder()	
.fileSystem(FileSystem.get(new Configuration()))	
.directory(new Path(“/data”))	
.get();	
	
Dataset events = repo.create(“events”,	
new DatasetDescriptor.Builder()	
.schema(new File(“event.avsc”))	
.partitionStrategy(	
new PartitionStrategy.Builder().hash(“userId”, 53).get()	
).get()	
);	
	
DatasetWriter<GenericRecord> writer = events.getWriter();	
writer.open();	
writer.write(	
new GenericRecordBuilder(schema)	
.set(“userId”, 1)	
.set(“timeStamp”, System.currentTimeMillis())	
.build()	
);	
writer.close();	

Data	
  

15

19

/data	
/events	
/.metadata	
/schema.avsc	
/descriptor.properties	
/userId=0	
/10000000.avro	
/10000001.avro	
/userId=1	
/20000000.avro	
/userId=2	
/30000000.avro
Kite	
  SDK	
  Morphlines	
  Module	
  
Pluggable,	
  configuraCon-­‐driven	
  data	
  transform	
  library	
  
Born	
  out	
  of	
  Cloudera	
  Search,	
  but	
  general	
  purpose	
  
Configure	
  record	
  transform	
  stages	
  in	
  a	
  container	
  library	
  
Use	
  the	
  library	
  in	
  Flume,	
  MapReduce	
  jobs,	
  Storm,	
  and	
  other	
  Java	
  
applicaCons	
  

14

20
Other	
  Modules	
  
Maven	
  plugin	
  
Package,	
  deploy,	
  and	
  execute	
  “apps”	
  
Execute	
  dataset	
  operaCons	
  

Examples	
  
POJO,	
  generic,	
  and	
  generated	
  enCty	
  ingest	
  
Dataset	
  administraCve	
  operaCons	
  
Crunch	
  and	
  MR	
  integraCon	
  
...	
  
14

21
Future	
  
HBase	
  
Extending	
  data	
  APIs	
  to	
  support	
  random	
  access	
  
Same	
  automaCc	
  serializaCon,	
  schema	
  management,	
  etc.	
  

Higher-­‐order	
  data	
  management	
  
Common	
  tasks	
  
Think	
  background	
  compacCon,	
  conversion,	
  etc.	
  

IntegraCon	
  with	
  exisCng	
  middleware	
  frameworks	
  
Give	
  us	
  all	
  your	
  good	
  ideas	
  (and	
  code)!	
  
14

22
Kite	
  SDK	
  Resources	
  
•  Docs	
  
• 

h^p://kitesdk.org/docs/current/	
  

•  Examples	
  
• 

h^ps://github.com/kite-­‐sdk/kite-­‐examples	
  

•  Source	
  code	
  
• 

h^ps://github.com/kite-­‐sdk/	
  

Binary	
  arCfacts	
  available	
  from	
  Cloudera’s	
  Maven	
  repository	
  
•  Twi^er:	
  @mark_grover	
  
•  Slides	
  at	
  	
  
•  LinkedIn:	
  linkedin.com/in/grovermark	
  
23

©2014 Cloudera, Inc. All Rights
Reserved.
18

24

More Related Content

PDF
NYC HUG - Application Architectures with Apache Hadoop
PPTX
Architecting Applications with Hadoop
PDF
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
PDF
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
PDF
SQL Engines for Hadoop - The case for Impala
PDF
Intro to hadoop tutorial
PDF
Introduction to Impala
PDF
Cloudera Impala: A modern SQL Query Engine for Hadoop
NYC HUG - Application Architectures with Apache Hadoop
Architecting Applications with Hadoop
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
SQL Engines for Hadoop - The case for Impala
Intro to hadoop tutorial
Introduction to Impala
Cloudera Impala: A modern SQL Query Engine for Hadoop

What's hot (20)

PDF
Application architectures with Hadoop – Big Data TechCon 2014
PPTX
Impala 2.0 - The Best Analytic Database for Hadoop
PPTX
A brave new world in mutable big data relational storage (Strata NYC 2017)
PDF
Impala: Real-time Queries in Hadoop
PPTX
Intro to Apache Kudu (short) - Big Data Application Meetup
PPTX
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
PPTX
The Impala Cookbook
PDF
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
PPTX
Hive analytic workloads hadoop summit san jose 2014
PDF
Application architectures with hadoop – big data techcon 2014
PDF
Real-Time Queries in Hadoop w/ Cloudera Impala
PDF
Building a Hadoop Data Warehouse with Impala
PPTX
De-Bugging Hive with Hadoop-in-the-Cloud
PPTX
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
PDF
Impala Architecture presentation
PDF
Introduction to Apache Kudu
PPTX
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
PDF
Presentations from the Cloudera Impala meetup on Aug 20 2013
PDF
Cloudera Impala
PDF
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Application architectures with Hadoop – Big Data TechCon 2014
Impala 2.0 - The Best Analytic Database for Hadoop
A brave new world in mutable big data relational storage (Strata NYC 2017)
Impala: Real-time Queries in Hadoop
Intro to Apache Kudu (short) - Big Data Application Meetup
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
The Impala Cookbook
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Hive analytic workloads hadoop summit san jose 2014
Application architectures with hadoop – big data techcon 2014
Real-Time Queries in Hadoop w/ Cloudera Impala
Building a Hadoop Data Warehouse with Impala
De-Bugging Hive with Hadoop-in-the-Cloud
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Impala Architecture presentation
Introduction to Apache Kudu
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Presentations from the Cloudera Impala meetup on Aug 20 2013
Cloudera Impala
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Ad

Similar to Applications on Hadoop (20)

PPT
Cloudera Developer Kit (CDK)
PDF
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
PPTX
CCD-410 Cloudera Study Material
PPTX
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
PDF
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
PPTX
Hadoop 3 (2017 hadoop taiwan workshop)
PDF
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
PPTX
Building data pipelines with kite
PDF
Application Architectures with Hadoop - Big Data TechCon SF 2014
PDF
Introduction to Data Science with Hadoop
PPTX
Introduction to Apache Hadoop Ecosystem
PDF
Final White Paper_
PPTX
Is hadoop for you
PDF
Troubleshooting Hadoop: Distributed Debugging
PPTX
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
PDF
Introduction to HBase
PDF
Building an Apache Hadoop data application
PPTX
Instant hadoop of your own
PDF
Introduction to HBase - NoSqlNow2015
PDF
Hadoop Operations for Production Systems (Strata NYC)
Cloudera Developer Kit (CDK)
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
CCD-410 Cloudera Study Material
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Hadoop 3 (2017 hadoop taiwan workshop)
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
Building data pipelines with kite
Application Architectures with Hadoop - Big Data TechCon SF 2014
Introduction to Data Science with Hadoop
Introduction to Apache Hadoop Ecosystem
Final White Paper_
Is hadoop for you
Troubleshooting Hadoop: Distributed Debugging
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Introduction to HBase
Building an Apache Hadoop data application
Instant hadoop of your own
Introduction to HBase - NoSqlNow2015
Hadoop Operations for Production Systems (Strata NYC)
Ad

More from markgrover (20)

PDF
From discovering to trusting data
PDF
Amundsen lineage designs - community meeting, Dec 2020
PDF
Amundsen at Brex and Looker integration
PDF
REA Group's journey with Data Cataloging and Amundsen
PDF
Amundsen gremlin proxy design
PDF
Amundsen: From discovering to security data
PDF
Amundsen: From discovering to security data
PDF
Data Discovery & Trust through Metadata
PDF
Data Discovery and Metadata
PDF
The Lyft data platform: Now and in the future
PDF
Disrupting Data Discovery
PDF
TensorFlow Extension (TFX) and Apache Beam
PDF
Big Data at Speed
PDF
Near real-time anomaly detection at Lyft
PDF
Dogfooding data at Lyft
PDF
Fighting cybersecurity threats with Apache Spot
PDF
Fraud Detection with Hadoop
PDF
Top 5 mistakes when writing Spark applications
PDF
Top 5 mistakes when writing Spark applications
PDF
Application architectures with Hadoop and Sessionization in MR
From discovering to trusting data
Amundsen lineage designs - community meeting, Dec 2020
Amundsen at Brex and Looker integration
REA Group's journey with Data Cataloging and Amundsen
Amundsen gremlin proxy design
Amundsen: From discovering to security data
Amundsen: From discovering to security data
Data Discovery & Trust through Metadata
Data Discovery and Metadata
The Lyft data platform: Now and in the future
Disrupting Data Discovery
TensorFlow Extension (TFX) and Apache Beam
Big Data at Speed
Near real-time anomaly detection at Lyft
Dogfooding data at Lyft
Fighting cybersecurity threats with Apache Spot
Fraud Detection with Hadoop
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
Application architectures with Hadoop and Sessionization in MR

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Electronic commerce courselecture one. Pdf
PDF
Encapsulation theory and applications.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Cloud computing and distributed systems.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
The AUB Centre for AI in Media Proposal.docx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Building Integrated photovoltaic BIPV_UPV.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Unlocking AI with Model Context Protocol (MCP)
Understanding_Digital_Forensics_Presentation.pptx
Spectral efficient network and resource selection model in 5G networks
Digital-Transformation-Roadmap-for-Companies.pptx
MYSQL Presentation for SQL database connectivity
Electronic commerce courselecture one. Pdf
Encapsulation theory and applications.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Cloud computing and distributed systems.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Applications on Hadoop

  • 1. DO  NOT  USE  PUBLICLY   PRIOR  TO  10/23/12   Building  ApplicaCons  on  Hadoop   Headline  Goes  Here   Mark  Grover   Speaker  Name  or  Subhead  Goes  Here   SoFware  Engineer,  Cloudera   @mark_grover   Jfokus  2014  (February  4th,  2014)     1 ©2014 Cloudera, Inc. All Rights Reserved.
  • 2. Agenda   •  Brief  intro  to  Hadoop  and  the  ecosystem   •  Developing  apps  on  Hadoop   •  What’s  the  current  problem?   •  How  are  we  fixing  it?   2 ©2014 Cloudera, Inc. All Rights Reserved.
  • 3. What  is  Apache  Hadoop?   Apache Hadoop  is  an  open  source   pla_orm  for  data  storage  and  processing   that  is…   ü  Scalable   ü  Fault  tolerant   ü  Distributed   Has  the  Flexibility  to  Store  and   Mine  Any  Type  of  Data     §  Ask  quesCons  across  structured  and   unstructured  data  that  were  previously   impossible  to  ask  or  solve   §  Not  bound  by  a  single  schema   3 CORE  HADOOP  SYSTEM  COMPONENTS   Hadoop  Distributed   File  System  (HDFS)     Self-­‐Healing,  High   Bandwidth  Clustered   Storage   Excels  at   Processing  Complex  Data       MapReduce     Distributed  CompuCng   Framework   Scales   Economically     §  Scale-­‐out  architecture  divides  workloads   across  mulCple  nodes     §  Can  be  deployed  on  commodity   hardware   §  Flexible  file  system  eliminates  ETL   bo^lenecks   §  Open  source  pla_orm  guards  against   vendor  lock   ©2014 Cloudera, Inc. All Rights Reserved.
  • 4. Developing  apps  on  Hadoop   Kite  SDK   4 ©2014 Cloudera, Inc. All Rights Reserved.
  • 5. “[I]t’s  not  enough  to  just  build  a  scalable   and  stable  system;  the  system  also  has  to   be  easy  enough  for  thousands  of  internal   developers  of  all  types  and  all  skill  levels  to   use.”   2 5 h^p://gigaom.com/data/how-­‐disney-­‐built-­‐a-­‐big-­‐data-­‐pla_orm-­‐on-­‐a-­‐startup-­‐budget/  
  • 6. Hadoop  is  incredibly  powerful   6 ©2014 Cloudera, Inc. All Rights Reserved.
  • 7. Hadoop  is  incredibly  flexible   7 ©2014 Cloudera, Inc. All Rights Reserved.
  • 8. Hadoop  is  incredibly  low-­‐level   8 ©2014 Cloudera, Inc. All Rights Reserved.
  • 9. Hadoop  is  incredibly  complex   9 ©2014 Cloudera, Inc. All Rights Reserved.
  • 10. A  typical  system  (zoom  100:1)   10 ©2014 Cloudera, Inc. All Rights Reserved.
  • 11. A  typical  system  (zoom  10:1)   11 ©2014 Cloudera, Inc. All Rights Reserved.
  • 12. A  typical  system  (zoom  5:1)   12 ©2014 Cloudera, Inc. All Rights Reserved.
  • 13. What  you  actually  care  about   •  Gelng  data  from  A  to  B   •  Using  it  later   13 ©2014 Cloudera, Inc. All Rights Reserved.
  • 14. Infrastructure  details   •  SerializaCon,  file  formats,  and  compression   •  Metadata  capture  and  maintenance   •  Dataset  organizaCon  and  parCConing   •  Durability  and  delivery  guarantees   •  Well-­‐defined  failure  semanCcs   •  Performance  and  health  instrumentaCon   14 ©2014 Cloudera, Inc. All Rights Reserved.
  • 15. Kite  SDK   •  Make  Hadoop  accessible  to  the  enterprise  developer   •  Address  the  most  common  cases   •  Codify  expert  pa^erns  and  pracCces  for  building  data-­‐oriented   systems  and  applicaCons.   •  Let  developers  focus  on  business  logic,  not  plumbing  or   infrastructure.   •  Provide  smart  defaults  for  pla_orm  choices.   •  Support  piecemeal  adopCon  via  loosely-­‐coupled  modules   15 ©2014 Cloudera, Inc. All Rights Reserved.
  • 16. Kite  SDK   •  An  open  source  set  of  libraries,  guides,  and  examples  for   building  data-­‐oriented  systems  and  applicaCons   •  Provides  higher  level  APIs  atop  exisCng  components  of  CDH   •  Supports  piecemeal  adopCon  via  loosely  coupled  modules   16 ©2014 Cloudera, Inc. All Rights Reserved.
  • 17. Kite  SDK   •  Data  –  logical  abstracCons  of  records,  datasets  and  repositories  with  implementaCons  for   HDFS  and  HBase  (upcoming)   •  •  •  •  •  •  APIs  to  drasCcally  simplify  working  with  datasets  in  Hadoop  filesystems.  The  Data  module:    handles  automaCc  serializaCon  and  deserializaCon  of  Java  POJOs  as  well  as  Avro  Records.   AutomaCc  compression.   File  and  directory  layout  and  management.   AutomaCc  parCConing  based  on  configurable  funcCons.   A  metadata  provider  plugin  interface  to  integrate  with  centralized  metadata  management   systems.     •  •  •  17 Morphlines  –  declaraCve  ETL  stream  processing  library     Maven  Plugin  –  tools  for  working  with  datasets  and  running  jobs   TODO:  Add  more!!!   ©2014 Cloudera, Inc. All Rights Reserved.
  • 18. Co-­‐authoring  O’Reilly  book   •  Titled  ‘Hadoop  ApplicaCon  Architectures’   •  How  to  build  end-­‐to-­‐end  soluCons  using     Apache  Hadoop  and  related  tools   •  Updates  on  Twi^er:  @hadooparchbook   •  h^p://www.hadooparchitecturebook.com/   18 ©2014 Cloudera, Inc. All Rights Reserved.
  • 19. Code   DatasetRepository repo = new FileSystemDatasetRepository.Builder() .fileSystem(FileSystem.get(new Configuration())) .directory(new Path(“/data”)) .get(); Dataset events = repo.create(“events”, new DatasetDescriptor.Builder() .schema(new File(“event.avsc”)) .partitionStrategy( new PartitionStrategy.Builder().hash(“userId”, 53).get() ).get() ); DatasetWriter<GenericRecord> writer = events.getWriter(); writer.open(); writer.write( new GenericRecordBuilder(schema) .set(“userId”, 1) .set(“timeStamp”, System.currentTimeMillis()) .build() ); writer.close(); Data   15 19 /data /events /.metadata /schema.avsc /descriptor.properties /userId=0 /10000000.avro /10000001.avro /userId=1 /20000000.avro /userId=2 /30000000.avro
  • 20. Kite  SDK  Morphlines  Module   Pluggable,  configuraCon-­‐driven  data  transform  library   Born  out  of  Cloudera  Search,  but  general  purpose   Configure  record  transform  stages  in  a  container  library   Use  the  library  in  Flume,  MapReduce  jobs,  Storm,  and  other  Java   applicaCons   14 20
  • 21. Other  Modules   Maven  plugin   Package,  deploy,  and  execute  “apps”   Execute  dataset  operaCons   Examples   POJO,  generic,  and  generated  enCty  ingest   Dataset  administraCve  operaCons   Crunch  and  MR  integraCon   ...   14 21
  • 22. Future   HBase   Extending  data  APIs  to  support  random  access   Same  automaCc  serializaCon,  schema  management,  etc.   Higher-­‐order  data  management   Common  tasks   Think  background  compacCon,  conversion,  etc.   IntegraCon  with  exisCng  middleware  frameworks   Give  us  all  your  good  ideas  (and  code)!   14 22
  • 23. Kite  SDK  Resources   •  Docs   •  h^p://kitesdk.org/docs/current/   •  Examples   •  h^ps://github.com/kite-­‐sdk/kite-­‐examples   •  Source  code   •  h^ps://github.com/kite-­‐sdk/   Binary  arCfacts  available  from  Cloudera’s  Maven  repository   •  Twi^er:  @mark_grover   •  Slides  at     •  LinkedIn:  linkedin.com/in/grovermark   23 ©2014 Cloudera, Inc. All Rights Reserved.
  • 24. 18 24