Applications on Hadoop

DO
NOT
USE
PUBLICLY

PRIOR
TO
10/23/12

Building
ApplicaCons
on
Hadoop

Headline
Goes
Here

Mark
Grover

Speaker
Name
or
Subhead
Goes
Here

SoFware
Engineer,
Cloudera

@mark_grover

Jfokus
2014
(February
4th,
2014)

1

©2014 Cloudera, Inc. All Rights
Reserved.

Agenda

•  Brief
intro
to
Hadoop
and
the
ecosystem

•  Developing
apps
on
Hadoop

•  What’s
the
current
problem?

•  How
are
we
ﬁxing
it?

2

Reserved.

What
is
Apache
Hadoop?

Apache Hadoop
is
an
open
source

pla_orm
for
data
storage
and
processing

that
is…

ü  Scalable

ü  Fault
tolerant

ü  Distributed

Has
the
Flexibility
to
Store
and

Mine
Any
Type
of
Data

§  Ask
quesCons
across
structured
and

unstructured
data
that
were
previously

impossible
to
ask
or
solve

§  Not
bound
by
a
single
schema

3

CORE
HADOOP
SYSTEM
COMPONENTS

Hadoop
Distributed

File
System
(HDFS)

Self-‐Healing,
High

Bandwidth
Clustered

Storage

Excels
at

Processing
Complex
Data

MapReduce

Distributed
CompuCng

Framework

Scales

Economically

§  Scale-‐out
architecture
divides
workloads

across
mulCple
nodes

§  Can
be
deployed
on
commodity

hardware

§  Flexible
ﬁle
system
eliminates
ETL

bo^lenecks

§  Open
source
pla_orm
guards
against

vendor
lock

Reserved.

Developing
apps
on
Hadoop

Kite
SDK

4

Reserved.

“[I]t’s
not
enough
to
just
build
a
scalable

and
stable
system;
the
system
also
has
to

be
easy
enough
for
thousands
of
internal

developers
of
all
types
and
all
skill
levels
to

use.”

2

5
h^p://gigaom.com/data/how-‐disney-‐built-‐a-‐big-‐data-‐pla_orm-‐on-‐a-‐startup-‐budget/

Hadoop
is
incredibly
powerful

6

Reserved.

Hadoop
is
incredibly
ﬂexible

7

Reserved.

Hadoop
is
incredibly
low-‐level

8

Reserved.

Hadoop
is
incredibly
complex

9

Reserved.

A
typical
system
(zoom
100:1)

10

Reserved.

A
typical
system
(zoom
10:1)

11

Reserved.

A
typical
system
(zoom
5:1)

12

Reserved.

What
you
actually
care
about

•  Gelng
data
from
A
to
B

•  Using
it
later

13

Reserved.

Infrastructure
details

•  SerializaCon,
ﬁle
formats,
and
compression

•  Metadata
capture
and
maintenance

•  Dataset
organizaCon
and
parCConing

•  Durability
and
delivery
guarantees

•  Well-‐deﬁned
failure
semanCcs

•  Performance
and
health
instrumentaCon

14

Reserved.

Kite
SDK

•  Make
Hadoop
accessible
to
the
enterprise
developer

•  Address
the
most
common
cases

•  Codify
expert
pa^erns
and
pracCces
for
building
data-‐oriented

systems
and
applicaCons.

•  Let
developers
focus
on
business
logic,
not
plumbing
or

infrastructure.

•  Provide
smart
defaults
for
pla_orm
choices.

•  Support
piecemeal
adopCon
via
loosely-‐coupled
modules

15

Reserved.

Kite
SDK

•  An
open
source
set
of
libraries,
guides,
and
examples
for

building
data-‐oriented
systems
and
applicaCons

•  Provides
higher
level
APIs
atop
exisCng
components
of
CDH

•  Supports
piecemeal
adopCon
via
loosely
coupled
modules

16

Reserved.

Kite
SDK

• 

Data
–
logical
abstracCons
of
records,
datasets
and
repositories
with
implementaCons
for

HDFS
and
HBase
(upcoming)

• 
• 
• 
• 
• 
• 

APIs
to
drasCcally
simplify
working
with
datasets
in
Hadoop
ﬁlesystems.
The
Data
module:

handles
automaCc
serializaCon
and
deserializaCon
of
Java
POJOs
as
well
as
Avro
Records.

AutomaCc
compression.

File
and
directory
layout
and
management.

AutomaCc
parCConing
based
on
conﬁgurable
funcCons.

A
metadata
provider
plugin
interface
to
integrate
with
centralized
metadata
management

systems.

• 
• 
• 

17

Morphlines
–
declaraCve
ETL
stream
processing
library

Maven
Plugin
–
tools
for
working
with
datasets
and
running
jobs

TODO:
Add
more!!!

Reserved.

Co-‐authoring
O’Reilly
book

•  Titled
‘Hadoop
ApplicaCon
Architectures’

•  How
to
build
end-‐to-‐end
soluCons
using

Apache
Hadoop
and
related
tools

•  Updates
on
Twi^er:
@hadooparchbook

•  h^p://www.hadooparchitecturebook.com/

18

Reserved.

Code

DatasetRepository repo = new FileSystemDatasetRepository.Builder()
.fileSystem(FileSystem.get(new Configuration()))
.directory(new Path(“/data”))
.get();

Dataset events = repo.create(“events”,
new DatasetDescriptor.Builder()
.schema(new File(“event.avsc”))
.partitionStrategy(
new PartitionStrategy.Builder().hash(“userId”, 53).get()
).get()
);

DatasetWriter<GenericRecord> writer = events.getWriter();
writer.open();
writer.write(
new GenericRecordBuilder(schema)
.set(“userId”, 1)
.set(“timeStamp”, System.currentTimeMillis())
.build()
);
writer.close();

Data

15

19

/data
/events
/.metadata
/schema.avsc
/descriptor.properties
/userId=0
/10000000.avro
/10000001.avro
/userId=1
/20000000.avro
/userId=2
/30000000.avro

Kite
SDK
Morphlines
Module

Pluggable,
conﬁguraCon-‐driven
data
transform
library

Born
out
of
Cloudera
Search,
but
general
purpose

Conﬁgure
record
transform
stages
in
a
container
library

Use
the
library
in
Flume,
MapReduce
jobs,
Storm,
and
other
Java

applicaCons

14

20

Other
Modules

Maven
plugin

Package,
deploy,
and
execute
“apps”

Execute
dataset
operaCons

Examples

POJO,
generic,
and
generated
enCty
ingest

Dataset
administraCve
operaCons

Crunch
and
MR
integraCon

...

14

21

Future

HBase

Extending
data
APIs
to
support
random
access

Same
automaCc
serializaCon,
schema
management,
etc.

Higher-‐order
data
management

Common
tasks

Think
background
compacCon,
conversion,
etc.

IntegraCon
with
exisCng
middleware
frameworks

Give
us
all
your
good
ideas
(and
code)!

14

22

Kite
SDK
Resources

•  Docs

• 

h^p://kitesdk.org/docs/current/

•  Examples

• 

h^ps://github.com/kite-‐sdk/kite-‐examples

•  Source
code

• 

h^ps://github.com/kite-‐sdk/

Binary
arCfacts
available
from
Cloudera’s
Maven
repository

•  Twi^er:
@mark_grover

•  Slides
at

•  LinkedIn:
linkedin.com/in/grovermark

23

Reserved.

Applications on Hadoop

More Related Content

What's hot (20)

Similar to Applications on Hadoop (20)

More from markgrover (20)

Recently uploaded (20)

Applications on Hadoop