Leveraging Hadoop in Polyglot Architectures

LEVERAGING HADOOP IN POLYGLOT
ARCHITECTURES
Thanigai Vellore
Enterprise Architect at Art.com
@tvellore

AGENDA
 Background on Art.com
 Polyglot Architecture at Art.com
 Hadoop Integration Use cases
 Frameworks and Integration Components
 Q&A

5
POLYGLOT ARCHITECTURE
WEB
.NET JAVA NODE.JS
SERVICES/API
.NET NODE.JSJAVA
DATABASE SEARCH
SQL
Server
Mongo ENDECA SOLR

HADOOP @ ART.COM
 Use Hadoop to implement data-driven capabilities
via centralized platform that can be consumed by all
our brands
 Intelligent data platform that supports different types
of workloads (batch, stream processing, search, etc)
 Use frameworks that enable interoperability between
different technologies and teams
 We use Cloudera’s Enterprise Data Hub (EDH)
 Data Governance
 Centralized Management
 Security and Compliance

GOALS
 Platform to collect, ingest, analyze and report
aggregate information on clickstream activities at
scale
 Seamless integration with existing systems
 Traditional BI tools (Business Objects)
 Web analytics (Google Analytics)
 Marketing platforms (Email, SEM, etc)
 Provide foundation for building near real time closed
loop predictive engines

HIGH LEVEL ARCHITECTURE
CLICKSTREAM SERVICE
JAVA .NET NODE.JS
GA LOGGER
WEBSITES
AVRO SOURCE
ETL - MORPHLINES
AVRO SERIALIZER
APACHEFLUME
CLICKSTREAM
APACHE OOZIE
APACHE CRUNCH
HIVE ODBC
DRIVER
SESSIONIZATION
BUSINESS
OBJECTS
HDFS
SESSIONS
(HIVE)
HDFS

CLICKSTREAM COLLECTION
 Google Analytics provides local logging capability
(_setLocalRemoteServerMode)
 Capture all pageviews and GA events via simple
javascript file which is included on all pages
 Clickstream events are sent to clickstream service that
transforms incoming events and emit Avro records
 Flume Client SDK (NettyAvroRpcClient) is used to
send data into the agent
 Factory – org.apache.flume.api.RpcClientFactory:
RpcClient getInstance(properties)

CLICKSTREAM INGESTION
 Clickstream Ingested using Apache Flume from an
AvroSource
 Kite Morphlines (used as Flume Source Interceptor) is
used for ETL transformation into Avro
 AvroSerializer used to write Avro records to HDFS (HDFS
Sink)

AVRO
 Storage format (Persistence) and wire protocol
(Serialization)
 Self describing (schema stored with data)
 Supports compression of data and map-reduce
friendly
 Supports easier schema evolution
 Read/write data in Java, C, C++, Python, PHP, and
other languages
Platform Library Link
.NET Microsoft Avro Library https://hadoopsdk.codeplex.com/
Node.js node-avro-io https://www.npmjs.com/package/node-avro-io

KITE SDK
 Open source SDK (www.kitesdk.org) - Apache 2.0
licensed
 High level data layer for Hadoop
 Codify best practices for building data-oriented
systems
 Loosely coupled modular Design
 Kite Data Module
 Kite Morphlines
 Kite Maven Plugin

KITE DATA MODULE
 Set of APIs for interacting data with Hadoop
 Entities
 A single record in a dataset
 Simple or complex and nested (avro or POJO)
 Dataset
 A collection of entities/records
 Data types and field names defined by Avro schema
 Dataset Repository
 Physical storage location for datasets
Kite Abstractions Relational Equivalent
Entities Record
Dataset Table
Dataset Repository Database

KITE DATA MODULE
 Unified Storage Interface
 Support for Data Format, Partition Strategies and
Compression Formats
 Command Line Interface
 Utility commands to create, load, update datasets
 http://kitesdk.org/docs/0.17.1/cli-reference.html
Kite Data
HBaseHDFS
Application

KITE MORPHLINES
 Open source framework for simple ETL in Hadoop
Applications
 Consume any kind of data from any kind of dat
a source, process and load into any app or stor
age system
 Simple and flexible data mapping and transformation
 Similar to Unix pipelines with extensible set of
transformation commands

KITE MAVEN PLUGIN
 Maven goals for packaging, deploying, and running
distributed applications
 Create, update and delete datasets
mvn kite:create-dataset -Dkite.rootDirectory=/data
-Dkite.datasetName=clickstream
-Dkite.avroSchemaFile=/etc/flume-ng/schemas/clickstream.avsc
 Submit Jobs to oozie
mvn package kite:deploy-app -Dkite.applicationType=coordinator
mvn kite:run-app -Dkite.applicationType=coordinator -Dstart="$(date -d '1
hour ago' +"%Y-%m-%dT%H:%MZ")"

SESSIONIZATION
 MapReduce program to transform raw clickstream
logs into aggregate session summary using Apache
Crunch
 Hourly Coordinator job scheduled using Apache
Oozie
 Triggered based on presence of HDFS partition
folder

KITE CRUNCH INTEGRATION
 Enables loading Kite Dataset into Crunch Programs
 CrunchDatasets helper class
 CrunchDatasets.asSource(View view)
PCollection<Clickstream> clickstreamEvents =
getPipeline().read(CrunchDatasets.asSource(“dataset:hdfs/data/clickstream”,
Clickstream.class);
 CrunchDatasets.asTarget(View view)
 Supports Crunch write modes and repartitioning
PCollection<Clickstream> clickstreamLogs = getPipeline().read(
CrunchDatasets.asSource(“dataset:hdfs/data/clickstream”, Clickstream.class);
DatasetRepository hcatRepo = DatasetRepositories.open(hiveRepoUri);
View<Session> sessionView = hcatRepo.load(“sessions”);
PCollection<Session> sessions = clickstreamLogs
.by(new GetSessionId(), Avros.strings())
.groupByKey()
.parallelDo(new MakeSession(), Avros.specifics(Session.class));
getPipeline().write(sessions,CrunchDatasets.asTarget(sessionView), Target.WriteMode.APPE

APACHE HIVE ODBC DRIVER
 Used to read Hive Tables from Business Objects
 Fully compliant ODBC driver supporting multiple
Hadoop distributions
 High performance and throughput with support for
Hive2
 Supports Hive grammar, standard SQL and with
range of data types

Leveraging Hadoop in Polyglot Architectures

SEMANTIC SEARCH CLUSTERING
 Objectives here

WEBSITES
SEARCH SERVICE
(NODE.JS)
SOLRCLOUD
RABBITMQ
SEARCH PROCESSOR
(NODE.JS)
HBASE
LILY HBASE
INDEXER
H
U
E
CLUSTERING
ENGINE
(CARROT2)

CLUSTERING ENGINE
 Carrot2 – open source search results clustering
engine
 Allows to dynamically identify semantically related
“clusters” based on search results
 Multiple clustering algorithms – Lingo, STC, K-means
 Pluggable Search Component in SOLR - runs on top
of SOLR search results
SEARCH
PROCESSOR
SEARCH
CLUSTERING
SEARCH
RESULTS

NODE.JS & BIG DATA INTEGRATION
 NODE.JS
 Evented, non-blocking I/O – built on V8 runtime
 Ideal for scalable concurrent applications
Componen
t
Protoc
ol
NPM module
HBASE Thrift https://www.npmjs.com/package/node-
thrift
REST https://www.npmjs.com/package/hbase
HDFS REST https://www.npmjs.com/package/node-
webhdfs
Hive Thrift https://github.com/forward/node-hive
SOLR REST https://github.com/artlabs/solr-node-
client

LILY HBASE INDEXER
 Acts as a Hbase Replication Sink
 Horizontal Scalability via Zookeeper
 Automatic Failure Handling (inherits Hbase
replication system)
Memstore
HLog
(WAL)
SEP
Replication
Source
Hbase Region Server
Hbase
Indexer
SOLRCLOUD
Morphlines

LILY HBASE INDEXER
 Indexer Configuration Setup
Hbase-indexer add-indexer
--name search_indexer
--indexer-conf /.search-indexer.xml
--connection-param solr.zk=ZK_HOST/solr
--connection-param solr.collection=search_meta
--zookeeper ZK_HOST:2181
 Search-Indexer.xml
<indexer table=“search_meta”
mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultT
oSolrMapper" mapping-type="row" unique-key-field="id" row-
field="keyword">
<param name="morphlineFile" value="morphlines.conf"/>
</indexer>

KITE MORPHLINES IN HBASE INDEXER
morphlines : [
{
id : morphline1
importCommands : [”org.kitesdk.morphline.**", "com.ngdata.**”]
commands : [
{
extractHBaseCells {
mappings : [
{
inputColumn : “cf:column1”
outputField : “field1”
type : string
source : value
} ]
}
}
{
sanitizeUnknownSolrFields {
solrLocator : ${SOLR_LOCATOR}
}
}
{ logTrace { format : "output record: {}", args : ["@{}"] } }
]
}
]

MORPHLINE AVRO SUPPORT
readAvroContainer Parses an Apache Avro binary container and emits a morphline
record for each contained Avro datum.
extractAvroPaths Extracts specific values from an Avro object
commands : [
{
extractHBaseCells {
mappings : [
{
inputColumn : “cf:column1”
outputField : "_attachment_body"
type : "byte[]"
source : value
}]}
}
{ readAvroContainer {} }
{
extractAvroPaths {
paths : {
meta : /meta_data
}
}}
]
}]

GOALS
 Scalable Real time Stream
Processing Engine
 Based on Clickstream data,
provide Real-time trending
capability for all websites on
 Top Products Added to Cart
 Top Searches/Galleries visited
 Top User Galleries visited
 Low latency aggregations on
moving time window and
configurable time slices

Flume
Avro
source
Morphlines
ETL
channel
channel
HDFS
HDFS
sink
Spark
sink
Spark
Spark Streaming
clickstream
Rabbitmq
Node.js Aggregation
Processor
Websites
Clickstreamevents
Notifications
socket.io

WHY SPARK?
 Fast and Expressive
Cluster Computing Engine
 Leverages distributed
memory
 Linear scalability and fault
tolerance
 Rich DAG expressions for
data parallel computations
 Seamless Hadoop
Integration – Runs with
YARN and works with
HDFS
 Great Libraries (MLlib,
Spark Streaming,
SparkSQL, Graphx)
Spark
SQL
Streaming MLlib Graphx
Apache Spark
MESOS YARN
HDFSS3
Standalone
with local
storage

SPARK STREAMING
• Extension of Spark Core API for large
scale stream processing of live data
streams
• Integrates with Spark’s batch and
interactive processing
• Runs as a series of small, deterministic
batch jobs
• DStream provides a continuous stream
of data (sequence of RDDs)

FLUME SPARK SINK
 Pull-based Flume Sink
 Polling Flume Receiver to pull data from sink
 Strong Reliability and Fault Tolerance
 Flume Agent Configuration
 Custom Sink JAR available on Flume Classpath
 Flume Configuration

SPARK ON YARN
 Leverage both hardware and expertise in dealing
with YARN
 Eliminate cost of maintaining a separate cluster
 Take advantage of YARN scheduler features
 Spark supports YARN cluster and client mode
CLIENT
YARN CONTAINER
YARN RESOURCE
MANAGER
APPLICATION MASTER
DRIVER
YARN CONTAINER YARN CONTAINER
SPARK
EXECUTOR
SPARK
EXECUTOR
SPARK
TASK
YARN NODEMANAGER
Launch executorApplication commands

DATA SERIALIZATION & KITE SUPPORT
 Data Serialization - Key for good network performance
and memory usage
 Use Kryo Serialization – compact and faster
 Initialize with conf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
 Register classes for best performance
conf.set("spark.kryo.registrator", “com.art.spark.AvroKyroRegistrator");
 Kite Dataset Support
 DatasetKeyInputFormat – read kite dataset from HDFS into RDDs
 DatasetKeyOutputFormat – write RDDs as Kite dataset

THANK YOU
tvellore@art.com
@tvellore

Leveraging Hadoop in Polyglot Architectures

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Leveraging Hadoop in Polyglot Architectures (20)

Recently uploaded (20)

Leveraging Hadoop in Polyglot Architectures

Editor's Notes