SlideShare a Scribd company logo
LEVERAGING HADOOP IN POLYGLOT
ARCHITECTURES
Thanigai Vellore
Enterprise Architect at Art.com
@tvellore
AGENDA
 Background on Art.com
 Polyglot Architecture at Art.com
 Hadoop Integration Use cases
 Frameworks and Integration Components
 Q&A
BACKGROUND ON ART.COM
OUR MISSION
5
POLYGLOT ARCHITECTURE
WEB
.NET JAVA NODE.JS
SERVICES/API
.NET NODE.JSJAVA
DATABASE SEARCH
SQL
Server
Mongo ENDECA SOLR
HADOOP @ ART.COM
 Use Hadoop to implement data-driven capabilities
via centralized platform that can be consumed by all
our brands
 Intelligent data platform that supports different types
of workloads (batch, stream processing, search, etc)
 Use frameworks that enable interoperability between
different technologies and teams
 We use Cloudera’s Enterprise Data Hub (EDH)
 Data Governance
 Centralized Management
 Security and Compliance
HADOOP
USECASES
CLICKSTREAM ANALYTICS
GOALS
 Platform to collect, ingest, analyze and report
aggregate information on clickstream activities at
scale
 Seamless integration with existing systems
 Traditional BI tools (Business Objects)
 Web analytics (Google Analytics)
 Marketing platforms (Email, SEM, etc)
 Provide foundation for building near real time closed
loop predictive engines
HIGH LEVEL ARCHITECTURE
CLICKSTREAM SERVICE
JAVA .NET NODE.JS
GA LOGGER
WEBSITES
AVRO SOURCE
ETL - MORPHLINES
AVRO SERIALIZER
APACHEFLUME
CLICKSTREAM
APACHE OOZIE
APACHE CRUNCH
HIVE ODBC
DRIVER
SESSIONIZATION
BUSINESS
OBJECTS
HDFS
SESSIONS
(HIVE)
HDFS
CLICKSTREAM COLLECTION
 Google Analytics provides local logging capability
(_setLocalRemoteServerMode)
 Capture all pageviews and GA events via simple
javascript file which is included on all pages
 Clickstream events are sent to clickstream service that
transforms incoming events and emit Avro records
 Flume Client SDK (NettyAvroRpcClient) is used to
send data into the agent
 Factory – org.apache.flume.api.RpcClientFactory:
RpcClient getInstance(properties)
CLICKSTREAM INGESTION
 Clickstream Ingested using Apache Flume from an
AvroSource
 Kite Morphlines (used as Flume Source Interceptor) is
used for ETL transformation into Avro
 AvroSerializer used to write Avro records to HDFS (HDFS
Sink)
AVRO
 Storage format (Persistence) and wire protocol
(Serialization)
 Self describing (schema stored with data)
 Supports compression of data and map-reduce
friendly
 Supports easier schema evolution
 Read/write data in Java, C, C++, Python, PHP, and
other languages
Platform Library Link
.NET Microsoft Avro Library https://hadoopsdk.codeplex.com/
Node.js node-avro-io https://www.npmjs.com/package/node-avro-io
KITE SDK
 Open source SDK (www.kitesdk.org) - Apache 2.0
licensed
 High level data layer for Hadoop
 Codify best practices for building data-oriented
systems
 Loosely coupled modular Design
 Kite Data Module
 Kite Morphlines
 Kite Maven Plugin
KITE DATA MODULE
 Set of APIs for interacting data with Hadoop
 Entities
 A single record in a dataset
 Simple or complex and nested (avro or POJO)
 Dataset
 A collection of entities/records
 Data types and field names defined by Avro schema
 Dataset Repository
 Physical storage location for datasets
Kite Abstractions Relational Equivalent
Entities Record
Dataset Table
Dataset Repository Database
KITE DATA MODULE
 Unified Storage Interface
 Support for Data Format, Partition Strategies and
Compression Formats
 Command Line Interface
 Utility commands to create, load, update datasets
 http://kitesdk.org/docs/0.17.1/cli-reference.html
Kite Data
HBaseHDFS
Application
KITE MORPHLINES
 Open source framework for simple ETL in Hadoop
Applications
 Consume any kind of data from any kind of dat
a source, process and load into any app or stor
age system
 Simple and flexible data mapping and transformation
 Similar to Unix pipelines with extensible set of
transformation commands
KITE MAVEN PLUGIN
 Maven goals for packaging, deploying, and running
distributed applications
 Create, update and delete datasets
mvn kite:create-dataset -Dkite.rootDirectory=/data
-Dkite.datasetName=clickstream 
-Dkite.avroSchemaFile=/etc/flume-ng/schemas/clickstream.avsc
 Submit Jobs to oozie
mvn package kite:deploy-app -Dkite.applicationType=coordinator
mvn kite:run-app -Dkite.applicationType=coordinator -Dstart="$(date -d '1
hour ago' +"%Y-%m-%dT%H:%MZ")"
SESSIONIZATION
 MapReduce program to transform raw clickstream
logs into aggregate session summary using Apache
Crunch
 Hourly Coordinator job scheduled using Apache
Oozie
 Triggered based on presence of HDFS partition
folder
KITE CRUNCH INTEGRATION
 Enables loading Kite Dataset into Crunch Programs
 CrunchDatasets helper class
 CrunchDatasets.asSource(View view)
PCollection<Clickstream> clickstreamEvents =
getPipeline().read(CrunchDatasets.asSource(“dataset:hdfs/data/clickstream”,
Clickstream.class);
 CrunchDatasets.asTarget(View view)
 Supports Crunch write modes and repartitioning
PCollection<Clickstream> clickstreamLogs = getPipeline().read(
CrunchDatasets.asSource(“dataset:hdfs/data/clickstream”, Clickstream.class);
DatasetRepository hcatRepo = DatasetRepositories.open(hiveRepoUri);
View<Session> sessionView = hcatRepo.load(“sessions”);
PCollection<Session> sessions = clickstreamLogs
.by(new GetSessionId(), Avros.strings())
.groupByKey()
.parallelDo(new MakeSession(), Avros.specifics(Session.class));
getPipeline().write(sessions,CrunchDatasets.asTarget(sessionView), Target.WriteMode.APPE
APACHE HIVE ODBC DRIVER
 Used to read Hive Tables from Business Objects
 Fully compliant ODBC driver supporting multiple
Hadoop distributions
 High performance and throughput with support for
Hive2
 Supports Hive grammar, standard SQL and with
range of data types
Leveraging Hadoop in Polyglot Architectures
SEMANTIC SEARCH CLUSTERING
 Objectives here
HIGH LEVEL ARCHITECTURE
WEBSITES
SEARCH SERVICE
(NODE.JS)
SOLRCLOUD
RABBITMQ
SEARCH PROCESSOR
(NODE.JS)
HBASE
LILY HBASE
INDEXER
H
U
E
CLUSTERING
ENGINE
(CARROT2)
CLUSTERING ENGINE
 Carrot2 – open source search results clustering
engine
 Allows to dynamically identify semantically related
“clusters” based on search results
 Multiple clustering algorithms – Lingo, STC, K-means
 Pluggable Search Component in SOLR - runs on top
of SOLR search results
SEARCH
PROCESSOR
SEARCH
CLUSTERING
SEARCH
RESULTS
NODE.JS & BIG DATA INTEGRATION
 NODE.JS
 Evented, non-blocking I/O – built on V8 runtime
 Ideal for scalable concurrent applications
Componen
t
Protoc
ol
NPM module
HBASE Thrift https://www.npmjs.com/package/node-
thrift
REST https://www.npmjs.com/package/hbase
HDFS REST https://www.npmjs.com/package/node-
webhdfs
Hive Thrift https://github.com/forward/node-hive
SOLR REST https://github.com/artlabs/solr-node-
client
LILY HBASE INDEXER
 Acts as a Hbase Replication Sink
 Horizontal Scalability via Zookeeper
 Automatic Failure Handling (inherits Hbase
replication system)
Memstore
HLog
(WAL)
SEP
Replication
Source
Hbase Region Server
Hbase
Indexer
SOLRCLOUD
Morphlines
LILY HBASE INDEXER
 Indexer Configuration Setup
Hbase-indexer add-indexer
--name search_indexer
--indexer-conf /.search-indexer.xml
--connection-param solr.zk=ZK_HOST/solr
--connection-param solr.collection=search_meta
--zookeeper ZK_HOST:2181
 Search-Indexer.xml
<indexer table=“search_meta”
mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultT
oSolrMapper" mapping-type="row" unique-key-field="id" row-
field="keyword">
<param name="morphlineFile" value="morphlines.conf"/>
</indexer>
KITE MORPHLINES IN HBASE INDEXER
morphlines : [
{
id : morphline1
importCommands : [”org.kitesdk.morphline.**", "com.ngdata.**”]
commands : [
{
extractHBaseCells {
mappings : [
{
inputColumn : “cf:column1”
outputField : “field1”
type : string
source : value
} ]
}
}
{
sanitizeUnknownSolrFields {
solrLocator : ${SOLR_LOCATOR}
}
}
{ logTrace { format : "output record: {}", args : ["@{}"] } }
]
}
]
MORPHLINE AVRO SUPPORT
readAvroContainer Parses an Apache Avro binary container and emits a morphline
record for each contained Avro datum.
extractAvroPaths Extracts specific values from an Avro object
commands : [
{
extractHBaseCells {
mappings : [
{
inputColumn : “cf:column1”
outputField : "_attachment_body"
type : "byte[]"
source : value
}]}
}
{ readAvroContainer {} }
{
extractAvroPaths {
paths : {
meta : /meta_data
}
}}
]
}]
REAL TIME TRENDING
GOALS
 Scalable Real time Stream
Processing Engine
 Based on Clickstream data,
provide Real-time trending
capability for all websites on
 Top Products Added to Cart
 Top Searches/Galleries visited
 Top User Galleries visited
 Low latency aggregations on
moving time window and
configurable time slices
Flume
HIGH LEVEL ARCHITECTURE
Avro
source
Morphlines
ETL
channel
channel
HDFS
HDFS
sink
Spark
sink
Spark
Spark Streaming
clickstream
Rabbitmq
Node.js Aggregation
Processor
Websites
Clickstreamevents
Notifications
socket.io
WHY SPARK?
 Fast and Expressive
Cluster Computing Engine
 Leverages distributed
memory
 Linear scalability and fault
tolerance
 Rich DAG expressions for
data parallel computations
 Seamless Hadoop
Integration – Runs with
YARN and works with
HDFS
 Great Libraries (MLlib,
Spark Streaming,
SparkSQL, Graphx)
Spark
SQL
Streaming MLlib Graphx
Apache Spark
MESOS YARN
HDFSS3
Standalone
with local
storage
SPARK STREAMING
• Extension of Spark Core API for large
scale stream processing of live data
streams
• Integrates with Spark’s batch and
interactive processing
• Runs as a series of small, deterministic
batch jobs
• DStream provides a continuous stream
of data (sequence of RDDs)
FLUME SPARK SINK
 Pull-based Flume Sink
 Polling Flume Receiver to pull data from sink
 Strong Reliability and Fault Tolerance
 Flume Agent Configuration
 Custom Sink JAR available on Flume Classpath
 Flume Configuration
SPARK ON YARN
 Leverage both hardware and expertise in dealing
with YARN
 Eliminate cost of maintaining a separate cluster
 Take advantage of YARN scheduler features
 Spark supports YARN cluster and client mode
CLIENT
YARN CONTAINER
YARN RESOURCE
MANAGER
APPLICATION MASTER
DRIVER
YARN CONTAINER YARN CONTAINER
SPARK
EXECUTOR
SPARK
EXECUTOR
SPARK
TASK
YARN NODEMANAGER
Launch executorApplication commands
DATA SERIALIZATION & KITE SUPPORT
 Data Serialization - Key for good network performance
and memory usage
 Use Kryo Serialization – compact and faster
 Initialize with conf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
 Register classes for best performance
conf.set("spark.kryo.registrator", “com.art.spark.AvroKyroRegistrator");
 Kite Dataset Support
 DatasetKeyInputFormat – read kite dataset from HDFS into RDDs
 DatasetKeyOutputFormat – write RDDs as Kite dataset
THANK YOU
tvellore@art.com
@tvellore

More Related Content

PPTX
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
PPTX
Spark meetup v2.0.5
PDF
L12: REST Service
PPTX
Rest with Java EE 6 , Security , Backbone.js
PPTX
Spark Sql and DataFrame
ODP
The other Apache Technologies your Big Data solution needs
PDF
Infrastructure as Code with Terraform
PPT
ASP.NET 08 - Data Binding And Representation
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
Spark meetup v2.0.5
L12: REST Service
Rest with Java EE 6 , Security , Backbone.js
Spark Sql and DataFrame
The other Apache Technologies your Big Data solution needs
Infrastructure as Code with Terraform
ASP.NET 08 - Data Binding And Representation

What's hot (20)

PPT
DataFinder: A Python Application for Scientific Data Management
PPTX
Apache MetaModel - unified access to all your data points
PPT
Organizing the Data Chaos of Scientists
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
PDF
SQL to Hive Cheat Sheet
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
PPTX
Hive : WareHousing Over hadoop
PDF
Scaling ArangoDB on Mesosphere DCOS
PPTX
Spark core
PDF
Hw09 Sqoop Database Import For Hadoop
PPTX
Ch06 ado.net fundamentals
PPTX
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
PPTX
Query optimization techniques in Apache Hive
PPT
Hive User Meeting August 2009 Facebook
PPTX
Hadoop - Stock Analysis
PDF
20140908 spark sql & catalyst
PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
PPTX
In15orlesss hadoop
PDF
Hadoop Design and k -Means Clustering
PDF
Ahsay Backup Software v7 - Datasheet
DataFinder: A Python Application for Scientific Data Management
Apache MetaModel - unified access to all your data points
Organizing the Data Chaos of Scientists
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
SQL to Hive Cheat Sheet
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hive : WareHousing Over hadoop
Scaling ArangoDB on Mesosphere DCOS
Spark core
Hw09 Sqoop Database Import For Hadoop
Ch06 ado.net fundamentals
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Query optimization techniques in Apache Hive
Hive User Meeting August 2009 Facebook
Hadoop - Stock Analysis
20140908 spark sql & catalyst
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
In15orlesss hadoop
Hadoop Design and k -Means Clustering
Ahsay Backup Software v7 - Datasheet

Viewers also liked (20)

PPTX
Michael Yarichuk. Polyglot Persistence: From Architecture to Solutions
PPTX
Dino Esposito. Polyglot Persistence: From Architecture to Solutions
PDF
Applications on Hadoop
PDF
Jan 2012 HUG: HCatalog
PPTX
A glimpse into the Future of Hadoop & Big Data
PDF
Sql saturday pig session (wes floyd) v2
PDF
HCatalog & Templeton
PDF
mypipe: Buffering and consuming MySQL changes via Kafka
PDF
HCatalog: Table Management for Hadoop - CHUG - 20120917
PDF
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
DOCX
Negotiable instruments 1
DOC
Ajit Kumar
DOCX
Negotiable instruments 3
PDF
Привет всем!
PDF
CL4D Final
DOCX
Etiquetas miguel 2
PDF
DS certifications
PPTX
05.02 13.10 надежда жуковская_акмр
PDF
IBM Business Analytics Dashboarding Tips > Cresco International
Michael Yarichuk. Polyglot Persistence: From Architecture to Solutions
Dino Esposito. Polyglot Persistence: From Architecture to Solutions
Applications on Hadoop
Jan 2012 HUG: HCatalog
A glimpse into the Future of Hadoop & Big Data
Sql saturday pig session (wes floyd) v2
HCatalog & Templeton
mypipe: Buffering and consuming MySQL changes via Kafka
HCatalog: Table Management for Hadoop - CHUG - 20120917
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
Keeping Spark on Track: Productionizing Spark for ETL
Negotiable instruments 1
Ajit Kumar
Negotiable instruments 3
Привет всем!
CL4D Final
Etiquetas miguel 2
DS certifications
05.02 13.10 надежда жуковская_акмр
IBM Business Analytics Dashboarding Tips > Cresco International

Similar to Leveraging Hadoop in Polyglot Architectures (20)

PPTX
Overview of VS2010 and .NET 4.0
PPTX
Nasdanika Foundation Server
PPTX
Clogeny Hadoop ecosystem - an overview
PDF
PPT
ADO.NET Data Services
PPTX
Practical OData
PPTX
Windows Azure HDInsight Service
PPTX
My Saminar On Php
PPTX
Developing your first application using FIWARE
PPTX
Apache Eagle in Action
PDF
Treasure Data and OSS
PDF
Today's Spring framework
PPTX
Vb essentials
PDF
Infrastructure as Code: Manage your Architecture with Git
PPTX
Infrastructure as code, using Terraform
PDF
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
PPTX
Day 1 - Technical Bootcamp azure synapse analytics
PDF
Towards sql for streams
PPTX
Azure Platform
PPT
Hadoop Frameworks Panel__HadoopSummit2010
Overview of VS2010 and .NET 4.0
Nasdanika Foundation Server
Clogeny Hadoop ecosystem - an overview
ADO.NET Data Services
Practical OData
Windows Azure HDInsight Service
My Saminar On Php
Developing your first application using FIWARE
Apache Eagle in Action
Treasure Data and OSS
Today's Spring framework
Vb essentials
Infrastructure as Code: Manage your Architecture with Git
Infrastructure as code, using Terraform
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
Day 1 - Technical Bootcamp azure synapse analytics
Towards sql for streams
Azure Platform
Hadoop Frameworks Panel__HadoopSummit2010

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Machine learning based COVID-19 study performance prediction
PPTX
1. Introduction to Computer Programming.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Dropbox Q2 2025 Financial Results & Investor Presentation
Network Security Unit 5.pdf for BCA BBA.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Advanced methodologies resolving dimensionality complications for autism neur...
Spectroscopy.pptx food analysis technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation theory and applications.pdf
Spectral efficient network and resource selection model in 5G networks
Empathic Computing: Creating Shared Understanding
Big Data Technologies - Introduction.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
A Presentation on Artificial Intelligence
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Machine learning based COVID-19 study performance prediction
1. Introduction to Computer Programming.pptx
MYSQL Presentation for SQL database connectivity
Programs and apps: productivity, graphics, security and other tools
The Rise and Fall of 3GPP – Time for a Sabbatical?

Leveraging Hadoop in Polyglot Architectures

  • 1. LEVERAGING HADOOP IN POLYGLOT ARCHITECTURES Thanigai Vellore Enterprise Architect at Art.com @tvellore
  • 2. AGENDA  Background on Art.com  Polyglot Architecture at Art.com  Hadoop Integration Use cases  Frameworks and Integration Components  Q&A
  • 5. 5 POLYGLOT ARCHITECTURE WEB .NET JAVA NODE.JS SERVICES/API .NET NODE.JSJAVA DATABASE SEARCH SQL Server Mongo ENDECA SOLR
  • 6. HADOOP @ ART.COM  Use Hadoop to implement data-driven capabilities via centralized platform that can be consumed by all our brands  Intelligent data platform that supports different types of workloads (batch, stream processing, search, etc)  Use frameworks that enable interoperability between different technologies and teams  We use Cloudera’s Enterprise Data Hub (EDH)  Data Governance  Centralized Management  Security and Compliance
  • 9. GOALS  Platform to collect, ingest, analyze and report aggregate information on clickstream activities at scale  Seamless integration with existing systems  Traditional BI tools (Business Objects)  Web analytics (Google Analytics)  Marketing platforms (Email, SEM, etc)  Provide foundation for building near real time closed loop predictive engines
  • 10. HIGH LEVEL ARCHITECTURE CLICKSTREAM SERVICE JAVA .NET NODE.JS GA LOGGER WEBSITES AVRO SOURCE ETL - MORPHLINES AVRO SERIALIZER APACHEFLUME CLICKSTREAM APACHE OOZIE APACHE CRUNCH HIVE ODBC DRIVER SESSIONIZATION BUSINESS OBJECTS HDFS SESSIONS (HIVE) HDFS
  • 11. CLICKSTREAM COLLECTION  Google Analytics provides local logging capability (_setLocalRemoteServerMode)  Capture all pageviews and GA events via simple javascript file which is included on all pages  Clickstream events are sent to clickstream service that transforms incoming events and emit Avro records  Flume Client SDK (NettyAvroRpcClient) is used to send data into the agent  Factory – org.apache.flume.api.RpcClientFactory: RpcClient getInstance(properties)
  • 12. CLICKSTREAM INGESTION  Clickstream Ingested using Apache Flume from an AvroSource  Kite Morphlines (used as Flume Source Interceptor) is used for ETL transformation into Avro  AvroSerializer used to write Avro records to HDFS (HDFS Sink)
  • 13. AVRO  Storage format (Persistence) and wire protocol (Serialization)  Self describing (schema stored with data)  Supports compression of data and map-reduce friendly  Supports easier schema evolution  Read/write data in Java, C, C++, Python, PHP, and other languages Platform Library Link .NET Microsoft Avro Library https://hadoopsdk.codeplex.com/ Node.js node-avro-io https://www.npmjs.com/package/node-avro-io
  • 14. KITE SDK  Open source SDK (www.kitesdk.org) - Apache 2.0 licensed  High level data layer for Hadoop  Codify best practices for building data-oriented systems  Loosely coupled modular Design  Kite Data Module  Kite Morphlines  Kite Maven Plugin
  • 15. KITE DATA MODULE  Set of APIs for interacting data with Hadoop  Entities  A single record in a dataset  Simple or complex and nested (avro or POJO)  Dataset  A collection of entities/records  Data types and field names defined by Avro schema  Dataset Repository  Physical storage location for datasets Kite Abstractions Relational Equivalent Entities Record Dataset Table Dataset Repository Database
  • 16. KITE DATA MODULE  Unified Storage Interface  Support for Data Format, Partition Strategies and Compression Formats  Command Line Interface  Utility commands to create, load, update datasets  http://kitesdk.org/docs/0.17.1/cli-reference.html Kite Data HBaseHDFS Application
  • 17. KITE MORPHLINES  Open source framework for simple ETL in Hadoop Applications  Consume any kind of data from any kind of dat a source, process and load into any app or stor age system  Simple and flexible data mapping and transformation  Similar to Unix pipelines with extensible set of transformation commands
  • 18. KITE MAVEN PLUGIN  Maven goals for packaging, deploying, and running distributed applications  Create, update and delete datasets mvn kite:create-dataset -Dkite.rootDirectory=/data -Dkite.datasetName=clickstream -Dkite.avroSchemaFile=/etc/flume-ng/schemas/clickstream.avsc  Submit Jobs to oozie mvn package kite:deploy-app -Dkite.applicationType=coordinator mvn kite:run-app -Dkite.applicationType=coordinator -Dstart="$(date -d '1 hour ago' +"%Y-%m-%dT%H:%MZ")"
  • 19. SESSIONIZATION  MapReduce program to transform raw clickstream logs into aggregate session summary using Apache Crunch  Hourly Coordinator job scheduled using Apache Oozie  Triggered based on presence of HDFS partition folder
  • 20. KITE CRUNCH INTEGRATION  Enables loading Kite Dataset into Crunch Programs  CrunchDatasets helper class  CrunchDatasets.asSource(View view) PCollection<Clickstream> clickstreamEvents = getPipeline().read(CrunchDatasets.asSource(“dataset:hdfs/data/clickstream”, Clickstream.class);  CrunchDatasets.asTarget(View view)  Supports Crunch write modes and repartitioning PCollection<Clickstream> clickstreamLogs = getPipeline().read( CrunchDatasets.asSource(“dataset:hdfs/data/clickstream”, Clickstream.class); DatasetRepository hcatRepo = DatasetRepositories.open(hiveRepoUri); View<Session> sessionView = hcatRepo.load(“sessions”); PCollection<Session> sessions = clickstreamLogs .by(new GetSessionId(), Avros.strings()) .groupByKey() .parallelDo(new MakeSession(), Avros.specifics(Session.class)); getPipeline().write(sessions,CrunchDatasets.asTarget(sessionView), Target.WriteMode.APPE
  • 21. APACHE HIVE ODBC DRIVER  Used to read Hive Tables from Business Objects  Fully compliant ODBC driver supporting multiple Hadoop distributions  High performance and throughput with support for Hive2  Supports Hive grammar, standard SQL and with range of data types
  • 24. HIGH LEVEL ARCHITECTURE WEBSITES SEARCH SERVICE (NODE.JS) SOLRCLOUD RABBITMQ SEARCH PROCESSOR (NODE.JS) HBASE LILY HBASE INDEXER H U E CLUSTERING ENGINE (CARROT2)
  • 25. CLUSTERING ENGINE  Carrot2 – open source search results clustering engine  Allows to dynamically identify semantically related “clusters” based on search results  Multiple clustering algorithms – Lingo, STC, K-means  Pluggable Search Component in SOLR - runs on top of SOLR search results SEARCH PROCESSOR SEARCH CLUSTERING SEARCH RESULTS
  • 26. NODE.JS & BIG DATA INTEGRATION  NODE.JS  Evented, non-blocking I/O – built on V8 runtime  Ideal for scalable concurrent applications Componen t Protoc ol NPM module HBASE Thrift https://www.npmjs.com/package/node- thrift REST https://www.npmjs.com/package/hbase HDFS REST https://www.npmjs.com/package/node- webhdfs Hive Thrift https://github.com/forward/node-hive SOLR REST https://github.com/artlabs/solr-node- client
  • 27. LILY HBASE INDEXER  Acts as a Hbase Replication Sink  Horizontal Scalability via Zookeeper  Automatic Failure Handling (inherits Hbase replication system) Memstore HLog (WAL) SEP Replication Source Hbase Region Server Hbase Indexer SOLRCLOUD Morphlines
  • 28. LILY HBASE INDEXER  Indexer Configuration Setup Hbase-indexer add-indexer --name search_indexer --indexer-conf /.search-indexer.xml --connection-param solr.zk=ZK_HOST/solr --connection-param solr.collection=search_meta --zookeeper ZK_HOST:2181  Search-Indexer.xml <indexer table=“search_meta” mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultT oSolrMapper" mapping-type="row" unique-key-field="id" row- field="keyword"> <param name="morphlineFile" value="morphlines.conf"/> </indexer>
  • 29. KITE MORPHLINES IN HBASE INDEXER morphlines : [ { id : morphline1 importCommands : [”org.kitesdk.morphline.**", "com.ngdata.**”] commands : [ { extractHBaseCells { mappings : [ { inputColumn : “cf:column1” outputField : “field1” type : string source : value } ] } } { sanitizeUnknownSolrFields { solrLocator : ${SOLR_LOCATOR} } } { logTrace { format : "output record: {}", args : ["@{}"] } } ] } ]
  • 30. MORPHLINE AVRO SUPPORT readAvroContainer Parses an Apache Avro binary container and emits a morphline record for each contained Avro datum. extractAvroPaths Extracts specific values from an Avro object commands : [ { extractHBaseCells { mappings : [ { inputColumn : “cf:column1” outputField : "_attachment_body" type : "byte[]" source : value }]} } { readAvroContainer {} } { extractAvroPaths { paths : { meta : /meta_data } }} ] }]
  • 32. GOALS  Scalable Real time Stream Processing Engine  Based on Clickstream data, provide Real-time trending capability for all websites on  Top Products Added to Cart  Top Searches/Galleries visited  Top User Galleries visited  Low latency aggregations on moving time window and configurable time slices
  • 33. Flume HIGH LEVEL ARCHITECTURE Avro source Morphlines ETL channel channel HDFS HDFS sink Spark sink Spark Spark Streaming clickstream Rabbitmq Node.js Aggregation Processor Websites Clickstreamevents Notifications socket.io
  • 34. WHY SPARK?  Fast and Expressive Cluster Computing Engine  Leverages distributed memory  Linear scalability and fault tolerance  Rich DAG expressions for data parallel computations  Seamless Hadoop Integration – Runs with YARN and works with HDFS  Great Libraries (MLlib, Spark Streaming, SparkSQL, Graphx) Spark SQL Streaming MLlib Graphx Apache Spark MESOS YARN HDFSS3 Standalone with local storage
  • 35. SPARK STREAMING • Extension of Spark Core API for large scale stream processing of live data streams • Integrates with Spark’s batch and interactive processing • Runs as a series of small, deterministic batch jobs • DStream provides a continuous stream of data (sequence of RDDs)
  • 36. FLUME SPARK SINK  Pull-based Flume Sink  Polling Flume Receiver to pull data from sink  Strong Reliability and Fault Tolerance  Flume Agent Configuration  Custom Sink JAR available on Flume Classpath  Flume Configuration
  • 37. SPARK ON YARN  Leverage both hardware and expertise in dealing with YARN  Eliminate cost of maintaining a separate cluster  Take advantage of YARN scheduler features  Spark supports YARN cluster and client mode CLIENT YARN CONTAINER YARN RESOURCE MANAGER APPLICATION MASTER DRIVER YARN CONTAINER YARN CONTAINER SPARK EXECUTOR SPARK EXECUTOR SPARK TASK YARN NODEMANAGER Launch executorApplication commands
  • 38. DATA SERIALIZATION & KITE SUPPORT  Data Serialization - Key for good network performance and memory usage  Use Kryo Serialization – compact and faster  Initialize with conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")  Register classes for best performance conf.set("spark.kryo.registrator", “com.art.spark.AvroKyroRegistrator");  Kite Dataset Support  DatasetKeyInputFormat – read kite dataset from HDFS into RDDs  DatasetKeyOutputFormat – write RDDs as Kite dataset

Editor's Notes

  • #3: Moving over to the agenda, I will start off by providing a brief background on art.com. I will also explain the technology landscape within art.com across multiple tiers of the stack. Then, I will present 3 different usecases in detail where Hadoop was utilized and integrated within the stack. For each usecase, we will review the different tools, frameworks and components used to integrate hadoop with the rest of our stack. And this will be the focus of the presentation. Then, hopefully we will have time for Q&A
  • #4: A brief background on art.com – Art.com is a leading online retailer for wall art. We are headquartered in the SF bay area and have about 700 employees worldwide. We have two distribution centers – one in the US and another in Europe. We have the world’s largest selection of curated images for Wall-art providing over 3M images from different publishers. Our websites and brands are global and have a strong international presence – we have about 35 websites in over 25 countries and 17 languages). Over the years, we have built some unique technologies and proprietory online tools that help in simplifying the art buying process.
  • #5: Our mission is to make art accessible to all by transforming the way the world discovers, personalizes, shares and purchases art. We have a portfolio of 5 different brands that aim in fulfilling that mission. The art.com brand focusses on the “home décor” providing easy access to the world’s largest selection of hand picked art images. Our brands include AllPosters.com and PosterRevolution which are online destination focussing on the latest trends in the wall décor categiory. Zenfolio is “all-in-one” solution for photographers to organize, display and sell their work online. Artist Rising is a online community of independent and emerging artists connecting art enthusiasts with rising artists from around the globe.
  • #6: We have a heterogeneous stack at art.com as there are multiple websites and brands some of which were acquired by the company. In addition, we are also working on evolving and upgrading the stack to leverage latest technologies. As a result, we have to deal with multiple technologies across the stack. On the web tier, we have .NET, Java and have been migrating and developing new features on node.js. On the services and API side, again we have .NET and Java and we have developed aggregation layer using Node.js. On the database side, we use both SQL Server and MongoDB and for search, we have historically used Endeca but we are also moving towards SOLR. So, as you can see, we have a polyglot system architecture which presents its own INTEGRATION challenges
  • #7: So, for us, we looked at Hadoop as a way to create a centralized data platform to implement data-driven capabilities that can be consumed by all our brands. Instead of having silos of multi-structured data, we used Hadoop as a centralized data hub as one place to store all data and for as long as required in its original fidelity. In addition to storing the data, we wanted to build an intelligent data platform that would allow us to run a variety of enterprise workloads – whether it is batch processing, interactive SQL or Search, stream processing or machine learning. In addition, it should be based on open architecture so that it is interoperable with the rest of the stack. We went with Cloudera’s Enterprise Data Hub distribution. The enterprise data hub provides data governance capabilities of allowing complete metadata management and audit logging and reporting. In addition, it provides robust access controls (using Sentry) and shared security policies.
  • #8: Now, I would like to go over few implementation usecases where we leveraged Hadoop to create information driven solutions
  • #10: The first usecase that I would like to talk about is the implementation of “Clickstream Analytics”. The goal of “Clickstream Analytics” was to create a scalable platform to collect, ingest , analyze and be able to report aggregate information of user visits. A key requirement of the platform was also that it should have seamless integration with existing systems. For eg: We use Business Objects as our BI system and it was very important that we are able to provide a consolidated view of the clickstream data within business objects. Similarly, we use Google Analytics for web analytics and wanted to capture all the events and custom variables tracked in GA into the clickstream platform. Also, we want to use the clickstream data to further strengthen our internal marketing platforms. Also, in addition to advanced analytics we want to make sure that the data pipeline that we create to ingest clickstream data provide foundation for near real time closed loop predictive analytics Support extensible
  • #11: This is a high level architecture diagram for the clickstream analytics infrasttructure. As I had mentioned earlier, our webstack consists of multiple technologies for the different brands. However, we use Google Analytics on all the pages in all our websites. So, we created this “GA Logger”, which is a common javascript library that pipes out all GA events and sends them to a “Clickstream Service”. The clickstream service transforms the raw incoming events and projects them into a AVRO source. We use Apache Flume as our distributed log ingestion. The Flume agent listens for events from the Avro source and then it processes through a ETL library called Morphlines to transform the events into well defined clickstream avro records. Using the AVRO serializer, the transformed avro records are written into a HDFS sink. Then, we have an hourly sessionization mapreduce program which analyzes the data for the previous hour and aggregates session summary for behavioral analysis for understanding user sessions. This hourly job is scheduled via Apache Oozie and the sessionization program is written using Apache Crunch. The output of the sessionization is stored in the sessions table which is an externally managed table in Hive. We use Business Objects as our BI tool and we have built dimensions and facts on the session data using the Hive ODBC driver. Now, let’s look a the individual components in a bit more detail.
  • #12: Now, talking about how we collect the clickstream events. We use GA for tracking all user events and page views and the GA tracking code allows you to store a backup copy of the data Google collects. It enables that through a property called local logging (_setLocalRemoteServerMode). So, we enabled this property in our GA initialization script which is included on all pages. With that script in effect, now we can capture all page views and GA events to be sent to a clickstream service. The clickstream service is a simple http listener which transforms incoming events and emits Avro record. The Flume Client SDK provides RPC client implementations for Avro and Thrift via RPCClientFactory and the clickstream service uses the uses the NettyAvroRpcClient to send the avro records to the Flume Agent.
  • #13: As I had mentioned earlier, we use Flume for ingesting the events. Flume is a distributed logs collection and ingestion service. Flume collects data using configurable "agents”. Agents can receive data from many sources, including other agents. Flume also supports inspection and modification of in-flight data through interceptors, which gets Invoked on events as they travel between a source and channel. Multiple interceptors can be chained together in a sequence. In our case, we use Kite Morphlines as our interceptors for transforming the incoming events to the required avro schema. We use the AvroSerializer to write the AVRO records into the HDFS sink
  • #14: I want to take a minute about the data format that we used throughout in this platform, which is Avro. Avro defines a data format designed to support data-intensive applications and is widely supported throughout Hadoop and its ecosystem - it provides both RPC and serialization framework. It is self describing as the schema is stored with the data which allows for schema evolution in a scalable manner. The producers and consumers can use different versions of the schema and continue to work fine which is very important for us. In addition, both Hive and Impala support avro formated records. Also, with Avro we can read and write data in a variety of language and platforms like Java, C, C++, Python which is shipped by default. Also for us, we wanted a format that works well with .NET and Node.js and there are libraries supported
  • #15: One of the open source data frameworks that we used throughout this project was the “Kite SDK”. The Kite SDK is a set of libraries, tools and documentation that make it easier to build systems on top of the Hadoop Stack. It acts as a high level data layer on top of Hadoop. It Codify expert patterns and practices for building data-oriented systems. It provides smart defaults for platform choices and abstracts the plumbing and infrastructure of the data layer and let’s us focus on the business logic. The project is organized as loosely couple modules so that we can pick and chose the modules that are of interest. At a high level, it has 3 modules: The data module, Morphlines and Maven Plugin which I will explain more in the next few slides.
  • #16: The Kite Data module is a set of APIs for interacting with data in Hadoop; specifically, direct reading and writing of datasets in storage subsystems such as the Hadoop Distributed FileSystem (HDFS) and HBase. The Kite Data module reflects best practices for default choices, data organization, and metadata system integration. It does that by providing well defined interfaces and abstractions for data organization. At a high level, the data module contains the following abstractions. The first one is Entities. An entity is a single record in a dataset. Entities can be simple types, representing data structures with a few string attributes, or as complex as required that could contain maps, lists, or other POJOs. A dataset is a collection of zero or more entities, represented by the interface Dataset. The relational database analog of a dataset is a table.Datasets are identified by URIs. The HDFS implementation of a dataset is stored as Snappy-compressed Avro data files by default. A dataset repository is a physical storage location for datasets (similar to a database). We can organize datasets into different dataset . This is to enable logical grouping, security and access control, backup policies, and so on.
  • #17: So, the data module provides a consistent and unified interface for working with your data. We have control of implementation details, such as whether to use Avro or Parquet format, HDFS or HBase storage, and snappy compression or LZO or Gzip. We can just specify the option and Kite handles the implementation part. The datasets that we create with Kite can be queried with Hive and Impala just like hadoop datasets. Kite also comes with a command line interface that can be used to create, update, delete or load data into datasets. So, as you can see here through these well defined data abstraction, there is a clear separation of business logic which resides in the application and then all mechanics of data management is handled by Kite.
  • #18: Morphlines is another module part of the Kite SDK. Morphline provides a simple ETL framework for Hadoop applications. A morphline is a rich configuration file that makes it easy to define a transformation chain that consumes any kind of data from any kind of data source, processes the data and loads the results into a Hadoop component. It replaces Java programming with simple configuration steps, and correspondingly reduces the cost and integration effort associated with developing and maintaining custom ETL projects. Morphlines can be seen as an evolution of Unix pipelines where the data model is generalized to work with streams of generic records. It is an efficient way to consume and turn a a stream of records, and pipe the stream of records through a set of easily configurable transformations on the way to a target application Since Morphlines is a library, it can also be embedded in any Java codebase. We use morphlines as Flume Interceptor to transform incoming clickstream events into a well defined Avro kite dataset that is stored into HDFS
  • #19: The Kite Maven Plugin provides maven goals for packaging, deploying and running distributed applications. Using the plugin, we can create or delete datasets. For example, here is a command to create a kite dataset and the parameters provided are the root directory in HDFS, the name of the dataset and the avro schema file for the dataset. The plugin also exposes goals for deploying a packaged application to HDFS. This command packages oozie coordinator job and deploys it to HDFS. Also, we can run an app as job on the cluster. The command here does exactly that.
  • #20: The next important piece of the clickstream analytics infrastructure is the Sessionization program. This is a mapreduce program that runs hourly which analyzes the clickstream data for the previous hour and generates aggregate session summary It is implemented using Apache Crunch which is a High-level API for map-reduce providing the full power and expressiveness of the language. The goal of Crunch is to make pipelines that are composed of many user-defined functions that is simple to develop, test, and efficient to run. The hourly Crunch job is invoked as a coordinator scheduled using Apache Oozie which is a scheduling system for Hadoop. The workflow is triggered based on the presence of HDFS files for that hour.
  • #21: We utilize the Kite Data Module within crunch too. Kite exposes helper classes to both read and write Kite Datasets inside Crunch. The CrunchDatasets.asSource returns a Pcollection as a Crunch Source – The statement over here shows how the Clickstream kite dataset from HDFS is loaded inside Crunch. Now, subsequent distributed operations can be performed on this Pcollection. Similarly CrunchDatasets.asTarget can be used to expose the dataset as a Crunch Target. Also, it supports different write modes such as Overwrite and Append and also allows for repartition before writing. Here’s a short snippet of code that shows how Kite is used inside Crunch. You can see how a dataset is loaded and a sequence of operations are performed which results in a Pcollection<Session>. This Pcollection is then written into sessions dataset repository.
  • #22: The final integration component within Clickstream Analytics setup is the Apache HIVE ODBC driver which is used integrate Business Objects which is our BI tool to the Sesisons Hive table. This is a fully compliant ODBC driver supporting multiple Hadoop distributions. we have built dimensions and facts on the session data using the Hive ODBC driver. The driver works with Hive2 and supports the Hive grammar and standard SQL with wide range of data types
  • #23: T
  • #24: The goal of this initiative is to build a capability to provide a new type of discovery experience that is visual and semantically related. Conventional discovery experiences on ecommerce sites are typically taxonomy based which is very hierarchical in nature. However, we wanted to build a visually engaging experience that is contextually and semantically relevant and not tied to a typical parametric search experience. So, the goal was to build semantically related clusters for all searches and categories. Here’s a screenshot of that feature in action – where you can see a search for “flowers” and the semantically related clusters are shown based on theme, artists and art style. For example, clusters like poppy, roses and lilies are semantically related but also provide a strong visual coherence that provides a good shopping experience
  • #25: Now, let’s move on the architecture of this platform. The Web client calls search service which is built using Node.js. The search service gets all search terms and refinements and queries SOLR for retrieving the semantically related clusters for that state. The semantic clusters are stored in SOLR for search. When it does not find any result for a search term, it logs an event into Rabbitmq queue. The semantic engine which is responsible for running the clustering algorithm listens to the queue and retrieves incoming messages and processes them. It invokes the clustering engine (which is built using carrot2 library) to run the clustering algorithm and generates the clusters which is stored in Hbase. Then, by enabling replication on the column family that stores the cluster data, the mutations that happen on Hbase is replicated to SOLR through the Lily Hbase indexer component. Also, in addition to algorithmically generated clusters for searches, we also Hue to manually curate content for some top searches and these updates also get replicated to SOLR. So, this is how the search cluster is index is continuously built based on a “demand based search”. Now, let’s look at the key components of this infrastructure
  • #26: Now, talking about the clustering Engine – we use Carrot2 as the clustering library. Carrot2 is an open source search results clustering engine. It can automatically organize small collections of documents into thematic categories. It is implemented in Java but can be integrated quickly with a wide variety of APIs such as Google , Yahoo and Bing. It can also be integrated as a search component in SOLR, which is what we use. Carrot2 is a pluggable search component and the clustering algorithm runs on top of the SOLR search results which is accessible via a search handler in SOLR. There are multiple clustering algorithms available in Carrot2 library with a rich set of tokenizers and stop word lists that can be configured to our needs.
  • #27: We use node.js a lot for backend I/O bound process. Node.js is a platform built on Chrome’s V8 javascript runtime for building fast, scalable network applications. It provides a evented, non-blocking I/O with a event loop which makes all IO operations asynchronous. There are some npm modules available for integrating with the Hadoop ecosystem. The table here shows the modules available for different components like Hbase, HDFS, Hive, Solr and Zookeeper. Most of the modules is built on top of the Thrift or REST gateway exposed within each component. We use the modules highlighted here in orange
  • #28: The next important integration component of this stack is the Lily Hbase indexer. The Lily HBase Indexer Service is a scalable, fault tolerant system for processing a continuous stream of HBase cell updates into live search indexes. The HBase Indexer works by acting as an HBase replication sink. As updates are written to HBase region servers, it is written to the Hlog (WAL) and HBase replication continuously polls the HLog files to get the latest changes and they are "replicated" asynchronously to the HBase Indexer processes. The indexer analyzes incoming HBase mutation events, and it creates Solr documents and pushes them to SolrCloud servers. All information about indexers is stored in ZooKeeper. So, new indexer hosts can always be added to a cluster, in the same way that HBase regionservers can be added to to an HBase cluster. We use the Kite Morphlines library for mapping the column names in Hbase table to field in SOLR collection.
  • #29: This slide shows a indexer configuration is setup. The Lily hbase indexer services provides a command line utility that can be used to add, list, update and delete indexer configurations.The first command shown here registers and adds a indexer configuration to the Hbase Indexer. This is done by passing an index configuration XML file alsong with the zookeeper ensemble used for Hbase and SOLR and the solr collection. The XML configuration file provides the option to specify the Hbase table which needs to be replicated and a mapper. Here we use morphline framework again to transform the columns in the hbase table to SOLR fields and we pass the morphline file which has the pipeline of commands to do the transformation
  • #30: As we had seen earlier, the morphline file contains a chain of ETL transformations that are executed as a pipeline. It can have any number of commands – I’m just showing here a very basic morphline file where the first command is extractHbaseCells which is a morphline command that extracts cells from an HBase Result and transforms the values into a SolrInputDocument. The command consists of an array of zero or more mapping specifications.. We can list an array of such mappings here. The second command is to sanitizeunknown fields from being written to SOLR . The mapper that we used MorphlineResultToSolrMapper has the implementation to write the morphline fields into SOLR documents
  • #31: Morphline comes with some handy utilities for reading and writing Avro formatted objects. This can be combined with the extracthbaseCells command to transform a kite avro formatted dataset persisted in Hbase as byte arrays. The readAvroContainer command parses an InputStream or byte array that contains Apache Avro binary container file data. For each Avro datum, the command emits a morphline record containing the datum. The Avro schema that was used to write the Avro data is retrieved from the Avro container. Optionally, the Avro schema that shall be used for reading can be supplied with a configuration option; otherwise it is assumed to be the same as the writer schema. The input stream or byte array is read from the first attachment of the input record.
  • #32: The next implementation use case that I would like to present is the “real time trending”
  • #33: The goals of this project was to implement a scalable real time stream processing engine that provides trend aggregations. Based on clickstream data, we want the ability to build real time trends on user activity on the website. Some usecases that were implemented based on the stream processing engine is to compute top products being added to cart, top searches or category pages that were visited and top user galleries visited. One of the main requirements for this engine was to be able to compute aggregations based on a configurable sliding window of data so that trends can be recomputed at set intervals
  • #34: Now, lets look at how this platform was implemented. As I had mentioned earlier, we use Flume for ingesting our clickstream data from our websites. The client transmits the events into an Avro Source. The source that is receiving the event passes the event through the source interceptors where the event is transformed into a clickstream avro record using the kite morphlines ETL framework. After it passes through the source interceptors, the event is fanned out to two different channels. Each channel is connected to a sink which drains the events from their respective channel. In this case, one sink is the HDFS sink which writes the avro records into HDFS as kite dataset that is used for clickstream analytics while the other channel is configured to connect to a Spark Sink which is a custom flume sink where events get buffered for consumption by the Spark engine. We use Spark Streaming which is an extension of the Spark Core API for large scale stream processing from live datastreams. Spark streaming uses a reliable Flume receiver to pull data from the sink and the trend aggregates are computed as a series of small batch programs. The computed trend summaries are then written to a Rabbitmq exchange. Then, we have a node aggregation process that subscribes to messages from the queue which retrieves the messages from different queues and creates a trend object in JSON. This trend object is then broadcasted to all clients via socket.io. Socket.io is a library that enables real-time bidirectional event based communication via websockets. This way the UI module on the websites get automatically updated with new trend information as they happen without requiring a page refresh or a periodic poll to the server. Now, let’s look at the key components in this stack in more detail
  • #35: With this real-time trending platform, we decided to use Apache Spark for building the stream processing engine. For people not familiar with Spark, Spark is a fast and general purpose cluster computing engine that leverages distributed memory. It has well defined and clean APIs available in multiple languages which can be used to write programs in terms of transformations on distributed datasets. It uses a data abstraction called RDDs which stand for resilient distributed datasets which is a collection of objects spread across a cluster stored in RAM. RDDs are built through parallel transformations and can be automatically rebuilt on failure and thus provides linear scalability and fault tolerance. It also seamlessly integrates with the rest of the hadoop ecosystem both in terms of data compatibility and deployments Spark supports different deployment models – it can run on YARN, MESOS or a standalone mode. Plus, it comes with great libraries for machine learning, streaming processing, SQL engine and graph computations. Foundation for machine learning (recommendation systems)
  • #36: Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume,, ZroMQ, Kinesis or TCP sockets and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams. Internally, Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark Streaming provides a high-level abstraction called DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
  • #37: In order send the clickstream events to Spark Streaming, we use a custom spark sink. This spark sink uses a pull based approach – where event in the sink gets buffered. Spark streaming uses a reliable flume receiver to pull data from the sink using transactions. The transactions succeed only after the data is received and replicated by spark streaming. This ensures strong reliability and guarantees fault tolerance. In order to configure this custom sink, the custom sink needs to be downloaded and be available on the Flume agent’s classpath. Then in the flume configuration, we specfiy the type to be Sparlsink.
  • #38: As I mentioned earlier, Spark supports pluggable cluster management and can run as either standalone or on YARN or mesos. We run our spark processes on Yarn for multiple reasons. One of the main reason is to leverage the same hadoop cluster hardware and also our existing expertise in running YARN applications. This leads to better utilization of the cluster and also eliminate the cost of maintaining a separate cluster. Also, we can take advantage of the features of the YARN scheduler for categorizing, isolating and prioritizing workloads. Spark can be run in YARN under two modes – a cluster mode and a client mode. The cluster mode is suitable for production deployments where the Spark driver runs inside an application master process which is managed by YARN on the cluster, so that the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN. This is useful when you need the spark shell interactivity or for debugging purposes.
  • #39: Because of the in-memory nature of most Spark computations, Serialization plays an important role in the performance of the application. Formats that are slow too serialize into objects, or consume a large number of bytes, will greatly slow down the computation. Spark by default has “Java Serialization” which is very flexible and works with most classes but it is also very slow. We use the Kyro Serialization which uses Kyro library which is about 10 times compact and way faster than Java Serialization but it does not support all Serializable types. You can switch to using Kryo by initializing your job with a SparkConf and calling conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"). This setting configures the serializer used for not only shuffling data between worker nodes but also when serializing RDDs to disk. Another requirement for Kyro serializer is to register the class in advance for best performance. The scala snippet here shows how to register the avro classes. This will register the use of Avro's specific binary serialization for the Clickstream class Also, Kite SDK can also be used to read data from kite dataset repositories residing in HDFS or Hive backed tables into RDDs and the result of the computed RDDs can be written back as Kite Datasets in to the respective repository. The DatasetKeyInputFormat and DatasetKeyOutputFormat classes can be configured to achieve just that. So, you can see here how the kite sdk makes it easier to work with data across multiple platforms in the ecosystem