SlideShare a Scribd company logo
HADOOP INFRASTRUCTURE AND
SOFTSERVE EXPERIENCE
 Pacemaker BigData, Lviv
February, 2015
Agenda
•Business needs
•Hadoop Infrastructure
•Hadoop Distributives
•SoftServe Experience
Presentation drivers
• Hadoop competence development
• Hadoop isn’t MapReduce only
• Components for solution building
• Case studies
Big Analytics Engineering Challenges
Data
Discovery
Business
Reporting
Real Time
Intelligence
Business Users
Intelligent AgentsConsumers
How to achieve Low Latency for
personalized customer
experience in real-time?
Data Scientists/
Analysts
How to improve
System Performance
for Data Science/
Analytics team?
How to implement
Self-Service with high
Data Quality over
terabytes and
petabytes?
Pacemaker   hadoop infrastructure and soft serve experience
A distributed file system
• Files are split into blocks
• Each block has 3 replicas minimum
A distribute computing framework
Apache YARN
A resource manager (Yet Another Resource Manager)
A more complex resource management
An SQL interpreter for MapReduce
Apache Pig
A script language to query HDFS
Real-Time Queries in Apache Hadoop
Runs Everywhere
Engine for large scale data processing. Could be used with Java, Scala and Python
Apache Sqoop
SQL to HADOOP – data load tool for RDBMS
Pacemaker   hadoop infrastructure and soft serve experience
Other Databases on top of Hadoop
Column oriented Key-Value datastore
Graph oriented Database
A distributed service for collecting, aggregating, transformation and moving
large amount of log data
Distributed, real time computation service. Could be used for real time
analytics, online machine learning, continuous computation, distributed
RPC, ETL, and more
Apache Zookeeper
Distributed Service for:
• maintaining configuration information
• naming
• providing distributed synchronization
• providing group services
Service is fault tolerant:
• Zookeeper cluster is called “ensemble”
• There is one “leader” in an “ensemble”
• If “leader” is down a new “leader” is elected with quorum
Distributed messaging service
• Large amount of data
• Scalable
• Durable (messages are persisted on disc)
Popular Distributions
The last architecture trends
Lambda Architecture
http://lambda-architecture.net/
SoftServe Lambda Architecture
Accelerator
• Lambda Architecture – is a highly scalable and reliable data processing architecture based
on Twitter successful experience in Big Data and Analytics
• Supports majority of use cases: Real-time analytics, data discovery and business reports
• SoftServe’s pre-built Lambda Architecture stack accelerates customer’s Time to Market to
15-20+ man/month
25
Business Goals:
 Build a centralized platform for log data analysis which
collects data from ~270-300 Web Servers
 Provide Online Monitoring to answer the question: “What
is going on with systems now?”
 Provide Retrospective Analytics – strategic management,
capacity management/planning, route cause analysis, ad-hoc
analysis
Business Area:
Retail industry. A leading travel site in a world
Big Data Lab: Log Management
Log Data Analysis Platform
Details
26
Key Facts:
• ~270-300 Web Servers
• Log Types: HTTPD Access
logs, Error logs, Application
Server Servlet, OS Service
Logs
• ~500K events per minute
• 150GB of data per day
Technologies:
• Flume
• Hadoop/HDFS, MapReduce
• Hive, Impala
• Oozie
• Elasticsearch, Kibana
• MicroStrategy Analytics
platform
Solution Architecture
27
28
Business Goals:
 Build in-house Analytics Platform for ROI measurement
and performance analysis of every product and feature
delivered by the e-commerce platform;
 Provide the ability to understand how end-users are
interacting with service content, products, and features on
sites;
 Do clickstream analysis;
 Perform A/B Testing
Business Area:
Retail. A platform for e-commerce and
collecting feedbacks from customers
Case Study #1: Clickstream for retail website
Architectural Decisions
29
▪ Volume (45 TB)
▪ Sources (Semi-structured - JSON)
▪ Throughput (> 20K/sec)
▪ Latency (1 hour/real-time)
▪ Extensibility (Custom tags)
▪ Data Quality (Not critical)
▪ Reliability (24/7)
▪ Security (Multitenancy)
▪ Self-Service (Canned reports, Data
science)
▪ Cost (The less the better )
▪ Constraints (Public Cloud)
Architecture Drivers:
Technology Stack:
Lambda
Architecture
• Apache Kafka
• Apache Storm
• Amazon S3
• Hadoop/HDFS, MapReduce (CDH 5)
• HBase
• Oozie, Zookeper
• Cloudera Manager
Solution Architecture
30
31
Business Goals:
 In-house Web Analytics Platform for Conversion
Funnel Analysis, marketing campaign optimization,
user behavior analytics (based on server logs
analysis, page tagging, external data);
 Perform A/B Testing, platform feature usage
analysis
Business Area:
Retail. The world's largest digital coupon
marketplace. The company owns the largest
coupon sites in the US, UK, Germany,
Netherlands, France
Case Study #2: Coupon Marketplace
Coupon Marketplace: Project
Details
32
Project Facts:
• 500 million visits a year
• 25TB+ HP Vertica Data Warehouse
• 50TB+ Hadoop Cluster
• Near-Real time data visualization
Technology Stack:
• Hadoop Cluster (Amazon EMR)
/Hive/Hue/MapReduce/Flume/Spark
• HP Vertica, MySQL
• Python
• Tableau
Major Activities:
• Near-Real time data integration processes
design and implementation
• Hadoop cluster optimization
• Data Warehouse re-design and optimization
• Data Science algorithms design
Coupon Web Analytics Platform
33
Coupon Web-Site
JS Libs
Web Logs
Operational
databases
Coupon Web-Site
JS Libs
Web Logs
Operational
databases
3rd Party API
MPP Data Warehouse
Cluster
Raw Data Hadoop Cluster
ETL Additional Data Stores
Data Scientists
BI/Marketing Team
REST/SOAP
34
Business Goals:
Insights and optimization of all web, mobile,
and social channels
 Optimization of recommendations for
each visitor
 High return on online marketing
investments
Business Area:
Web Analytics Platform by Fortune 100
company is a data storage and analytics on
visitors' digital journeys
Case Study #3: Online Analytics Platform
Online Analytics Platform
Details
35
Key Facts:
• Big Data > 1PB
• 10+ GB per customer/day
• 10+ Hadoop Clusters
• 15+ Aster Data Clusters
Technologies:
• Hadoop/HBase/HiveQL
• Aster Data
• Oracle
• Java/Flex
Solution Architecture
36
Customer Marketing Team
Customer Web Server
Environment
Web Analytics Platform
Web
Analytics
Data
Offerings
Business Rules
Schedule
Recommendation
Rule Engine
Further learning
http://bigdatauniversity.com/
http://blog.cloudera.com/blog/
http://hortonworks.com/blog/
https://www.mapr.com/blog
Hadoop: The Definitive Guide, 3rd
Edition
Any
questions,
Dude?

More Related Content

PDF
Infrastructure Around Hadoop
PDF
Kudu - Fast Analytics on Fast Data
PDF
Introduction To Hadoop Ecosystem
PDF
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
PPTX
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
PPTX
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
PPTX
Hadoop configuration & performance tuning
PDF
Large-scale Web Apps @ Pinterest
Infrastructure Around Hadoop
Kudu - Fast Analytics on Fast Data
Introduction To Hadoop Ecosystem
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
Hadoop configuration & performance tuning
Large-scale Web Apps @ Pinterest

What's hot (20)

ODP
Hadoop Ecosystem Overview
PPTX
Hadoop and HBase @eBay
PPTX
Hoodie: Incremental processing on hadoop
PPTX
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
PDF
Hadoop meets Agile! - An Agile Big Data Model
PPTX
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
PPTX
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
PPTX
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
PDF
Hoodie - DataEngConf 2017
PPTX
Asbury Hadoop Overview
PDF
Welcome to Hadoop2Land!
PPTX
Apache hadoop technology : Beginners
PDF
Hadoop ecosystem
PPTX
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
PPTX
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
PDF
Cisco connect toronto 2015 big data sean mc keown
PDF
Using Spark with Tachyon by Gene Pang
PPTX
Rds data lake @ Robinhood
PDF
Apache kudu
Hadoop Ecosystem Overview
Hadoop and HBase @eBay
Hoodie: Incremental processing on hadoop
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Hadoop meets Agile! - An Agile Big Data Model
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Hoodie - DataEngConf 2017
Asbury Hadoop Overview
Welcome to Hadoop2Land!
Apache hadoop technology : Beginners
Hadoop ecosystem
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Cisco connect toronto 2015 big data sean mc keown
Using Spark with Tachyon by Gene Pang
Rds data lake @ Robinhood
Apache kudu
Ad

Similar to Pacemaker hadoop infrastructure and soft serve experience (20)

PPTX
Skillwise Big Data part 2
PPTX
Skilwise Big data
PPTX
How Hewlett Packard Enterprise Gets Real with IoT Analytics
PPTX
Big Data Simplified - Is all about Ab'strakSHeN
PDF
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
PDF
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
PDF
Hitachi Data Systems Hadoop Solution
PPTX
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
PDF
Hadoop Master Class : A concise overview
PPTX
From Data to Services at the Speed of Business
PPTX
Accelerating Big Data Analytics
PPTX
Hortonworks.bdb
PDF
Hadoop and Your Enterprise Data Warehouse
PPTX
How does Microsoft solve Big Data?
PDF
Advanced Analytics and Big Data (August 2014)
PPTX
PDF
OPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
PDF
Teradata - Presentation at Hortonworks Booth - Strata 2014
PPT
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
PDF
Hadoop and the Data Warehouse: Point/Counter Point
Skillwise Big Data part 2
Skilwise Big data
How Hewlett Packard Enterprise Gets Real with IoT Analytics
Big Data Simplified - Is all about Ab'strakSHeN
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Hitachi Data Systems Hadoop Solution
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
Hadoop Master Class : A concise overview
From Data to Services at the Speed of Business
Accelerating Big Data Analytics
Hortonworks.bdb
Hadoop and Your Enterprise Data Warehouse
How does Microsoft solve Big Data?
Advanced Analytics and Big Data (August 2014)
OPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
Teradata - Presentation at Hortonworks Booth - Strata 2014
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Hadoop and the Data Warehouse: Point/Counter Point
Ad

Recently uploaded (20)

PDF
Lecture1 pattern recognition............
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
Quality review (1)_presentation of this 21
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
annual-report-2024-2025 original latest.
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
Fluorescence-microscope_Botany_detailed content
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
Lecture1 pattern recognition............
STUDY DESIGN details- Lt Col Maksud (21).pptx
Qualitative Qantitative and Mixed Methods.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Quality review (1)_presentation of this 21
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Database Infoormation System (DBIS).pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
annual-report-2024-2025 original latest.
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Clinical guidelines as a resource for EBP(1).pdf
IB Computer Science - Internal Assessment.pptx
Business Analytics and business intelligence.pdf
Fluorescence-microscope_Botany_detailed content
ISS -ESG Data flows What is ESG and HowHow
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Knowledge Engineering Part 1
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Data_Analytics_and_PowerBI_Presentation.pptx

Pacemaker hadoop infrastructure and soft serve experience

  • 1. HADOOP INFRASTRUCTURE AND SOFTSERVE EXPERIENCE  Pacemaker BigData, Lviv February, 2015
  • 2. Agenda •Business needs •Hadoop Infrastructure •Hadoop Distributives •SoftServe Experience
  • 3. Presentation drivers • Hadoop competence development • Hadoop isn’t MapReduce only • Components for solution building • Case studies
  • 4. Big Analytics Engineering Challenges Data Discovery Business Reporting Real Time Intelligence Business Users Intelligent AgentsConsumers How to achieve Low Latency for personalized customer experience in real-time? Data Scientists/ Analysts How to improve System Performance for Data Science/ Analytics team? How to implement Self-Service with high Data Quality over terabytes and petabytes?
  • 6. A distributed file system • Files are split into blocks • Each block has 3 replicas minimum
  • 8. Apache YARN A resource manager (Yet Another Resource Manager)
  • 9. A more complex resource management
  • 10. An SQL interpreter for MapReduce
  • 11. Apache Pig A script language to query HDFS
  • 12. Real-Time Queries in Apache Hadoop
  • 13. Runs Everywhere Engine for large scale data processing. Could be used with Java, Scala and Python
  • 14. Apache Sqoop SQL to HADOOP – data load tool for RDBMS
  • 16. Other Databases on top of Hadoop Column oriented Key-Value datastore Graph oriented Database
  • 17. A distributed service for collecting, aggregating, transformation and moving large amount of log data
  • 18. Distributed, real time computation service. Could be used for real time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more
  • 19. Apache Zookeeper Distributed Service for: • maintaining configuration information • naming • providing distributed synchronization • providing group services Service is fault tolerant: • Zookeeper cluster is called “ensemble” • There is one “leader” in an “ensemble” • If “leader” is down a new “leader” is elected with quorum
  • 20. Distributed messaging service • Large amount of data • Scalable • Durable (messages are persisted on disc)
  • 24. SoftServe Lambda Architecture Accelerator • Lambda Architecture – is a highly scalable and reliable data processing architecture based on Twitter successful experience in Big Data and Analytics • Supports majority of use cases: Real-time analytics, data discovery and business reports • SoftServe’s pre-built Lambda Architecture stack accelerates customer’s Time to Market to 15-20+ man/month
  • 25. 25 Business Goals:  Build a centralized platform for log data analysis which collects data from ~270-300 Web Servers  Provide Online Monitoring to answer the question: “What is going on with systems now?”  Provide Retrospective Analytics – strategic management, capacity management/planning, route cause analysis, ad-hoc analysis Business Area: Retail industry. A leading travel site in a world Big Data Lab: Log Management
  • 26. Log Data Analysis Platform Details 26 Key Facts: • ~270-300 Web Servers • Log Types: HTTPD Access logs, Error logs, Application Server Servlet, OS Service Logs • ~500K events per minute • 150GB of data per day Technologies: • Flume • Hadoop/HDFS, MapReduce • Hive, Impala • Oozie • Elasticsearch, Kibana • MicroStrategy Analytics platform
  • 28. 28 Business Goals:  Build in-house Analytics Platform for ROI measurement and performance analysis of every product and feature delivered by the e-commerce platform;  Provide the ability to understand how end-users are interacting with service content, products, and features on sites;  Do clickstream analysis;  Perform A/B Testing Business Area: Retail. A platform for e-commerce and collecting feedbacks from customers Case Study #1: Clickstream for retail website
  • 29. Architectural Decisions 29 ▪ Volume (45 TB) ▪ Sources (Semi-structured - JSON) ▪ Throughput (> 20K/sec) ▪ Latency (1 hour/real-time) ▪ Extensibility (Custom tags) ▪ Data Quality (Not critical) ▪ Reliability (24/7) ▪ Security (Multitenancy) ▪ Self-Service (Canned reports, Data science) ▪ Cost (The less the better ) ▪ Constraints (Public Cloud) Architecture Drivers: Technology Stack: Lambda Architecture • Apache Kafka • Apache Storm • Amazon S3 • Hadoop/HDFS, MapReduce (CDH 5) • HBase • Oozie, Zookeper • Cloudera Manager
  • 31. 31 Business Goals:  In-house Web Analytics Platform for Conversion Funnel Analysis, marketing campaign optimization, user behavior analytics (based on server logs analysis, page tagging, external data);  Perform A/B Testing, platform feature usage analysis Business Area: Retail. The world's largest digital coupon marketplace. The company owns the largest coupon sites in the US, UK, Germany, Netherlands, France Case Study #2: Coupon Marketplace
  • 32. Coupon Marketplace: Project Details 32 Project Facts: • 500 million visits a year • 25TB+ HP Vertica Data Warehouse • 50TB+ Hadoop Cluster • Near-Real time data visualization Technology Stack: • Hadoop Cluster (Amazon EMR) /Hive/Hue/MapReduce/Flume/Spark • HP Vertica, MySQL • Python • Tableau Major Activities: • Near-Real time data integration processes design and implementation • Hadoop cluster optimization • Data Warehouse re-design and optimization • Data Science algorithms design
  • 33. Coupon Web Analytics Platform 33 Coupon Web-Site JS Libs Web Logs Operational databases Coupon Web-Site JS Libs Web Logs Operational databases 3rd Party API MPP Data Warehouse Cluster Raw Data Hadoop Cluster ETL Additional Data Stores Data Scientists BI/Marketing Team REST/SOAP
  • 34. 34 Business Goals: Insights and optimization of all web, mobile, and social channels  Optimization of recommendations for each visitor  High return on online marketing investments Business Area: Web Analytics Platform by Fortune 100 company is a data storage and analytics on visitors' digital journeys Case Study #3: Online Analytics Platform
  • 35. Online Analytics Platform Details 35 Key Facts: • Big Data > 1PB • 10+ GB per customer/day • 10+ Hadoop Clusters • 15+ Aster Data Clusters Technologies: • Hadoop/HBase/HiveQL • Aster Data • Oracle • Java/Flex
  • 36. Solution Architecture 36 Customer Marketing Team Customer Web Server Environment Web Analytics Platform Web Analytics Data Offerings Business Rules Schedule Recommendation Rule Engine

Editor's Notes

  • #26: Client Our client is a leading travel site in a world. Engagement Partnering with SoftServe, the combined teams developed an and implementation of Hadoop Cluster which collects log data from ~270-300 Web Servers including HTTPD Access and Error logs, as well as Application Server Servlet and OS Service Logs for further operational and retrospective analysis. Result The client has decreased their time to react on a issues which happens with web-servers as well as increased insight into ROI analysis for marketing campaigns which enabled company to increase number of visitors.
  • #32: Clickstream Data: Google Analytics Site Catalyst, SaaS App from Adobe (prev. Omniture) Apache Web Logs Beacon JavaScript Library Financial Data: Data, provided by Affiliate Networks though API, FTP etc Marketing Data: Kenshoo: used as a platform to analyze the effectiveness of pay per click Google Ad campaigns.   The Kenshoo Conversion Feed provides sales and commission data to measure ROI on campaigns
  • #36: Tools & Technologies Extended List: SaaS, Hadoop/HDFS, Hadoop/Hbase, Aster Data, Java/Flex, J2EE, Java Script, Scape SSH/SFTP library, Velocity, Linux, Bash RDL, SQL, XSL Java, XML, Oracle database, JMS, Java Servlet, JDBC, JBoss, Flash RDL, Macromedia Flash.
  • #37: Hadoop/HiveQL: Raw data about website users behavior Aggregation information for historical analytics Customized scheduled reports HBase: Online query for immediate data access: User geographical and demographics information Recent user purchase, search, unsubscribe activities