SlideShare a Scribd company logo
@eBay Anil Madan amadan@ebay.com  Analytics Platform Development
2007 Research Team Builds a 4 node Cluster   Subset of Click Stream and EDW data Innovation with Mobius Query Language Visualization  and Click Path analysis 2009 Sept Search Clusters  Machine Learning Ranking cluster of 28 nodes Search relevance cluster of 10 nodes Subset of Click Stream and EDW Data 2010 May – Athena* Exploratory Cluster of 532 nodes Platform Teams join hands with Search/Research to build a larger cluster . Build it as a core competency for advanced insights for complex data Rapid build-out with timelines pulled in by couple of months *  Athena, is the goddess of civilization, wisdom, strength, strategy, craft, justice and skill in Greek mythology MIT's Athena ushered the world in a new era of distributed systems when it started in the mid 80s.
Infrastructure Enterprise Nodes  Sun 64bit , Red Hat Linux 2 Quad Core Nehalem, 72GB RAM, 4TB Servers NameNode(s) Job Tracker Zookeeper HBaseMaster Ganglia Server eBay (Cloudera) HUE Data Nodes SGI-Rackables, Cent OS, 1U , 5.3PB 2 Quad Core Nehalem, 36GB RAM, 10TB Hbase on 20 nodes Network TOR 1Gbps Core Switches uplink 40Gbps
Ecosystem Hadoop Core  (HDFS,Common) MapReduce  (Java, Streaming, Pipes,Scala) Data Access  (Hbase, Pig,  Hive) Tools & Libraries (HUE,UC4,Oozie.Mobius,Mahout) Monitoring & Alerting  (Ganglia, Nagios) MapReduce Sourcing data primarily Java   Applications using Perl, Scala, Python… Data Access Frameworks Hbase - for EDWdata Pig – data piplelines Hive – Adhoc queries  MQL – Mobius Query Language Monitoring & Alerting Ganglia, Nagios Tools HUE/Mobius – lifecycle of user  jobs   UC4 - scheduling   Oozie – user workflow and data pipelines Mahout – data mining
Administration Groups Built to support multiple groups Job invocation uses the group name Fair Scheduler  Allocations based on investment Weights  Minimum share of mappers and reducers poolMaxJobsDefault userMaxJobsDefault defaultMinSharePreemptionTimeout fairSharePreemptionTimeout Auth & Auth HUE – custom module to use corp. credentials CLI*– PAM custom module Security* - Implement token interface to replace Kerberos with SAML. *  Work in Progress
Data Sourcing Patterns Click Stream EDW Images Search Indices Analytics Reporting Algorithmic Models Acquisition Description Source Preparation Format Pattern Click Stream Session Event Session Container Session/Event Streamed as LZO/Text SessionContainer generate Sequence Files  Session/Event Data   Build an index and use LzoTextInputFormat for splits based on the work done by Johan Oskarsson/Twitter Session Container    ‘Value to Type Conversion’ Pattern   Secondary sort with reduce side join EDW Item Transaction User Feedback Bids Streamed as GZIP/Text Generate SequenceFile/ Hbase snapshot with previous day snapshot and current day data. Hive StorageHandlers to point to SequenceFile/Hbase snapshot TotalOrderPartitoner with  RandomSamplers to identify partition ranges for reducers. Create Hbase regions using Hfile Update RegionServers using ruby script loadtable.rb Concerns - Hbase append performance, Hfile flush HBASE-1923
Search Use Case – Machine Learned Ranking ClickStream Items Users Feedback Classifiers Ranking Function Great Search Results Goal Enhance search relevance for eBay’s items. Hadoop Usage Build a ranking function that takes multiple factors into account like price, listing format, seller track record, relevance. Ability to add new factors to validate hypothesis .
Research Use Case – Description Data Mining  Goal Extend catalog coverage Hadoop Usage Leverage data mining/machine learning techniques to create inventory into name value pairs  in an completely unsupervised way BARBIE 1999 "PREMIERE NIGHT"  Home Shopping Special Edition Gorgeous Doll With Beautiful Blond Hair /  In A Gown Of Purple And Silver New / Never Removed From Box / Doll Is In Mint Condition / Remember This Beauty Is 11 Years Old Free Shipping To US Only / Will Ship International / Please E-mail For Cost Feel Free To Ask Me Any Questions Or Concerns Smoke - Free Environment Free Shipping Year: 1999 Model: premiere night Edition: home shopping special Hair: blond Gown: purple and silver Condition: new / never removed from box / mint
Platform Details Metrics  Job Statistics, System/Disk Consumption, Utilization Infrastructure  Publish/Subscribe ETL tools, low latency data movement Development Tools, Environment, IDE, Architecture Schemas, Metadata, Governance, Policies Operations Administration, Configuration, Monitoring Reporting Visualization, BI Generation, Information delivery Security User & Group Management, Auth & Auth Clusters Details Exploratory Strategic  investment 1000-5000 nodes Production Site facing, low latency, high availability Use Case Specific Advertising, Trust & Safety , Merchandizing
Acknowledgments Athena Team Cloudera Inc. Community

More Related Content

PPT
Hadoop at Ebay
PPTX
Hadoop-2 @ eBay
PPTX
Hadoop and HBase @eBay
PPTX
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
PPTX
Tailored for Spark
PPT
Hadoop at Yahoo! -- Hadoop World NY 2009
PDF
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
PPTX
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
Hadoop at Ebay
Hadoop-2 @ eBay
Hadoop and HBase @eBay
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Tailored for Spark
Hadoop at Yahoo! -- Hadoop World NY 2009
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase

What's hot (20)

PPTX
Big Data Ingestion @ Flipkart Data Platform
PDF
Hadoop summit 2010, HONU
PPTX
Real-time Distributed Stream Processing @ Scale
PDF
Netflix running Presto in the AWS Cloud
PPTX
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
PPTX
Distributed Deep Learning on Hadoop Clusters
PPTX
Spark Technology Center IBM
PPTX
Real time fraud detection at 1+M scale on hadoop stack
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
PDF
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
PPTX
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
PDF
Data pipeline and data lake for autonomous driving
PPTX
Data ingestion
PPTX
Building a Scalable Web Crawler with Hadoop
PDF
How Adobe Does 2 Million Records Per Second Using Apache Spark!
PPTX
Preventative Maintenance of Robots in Automotive Industry
PPTX
LEGO: Data Driven Growth Hacking Powered by Big Data
PDF
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Big Data Ingestion @ Flipkart Data Platform
Hadoop summit 2010, HONU
Real-time Distributed Stream Processing @ Scale
Netflix running Presto in the AWS Cloud
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Distributed Deep Learning on Hadoop Clusters
Spark Technology Center IBM
Real time fraud detection at 1+M scale on hadoop stack
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data pipeline and data lake for autonomous driving
Data ingestion
Building a Scalable Web Crawler with Hadoop
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Preventative Maintenance of Robots in Automotive Industry
LEGO: Data Driven Growth Hacking Powered by Big Data
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Ad

Similar to Hadoop at eBay (20)

PPTX
Apache Hadoop India Summit 2011 talk "Hadoop Avatar at eBay" by Srinivasan Re...
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
PPT
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
PDF
Big Data Meetup #7
PPTX
Fundamentals Of Search
PDF
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
PPTX
Bigdata
PPTX
Learn Big Data & Hadoop
PDF
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
PDF
President Election of Korea in 2017
PPT
Advanced Web Development
PDF
Introduction Big Data
PPTX
BigData
PPTX
END-TO-END MACHINE LEARNING STACK
PDF
Real-Time AI Streaming - AI Max Princeton
PPTX
Big data or big deal
PPT
SharePoint Jumpstart #2 Making Basic SharePoint Search Work
PDF
Understanding Metadata: Why it's essential to your big data solution and how ...
DOC
Bigdata.sunil_6+yearsExp
PPTX
Testing Big Data: Automated ETL Testing of Hadoop
Apache Hadoop India Summit 2011 talk "Hadoop Avatar at eBay" by Srinivasan Re...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Big Data Meetup #7
Fundamentals Of Search
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Bigdata
Learn Big Data & Hadoop
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
President Election of Korea in 2017
Advanced Web Development
Introduction Big Data
BigData
END-TO-END MACHINE LEARNING STACK
Real-Time AI Streaming - AI Max Princeton
Big data or big deal
SharePoint Jumpstart #2 Making Basic SharePoint Search Work
Understanding Metadata: Why it's essential to your big data solution and how ...
Bigdata.sunil_6+yearsExp
Testing Big Data: Automated ETL Testing of Hadoop
Ad

Hadoop at eBay

  • 1. @eBay Anil Madan amadan@ebay.com Analytics Platform Development
  • 2. 2007 Research Team Builds a 4 node Cluster Subset of Click Stream and EDW data Innovation with Mobius Query Language Visualization and Click Path analysis 2009 Sept Search Clusters Machine Learning Ranking cluster of 28 nodes Search relevance cluster of 10 nodes Subset of Click Stream and EDW Data 2010 May – Athena* Exploratory Cluster of 532 nodes Platform Teams join hands with Search/Research to build a larger cluster . Build it as a core competency for advanced insights for complex data Rapid build-out with timelines pulled in by couple of months * Athena, is the goddess of civilization, wisdom, strength, strategy, craft, justice and skill in Greek mythology MIT's Athena ushered the world in a new era of distributed systems when it started in the mid 80s.
  • 3. Infrastructure Enterprise Nodes Sun 64bit , Red Hat Linux 2 Quad Core Nehalem, 72GB RAM, 4TB Servers NameNode(s) Job Tracker Zookeeper HBaseMaster Ganglia Server eBay (Cloudera) HUE Data Nodes SGI-Rackables, Cent OS, 1U , 5.3PB 2 Quad Core Nehalem, 36GB RAM, 10TB Hbase on 20 nodes Network TOR 1Gbps Core Switches uplink 40Gbps
  • 4. Ecosystem Hadoop Core (HDFS,Common) MapReduce (Java, Streaming, Pipes,Scala) Data Access (Hbase, Pig, Hive) Tools & Libraries (HUE,UC4,Oozie.Mobius,Mahout) Monitoring & Alerting (Ganglia, Nagios) MapReduce Sourcing data primarily Java Applications using Perl, Scala, Python… Data Access Frameworks Hbase - for EDWdata Pig – data piplelines Hive – Adhoc queries MQL – Mobius Query Language Monitoring & Alerting Ganglia, Nagios Tools HUE/Mobius – lifecycle of user jobs UC4 - scheduling Oozie – user workflow and data pipelines Mahout – data mining
  • 5. Administration Groups Built to support multiple groups Job invocation uses the group name Fair Scheduler Allocations based on investment Weights Minimum share of mappers and reducers poolMaxJobsDefault userMaxJobsDefault defaultMinSharePreemptionTimeout fairSharePreemptionTimeout Auth & Auth HUE – custom module to use corp. credentials CLI*– PAM custom module Security* - Implement token interface to replace Kerberos with SAML. * Work in Progress
  • 6. Data Sourcing Patterns Click Stream EDW Images Search Indices Analytics Reporting Algorithmic Models Acquisition Description Source Preparation Format Pattern Click Stream Session Event Session Container Session/Event Streamed as LZO/Text SessionContainer generate Sequence Files Session/Event Data Build an index and use LzoTextInputFormat for splits based on the work done by Johan Oskarsson/Twitter Session Container ‘Value to Type Conversion’ Pattern Secondary sort with reduce side join EDW Item Transaction User Feedback Bids Streamed as GZIP/Text Generate SequenceFile/ Hbase snapshot with previous day snapshot and current day data. Hive StorageHandlers to point to SequenceFile/Hbase snapshot TotalOrderPartitoner with RandomSamplers to identify partition ranges for reducers. Create Hbase regions using Hfile Update RegionServers using ruby script loadtable.rb Concerns - Hbase append performance, Hfile flush HBASE-1923
  • 7. Search Use Case – Machine Learned Ranking ClickStream Items Users Feedback Classifiers Ranking Function Great Search Results Goal Enhance search relevance for eBay’s items. Hadoop Usage Build a ranking function that takes multiple factors into account like price, listing format, seller track record, relevance. Ability to add new factors to validate hypothesis .
  • 8. Research Use Case – Description Data Mining Goal Extend catalog coverage Hadoop Usage Leverage data mining/machine learning techniques to create inventory into name value pairs in an completely unsupervised way BARBIE 1999 "PREMIERE NIGHT" Home Shopping Special Edition Gorgeous Doll With Beautiful Blond Hair /  In A Gown Of Purple And Silver New / Never Removed From Box / Doll Is In Mint Condition / Remember This Beauty Is 11 Years Old Free Shipping To US Only / Will Ship International / Please E-mail For Cost Feel Free To Ask Me Any Questions Or Concerns Smoke - Free Environment Free Shipping Year: 1999 Model: premiere night Edition: home shopping special Hair: blond Gown: purple and silver Condition: new / never removed from box / mint
  • 9. Platform Details Metrics Job Statistics, System/Disk Consumption, Utilization Infrastructure Publish/Subscribe ETL tools, low latency data movement Development Tools, Environment, IDE, Architecture Schemas, Metadata, Governance, Policies Operations Administration, Configuration, Monitoring Reporting Visualization, BI Generation, Information delivery Security User & Group Management, Auth & Auth Clusters Details Exploratory Strategic investment 1000-5000 nodes Production Site facing, low latency, high availability Use Case Specific Advertising, Trust & Safety , Merchandizing
  • 10. Acknowledgments Athena Team Cloudera Inc. Community

Editor's Notes