SlideShare a Scribd company logo
Data Platform and Services

  Vipul Sharma and EyalReuveni
Agenda


            Eventbrite
           Data Products
           Data Platform
         Recommendations
            Questions
•   A social event ticketing and discovery platform
•   50th Million Ticket Sold
•   Revenue doubled YOY
•   180 Employees in SOMA SF
•   Solving significant engineering problems
    • Data
    • Data, Infrastructure, Mobile, Web, Scale, Ops, QA
• Firing all cylinders and hiring blazing fast
www.eventbrite.com/jobs
Data Products
Eventbrite dataplatform and services - Interest graph based recommendations
Eventbrite dataplatform and services - Interest graph based recommendations
Analytics




            • Add–Hoc queries by Analysts
Fraud and Spam
Data Platform
Eventbrite dataplatform and services - Interest graph based recommendations
Hadoop Cluster




•   30 persistent EC2 High-Memory Instances
•   30TB disk with replication factor of 2, ext3 formatted
•   CDH3
•   Fair Scheduler
•   HBase
Infrastructure

• Search
   • Solr
   • Incremental updates towards event driven
• Recommendation/Graph
   • Hadoop
   • Native Java MapReduce
   • Bash for workflow
• Persistence
   •   MySql
   •   HDFS
   •   HBase
   •   MongoDB (Investigating Cassandra and Riak)
Infrastructure


• Stream
   • RabbitMQ
   • Internal Fire hose (Investigating Kafka)
• Offline
   •   MapRedude
   •   Streaming
   •   Hive
   •   Hue
Infrastructure - Sqoozie



• Workflow for mysql imports to HDFS
    • Generate Sqoop commands
    • Run these imports in parallel
•   Transparent to schema changes
•   Include or exclude on column, data types, table level
•   Data Type Casting tinyint(1)  Integer
•   Distributed Table Imports
Infrastructure - Blammo



•   Raw logs are imported to HDFS via flume
•   Almost real-time – 5 min latency
•   Logs are key-value pairs in JSON
•   Each log producer publishes schema in yaml
•   Hive schema and schema yaml in sync using thrift
•   Control exclusion and inclusion
Recommendations
You will like to attend this event
Recommendation Engines



                                                                                      Interest Graph
                                                                                      Based
                                                                 Social Graph
                                                                 Based (Your          (Your friends who
                                                                 friends like Lady    like rock music
                                          Collaborative          Gaga so you will     like you are
                                          Filtering – Item-      like Lady            attending Eric
                                          Item similarity        Gaga, PYMK –         Clapton Event–
                                                                 Facebook, Linkedin   Eventbrite)
                      Collaborative       (You like
                                          Godfather so you       )
                      Filtering – User-
                      User Similarity     will like Scarface -
                                          Netflix)
                      (People who
     Item             bought camera
     Hierarchy        also bought
                      batteries -
     (You bought      Amazon)
     camera so you
     need batteries
     - Amazon)
Why Interest?




  Events are Social          Events are Interest




Dense Graph is Irrelevant
                            Interest are Changing
How do we know your Interest?


• We ask you
• Based on your activity
   • Events Attended
   • Events Browsed
• Facebook Interests
   • User Interest has to match Event category
   • Static
• Machine Learning
   • Logistic Regression using MLE
   • Sparse Matrix is generated using MapReduce
   • A model for each interest
Model Based vs Clustering

            Item-Item vs User-User

     Building Social Graph is Clustering Step

Social Graph Recommendation is a Ranking Problem
Implicit Social Graph


                                 U1


                            E1        E4

                  U2                       U3


             E2        E3

        U4                       U5
Mixed Social Graph


                                U1


                           E1

                 U2                  U3


            E2        E3
                                          FB
       U4                       U5
                                          LI
15M * 260 * 260 = 1.14 Trillion Edges
               4Billion edges ranked
   Each node is a feature vector representing a User

Each edge is a feature vector representing a Relationship
Feature Generation

•   Mixed Features
•   A series of map-reduce jobs
•   Output on HDFS in flat files; Input to subsequent jobs
•   Orders = Event  Attendees
    • MAP: eid: uid
    • REDUCE: eid:[uid]
• Attendees  Social Graph
    • Input: eid:[uid]
    • MAP: uidi:[uid]
    • REDUCE: uid:[neighbors]
• Interest based features, user specific, graph mining etc
• Upload feature values to HBase
U1




U2        U3
HBase
HBase




• Collect data from multiple Map Reduce jobs
   • Stores entire social graph
   • Over one million writes per second
HBase




    rowid     neighbors   events   featureX
    2718282   101         3        0.3678795
HBase




rowid     314159:n   314159:e   314159:fx   161803:n   161803:e   161803:fx
2718282   31         1          0.3183      83         2          0.618
Tips & Tricks




• Distributed cache database
   • Sped up some Map Reduce jobs by hours
   • Be sure to use counters!
Tips & Tricks




• Hive (ab)uses
   •   Almost as many hive jobs as custom ones
   •   “flip join”
   •   Statistical functions using hive
   •   UDF
Tips & Tricks


•   Memory Memory Memory
•   LZO, WAL
•   Combiners are great until
•   Shuffle and Sorting stage
•   Hadoop ecosystem is still new
Questions?

More Related Content

PPTX
Eventbrite Data Platform Talk foir SFDM
PDF
SeaTug Presentation (Viz & Share) Seattle Tableau User Group
PDF
Cincinnati Tableau User Group Event #5
PDF
Oracle® Trading Community Architecture
PPTX
Ticketing Protocol
PPTX
Testtting
PPTX
Ashu Desc
PPTX
Testtting
Eventbrite Data Platform Talk foir SFDM
SeaTug Presentation (Viz & Share) Seattle Tableau User Group
Cincinnati Tableau User Group Event #5
Oracle® Trading Community Architecture
Ticketing Protocol
Testtting
Ashu Desc
Testtting

Similar to Eventbrite dataplatform and services - Interest graph based recommendations (20)

PPTX
Eventbrite sxsw
PPTX
CSC 8101 Non Relational Databases
PPT
InfiniteGraph Presentation from Oct 21, 2010 DBTA Webcast
PPT
Giraph at Hadoop Summit 2014
PPTX
Steve Watt Presentation
PPTX
Tech4Africa - Opportunities around Big Data
PDF
Liferay & Big Data Dev Con 2014
PPTX
Big Data Analysis : Deciphering the haystack
PDF
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
PDF
Spark after Dark by Chris Fregly of Databricks
PDF
Spark After Dark - LA Apache Spark Users Group - Feb 2015
PPTX
Graph Databases
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PDF
UNit4.pdf
PPTX
Taboola Road To Scale With Apache Spark
PPTX
Music streams
PPTX
WOOster: A Map-Reduce based Platform for Graph Mining
PPTX
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
PPTX
Big data & hadoop
PDF
Scratchpads past,present,future
Eventbrite sxsw
CSC 8101 Non Relational Databases
InfiniteGraph Presentation from Oct 21, 2010 DBTA Webcast
Giraph at Hadoop Summit 2014
Steve Watt Presentation
Tech4Africa - Opportunities around Big Data
Liferay & Big Data Dev Con 2014
Big Data Analysis : Deciphering the haystack
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Spark after Dark by Chris Fregly of Databricks
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Graph Databases
Apache Spark: The Next Gen toolset for Big Data Processing
UNit4.pdf
Taboola Road To Scale With Apache Spark
Music streams
WOOster: A Map-Reduce based Platform for Graph Mining
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big data & hadoop
Scratchpads past,present,future
Ad

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Tartificialntelligence_presentation.pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Electronic commerce courselecture one. Pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Approach and Philosophy of On baking technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation_ Review paper, used for researhc scholars
Reach Out and Touch Someone: Haptics and Empathic Computing
Getting Started with Data Integration: FME Form 101
Tartificialntelligence_presentation.pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
NewMind AI Weekly Chronicles - August'25-Week II
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Mobile App Security Testing_ A Comprehensive Guide.pdf
Electronic commerce courselecture one. Pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Network Security Unit 5.pdf for BCA BBA.
Approach and Philosophy of On baking technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
20250228 LYD VKU AI Blended-Learning.pptx
Ad

Eventbrite dataplatform and services - Interest graph based recommendations

  • 1. Data Platform and Services Vipul Sharma and EyalReuveni
  • 2. Agenda Eventbrite Data Products Data Platform Recommendations Questions
  • 3. A social event ticketing and discovery platform • 50th Million Ticket Sold • Revenue doubled YOY • 180 Employees in SOMA SF • Solving significant engineering problems • Data • Data, Infrastructure, Mobile, Web, Scale, Ops, QA • Firing all cylinders and hiring blazing fast www.eventbrite.com/jobs
  • 7. Analytics • Add–Hoc queries by Analysts
  • 11. Hadoop Cluster • 30 persistent EC2 High-Memory Instances • 30TB disk with replication factor of 2, ext3 formatted • CDH3 • Fair Scheduler • HBase
  • 12. Infrastructure • Search • Solr • Incremental updates towards event driven • Recommendation/Graph • Hadoop • Native Java MapReduce • Bash for workflow • Persistence • MySql • HDFS • HBase • MongoDB (Investigating Cassandra and Riak)
  • 13. Infrastructure • Stream • RabbitMQ • Internal Fire hose (Investigating Kafka) • Offline • MapRedude • Streaming • Hive • Hue
  • 14. Infrastructure - Sqoozie • Workflow for mysql imports to HDFS • Generate Sqoop commands • Run these imports in parallel • Transparent to schema changes • Include or exclude on column, data types, table level • Data Type Casting tinyint(1)  Integer • Distributed Table Imports
  • 15. Infrastructure - Blammo • Raw logs are imported to HDFS via flume • Almost real-time – 5 min latency • Logs are key-value pairs in JSON • Each log producer publishes schema in yaml • Hive schema and schema yaml in sync using thrift • Control exclusion and inclusion
  • 17. You will like to attend this event
  • 18. Recommendation Engines Interest Graph Based Social Graph Based (Your (Your friends who friends like Lady like rock music Collaborative Gaga so you will like you are Filtering – Item- like Lady attending Eric Item similarity Gaga, PYMK – Clapton Event– Facebook, Linkedin Eventbrite) Collaborative (You like Godfather so you ) Filtering – User- User Similarity will like Scarface - Netflix) (People who Item bought camera Hierarchy also bought batteries - (You bought Amazon) camera so you need batteries - Amazon)
  • 19. Why Interest? Events are Social Events are Interest Dense Graph is Irrelevant Interest are Changing
  • 20. How do we know your Interest? • We ask you • Based on your activity • Events Attended • Events Browsed • Facebook Interests • User Interest has to match Event category • Static • Machine Learning • Logistic Regression using MLE • Sparse Matrix is generated using MapReduce • A model for each interest
  • 21. Model Based vs Clustering Item-Item vs User-User Building Social Graph is Clustering Step Social Graph Recommendation is a Ranking Problem
  • 22. Implicit Social Graph U1 E1 E4 U2 U3 E2 E3 U4 U5
  • 23. Mixed Social Graph U1 E1 U2 U3 E2 E3 FB U4 U5 LI
  • 24. 15M * 260 * 260 = 1.14 Trillion Edges 4Billion edges ranked Each node is a feature vector representing a User Each edge is a feature vector representing a Relationship
  • 25. Feature Generation • Mixed Features • A series of map-reduce jobs • Output on HDFS in flat files; Input to subsequent jobs • Orders = Event  Attendees • MAP: eid: uid • REDUCE: eid:[uid] • Attendees  Social Graph • Input: eid:[uid] • MAP: uidi:[uid] • REDUCE: uid:[neighbors] • Interest based features, user specific, graph mining etc • Upload feature values to HBase
  • 26. U1 U2 U3
  • 27. HBase
  • 28. HBase • Collect data from multiple Map Reduce jobs • Stores entire social graph • Over one million writes per second
  • 29. HBase rowid neighbors events featureX 2718282 101 3 0.3678795
  • 30. HBase rowid 314159:n 314159:e 314159:fx 161803:n 161803:e 161803:fx 2718282 31 1 0.3183 83 2 0.618
  • 31. Tips & Tricks • Distributed cache database • Sped up some Map Reduce jobs by hours • Be sure to use counters!
  • 32. Tips & Tricks • Hive (ab)uses • Almost as many hive jobs as custom ones • “flip join” • Statistical functions using hive • UDF
  • 33. Tips & Tricks • Memory Memory Memory • LZO, WAL • Combiners are great until • Shuffle and Sorting stage • Hadoop ecosystem is still new