SlideShare a Scribd company logo
Rebuilding Web Tracking
Infrastructure for Scale
Stephen Oakley
Principal Engineer
Marketo
What is Marketo?
Page 3
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
What is Web Tracking at Marketo?
• Ingest web page visits and clicks on customer’s website
• Trigger campaigns in response to web activity
• Trigger real-time personalization of web experience
• Provide lead level analytics for known leads
• Provide aggregate analytics for all lead activity
• Typically known leads < 10 % of all traffic
Page 4
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Legacy Web Tracking Infrastructure
Page 5
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Legacy Web Tracking Infrastructure
Page 6
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Legacy Problems
• Throughput limitations – 2 million activities per day
• Processing delays can be on the order of hours
• Large customers cause web server brownouts
• Web reporting does not scale
• Fixed-sized clusters prohibit horizontal scaling
• Brittle infrastructure prevents feature development
The Vision
Page 8
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Orion Initiative
• Increase scale to support IoT for Marketers
• Support billions of marketing activities each day
• Trigger on activities in near real time (< 2 minute @ 99th %)
• Reduce operational costs
• Improve multitenancy and QoS
Requirements
Page 10
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Business Requirements
• 200 MM activities per customer per day
• Near real-time web activity processing (SLA of < 1
minute lag)
• Improve cost efficiency
• Improve flexibility for feature enhancements
Page 11
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Technical Requirements
• Multitenancy support with brownout protections
• Infrastructure must scale horizontally
• Decouple web processing from downstream processing
• Anonymous leads should cost next to nothing to track
Architecture & Design
Page 13
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Page 14
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Page 15
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Why Hbase + Phoenix?
• Horizontally scalable
• Leverages the Hadoop cluster for storage and scaling
• Provides secondary indices for query patterns through
Phoenix
• Natural integration with JDBC and Spark JDBC RDDs
Page 16
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Page 17
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Marketo Lambda Architecture
Spark Streaming
Consumers
Campaign Triggers
Solr Indexing
Solr
Spark Streaming Indexer
Ingestion Processor
Scala/Tomcat
HBase
Kafka
CRM Sync
Partner APIs
Other Marketing
Activities
Web Activity
RTP Activity
Mobile Activity
Marketo UI
Campaign Detail
Lead Detail
Other Clients
CRM Sync
Revenue Cycle Analylitcs
APIs
Email Report Loader
Web Activity Processor
Page 18
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Why Spark Streaming?
• Micro-batching provides sink-side efficiencies
• This is especially important with MySQL touchpoints
• Great integration with Kafka
• No strict real-time processing requirements
• Great community and industry adoption
Page 19
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Multitenancy
• One topic per customer (sized by volume)
• Traffic storms are isolated to a single customer
• Fairness/throttling is easy to control
• Spark Streaming job consumes from many topics
• Allows us to turn a customer off under error conditions
• See “Elastic Streaming” by Neelesh Shastry –
Spark Summit
Page 20
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Making Spark Streaming Performant
• Coalesce small partitions for the same customer
• Aggressive caching of metadata (mostly from MySQL)
• Heavily leverage Scala future composition for parallelism
• Persist RDDs that are used for multiple outputs
• e.g. write to Kafka and Activity Service
Page 21
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Making Anonymous Traffic Cheap
• High costs of web traffic in legacy system
• MySQL storage for all traffic
• Down streaming processing of all events (even anonymous)
• V2 only processes and stores known traffic in MySQL
• Defer triggering for anonymous data until promotion
• Rolled out to our highest volume customers
• Processing latencies < 30s (at 99.9th %)
• Allowed key customers to scale from ~2MM/day to > 20
MM/day
Impact and Results
• Mitigations of straggler effects on processing delays
• Adding sessionization for web reporting
• Scaling Kafka topics as customers increase volume
• Globally distributed ingestion for a single customer
Future Work
We’re Hiring!
Http://Marketo.Jobs
Q & A

More Related Content

PPTX
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
From Zero to Data Flow in Hours with Apache NiFi
PPTX
Integrating Apache Spark and NiFi for Data Lakes
PPTX
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
PDF
Multitenancy At Bloomberg - HBase and Oozie
PPTX
Big data at United Airlines
PPTX
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
From Zero to Data Flow in Hours with Apache NiFi
Integrating Apache Spark and NiFi for Data Lakes
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
Multitenancy At Bloomberg - HBase and Oozie
Big data at United Airlines
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...

What's hot (20)

PPTX
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
PPTX
Practice of large Hadoop cluster in China Mobile
PPTX
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
PPTX
Data Regions: Modernizing your company's data ecosystem
PDF
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...
PPTX
Hive edw-dataworks summit-eu-april-2017
PDF
Introduction to Apache NiFi 1.10
PPTX
IoT with Apache MXNet and Apache NiFi and MiniFi
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
PDF
Pivotal Real Time Data Stream Analytics
PDF
Realizing the promise of portable data processing with Apache Beam
PDF
Curing the Kafka Blindness – Streams Messaging Manager
PPTX
Lessons learned running a container cloud on YARN
PPTX
Apache deep learning 101
PPTX
MiNiFi 0.0.1 MeetUp talk
PPTX
Why is my Hadoop cluster slow?
PPTX
Apache Hadoop YARN: Past, Present and Future
PPTX
Ingest and Stream Processing - What will you choose?
PDF
Fast SQL on Hadoop, Really?
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Practice of large Hadoop cluster in China Mobile
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Data Regions: Modernizing your company's data ecosystem
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...
Hive edw-dataworks summit-eu-april-2017
Introduction to Apache NiFi 1.10
IoT with Apache MXNet and Apache NiFi and MiniFi
HBase Global Indexing to support large-scale data ingestion at Uber
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Pivotal Real Time Data Stream Analytics
Realizing the promise of portable data processing with Apache Beam
Curing the Kafka Blindness – Streams Messaging Manager
Lessons learned running a container cloud on YARN
Apache deep learning 101
MiNiFi 0.0.1 MeetUp talk
Why is my Hadoop cluster slow?
Apache Hadoop YARN: Past, Present and Future
Ingest and Stream Processing - What will you choose?
Fast SQL on Hadoop, Really?
Ad

Viewers also liked (20)

PPTX
The truth about SQL and Data Warehousing on Hadoop
PDF
Comparison of Transactional Libraries for HBase
PPTX
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
PPTX
SEGA : Growth hacking by Spark ML for Mobile games
PDF
The real world use of Big Data to change business
PPTX
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
PPTX
Streamline Hadoop DevOps with Apache Ambari
PDF
Case study of DevOps for Hadoop in Recruit.
PPTX
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
PDF
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
PDF
#HSTokyo16 Apache Spark Crash Course
PPTX
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
PDF
Hadoop Summit Tokyo HDP Sandbox Workshop
PPTX
Hadoop Summit Tokyo Apache NiFi Crash Course
PPTX
Major advancements in Apache Hive towards full support of SQL compliance
PDF
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
PPTX
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
PPTX
Apache Hadoop 3.0 What's new in YARN and MapReduce
PPTX
Data infrastructure architecture for medium size organization: tips for colle...
The truth about SQL and Data Warehousing on Hadoop
Comparison of Transactional Libraries for HBase
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
SEGA : Growth hacking by Spark ML for Mobile games
The real world use of Big Data to change business
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
Streamline Hadoop DevOps with Apache Ambari
Case study of DevOps for Hadoop in Recruit.
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
#HSTokyo16 Apache Spark Crash Course
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo Apache NiFi Crash Course
Major advancements in Apache Hive towards full support of SQL compliance
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
Apache Hadoop 3.0 What's new in YARN and MapReduce
Data infrastructure architecture for medium size organization: tips for colle...
Ad

Similar to Rebuilding Web Tracking Infrastructure for Scale (20)

PDF
Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...
PDF
Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...
PPTX
Adobe Ask the AEM Community Expert Session Oct 2016
PPTX
Big Kahuna
PDF
Enabling Telco to Build and Run Modern Applications
PPTX
Accelerating a Path to Digital with a Cloud Data Strategy
PPTX
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
PPTX
SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptx
PDF
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
PDF
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
PDF
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...
PDF
Understanding the Top Four Use Cases for IoT
PPTX
Using SQL and Salesforce data to build a Product Catalog (or anything) in Con...
PPTX
Using SQL and Salesforce Data to Build a Product Catalog (or Anything) in Con...
PDF
Acting on Real-time Behavior: How Peak Games Won Transactions
PPTX
Webinar Slides: High Volume MySQL HA: SaaS Continuous Operations with Terabyt...
PPTX
JAMStack
PPTX
Digital Transformation in Market Data and Trading Platforms
PPTX
The role of NoSQL in the Next Generation of Financial Informatics
PPTX
Accelerating a Path to Digital With a Cloud Data Strategy
Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...
Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...
Adobe Ask the AEM Community Expert Session Oct 2016
Big Kahuna
Enabling Telco to Build and Run Modern Applications
Accelerating a Path to Digital with a Cloud Data Strategy
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptx
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...
Understanding the Top Four Use Cases for IoT
Using SQL and Salesforce data to build a Product Catalog (or anything) in Con...
Using SQL and Salesforce Data to Build a Product Catalog (or Anything) in Con...
Acting on Real-time Behavior: How Peak Games Won Transactions
Webinar Slides: High Volume MySQL HA: SaaS Continuous Operations with Terabyt...
JAMStack
Digital Transformation in Market Data and Trading Platforms
The role of NoSQL in the Next Generation of Financial Informatics
Accelerating a Path to Digital With a Cloud Data Strategy

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Cloud computing and distributed systems.
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
cuic standard and advanced reporting.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Encapsulation theory and applications.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPT
Teaching material agriculture food technology
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Electronic commerce courselecture one. Pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation_ Review paper, used for researhc scholars
Cloud computing and distributed systems.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Mobile App Security Testing_ A Comprehensive Guide.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
cuic standard and advanced reporting.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Building Integrated photovoltaic BIPV_UPV.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Empathic Computing: Creating Shared Understanding
Encapsulation theory and applications.pdf
Understanding_Digital_Forensics_Presentation.pptx
Teaching material agriculture food technology

Rebuilding Web Tracking Infrastructure for Scale

  • 1. Rebuilding Web Tracking Infrastructure for Scale Stephen Oakley Principal Engineer Marketo
  • 3. Page 3 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 What is Web Tracking at Marketo? • Ingest web page visits and clicks on customer’s website • Trigger campaigns in response to web activity • Trigger real-time personalization of web experience • Provide lead level analytics for known leads • Provide aggregate analytics for all lead activity • Typically known leads < 10 % of all traffic
  • 4. Page 4 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Legacy Web Tracking Infrastructure
  • 5. Page 5 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Legacy Web Tracking Infrastructure
  • 6. Page 6 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Legacy Problems • Throughput limitations – 2 million activities per day • Processing delays can be on the order of hours • Large customers cause web server brownouts • Web reporting does not scale • Fixed-sized clusters prohibit horizontal scaling • Brittle infrastructure prevents feature development
  • 8. Page 8 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Orion Initiative • Increase scale to support IoT for Marketers • Support billions of marketing activities each day • Trigger on activities in near real time (< 2 minute @ 99th %) • Reduce operational costs • Improve multitenancy and QoS
  • 10. Page 10 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Business Requirements • 200 MM activities per customer per day • Near real-time web activity processing (SLA of < 1 minute lag) • Improve cost efficiency • Improve flexibility for feature enhancements
  • 11. Page 11 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Technical Requirements • Multitenancy support with brownout protections • Infrastructure must scale horizontally • Decouple web processing from downstream processing • Anonymous leads should cost next to nothing to track
  • 13. Page 13 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016
  • 14. Page 14 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016
  • 15. Page 15 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Why Hbase + Phoenix? • Horizontally scalable • Leverages the Hadoop cluster for storage and scaling • Provides secondary indices for query patterns through Phoenix • Natural integration with JDBC and Spark JDBC RDDs
  • 16. Page 16 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016
  • 17. Page 17 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Marketo Lambda Architecture Spark Streaming Consumers Campaign Triggers Solr Indexing Solr Spark Streaming Indexer Ingestion Processor Scala/Tomcat HBase Kafka CRM Sync Partner APIs Other Marketing Activities Web Activity RTP Activity Mobile Activity Marketo UI Campaign Detail Lead Detail Other Clients CRM Sync Revenue Cycle Analylitcs APIs Email Report Loader Web Activity Processor
  • 18. Page 18 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Why Spark Streaming? • Micro-batching provides sink-side efficiencies • This is especially important with MySQL touchpoints • Great integration with Kafka • No strict real-time processing requirements • Great community and industry adoption
  • 19. Page 19 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Multitenancy • One topic per customer (sized by volume) • Traffic storms are isolated to a single customer • Fairness/throttling is easy to control • Spark Streaming job consumes from many topics • Allows us to turn a customer off under error conditions • See “Elastic Streaming” by Neelesh Shastry – Spark Summit
  • 20. Page 20 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Making Spark Streaming Performant • Coalesce small partitions for the same customer • Aggressive caching of metadata (mostly from MySQL) • Heavily leverage Scala future composition for parallelism • Persist RDDs that are used for multiple outputs • e.g. write to Kafka and Activity Service
  • 21. Page 21 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Making Anonymous Traffic Cheap • High costs of web traffic in legacy system • MySQL storage for all traffic • Down streaming processing of all events (even anonymous) • V2 only processes and stores known traffic in MySQL • Defer triggering for anonymous data until promotion
  • 22. • Rolled out to our highest volume customers • Processing latencies < 30s (at 99.9th %) • Allowed key customers to scale from ~2MM/day to > 20 MM/day Impact and Results
  • 23. • Mitigations of straggler effects on processing delays • Adding sessionization for web reporting • Scaling Kafka topics as customers increase volume • Globally distributed ingestion for a single customer Future Work

Editor's Notes

  • #3: Next phase was when we were ready to validate our newly built event ingestion system Marketo is a powerful Engagement Marketing Platform. There are several applications that make up the platform, such as ABM, Marketing analytics, predictive content, Digital Ads, and Marketing Automation. Marketing automation is what we are focusing on today. Marketing Automation enables the marketer to create, automate and measure marketing campaigns across channels. A simple example of an automated campaign or workflow is User visits your website and fills out a form Web tracking sees that they spent most their time looking at pages about spark streaming Automatically Send an email to the user to Invite them to a webinar on spark streaming services If they attend the webinar, register their interests in your crm and request a sales person contacts the user The campaigns can be complex and can reach out and track customers across channels like web, email, mobile, social
  • #4: Explain what a known vs anonymous lead is Known is targetable on other channels, anonymous is only web activity Speak to how the traffic patterns are heavily skewed toward anonymous given our customer base Talk about how anonymous converts to known. Aggregate analytics include company web report, landing page reports, etc.
  • #5: Speak to the pod Mention how there are many many pods
  • #6: An additional complication is the fact that the same two webservers also serve the mlm app, soap apis, and the landing pages
  • #8: Although the talk isn’t about the project…  we have a few slides up front to set the context around what we are working on If you have been near technology at all in the last couple of years you know that the world has become very connected.   The number of connected devices blows my mind.  It’s not just phones anymore…   Amazon dash buttons, coffee makers, propane tanks, garage doors.  These devices are sending 10’s of billions of activities and user interactions every day... Orion is our platfor Our marketing platform ingests the user interactions process them into relevant marketing touchpoints Its enables marketers to create marketing campaigns around these activities to build relationships with their customers Become the fabric for marketers Its been a great experience building this
  • #9: Here are a few of the requirements Near real time processing At least a 1 billion activities per customer per day. customer demands from increasing devices caused us to evaluate next get queueing and streaming... reduction in infrastructure COGS primarily from expensive enterprise class filers... reduction in people COGS by gained efficiency from reducing tech stack from using too many similar technologies ... Multitenant… of course Secure Customer isolation and improved resource management
  • #12: Arch requirement driven from biz requirement Improve utilization over the existing system Lots of customers in same infra, without starving Encryption from day 1 for safe data storage Aim for horz scalability Coming from standard 3 tier app Radically reduce processing latency Eliminate backlogs Brownout protection
  • #13: A few words about the architecture Main goal is to inject, process and store marketing events
  • #14: Details overview of Munchkin FE component Spray.io for MFE Frontend has the simple job of verifying subscription status, collecting metrics and persisting to kafka Use Avro to allow for schema evolution, strong typing and compact representation in topic Use Schema registry to allow the schema to be upgraded by the producer and them automatically picked up by the spark streaming component Use asynchronous API for kafka to allow high throughput.
  • #15: Details overview of LeadService component Spray.io for leadservice Hbase for Cookie and anonymous lead storage Salted table Key structure is subscription-cookie-leadid Secondary index for subscription-lead-createdat MySQL for known lead storage Masterdata for reverse ip information enrichments
  • #17: Overall view for the system Describe how there is a Kafka topic per subscription Spark streaming transforms the raw events into activities by Enriching with web page metadata from MySQL Lead and reverse IP enrichment from LeadService Persist activities to AS for storage and secondary processing (e.g. triggering and solr indexing) Push enriched web events to Kafka for the downstream Druid OLAP infrastructure.
  • #18: High level diagram of our event processor Enhanced Lambda Architecture Inbound activities written to Ingestion Processor Hbase and then Kafka High volume (e.g. web) activities First written to Kafka, then enriched Spark Streaming applications consume events from Kafka Solr Indexing Email Reports Campaign Processing HBase is used for simple historical queries, and is system of record
  • #19: While it is not “true” streaming, we exactly need this as an optimization
  • #21: Our multitenant Kafka framework coalesces small kafka paritions into large spark rdd partitions to improve batch utilization Several components of the event enrichment requires outbound RPC calls, using async clients and performing the calls in parallel and then composing the futures pipelines the computation and significantly improves throughput. Caching web assets and cookies for temporal locality Cache is > 60% of the executor memory Enriched events are written out to multiple sources and be selective about persisting RDDS prevents recomputing expensive transformations (multiple RPC calls or MySQL queries)
  • #22: Traditionally both anonymous and known data was treated equally in MLM. This is problematic because Anonymous volumes are usually 10-20x higher than known. Additionally there is very little intrinsic value in performing downstream processing on anonymous data since you cannot target anonymous leads for Campaigns. To improve this, in Munchkin V2 we only allow known traffic to flow to downstream processing. Anonymous data is passed for downstream processing when the lead converts to a known lead Via form fillout, api calls, etc.
  • #26: Reiterates my points on the last slide. I included in case you wanted to look at the slides later
  • #27: Give a quick overview of the activities architecture. Introduce Kafka in the presentation
  • #28: Spend more time on this – purple is our code , teal is spark standard # SubscriptionRegistry is using ZK # OffsetManager is a library, uses low level kafka consumer API # Provisioning framework – Sirius, a new subscription provisioned to registry via oozie