SlideShare a Scribd company logo
Powering Social Business Intelligence
Cassandra and Hadoop at the Dachis Group
Social Business Wha?
Big Data meets Big Budgets




• Brand marketers spend
  • $450 (~£270) billion annually on tradition media
  • $50 (~£30) billion annually on SEO/SEM
• Starting to transition to social media
Effectiveness - Traditional
Measure all the things!




            Measuring traditional marketing effectiveness
Effectiveness - Social
Measure all the things!




             Measuring social marketing effectiveness
The Dachis Group
Measure all the things!




• Jeff Dachis amasses small army of social strategists
• Funds team to create social analytics platform
  • Measure business outcomes of social media strategies
  • Track social media surrounding Forbes Global 2000
  • Include all brands, all subsidiaries, all social media types
Architecture

• Raw data in S3
• Cassandra
  • Realtime queries to return raw data
  • Hadoop analytic integration for foundational measures
  • Horizontal scalability
  • Operationally simple
• RDBMS
  • Time rollups of measures
  • Aggregates and composite measures
  • Arbitrary dimensional queries
  • Mini data warehouse
Pipeline



                                                            Memcached
   AWS S3                       Cassandra                    Postgres




   Raw Signal                      Signal                       Metrics
    Storage                      Repository                      Store




                Normalization           Enrichment   Analysis
Normalization




• Parallel copy from S3 to HDFS
• MapReduce to Cassandra from Raw to Normalized CF
• Normalized data model
  • Decent investment to get right
  • Mostly for conceptual reasons rather than concerns about queries
  • Secondary indexes vs app maintained indexes
Enrichment



• Enrich with
  • Unique company/brand information
  • Sentiment
  • Relationships
  • Conversations
  • Social graph information
• Enter Pig
• Enter Oozie
The Bleeding Edge
Pig


 • newlogicalplan in 0.8.0
 • Debugging/tracing?
 • Incremental development
 • Working with Cassandra
      • Pygmalion - facilitating to and from Cassandra
 • Experience, unit test framework, UDFs, community



                                slowly became
The Bleeding Edge
Oozie
• Learning curve and common errors
  • User impersonation
  • Logs, we haz them, lots of them
  • Web UI needs love
• Specific to Cassandra
  • mapreduce.fileoutputcommitter.marksuccessfuljobs
  • See http://wiki.apache.org/cassandra/HadoopSupport#Oozie
• Still very good DAG workflow crunching tool
  • Subworkflows, fork/join, regular scheduling, dataset detection
  • Extensible
  • Apache Incubator (@oozie on twitter, #oozie on freenode)
The Bleeding Edge
Cassandra

 • Rack aware snitch and replication
   • Always rotate racks in order in topology
   • In EC2 this likely means rotate AZs
 • Dealing with scanning over column families
   • Project early
 • General tuning and unique workload
   • Mahout and other higher memory hadoop tasks
   • EC2 instance types
 • Visualization tool helped (OpsCenter, Acunu has Control Center)
 • Community++
Social Business Index
Launches September 2011


                          • Global Ranking of Companies
                          • Industry Rankings
                          • Visualization of strategy
This might actually work!


 • Fall 2011, built up the team
 • Expertise in Pig, Lucene/Solr, machine learning, statistics, event
  prediction and analysis
 • Making everlasting gobstoppers
Social Performance Monitor
The measures behind the score
Topics topics topics




• Black Friday
  • Science project!
  • Mallet, Pig
  • Custom analysis
• Superbowl
• Oscars
Productizing Topics



• Ongoing automated topic detection
• Lessons from one-off topic analysis
• Represented by term distributions
• Threads with detail like
  • Signal volume
  • Participants
  • Links
  • Sentiment gauge
Advocates




• Auto-discovery of potential advocates
• Curated set of known advocates
• Example signal (from Cassandra)
• Reports and other useful bits
Lessons learned




• Emerging products are sometimes frustrating, but well worth the pain in
 their respective niche.
• “Never underestimate the massive impact of small bugs in big
 data.” (@peteskomoroch at LinkedIn)
• Community karma
A Note on Community

• Community involvement
  • IRC, mailing lists, twitter, conferences, meetups
  • Newer projects have little or outdated docs
  • Some features may be
    • Deprecated
    • Not ready for primetime
    • Not a fit for your use case
• Community karma
  • Don’t just take
  • Be a bridge builder
  • Positive karma helps
Questions?




• We’re hiring
• Ping me @jeromatron (Twitter and IRC)

More Related Content

PPTX
Intro to cassandra + hadoop
PPTX
Dataiku big data paris - the rise of the hadoop ecosystem
KEY
Hadoopソースコードリーディング第3回 Hadopo MR + Cassandra
PDF
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
PPTX
Hadoop and Big Data: Revealed
PPT
Hadoop at Yahoo! -- Hadoop World NY 2009
PPTX
Atlanta MLConf
PPTX
Optimizing Big Data to run in the Public Cloud
Intro to cassandra + hadoop
Dataiku big data paris - the rise of the hadoop ecosystem
Hadoopソースコードリーディング第3回 Hadopo MR + Cassandra
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
Hadoop and Big Data: Revealed
Hadoop at Yahoo! -- Hadoop World NY 2009
Atlanta MLConf
Optimizing Big Data to run in the Public Cloud

What's hot (20)

PPTX
Cloud Optimized Big Data
PPTX
Hadoop and HBase @eBay
PPTX
Real Time and Big Data – It’s About Time
PPTX
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
PPTX
Qubole @ AWS Meetup Bangalore - July 2015
PPTX
Qubole Overview at the Fifth Elephant Conference
PPTX
Summer Shorts: Big Data Integration
 
PPTX
Qubole - Big data in cloud
PPTX
Big data advance topics - part 2.pptx
PPT
Hadoop distributions - ecosystem
PDF
Data Engineering Quick Guide
PDF
Proud to be Polyglot - Riviera Dev 2015
DOCX
Big Data A La Carte Menu
PPTX
Hadoop-2 @ eBay
PPTX
Not Just Another Overview of Apache Hadoop
PPTX
Hadoop @ eBay: Past, Present, and Future
PPTX
The Fundamentals Guide to HDP and HDInsight
PPSX
Hadoop Ecosystem
PDF
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
PPTX
Concepts on Hadoop
Cloud Optimized Big Data
Hadoop and HBase @eBay
Real Time and Big Data – It’s About Time
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
Qubole @ AWS Meetup Bangalore - July 2015
Qubole Overview at the Fifth Elephant Conference
Summer Shorts: Big Data Integration
 
Qubole - Big data in cloud
Big data advance topics - part 2.pptx
Hadoop distributions - ecosystem
Data Engineering Quick Guide
Proud to be Polyglot - Riviera Dev 2015
Big Data A La Carte Menu
Hadoop-2 @ eBay
Not Just Another Overview of Apache Hadoop
Hadoop @ eBay: Past, Present, and Future
The Fundamentals Guide to HDP and HDInsight
Hadoop Ecosystem
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
Concepts on Hadoop
Ad

Similar to Cassandra eu (20)

PPTX
Chicago HUG Presentation Oct 2011
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
PDF
Social Business in a World of Abundant Real-time Data
PDF
Big Data @ Bodensee Barcamp 2010
PPTX
Gilbane Boston 2012 Big Data 101
PDF
Real-time Analytics with Cassandra, Spark, and Shark
PDF
Data Infrastructure for a World of Music
PDF
Data Care, Feeding, and Maintenance
PPTX
Architecting Your First Big Data Implementation
PPT
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
PPT
Gartner peer forum sept 2011 orbitz
PPTX
Big Data & Hadoop Introduction
PPSX
Big Data Basic Concepts | Presented in 2014
PDF
IS-4011, Accelerating Analytics on HADOOP using OpenCL, by Zubin Dowlaty and ...
PPTX
Big Data, Baby Steps
PDF
Utah Big Mountain Big Data Baby Steps (4-12-2014) Final
PDF
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
PDF
KEY
Big data and APIs for PHP developers - SXSW 2011
Chicago HUG Presentation Oct 2011
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Social Business in a World of Abundant Real-time Data
Big Data @ Bodensee Barcamp 2010
Gilbane Boston 2012 Big Data 101
Real-time Analytics with Cassandra, Spark, and Shark
Data Infrastructure for a World of Music
Data Care, Feeding, and Maintenance
Architecting Your First Big Data Implementation
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Gartner peer forum sept 2011 orbitz
Big Data & Hadoop Introduction
Big Data Basic Concepts | Presented in 2014
IS-4011, Accelerating Analytics on HADOOP using OpenCL, by Zubin Dowlaty and ...
Big Data, Baby Steps
Utah Big Mountain Big Data Baby Steps (4-12-2014) Final
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Big data and APIs for PHP developers - SXSW 2011
Ad

More from Jeremy Hanna (11)

PDF
Göteborg Distributed: Eventual Consistency in Apache Cassandra
PDF
Apache Cassandra in the Real World
PDF
Apache Cassandra in the Real World
PDF
Modern Cassandra for Developers
PDF
Troubleshooting Cassandra
PPT
Cassandra + Hadoop: Analisi Batch con Apache Cassandra
KEY
End-to-end Analytics with Apache Cassandra
PPTX
Pig with Cassandra: Adventures in Analytics
PPTX
Cassandra/Hadoop Integration
PPTX
Cassandra + Hadoop @ApacheCon
KEY
Cassandra+Hadoop
Göteborg Distributed: Eventual Consistency in Apache Cassandra
Apache Cassandra in the Real World
Apache Cassandra in the Real World
Modern Cassandra for Developers
Troubleshooting Cassandra
Cassandra + Hadoop: Analisi Batch con Apache Cassandra
End-to-end Analytics with Apache Cassandra
Pig with Cassandra: Adventures in Analytics
Cassandra/Hadoop Integration
Cassandra + Hadoop @ApacheCon
Cassandra+Hadoop

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
Teaching material agriculture food technology
PDF
Modernizing your data center with Dell and AMD
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Electronic commerce courselecture one. Pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
cuic standard and advanced reporting.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Spectral efficient network and resource selection model in 5G networks
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Teaching material agriculture food technology
Modernizing your data center with Dell and AMD
20250228 LYD VKU AI Blended-Learning.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Approach and Philosophy of On baking technology
NewMind AI Weekly Chronicles - August'25 Week I
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Electronic commerce courselecture one. Pdf
Machine learning based COVID-19 study performance prediction
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
cuic standard and advanced reporting.pdf
Understanding_Digital_Forensics_Presentation.pptx
Review of recent advances in non-invasive hemoglobin estimation
Spectral efficient network and resource selection model in 5G networks

Cassandra eu

  • 1. Powering Social Business Intelligence Cassandra and Hadoop at the Dachis Group
  • 2. Social Business Wha? Big Data meets Big Budgets • Brand marketers spend • $450 (~£270) billion annually on tradition media • $50 (~£30) billion annually on SEO/SEM • Starting to transition to social media
  • 3. Effectiveness - Traditional Measure all the things! Measuring traditional marketing effectiveness
  • 4. Effectiveness - Social Measure all the things! Measuring social marketing effectiveness
  • 5. The Dachis Group Measure all the things! • Jeff Dachis amasses small army of social strategists • Funds team to create social analytics platform • Measure business outcomes of social media strategies • Track social media surrounding Forbes Global 2000 • Include all brands, all subsidiaries, all social media types
  • 6. Architecture • Raw data in S3 • Cassandra • Realtime queries to return raw data • Hadoop analytic integration for foundational measures • Horizontal scalability • Operationally simple • RDBMS • Time rollups of measures • Aggregates and composite measures • Arbitrary dimensional queries • Mini data warehouse
  • 7. Pipeline Memcached AWS S3 Cassandra Postgres Raw Signal Signal Metrics Storage Repository Store Normalization Enrichment Analysis
  • 8. Normalization • Parallel copy from S3 to HDFS • MapReduce to Cassandra from Raw to Normalized CF • Normalized data model • Decent investment to get right • Mostly for conceptual reasons rather than concerns about queries • Secondary indexes vs app maintained indexes
  • 9. Enrichment • Enrich with • Unique company/brand information • Sentiment • Relationships • Conversations • Social graph information • Enter Pig • Enter Oozie
  • 10. The Bleeding Edge Pig • newlogicalplan in 0.8.0 • Debugging/tracing? • Incremental development • Working with Cassandra • Pygmalion - facilitating to and from Cassandra • Experience, unit test framework, UDFs, community slowly became
  • 11. The Bleeding Edge Oozie • Learning curve and common errors • User impersonation • Logs, we haz them, lots of them • Web UI needs love • Specific to Cassandra • mapreduce.fileoutputcommitter.marksuccessfuljobs • See http://wiki.apache.org/cassandra/HadoopSupport#Oozie • Still very good DAG workflow crunching tool • Subworkflows, fork/join, regular scheduling, dataset detection • Extensible • Apache Incubator (@oozie on twitter, #oozie on freenode)
  • 12. The Bleeding Edge Cassandra • Rack aware snitch and replication • Always rotate racks in order in topology • In EC2 this likely means rotate AZs • Dealing with scanning over column families • Project early • General tuning and unique workload • Mahout and other higher memory hadoop tasks • EC2 instance types • Visualization tool helped (OpsCenter, Acunu has Control Center) • Community++
  • 13. Social Business Index Launches September 2011 • Global Ranking of Companies • Industry Rankings • Visualization of strategy
  • 14. This might actually work! • Fall 2011, built up the team • Expertise in Pig, Lucene/Solr, machine learning, statistics, event prediction and analysis • Making everlasting gobstoppers
  • 15. Social Performance Monitor The measures behind the score
  • 16. Topics topics topics • Black Friday • Science project! • Mallet, Pig • Custom analysis • Superbowl • Oscars
  • 17. Productizing Topics • Ongoing automated topic detection • Lessons from one-off topic analysis • Represented by term distributions • Threads with detail like • Signal volume • Participants • Links • Sentiment gauge
  • 18. Advocates • Auto-discovery of potential advocates • Curated set of known advocates • Example signal (from Cassandra) • Reports and other useful bits
  • 19. Lessons learned • Emerging products are sometimes frustrating, but well worth the pain in their respective niche. • “Never underestimate the massive impact of small bugs in big data.” (@peteskomoroch at LinkedIn) • Community karma
  • 20. A Note on Community • Community involvement • IRC, mailing lists, twitter, conferences, meetups • Newer projects have little or outdated docs • Some features may be • Deprecated • Not ready for primetime • Not a fit for your use case • Community karma • Don’t just take • Be a bridge builder • Positive karma helps
  • 21. Questions? • We’re hiring • Ping me @jeromatron (Twitter and IRC)

Editor's Notes

  • #2: \n
  • #3: \n
  • #4: This is how they see their capability after, for example, the Superbowl.\n
  • #5: When managers ask how effective the campaign was, the marketing department says it was awesome. When asked how they know that, they say that Zoltar told them so. In reality there are a lot of home grown methods, some good, some not so good. Some of what we did grew out of a spreadsheet that was manually updated, validated and refined over time with one of our major customers.\n\n
  • #6: What brands does Berkshire Hathaway have under its gigantic umbrella?!?\nCan mention Red Bull, Disney, HP, Levis, Samsung, Honda, etc.\n
  • #7: Operationally simple doesn’t mean that you don’t need to learn a lot about it, just that there aren’t a lot of moving parts.\nUnique use case in that it’s hybrid. Both lots of writes and analytics and reads.\n
  • #8: \n
  • #9: It’s just scads of text, but we do classify - conversations long/short difference between microblogs and blogs.\nWe may use hadoop to generate alternate CFs for specific queries as we need them.\n
  • #10: Company information is unique because we had to buy, borrow, steal and yes crowd source that data.\nPig handles joins really well for example account snapshots and signal for enrichment.\n\n
  • #11: Mention Brandon’s work to make things better with CassandraStorage and newer versions of Pig, including regression tests.\nSpeculative exectution.\n
  • #12: Mention having looked at Azkaban as well.\nNo real way around the logs, just takes getting used to. User impersonation is a product of the authorization framework, patch added to DSE.\n
  • #13: Mention consistency level choices.\nRotate racks - yeah, wasn’t documented except in the code.\nBackup/restore.\nRoot causes sometimes difficult to determine.\nScaling up - each order of magnitude jump has its own problems.\n
  • #14: But the long sleepless Summer finally pays off...\n
  • #15: Everlasting gobstoppers are a fun phase for the projects.\n
  • #16: Reveals numbers\n
  • #17: Explanation\nGreat working as a team\nMention Boxing Day\n
  • #18: Also customer curated topics in the future\n
  • #19: \n
  • #20: Data consistency - periodic checks, staging cluster, unit tests, integration testing.\nReparable data. Sometimes incredibly painful, but possible.\nMention backup/restore.\nMention root causes.\n
  • #21: Be active in communities of these new projects\nIf necessary start building communities around them\nDon’t just take, answer questions, follow mailing lists but have a filter, docs, bug submission, feature requests, votes, representation, tests, patches/pull requests.\n
  • #22: \n