scrazzl
analysing science
the team
the place
the idea
the idea: what problem?


• Does this product
  work?

• Difficult to gather
  info

• Time consuming
the idea: what solution?


• Open up

• Extract info

• Help manufacturers
  too
the product
the product:
          highlighting

• Highlight entities
  within articles

• Popup with
  supplementary info

• Further data on
  scrazzl.com
the product:
          scrazzl.com

• Repository of info

• Extracted from
  articles

• Cross referenced

• Linked data
the product: analytics


• Brands

• Phrases

• Products

• Locations
the product: feeds

• Distribute exposed
  data across the web

• Gain inbound
  traffic

• Citations

• Ratings
demo
(should work)
the tech
the tech: architecture
the tech: scrazzl.com

• Varnish

• Nginx

• PHP / Zend
  Framework

• APC

• Mysql
the tech: deployment


• Git based deployment
• Auto-pull from master every minute
  (danger Will Robinson!)
• Work off develop branches and merge
the tech: configuration



• Currently local file read
• Unexpected annoyance
• Looking at Doozer / Zookeeper
the tech: scaling

• Every machine can disappear

• Ignore FS

• Uploads to S3

• Next: sessions off the box - then ready !

• Not quite auto-scaling but almost there

• Plan to fail
the tech: highlighting



• Index

• Analyse
the tech: index


• Apache Solr
• Learning curve not steep - just hard to
  find!
• ~25m documents
• Three servers
the tech: index


• Index documents by sentence
• Prevents cross sentence mismatches
• NLTK
• Not 100%
the tech: index

• Performance factors
 • Distribute workload
 • Commit frequency
 • Data size
 • Caching
 • Memory
the tech: index

• 2 - 3 days to index full text
• 1 week if any issues arise
• Not a runner
• Reduced to 9 hours with optimisations
• ~450k / hr | ~125 / s
• Distributed index = distributed search
the tech: analyse


• Gearman-like approach
• One job queue server
• Many analysis servers
• Many workers per analysis server
the tech: analyse


• How
 • Solr proximity search
 • Magic filters  o /
 • Store in Mongo
the tech: analyse


• Filters
 • Chained
 • Pattern matching
 • NLP entity identification
the tech: analyse


• Where next
 • More magic filters
 • More NLP
 • Automated multi-threaded PHP set
   up
the tech: analytics


• Easy setup

• Fast writes

• Fast reads
the tech: analytics

• Data
 • Articles
 • Hits
 • Events
 • Aggregation
the tech: analytics


• MongoDB
 • Easy setup
 • PHP driver
 • Common use of analytics
the tech: analytics
• MongoDB becomes trickier
 • Replication
 • Sharding
• Primary
• Secondary
• Arbiters
• Configs
the tech: analytics

• Performance
 • 20,000 writes/s
• Key factors:
 • Index / data in memory
 • SSD (not us!)
the tech: architecture
@free2panik
scrazzl.com
   Questions ?

More Related Content

KEY
Cascalog at May Bay Area Hadoop User Group
PDF
Silicon Valley Code Camp 2016 - MongoDB in production
PDF
Know thy cost (or where performance problems lurk)
PDF
Python performance profiling
PDF
Elasticsearch in Production (London version)
PDF
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
PDF
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
PPTX
Power shell v3 session1
Cascalog at May Bay Area Hadoop User Group
Silicon Valley Code Camp 2016 - MongoDB in production
Know thy cost (or where performance problems lurk)
Python performance profiling
Elasticsearch in Production (London version)
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
Power shell v3 session1

What's hot (20)

PDF
DevCon Summit 2014 #DevelopersUnitePH: The "What" and "Why" of NoSQL by Matia...
PDF
Is your Elastic Cluster Stable and Production Ready?
PDF
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
PDF
An Open Source NoSQL solution for Internet Access Logs Analysis
PDF
A Survey of Elasticsearch Usage
PPTX
Dev nexus 2017
PPTX
Devnexus 2018
PDF
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
KEY
Cascalog
PDF
Python & Cassandra - Best Friends
PPTX
Parallel programming in .NET
PPTX
NATE-Central-Log
PPTX
Dead Simple Scalability Patterns
PDF
pandas.(to/from)_sql is simple but not fast
PDF
Strata Beijing 2017: Jumpy, a python interface for nd4j
PDF
"Infrastructure Security Practice" by Wasis Adi Putranto (OLX Indonesia)
PDF
Metrics & more
PPTX
Exploiting NoSQL Like Never Before
PDF
Security Analytics using ELK stack
PPTX
Optimizing Spark
DevCon Summit 2014 #DevelopersUnitePH: The "What" and "Why" of NoSQL by Matia...
Is your Elastic Cluster Stable and Production Ready?
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
An Open Source NoSQL solution for Internet Access Logs Analysis
A Survey of Elasticsearch Usage
Dev nexus 2017
Devnexus 2018
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
Cascalog
Python & Cassandra - Best Friends
Parallel programming in .NET
NATE-Central-Log
Dead Simple Scalability Patterns
pandas.(to/from)_sql is simple but not fast
Strata Beijing 2017: Jumpy, a python interface for nd4j
"Infrastructure Security Practice" by Wasis Adi Putranto (OLX Indonesia)
Metrics & more
Exploiting NoSQL Like Never Before
Security Analytics using ELK stack
Optimizing Spark
Ad

Viewers also liked (12)

PPTX
Presentation at SBP Start-up Conference
PDF
Affiliateinformationdocument
PPTX
The pitch that never was
PPTX
Ulster Community College Foundation Gala 2012
PPT
Apa 2009 2010 update
PPT
Why Resource Identification Matters - Scrazzl
PPTX
Commons fostercommander
PDF
Lost In Translation: When Machines Meet STM Content
PDF
Crash Course in UX - Internet Week NY 2015
PDF
Ulster Community College Foundation, Inc. Commemorative Journal-Gala 2013
PPTX
Lost In Translation when machines meet STM content
PDF
05 circuitos isv 3000 dp - pdf
Presentation at SBP Start-up Conference
Affiliateinformationdocument
The pitch that never was
Ulster Community College Foundation Gala 2012
Apa 2009 2010 update
Why Resource Identification Matters - Scrazzl
Commons fostercommander
Lost In Translation: When Machines Meet STM Content
Crash Course in UX - Internet Week NY 2015
Ulster Community College Foundation, Inc. Commemorative Journal-Gala 2013
Lost In Translation when machines meet STM content
05 circuitos isv 3000 dp - pdf
Ad

Similar to scrazzl - A technical overview (20)

PPTX
Tech Spark Presentation
PDF
Data Science meets Software Development
PDF
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
PPTX
Taming the resource tiger
PDF
Adventures in Azure Machine Learning from NE Bytes
PDF
Build a Time Series Application with Apache Spark and Apache HBase
PPTX
The Background Noise of the Internet
PDF
Spark Internals Training | Apache Spark | Spark | Anika Technologies
PPTX
Taming the resource tiger
PDF
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
PPTX
Data Modeling for NoSQL
PPTX
Practical Machine Learning for Smarter Search with Spark+Solr
PPTX
Practical Machine Learning for Smarter Search with Solr and Spark
PPTX
Cool NoSQL on Azure with DocumentDB
PPTX
Test driving Azure Search and DocumentDB
PDF
Webinar - DreamObjects/Ceph Case Study
PPTX
Software Engineering in Startups
PDF
Presto: Fast SQL on Everything
PDF
Using Riak for Events storage and analysis at Booking.com
PPTX
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
Tech Spark Presentation
Data Science meets Software Development
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Taming the resource tiger
Adventures in Azure Machine Learning from NE Bytes
Build a Time Series Application with Apache Spark and Apache HBase
The Background Noise of the Internet
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Taming the resource tiger
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Data Modeling for NoSQL
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Solr and Spark
Cool NoSQL on Azure with DocumentDB
Test driving Azure Search and DocumentDB
Webinar - DreamObjects/Ceph Case Study
Software Engineering in Startups
Presto: Fast SQL on Everything
Using Riak for Events storage and analysis at Booking.com
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?

Recently uploaded (20)

PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
Modernising the Digital Integration Hub
PPTX
The various Industrial Revolutions .pptx
DOCX
search engine optimization ppt fir known well about this
PDF
STKI Israel Market Study 2025 version august
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Architecture types and enterprise applications.pdf
PDF
Five Habits of High-Impact Board Members
PDF
Unlock new opportunities with location data.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Getting Started with Data Integration: FME Form 101
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
observCloud-Native Containerability and monitoring.pptx
Hindi spoken digit analysis for native and non-native speakers
Modernising the Digital Integration Hub
The various Industrial Revolutions .pptx
search engine optimization ppt fir known well about this
STKI Israel Market Study 2025 version august
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
A novel scalable deep ensemble learning framework for big data classification...
Group 1 Presentation -Planning and Decision Making .pptx
Architecture types and enterprise applications.pdf
Five Habits of High-Impact Board Members
Unlock new opportunities with location data.pdf
Chapter 5: Probability Theory and Statistics
Final SEM Unit 1 for mit wpu at pune .pptx
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Tartificialntelligence_presentation.pptx
Taming the Chaos: How to Turn Unstructured Data into Decisions
Zenith AI: Advanced Artificial Intelligence
Getting Started with Data Integration: FME Form 101
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor

scrazzl - A technical overview

Editor's Notes