scrazzl - A technical overview

the idea: what problem?

• Does this product
work?

• Difﬁcult to gather
info

• Time consuming

the idea: what solution?

• Open up

• Extract info

• Help manufacturers
too

the product:
highlighting

• Highlight entities
within articles

• Popup with
supplementary info

• Further data on
scrazzl.com

the product:
scrazzl.com

• Repository of info

• Extracted from
articles

• Cross referenced

• Linked data

the product: analytics

• Brands

• Phrases

• Products

• Locations

the product: feeds

• Distribute exposed
data across the web

• Gain inbound
trafﬁc

• Citations

• Ratings

the tech: scrazzl.com

• Varnish

• Nginx

• PHP / Zend
Framework

• APC

• Mysql

the tech: deployment

• Git based deployment
• Auto-pull from master every minute
(danger Will Robinson!)
• Work off develop branches and merge

the tech: configuration

• Currently local ﬁle read
• Unexpected annoyance
• Looking at Doozer / Zookeeper

the tech: scaling

• Every machine can disappear

• Ignore FS

• Uploads to S3

• Next: sessions off the box - then ready !

• Not quite auto-scaling but almost there

• Plan to fail

the tech: highlighting

• Index

• Analyse

the tech: index

• Apache Solr
• Learning curve not steep - just hard to
ﬁnd!
• ~25m documents
• Three servers

the tech: index

• Index documents by sentence
• Prevents cross sentence mismatches
• NLTK
• Not 100%

the tech: index

• Performance factors
• Distribute workload
• Commit frequency
• Data size
• Caching
• Memory

the tech: index

• 2 - 3 days to index full text
• 1 week if any issues arise
• Not a runner
• Reduced to 9 hours with optimisations
• ~450k / hr | ~125 / s
• Distributed index = distributed search

the tech: analyse

• Gearman-like approach
• One job queue server
• Many analysis servers
• Many workers per analysis server

the tech: analyse

• How
• Solr proximity search
• Magic ﬁlters o /
• Store in Mongo

the tech: analyse

• Filters
• Chained
• Pattern matching
• NLP entity identiﬁcation

the tech: analyse

• Where next
• More magic ﬁlters
• More NLP
• Automated multi-threaded PHP set
up

the tech: analytics

• Easy setup

• Fast writes

• Fast reads

the tech: analytics

• Data
• Articles
• Hits
• Events
• Aggregation

the tech: analytics

• MongoDB
• Easy setup
• PHP driver
• Common use of analytics

the tech: analytics
• MongoDB becomes trickier
• Replication
• Sharding
• Primary
• Secondary
• Arbiters
• Conﬁgs

the tech: analytics

• Performance
• 20,000 writes/s
• Key factors:
• Index / data in memory
• SSD (not us!)

@free2panik
scrazzl.com
Questions ?

scrazzl - A technical overview

More Related Content

What's hot (20)

Viewers also liked (12)

Similar to scrazzl - A technical overview (20)

Recently uploaded (20)

scrazzl - A technical overview

Editor's Notes