SlideShare a Scribd company logo
http://www.flickr.com/photos/pio1976/3330670980/




                                                  Link extraction and classification
                                                                         Bruno Pedro
                                                                          December 2010
Bruno Pedro
A n e x p e r i e n c e d We b d e v e l o p e r a n d
entrepreneur. Has extensive background in
large scale projects and technical writing.

http://tarpipe.com/user/bpedro
What is tarpipe?




       User
What is tarpipe?
twitter


                      source: mashable.com




• Average ~1.1M tweets/hour
• ~300 new tweets every second
Challenge
• ~300 reads/second
• 160 X 300 = 48 KB/second = 4 GB/day
  (approximate calculation)




• How to process all this information?
Strategy
• Read and store immediately:
 • High performance write storage
 • No locks allowed
 • Prepare for lots of reading errors
Strategy
• Process later:
 • Regular expressions
 • Term extraction
 • Machine learning
Link extraction
Link extraction
                        extract user   extract URL




         new tweet             yes




launch process           has URL?         save       show




                                no




                           dump
Content extraction

  • Short URLs
  • HTTP redirects
  • HTTP errors — retry?
  • Content analysis
Content extraction
    URL                                   retry?
                               yes



                                              yes



                                     no
fetch contents         redirect?          error?

                 yes

                                              no




                                          save
Content analysis

• Assume malformed (X)HTML
• Regular expressions
  or
• Convert to XHTML
• DOM traversing
Content classification
                • HTML title, description, keywords
   tag
  cloud
            {   • H1, H2, H3, ...
                • Paragraphs

  graph
placement   {   • Who shared the link?
                • Internal and external links
Content classification
 (X)HTML                 extract head
                                                   extract text         extract text
                          elements




                   yes                       yes                  yes




                          H1,H2,...                paragraphs
head found?                                                                save
                           found?                    found?




              no                        no
The big picture

new tweet                     classify content   save
             extract URL




            extract content
Food for thought
• PubsubHubbub
  http://code.google.com/p/pubsubhubbub/

• Activity Streams
  http://activitystrea.ms/
• twitter streaming API
  http://dev.twitter.com/pages/streaming_api
tarpipe streamlines your     tarpipe is one of the most     Today I had a chance to
updates to various social    curious experiments in         spend time experimenting
web sites, creating simple   social media that I've         with tarpipe and I have to
or complex workflows to       seen lately. The service       say that I am intrigued by
update several buckets in    has the potential to be        the concept and impressed
one fell swoop.              the answer to the lament       by the implementation.
                             I first talked about in The
Adam Pash                    looming crisis: Personal       Jeff Barr
lifehacker                   syndication overload.          Amazon.com

                             Rafe Needleman
                             CNET news




thank you                                            automated publishing

More Related Content

KEY
Unified Content Model and Joomla!
PDF
Cms Workshop Long
PPT
10 Widgets To Rock Your WordPress
PDF
Api Design & The Paris Subway
KEY
Mongo DB - Segen oder Fluch
PDF
Activity Streams And Contexts
PDF
Bridging the Gap Between APIs and Customers
PDF
Everything OAuth
Unified Content Model and Joomla!
Cms Workshop Long
10 Widgets To Rock Your WordPress
Api Design & The Paris Subway
Mongo DB - Segen oder Fluch
Activity Streams And Contexts
Bridging the Gap Between APIs and Customers
Everything OAuth

Viewers also liked (20)

PDF
Schnelle Geschäfte
PDF
Introd to CS. Presentation
PDF
Test-Driven JavaScript Development IPC
KEY
The Executable Web
PDF
Native Cross-Platform-Apps mit Titanium Mobile und Alloy
PDF
tarpipe WordPress plugin demo
PDF
node-fs
PPTX
Computer concepts presentation 2
PDF
Maintainable consumers
PDF
Shoeism - Frau im Glück
PDF
Autenticação e Autorização (in portuguese)
PDF
Plugging holes — javascript memory leak debugging
PDF
Who's using your API?
PDF
Salt and pepper — native code in the browser Browser using Google native Client
PDF
APIs Love to Chat
PDF
How to Automate API Discovery
KEY
Is OAuth Really Secure?
PDF
The importance of /me
KEY
Information Retrieval Challenges
PDF
API Code Generation
Schnelle Geschäfte
Introd to CS. Presentation
Test-Driven JavaScript Development IPC
The Executable Web
Native Cross-Platform-Apps mit Titanium Mobile und Alloy
tarpipe WordPress plugin demo
node-fs
Computer concepts presentation 2
Maintainable consumers
Shoeism - Frau im Glück
Autenticação e Autorização (in portuguese)
Plugging holes — javascript memory leak debugging
Who's using your API?
Salt and pepper — native code in the browser Browser using Google native Client
APIs Love to Chat
How to Automate API Discovery
Is OAuth Really Secure?
The importance of /me
Information Retrieval Challenges
API Code Generation
Ad

Similar to Link extraction and classification (20)

PDF
Nutch - web-scale search engine toolkit
PDF
Centre for Arts and Technology - Radian6
PPTX
Social web for Tech Comm, STC March 2013
ZIP
The Interaction Design Of APIs
PPTX
Web2 0-111020043404-phpapp01
PPTX
PDF
WordLift 2.0 (presentation for the IKS annual review in Saarbrücken)
PDF
SEO for Developers
PDF
12 core technologies you should learn, love, and hate to be a 'real' technocrat
KEY
Pr 2 presentation
PPTX
web 2.0
PDF
Mozilla the web and you
PDF
Nutch as a Web data mining platform
PDF
2 Conferences in 1 hour
PDF
Qwerly GeeknRolla Presentation
PDF
Cloud based Web Intelligence
KEY
Hello Drupal
PDF
Mozilla, the web and you! (including notes)
PDF
Upload Tag Share Discuss: Content Management in the Age of User Participation
PDF
Direct Answers for Search Queries in the Long Tail
Nutch - web-scale search engine toolkit
Centre for Arts and Technology - Radian6
Social web for Tech Comm, STC March 2013
The Interaction Design Of APIs
Web2 0-111020043404-phpapp01
WordLift 2.0 (presentation for the IKS annual review in Saarbrücken)
SEO for Developers
12 core technologies you should learn, love, and hate to be a 'real' technocrat
Pr 2 presentation
web 2.0
Mozilla the web and you
Nutch as a Web data mining platform
2 Conferences in 1 hour
Qwerly GeeknRolla Presentation
Cloud based Web Intelligence
Hello Drupal
Mozilla, the web and you! (including notes)
Upload Tag Share Discuss: Content Management in the Age of User Participation
Direct Answers for Search Queries in the Long Tail
Ad

More from Bruno Pedro (14)

PDF
What are Web APIs
PDF
Growing your business with an API
PDF
Product growth with an API
PDF
How to grow your business with an API
PDF
How to Automate API Testing
PDF
Asynchronous Microservices in nodejs
PDF
OAuth checklist
PDF
OOP (in portuguese)
PDF
Segurança (in portuguese)
PDF
Cache e Performance (in portuguese)
PDF
Web Services (in portuguese)
PDF
Sessões (in portuguese)
PDF
User Interface (in portuguese)
PDF
Takeoff2008
What are Web APIs
Growing your business with an API
Product growth with an API
How to grow your business with an API
How to Automate API Testing
Asynchronous Microservices in nodejs
OAuth checklist
OOP (in portuguese)
Segurança (in portuguese)
Cache e Performance (in portuguese)
Web Services (in portuguese)
Sessões (in portuguese)
User Interface (in portuguese)
Takeoff2008

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Cloud computing and distributed systems.
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Big Data Technologies - Introduction.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Approach and Philosophy of On baking technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Encapsulation theory and applications.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
“AI and Expert System Decision Support & Business Intelligence Systems”
Cloud computing and distributed systems.
NewMind AI Monthly Chronicles - July 2025
Understanding_Digital_Forensics_Presentation.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Big Data Technologies - Introduction.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Reach Out and Touch Someone: Haptics and Empathic Computing
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Approach and Philosophy of On baking technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Link extraction and classification

  • 1. http://www.flickr.com/photos/pio1976/3330670980/ Link extraction and classification Bruno Pedro December 2010
  • 2. Bruno Pedro A n e x p e r i e n c e d We b d e v e l o p e r a n d entrepreneur. Has extensive background in large scale projects and technical writing. http://tarpipe.com/user/bpedro
  • 5. twitter source: mashable.com • Average ~1.1M tweets/hour • ~300 new tweets every second
  • 6. Challenge • ~300 reads/second • 160 X 300 = 48 KB/second = 4 GB/day (approximate calculation) • How to process all this information?
  • 7. Strategy • Read and store immediately: • High performance write storage • No locks allowed • Prepare for lots of reading errors
  • 8. Strategy • Process later: • Regular expressions • Term extraction • Machine learning
  • 10. Link extraction extract user extract URL new tweet yes launch process has URL? save show no dump
  • 11. Content extraction • Short URLs • HTTP redirects • HTTP errors — retry? • Content analysis
  • 12. Content extraction URL retry? yes yes no fetch contents redirect? error? yes no save
  • 13. Content analysis • Assume malformed (X)HTML • Regular expressions or • Convert to XHTML • DOM traversing
  • 14. Content classification • HTML title, description, keywords tag cloud { • H1, H2, H3, ... • Paragraphs graph placement { • Who shared the link? • Internal and external links
  • 15. Content classification (X)HTML extract head extract text extract text elements yes yes yes H1,H2,... paragraphs head found? save found? found? no no
  • 16. The big picture new tweet classify content save extract URL extract content
  • 17. Food for thought • PubsubHubbub http://code.google.com/p/pubsubhubbub/ • Activity Streams http://activitystrea.ms/ • twitter streaming API http://dev.twitter.com/pages/streaming_api
  • 18. tarpipe streamlines your tarpipe is one of the most Today I had a chance to updates to various social curious experiments in spend time experimenting web sites, creating simple social media that I've with tarpipe and I have to or complex workflows to seen lately. The service say that I am intrigued by update several buckets in has the potential to be the concept and impressed one fell swoop. the answer to the lament by the implementation. I first talked about in The Adam Pash looming crisis: Personal Jeff Barr lifehacker syndication overload. Amazon.com Rafe Needleman CNET news thank you automated publishing

Editor's Notes