SlideShare a Scribd company logo
feedparser http://www.feedparser.org/ Because RSS is Hairy Lindsey Smith @turbodog
feedparser: because RSS is hairy RSS formats bundle HTML  User input via HTML  is hairy There are several syndication formats and versions (RSS, Atom, etc.) RSS HTML Micro-format
feedparser: because rss is hairy Download and parse just about any feed type, including:  Various flavors of Atom and RSS Format extensions (iTunes) Micro-formats (GeoRSS, hcard) Ensures that you can treat all feeds the same way, regardless of format or version
feedparser: because rss is hairy Digests whatever crap you throw at it Sanitizes HTML Date normalization Resolving relative links Feed type, version and encoding detection Bozo detection of non-well-formed feeds without blowing up
feedparser: because rss is hairy Parse URL, local file or string data 304 Not Modified  HTTP return code HTTP basic auth Custom request headers Customer handlers Captures response headers
feedparser: the good ol’ days Created circa 2002 by Mark Pilgrim of  Dive Into Python  fame Powers feedvalidator.org v4.1 released in 2007 Open source Well-documented 3000 unit tests Available in popular Linux distros
feedparser: the lean years Development slows to a trickle No official releases Atom & RSS continue to evolve iTunes enclosures v4.1 released in 2007 Still  available in popular Linux distros
feedparser 5.0: a new hope Small group of developers start working on feedparser v5.0 released January 2011 Supports Python 3 Micro-formats CSS & HTML5 sanitation Bug fixes, bug fixes, bug fixes
>>>  import  feedparser  >>> d = feedparser.parse( &quot; http://feedparser.org/docs/examples/atom10.xml &quot; )  >>> d['feed']['title']  # feed data is a dictionary   u'Sample Feed'   >>> d.feed.title  # get values attr-style or dict-style   u'Sample Feed'   >>> d.channel.title  # use RSS or Atom terminology anywhere   u'Sample Feed'   >>> d.feed.link  # resolves relative links   u'http://example.org/'   >>> d.feed.subtitle  # parses escaped HTML   u'For documentation <em>only</em>'
>>> len(d['entries'])  # entries are a list   1   >>> d['entries'][0]['title']  # each entry is a dictionary   u'First entry title'   >>> d.entries[0].title  # attr-style works here too   u'First entry title'   >>> d['items'][0].title  # RSS terminology works here too   u'First entry title'   >>> e = d.entries[0]  >>> e.link  # easy access to alternate link   u'http://example.org/entry/3'   >>> e.links[1].rel  # full access to all Atom links   u'related'   >>> e.links[0].href  # resolves relative links here too   u'http://example.org/entry/3'
>>> e.updated_parsed  # parses all date formats   time.struct_time(tm_year=2005, tm_mon=11, tm_mday=9, tm_hour=11, tm_min=56, tm_sec=34, tm_wday=2, tm_yday=313, tm_isdst=0)   >>> e.content[0].value  # sanitizes dangerous HTML   u'<div>Watch out for <em>nasty tricks</em></div>'   >>> d.version  # reports feed type and version   u'atom10'   >>> d.encoding  # auto-detects character encoding   u'utf-8'   >>> d.headers.get('Content-type')  # full access to all HTTP headers   u'application/xml‘ >>> d.bozo  # well-formed? 0
feedparser: caveats Fairly slow and CPU intensive Friendfeed rolled their own and fell back on feedparser Team is looking at ways to speed it up
feedparser: the project details Home page:  http://www.feedparser.org Discussion:  http://code.google.com/p/feedparser

More Related Content

PDF
5c handbook-welding
PPT
Blogs and RSS
PPT
PPT
Miyagawa
PPT
Miyagawa
PPT
Miyagawa
PPT
Miyagawa
PPTX
Rss project1 kalashian-finnigan
5c handbook-welding
Blogs and RSS
Miyagawa
Miyagawa
Miyagawa
Miyagawa
Rss project1 kalashian-finnigan

Similar to Feedparser (20)

PPT
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
PPT
Using the RSS Platform on Windows: Syndication Goes Mainstream
PDF
The Zeitgeist Movement
PDF
India Pr Wire May 11, 2009 Sensex Down 193 Points On Profit Booking
PPT
Solr Presentation
PPT
Mla Databases
PPT
Declarative Development Using Annotations In PHP
PPT
Declarative Development Using Annotations In PHP
PPT
PPT
6 311 W
PPT
6 311 W
PPTX
Sumo Logic "How to" Webinar: Advanced Analytics
PPTX
Introduction to SDshare
PPT
RSS and Atom in the Social Web
PPT
Processing XML with Java
PPT
Plagger the duct tape of internet
PPT
All the News You Think You Want: Managing News, Tables of Contents, Blog Post...
PPT
Xml Zoe
PPT
Xml Zoe
PPT
RESTful SOA - 中科院暑期讲座
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
Using the RSS Platform on Windows: Syndication Goes Mainstream
The Zeitgeist Movement
India Pr Wire May 11, 2009 Sensex Down 193 Points On Profit Booking
Solr Presentation
Mla Databases
Declarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHP
6 311 W
6 311 W
Sumo Logic "How to" Webinar: Advanced Analytics
Introduction to SDshare
RSS and Atom in the Social Web
Processing XML with Java
Plagger the duct tape of internet
All the News You Think You Want: Managing News, Tables of Contents, Blog Post...
Xml Zoe
Xml Zoe
RESTful SOA - 中科院暑期讲座
Ad

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
Teaching material agriculture food technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Cloud computing and distributed systems.
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
MYSQL Presentation for SQL database connectivity
Unlocking AI with Model Context Protocol (MCP)
Review of recent advances in non-invasive hemoglobin estimation
Chapter 3 Spatial Domain Image Processing.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Teaching material agriculture food technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Spectral efficient network and resource selection model in 5G networks
Diabetes mellitus diagnosis method based random forest with bat algorithm
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
“AI and Expert System Decision Support & Business Intelligence Systems”
Cloud computing and distributed systems.
Agricultural_Statistics_at_a_Glance_2022_0.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Encapsulation_ Review paper, used for researhc scholars
Ad

Feedparser

  • 2. feedparser: because RSS is hairy RSS formats bundle HTML User input via HTML is hairy There are several syndication formats and versions (RSS, Atom, etc.) RSS HTML Micro-format
  • 3. feedparser: because rss is hairy Download and parse just about any feed type, including: Various flavors of Atom and RSS Format extensions (iTunes) Micro-formats (GeoRSS, hcard) Ensures that you can treat all feeds the same way, regardless of format or version
  • 4. feedparser: because rss is hairy Digests whatever crap you throw at it Sanitizes HTML Date normalization Resolving relative links Feed type, version and encoding detection Bozo detection of non-well-formed feeds without blowing up
  • 5. feedparser: because rss is hairy Parse URL, local file or string data 304 Not Modified HTTP return code HTTP basic auth Custom request headers Customer handlers Captures response headers
  • 6. feedparser: the good ol’ days Created circa 2002 by Mark Pilgrim of  Dive Into Python  fame Powers feedvalidator.org v4.1 released in 2007 Open source Well-documented 3000 unit tests Available in popular Linux distros
  • 7. feedparser: the lean years Development slows to a trickle No official releases Atom & RSS continue to evolve iTunes enclosures v4.1 released in 2007 Still available in popular Linux distros
  • 8. feedparser 5.0: a new hope Small group of developers start working on feedparser v5.0 released January 2011 Supports Python 3 Micro-formats CSS & HTML5 sanitation Bug fixes, bug fixes, bug fixes
  • 9. >>> import feedparser >>> d = feedparser.parse( &quot; http://feedparser.org/docs/examples/atom10.xml &quot; ) >>> d['feed']['title'] # feed data is a dictionary u'Sample Feed' >>> d.feed.title # get values attr-style or dict-style u'Sample Feed' >>> d.channel.title # use RSS or Atom terminology anywhere u'Sample Feed' >>> d.feed.link # resolves relative links u'http://example.org/' >>> d.feed.subtitle # parses escaped HTML u'For documentation <em>only</em>'
  • 10. >>> len(d['entries']) # entries are a list 1 >>> d['entries'][0]['title'] # each entry is a dictionary u'First entry title' >>> d.entries[0].title # attr-style works here too u'First entry title' >>> d['items'][0].title # RSS terminology works here too u'First entry title' >>> e = d.entries[0] >>> e.link # easy access to alternate link u'http://example.org/entry/3' >>> e.links[1].rel # full access to all Atom links u'related' >>> e.links[0].href # resolves relative links here too u'http://example.org/entry/3'
  • 11. >>> e.updated_parsed # parses all date formats time.struct_time(tm_year=2005, tm_mon=11, tm_mday=9, tm_hour=11, tm_min=56, tm_sec=34, tm_wday=2, tm_yday=313, tm_isdst=0) >>> e.content[0].value # sanitizes dangerous HTML u'<div>Watch out for <em>nasty tricks</em></div>' >>> d.version # reports feed type and version u'atom10' >>> d.encoding # auto-detects character encoding u'utf-8' >>> d.headers.get('Content-type') # full access to all HTTP headers u'application/xml‘ >>> d.bozo # well-formed? 0
  • 12. feedparser: caveats Fairly slow and CPU intensive Friendfeed rolled their own and fell back on feedparser Team is looking at ways to speed it up
  • 13. feedparser: the project details Home page: http://www.feedparser.org Discussion: http://code.google.com/p/feedparser