`
happmaoo
  • 浏览: 4333789 次
  • 性别: Icon_minigender_1
  • 来自: 杭州
社区版块
存档分类
最新评论

What is RSS?(2)

阅读更多

What is RSS?
by Mark Pilgrim | Pages: 1, 2

Despite being RDF/XML, RSS 1.0 is structurally similar to previous versions of RSS -- similar enough that we can simply treat it as XML and write a single function to extract information out of either an RSS 0.91 or RSS 1.0 feed. However, there are some significant differences that our code will need to be aware of:

  1. The root element is rdf:RDF instead of rss. We'll either need to handle both explicitly or just ignore the name of the root element altogether and blindly look for useful information inside it.

  2. RSS 1.0 uses namespaces extensively. The RSS 1.0 namespace is http://purl.org/rss/1.0/, and it's defined as the default namespace. The feed also uses http://www.w3.org/1999/02/22-rdf-syntax-ns# for the RDF-specific elements (which we'll simply be ignoring for our purposes) and http://purl.org/dc/elements/1.1/ (Dublin Core) for the additional metadata of article authors and publishing dates.

    We can go in one of two ways here: if we don't have a namespace-aware XML parser, we can blindly assume that the feed uses the standard prefixes and default namespace and look for item elements and dc:creator elements within them. This will actually work in a large number of real-world cases; most RSS feeds use the default namespace and the same prefixes for common modules like Dublin Core. This is a horrible hack, though. There's no guarantee that a feed won't use a different prefix for a namespace (which would be perfectly valid XML and RDF). If or when it does, we'll miss it.

    If we have a namespace-aware XML parser at our disposal, we can construct a more elegant solution that handles both RSS 0.91 and 1.0 feeds. We can look for items in no namespace; if that fails, we can look for items in the RSS 1.0 namespace. (Not shown, but RSS 0.90 feeds also use a namespace, but not the same one as RSS 1.0. So what we really need is a list of namespaces to search.)

  3. Less obvious but still important, the item elements are outside the channel element. (In RSS 0.91, the item elements were inside the channel. In RSS 0.90, they were outside; in RSS 2.0, they're inside. Whee.) So we can't be picky about where we look for items.

  4. Finally, you'll notice there is an extra items element within the channel. It's only useful to RDF parsers, and we're going to ignore it and assume that the order of the items within the RSS feed is given by their order of the item elements.

But what about RSS 2.0? Luckily, once we've written code to handle RSS 0.91 and 1.0, RSS 2.0 is a piece of cake. Here's the RSS 2.0 version of the same feed:

<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<title>XML.com</title>
<link>http://www.xml.com/</link>
<description>XML.com features a rich mix of information and services for the XML community.</description>
<language>en-us</language>
<item>
<title>Normalizing XML, Part 2</title>
<link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>
<description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description>
<dc:creator>Will Provost</dc:creator>
<dc:date>2002-12-04</dc:date>
</item>
<item>
<title>The .NET Schema Object Model</title>
<link>http://www.xml.com/pub/a/2002/12/04/som.html</link>
<description>Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.</description>
<dc:creator>Priya Lakshminarayanan</dc:creator>
<dc:date>2002-12-04</dc:date>
</item>
<item>
<title>SVG's Past and Promising Future</title>
<link>http://www.xml.com/pub/a/2002/12/04/svg.html</link>
<description>In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.</description>
<dc:creator>Antoine Quint</dc:creator>
<dc:date>2002-12-04</dc:date>
</item>
</channel>
</rss>

As this example shows, RSS 2.0 uses namespaces like RSS 1.0, but it's not RDF. Like RSS 0.91, there is no default namespace and items are back inside the channel. If our code is liberal enough to handle the differences between RSS 0.91 and 1.0, RSS 2.0 should not present any additional wrinkles.

How can I read RSS?

Now let's get down to actually reading these sample RSS feeds from Python. The first thing we'll need to do is download some RSS feeds. This is simple in Python; most distributions come with both a URL retrieval library and an XML parser. (Note to Mac OS X 10.2 users: your copy of Python does not come with an XML parser; you will need to install PyXML first.)

from xml.dom import minidom
import urllib

def load(rssURL):
return minidom.parse(urllib.urlopen(rssURL))

This takes the URL of an RSS feed and returns a parsed representation of the DOM, as native Python objects.

The next bit is the tricky part. To compensate for the differences in RSS formats, we'll need a function that searches for specific elements in any number of namespaces. Python's XML library includes a getElementsByTagNameNS which takes a namespace and a tag name, so we'll use that to make our code general enough to handle RSS 0.9x/2.0 (which has no default namespace), RSS 1.0 and even RSS 0.90. This function will find all elements with a given name, anywhere within a node. That's a good thing; it means that we can search for item elements within the root node and always find them, whether they are inside or outside the channel element.

DEFAULT_NAMESPACES = \
(None, # RSS 0.91, 0.92, 0.93, 0.94, 2.0
'http://purl.org/rss/1.0/', # RSS 1.0
'http://my.netscape.com/rdf/simple/0.9/' # RSS 0.90
)

def getElementsByTagName(node, tagName, possibleNamespaces=DEFAULT_NAMESPACES):
for namespace in possibleNamespaces:
children = node.getElementsByTagNameNS(namespace, tagName)
if len(children): return children
return []

Finally, we need two utility functions to make our lives easier. First, our getElementsByTagName function will return a list of elements, but most of the time we know there's only going to be one. An item only has one title, one link, one description, and so on. We'll define a first function that returns the first element of a given name (again, searching across several different namespaces). Second, Python's XML libraries are great at parsing an XML document into nodes, but not that helpful at putting the data back together again. We'll define a textOf function that returns the entire text of a particular XML element.

def first(node, tagName, possibleNamespaces=DEFAULT_NAMESPACES):
children = getElementsByTagName(node, tagName, possibleNamespaces)
return len(children) and children[0] or None

def textOf(node):
return node and "".join([child.data for child in node.childNodes]) or ""

That's it. The actual parsing is easy. We'll take a URL on the command line, download it, parse it, get the list of items, and then get some useful information from each item:

DUBLIN_CORE = ('http://purl.org/dc/elements/1.1/',)

if __name__ == '__main__':
import sys
rssDocument = load(sys.argv[1])
for item in getElementsByTagName(rssDocument, 'item'):
print 'title:', textOf(first(item, 'title'))
print 'link:', textOf(first(item, 'link'))
print 'description:', textOf(first(item, 'description'))
print 'date:', textOf(first(item, 'date', DUBLIN_CORE))
print 'author:', textOf(first(item, 'creator', DUBLIN_CORE))
print

Running it with our sample RSS 0.91 feed prints only title, link, and description (since the feed didn't include any other information on dates or authors):

$ python rss1.py http://www.xml.com/2002/12/18/examples/rss091.xml.txt
title: Normalizing XML, Part 2
link: http://www.xml.com/pub/a/2002/12/04/normalizing.html
description: In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.
date:
author:

title: The .NET Schema Object Model
link: http://www.xml.com/pub/a/2002/12/04/som.html
description: Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.
date:
author:

title: SVG's Past and Promising Future
link: http://www.xml.com/pub/a/2002/12/04/svg.html
description: In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.
date:
author:

For both the sample RSS 1.0 feed and sample RSS 2.0 feed, we also get dates and authors for each item. We reuse our custom getElementsByTagName function, but pass in the Dublin Core namespace and appropriate tag name. We could reuse this same function to extract information from any of the basic RSS modules. (There are a few advanced modules specific to RSS 1.0 that would require a full RDF parser, but they are not widely deployed in public RSS feeds.)

Here's the output against our sample RSS 1.0 feed:

$ python rss1.py http://www.xml.com/2002/12/18/examples/rss10.xml.txt
title: Normalizing XML, Part 2
link: http://www.xml.com/pub/a/2002/12/04/normalizing.html
description: In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.
date: 2002-12-04
author: Will Provost

title: The .NET Schema Object Model
link: http://www.xml.com/pub/a/2002/12/04/som.html
description: Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.
date: 2002-12-04
author: Priya Lakshminarayanan

title: SVG's Past and Promising Future
link: http://www.xml.com/pub/a/2002/12/04/svg.html
description: In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.
date: 2002-12-04
author: Antoine Quint

Running against our sample RSS 2.0 feed produces the same results.

This technique will handle about 90% of the RSS feeds out there; the rest are ill-formed in a variety of interesting ways, mostly caused by non-XML-aware publishing tools building feeds out of templates and not respecting basic XML well-formedness rules. Next month we'll tackle the thorny problem of how to handle RSS feeds that are almost, but not quite, well-formed XML.

<!-- sidebar begins -->

Related resources

<!-- sidebar ends --><!-- talk back -->

Comment on this articleAre you using RSS in your web or XML projects? Share your experience in our forum.
(* You must be a
member of XML.com to use this feature.)
Comment on this Article

member of XML.com to use this feature.)
Comment on this Article

Titles Only Titles Only Newest First
  • Your article helped me -
    2005-03-06 08:25:14katykoot [Reply]

    I just want to make sure I did it right. I am hoping that my feed.rss will be picked up by news agencies and syndicated. Can you check it for me...and by looking at this - is this the correct way to be doing a press release...by rss? versus prweb.com?


    http://www.amethystlive.com/feed.rss

  • My Company doesn't want RSS
    2004-04-07 08:47:00Raleigh Swick [Reply]

    After trying to get my company to get RSS feeds.... their reply:


    >> However, we are holding off on deploying them in a widespread fashion
    >> while we craft a general strategy for syndication and mobility. Though
    >> RSS feeds may be useful in news aggregators, and could increase page
    >> views to articles, the risk is that they may actually reduce page
    >> views to our own index pages, which carry display advertising that RSS
    >> inherently cannot.


    How can I argue this? What to say?

    • My Company doesn't want RSS
      2004-08-07 20:09:08Thogek [Reply]

      Note that an RSS feed often contains only the opening paragraph (or short teaser/summary) of each posted article (or announcement or whatever). The bulk of the article is generally kept on the Web site, the URL for which is included in the RSS feed, so that users who are interested can click through and read the whole thing. So, your RSS feed is basically another way for users to subscribe to announcements of new articles, features, etc., for which they still have to come to your site, and view your pages (and ads) in order to view the whole article. (Kinda like direct-emailing of new article announcements, but easier.)

    • My Company doesn't want RSS
      2004-06-11 09:09:38prakashnambiar [Reply]

      Hey , you can deliver an Advt with the image/logo of your rss feed, stil you can use the comments tag !!!

  • Please don't break XML!
    2003-01-03 01:35:35Henri Sivonen [Reply]

    Using the namespace prefixes instead of proper namespace processing is dirty. Getting a namespace-aware XML parser is not that hard. Please don't break namespaces by using prefix-based guessing.


    Even more worrying is the last sentence: "Next month we'll tackle the thorny problem of how to handle RSS feeds that are almost, but not quite, well-formed XML."


    What's there to tackle? The only correct way to handle ill-formed XML is to firmly reject it. Please enforce the XML well-formedness requirements in order to protect XML from degenerating into tag soup.

    • Please don't break XML!
      2005-01-19 04:20:03Looking_past_XML [Reply]

      Oh my freakin' god, XML is not a sacred standard, shit happens, the offical W3C docs encourage browser/user-agent creators to attempt to properly render imperfect html/xhtml.


      The spirit of RFC's and protocols that has made the internet work(able) is:
      "Be liberal in in what you accept and conservative in what you send" and it's variations by Jon Postel.


      Also, TOG et al would probably assail a system that was so anal and rigid and non-resilient (and they would say lazy) that it couldn't route around some minor formatting transgressions and give the user 50%, 80% or whatever percent of the feed that it could decipher.
      But this brings up the elephant that no one is allowed to talk about - XML and it's main parsers are extremely brittle and complex.


      Flame on...

      • Please don't break XML!
        2005-01-19 04:33:16Looking_past_XML [Reply]

        Since people may not get the "TOG" reference:
        TOG - usability expert that puts much more responsibility on the system creators for making systems that "Just Work++" than many programmers would like, after reading too much of his stuff start thinking "Wow, programs should do a lot better job for the user in many cases" http://www.asktog.com/Bughouse/10MostPersistentBugs.html

    • Please don't break XML!
      2005-01-11 05:27:50despil [Reply]

      I absolutely agree.


      What is the sense in making standards if we throw them out the window so easily?


      Either don't make standards or use them.
      There is no third way.

  • RDF makes life difficult
    2002-12-19 12:39:02Mario Diana [Reply]

    If there is some reason that sites wish to use RDF, they ought to include a XSL transformation of the document to RSS. Is that really so difficult?


    I was writing a Web service to gather an RSS feed from a client and return transformed HTML. When I ran into RDF, I was completely thrown. (Okay, maybe I live under a rock.)


    RDF is for machines; RSS is far more human-friendly. It's a pain to have to deal with it if you're not interested in its features.

    • RDF makes life difficult
      2004-09-24 13:06:41kes [Reply]

      Can you tell me more about the project you are working on. It sounds really interesting...and I would love to learn more. thanks-

  • So that's what it is ;-)
    2002-12-19 11:59:51Danny Ayers [Reply]

    Good piece, refreshingly practical. Also a refreshingly balanced comparison between the different formats, though (predictable quibble) the recommendation of RSS 2.0 for "general-purpose, metadata-rich syndication" seems a little strange when RSS 1.0-based feeds can be much more general purpose and metadata rich, thanks to RDF.
分享到:
评论

相关推荐

    Android代码-Munch

    What is Munch ? Munch is an android app which enable the users to manage their Rss feeds. User can add new sources, manage them and view the article associated with the feeds. Permissions: Internet ...

    Android SampleNetworking

    RssParser – is used to parse the results returned from the server into Data Objects. If we look at the responsibilities of the classes it’s clear where to break them up and delegate. Line numbers ...

    Professional Python Frameworks - Web 2.0 Programming with Django and TurboGears

    What Is MochiKit? 341 What’s Inside MochiKit? 341 MochiKit.Async 342 MochiKit.Base 349 MochiKit.DOM 351 MochiKit.Logging 352 MochiKit.Signal 352 MochiKit.Visual 355 MochiKit Miscellany 357 ...

    Professional Ajax

    The chapter-level breakdown is as follows: Chapter 1: "What Is Ajax?" This chapter explains the origins of Ajax and the technologies involved. It describes how Ajax developed as the ...

    Beginning Python (2005).pdf

    What Is a Schema/DTD? 278 What Are Document Models For? 278 Do You Need One? 278 Document Type Definitions 278 An Example DTD 278 DTDs Aren’t Exactly XML 280 Limitations of DTDs 280 Schemas ...

    wp-o-matic.1.0RC4.zip

    = What's the benefit of campaigns ? = Campaigns give you the ability to group feeds and select common options for them. So, if you ever decide to change a setting, all the feeds in that logical group...

    Oracle Concepts 中文英文对照版 (10g R2)

    Oracle Concepts 中文版 (10g R2) 订阅 RSS &lt;br&gt;-------------------------------------------------------------------------------- &lt;br&gt; Part I What Is Oracle? 第一部分 何为 Oracle? ...

    Practical Apache Struts2 Web 2.0 Projects

    Follow along with the introduction of important concepts and development techniques by way of a web site project closely resembling what you might encounter in any enterprise environment. What ...

    Zend Framework 2 Application Development

    Zend Framework 2 Application Development is your guide to everything you need to know to build applications of any size for big and small companies alike, whilst using the right components for the job...

    Learn Qt 5: with Qt, C++, and QML

    What you will learn Install and configure the Qt Framework and Qt Creator IDE Create a new multi-project solution from scratch and control every aspect of it with QMake Implement a rich user interface...

    Django 1.0 Website Development.pdf

    Chapter 2: Getting Started 13 Installing the required software 13 Installing Python 13 Installing Python on Windows 14 Installing Python on UNIX/Linux 14 Installing Python on Mac OS X 15 ...

    PHP Web 2.0 Mashup Projects.pdf

    Chapter 1 provides an overview of mashups: what a mashup is, and why you would want one. In Chapter 2 we create a basic mashup, and go shopping. We will simply look up products on Amazon....

    PHP.and.MySQL.Recipes.A.Problem-Solution.Approach.1484206061

    ...What you hold in your hands is the answer to all your PHP 5 needs.... First, this book is a source of instant solutions, including countless pieces of useful code that you can just copy and paste into ...

    Learning Google Apps Script(PACKT,2016)

    Google Apps Script is a cloud-based scripting language based on ...Build Translator and RSS reader applications Develop interactive web pages Design interactive web-forms Form a workflow application

    DDKWizard 安装包及ddkbuild_cmd,ddkbuild_bat(用于DDK开发环境搭建)

    Instead of forcing you to use a subset of options and specific DDKs this wizard lets you choose what you want. Of course the template files are predefined and so is the content of the project(s) ...

    Learning.Google.Apps.Script.1785882511

    Chapter 2. Creating Basic Elements Chapter 3. Parsing and Sending E-mails Chapter 4. Creating Interactive Forms Chapter 5. Creating Google Calendar and Drive Applications Chapter 6. Creating Feed ...

    React.Native.Blueprints.epub

    Key Features ...Chapter 2. Rss Reader Chapter 3. Car Booking App Chapter 4. Image Sharing App Chapter 5. Guitar Tuner Chapter 6. Messaging App Chapter 7. Game Chapter 8. E-Commerce App

    ComponentOne Studio for Windows Phone 2012 v3

    RichTextBox for Windows Phone is the only Windows Phone 7 control available which enables rich text editing in both RTF and HTML formats. You can also use the control to display blocks of rich content...

Global site tag (gtag.js) - Google Analytics