`
javasee
  • 浏览: 961244 次
  • 性别: Icon_minigender_1
  • 来自: 北京
文章分类
社区版块
存档分类
最新评论
阅读更多

RSS(Rich Site Summary或者RDF Site Summary)是一种用于网站内容集成的技术。这种最初源自浏览器“新闻频道”的技术,现在却在企业门户(portal)、企业应用集成(EAI)等方面得到了更加宽广的用武之地。

————————————————

What is RSS?
By Mark Pilgrim <!-- CS_PAGE_BREAK --><!-- sidebar begins --><!-- don't move sidebars --><!-- sidebar ends -->

RSS is a format for syndicating news and the content of news-like sites, including major news sites like Wired, news-oriented community sites like Slashdot, and personal weblogs. But it's not just for news. Pretty much anything that can be broken down into discrete items can be syndicated via RSS: the "recent changes" page of a wiki, a changelog of CVS checkins, even the revision history of a book. Once information about each item is in RSS format, an RSS-aware program can check the feed for changes and react to the changes in an appropriate way.

RSS-aware programs called news aggregators are popular in the weblogging community. Many weblogs make content available in RSS. A news aggregator can help you keep up with all your favorite weblogs by checking their RSS feeds and displaying new items from each of them.

A brief history<!-- sidebar begins --><!-- sidebar ends -->

But coders beware. The name "RSS" is an umbrella term for a format that spans several different versions of at least two different (but parallel) formats. The original RSS, version 0.90, was designed by Netscape as a format for building portals of headlines to mainstream news sites. It was deemed overly complex for its goals; a simpler version, 0.91, was proposed and subsequently dropped when Netscape lost interest in the portal-making business. But 0.91 was picked up by another vendor, UserLand Software, which intended to use it as the basis of its weblogging products and other web-based writing software.

In the meantime, a third, non-commercial group split off and designed a new format based on what they perceived as the original guiding principles of RSS 0.90 (before it got simplified into 0.91). This format, which is based on RDF, is called RSS 1.0. But UserLand was not involved in designing this new format, and, as an advocate of simplifying 0.90, it was not happy when RSS 1.0 was announced. Instead of accepting RSS 1.0, UserLand continued to evolve the 0.9x branch, through versions 0.92, 0.93, 0.94, and finally 2.0.

What a mess.

So which one do I use?

That's 7 -- count 'em, 7! -- different formats, all called "RSS". As a coder of RSS-aware programs, you'll need to be liberal enough to handle all the variations. But as a content producer who wants to make your content available via syndication, which format should you choose?

RSS versions and recommendations Version Owner Pros Status Recommendation 0.90 0.91 0.92, 0.93, 0.94 1.0 2.0
Netscape Obsoleted by 1.0 Don't use
UserLand Drop dead simple Officially obsoleted by 2.0, but still quite popular Use for basic syndication. Easy migration path to 2.0 if you need more flexibility
UserLand Allows richer metadata than 0.91 Obsoleted by 2.0 Use 2.0 instead
RSS-DEV Working Group RDF-based, extensibility via modules, not controlled by a single vendor Stable core, active module development Use for RDF-based applications or if you need advanced RDF-specific modules
UserLand Extensibility via modules, easy migration path from 0.9x branch Stable core, active module development Use for general-purpose, metadata-rich syndication

What does RSS look like?

Imagine you want to write a program that reads RSS feeds, so that you can publish headlines on your site, build your own portal or homegrown news aggregator, or whatever. What does an RSS feed look like? That depends on which version of RSS you're talking about. Here's a sample RSS 0.91 feed (adapted from XML.com's RSS feed):

<rss version="0.91">
<channel>
<title>XML.com</title>
<link>http://www.xml.com/</link>
<description>XML.com features a rich mix of information and services for the XML community.</description>
<language>en-us</language>
<item>
<title>Normalizing XML, Part 2</title>
<link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>
<description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description>
</item>
<item>
<title>The .NET Schema Object Model</title>
<link>http://www.xml.com/pub/a/2002/12/04/som.html</link>
<description>Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.</description>
</item>
<item>
<title>SVG's Past and Promising Future</title>
<link>http://www.xml.com/pub/a/2002/12/04/svg.html</link>
<description>In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.</description>
</item>
</channel>
</rss>

Simple, right? A feed comprises a channel, which has a title, link, description, and (optional) language, followed by a series of items, each of which have a title, link, and description.

Now look at the RSS 1.0 version of the same information:

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://purl.org/rss/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
>
<channel rdf:about="http://www.xml.com/cs/xml/query/q/19">
<title>XML.com</title>
<link>http://www.xml.com/</link>
<description>XML.com features a rich mix of information and services for the XML community.</description>
<language>en-us</language>
<items>
<rdf:Seq>
<rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/normalizing.html"/>
<rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/som.html"/>
<rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/svg.html"/>
</rdf:Seq>
</items>
</channel>
<item rdf:about="http://www.xml.com/pub/a/2002/12/04/normalizing.html">
<title>Normalizing XML, Part 2</title>
<link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>
<description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description>
<dc:creator>Will Provost</dc:creator>
<dc:date>2002-12-04</dc:date>
</item>
<item rdf:about="http://www.xml.com/pub/a/2002/12/04/som.html">
<title>The .NET Schema Object Model</title>
<link>http://www.xml.com/pub/a/2002/12/04/som.html</link>
<description>Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.</description>
<dc:creator>Priya Lakshminarayanan</dc:creator>
<dc:date>2002-12-04</dc:date>
</item>
<item rdf:about="http://www.xml.com/pub/a/2002/12/04/svg.html">
<title>SVG's Past and Promising Future</title>
<link>http://www.xml.com/pub/a/2002/12/04/svg.html</link>
<description>In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.</description>
<dc:creator>Antoine Quint</dc:creator>
<dc:date>2002-12-04</dc:date>
</item>
</rdf:RDF>

Quite a bit more verbose. People familiar with RDF will recognize this as an XML serialization of an RDF document; the rest of the world will at least recognize that we're syndicating essentially the same information. In fact, we're including a bit more information: item-level authors and publishing dates, which RSS 0.91 does not support.

<!-- CS_PAGE_INDEX -->

<!-- CS_PAGE_BREAK -->

by Mark Pilgrim <!-- CS_PAGE_INDEX -->

Despite being RDF/XML, RSS 1.0 is structurally similar to previous versions of RSS -- similar enough that we can simply treat it as XML and write a single function to extract information out of either an RSS 0.91 or RSS 1.0 feed. However, there are some significant differences that our code will need to be aware of:

  1. The root element is rdf:RDF instead of rss. We'll either need to handle both explicitly or just ignore the name of the root element altogether and blindly look for useful information inside it.

  2. RSS 1.0 uses namespaces extensively. The RSS 1.0 namespace is http://purl.org/rss/1.0/, and it's defined as the default namespace. The feed also uses http://www.w3.org/1999/02/22-rdf-syntax-ns# for the RDF-specific elements (which we'll simply be ignoring for our purposes) and http://purl.org/dc/elements/1.1/ (Dublin Core) for the additional metadata of article authors and publishing dates.

    We can go in one of two ways here: if we don't have a namespace-aware XML parser, we can blindly assume that the feed uses the standard prefixes and default namespace and look for item elements and dc:creator elements within them. This will actually work in a large number of real-world cases; most RSS feeds use the default namespace and the same prefixes for common modules like Dublin Core. This is a horrible hack, though. There's no guarantee that a feed won't use a different prefix for a namespace (which would be perfectly valid XML and RDF). If or when it does, we'll miss it.

    If we have a namespace-aware XML parser at our disposal, we can construct a more elegant solution that handles both RSS 0.91 and 1.0 feeds. We can look for items in no namespace; if that fails, we can look for items in the RSS 1.0 namespace. (Not shown, but RSS 0.90 feeds also use a namespace, but not the same one as RSS 1.0. So what we really need is a list of namespaces to search.)

  3. Less obvious but still important, the item elements are outside the channel element. (In RSS 0.91, the item elements were inside the channel. In RSS 0.90, they were outside; in RSS 2.0, they're inside. Whee.) So we can't be picky about where we look for items.

  4. Finally, you'll notice there is an extra items element within the channel. It's only useful to RDF parsers, and we're going to ignore it and assume that the order of the items within the RSS feed is given by their order of the item elements.

But what about RSS 2.0? Luckily, once we've written code to handle RSS 0.91 and 1.0, RSS 2.0 is a piece of cake. Here's the RSS 2.0 version of the same feed:

<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<title>XML.com</title>
<link>http://www.xml.com/</link>
<description>XML.com features a rich mix of information and services for the XML community.</description>
<language>en-us</language>
<item>
<title>Normalizing XML, Part 2</title>
<link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>
<description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description>
<dc:creator>Will Provost</dc:creator>
<dc:date>2002-12-04</dc:date>
</item>
<item>
<title>The .NET Schema Object Model</title>
<link>http://www.xml.com/pub/a/2002/12/04/som.html</link>
<description>Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.</description>
<dc:creator>Priya Lakshminarayanan</dc:creator>
<dc:date>2002-12-04</dc:date>
</item>
<item>
<title>SVG's Past and Promising Future</title>
<link>http://www.xml.com/pub/a/2002/12/04/svg.html</link>
<description>In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.</description>
<dc:creator>Antoine Quint</dc:creator>
<dc:date>2002-12-04</dc:date>
</item>
</channel>
</rss>

As this example shows, RSS 2.0 uses namespaces like RSS 1.0, but it's not RDF. Like RSS 0.91, there is no default namespace and items are back inside the channel. If our code is liberal enough to handle the differences between RSS 0.91 and 1.0, RSS 2.0 should not present any additional wrinkles.

How can I read RSS?

Now let's get down to actually reading these sample RSS feeds from Python. The first thing we'll need to do is download some RSS feeds. This is simple in Python; most distributions come with both a URL retrieval library and an XML parser. (Note to Mac OS X 10.2 users: your copy of Python does not come with an XML parser; you will need to install PyXML first.)

from xml.dom import minidom
import urllib

def load(rssURL):
return minidom.parse(urllib.urlopen(rssURL))

This takes the URL of an RSS feed and returns a parsed representation of the DOM, as native Python objects.

The next bit is the tricky part. To compensate for the differences in RSS formats, we'll need a function that searches for specific elements in any number of namespaces. Python's XML library includes a getElementsByTagNameNS which takes a namespace and a tag name, so we'll use that to make our code general enough to handle RSS 0.9x/2.0 (which has no default namespace), RSS 1.0 and even RSS 0.90. This function will find all elements with a given name, anywhere within a node. That's a good thing; it means that we can search for item elements within the root node and always find them, whether they are inside or outside the channel element.

DEFAULT_NAMESPACES = \
(None, # RSS 0.91, 0.92, 0.93, 0.94, 2.0
'http://purl.org/rss/1.0/', # RSS 1.0
'http://my.netscape.com/rdf/simple/0.9/' # RSS 0.90
)

def getElementsByTagName(node, tagName, possibleNamespaces=DEFAULT_NAMESPACES):
for namespace in possibleNamespaces:
children = node.getElementsByTagNameNS(namespace, tagName)
if len(children): return children
return []

Finally, we need two utility functions to make our lives easier. First, our getElementsByTagName function will return a list of elements, but most of the time we know there's only going to be one. An item only has one title, one link, one description, and so on. We'll define a first function that returns the first element of a given name (again, searching across several different namespaces). Second, Python's XML libraries are great at parsing an XML document into nodes, but not that helpful at putting the data back together again. We'll define a textOf function that returns the entire text of a particular XML element.

def first(node, tagName, possibleNamespaces=DEFAULT_NAMESPACES):
children = getElementsByTagName(node, tagName, possibleNamespaces)
return len(children) and children[0] or None

def textOf(node):
return node and "".join([child.data for child in node.childNodes]) or ""

That's it. The actual parsing is easy. We'll take a URL on the command line, download it, parse it, get the list of items, and then get some useful information from each item:

DUBLIN_CORE = ('http://purl.org/dc/elements/1.1/',)

if __name__ == '__main__':
import sys
rssDocument = load(sys.argv[1])
for item in getElementsByTagName(rssDocument, 'item'):
print 'title:', textOf(first(item, 'title'))
print 'link:', textOf(first(item, 'link'))
print 'description:', textOf(first(item, 'description'))
print 'date:', textOf(first(item, 'date', DUBLIN_CORE))
print 'author:', textOf(first(item, 'creator', DUBLIN_CORE))
print

Running it with our sample RSS 0.91 feed prints only title, link, and description (since the feed didn't include any other information on dates or authors):

$ python rss1.py http://www.xml.com/2002/12/18/examples/rss091.xml.txt
title: Normalizing XML, Part 2
link: http://www.xml.com/pub/a/2002/12/04/normalizing.html
description: In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.
date:
author:

title: The .NET Schema Object Model
link: http://www.xml.com/pub/a/2002/12/04/som.html
description: Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.
date:
author:

title: SVG's Past and Promising Future
link: http://www.xml.com/pub/a/2002/12/04/svg.html
description: In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.
date:
author:

For both the sample RSS 1.0 feed and sample RSS 2.0 feed, we also get dates and authors for each item. We reuse our custom getElementsByTagName function, but pass in the Dublin Core namespace and appropriate tag name. We could reuse this same function to extract information from any of the basic RSS modules. (There are a few advanced modules specific to RSS 1.0 that would require a full RDF parser, but they are not widely deployed in public RSS feeds.)

Here's the output against our sample RSS 1.0 feed:

$ python rss1.py http://www.xml.com/2002/12/18/examples/rss10.xml.txt
title: Normalizing XML, Part 2
link: http://www.xml.com/pub/a/2002/12/04/normalizing.html
description: In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.
date: 2002-12-04
author: Will Provost

title: The .NET Schema Object Model
link: http://www.xml.com/pub/a/2002/12/04/som.html
description: Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.
date: 2002-12-04
author: Priya Lakshminarayanan

title: SVG's Past and Promising Future
link: http://www.xml.com/pub/a/2002/12/04/svg.html
description: In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.
date: 2002-12-04
author: Antoine Quint

Running against our sample RSS 2.0 feed produces the same results.

This technique will handle about 90% of the RSS feeds out there; the rest are ill-formed in a variety of interesting ways, mostly caused by non-XML-aware publishing tools building feeds out of templates and not respecting basic XML well-formedness rules. Next month we'll tackle the thorny problem of how to handle RSS feeds that are almost, but not quite, well-formed XML.

<!-- sidebar begins -->

Related resources

分享到:
评论

相关推荐

    RSS案例视频,RSS阅读器

    什么是RSS?RSS及其发展历程 ------主讲:天涯浪子  RSS是2004年最热门的互联网词汇之一,不过,相对于博客(BLOG)来说,RSS的知名度相应会低很多,而且至今还没有一个非常贴切的中文词汇,也许以后无需...

    语音交互的RSS阅读器

    首先,我们要理解什么是RSS。RSS(Really Simple Syndication)是一种内容发布格式,它允许用户通过订阅的方式获取网站的更新内容,而无需直接访问网站。RSS阅读器则扮演了聚合和展示这些订阅内容的角色,帮助用户一...

    RSS是RSS的jar包

    RSS(Really Simple Syndication)是一种基于XML的网络内容发布协议,它允许网站提供自己的新闻提要,以便用户可以通过RSS阅读器订阅和获取更新。在这个压缩包中,包含了一系列与Java实现RSS相关的库和源代码,这将...

    RSS代码RSS 代码RSS 代码RSS 代码

    【标题】:“RSS代码解析与应用” 【描述】:“RSS(Really Simple Syndication)是一种用于发布和订阅信息的XML格式,它使得用户可以方便地获取网站的更新内容,如新闻、博客文章等。RSS代码是实现RSS订阅功能的...

    RSS订阅

    什么是RSS? RSS文档(通常称为“源”或“频道”)包含了一个站点或服务的最新条目列表,这些条目通常包括标题、摘要和链接,有时还会包含完整的文章内容。RSS文件遵循特定的XML结构,这使得其他程序(如RSS阅读器...

    RSS RSS RSS

    RSS,全称“Really Simple Syndication”或者“Rich Site Summary”,中文通常译为“简易聚合”或“富站点摘要”,是一种用于发布和订阅网站内容的标准化格式。它允许用户通过RSS阅读器或新闻聚合器获取网站的更新,...

    提交RSS工具英文站RSS提交,英文站RSS提交

    标题中的“提交RSS工具英文站RSS提交,英文站RSS提交”和描述中的“提交RSS工具搜索引擎的RSS方式的提交提交RSS工具”都指向了一个主题,即利用RSS(Really Simple Syndication)工具向英文网站和搜索引擎提交RSS ...

    如何在IE浏览器中使用和管理RSS订阅源?.docx

    什么是 RSS? ---------------- RSS 就像微博一样,在你的源有更新的时候把更新推送给你,或者说在网站有更新的时候告诉你更新的内容。而源就好比是那些微博账号。现在绝大多数网站都会提供自动的 RSS 源。 RSS ...

    RSS Announcer(国外RSS推广)

    RSS Announcer是一款专门针对国外市场设计的RSS推广工具,它为用户提供了一种高效的方式来分发和宣传他们的RSS(Really Simple Syndication) feed,以扩大在线影响力和吸引更多的读者。RSS是一种标准格式,允许用户...

    Rss插件-帝国CMS

    【Rss插件-帝国CMS】是专门为帝国内容管理系统(Empire CMS)设计的一款扩展功能插件,旨在增强系统对RSS(Really Simple Syndication)的支持。RSS是一种互联网内容发布格式,它允许用户订阅网站更新,无需频繁访问...

    RSS DEMO 支持RSS定阅

    【RSS模型】 RSS,全称Really Simple Syndication(真正简单的聚合),是一种用于发布和获取网站内容的标准化格式。它允许用户通过订阅RSS feed来跟踪更新,无需频繁地访问各个网站。RSS订阅使得新闻、博客文章和...

    一个基于新浪RSS的android RSS阅读器源码

    《基于新浪RSS的Android RSS阅读器源码解析与学习指南》 RSS(Really Simple Syndication)是一种内容聚合格式,常用于新闻、博客等网站,让用户能够方便地获取和订阅更新内容。在移动设备上,RSS阅读器应用是访问...

    常用Rss,生成解析Rss,

    RSS(Really Simple Syndication)是一种基于XML的网络内容发布协议,它使得用户能够轻松地获取网站的更新信息,如新闻、博客文章等。RSS通过订阅(subscribe)和发布(publish)的方式,允许用户无需直接访问网站就...

    java实现rss的发布和订阅

    RSS(Really Simple Syndication)是一种基于XML的网络内容聚合格式,它允许用户通过RSS阅读器或聚合器获取网站的更新信息,如新闻、博客文章等。在Java中实现RSS的发布和订阅,需要理解RSS的结构以及如何使用Java...

    phpcms rss调取内容,rss 全文输出怎么修改.doc

    【phpcms RSS全文输出修改详解】 在PHP CMS系统中,RSS(Really Simple Syndication)是一种标准,用于聚合网站内容,让订阅者通过RSS阅读器获取更新。然而,默认情况下,phpcms的RSS功能可能只提供文章的摘要,而...

    rss.jar 一个生成rss的jar包

    《RSS.jar:轻松生成RSS的Java工具包》 在当今信息爆炸的时代,RSS(Really Simple Syndication)成为了人们获取实时信息的重要方式。RSS允许用户订阅感兴趣的信息源,通过阅读器集中查看更新,避免了频繁访问各个...

    RSS.zip_RSS_RSS Java_android RSS_rss android_rss android

    首先,我们要理解RSS是什么。RSS(Really Simple Syndication)是一种内容聚合格式,它允许用户订阅网站的更新,而无需直接访问这些网站。RSS feed通常包含文章标题、摘要、发布日期等信息,通过RSS阅读器,用户可以...

    Rss全国各省市天气预报

    【Rss全国各省市天气预报】项目是一个利用RSS(Really Simple Syndication)技术获取并解析全国各地天气信息的应用。RSS是一种基于XML的格式,用于发布和订阅新闻、博客、天气等实时信息,使得用户能轻松地获取和...

    C# RSS阅读器 能添加和阅读订阅

    C# RSS阅读器是一款基于C#编程语言开发的应用程序,专为用户管理和阅读RSS(Really Simple Syndication)订阅而设计。RSS是一种XML格式,用于发布新闻、博客和其他定期更新的内容,使得用户可以方便地获取并聚合来自...

Global site tag (gtag.js) - Google Analytics