`
touchinsert
  • 浏览: 1328930 次
  • 性别: Icon_minigender_1
  • 来自: 北京
文章分类
社区版块
存档分类
最新评论

ECON: An Approach to Extract Content from Web News Page

 
阅读更多

Title:ECON: An Approach to Extract Content from Web News Page

Authors:Yan Guo, Huifeng Tang, Linhai Song, Yu Wang, Guodong Ding

Conference:The 12th International Asia-Pacific Web Conference (APWEB 2010)

Abstract:This paper provides a simple but effective approach, named ECON, to fully-automatically extract content from Web news page. ECON uses a DOM tree to represent the Web news page and leverages the substantial features of the DOM tree. ECON finds a snippet-node by which a part of the content of news is wrapped firstly, then backtracks from the snippet-node until a summary-node is found, and the entire content of news is wrapped by the summary-node. During the process of backtracking, ECON removes noise. Experimental results showed that ECON can achieve high accuracy and fully satisfy the requirements for scalable extraction. Moreover, ECON can be applied to Web news page written in many popular languages such as Chinese, English, French, German, Italian, Japanese, Portuguese, Russian, Spanish, Arabic. ECON can be implemented much easily.

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics