`
varsoft
  • 浏览: 2508455 次
  • 性别: Icon_minigender_1
  • 来自: 上海
文章分类
社区版块
存档分类
最新评论

从HTML文件中抽取正文的简单方案

阅读更多
The Easy Way to Extract Useful Text from Arbitrary HTML
从HTML文件中抽取正文的简单方案
作者:alexjc
译者:恋花蝶(http://blog.csdn.net/lanphaday)
原文地址:http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/
<iframe id="alimamaifrm" style="WIDTH: 750px; HEIGHT: 110px" border="0" name="alimamaifrm" marginwidth="0" marginheight="0" src="http://p.alimama.com/cpacode.php?t=A&amp;pid=mm_10108440_0_0&amp;w=750&amp;h=110&amp;rn=1&amp;cn=3&amp;ky=%CA%E9&amp;cid=50000072&amp;bgc=FFFFFF&amp;bdc=E6E6E6&amp;tc=0000FF&amp;dc=000000" frameborder="0" width="750" scrolling="no" height="110"></iframe>
译者导读:这篇文章主要介绍了从不同类型的HTML文件中抽取出真正有用的正文内容的一种有广泛适应性的方法。其功能类似于CSDN近期推出的“剪影”,能够去除页眉、页脚和侧边栏的无关内容,非常实用。其方法简单有效而又出乎意料,看完后难免大呼原来还可以这样!行文简明易懂,虽然应用了人工神经网络这样的算法,但因为FANN良好的封装性,并不要求读者需要懂得ANN。全文示例以Python代码写成,可读性更佳,具有科普气息,值得一读。
You’ve finally got your hands on the diverse collection of HTML documents you needed. But the content you’re interested in is hidden amidst adverts, layout tables or formatting markup, and other various links. Even worse, there’s visible text in the menus, headers and footers that you want to filter out. If you don’t want to write a complex scraping program for each type of HTML file, there is a solution.
每个人手中都可能有一大堆讨论不同话题的HTML文档。但你真正感兴趣的内容可能隐藏于广告、布局表格或格式标记以及无数链接当中。甚至更糟的是,你希望那些来自菜单、页眉和页脚的文本能够被过滤掉。如果你不想为每种类型的HTML文件分别编写复杂的抽取程序的话,我这里有一个解决方案。
This article shows you how to write a relatively simple script to extract text paragraphs from large chunks of HTML code, without knowing its structure or the tags used. It works on news articles and blogs pages with worthwhile text content, among others…
本文讲述如何编写与从大量HTML代码中获取正文内容的简单脚本,这一方法无需知道HTML文件的结构和使用的标签。它能够工作于含有文本内容的所有新闻文章和博客页面……
Do you want to find out how statistics and machine learning can save you time and effort mining text?
你想知道统计学和机器学习在挖掘文本方面能够让你省时省力的原因吗?

<!-- google_ad_client = "pub-0940885572422333"; google_alternate_color = "FFFFFF"; google_ad_width = 468; google_ad_height = 60; google_ad_format = "468x60_as"; google_ad_type = "text"; //2007-04-05: Content google_ad_channel = "6105530284"; google_color_border = "FFFFFF"; google_color_bg = "FFFFFF"; google_color_link = "202040"; google_color_text = "000000"; google_color_url = "606030"; //--> The concept is rather simple: use information about the density of text vs. HTML code to work out if a line of text is worth outputting. (This isn’t a novel idea, but it works!) The basic process works as follows:

答案极其简单:使用文本和HTML代码的密度来决定一行文件是否应该输出。(这听起来有点离奇,但它的确有用!)基本的处理工作如下:
  1. Parse the HTML code and keep track of the number of bytes processed.
一、解析HTML代码并记下处理的字节数。
  1. Store the text output on a per-line, or per-paragraph basis.
二、以行或段的形式保存解析输出的文本。
  1. Associate with each text line the number of bytes of HTML required to describe it.
三、统计每一行文本相应的HTML代码的字节数
  1. Compute the text density of each line by calculating the ratio of text to bytes.
四、通过计算文本相对于字节数的比率来获取文本密度
  1. Then decide if the line is part of the content by using a neural network.
五、最后用神经网络来决定这一行是不是正文的一部分。
You can get pretty good results just by checking if the line’s density is above a fixed threshold (or the average), but the system makes fewer mistakes if you use machine learning — not to mention that it’s easier to implement!
仅仅通过判断行密度是否高于一个固定的阈值(或者就使用平均值)你就可以获得非常好的结果。但你也可以使用机器学习(这易于实现,简直不值一提)来减少这个系统出现的错误。
Let’s take it from the top…
现在让我从头开始……
Converting the HTML to Text
转换HTML为文本
What you need is the core of a text-mode browser, which is already setup to read files with HTML markup and display raw text. By reusing existing code, you won’t have to spend too much time handling invalid XML documents, which are very common — as you’ll realise quickly.
你需要一个文本模式浏览器的核心,它应该已经内建了读取HTML文件和显示原始文本功能。通过重用已有代码,你并不需要把很多时间花在处理无效的XML文件上。
As a quick example, we’ll be using Python along with a few built-in modules: htmllib for the parsing and formatter for outputting formatted text. This is what the top-level function looks like:
我们将使用Python来完成这个例子,它的htmllib模块可用以解析HTML文件,formatter模块可用以输出格式化的文本。嗯,实现的顶层函数如下:
def extract_text(html):
# Derive from formatter.AbstractWriter to store paragraphs.
writer = LineWriter()
# Default formatter sends commands to our writer.
formatter = AbstractFormatter(writer)
# Derive from htmllib.HTMLParser to track parsed bytes.
parser = TrackingParser(writer, formatter)
# Give the parser the raw HTML data.
parser.feed(html)
parser.close()
# Filter the paragraphs stored and output them.
return writer.output()
The TrackingParser itself overrides the callback functions for parsing start and end tags, as they are given the current parse index in the buffer. You don’t have access to that normally, unless you start diving into frames in the call stack — which isn’t the best approach! Here’s what the class looks like:
TrackingParser覆盖了解析标签开始和结束时调用的回调函数,用以给缓冲对象传递当前解析的索引。通常你不得不这样,除非你使用不被推荐的方法——深入调用堆栈去获取执行帧。这个类看起来是这样的:
class TrackingParser(htmllib.HTMLParser):
"""Try to keep accurate pointer of parsing location."""
def __init__(self, writer, *args):
htmllib.HTMLParser.__init__(self, *args)
self.writer = writer
def parse_starttag(self, i):
index = htmllib.HTMLParser.parse_starttag(self, i)
self.writer.index = index
return index
def parse_endtag(self, i):
self.writer.index = i
return htmllib.HTMLParser.parse_endtag(self, i)
The LineWriter class does the bulk of the work when called by the default formatter. If you have any improvements or changes to make, most likely they’ll go here. This is where we’ll put our machine learning code in later. But you can keep the implementation rather simple and still get good results. Here’s the simplest possible code:
LinWriter的大部分工作都通过调用formatter来完成。如果你要改进或者修改程序,大部分时候其实就是在修改它。我们将在后面讲述怎么为它加上机器学习代码。但你也可以保持它的简单实现,仍然可以得到一个好结果。具体的代码如下:
class Paragraph:
def __init__(self):
self.text = ''
self.bytes = 0
self.density = 0.0
class LineWriter(formatter.AbstractWriter):
def __init__(self, *args):
self.last_index = 0
self.lines = [Paragraph()]
formatter.AbstractWriter.__init__(self)
def send_flowing_data(self, data):
# Work out the length of this text chunk.
t = len(data)
# We've parsed more text, so increment index.
self.index += t
# Calculate the number of bytes since last time.
b = self.index - self.last_index
self.last_index = self.index
# Accumulate this information in current line.
l = self.lines[-1]
l.text += data
l.bytes += b
def send_paragraph(self, blankline):
"""Create a new paragraph if necessary."""
if self.lines[-1].text == '':
return
self.lines[-1].text += 'n' * (blankline+1)
self.lines[-1].bytes += 2 * (blankline+1)
self.lines.append(Writer.Paragraph())
def send_literal_data(self, data):
self.send_flowing_data(data)
def send_line_break(self):
self.send_paragraph(0)
This code doesn’t do any outputting yet, it just gathers the data. We now have a bunch of paragraphs in an array, we know their length, and we know roughly how many bytes of HTML were necessary to create them. Let’s see what emerge from our statistics.
这里代码还没有做输出部分,它只是聚合数据。现在我们有一系列的文字段(用数组保存),以及它们的长度和生成它们所需要的HTML的大概字节数。现在让我们来看看统计学带来了什么。
Examining the Data
数据分析
Luckily, there are some patterns in the data. In the raw output below, you’ll notice there are definite spikes in the number of HTML bytes required to encode lines of text, notably around the title, both sidebars, headers and footers.
幸运的是,数据里总是存在一些模式。从下面的原始输出你可以发现有些文本需要大量的HTML来编码,特别是标题、侧边栏、页眉和页脚。
While the number of HTML bytes spikes in places, it remains below average for quite a few lines. On these lines, the text output is rather high. Calculating the density of text to HTML bytes gives us a better understanding of this relationship.
虽然HTML字节数的峰值多次出现,但大部分仍然低于平均值;我们也可以看到在大部分低HTML字节数的字段中,文本输出却相当高。通过计算文本与HTML字节数的比率(即密度)可以让我们更容易明白它们之间的关系:
The patterns are more obvious in this density value, so it gives us something concrete to work with.
密度值图更加清晰地表达了正文的密度更高,这是我们的工作的事实依据。
Filtering the Lines
过滤文本行
The simplest way we can filter lines now is by comparing the density to a fixed threshold, such as 50% or the average density. Finishing the LineWriter class:
过滤文本行的最简单方法是通过与一个阈值(如50%或者平均值)比较密度值。下面来完成LineWriter类:
def compute_density(self):
"""Calculate the density for each line, and the average."""
total = 0.0
for l in self.lines:
l.density = len(l.text) / float(l.bytes)
total += l.density
# Store for optional use by the neural network.
self.average = total / float(len(self.lines))
def output(self):
"""Return a string with the useless lines filtered out."""
self.compute_density()
output = StringIO.StringIO()
for l in self.lines:
# Check density against threshold.
# Custom filter extensions go here.
if l.density > 0.5:
output.write(l.text)
return output.getvalue()
This rough filter typically gets most of the lines right. All the headers, footers and sidebars text is usually stripped as long as it’s not too long. However, if there are long copyright notices, comments, or descriptions of other stories, then those are output too. Also, if there are short lines around inline graphics or adverts within the text, these are not output.
这个粗糙的过滤器能够获取大部分正确的文本行。只要页眉、页脚和侧边栏文本并不非常长,那么所有的这些都会被剔除。然而,它仍然会输出比较长的版本声明、注释和对其它故事的概述;在图片和广告周边的比较短小的文本,却被过滤掉了。
To fix this, we need a more complex filtering heuristic. But instead of spending days working out the logic manually, we’ll just grab loads of information about each line and use machine learning to find patterns for us.
要解决这个问题,我们需要更复杂些的启发式过滤器。为了节省手工计算需要花费的无数时间,我们将利用机器学习来处理每一文本行的信息,以找出对我们有用的模式。
Supervised Machine Learning
监督式机器学习
Here’s an example of an interface for tagging lines of text as content or not:
这是一个标识文本行是否为正文的接口界面:
The idea of supervised learning is to provide examples for an algorithm to learn from. In our case, we give it a set documents that were tagged by humans, so we know which line must be output and which line must be filtered out. For this we’ll use a simple neural network known as the perceptron. It takes floating point inputs and filters the information through weighted connections between “neurons” and outputs another floating point number. Roughly speaking, the number of neurons and layers affects the ability to approximate functions precisely; we’ll use both single-layer perceptrons (SLP) and multi-layer perceptrons (MLP) for prototyping.
所谓的监督式学习就是为算法提供学习的例子。在这个案例中,我们给定一系列已经由人标识好的文档——我们知道哪一行必须输出或者过滤掉。我们用使用一个简单的神经网络作为感知器,它接受浮点输入并通过“神经元”间的加权连接过滤信息,然后输后另一个浮点数。大体来说,神经元数量和层数将影响获取最优解的能力。我们的原型将分别使用单层感知器(SLP)和多层感知器(MLP)模型。
To get the neural network to learn, we need to gather some data. This is where the earlier LineWriter.output() function comes in handy; it gives us a central point to process all the lines at once, and make a global decision which lines to output. Starting with intuition and experimenting a bit, we discover that the following data is useful to decide how to filter a line:
我们需要找些数据来供机器学习。之前的LineWriter.output()函数正好派上用场,它使我们能够一次处理所有文本行并作出决定哪些文本行应该输出的全局结策。从直觉和经验中我们发现下面的几条原则可用于决定如何过滤文本行:
  • Density of the current line.
  • 当前行的密度
  • Number of HTML bytes of the line.
  • 当前行的HTML字节数
  • Length of output text for this line.
  • 当前行的输出文本长度
  • These three values for the previous line,
  • 前一行的这三个值
  • … and the same for the next line.
  • 后一行的这三个值
For the implementation, we’ll be using Python to interface with FANN, the Fast Artificial Neural Network Library. The essence of the learning code goes like this:
我们可以利用FANN的Python接口来实现,FANN是Fast Artificial Neural NetWork库的简称。基本的学习代码如下:
from pyfann import fann, libfann
# This creates a new single-layer perceptron with 1 output and 3 inputs.
obj = libfann.fann_create_standard_array(2, (3, 1))
ann = fann.fann_class(obj)
# Load the data we described above.
patterns = fann.read_train_from_file('training.txt')
ann.train_on_data(patterns, 1000, 1, 0.0)
# Then test it with different data.
for datin, datout in validation_data:
result = ann.run(datin)
print 'Got:', result, ' Expected:', datout
Trying out different data and different network structures is a rather mechanical process. Don’t have too many neurons or you may train too well for the set of documents you have (overfitting), and conversely try to have enough to solve the problem well. Here are the results, varying the number of lines used (1L-3L) and the number of attributes per line (1A-3A):
尝试不同的数据和不同的网络结构是比较机械的过程。不要使用太多的神经元和使用太好的文本集合来训练(过拟合),相反地应当尝试解决足够多的问题。使用不同的行数(1L-3L)和每一行不同的属性(1A-3A)得到的结果如下:
The interesting thing to note is that 0.5 is already a pretty good guess at a fixed threshold (see first set of columns). The learning algorithm cannot find much better solution for comparing the density alone (1 Attribute in the second column). With 3 Attributes, the next SLP does better overall, though it gets more false negatives. Using multiple lines also increases the performance of the single layer perceptron (fourth set of columns). And finally, using a more complex neural network structure works best overall — making 80% less errors in filtering the lines.
有趣的是作为一个猜测的固定阈值,0.5的表现非常好(看第一列)。学习算法并不能仅仅通过比较密度来找出更佳的方案(第二列)。使用三个属性,下一个SLP比前两都好,但它引入了更多的假阴性。使用多行文本也增进了性能(第四列),最后使用更复杂的神经网络结构比所有的结果都要更好,在文本行过滤中减少了80%错误。
Note that you can tweak how the error is calculated if you want to punish false positives more than false negatives.
注意:你能够调整误差计算,以给假阳性比假阴性更多的惩罚(宁缺勿滥的策略)。
Conclusion
结论
Extracting text from arbitrary HTML files doesn’t necessarily require scraping the file with custom code. You can use statistics to get pretty amazing results, and machine learning to get even better. By tweaking the threshold, you can avoid the worst false positive that pollute your text output. But it’s not so bad in practice; where the neural network makes mistakes, even humans have trouble classifying those lines as “content” or not.
从任意HTML文件中抽取正文无需编写针对文件编写特定的抽取程序,使用统计学就能获得令人惊讶的效果,而机器学习能让它做得更好。通过调整阈值,你能够避免出现鱼目混珠的情况。它的表现相当好,因为在神经网络判断错误的地方,甚至人类也难以判定它是否为正文。
Now all you have to figure out is what to do with that clean text content!
现在需要思考的问题是用这些“干净”的正文内容做什么应用好呢?
<iframe id="alimamaifrm" style="WIDTH: 750px; HEIGHT: 110px" border="0" name="alimamaifrm" marginwidth="0" marginheight="0" src="http://p.alimama.com/cpacode.php?t=A&amp;pid=mm_10108440_0_0&amp;w=750&amp;h=110&amp;rn=1&amp;cn=3&amp;ky=&amp;cid=2203&amp;bgc=FFFFFF&amp;bdc=E6E6E6&amp;tc=0000FF&amp;dc=000000" frameborder="0" width="750" scrolling="no" height="110"></iframe>
分享到:
评论

相关推荐

    从HTML文件中抽取正文的简单方案 试验结果

    这篇名为“从HTML文件中抽取正文的简单方案 试验结果”的文章可能探讨了如何有效地从HTML文档中分离出核心的正文部分。 首先,提取HTML正文的一种常见方法是利用HTML标签的语义特性。例如,`&lt;article&gt;`、`&lt;main&gt;`、...

    从HTML文件中抽取正文的简单方案.pdf

    ### 从HTML文件中抽取正文的简单方案 #### 背景介绍 随着互联网的快速发展,HTML文件成为了信息传递的主要载体之一。然而,在这些文件中,真正的内容往往被各种无关的元素如广告、布局表格、格式标记等所包围。为了...

    HTMLParser抽取Web网页正文信息.doc

    解压后的`htmlparser.jar`文件需要被添加到项目的class path中,以便在编程时能够调用相关的类和方法。通过在代码中导入`HTMLParser`包,可以开始利用其强大的功能。 #### 三、HTMLParser的基本操作流程 1. **初始...

    HTMLParser抽取Web网页正文信息

    HTMLParser 是一个强大的工具,用于解析和分析HTML文档,它能帮助我们从网页中抽取主要信息,排除掉无关的导航、广告和版权等噪音内容。这不仅能够优化用户体验,节省浏览时间,还能提高用户获取信息的效率,进而...

    基于统计的网页正文信息抽取

    传统的网页正文抽取方法主要依赖于规则匹配,而基于统计的方法则更注重从大量网页数据中学习和推断正文特征。 在本案例中,使用了名为htmlparser的网页分析器。这是一个Java库,专门用于解析HTML文档,帮助开发者...

    unity抽取html信息demo

    "unity抽取html信息demo"就是这样一个示例项目,它演示了如何在Unity中处理HTML数据,虽然可能不是全自动化的解决方案,但它提供了从静态HTML页面中提取关键信息的基础方法。 Unity本身并不直接支持HTML解析,但...

    基于JerichoHTMLParser的html信息抽取.pdf

    HTML信息抽取是网络数据挖掘的重要组成部分,用于从网页中提取结构化或半结构化信息,以便进一步处理和分析。在给定的文件“基于JerichoHTMLParser的html信息抽取.pdf”中,作者王鸿伟探讨了如何利用Jericho HTML ...

    万金油正文抽取器体验版

    "万金油正文抽取器体验版"是一款专为新闻和博客设计的网页正文提取工具,其核心功能是基于最大文本块算法,旨在高效地从网页HTML代码中抽取出主要内容,提供用户更为清晰、干净的文章阅读体验。这款软件的开发者针对...

    使用JSoup实现新闻网页正文抽取

    在IT领域,网络爬虫是获取大量数据的重要手段,而新闻正文的抽取是网络爬虫技术中的一个关键环节。JSoup是一款非常流行的Java库,专用于处理HTML文档,提供了丰富的API来提取和操作结构化的HTML数据。在这个项目中,...

    .NET平台上的文件抽取框架toxy.zip

    toxy是.NET平台上的文件抽取框架,主要解决各种格式的内容抽取问题,比如pdf, doc, docx, xls, xlsx等,尽管听上去支持了很多格式,但它的使用却是极其方便的,因为Toxy把复杂的抽取流程透明化,Toxy的用户根本不用...

    Node.js-textract从各种文件抽取文本的Node.js模块

    Node.js-textract是一个强大的开源模块,专为在Node.js环境中从多种类型的文件中提取文本而设计。这个模块的出现解决了开发者在处理非纯文本格式文件时的痛点,它能够高效地从HTML、PDF、Microsoft Office文档(如...

    电信设备-一种从网页中抽取信息的方法及装置.zip

    在压缩包内的文件“一种从网页中抽取信息的方法及装置.pdf”,我们可以预期它将详细介绍这个方法的原理、实施流程以及所使用的装置。信息抽取通常涉及到自然语言处理(NLP)、正则表达式、数据挖掘和机器学习等技术...

    IECacheView IE缓存轻松抽取文件 jpg swf MP3 flash影片等

    标题 "IECacheView IE缓存轻松抽取文件 jpg swf MP3 flash影片等" 提到的是一款名为IECacheView的工具,这款工具专门用于方便地从Internet Explorer(简称IE)浏览器的缓存中提取各种类型的文件,包括图片(jpg)、...

    Kettle API(HTML格式)

    例如,"CSV输入"步骤用于从CSV文件中读取数据,"Java脚本"步骤则允许用户编写自定义的JavaScript代码进行数据处理。 2. **Job(作业)**:作业是Kettle中的一系列步骤的逻辑组合,它们按照特定的顺序和条件执行。...

    dotnet-NET平台下一个高效的从Html中提取正文的工具

    总之,“dotnet-NET平台下一个高效的从Html中提取正文的工具”是一个高性能、高准确度的解决方案,它结合了文本密度算法和.NET技术,为处理HTML正文提取问题提供了强大的支持。开发者可以通过深入研究其代码和使用...

    vc下抽取html页面中js,并进行调用

    2. **内联JS提取**:对于内联的JavaScript,可以直接从`&lt;script&gt;`标签的文本内容中获取。这可能涉及到对HTML字符串的搜索和替换,以去除HTML标签,只保留JavaScript代码。 3. **外部JS文件**:如果`&lt;script&gt;`标签有...

    Python-一个简单的库和命令行实用程序用以从HTML页面或者明文中抽取摘要

    标题中的“Python-一个简单的库和命令行实用程序用以从HTML页面或者明文中抽取摘要”揭示了这个项目是关于Python编程的一个库和命令行工具,它的主要功能是从HTML文档或纯文本中提取摘要信息。这涉及到自然语言处理...

    Java抽取Word和PDF格式文件

    本篇文章将探讨四种不同的方法来实现Java抽取Word和PDF格式文件。 1. 使用JACOB库 JACOB(Java COM Bridge)是一个允许Java调用COM组件的库。在Java中操作Word和PDF文件时,JACOB通过桥接Java和COM接口来实现。...

    Java抽取Word及PDF编程

    以下是一个简单的Java程序示例,演示了如何使用Jacob从Word文档中提取文本: ```java import java.io.File; import com.jacob.com.*; import com.jacob.activeX.*; public class FileExtracter { public static ...

Global site tag (gtag.js) - Google Analytics