PostBack:http://www.codeproject.com/KB/HTML/HTML_to_Plain_Text.aspx
Introduction
This article provides the procedure for stripping out HTML tags while preserving most basic formatting. In other words, it converts HTML to plain text.
Background
This example heavily relies on regular expressions, in particular System.Text.RegularExpressions.Regex.Replace() method. You may also find this reference on regular expressions syntax useful.
Using the Code
The code uses System.Text.RegularExpressions namespace
and consists of a single function, StripHTML()
.
First, the development formatting is removed such as tabs used for step-identations and repeated whitespaces. As a result, the input HTML is "flattened" into one continuous string. This serves two reasons:
- To remove the formatting ignored by browsers
- To make the regexes work reliably (they seem to get confused by escape characters)
Then the header is removed by removing anything between <head>
and </head>
tags.
Then, all scripts are removed by chopping out anything between <script>
and </script>
tags inclusive. Similarly with styles.
Then the basic formatting tags, such as <BR>
and <DIV>
are replaced with \r
or \r\r
. Also <TR>
tags are replaced by line breaks and <TD>
s by tabs.
<LI>
s are replaced by *s
and special characters such as
are replaced with their corresponding values.
Finally all the remaining tags are replaced with empty strings.
By this stage, there are likely to be a lot of redundant repeating line breaks and tabs. Any sequence over 2 line breaks long is replaced by two line breaks. Similarly with tabs: sequences over 4 tabs long are replaced by 4 tabs.
Collapse
private string StripHTML(string source)
{
try
{
string result;
result = source.Replace("\r", " ");
result = result.Replace("\n", " ");
result = result.Replace("\t", string.Empty);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"( )+", " ");
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*head([^>])*>","<head>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"(<( )*(/)( )*head( )*>)","</head>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(<head>).*(</head>)",string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*script([^>])*>","<script>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"(<( )*(/)( )*script( )*>)","</script>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"(<script>).*(</script>)",string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*style([^>])*>","<style>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"(<( )*(/)( )*style( )*>)","</style>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(<style>).*(</style>)",string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*td([^>])*>","\t",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*br( )*>","\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*li( )*>","\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*div([^>])*>","\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*tr([^>])*>","\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*p([^>])*>","\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<[^>]*>",string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@" "," ",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"•"," * ",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"‹","<",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"›",">",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"™","(tm)",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"⁄","/",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<","<",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@">",">",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"©","(c)",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"®","(r)",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"&(.{2,6});", string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = result.Replace("\n", "\r");
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\r)( )+(\r)","\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\t)( )+(\t)","\t\t",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\t)( )+(\r)","\t\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\r)( )+(\t)","\r\t",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\r)(\t)+(\r)","\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\r)(\t)+","\r\t",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
string breaks = "\r\r\r";
string tabs = "\t\t\t\t\t";
for (int index=0; index<result.Length; index++)
{
result = result.Replace(breaks, "\r\r");
result = result.Replace(tabs, "\t\t\t\t");
breaks = breaks + "\r";
tabs = tabs + "\t";
}
return result;
}
catch
{
MessageBox.Show("Error");
return source;
}
}
Points of Interest
Escape characters such as \n
and \r
had to be removed first because they cause regexes to cease working as expected.
Moreover, to make the result string display correctly in the textbox, one might need to split it up and set textbox
's Lines
property instead of assigning to Text
property.
this.txtResult.Lines =
StripHTML(this.txtSource.Text).Split("\r".ToCharArray());
History
- 6th October, 2005: Initial post
<!-- Article Ends --><!-- Main Page Contents End -->
License
About the Author
paceman
|
Occupation: |
Web Developer |
Location: |
Australia |
|
分享到:
相关推荐
it into HTML, plain text or LaTeX format by sgmltools: $ sgml2html manual.sgml $ sgml2txt manual.sgml $ sgml2latex manual.sgml There are a number of useful configuration examples in the etc/ ...
Aspose.Cells 最新并且非常详细的API开发文档。... It also allows exporting Excel files to PDF, XPS, HTML, MHTML, Plain Text and popular image formats including TIFF, JPG, PNG, BMP and SVG.
byte[] plainTextBytes = System.Text.Encoding.UTF8.GetBytes(plainText); string base64String = System.Convert.ToBase64String(plainTextBytes); ``` 2. 对于非UTF-8编码的字符串,需要先指定正确的Encoding...
```plaintext #-----beginChineseSimplifiedsupportpackage cidToUnicodeAdobe-GB1xpdf-chinese-simplified\Adobe-GB1.cidToUnicode unicodeMapISO-2022-CNxpdf-chinese-simplified\ISO-2022-CN.unicodeMap ...
本文实例讲述了C#实现将HTML转换成纯文本的方法。分享给大家供大家参考。.../// Converts HTML to plain text. /// class HtmlToText { // Static data tables protected static Dictionary<string, string
If you're viewing this document online, you can click any of the topics below to link directly to that section. 1. Tutorial tips 2 2. Introducing the JavaMail API 3 3. Reviewing related ...
to convert all the demo form files to binary format. A batch file, convert_forms_to delphi_4_format.bat, is supplied in the demo directory which automates the conversion process. The C++ Builder ...
return mb_convert_encoding($fContents, $to, $from); } else { return $fContents; } } elseif (is_array($fContents)) { foreach ($fContents as $key => $val) { $_key = auto_charset($key, $from, $to)...
return plainText.toString().replaceAll("\\s+", " ").trim(); // 去除多余空格并修剪两端 } public static void main(String[] args) { String markdownInput = "# 标题\n" + "这是一个**粗体**测试,*斜体*...
You may use either HTML tags or plain text in the copyright message, which depends on your requirements. The copyright message cannot be modified with an unregistered copy of Word-2-CHM. ...
The side effect for this is that YOUR app must check if the host is a host name or a IP address, in my app I remove the periods and try to convert the result to a float (long integers don‘t work, ...
你可以使用`MemoryStream`和`BinaryReader`读取图片数据,然后使用`Convert.ToBase64String`方法进行转换。 3. **创建RTF图片标记**:构建RTF图片标记,其结构大致如下: ``` {\pict \pngblip ...base64编码... }...
在提供的文件列表中,"cyphertext.dat"可能是加密后的数据文件,"plaintext.txt"是原始明文,而"decyphered.txt"则是解密后的文本。"howto_net_des_file.html"可能包含有关如何在.NET环境下实现DES的教程。项目文件...
public static string EncryptString(string plainText, string key) { using (Aes aes = Aes.Create()) { aes.Key = Encoding.UTF8.GetBytes(key); aes.IV = new byte[16]; // 或者使用随机生成的IV ...
context.Response.ContentType = "text/plain"; // 如果进行了分片 if (context.Request.Form.AllKeys.Any(m => m == "chunk")) { // 取得chunk和chunks int chunk = Convert.ToInt32(context.Request.Form[...