`
yl.fighter
  • 浏览: 257828 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

UTF-8 BOM (网页中经常出现一些不明的空行或者乱码字符 )

    博客分类:
  • PHP
阅读更多
转自: http://hi.baidu.com/lssbing/blog/item/d2074a1635763910972b43d8.html
utf-8 是一种在web应用中经常使用的一种 unicode 字符的编码方式,使用 utf-8 的好处在于它是一种变长的编码方式,对于 ANSII 码编码长度为1个字节,这样的话在传输大量 ASCII 字符集的网页时,可以大量节约网络带宽。使用 utf-8 编码来编写网页的时候, 往往会因为 bom (Byte Order Mark) 的问题,导致网页中经常出现一些不明的空行或者乱码字符。 这些都是因为 utf-8 编码方式对于 bom 不是强制的。因此 utf-8 编码在保存文件的时候,会出现不同的处理方式。比如有的浏览器(FireFox)可以自动过滤掉所有 utf-8 bom , 有的 (IE) 只能过滤掉一次 bom (为什么是一次? 当你出现 Include 多次文件时就会碰上这个问题了)。

转自: http://www.w3.org/International/questions/qa-utf8-bom
When using UTF-8 encoded pages in some user agents, I get an extra line or unwanted characters at the top of my web page or included file. How do I remove them?
answer
If you are dealing with a file encoded in UTF-8, your display problems may be caused by the presence of a UTF-8 signature (BOM) that the user agent doesn't recognize. This used to be a problem for static HTML files, but is no longer in recent versions of major browsers. However, if you use PHP to generate your HTML, this was still an issue with PHP version 5.3.6.
The BOM is always at the beginning of the file, and so you would normally expect to see the display issues at the top of a page. However, you may also find blank lines appearing within the page if you include text from a separate file that begins with a UTF-8 signature.
We have a set of test pages and a summary of results for various recent browser versions that explore this behaviour.
This article will help you determine whether the UTF-8 is causing the problem. If there is no evidence of a UTF-8 signature at the beginning of the file, then you will have to look elsewhere for a solution.
What is a UTF-8 signature (BOM)?
Some applications insert a particular combination of bytes at the beginning of a file to indicate that the text contained in the file is Unicode. This combination of bytes is known as a signature or Byte Order Mark (BOM). Some applications - such as a text editor or a browser - will display the BOM as an extra line in the file, others will display unexpected characters, such as .
See the side panel for more detailed information about the BOM.
The BOM is the Unicode codepoint U+FEFF, corresponding to the Unicode character 'ZERO WIDTH NON-BREAKING SPACE' (ZWNBSP).
In UTF-16 and UTF-32 encodings, unless there is some alternative indicator, the BOM is essential to ensure correct interpretation of the file's contents. Each character in the file is represented by 2 or 4 bytes of data and the order in which these bytes are stored in the file is significant; the BOM indicates this order.
In the UTF-8 encoding, the presence of the BOM is not essential because, unlike the UTF-16 or UTF-32 encodings, there is no alternative sequence of bytes in a character. The BOM may still occur in UTF-8 encoding text, however, either as a by-product of an encoding conversion or because it was added by an editor.
Detecting the BOM
First, we need to check whether there is indeed a BOM at the beginning of the file.
You can try looking for a BOM in your content, but if your editor handles the UTF-8 signature correctly you probably won't be able to see it. An editor which does not handle the UTF-8 signature correctly displays the bytes that compose that signature according to its own character encoding setting. (With the Latin 1 (ISO 8859-1) character encoding, the signature displays as characters .) With a binary editor capable of displaying the hexadecimal byte values in the file, the UTF-8 signature displays as EF BB BF.
Alternatively, your editor may tell you in a status bar or a menu what encoding your file is in, including information about the presence or not of the UTF-8 signature.
If not, some kind of script-based test (see below) may help. Alternatively, you could try this small web-based utility. (Note, if it’s a file included by PHP or some other mechanism that you think is causing the problem, type in the URI of the included file.)
Removing the BOM
If you have an editor which shows the characters that make up the UTF-8 signature you may be able to delete them by hand. Chances are, however, that the BOM is there in the first place because you didn't see it.
Check whether your editor allows you to specify whether a UTF-8 signature is added or kept during a save. Such an editor provides a way of removing the signature by simply reading the file in then saving it out again. For example, if Dreamweaver detects a BOM the Save As dialogue box will have a check mark alongside the text "Include Unicode Signature (BOM)". Just uncheck the box and save.
One of the benefits of using a script is that you can remove the signature quickly, and from multiple files. In fact the script could be run automatically as part of your process. If you use Perl, you could use a simple script created by Martin Dürst.
Note: You should check the process impact of removing the signature. It may be that some part of your content development process relies on the use of the signature to indicate that a file is in UTF-8. Bear in mind also that pages with a high proportion of Latin characters may look correct superficially but that occasional characters outside the ASCII range (U+0000 to U+007F) may be incorrectly encoded.
by the way
You will find that some text editors such as Windows Notepad will automatically add a UTF-8 signature to any file you save as UTF-8.
A UTF-8 signature at the beginning of a CSS file can sometimes cause the initial rules in the file to fail on certain user agents.
In some browsers, the presence of a UTF-8 signature will cause the browser to interpret the text as UTF-8 regardless of any character encoding declarations to the contrary.
分享到:
评论

相关推荐

    Java解决UTF-8的BOM问题

    在Java编程中,UTF-8编码是一个非常常见且广泛使用的字符编码格式,它能支持全球大部分语言的字符表示。然而,UTF-8有一个特殊特性,那就是它可以带有Byte Order Mark(BOM),这是一个特殊的字节序列,用于标识数据...

    gb2312,utf-8,utf-8-bom等编码格式的互相转换

    本文将深入探讨几种常见的编码格式,如GB2312、UTF-8以及UTF-8-BOM,并详细讲解如何在C#中进行这些编码格式之间的转换,同时会涉及到与Stream相关的操作。 GB2312,全称为“国标汉字编码字符集”,是中国大陆广泛...

    批量utf文件转utf8-bom

    在Windows操作系统环境下,经常需要进行这样的转换,因为某些程序或系统可能更倾向于识别带有BOM的UTF-8编码,尤其是在处理源代码文件或者非英文文本时。不带BOM的UTF-8文件可能会导致乱码或者程序无法正确解析。 ...

    utf-8批量bom添加删除(BomChecker)工具

    1.首先介绍一下本人应用场景,qt...3.此小工具主要针对utf-8编码文件,能够批量添加删除BOM,无识别转化ASIIC功能,添加BOM时,如果文件是utf-8(BOM),则跳过,删除亦然 4.当不选中添加删除时可用于文件数量统计。

    批量去掉UTF-8文件中BOM标示符

    标题"批量去掉UTF-8文件中BOM标示符"指的是处理这一问题的方法,即通过特定工具或代码删除UTF-8文件开头的BOM标识。这个过程通常是为了确保文件在不同的系统和环境中能够正确无误地被读取和处理。 描述中提到的博文...

    PB字符串转XML文件,解决PB12.5创建UTF-8文件BOM问题(powerbuilder 12.5)

    由于项目需要,需要字符串转为XML文件,直接用Fileopen进行EncodingUTF8编码后,发现文件实际为UTF-8 BOM编码 问度娘发现有相同问题,但解决方式是利用新建一个UTF-8的TXT文件后,再进行COPY加内容。感觉这样操作...

    java 读取服务器上的某个文件,并解决UTF-8 BOM文件的问号问题

    这个场景中,我们面临的挑战是如何正确处理UTF-8带有BOM(Byte Order Mark)的文件,因为BOM可能会导致文件内容显示为问号或者其他乱码。下面将详细介绍如何解决这个问题。 首先,我们需要理解什么是UTF-8的BOM。...

    git 修改上传文件编码为utf-8-bom

    当上传文件存在中文时,修改上传文件编码为utf-8-bom

    字符编码转换类,支持 ANSI、Unicode、Unicode big endian、UTF-8、UTF-8+Bom互相转换

    本文将深入探讨PHP中的字符编码转换,特别是针对ANSI、Unicode(包括Little Endian和Big Endian)、UTF-8以及UTF-8+BOM的转换。 首先,让我们了解这些编码格式的含义: 1. ANSI编码:通常指的是Windows系统的默认...

    Java避免UTF-8的csv文件打开中文出现乱码的方法

    在Java中,避免UTF-8的csv文件打开中文出现乱码的方法是非常重要的。csv文件是 comma separated values 的缩写,常用于数据交换和导入导出操作。然而,在Java中读取和写入csv文件时,中文字符如果不正确地处理,可能...

    PB9转换utf-8例子

    标题中的“PB9转换utf-8例子”指的是在PowerBuilder 9(PB9)环境下将数据从非UTF-8编码转换为UTF-8编码的一种解决方案。由于PB9本身不直接支持这种转换,开发者通常需要利用外部库或者特定的编程技巧来实现这个功能...

    UTF-8去BOM头工具

    UTF-8编码是一种广泛使用的字符编码方式,尤其在互联网和软件开发中占据核心地位。它能够表示Unicode字符集中的所有字符,确保了不同语言的文字能在同一文档中和谐共存。然而,UTF-8编码有一个特性,就是对于某些...

    UTF-8文件批量去除BOM标记

    UTF-8编码是一种广泛使用的字符编码标准,尤其在文本文件和网页中常见。它能够表示世界上几乎所有的文字,包括汉字和其他非拉丁字符。UTF-8的全称是“8位Unicode转换格式”,它以1到4个字节来编码Unicode字符。 在...

    C#写入文件加上bom头,主要适用于utf8文件

    在UTF-8编码中,BOM是一个由三个字节组成的序列:0xEF, 0xBB, 0xBF,它位于文件的开头,用来表明该文件采用的是UTF-8编码。在C#编程中,有时我们需要在写入UTF-8文件时添加这个BOM头,以确保其他程序或系统能正确...

    php utf-8编码去bom小工具

    BOM是UTF-8编码的一个可选特征,它在文件开头放置三个特殊的字节来标识文件的字符编码,但这可能会导致在某些编辑器或浏览器中出现不必要的字符或者乱码问题。因此,开发这个小工具是为了帮助开发者处理这个问题。 ...

    IDEA支持带BOM的UTF-8编码文件.rar

    在大多数情况下,BOM在UTF-8编码中并不必要,因为它默认是小端序,但对于某些程序或系统,BOM可能有助于识别文件的编码方式。 在处理带BOM的UTF-8文件时,IDEA提供了很好的兼容性。通常,BOM可能会导致一些编辑器或...

    utf-8码转换器(转换成utf-8码)

    如果一个GBK编码的文本包含非GBK字符,使用UTF-8编码器读取会出现乱码。因此,通过转换器将GBK编码转换为UTF-8编码,可以确保文本在各种系统和语言环境中都能正确显示。 4. **编码转换工具的实现**: - 接收输入:...

    易语言另类数据库读写UTF-8

    在处理UTF-8编码时,我们需要确保数据在存储和读取过程中不发生乱码。易语言本身可能不直接支持UTF-8编码,因此需要我们自定义函数或组件来实现这一功能。这通常涉及到字符串编码的转换,比如将易语言内部的GBK编码...

    delphi 6-XE读取ANSI,unicode,unicode big,utf-8,utf-8BOM文件,保存utf8

    delphi 6-XE读取ANSI,unicode,unicode big,utf-8,utf-8BOM文件,保存utf8

Global site tag (gtag.js) - Google Analytics