【原】UTF-8编码不得不说的事情

rmzdb

浏览: 83071 次
性别:
来自: 合肥

最近访客更多访客>>

御羽倾城

hujq998

MiniKnife

aboyaini

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

JAVA BASE

    一贯都喜欢用UTF-8作为系统的编码方式。但是项目中做了一个上传的操作，直接将xml字符串存库。
        流读取的时候用的是utf-8编码，上传的文件也是utf-8编码，怎么上传后就乱码了？乱的也不是很离谱，就
        在文件的头部，多了一个？字符。可是上传之前的日志输出：没有任何问题。“？”这个字符是从哪里来的。
        百度一番，原来utf-8还有带不带BOM 之分。
BOM: Byte Order Mark
UTF-8 BOM又叫UTF-8 签名,其实UTF-8 的BOM对UFT-8没有作用,是为了支援UTF-16,UTF-32才加上的BOM,BOM签名的意思就是告诉编辑器当前文件采用何种编码,
方便编辑器识别,但是BOM虽然在编辑器中不显示,但是会产生输出。
如果通过java写的UTF-8文件，使用Java可以正确的读，但是如果用记事本将相同的内容使用UTF-8格式保存，则在使用程序读取是会从文件中多读出一个不可见字符。

 
  public static void main(String[] args) throws IOException {
  File f = new File("C:/utf.txt");
  FileInputStream in = new FileInputStream(f);
        // 指定读取文件时以UTF-8的格式读取
  BufferedReader br = new BufferedReader(new InputStreamReader(in,
    "UTF-8"));
  String line = br.readLine();
  while (line != null) {
   byte[] allbytes = line.getBytes("UTF-8");     
            for (int i=0; i < allbytes.length; i++)    
            {    
                int tmp = allbytes[i];    
                String hexString = Integer.toHexString(tmp);    
                // 1个byte变成16进制的，只需要2位就可以表示了，取后面两位，去掉前面的符号填充    
                hexString = hexString.substring(hexString.length() -2);    
                System.out.print(hexString.toUpperCase());    
                System.out.print(" ");    
            }   

   System.out.println(line);
   line = br.readLine();
  }

 }

输出结果如下：

引用

EF BB BF 54 68 69 73 20 69 73 20 74 68 65 20 66 69 72 73 74 20 6C 69 6E 65 2E
?This is the first line.
54 68 69 73 20 69 73 20 73 65 63 6F 6E 64 20 6C 69 6E 65 2E
This is second line.

红色部分的"EF BB BF"刚好是UTF-8文件的BOM编码，可以看出Java在读文件时没能正确处理UTF-8文件的BOM编码，将前3个字节当作文本内容来处理了。

解决办法：

引用

JDK Bug 4508058

Java InpuStreamReader will support BOM mark for UTF-16 files. But for some reason it does not recognize UTF-8 BOM marks. This is very unfortunate all Windows (>win2k) users if textfiles are saved with Notepad using UTF-8 format. Notepad will add BOM bytes at the start of file, but Java's InputStreamReader does not skip it.

UnicodeInputStream.java class will help you to autorecognize and skip BOMs. This will support UTF-8 as well.

UnicodeReader.java class will do everything ever more transparently. Just instantiate it and read text.

1.通过上面的描述，我们可以发现 inputStream 没有对其处理，但是 UnicodeInputStream 和 UnicodeReader 就可以解决这个问题。

  BufferedReader br = new BufferedReader(new UnicodeReader(in, Charset.defaultCharset().name()));

2.我们自己可以去写程序跳过这个BOM标志。

 /**
  * 读取流中前面的字符，看是否有bom，如果有bom，将bom头先读掉丢弃
  * 
  * @param in
  * @return
  * @throws IOException
  */
 public static InputStream trimBOM(InputStream in) throws IOException {

  PushbackInputStream testin = new PushbackInputStream(in);
  int ch = testin.read();
  if (ch != 0xEF) {
   testin.unread(ch);
  } else if ((ch = testin.read()) != 0xBB) {
   testin.unread(ch);
   testin.unread(0xef);
  } else if ((ch = testin.read()) != 0xBF) {
   throw new IOException("错误的UTF-8格式文件");
  } else {
   // 不需要做，这里是bom头被读完了
   // // System.out.println("still exist bom");

  }
  return testin;

 }

编辑器的问题：

               win 记事本保存的utf-8格式文件是带有BOM。
               notepad++ 保存的utf-8 也是带有BOM的，但是他提供了编码方式 : UTF-8 无 BOM 编码方式
               editplus 保存的utf-8 是不带BOM的其提供了编码方式： UTF-8 + bom

分享到：

【设计模式说开去系列】--责任链模式 | 【原】JS事件机制--键盘实例

2013-04-21 12:35
浏览 816
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论