`
stephen830
  • 浏览: 3010974 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

关于java生成UTF-8编码格式文件的诡异问题

    博客分类:
  • java
阅读更多
★★★ 本篇为原创,需要引用转载的朋友请注明:《 http://stephen830.iteye.com/blog/259350 》 谢谢支持!★★★

用java生成一个UTF-8文件:

如果文件内容中没有中文内容,则生成的文件为ANSI编码格式;
如果文件内容中有中文内容,则生成的文件为UTF-8编码格式。

也就是说,如果你的文件内容没有中文内容的话,你生成的文件是ANSI编码的。

/**
     * 生成UTF-8文件.
     * 如果文件内容中没有中文内容,则生成的文件为ANSI编码格式;
     * 如果文件内容中有中文内容,则生成的文件为UTF-8编码格式。
     * @param fileName 待生成的文件名(含完整路径)
     * @param fileBody 文件内容
     * @return
     */
    public static boolean writeUTFFile(String fileName,String fileBody){
        FileOutputStream fos = null;
        OutputStreamWriter osw = null;
        try {
            fos = new FileOutputStream(fileName);
            osw = new OutputStreamWriter(fos, "UTF-8");
            osw.write(fileBody);
            return true;
        } catch (Exception e) {
            e.printStackTrace();
            return false;
        }finally{
            if(osw!=null){
                try {
                    osw.close();
                } catch (IOException e1) {
                    e1.printStackTrace();
                }
            }
            if(fos!=null){
                try {
                    fos.close();
                } catch (IOException e1) {
                    e1.printStackTrace();
                }
            }
        }
    }

//main()
public static void main(String[] argc){
    	writeUTFFile("C:\\test1.txt","aaa");//test1.txt为ANSI格式文件
    	writeUTFFile("C:\\test2.txt","中文aaa");//test2.txt为UTF-8格式文件
}


感谢朋友[Liteos]关于(utf-8 bom)的建议,贴出这2个文件的Hex内容图,大家参考。

用UltraEdit(UltraEdit版本10.0)的Hex模式查看(见下图):

(test1.txt ANSI格式 a:61)


(test2.txt UTF-8格式 2D 4E:中,87 65:文,61 00:a)

如果在你的UltraEdit也看到上面的2个图,那么请马上升级你的UltraEdit软件吧,低版本的UltraEdit对utf-8文件的Hex模式查看有问题。

感谢Liteos朋友的指出,将我原来的UltraEdit10卸了装上了最新的14.20版本,对test2.txt按Hex模式得到下面的图:

UltraEdit版本14.20 看test2.txt文件 E4B8AD表示“中”,E69687表示“文”,61表示“a”)

至此可以发现,UTF-8对中文的处理很夸张,需要3个字节才能表示一个中文字!!这让我想起在另外一篇文章中看到的关于utf-8的一个批评(UTF-8对亚洲语言的一种歧视),具体的文章地址:《为什么用Utf-8编码?》 http://stephen830.iteye.com/blog/258929



还是回到本文的问题,为啥没有生成UTF-8文件?估计是JAVA内部I/O处理时如果遇到都是单字节字符,则只生成ANSI格式文件(但程序中已经设定了要UTF-8,为什么不给我生成UTF-8,一个bug吗?),只有遇到多字节的字符时才根据设定的编码(例如UTF-8)来生成文件。

下面引用一段w3c组织关于utf-8的bom描述:(原文地址:http://www.w3.org/International/questions/qa-utf8-bom)

引用

FAQ: Display problems caused by the UTF-8 BOM

on this page:  Question - Background - Answer - By the way - Further reading

Intended audience: XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), CSS coders, XSLT developers, Web project managers, and anyone who is trying to diagnose why blank lines or other strange items are displayed on their UTF-8 page.
Question

When using UTF-8 encoded pages in some user agents, I get an extra line or unwanted characters at the top of my web page or included file. How do I remove them?
Answer

If you are dealing with a file encoded in UTF-8, your display problems may be caused by the presence of a UTF-8 signature (BOM) that the user agent doesn't recognize.

The BOM is always at the beginning of the file, and so you would normally expect to see the display issues at the top of a page. However, you may also find blank lines appearing within the page if you include text from a separate file that begins with a UTF-8 signature.

We have a set of test pages and a summary of results for various recent browser versions that explore this behaviour.

This article will help you determine whether the UTF-8 is causing the problem. If there is no evidence of a UTF-8 signature at the beginning of the file, then you will have to look elsewhere for a solution.
What is a UTF-8 signature (BOM)?

Some applications insert a particular combination of bytes at the beginning of a file to indicate that the text contained in the file is Unicode. This combination of bytes is known as a signature or Byte Order Mark (BOM). Some applications - such as a text editor or a browser - will display the BOM as an extra line in the file, others will display unexpected characters, such as .

See the side panel for more detailed information about the BOM.

The BOM is the Unicode codepoint U+FEFF, corresponding to the Unicode character 'ZERO WIDTH NON-BREAKING SPACE' (ZWNBSP).

In UTF-16 and UTF-32 encodings, unless there is some alternative indicator, the BOM is essential to ensure correct interpretation of the file's contents. Each character in the file is represented by 2 or 4 bytes of data and the order in which these bytes are stored in the file is significant; the BOM indicates this order.

In the UTF-8 encoding, the presence of the BOM is not essential because, unlike the UTF-16 or UTF-32 encodings, there is no alternative sequence of bytes in a character. The BOM may still occur in UTF-8 encoding text, however, either as a by-product of an encoding conversion or because it was added by an editor.
Detecting the BOM

First, we need to check whether there is indeed a BOM at the beginning of the file.

You can try looking for a BOM in your content, but if your editor handles the UTF-8 signature correctly you probably won't be able to see it. An editor which does not handle the UTF-8 signature correctly displays the bytes that compose that signature according to its own character encoding setting. (With the Latin 1 (ISO 8859-1) character encoding, the signature displays as characters .) With a binary editor capable of displaying the hexadecimal byte values in the file, the UTF-8 signature displays as EF BB BF.

Alternatively, your editor may tell you in a status bar or a menu what encoding your file is in, including information about the presence or not of the UTF-8 signature.

If not, some kind of script-based test (see below) may help. Alternatively, you could try this small web-based utility. (Note, if it’s a file included by PHP or some other mechanism that you think is causing the problem, type in the URI of the included file.)
Removing the BOM

If you have an editor which shows the characters that make up the UTF-8 signature you may be able to delete them by hand. Chances are, however, that the BOM is there in the first place because you didn't see it.

Check whether your editor allows you to specify whether a UTF-8 signature is added or kept during a save. Such an editor provides a way of removing the signature by simply reading the file in then saving it out again. For example, if Dreamweaver detects a BOM the Save As dialogue box will have a check mark alongside the text "Include Unicode Signature (BOM)". Just uncheck the box and save.

One of the benefits of using a script is that you can remove the signature quickly, and from multiple files. In fact the script could be run automatically as part of your process. If you use Perl, you could use a simple script created by Martin Dürst.

Note: You should check the process impact of removing the signature. It may be that some part of your content development process relies on the use of the signature to indicate that a file is in UTF-8. Bear in mind also that pages with a high proportion of Latin characters may look correct superficially but that occasional characters outside the ASCII range (U+0000 to U+007F) may be incorrectly encoded.
By the way

You will find that some text editors such as Windows Notepad will automatically add a UTF-8 signature to any file you save as UTF-8.

A UTF-8 signature at the beginning of a CSS file can sometimes cause the initial rules in the file to fail on certain user agents.

In some browsers, the presence of a UTF-8 signature will cause the browser to interpret the text as UTF-8 regardless of any character encoding declarations to the contrary.





在上面的方法中用到了[类 sun.nio.cs.StreamEncoder],下面贴出类的内容,供大家参考:
/*
 * Copyright 2001-2005 Sun Microsystems, Inc.  All Rights Reserved.
 * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
 *
 * This code is free software; you can redistribute it and/or modify it
 * under the terms of the GNU General Public License version 2 only, as
 * published by the Free Software Foundation.  Sun designates this
 * particular file as subject to the "Classpath" exception as provided
 * by Sun in the LICENSE file that accompanied this code.
 *
 * This code is distributed in the hope that it will be useful, but WITHOUT
 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
 * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
 * version 2 for more details (a copy is included in the LICENSE file that
 * accompanied this code).
 *
 * You should have received a copy of the GNU General Public License version
 * 2 along with this work; if not, write to the Free Software Foundation,
 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
 *
 * Please contact Sun Microsystems, Inc., 4150 Network Circle, Santa Clara,
 * CA 95054 USA or visit www.sun.com if you need additional information or
 * have any questions.
 */

/*
 */

package sun.nio.cs;

import java.io.*;
import java.nio.*;
import java.nio.channels.*;
import java.nio.charset.*;

public class StreamEncoder extends Writer
{

    private static final int DEFAULT_BYTE_BUFFER_SIZE = 8192;

    private volatile boolean isOpen = true;

    private void ensureOpen() throws IOException {
        if (!isOpen)
            throw new IOException("Stream closed");
    }

    // Factories for java.io.OutputStreamWriter
    public static StreamEncoder forOutputStreamWriter(OutputStream out,
                                                      Object lock,
                                                      String charsetName)
        throws UnsupportedEncodingException
    {
        String csn = charsetName;
        if (csn == null)
            csn = Charset.defaultCharset().name();
        try {
            if (Charset.isSupported(csn))
                return new StreamEncoder(out, lock, Charset.forName(csn));
        } catch (IllegalCharsetNameException x) { }
        throw new UnsupportedEncodingException (csn);
    }

    public static StreamEncoder forOutputStreamWriter(OutputStream out,
                                                      Object lock,
                                                      Charset cs)
    {
        return new StreamEncoder(out, lock, cs);
    }

    public static StreamEncoder forOutputStreamWriter(OutputStream out,
                                                      Object lock,
                                                      CharsetEncoder enc)
    {
        return new StreamEncoder(out, lock, enc);
    }


    // Factory for java.nio.channels.Channels.newWriter

    public static StreamEncoder forEncoder(WritableByteChannel ch,
                                           CharsetEncoder enc,
                                           int minBufferCap)
    {
        return new StreamEncoder(ch, enc, minBufferCap);
    }


    // -- Public methods corresponding to those in OutputStreamWriter --

    // All synchronization and state/argument checking is done in these public
    // methods; the concrete stream-encoder subclasses defined below need not
    // do any such checking.

    public String getEncoding() {
        if (isOpen())
            return encodingName();
        return null;
    }

    public void flushBuffer() throws IOException {
        synchronized (lock) {
            if (isOpen())
                implFlushBuffer();
            else
                throw new IOException("Stream closed");
        }
    }

    public void write(int c) throws IOException {
        char cbuf[] = new char[1];
        cbuf[0] = (char) c;
        write(cbuf, 0, 1);
    }

    public void write(char cbuf[], int off, int len) throws IOException {
        synchronized (lock) {
            ensureOpen();
            if ((off < 0) || (off > cbuf.length) || (len < 0) ||
                ((off + len) > cbuf.length) || ((off + len) < 0)) {
                throw new IndexOutOfBoundsException();
            } else if (len == 0) {
                return;
            }
            implWrite(cbuf, off, len);
        }
    }

    public void write(String str, int off, int len) throws IOException {
        /* Check the len before creating a char buffer */
        if (len < 0)
            throw new IndexOutOfBoundsException();
        char cbuf[] = new char[len];
        str.getChars(off, off + len, cbuf, 0);
        write(cbuf, 0, len);
    }

    public void flush() throws IOException {
        synchronized (lock) {
            ensureOpen();
            implFlush();
        }
    }

    public void close() throws IOException {
        synchronized (lock) {
            if (!isOpen)
                return;
            implClose();
            isOpen = false;
        }
    }

    private boolean isOpen() {
        return isOpen;
    }


    // -- Charset-based stream encoder impl --

    private Charset cs;
    private CharsetEncoder encoder;
    private ByteBuffer bb;

    // Exactly one of these is non-null
    private final OutputStream out;
    private WritableByteChannel ch;

    // Leftover first char in a surrogate pair
    private boolean haveLeftoverChar = false;
    private char leftoverChar;
    private CharBuffer lcb = null;

    private StreamEncoder(OutputStream out, Object lock, Charset cs) {
        this(out, lock,
         cs.newEncoder()
         .onMalformedInput(CodingErrorAction.REPLACE)
         .onUnmappableCharacter(CodingErrorAction.REPLACE));
    }

    private StreamEncoder(OutputStream out, Object lock, CharsetEncoder enc) {
        super(lock);
        this.out = out;
        this.ch = null;
        this.cs = enc.charset();
        this.encoder = enc;

        // This path disabled until direct buffers are faster
        if (false && out instanceof FileOutputStream) {
                ch = ((FileOutputStream)out).getChannel();
        if (ch != null)
                    bb = ByteBuffer.allocateDirect(DEFAULT_BYTE_BUFFER_SIZE);
        }
            if (ch == null) {
        bb = ByteBuffer.allocate(DEFAULT_BYTE_BUFFER_SIZE);
        }
    }

    private StreamEncoder(WritableByteChannel ch, CharsetEncoder enc, int mbc) {
        this.out = null;
        this.ch = ch;
        this.cs = enc.charset();
        this.encoder = enc;
        this.bb = ByteBuffer.allocate(mbc < 0
                                  ? DEFAULT_BYTE_BUFFER_SIZE
                                  : mbc);
    }

    private void writeBytes() throws IOException {
        bb.flip();
        int lim = bb.limit();
        int pos = bb.position();
        assert (pos <= lim);
        int rem = (pos <= lim ? lim - pos : 0);

            if (rem > 0) {
        if (ch != null) {
            if (ch.write(bb) != rem)
                assert false : rem;
        } else {
            out.write(bb.array(), bb.arrayOffset() + pos, rem);
        }
        }
        bb.clear();
        }

    private void flushLeftoverChar(CharBuffer cb, boolean endOfInput)
        throws IOException
    {
        if (!haveLeftoverChar && !endOfInput)
            return;
        if (lcb == null)
            lcb = CharBuffer.allocate(2);
        else
            lcb.clear();
        if (haveLeftoverChar)
            lcb.put(leftoverChar);
        if ((cb != null) && cb.hasRemaining())
            lcb.put(cb.get());
        lcb.flip();
        while (lcb.hasRemaining() || endOfInput) {
            CoderResult cr = encoder.encode(lcb, bb, endOfInput);
            if (cr.isUnderflow()) {
                if (lcb.hasRemaining()) {
                    leftoverChar = lcb.get();
                    if (cb != null && cb.hasRemaining())
                        flushLeftoverChar(cb, endOfInput);
                    return;
                }
                break;
            }
            if (cr.isOverflow()) {
                assert bb.position() > 0;
                writeBytes();
                continue;
            }
            cr.throwException();
        }
        haveLeftoverChar = false;
    }

    void implWrite(char cbuf[], int off, int len)
        throws IOException
    {
        CharBuffer cb = CharBuffer.wrap(cbuf, off, len);

        if (haveLeftoverChar)
        flushLeftoverChar(cb, false);

        while (cb.hasRemaining()) {
        CoderResult cr = encoder.encode(cb, bb, false);
        if (cr.isUnderflow()) {
           assert (cb.remaining() <= 1) : cb.remaining();
           if (cb.remaining() == 1) {
                haveLeftoverChar = true;
                leftoverChar = cb.get();
            }
            break;
        }
        if (cr.isOverflow()) {
            assert bb.position() > 0;
            writeBytes();
            continue;
        }
        cr.throwException();
        }
    }

    void implFlushBuffer() throws IOException {
        if (bb.position() > 0)
        writeBytes();
    }

    void implFlush() throws IOException {
        implFlushBuffer();
        if (out != null)
        out.flush();
    }

    void implClose() throws IOException {
        flushLeftoverChar(null, true);
        try {
            for (;;) {
                CoderResult cr = encoder.flush(bb);
                if (cr.isUnderflow())
                    break;
                if (cr.isOverflow()) {
                    assert bb.position() > 0;
                    writeBytes();
                    continue;
                }
                cr.throwException();
            }

            if (bb.position() > 0)
                writeBytes();
            if (ch != null)
                ch.close();
            else
                out.close();
        } catch (IOException x) {
            encoder.reset();
            throw x;
        }
    }

    String encodingName() {
        return ((cs instanceof HistoricallyNamedCharset)
            ? ((HistoricallyNamedCharset)cs).historicalName()
            : cs.name());
    }
}



哪位朋友如果能发现原因,请留下您的答案哦!

-------------------------------------------------------------
分享知识,分享快乐,希望文章能给需要的朋友带来小小的帮助。
  • 大小: 1.9 KB
  • 大小: 2.3 KB
  • 大小: 14.1 KB
9
1
分享到:
评论
2 楼 Liteos 2008-10-30  
想要让编辑器识别某只含ASCII的文件为UTF-8文件,需要加个文件头即UTF-8 BOM(Byte Order Mark),但bom在不同的浏览器或编辑器里处理方式不一,常会搞出问题来,慎用。Google关键字"utf 8 bom"
1 楼 Liteos 2008-10-30  
是博主对UTF-8的理解问题,UTF-8对ASCII是不转换的,你加一个汉字生成的文件和原文件比只是三个字节不同而已。

相关推荐

    Java避免UTF-8的csv文件打开中文出现乱码的方法

    Java避免UTF-8的csv文件打开中文出现...Java避免UTF-8的csv文件打开中文出现乱码的方法是使用UTF-16LE编码格式,并在文件头部输出BOM。同时,需要考虑Excel版本的兼容性问题,以确保csv文件可以正确地被打开和读取。

    Java解决UTF-8的BOM问题

    在Java编程中,UTF-8编码是一个非常常见且广泛使用的字符编码格式,它能支持全球大部分语言的字符表示。然而,UTF-8有一个特殊特性,那就是它可以带有Byte Order Mark(BOM),这是一个特殊的字节序列,用于标识数据...

    Delphi和JAVA进行TCPIP通信(用UTF-8编码)例子

    最近的项目(Delphi开发),需要经常和java语言开发的系统进行数据交互(Socket通信方式),数据编码约定采用UTF-8编码。 令我无语的是:JAVA系统那边反映说,Delphi发的数据他们收到是乱码,而我这边(Delphi7,...

    java 读取服务器上的某个文件,并解决UTF-8 BOM文件的问号问题

    在Java编程中,读取...总结来说,Java中读取服务器上的UTF-8 BOM文件,需要正确设置字符编码,检测并处理BOM,同时注意异常处理和资源释放。通过掌握这些知识点,开发者可以编写出稳定、高效的代码来处理这类问题。

    判断文本文件是否为UTF-8编码

    本资源主要关注如何判断一个文本文件是否采用UTF-8编码。 在Windows系统中,比如Windows 7旗舰版,我们经常使用Notepad.exe(记事本)来查看和编辑文本文件。在编程环境中,有时候我们需要编写代码来检查文件的编码...

    JAVA文件编码格式转换:UTF-8转为GB2312

    能够集成到Source Insight中,解决JAVA文件乱码问题

    Java工程编码格式由GBK转化成utf-8(编码格式互转)

    Java工程编码格式由GBK转化成utf-8(编码格式互转) https://ymjin.blog.csdn.net/article/details/118769530

    java jsp解决utf-8乱码.zip

    在Java JSP开发中,遇到UTF-8编码导致的乱码问题是一个常见的困扰。这个问题主要涉及到字符编码的统一和正确处理。UTF-8是一种广泛使用的Unicode字符编码方案,它可以支持几乎所有的字符集,包括中文、日文和韩文。...

    泰文UTF-8编码转成Unicode编码详细代码

    文件里有详细的代码,编码格式选择UTF-8编码,亲测在linux下可以直接运行。泰文在osd输出的流程一般是泰文先转换成Unicode编码,然后调用freetype进行文字渲染叠加

    如何使用Java代码将GBK编码格式的工程转换为UTF-8编码格式的工程.zip

    3. **转换编码**:对于检测到GBK编码的文件,使用`java.nio.file.Files`类的`readAllBytes`和`write`方法,配合`java.nio.charset.StandardCharsets`中的`UTF_8`常量,将文件内容从GBK编码转换为UTF-8编码。...

    判断文件是否为utf-8的编码格式

    一个判断文件为utf-8的java类,自己用有限状态机实现的,很好用的。

    UTF-8或者GBK文本格式判断

    Recognize类判定指定文本文件为UTF-8还是GBK编码格式。

    Java 所有字符串转UTF-8 万能工具类-GetEncode.java

    不需要关心接受的字符串编码是UTF_8还是GBK,还是ios-8859-1,自动转换为utf-8编码格式,无需判断字符串原有编码,用法://处理编码String newStr = GetEncode.transcode(oldStr);

    utf-8编码转换工具.zip

    该程序支持.c .h .cpp .hpp .bat .java等6种格式的文件编码转换,如果需要添加其他格式的文件,直接修改suffix的条件判断处的语句即可,压缩包中提供exe程序和python写的源码

    utf-8 ansi 字符互转 工具

    例如“utf-8 ansi 字符互转 工具”就是这样一个软件,它能方便地帮助用户将文件或文本内容在UTF-8和ANSI编码之间进行转换。使用这类工具,用户通常只需选择输入文件、指定输出格式,然后点击转换按钮即可完成操作。 ...

    快速转码(UTF-8转ASCII)

    如果要转换整个文件的编码,可以使用`java.io`或`java.nio.file`包中的类来读取UTF-8编码的文件,转换后写入ASCII编码的文件。 总结来说,"快速转码(UTF-8转ASCII)"是一个常见的编程需求,尤其在Java Web开发中。...

    java 中文字符串,utf-8编码为byte数组的计算过程

    在Java中,`String`类的`getBytes()`方法可以将字符串转换为字节数组,但默认编码通常是平台的默认编码,可能不是UTF-8。为了确保使用UTF-8编码,我们需要指定`Charset`参数,如`getBytes("UTF-8")`。 以下是一个...

    Java本地编码文件转UTF-8文件

    把含有本地编码的文件转成utf-8的,经常用于java文件的转码. 把编码gbk的java文件批量转成utf-8编码的文件.

    MyEclipse UTF-8编码插件

    MyEclipse UTF-8编码插件是针对Java开发环境MyEclipse的一款实用工具,它主要解决了在MyEclipse中处理包含中文字符的资源文件时可能出现的编码问题。在默认情况下,MyEclipse可能会使用GBK或其他非UTF-8编码,这在...

    【源代码】明解Java-UTF-8.rar

    《明解Java-UTF-8》是一门针对Java编程语言和UTF-8编码的课程,其课后提供的源代码旨在帮助学习者巩固所学知识,适用于初学者和希望深入理解Java的同学。UTF-8是一种广泛使用的字符编码标准,能够表示Unicode字符...

Global site tag (gtag.js) - Google Analytics