`
george.gu
  • 浏览: 73530 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

Java String Charset Encoding

阅读更多

Charset

Charset is a named mapping between sequences of sixteen-bit Unicode code units and sequences of bytes.

1 byte = 8 bits, byte represent the int value from 0x00 through 0xFF. In ASC-II, we use a mapping to represent characters. For example, Charsets are named by strings composed of the following characters:
		The uppercase letters 'A' through 'Z' ('0x41' through '0x5a'), 
		The lowercase letters 'a' through 'z' ('0x61' through '0x7a'), 
		The digits '0' through '9' ('0x30' through '0x39'), 
		The dash character '-' ('0x2d', HYPHEN-MINUS), 
		The period character '.' ('0x2e', FULL STOP), 
		The colon character ':' ('0x3a', COLON), and 
		The underscore character '_' ('0x5f', LOW LINE). 
 

Unicode 4.0

The weakness for previous mapping is some special Characters cannot be represented, like Chinese, Greek cannot be represented. "unicode 4.0 standard" define basic multiple language encoding charset mapping, it is from \u0000 to \uFFFF.

 

Java save characters in unicode. 

 

    /** The value is used for character storage. */
    private final char value[];

1 char = 2 bytes. 

 

We can see following information from String javadoc:

 

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String. 

The String class provides methods for dealing with Unicode code points (i.e., characters), in addition to those for dealing with Unicode code units (i.e., char values).
 

Encoding and Decoding String

String won't keep the Charset information, because all the characters are stored in unicode. Charset is only used when decoding the specified array of bytes to String characters. 

Following source snapshot will help us to understand well:
	String a = "èç";
	byte[] b_defaultEncoding = a.getBytes(); // 0xe8, 0xe7
	byte[] b_utf8 = a.getBytes("UTF-8"); // 0xc3, 0xa8, 0xc3, 0xa7
	byte[] b_ucs2 = a.getBytes("ISO-10646-UCS-2"); // 0x00, 0xe8, 0x00, 0xe7

	String a_defaultEncoding = new String(b_defaultEncoding);
	String a_mix = new String(b_ucs2);
	String a_ucs2 = new String(b_ucs2, "ISO-10646-UCS-2");
	String a_utf8 = new String(b_utf8, "UTF-8");

	System.out.println(a_defaultEncoding);//èç
	System.out.println(a_ucs2);//èç
	System.out.println(a_utf8);//èç
	System.out.println(a_mix); // è ç
We can see a_mix is not well converted, because it is encoded by Charset "ISO-10646-UCS-2" but when decode to String, default charset is used. 

Default Charset in java

You can specify default Charset with system property: "file.encoding". If not specified, normally "UTF-8" will be used. For more details please refer to Charset.defaultCharset().

Chinese Characters unicode charset

Normally, Simplified Chinese Characters unicode charset from \u4e00 to \u9fa5.


 

0
1
分享到:
评论

相关推荐

    Java设置String字符串编码方法详解

    例如,使用`new String(byte[], encoding)`构造函数: ```java byte[] bytes = ...; // 从文件或流中获取的字节 String str = new String(bytes, "UTF-8"); // 使用UTF-8编码解析字节 ``` 2. 字符串转换为字节数组...

    java 中file.encoding的设置详解

    在Java编程语言中,`file.encoding`是一个非常重要的系统属性,它定义了默认的字符编码。这个属性在处理文件输入输出、字符串与字节数组转换时起到关键作用。本文将深入探讨`file.encoding`的设置及其在Java中的工作...

    用java修改文件的编码

    2. **`java.nio`包中的Charset类**:Java标准库提供了`java.nio.charset`包,其中的`Charset`类用于表示字符集,提供对各种字符编码的支持。例如,`StandardCharsets.UTF_8`代表UTF-8编码。 3. **BufferedReader和...

    Java文件编码转换源码

    public static void convertEncoding(String sourceFilePath, String targetFilePath, String sourceEncoding, String targetEncoding) throws IOException { File sourceFile = new File(sourceFilePath); File ...

    java判断编码方式

    Java 6引入了`java.nio.charset.CharsetDetector`类,它可以检测输入流的字符编码。以下是一个简单的示例: ```java import java.io.FileInputStream; import java.nio.charset.Charset; import java.nio....

    JAVA 转换字符编码工具

    在Java中,`java.nio.charset`包提供了处理字符编码的类和接口,例如`Charset`,`CharsetDecoder`和`CharsetEncoder`,它们可以用来读写不同编码的字节流。 `BufferedStream01.java`这个文件名暗示了一个关于缓冲流...

    java文件编码转换

    public static void convertEncoding(String sourceFilePath, String targetFilePath, String sourceEncoding, String targetEncoding) throws IOException { File sourceFile = new File(sourceFilePath); File ...

    关于java中的编码转换问题(解决乱码问题)

    在Java中,String类提供了`getBytes()`方法,用于将字符串转换为字节数组,使用默认的平台编码。若需指定编码,可使用`getBytes(String charsetName)`,如`getBytes("UTF-8")`。 三、文件编码转换 Java的`...

    java 字符集的解码方法

    Java还提供了`String`类的`getBytes()`和`new String(byte[], charset)`方法来处理字符集。`getBytes()`会根据默认字符集编码字符串为字节数组,而`new String(byte[], charset)`则可以指定字符集解码字节数组: ``...

    Java字符集编码简记

    在Java中,字符编码主要通过`Charset`类来处理。Java默认使用Unicode(更具体地说是UTF-16)作为内部编码,这允许Java支持世界上几乎所有的字符。然而,在读写文件或与外部系统交互时,可能需要处理其他编码格式,如...

    Java解决UTF-8的BOM问题

    在Java编程中,UTF-8编码是一个非常常见且广泛使用的字符编码格式,它能支持全球大部分语言的字符表示。然而,UTF-8有一个特殊特性,那就是它可以带有Byte Order Mark(BOM),这是一个特殊的字节序列,用于标识数据...

    java获取字符串编码类型代码(导入直接查看结果)

    for (String encoding : Charset.availableCharsets().keySet()) { try { System.out.println("检测编码:" + encoding); if (input.equals(input.getBytes(encoding).toString())) { System.out.println("匹配...

    base64,java与JavaScript实现

    entext = new String(Base64.getUrlEncoder().encode(textcomment.getBytes(CHARSET)), CHARSET); } catch (UnsupportedEncodingException e) { // 不会发生 } return entext; } @CrossOrigin @...

    java字符串编码转换

    因此,当你在Java程序中创建一个 `String` 对象时,它默认就是Unicode格式的。 然而,在实际的应用场景中,我们常常需要处理不同编码格式的数据,如GBK、GB2312等。这就需要进行编码转换。 #### 三、编码转换的...

    java乱码问题解决方法

    charset=UTF-8" pageEncoding="UTF-8"%> ``` 这将指定 JSP 输出的编码方式为 UTF-8。同时,也可以使用 request.setCharacterEncoding("UTF-8") 指令来设置请求的编码方式。 3. Tomcat 5.5 中文乱码问题解决方法 ...

    java中文乱码处理.pdf

    String encoding = selectEncoding(request); if (encoding != null) { request.setCharacterEncoding(encoding); } } chain.doFilter(request, response); } public void init(FilterConfig filterConfig)...

    自己写的用于Servlet中doGet方法转码

    public String charset(String input, String sourceEncoding) { try { return new String(input.getBytes(sourceEncoding), "UTF-8"); } catch (UnsupportedEncodingException e) { // 处理编码不支持的异常 ...

    WEBLOGIC8+AJAX setCharacterEncoding报错

    描述中提到的"NoSuchMethodError setCharacterEncoding(Ljava/lang/String;)V"是一个Java运行时异常,意味着在类装载时尝试调用的方法在该类的Class文件中存在,但在链接阶段找不到。这通常发生在试图执行的方法在...

    Java解决WE8DEC字符集乱码问题

    Java标准库提供了广泛的字符集支持,通过`java.nio.charset.Charset`类可以访问和操作它们。在Java中,字符串默认使用Unicode(UTF-16)编码,但与其他系统交互时,可能需要进行编码转换。 3. **输入输出流的编码...

Global site tag (gtag.js) - Google Analytics