utf8的编码原理

dingjun1

浏览: 215937 次
性别:
来自: 北京

最近访客更多访客>>

ggj2010

rongsoft

大白2019

sea_wave2011

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

JAVA基础

Blog .net

大概意思：
在UTF8中，字符使用1到6个八位序列编码。
只有一个八位序列的字符，一个高位置为0，剩下的7位用于字符值的编码（能表示ASCII）
一N个八位的序列（N>1），开头的八位中高位有n位置为1，相邻的一位置为0，这个八位中
剩下的位用于字符值的编码，接着的N-1个八位序列中都在最高位置为1，相邻位置为0，每一个八位序列剩下的6
位包含字符值的编码位。

只有一个八位序列，则有7位编码位，表示值为127以内的字符
两个八位序列，第一个八位剩下5位，第二个序列剩下6位，共11位可以表示128到2048-1以内的字符
三个八位序列，第一个八位剩下4位，第二个序列剩下6位，第三个序列剩下6位，共16位，可以表示2048到65536-1以内的字符。
以此类推。

最大6个八位序列，用于字符的编码值有1+5*6=31位，才可以表示2147483648-1以内的字符。
===================================================================================
摘取：RFC2044 - UTF-8
In UTF-8, characters are encoded using sequences of 1 to 6 octets.
   The only octet of a "sequence" of one has the higher-order bit set to
   0, the remaining 7 bits being used to encode the character value. In
   a sequence of n octets, n>1, the initial octet has the n higher-order
   bits set to 1, followed by a bit set to 0. The remaining bit(s) of
   that octet contain bits from the value of the character to be
   encoded. The following octet(s) all have the higher-order bit set to
   1 and the following bit set to 0, leaving 6 bits in each to contain
   bits from the character to be encoded.

   The table below summarizes the format of these different octet types.
   The letter x indicates bits available for encoding bits of the UCS-4
   character value.

   UCS-4 range (hex.)           UTF-8 octet sequence (binary)
   0000 0000-0000 007F   0xxxxxxx
   0000 0080-0000 07FF   110xxxxx 10xxxxxx
   0000 0800-0000 FFFF   1110xxxx 10xxxxxx 10xxxxxx

   0001 0000-001F FFFF   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
   0020 0000-03FF FFFF   111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
   0400 0000-7FFF FFFF   1111110x 10xxxxxx ... 10xxxxxx

1.《Unicode详解》：http://tech.idv2.com/2008/02/21/unicode-intro/
         2.《Unicode、UCS和UTF编码简介》：http://hi.baidu.com/%D0%DB%CF%D8/blog/item/f3e0d7f221c09c12b17ec512.html
         3.《GB18030编码研究以及GBK、GB18030与Unicode的映射》：http://blog.csdn.net/fmddlmyy/archive/2008/04/13/2288312.aspx
         4.《汉字编码问题》：http://www.css8.cn/css8_document/gb2312.htm
         5.《Java:Unicode简介》：http://tech.it168.com/oldarticle/2006-11-09/200611092313338.shtml
         6.《字符，字节和编码》：http://www.regexlab.com/zh/encoding.htm
         7.《ISO 8859-1》：http://baike.baidu.com/view/758577.htm
         8.《Base64》：http://zh.wikipedia.org/wiki/Base64

分享到：

总结:CSS在IE与Firefox下的兼容性 | 转发(forward)、包含(include)及转向(red ...

2009-07-20 15:08
浏览 1166
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论