- 浏览: 21228 次
- 性别:
- 来自: 新乡
最近访客 更多访客>>
文章分类
- 全部博客 (24)
- WIDE JAVA (2)
- SHARE JAVA (1)
- DAY JAVA (2)
- ERROR JAVA (1)
- ENGLISH JAVA (0)
- OSA JAVA (0)
- CORE JAVA (0)
- BOOK JAVA (0)
- TEMP JAVA (0)
- WIDE DELPHI (0)
- SHARE DELPHI (0)
- DAY DELPHI (2)
- ERROR DELPHI (0)
- ENGLISH DELPHI (1)
- OSA DELPHI (0)
- CORE DELPHI (0)
- TEMP DELPHI (0)
- BOOK DELPHI (0)
- SHARE CPP (1)
- TEMP CPP (1)
- BOOK CPP (2)
- PS (0)
- LINUX (1)
- WINDOWS (0)
- THE ART OF ALGORITHM (1)
- ENGLISH CPP (4)
- DAY CPP (3)
- PROBLEMS (1)
- ENGLISH (1)
- 黑马程序员Heima (0)
- Help Yourself (0)
最新评论
FROM:
http://en.wikipedia.org/wiki/UTF-8
UTF-8
From Wikipedia, the free encyclopedia
UTF-8 (UCS Transformation Format — 8-bit[1]) is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks (BOM). For these and other reasons, UTF-8 has become the dominant character encoding for the World-Wide Web, accounting for more than half of all Web pages.[2][3][4] The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8.[5] The Internet Mail Consortium (IMC) recommends that all e‑mail programs be able to display and create mail using UTF-8.[6] UTF-8 is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications.
UTF-8 encodes each of the 1,112,064[7] code points in the Unicode character set using one to four 8-bit bytes (termed “octets” in the Unicode Standard). Code points with lower numerical values (i. e., earlier code positions in the Unicode character set, which tend to occur more frequently in practice) are encoded using fewer bytes,[8] making the encoding scheme reasonably efficient. In particular, the first 128 characters of the Unicode character set, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as the corresponding ASCII character, making valid ASCII text valid UTF-8-encoded Unicode text as well.
The official IANA code for the UTF-8 character encoding is UTF-8.[9]
Contents
1 History
2 Design
3 Description
3.1 Codepage layout
3.2 Invalid byte sequences
3.3 Invalid code points
4 Official name and variants
5 Derivatives
5.1 CESU-8
5.2 Modified UTF-8
6 Byte order mark
7 Advantages and disadvantages
7.1 General
7.1.1 Advantages
7.1.2 Disadvantages
7.2 Compared to single-byte encodings
7.2.1 Advantages
7.2.2 Disadvantages
7.3 Compared to other multi-byte encodings
7.3.1 Advantages
7.3.2 Disadvantages
7.4 Compared to UTF-16
7.4.1 Advantages
7.4.2 Disadvantages
8 See also
9 References
10 External links
[edit] History
By early 1992 the search was on for a good byte-stream encoding of multi-byte character sets. The draft ISO 10646 standard contained a non-required annex called UTF-1 that provided a byte-stream encoding of its 32-bit code points. This encoding was not satisfactory on performance grounds, but did introduce the notion that bytes in the ASCII range of 0–127 represent themselves in UTF, thereby providing backward compatibility.
In July 1992, the X/Open committee XoJIG was looking for a better encoding. Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; all multibyte sequences would include only bytes where the high bit was set.
In August 1992, this proposal was circulated by an IBM X/Open representative to interested parties. Ken Thompson of the Plan 9 operating system group at Bell Labs then made a crucial modification to the encoding to allow it to be self-synchronizing, meaning that it was not necessary to read from the beginning of the string to find code point boundaries. Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. The following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open.[10]
UTF-8 was first officially presented at the USENIX conference in San Diego, from January 25–29, 1993.
The original specification allowed for sequences of up to six bytes, covering numbers up to 31 bits (the original limit of the Universal Character Set). In November 2003 UTF-8 was restricted by RFC 3629 to four bytes covering only the range U+0000 to U+10FFFF, in order to match the constraints of the UTF-16 character encoding.
[edit] Design
The design of UTF‑8 as originally proposed by Dave Prosser and subsequently modified by Ken Thompson was intended to satisfy two objectives:
To be backward-compatible with ASCII; and
To enable encoding of up to at least 231 characters (the theoretical limit of the first draft proposal for the Universal Character Set).
Being backward-compatible with ASCII implied that every valid ASCII character (a 7-bit character set) also be a valid UTF‑8 character sequence, specifically, a one-byte UTF‑8 character sequence whose binary value equals that of the corresponding ASCII character:
Bits Last code point Byte 1
7 U+007F 0xxxxxxx
Prosser’s and Thompson’s challenge was to extend this scheme to handle code points with up to 31 bits. The solution proposed by Prosser as subsequently modified by Thompson was as follows:
Bits Last code point Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
7 U+007F 0xxxxxxx
11 U+07FF 110xxxxx 10xxxxxx
16 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
21 U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
26 U+3FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
31 U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
The salient features of the above scheme are as follows:
Every valid ASCII character is also a valid UTF‑8 encoded Unicode character with the same binary value. (Thus, valid ASCII text is also valid UTF‑8-encoded Unicode text.)
For every UTF‑8 byte sequence corresponding to a single Unicode character, the first byte unambiguously indicates the length of the sequence in bytes.
All continuation bytes (byte nos. 2 – 6 in the table above) have 10 as their two most-significant bits (bits 7 – 6); in contrast, the first byte never has 10 as its two most-significant bits. As a result, it is immediately obvious whether any given byte anywhere in a (valid) UTF‑8 stream represents the first byte of a byte sequence corresponding to a single character, or a continuation byte of such a byte sequence.
As a consequence of no. 3 above, starting with any arbitrary byte anywhere in a (valid) UTF‑8 stream, it is necessary to back up by only at most five bytes in order to get to the beginning of the byte sequence corresponding to a single character (three bytes in actual UTF‑8 as explained in the next section). If it is not possible to back up, or a byte is missing because of e.g. a communication failure, one single character can be discarded, and the next character be correctly read.
Starting with the second row in the table above (two bytes), every additional byte extends the maximum number of bits by five (six additional bits from the additional continuation byte, minus one bit lost in the first byte).
Prosser’s and Thompson’s scheme was sufficiently general to be extended beyond 6-byte sequences (however, this would have allowed FE or FF bytes to occur in valid UTF-8 text — see under Advantages in section "Compared to single byte encodings" below — and indefinite extension would lose the desirable feature that the length of a sequence can be determined from the start byte only).
[edit] Description
UTF-8 is a variable-width encoding, with each character represented by one to four bytes. If the character is encoded by just one byte, the high-order bit is 0 and the other bits give the code value (in the range 0..127). If the character is encoded by a sequence of more than one byte, the first byte has as many leading '1' bits as the total number of bytes in the sequence, followed by a '0' bit, and the succeeding bytes are all marked by a leading "10" bit pattern. The remaining bits in the byte sequence are concatenated to form the Unicode code point value (in the range 80hex to 10FFFFhex). Thus a byte with lead bit '0' is a single-byte code, a byte with multiple leading '1' bits is the first of a multi-byte sequence, and a byte with a leading "10" bit pattern is a continuation byte of a multi-byte sequence. The format of the bytes thus allows the beginning of each sequence to be detected without decoding from the beginning of the string. UTF-16 limits Unicode to 10FFFFhex; therefore UTF-8 is not defined beyond that value, even if it could easily be defined to reach 7FFFFFFFhex.
Code point range Binary code point UTF-8 bytes Example
U+0000 to
U+007F 0xxxxxxx 0xxxxxxx character '$' = code point U+0024
= 00100100
→ 00100100
→ hexadecimal 24
U+0080 to
U+07FF 00000yyy yyxxxxxx 110yyyyy
10xxxxxx character '¢' = code point U+00A2
= 00000000 10100010
→ 11000010 10100010
→ hexadecimal C2 A2
U+0800 to
U+FFFF zzzzyyyy yyxxxxxx 1110zzzz
10yyyyyy
10xxxxxx character '€' = code point U+20AC
= 00100000 10101100
→ 11100010 10000010 10101100
→ hexadecimal E2 82 AC
U+010000 to
U+10FFFF 000wwwzz zzzzyyyy yyxxxxxx 11110www
10zzzzzz
10yyyyyy
10xxxxxx character '
http://en.wikipedia.org/wiki/UTF-8
UTF-8
From Wikipedia, the free encyclopedia
UTF-8 (UCS Transformation Format — 8-bit[1]) is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks (BOM). For these and other reasons, UTF-8 has become the dominant character encoding for the World-Wide Web, accounting for more than half of all Web pages.[2][3][4] The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8.[5] The Internet Mail Consortium (IMC) recommends that all e‑mail programs be able to display and create mail using UTF-8.[6] UTF-8 is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications.
UTF-8 encodes each of the 1,112,064[7] code points in the Unicode character set using one to four 8-bit bytes (termed “octets” in the Unicode Standard). Code points with lower numerical values (i. e., earlier code positions in the Unicode character set, which tend to occur more frequently in practice) are encoded using fewer bytes,[8] making the encoding scheme reasonably efficient. In particular, the first 128 characters of the Unicode character set, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as the corresponding ASCII character, making valid ASCII text valid UTF-8-encoded Unicode text as well.
The official IANA code for the UTF-8 character encoding is UTF-8.[9]
Contents
1 History
2 Design
3 Description
3.1 Codepage layout
3.2 Invalid byte sequences
3.3 Invalid code points
4 Official name and variants
5 Derivatives
5.1 CESU-8
5.2 Modified UTF-8
6 Byte order mark
7 Advantages and disadvantages
7.1 General
7.1.1 Advantages
7.1.2 Disadvantages
7.2 Compared to single-byte encodings
7.2.1 Advantages
7.2.2 Disadvantages
7.3 Compared to other multi-byte encodings
7.3.1 Advantages
7.3.2 Disadvantages
7.4 Compared to UTF-16
7.4.1 Advantages
7.4.2 Disadvantages
8 See also
9 References
10 External links
[edit] History
By early 1992 the search was on for a good byte-stream encoding of multi-byte character sets. The draft ISO 10646 standard contained a non-required annex called UTF-1 that provided a byte-stream encoding of its 32-bit code points. This encoding was not satisfactory on performance grounds, but did introduce the notion that bytes in the ASCII range of 0–127 represent themselves in UTF, thereby providing backward compatibility.
In July 1992, the X/Open committee XoJIG was looking for a better encoding. Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; all multibyte sequences would include only bytes where the high bit was set.
In August 1992, this proposal was circulated by an IBM X/Open representative to interested parties. Ken Thompson of the Plan 9 operating system group at Bell Labs then made a crucial modification to the encoding to allow it to be self-synchronizing, meaning that it was not necessary to read from the beginning of the string to find code point boundaries. Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. The following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open.[10]
UTF-8 was first officially presented at the USENIX conference in San Diego, from January 25–29, 1993.
The original specification allowed for sequences of up to six bytes, covering numbers up to 31 bits (the original limit of the Universal Character Set). In November 2003 UTF-8 was restricted by RFC 3629 to four bytes covering only the range U+0000 to U+10FFFF, in order to match the constraints of the UTF-16 character encoding.
[edit] Design
The design of UTF‑8 as originally proposed by Dave Prosser and subsequently modified by Ken Thompson was intended to satisfy two objectives:
To be backward-compatible with ASCII; and
To enable encoding of up to at least 231 characters (the theoretical limit of the first draft proposal for the Universal Character Set).
Being backward-compatible with ASCII implied that every valid ASCII character (a 7-bit character set) also be a valid UTF‑8 character sequence, specifically, a one-byte UTF‑8 character sequence whose binary value equals that of the corresponding ASCII character:
Bits Last code point Byte 1
7 U+007F 0xxxxxxx
Prosser’s and Thompson’s challenge was to extend this scheme to handle code points with up to 31 bits. The solution proposed by Prosser as subsequently modified by Thompson was as follows:
Bits Last code point Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
7 U+007F 0xxxxxxx
11 U+07FF 110xxxxx 10xxxxxx
16 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
21 U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
26 U+3FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
31 U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
The salient features of the above scheme are as follows:
Every valid ASCII character is also a valid UTF‑8 encoded Unicode character with the same binary value. (Thus, valid ASCII text is also valid UTF‑8-encoded Unicode text.)
For every UTF‑8 byte sequence corresponding to a single Unicode character, the first byte unambiguously indicates the length of the sequence in bytes.
All continuation bytes (byte nos. 2 – 6 in the table above) have 10 as their two most-significant bits (bits 7 – 6); in contrast, the first byte never has 10 as its two most-significant bits. As a result, it is immediately obvious whether any given byte anywhere in a (valid) UTF‑8 stream represents the first byte of a byte sequence corresponding to a single character, or a continuation byte of such a byte sequence.
As a consequence of no. 3 above, starting with any arbitrary byte anywhere in a (valid) UTF‑8 stream, it is necessary to back up by only at most five bytes in order to get to the beginning of the byte sequence corresponding to a single character (three bytes in actual UTF‑8 as explained in the next section). If it is not possible to back up, or a byte is missing because of e.g. a communication failure, one single character can be discarded, and the next character be correctly read.
Starting with the second row in the table above (two bytes), every additional byte extends the maximum number of bits by five (six additional bits from the additional continuation byte, minus one bit lost in the first byte).
Prosser’s and Thompson’s scheme was sufficiently general to be extended beyond 6-byte sequences (however, this would have allowed FE or FF bytes to occur in valid UTF-8 text — see under Advantages in section "Compared to single byte encodings" below — and indefinite extension would lose the desirable feature that the length of a sequence can be determined from the start byte only).
[edit] Description
UTF-8 is a variable-width encoding, with each character represented by one to four bytes. If the character is encoded by just one byte, the high-order bit is 0 and the other bits give the code value (in the range 0..127). If the character is encoded by a sequence of more than one byte, the first byte has as many leading '1' bits as the total number of bytes in the sequence, followed by a '0' bit, and the succeeding bytes are all marked by a leading "10" bit pattern. The remaining bits in the byte sequence are concatenated to form the Unicode code point value (in the range 80hex to 10FFFFhex). Thus a byte with lead bit '0' is a single-byte code, a byte with multiple leading '1' bits is the first of a multi-byte sequence, and a byte with a leading "10" bit pattern is a continuation byte of a multi-byte sequence. The format of the bytes thus allows the beginning of each sequence to be detected without decoding from the beginning of the string. UTF-16 limits Unicode to 10FFFFhex; therefore UTF-8 is not defined beyond that value, even if it could easily be defined to reach 7FFFFFFFhex.
Code point range Binary code point UTF-8 bytes Example
U+0000 to
U+007F 0xxxxxxx 0xxxxxxx character '$' = code point U+0024
= 00100100
→ 00100100
→ hexadecimal 24
U+0080 to
U+07FF 00000yyy yyxxxxxx 110yyyyy
10xxxxxx character '¢' = code point U+00A2
= 00000000 10100010
→ 11000010 10100010
→ hexadecimal C2 A2
U+0800 to
U+FFFF zzzzyyyy yyxxxxxx 1110zzzz
10yyyyyy
10xxxxxx character '€' = code point U+20AC
= 00100000 10101100
→ 11100010 10000010 10101100
→ hexadecimal E2 82 AC
U+010000 to
U+10FFFF 000wwwzz zzzzyyyy yyxxxxxx 11110www
10zzzzzz
10yyyyyy
10xxxxxx character '
相关推荐
资源来自pypi官网。 资源全名:asf_hyp3-0.9.3-py2.py3-none-any.whl
$ python train_plate.py --data data/mydata.yaml --batch 256 --epochs 200 --weights weights/yolov5s.pt --imgsz 416 --device '0,1' --cfg models/yolov5s_plate.yaml --hyp data/hyps/palte_head.yaml ...
8. `gpsa_survival.m` - 生存函数的计算,表示个体在特定时间点生存的概率。 9. `gpsa_optimset.m` - 可能是创建优化选项的函数,用于调用Matlab的优化工具箱进行参数优化。 10. `gpsa_tau_variance.m` - 与时间点τ...
官方离线安装包,亲测可用。使用rpm -ivh [rpm完整包名] 进行安装
官方离线安装包,亲测可用
官方离线安装包,亲测可用。使用rpm -ivh [rpm完整包名] 进行安装
官方离线安装包,亲测可用。使用rpm -ivh [rpm完整包名] 进行安装
根据提供的文件信息,文件标题为“hyp_security_guide.pdf”,描述为“系统安全设置、安全管理”,标签为“HFM系统安全文档”。文件内容包含了关于Hyperion Release 9.3.1 版本的共享服务安全管理系统手册的相关知识...
知识图谱是一种结构化的知识表达形式,它以图形的方式组织和存储了大量实体(如人、地点、事件等)及其相互关系。在知识图谱中,实体作为节点,实体之间的各种语义关联则通过边进行连接,形成了一个庞大的数据网络。...
Setup Specialist 2002 是一个功能强大.且灵活易用的安装程序制作工具。它使用可视化开发环境来制作安装工程文件,采用拖放方式添加文件,所有的设置都一目了然。能制作 32 位和 16 位的安装程序,具有安装、卸载、...
-workers 8 --batch 4 --img 640 --epochs 50 --data /mydrive/yolov9/yolov9/data.yaml --weights /mydrive/yolov9/yolov9-e.pt --device 0 --cfg /mydrive/yolov9/yolov9/models/detect/yolov9_custom.yaml --hyp ...
--batch-size 8 --device 0 --hyp data/hyp.scratch.yaml 蒸馏训练: python train.py --weights weights/yolov5s.pt \ --cfg models/yolov5s.yaml --data data/voc.yaml --epochs 50 \ --batch-size 8 --device...
官方离线安装包,亲测可用。使用rpm -ivh [rpm完整包名] 进行安装
双曲知识图嵌入 该代码是[6]的官方PyTorch实现,以及可以为...virtualenv -p python3.7 hyp_kg_env source hyp_kg_env/bin/activate pip install -r requirements.txt 然后,设置环境变量并激活您的环境: source set
离线安装包,亲测可用
HyP3 GAMMA HyP3插件,用于使用GAMMA进行SAR处理开发人员设定建议使用Ubuntu 18.0.4以获得GAMMA支持。 安装GAMMA 安装 安装hyp3_gamma git clone git@github....
【HYP-50M-SR.zip】是一个包含地理信息系统(GIS)数据的压缩包,其核心文件为"TIF"格式。TIF,全称Tagged Image File Format,是一种广泛用于图像存储,尤其是地理空间数据的高保真图像格式。在GIS领域,TIF常被...
颜色假设的多假设方法Daniel Hernandez-Juarez,Sarah Parisot,Benjamin Busam,Ales Leonardis,Gregory Slabaugh和Steven McDonagh 2020年CVPR / / /// 当代方法将色彩恒定性问题归结为学习相机特定的光源映射。...