`
leonzhx
  • 浏览: 785926 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论
阅读更多

    Unicode was invented to overcome the limitations of traditional character encoding schemes. Before Unicode, there were many different standards: ASCII in the United States, ISO 8859-1 for Western European languages, KOI-8 for Russian, GB18030 and BIG-5 for Chinese, and so on. This causes two problems. A particular code value corresponds to different letters in the various encoding schemes. Moreover, the encodings for languages with large character sets have variable length: Some common characters are encoded as single bytes, others require two or more bytes. Unicode was designed to solve these problems. When the unification effort started in the 1980s, a fixed 2-byte width code was more than sufficient to encode all characters used in all languages in the world, with room to spare for future expansion or so everyone thought at the time. In 1991, Unicode 1.0 was released, using slightly less than half of the available 65,536 code values. Java was designed from the ground up to use 16-bit Unicode characters, which was a major advance over other programming languages that used 8-bit characters. Unfortunately, over time, the inevitable happened. Unicode grew beyond 65,536 characters, primarily due to the addition of a very large set of ideographs used for Chinese, Japanese, and Korean. Now, the 16-bit char type is insufficient to describe all Unicode characters. We need a bit of terminology to explain how this problem is resolved in Java, beginning with Java SE 5.0.

 

    A code point is a code value that is associated with a character in an encoding scheme. In the Unicode standard, code points are written in hexadecimal and prefixed with U+, such as U+0041 for the code point of the letter A. Unicode has code points that are grouped into 17 code planes. The first code plane, called the basic multilingual plane, consists of the classic Unicode characters with code points U+0000 to U+FFFF. Sixteen additional planes, with code points U+10000 to U+10FFFF, hold the supplementary characters.

 

    The UTF-16 encoding is a method of representing all Unicode code points in a variable length code. The characters in the basic multilingual plane are represented as 16-bit values, called code units. The supplementary characters are encoded as consecutive pairs of code units. Each of the values in such an encoding pair falls into a range of 2048 unused values of the basic multilingual plane, called the surrogates area (U+D800 to U+DBFF for the first code unit, U+DC00 to U+DFFF for the second code unit). This is rather clever, because you can immediately tell whether a code unit encodes a single character or whether it is the first or second part of a supplementary character. For example, the mathematical symbol for the set of integers Z has code point U+1D56B and is encoded by the two code units U+D835 and U+DD6B. (See http://en.wikipedia.org/wiki/UTF-16 for a description of the encoding algorithm.)

 

Why do we need Unicode?

In the (not too) early days, all that existed was ASCII. This was okay, as all that would ever be needed were a few control characters, punctuation, numbers and letters like the ones in this sentence. Unfortunately, today's strange world of global intercommunication and social media was not foreseen, and it is not too unusual to see English, العربية, 汉语, עִבְרִית, ελληνικά, and ភាសាខ្មែរ in the same document (I hope I didn't break any old browsers).

But for argument's sake, lets say Joe Average is a software developer. He insists that he will only ever need English, and as such only wants to use ASCII. This might be fine for Joe the user, but this is not fine for Joe the software developer. Approximately half the world uses non-Latin characters and using ASCII is arguably inconsiderate to these people, and on top of that, he is closing off his software to a large and growing economy.

Therefore, an encompassing character set including all languages is needed. Thus came Unicode. It assigns every character a unique number called a code point. One advantage of Unicode over other possible sets is that the first 256 code points are identical to ISO-8859-1, and hence also ASCII. In addition, the vast majority of commonly used characters are representable by only two bytes, in a region called the Basic Multilingual Plane (BMP). Now a character encoding is needed to access this character set, and as the question asks, I will concentrate on UTF-8 and UTF-16.

Memory considerations

So how many bytes give access to what characters in these encodings?

  • UTF-8:
    • 1 byte: Standard ASCII
    • 2 bytes: Arabic, Hebrew, most European scripts (most notably excluding Georgian)
    • 3 bytes: BMP
    • 4 bytes: All Unicode characters
  • UTF-16:
    • 2 bytes: BMP
    • 4 bytes: All Unicode characters

It's worth mentioning now that characters not in the BMP include ancient scripts, mathematical symbols, musical symbols, and rarer Chinese/Japanese/Korean (CJK) characters.

If you'll be working mostly with ASCII characters, then UTF-8 is certainly more memory efficient. However, if you're working mostly with non-European scripts, using UTF-8 could be up to 1.5 times less memory efficient than UTF-16. When dealing with large amounts of text, such as large web-pages or lengthy word documents, this could impact performance.

Encoding basics

Note: If you know how UTF-8 and UTF-16 are encoded, skip to the next section for practical applications.

  • UTF-8: For the standard ASCII (0-127) characters, the UTF-8 codes are identical. This makes UTF-8 ideal if backwards compatibility is required with existing ASCII text. Other characters require anywhere from 2-4 bytes. This is done by reserving some bits in each of these bytes to indicate that it is part of a multi-byte character. In particular, the first bit of each byte is 1 to avoid clashing with the ASCII characters.
  • UTF-16: For valid BMP characters, the UTF-16 representation is simply its code point. However, for non-BMP characters UTF-16 introduces surrogate pairs. In this case a combination of two two-byte portions map to a non-BMP character. These two-byte portions come from the BMP numeric range, but are guaranteed by the Unicode standard to be invalid as BMP characters. In addition, since UTF-16 has two bytes as its basic unit, it is affected by endianness. To compensate, a reserved byte order mark can be placed at the beginning of a data stream which indicates endianness. Thus, if you are reading UTF-16 input, and no endianness is specified, you must check for this.

As can be seen, UTF-8 and UTF-16 are nowhere near compatible with each other. So if you're doing I/O, make sure you know which encoding you are using! For further details on these encodings, please see the UTF FAQ.

Practical programming considerations

Character and String data types: How are they encoded in the programming language? If they are raw bytes, the minute you try to output non-ASCII characters, you may run into a few problems. Also, even if the character type is based on a UTF, that doesn't mean the strings are proper UTF. They may allow byte sequences that are illegal. Generally, you'll have to use a library that supports UTF, such as ICU for C, C++ and Java. In any case, if you want to input/output something other than the default encoding, you will have to convert it first.

Recommended/default/dominant encodings: When given a choice of which UTF to use, it is usually best to follow recommended standards for the environment you are working in. For example, UTF-8 is dominant on the web, and since HTML5, it has been the recommended encoding. Conversely, both .NET and Java environments are founded on a UTF-16 character type. Confusingly (and incorrectly), references are often made to the "Unicode encoding", which usually refers to the dominant UTF encoding in a given environment.

Library support: What encodings are the libraries you are using support? Do they support the corner cases? Since necessity is the mother of invention, UTF-8 libraries will generally support 4-byte characters properly, since 1, 2, and even 3 byte characters can occur frequently. However, not all purported UTF-16 libraries support surrogate pairs properly since they occur very rarely.

Counting characters: There exist combining characters in Unicode. For example the code point U+006E (n), and U+0303 (a combining tilde) forms ñ, but the code point U+00F1 forms ñ. They should look identical, but a simple counting algorithm will return 2 for the first example, 1 for the latter. This isn't necessarily wrong, but may not be the desired outcome either.

Comparing for equality: A, А, and Α look the same, but they're Latin, Cyrillic, and Greek respectively. You also have cases like C and Ⅽ, one is a letter, the other a Roman numeral. In addition, we have the combining characters to consider as well. For more info see Duplicate characters in Unicode.

Surrogate pairs: These come up often enough on SO, so I'll just provide some example links:

Others?:

分享到:
评论

相关推荐

    Laravel开发-laravel-schema

    3.2 **字符集与排序规则**:`->charset('utf8mb4')`和`->collation('utf8mb4_unicode_ci')`用于设置表的字符集和排序规则,以支持多语言和Unicode字符。 四、数据库连接 4.1 **配置**:在`config/database.php`文件...

    xmltools_2.3.1_r805_unicode_beta2.zip

    其次,XML Schema(XSD)和DTD(Document Type Definition)验证是XMLTools的重要组成部分。用户可以将XML文档与指定的XSD或DTD进行比对,验证文档是否符合预定义的模式规则。这样,开发人员可以在编码阶段就发现...

    藏经阁-facebook-Massive Schema Change.pdf

    OSC 的主要功能包括:Well tested(单元测试/集成测试)& bug fix、Faster catchup、Checksum、Hook 系统、Unicode 等。OSC 还提供了 wrapper 工具,用于简化在线模式架构变化的操作。 OSC 的工作流程包括 Init、...

    Oracle Solaris 9 - Schema Reference iPlanet Directory Server-164

    这份文件提供了 iPlanet Directory Server 的 schema 结构和相关信息,包括版本信息、版权信息、商标信息、软件许可信息等,同时也涉及到 LDAP 协议、UNIX 操作系统、Unicode 排序类等技术领域。

    XmlTools_2.4.9.2 x64_Unicode.zip

    标题中的"XmlTools_2.4.9.2 x64_Unicode.zip"指的是XML Tools插件的一个特定版本,版本号为2.4.9.2,适用于64位操作系统,并且是Unicode编码的。XML Tools是一款强大的XML编辑和处理工具,它专为提升在Notepad++中的...

    json-schema-validator-master JsonValidator

    return error("unicode escape sequence \\uxxxx ", start); } } return true; } private boolean ishex(char d) { return "0123456789abcdefABCDEF".indexOf(c) >= 0; } private char nextCharacter() {...

    Xml Tools 2.4.9.2 x64 Unicode.7z

    文件列表中的`Xml Tools 2.4.9.2 x64 Unicode.zip`是用于64位系统的版本,而`Xml Tools 2.4.9.2 x86 Unicode.zip`则是32位系统的版本。安装完成后,用户可以在Notepad++的“插件”菜单中找到并启用"Xml Tools"的各项...

    notepad++插件

    Xml Tools插件提供了许多实用功能,包括XML语法高亮、格式化、验证、XPath查询、XSD schema支持以及XML到JSON转换等。这个版本2.4.8 Unicode强调了对Unicode字符集的支持,这意味着它可以处理各种语言和字符编码,这...

    mysqlunicodedata:MySQL的Unicode数据库

    MySQL格式的unicode数据。 文件 描述 mysqlunicodedb.py 用于下载数据并创建表的脚本 requirements.txt 要支持pip3 install -r requirements.txt ucd_data.sql 脚本生成的数据 ucd_schema.sql 脚本生成的架构 ...

    Facebook Online Schema Change原理和大规模表结构变更最佳实践.pdf

    在线模式更改(Online Schema Change,简称OSC)是数据库管理系统中的一种技术,允许在不停止服务的情况下修改表结构。Facebook 在处理大规模数据表的结构变更时,面临了诸多挑战,如MySQL原生在线DDL(Data ...

    中文汉字笔划数据库

    里面排好了中文对应的汉字、笔画数量、汉字对应的unicode码等,方便大家进行笔画排序。可以参考链接: https://blog.csdn.net/ouyang_peng/article/details/83863693 sqlite> .tables BI_HUA_BEAN CHILD_BEAN ...

    oracle实用导入导出

    这里的`-w`选项表示使用Unicode格式(UTF-16)导出,`-S`指定了SQL Server实例的地址,`-U`是数据库用户,`-P`则是用户密码。 而导入数据时,同样使用`bcp`命令,但加上`in`参数: ```cmd bcp CALDB.dbo.LC_7...

    is-my-json-valid:使用代码生成速度非常快的JSONSchema验证器

    当使用unicode代理对时,它通过了整个JSONSchema v4测试套件,但remoteRefs和maxLength / minLength除外。安装npm install --save is-my-json-valid用法只需传递一个模式即可对其进行编译var validator = require ...

    sql注入绕过技术1

    这将涵盖十六进制编码、Unicode编码、各种注释技术以及如何利用特定的符号和函数来绕过Web应用防火墙(WAF)。 ### 一、十六进制编码 十六进制编码是一种常见的绕过技术,通过将SQL查询中的字符转换为对应的十六...

    数据库运维形考任务2 数据库运维.docx

    ALTER DATABASE school CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; ``` ##### 9. 在school数据库下创建stu_info表及相关表 - **知识点**:设计并创建包含学生信息的表`stu_info`,以及课程表`course`和...

    无废话xml

    XML文档通常使用Unicode编码(如UTF-8),这是因为Unicode支持几乎所有的书写系统,确保了全球范围内的文本兼容性。书中详细讨论了Unicode的基本概念、编码方式以及如何在XML文档中声明字符编码。 ### XML与CSS ...

    无废话xml(劳虎,胭脂虎)

    - **1.3.1 DTD与XML Schema**:讲解了DTD与XML Schema如何定义XML文档的结构。 - **1.3.2 XSLT与XSL**:解释了XSLT如何转换XML文档以及XSL的作用。 - **1.3.3 Flash与XML的结合**:探讨了XML在多媒体领域的应用。...

    XMLSpyEnt2018sp1_x64

    XMLSpy支持WYSWYG,支持Unicode、多字符集,支持Well-formed和Validated两种类型的XML文档,支持NewsML等多种标准XML文档的编辑,同时还提供了强有力的样式表设计功能,符合行业标准的XML开发环境,专门用于设计,...

    无废话XML

    - **Schema**:与DTD类似,Schema也用于验证XML文档,但提供了更强大的数据类型和约束机制,通常使用XML Schema语言定义。 - **命名空间**:命名空间用于解决XML文档中可能存在的标签名冲突问题,通过前缀和URI的...

    xml开源库以及字符转换库

    它不仅能够解析XML文档,还支持XPath、XInclude、XML Schema和 Relax NG等标准。这个库在Windows和各种Unix/Linux平台上都有广泛的应用。在描述中提到的“libxml2-2.7.6.win32”很可能是LibXML2的一个Windows 32位...

Global site tag (gtag.js) - Google Analytics