Character information is based on the Unicode Standard, version 6.0.0.
The char
data type (and therefore the value that a
Character
object encapsulates) are based on the
original Unicode specification, which defined characters as
fixed-width 16-bit entities. The Unicode Standard has since been
changed to allow for characters whose representation requires more
than 16 bits. The range of legal code point
s is now
U+0000 to U+10FFFF, known as Unicode scalar value
.
(Refer to the
definition
of the U+n
notation in the Unicode
Standard.)
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP).Characters whose code points are greater than U+FFFF are called supplementary character s. The Java
platform uses the UTF-16 representation in char
arrays and
in the String
and StringBuffer
classes. In
this representation, supplementary characters are represented as a pair
of char
values, the first from the high-surrogates
range, (\uD800-\uDBFF), the second from the
low-surrogates
range (\uDC00-\uDFFF).
A char
value, therefore, represents Basic
Multilingual Plane (BMP) code points, including the surrogate
code points, or code units of the UTF-16 encoding. An
int
value represents all Unicode code points,
including supplementary code points. The lower (least significant)
21 bits of int
are used to represent Unicode code
points and the upper (most significant) 11 bits must be zero.
Unless otherwise specified, the behavior with respect to
supplementary characters and surrogate char
values is
as follows:
-
The methods that only accept a
char
value cannot support
supplementary characters. They treat char
values from the
surrogate ranges as undefined characters. For example,
Character.isLetter('\uD840')
returns false
, even though
this specific value if followed by any low-surrogate value in a string
would represent a letter.
-
The methods that accept an
int
value support all
Unicode characters, including supplementary characters. For
example, Character.isLetter(0x2F81A)
returns
true
because the code point value represents a letter
(a CJK ideograph).
In the Java SE API documentation, Unicode code point
is
used for character values in the range between U+0000 and U+10FFFF,
and Unicode code unit
is used for 16-bit
char
values that are code units of the UTF-16
encoding. For more information on Unicode terminology, refer to the
Unicode Glossary
.
A String
represents a string in the UTF-16 format
in which supplementary characters
are represented by surrogate
pairs
.
Index values refer to char
code units, so a supplementary
character uses two positions in a String
.
The String
class provides methods for dealing with
Unicode code points (i.e., characters), in addition to those for
dealing with Unicode code units (i.e., char
values).
分享到:
相关推荐
这是关于概率统计的群表示的电子书,高清,最新版本,经典著作,英文版
Word2Vec是自然语言处理(NLP)领域的一个重要算法,主要用于高效地计算词汇的连续向量表示。Word2Vec的原始论文是由Tomas Mikolov在2013年发表,它在工业界和学术界引起了极大的关注。Google在2013年开源了Word2Vec...
AI复现大脑导航功能:DeepMind重大研究突破再次登上Nature,今天,DeepMind 在《Nature》上新发表的一篇论文引起了业内极大的关注,他们使用深度学习技术来训练一只老鼠,在虚拟环境中追踪其位置,模拟人类大脑的空间...
词向量开山之作第一篇,讲述作者第一次提出词向量。在自然语言处理任务中,首先需要考虑词如何在计算机中表示。通常,有两种表示方式:one-hot representation和distribution representation。
Sparse and Redundant Representations: From Theory to Appliations in Signal and Image Processing 's matlab source code
2、Modern.Compiler.Implementation.in.Java.Second.Edition.chm Last year you may have seen the Modern Compiler Implementation in C: Basic Techniques (1997) which was the preliminary edition of our new ...
标题中提到的“Vector-based Navigation using Grid-like Representations in Artificial Agents”(基于矢量的使用网格状表示的人工代理导航)和描述中提及的DeepMind发表的关于使用网格细胞样网格状表示进行模拟...
本文探讨了如何在深度专家混合模型(Mixture of Experts, MOE)中学习分因子表示(Factored Representations)。专家混合模型是一种将多个“专家”网络的输出结合起来的方法,每个专家网络都专注于输入空间的不同...
### 概念与表示在视觉与认知中的应用 #### 引言 《概念与表示在视觉与认知中的应用》这篇文章探讨了计算机视觉领域中的核心问题:如何理解和表征视觉场景和事件。文章从多个角度出发,包括图像空间、自然图像统计...
Analyzed and implemented in Java, the data structures presented in the book include stacks, queues, deques, and lists implemented as arrays and linked-lists; space-efficient implementations of lists; ...
learning representations by back-propagating errors
Sparse and Redundant Representations
### 稀疏与冗余表示:从理论到信号与图像处理的应用 #### 知识点一:稀疏与冗余表示的基本概念 稀疏表示(Sparse Representation)是指信号或图像可以通过一组基向量(basis vectors)进行表示,其中大多数基向量...
DeepMind 在《Nature》发表的的论文,他们通过AI复现了大脑的导航功能。用深度学习来训练一只老鼠,在虚拟环境中跟踪其位置,模拟人类大脑的空间导航能力。这项研究能够协助传统的神经科学研究测试大脑的工作原理。
### 稀疏与冗余表示:从理论到信号与图像处理的应用 #### 理解稀疏与冗余表示的基础概念 **稀疏表示**(Sparse Representation)是指一种信号或图像可以被表示为少量非零系数的方式。这种表示方法在处理大量数据时...