Characters are NOT the same as bytes.
The term character is a logical term (meaning it defines something in terms of the way people think of things). The term byte is a device term (meaning it defines something in terms of the way the hardware was designed). The difference is in the encoding.
Encoding
The character 'A' must be encoded into a certain bit pattern that the machine can use. This is a mostly arbitrary decision on the part of hardware implementors. You could say that A=1, B=2, and so on. For ASCII (the American Standard for Computer Information Interchange), designers chose a 7-bit encoding (A=1000001), giving room for encoding 128 different characters in those seven bits.
Since a byte is 8-bits on most hardware today, the hardware just pads the extra bit with a zero (A=01000001). So even in the case of simple ASCII, characters are not the same thing as bytes.
ASCII reserves 32 encodings for special characters like the TAB, NEWLINE, FORMFEED, and then the remaining encodings are for printing characters like SPACE, !, @, #, $, A-Z, a-z, 0-9, and so on.
Beyond ASCII
What can the eighth bit be used for? It gives 128 more character encoding numbers. Each manufacturer has used the other 128 encodings above ASCII (values 128-255) for special printing characters, such as GRAY-BLOCK, N-WITH-TILDE, and YEN. They've been pretty horrible about consistency, though: IBM's GRAY-BLOCK character is not the same number as Commodore PET's GRAY-BLOCK character. Some computers have a YEN symbol while others don't.
ASCII is not the only way to encode characters into bit patterns. EBCDIC was a popular eight-bit encoding scheme used on some mainframes. If you transferred a file from one machine to another, you would have to convert each encoded character according to a look-up table, so A=193 (EBCDIC) would end up in the new file as A=65 (ASCII), or vice versa. Character types that were not in common between the two encodings would have to be dropped or assigned a replacement value, essentially destroying information.
There are special-purpose encodings for certain applications. On the DEC PDP-11, which usually interacted with the user in ASCII, another encoding called ROT50 was a common way to pack 3 letters and digits into two bytes. It could not fit any punctuation or special characters, so it was suited only for specific tasks. FORTRAN compilers often used it to pack six-letter variable names in less space. This also allowed "six-dot-three" filenames to fit in six byte data records on the storage devices. "Eight-dot-three" filenames came later.
Windows and MS-DOS offered code pages to help international users fit their most important data into one-byte character encodings. If you were using a Russian code-page, then a byte value of 136 meant one thing, but it meant an entirely different thing in the Netherlands' code-page. Web pages use a similar scheme to provide support for different character sets like ISO8859-1.
The problem of defining content on a certain character set gets worse in a global community. Unicode was designed to replace all that swapping around of character sets and make one canonical encoding. Every known character in every language would get its own permanent number.
Unicode
Unicode has several tens of thousands of characters, from Hebrew to Arabic to Kanji to Cyrillic. It has more control characters to deal with the minimum text positioning requirements of various languages, such as the doubled layering of Japanese Kana over Kanji which helps readers with uncommon words.
The numbers range into the 60,000s. You clearly can't encode those into single bytes anymore. Even if you don't have the which-language-is-supported character set problem, you still haven't gotten rid of the encoding problem. You never will: characters are human concepts and bytes are device concepts.
The brute force encoding would just take, say, four bytes for every character. You still would have to agree on whether the least-significant byte is written first or last, but there's room for every conceivable current and future Unicode character number. Oh, but it wastes a LOT of space, since the vast majority of commerce can still just fit in the ASCII range. Just saying hello takes twenty bytes.
To optimize the storage, a few specialized encodings have come up. The Unicode character numbers don't change, but the way the numbers are packed into bytes and bits are changed. Instead of four bytes per character, which was really excessive, let's do two bytes, and if it just happens to be one of those rare numbers that won't fit in two bytes, use some reserved bit flags to indicate that the rest of the number follows in the third byte. It's getting pretty complicated to encode or decode strings, but it saves a lot of space.
UTF-8
Currently, the most popular Unicode encoding is even tighter. It's called UTF-8. It allows any all-ASCII content to remain in the old one-byte-per-character encoding, completely unchanged. That's popular and efficient for all that ASCII content. The last bit doesn't offer you 128 more characters, it tells the Unicode program's decoder that it's a non-ASCII character, and that some more bits are in the next byte. Or the third byte. Or the fourth byte. Or as far away as requiring six bytes to just grab a single Unicode character with a high number.
The benefit of UTF-8 takes advantage of the fact that Unicode associates the more common world characters with lower numbers, so UTF-8 requires less space to encode those character numbers. Rare or specialized characters from languages like Klingon, Feanorian (Tolkien Elvish) Tengwar, Heiroglyphics or Cuneiform, will take more space to store.
Beyond UTF-8
There will still be application-specific encodings, even using Unicode's numbering scheme. Just like ROT50 was designed to store FORTRAN variable names and filenames in less space, an encoding called "puny code" (or "P-Unicode") can encode just about any Unicode string by using only the few byte values allowed in a domain name registration record. The domain registrar may see "egbpdaj6bu4bxfgehfvwxn.com" but a Punycode-compliant decoder in a modern web browser would decode those characters and render a domain name spelled with flowing Egyptian Arabic characters.
--
分享到:
相关推荐
my $bytes = encode 'Shift_JIS', $characters; my $bytes = encode_lax 'ASCII', $characters; my $bytes = encode_utf8 $characters; my $characters = decode 'cp1252', $bytes; my $characters = decode_lax '...
nu/xom/characters.dat 64.0 KB nu/xom/compositions.dat 21.5 KB nu/xom/converters/ nu/xom/converters/DOMConverter.class 10.39 KB nu/xom/converters/SAXConverter.class 6.08 KB nu/xom/samples/ ...
base64编码规则的java实现.Provides encoding of raw bytes to base64-encoded characters, and decoding of base64 characters to raw bytes.
标题中的"Characters"很可能指的是字符处理或字符集相关的概念,在编程语言C#中,字符处理是基础且重要的部分。C#提供了丰富的字符类型、字符串操作和正则表达式功能,使得开发者能有效地处理文本数据。 在C#中,...
* PEP_sz: Null Terminated array of characters * PEP_f: Single (4 bytes) or Double (8 bytes) floating point * PEP_na: Array of Integer (4 bytes) * PEP_dwa: Array of Double Word (4 bytes) * PEP_sza: ...
secPosValue = bytes[0] * 100 + bytes[1]; for (i = 0; i ; i++) { if (secPosValue >= secPosValueList[i] && secPosValue [i + 1]) { result = firstLetter[i]; break; } } return ...
Some implementations of UTF-16 assume that all characters are two bytes long, but this has not been true since Unicode version 3.0. Happily UTF-8 is designed so that it is relatively easy to count ...
cw-字数统计 Rust中的快速wc克隆。 概要 -% cw --help ... -m, --chars Count UTF-8 characters instead of bytes -h, --help Prints help information -l, --lines Count lines -L, --max-line-len
InstallShield PackageForTheWeb Password Cracker ...------------------------------------------------------------------------------ ... (v2.03+ doesn't guarantee passwords that are greater than 28 characters)
Return the number of bytes is the length of this string Note: this is NOT the same as the number of unicode characters.
字符,字节和编码 - Characters, Bytes And Encoding.mht》为网上所获,其他均为陈小稳本人所有,工具已经经过测试,反复使用,验证有效。 联系方式: ccxw1983@yahoo.com.cn 欢迎技术交流,互助共进!请注明csdn...
* Return : The length, in characters, of the copied string, * not including the terminating null character, indicates success. */ ZC030XLIB_API int capGetCurrentVersion( int index, /* ...
public static string ToHexString(byte[] bytes) { string hexString = string.Empty; if (bytes != null) { StringBuilder strB = new StringBuilder(); for (int i = 0; i < bytes.Length; i++) { strB....
char[] characters = { 'H', 'e', 'l', 'l', 'o' }; StringBuilder asciiString = new StringBuilder(); foreach (char c in characters) { asciiString.Append((int)c).Append(' '); } string asciiCodes = ...
char[] characters = gbkEncoding.GetChars(bytes); string recoveredChinese = new string(characters); ``` 这里,我们首先将16进制字符串数组转换回字节,然后再将这些字节转换回汉字。 在Windows Forms应用程序...
Even user insert some spaces or non-letter characters between sensitive words, the library is also able to deal with it. For example: "Bad boy" is added to sensitive dictionary, "Bad.boy", "Bad ...
PHP提供了多种函数来生成随机字符串,例如`rand()`、`mt_rand()`和`random_bytes()`。为了创建一个4位数字和字母混合的验证码,可以结合使用这些函数,确保结果具有足够的随机性和安全性。 ```php $characters = '...
- **Serial Port IOCTLs**: These are ioctl() calls specific to serial ports, used for getting and setting control signals and querying the number of bytes available for reading. - **Selecting Input ...