读 String原代码

leonzhx

浏览: 799683 次
性别:
来自: 上海

最近访客更多访客>>

u012363178

justsimple

cdphantom

wang_xuewu

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

2014-05 ( 22)
2014-04 ( 47)
2014-03 ( 25)
更多存档...

博客分类：

String研究

Java String

1. CharSequence接口定义了一个只读的char序列。String 实现 CharSequence , Serializable , Comaprable<String>

2. 由 char[] value, int offset, int count 和 int hash组成。

3. 构造函数 public String(String original) 基本上来说是没用的，因为String本身是immutable的，没必要copy一下。但有一个用处：节省空间。比如说，你用subString生成的一个String，它是重用原来的char[]的，而原来的那个String你已经不想用了（这个条件很重要），你可以用这个构造函数重建一个String来释放空间。

public String(String original) {
        int size = original.count;
        char[] originalValue = original.value;
        char[] v;
        if (originalValue.length > size) {
            // The array representing the String is bigger than the new
            // String itself.  Perhaps this constructor is being called
            // in order to trim the baggage, so make a copy of the array.
            int off = original.offset;
            v = Arrays.copyOfRange(originalValue, off, off+size);
        } else {
            // The array representing the String is the same
            // size as the String, so no point in making a copy.
            v = originalValue;
        }
        this.offset = 0;
        this.count = size;
        this.value = v;
    }

4. 提供了一组构造函数，按照给定的Charset来将一个byte 数组 decode成一个字符串。

5. 接收StringBuilder或StringBuffer的构造函数，会将StringBuilder或StringBuffer先toString() 然后将新生成的String中的value , offset 和count赋值给当前String，这时会将char array拷贝一份，所以之后StringBuilder或StringBuffer的修改不会影响到生成的String。但问题是这样不就多生成了一个String对象么？

6. length()返回的是count的值，也就是说是Code Unit的个数，而不是Code Point的个数，对于Supplementary Character这个length()不是字符个数。如果想得到Code Point的个数可以使用codePointCount方法。

7. getBytes方法，当不传Charset时就用系统默认的Charset将字符串转成相应的编码。

8. public boolean contentEquals(StringBuffer sb) 对 sb作了同步，防止sb中途被修改：

public boolean contentEquals(StringBuffer sb) {
        synchronized(sb) {
            return contentEquals((CharSequence)sb);
        }
    }

9. public boolean contentEquals(CharSequence cs) 对StringBuilder和StringBuffer做了特殊处理，这样就避免在调用StringBuilder或StringBUffer的charAt方法时做的边界检查了：

public boolean contentEquals(CharSequence cs) {
        if (count != cs.length())
            return false;
        // Argument is a StringBuffer, StringBuilder
        if (cs instanceof AbstractStringBuilder) {
            char v1[] = value;
            char v2[] = ((AbstractStringBuilder)cs).getValue();
            int i = offset;
            int j = 0;
            int n = count;
            while (n-- != 0) {
                if (v1[i++] != v2[j++])
                    return false;
            }
            return true;
        }
        // Argument is a String
        if (cs.equals(this))
            return true;
        // Argument is a generic CharSequence
        char v1[] = value;
        int i = offset;
        int j = 0;
        int n = count;
        while (n-- != 0) {
            if (v1[i++] != cs.charAt(j++))
                return false;
        }
        return true;
    }

10. 当作Case Insensitve比较时代价有点高:

public boolean regionMatches(boolean ignoreCase, int toffset,
                           String other, int ooffset, int len) {
        char ta[] = value;
        int to = offset + toffset;
        char pa[] = other.value;
        int po = other.offset + ooffset;
        // Note: toffset, ooffset, or len might be near -1>>>1.
        if ((ooffset < 0) || (toffset < 0) || (toffset > (long)count - len) ||
                (ooffset > (long)other.count - len)) {
            return false;
        }
        while (len-- > 0) {
            char c1 = ta[to++];
            char c2 = pa[po++];
            if (c1 == c2) {
                continue;
            }
            if (ignoreCase) {
                // If characters don't match but case may be ignored,
                // try converting both characters to uppercase.
                // If the results match, then the comparison scan should
                // continue.
                char u1 = Character.toUpperCase(c1);
                char u2 = Character.toUpperCase(c2);
                if (u1 == u2) {
                    continue;
                }
                // Unfortunately, conversion to uppercase does not work properly
                // for the Georgian alphabet, which has strange rules about case
                // conversion.  So we need to make one last check before
                // exiting.
                if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
                    continue;
                }
            }
            return false;
        }
        return true;
    }

11. hashCode的计算：

public int hashCode() {
        int h = hash;
        if (h == 0 && count > 0) {
            int off = offset;
            char val[] = value;
            int len = count;

            for (int i = 0; i < len; i++) {
                h = 31*h + val[off++];
            }
            hash = h;
        }
        return h;
    }

12. String, StringBuilder和StringBuffer 共享了一段indexOf的逻辑：

static int indexOf(char[] source, int sourceOffset, int sourceCount,
                       char[] target, int targetOffset, int targetCount,
                       int fromIndex) {
        if (fromIndex >= sourceCount) {
            return (targetCount == 0 ? sourceCount : -1);
        }
        if (fromIndex < 0) {
            fromIndex = 0;
        }
        if (targetCount == 0) {
            return fromIndex;
        }

        char first  = target[targetOffset];
        int max = sourceOffset + (sourceCount - targetCount);

        for (int i = sourceOffset + fromIndex; i <= max; i++) {
            /* Look for first character. */
            if (source[i] != first) {
                while (++i <= max && source[i] != first);
            }

            /* Found first character, now look at the rest of v2 */
            if (i <= max) {
                int j = i + 1;
                int end = j + targetCount - 1;
                for (int k = targetOffset + 1; j < end && source[j] ==
                         target[k]; j++, k++);

                if (j == end) {
                    /* Found whole string. */
                    return i - sourceOffset;
                }
            }
        }
        return -1;
    }

由此可见，想让indexOf(String substr, int fromIndex) 返回String.length()的唯一方法是, fromIndex >= String.length()，并且子串为空串。

13. lastIndexOf写得有点奇怪：

static int lastIndexOf(char[] source, int sourceOffset, int sourceCount,
                           char[] target, int targetOffset, int targetCount,
                           int fromIndex) {
        /*
         * Check arguments; return immediately where possible. For
         * consistency, don't check for null str.
         */
        int rightIndex = sourceCount - targetCount;
        if (fromIndex < 0) {
            return -1;
        }
        if (fromIndex > rightIndex) {
            fromIndex = rightIndex;
        }
        /* Empty string always matches. */
        if (targetCount == 0) {
            return fromIndex;
        }

        int strLastIndex = targetOffset + targetCount - 1;
        char strLastChar = target[strLastIndex];
        int min = sourceOffset + targetCount - 1;
        int i = min + fromIndex;

    startSearchForLastChar:
        while (true) {
            while (i >= min && source[i] != strLastChar) {
                i--;
            }
            if (i < min) {
                return -1;
            }
            int j = i - 1;
            int start = j - (targetCount - 1);
            int k = strLastIndex - 1;

            while (j > start) {
                if (source[j--] != target[k--]) {
                    i--;
                    continue startSearchForLastChar;
                }
            }
            return start - sourceOffset + 1;
        }
    }

不是很理解为什么一定要从最后一个字符开始比较，虽然是last index，但匹配字符串没必要从最后一个字符开始比较，而且还用了一个让人不是很舒服的label。我也试着写了一个，才发现，其实要写得逻辑很严密还是需要费一番周折的：

private static int myLastIndexOf(char[] source, int sourceOffset,
			int sourceCount, char[] target, int targetOffset, int targetCount,
			int fromIndex) {

		if (fromIndex < 0) {
			return -1;
		}

		int rightIndex = sourceCount - targetCount;

		if (fromIndex > rightIndex) {
			fromIndex = rightIndex;
		}

		if (targetCount == 0) {
			return fromIndex;
		}

		char firstChar = target[targetOffset];
		int max = targetOffset + targetCount;

		for (int i = sourceOffset + fromIndex; i >= sourceOffset; i--) {
			if (source[i] != firstChar) {
				while (--i >= sourceOffset && source[i] != firstChar)
					;
			}

			if (i >= sourceOffset) {
				int j = targetOffset + 1;
				for (int k = i + 1; j < max && source[k] == target[j]; k++, j++)
					;

				if (j == max) {
					return i - sourceOffset;
				}
			}

		}
		return -1;

	}

14. subString重用了原来的char[]。

15. public String replace(CharSequence target, CharSequence replacement) 是纯字面的匹配。replace方法中除了replaceFirst以外都是找出所有匹配的进行替换。

16. trim把比空格小的unicode都当成空白了:

public String trim() {
        int len = count;
        int st = 0;
        int off = offset;      /* avoid getfield opcode */
        char[] val = value;    /* avoid getfield opcode */

        while ((st < len) && (val[off + st] <= ' ')) {
            st++;
        }
        while ((st < len) && (val[off + len - 1] <= ' ')) {
            len--;
        }
        return ((st > 0) || (len < count)) ? substring(st, len) : this;
    }

17. intern 方法的注释很精确了：

Returns a canonical representation for the string object.

A pool of strings, initially empty, is maintained privately by the class String .

When the intern method is invoked, if the pool already contains a string equal to this String object as determined by the equals(Object) method, then the string from the pool is returned. Otherwise, this String object is added to the pool and a reference to this String object is returned.

It follows that for any two strings s and t , s.intern() == t.intern() is true if and only if s.equals(t) is true .

All literal strings and string-valued constant expressions are interned.

17. 最后写了两个test case :

@Test
	public void testString() {
				
		String str = new String("ABC");
		String str1 = new String("ABC");
		
		assertFalse(str == str1);
		assertTrue(str.intern() == str1.intern());
		assertEquals(str.length(), str.indexOf("", str.length()));
		assertEquals("ss", "\\w\\w".replace((CharSequence)"\\w", "s"));
		
	}

0
顶

0
踩

分享到：

Chapter 1. Meet Hadoop | 读 Character 原代码

2013-01-11 16:08
浏览 1721
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论