Ruby每周一测 - 中英文混合字符串截取

全部 Ruby Python PHP Flash C++ .net Rails Flex C C# Django

浏览 18947 次

锁定老帖子主题：Ruby每周一测 - 中英文混合字符串截取精华帖 (0) :: 良好帖 (1) :: 新手帖 (0) :: 隐藏帖 (0)
作者	正文
庄表伟等级: 资深会员性别: 文章: 2351 积分: 3481 来自: 上海	发表时间：2008-06-11 def truncate_u(text, length = 30, truncate_string = "...") l=0 char_array=text.unpack("U") char_array.each_with_index do \|c,i\| l = l+ (c<127 ? 0.5 : 1) if l>=length return char_array[0..i].pack("U")+(i<char_array.length-1 ? truncate_string : "") end end return text end
返回顶楼	回帖地址 0 0 请登录后投票

40hood 等级: 性别: 文章: 12 积分: 132 来自: 西安	发表时间：2008-06-12 这个题目其实可以扩充为中日韩等文字字符串的截取...
返回顶楼	回帖地址 0 0 请登录后投票

QuakeWang 等级: 性别: 文章: 854 积分: 2516 来自: 上海	发表时间：2008-06-12 看了老庄的解法，第一次知道了string的pack/unpack方法，呵呵我的解法是用正则表达式： def truncate_u(text, length = 30, truncate_string = "...") if r = Regexp.new("(?:(?:[^\xe0-\xef\x80-\xbf]{1,2})\|(?:[\xe0-\xef][\x80-\xbf][\x80-\xbf])){#{length}}", true, 'n').match(text) r[0].length < text.length ? r[0] + truncate_string : r[0] else text end end 和老庄的解法比起来就是太难懂了，不过在length比较小的情况下(<50)，性能要好一些，顺便把我用的benchmark代码也贴出来： require 'benchmark' test_suits = [ ["english string", 2], ["中文字符串", 2], ["中文 and english", 6], ["中文 and english", 8], ["veryveryveryveryveryveryveryveryveryveryveryveryveryverylongstring", 20], ["很长verylong很长verylong很长verylong很长verylong很长很长很长很长很的字符串", 30] ] br = Benchmark.bmbm do \|b\| b.report("truncate_u benchmark") do 5000.times { test_suits.each {\|t\| truncate_u(t[0], t[1])} } end end
返回顶楼	回帖地址 0 0 请登录后投票

庄表伟等级: 资深会员性别: 文章: 2351 积分: 3481 来自: 上海	发表时间：2008-06-12 String的unpack/pack非常的强大，我也只是用了其中的一个参数而已。目前还没看到较为完整的中文介绍。抄一段ruby doc在这里吧，希望有心人翻一下 Decodes str (which may contain binary data) according to the format string, returning an array of each value extracted. The format string consists of a sequence of single-character directives, summarized in the table at the end of this entry. Each directive may be followed by a number, indicating the number of times to repeat with this directive. An asterisk (``’’) will use up all remaining elements. The directives sSiIlL may each be followed by an underscore (``_’’) to use the underlying platform‘s native size for the specified type; otherwise, it uses a platform-independent consistent size. Spaces are ignored in the format string. See also Array#pack. "abc \0\0abc \0\0".unpack('A6Z6') #=> ["abc", "abc "] "abc \0\0".unpack('a3a3') #=> ["abc", " \000\000"] "abc \0abc \0".unpack('ZZ') #=> ["abc ", "abc "] "aa".unpack('b8B8') #=> ["10000110", "01100001"] "aaa".unpack('h2H2c') #=> ["16", "61", 97] "\xfe\xff\xfe\xff".unpack('sS') #=> [-2, 65534] "now=20is".unpack('M') #=> ["now is"] "whole".unpack('xax2aX2aX1aX2a') #=> ["h", "e", "l", "l", "o"] This table summarizes the various formats and the Ruby classes returned by each. Format \| Returns \| Function -------+---------+----------------------------------------- A \| String \| with trailing nulls and spaces removed -------+---------+----------------------------------------- a \| String \| string -------+---------+----------------------------------------- B \| String \| extract bits from each character (msb first) -------+---------+----------------------------------------- b \| String \| extract bits from each character (lsb first) -------+---------+----------------------------------------- C \| Fixnum \| extract a character as an unsigned integer -------+---------+----------------------------------------- c \| Fixnum \| extract a character as an integer -------+---------+----------------------------------------- d,D \| Float \| treat sizeof(double) characters as \| \| a native double -------+---------+----------------------------------------- E \| Float \| treat sizeof(double) characters as \| \| a double in little-endian byte order -------+---------+----------------------------------------- e \| Float \| treat sizeof(float) characters as \| \| a float in little-endian byte order -------+---------+----------------------------------------- f,F \| Float \| treat sizeof(float) characters as \| \| a native float -------+---------+----------------------------------------- G \| Float \| treat sizeof(double) characters as \| \| a double in network byte order -------+---------+----------------------------------------- g \| Float \| treat sizeof(float) characters as a \| \| float in network byte order -------+---------+----------------------------------------- H \| String \| extract hex nibbles from each character \| \| (most significant first) -------+---------+----------------------------------------- h \| String \| extract hex nibbles from each character \| \| (least significant first) -------+---------+----------------------------------------- I \| Integer \| treat sizeof(int) (modified by _) \| \| successive characters as an unsigned \| \| native integer -------+---------+----------------------------------------- i \| Integer \| treat sizeof(int) (modified by _) \| \| successive characters as a signed \| \| native integer -------+---------+----------------------------------------- L \| Integer \| treat four (modified by _) successive \| \| characters as an unsigned native \| \| long integer -------+---------+----------------------------------------- l \| Integer \| treat four (modified by _) successive \| \| characters as a signed native \| \| long integer -------+---------+----------------------------------------- M \| String \| quoted-printable -------+---------+----------------------------------------- m \| String \| base64-encoded -------+---------+----------------------------------------- N \| Integer \| treat four characters as an unsigned \| \| long in network byte order -------+---------+----------------------------------------- n \| Fixnum \| treat two characters as an unsigned \| \| short in network byte order -------+---------+----------------------------------------- P \| String \| treat sizeof(char ) characters as a \| \| pointer, and return \emph{len} characters \| \| from the referenced location -------+---------+----------------------------------------- p \| String \| treat sizeof(char ) characters as a \| \| pointer to a null-terminated string -------+---------+----------------------------------------- Q \| Integer \| treat 8 characters as an unsigned \| \| quad word (64 bits) -------+---------+----------------------------------------- q \| Integer \| treat 8 characters as a signed \| \| quad word (64 bits) -------+---------+----------------------------------------- S \| Fixnum \| treat two (different if _ used) \| \| successive characters as an unsigned \| \| short in native byte order -------+---------+----------------------------------------- s \| Fixnum \| Treat two (different if _ used) \| \| successive characters as a signed short \| \| in native byte order -------+---------+----------------------------------------- U \| Integer \| UTF-8 characters as unsigned integers -------+---------+----------------------------------------- u \| String \| UU-encoded -------+---------+----------------------------------------- V \| Fixnum \| treat four characters as an unsigned \| \| long in little-endian byte order -------+---------+----------------------------------------- v \| Fixnum \| treat two characters as an unsigned \| \| short in little-endian byte order -------+---------+----------------------------------------- w \| Integer \| BER-compressed integer (see Array.pack) -------+---------+----------------------------------------- X \| --- \| skip backward one character -------+---------+----------------------------------------- x \| --- \| skip forward one character -------+---------+----------------------------------------- Z \| String \| with trailing nulls removed \| \| upto first null with * -------+---------+----------------------------------------- @ \| --- \| skip to the offset given by the \| \| length argument -------+---------+-----------------------------------------
返回顶楼	回帖地址 6 0 请登录后投票

seemoon 等级: 性别: 文章: 523 积分: 480 来自: 上海	发表时间：2008-06-12 quake wang：你的解法在测试‘ab你c好d’时有些问题两位高手的解法很tricky，我写了个比较低级的 require 'stringio' $KCODE = "u" def truncate_u(text, length = 30, truncate_string ="...") return text if text.size<=length ios=StringIO.new(text) while c=ios.getc break if length<=0 if c>127 length-=1 ios.seek(ios.tell+2) #skip to next 'char' else length-=0.5 end cursor=ios.tell end if length<0 #1.5 happens!!! sub_str=text[0..(cursor-4)] else sub_str=text[0..cursor-1] end if sub_str.size<text.size sub_str << truncate_string else sub_str end end 上述解法启发自simohayha和老庄的0.5，在utf-8编码下有效这道quize出的着实不错，学到了stringio,unpack,正则,benchmark，值啊....
返回顶楼	回帖地址 0 0 请登录后投票

nnnnon 等级: 性别: 文章: 36 积分: 259 来自: 北京	发表时间：2008-06-16 用ruby1.9，特别的简单了： #-- coding:utf-8 -- puts "Once u你好pon a time in a world far far away"[0,15]
返回顶楼	回帖地址 0 0 请登录后投票

hahahui 等级: 初级会员文章: 1 积分: 30 来自: ...	发表时间：2008-06-25 $KCODE='u' require 'jcode' require 'iconv' require 'benchmark' def truncate_u(text, length = 30, truncate_string = "...") return text<<truncate_string if text.jsize<=length result = "" width = 0 length = length*2 text.each_char { \|c\| if width<length if c.mbchar? result<<c if width+2<=length width+=2 else result<<c width+=1 end end if width>=length break end } result<<truncate_string end puts truncate_u("Helloa中文aaabbbbbbbbb",4) puts truncate_u("Helloworld",4) puts truncate_u("He中文lloworld",4) puts truncate_u("H中文中文elloworld",4) puts truncate_u("H中",4)
返回顶楼	回帖地址 0 0 请登录后投票

carlosbdw 等级: 性别: 文章: 352 积分: 286 来自: 北京	发表时间：2008-06-25 sea gull 写道用ruby1.9，特别的简单了： #-- coding:utf-8 -- puts "Once u你好pon a time in a world far far away"[0,15] 能不能把运行结果也贴出来啊？
返回顶楼	回帖地址 0 0 请登录后投票

nnnnon 等级: 性别: 文章: 36 积分: 259 来自: 北京	发表时间：2008-06-26 carlosbdw 写道 sea gull 写道用ruby1.9，特别的简单了： #-- coding:utf-8 -- puts "Once u你好pon a time in a world far far away"[0,15] 能不能把运行结果也贴出来啊？ ruby truncate_test.rb Once u你好pon a
返回顶楼	回帖地址 0 0 请登录后投票

sandybuster 等级: 初级会员性别: 文章: 51 积分: 30 来自: 南京	发表时间：2008-06-27 sea gull 写道用ruby1.9，特别的简单了： #-- coding:utf-8 -- puts "Once u你好pon a time in a world far far away"[0,15] 很好，很强大，是个不错的选择
返回顶楼	回帖地址 0 0 请登录后投票

论坛首页 → 编程语言技术版

跳转论坛: