锁定老帖子 主题:Ruby每周一测 - 中英文混合字符串截取
精华帖 (0) :: 良好帖 (1) :: 新手帖 (0) :: 隐藏帖 (0)
作者 | 正文 |
def truncate_u(text, length = 30, truncate_string = "...") l=0 char_array=text.unpack("U*") char_array.each_with_index do |c,i| l = l+ (c<127 ? 0.5 : 1) if l>=length return char_array[0..i].pack("U*")+(i<char_array.length-1 ? truncate_string : "") end end return text end |
返回顶楼 | |
返回顶楼 | |
我的解法是用正则表达式: def truncate_u(text, length = 30, truncate_string = "...") if r = Regexp.new("(?:(?:[^\xe0-\xef\x80-\xbf]{1,2})|(?:[\xe0-\xef][\x80-\xbf][\x80-\xbf])){#{length}}", true, 'n').match(text) r[0].length < text.length ? r[0] + truncate_string : r[0] else text end end 和老庄的解法比起来就是太难懂了,不过在length比较小的情况下(<50),性能要好一些,顺便把我用的benchmark代码也贴出来: require 'benchmark' test_suits = [ ["english string", 2], ["中文字符串", 2], ["中文 and english", 6], ["中文 and english", 8], ["veryveryveryveryveryveryveryveryveryveryveryveryveryverylongstring", 20], ["很长verylong很长verylong很长verylong很长verylong很长很长很长很长很的字符串", 30] ] br = Benchmark.bmbm do |b| b.report("truncate_u benchmark") do 5000.times { test_suits.each {|t| truncate_u(t[0], t[1])} } end end |
返回顶楼 | |
Decodes str (which may contain binary data) according to the format string, returning an array of each value extracted. The format string consists of a sequence of single-character directives, summarized in the table at the end of this entry. Each directive may be followed by a number, indicating the number of times to repeat with this directive. An asterisk (``*’’) will use up all remaining elements. The directives sSiIlL may each be followed by an underscore (``_’’) to use the underlying platform‘s native size for the specified type; otherwise, it uses a platform-independent consistent size. Spaces are ignored in the format string. See also Array#pack. "abc \0\0abc \0\0".unpack('A6Z6') #=> ["abc", "abc "]
"abc \0\0".unpack('a3a3') #=> ["abc", " \000\000"]
"abc \0abc \0".unpack('Z*Z*') #=> ["abc ", "abc "]
"aa".unpack('b8B8') #=> ["10000110", "01100001"]
"aaa".unpack('h2H2c') #=> ["16", "61", 97]
"\xfe\xff\xfe\xff".unpack('sS') #=> [-2, 65534]
"now=20is".unpack('M*') #=> ["now is"]
"whole".unpack('xax2aX2aX1aX2a') #=> ["h", "e", "l", "l", "o"]
This table summarizes the various formats and the Ruby classes returned by each. Format | Returns | Function |
返回顶楼 | |
quake wang:你的解法在测试‘ab你c好d’时有些问题
两位高手的解法很tricky,我写了个比较低级的 require 'stringio' $KCODE = "u" def truncate_u(text, length = 30, truncate_string ="...") return text if text.size<=length ios=StringIO.new(text) while c=ios.getc break if length<=0 if c>127 length-=1 ios.seek(ios.tell+2) #skip to next 'char' else length-=0.5 end cursor=ios.tell end if length<0 #1.5 happens!!! sub_str=text[0..(cursor-4)] else sub_str=text[0..cursor-1] end if sub_str.size<text.size sub_str << truncate_string else sub_str end end 上述解法启发自simohayha和老庄的0.5,在utf-8编码下有效 这道quize出的着实不错,学到了stringio,unpack,正则,benchmark,值啊.... |
返回顶楼 | |
#-*- coding:utf-8 -*- puts "Once u你好pon a time in a world far far away"[0,15] |
返回顶楼 | |
$KCODE='u' require 'jcode' require 'iconv' require 'benchmark' def truncate_u(text, length = 30, truncate_string = "...") return text<<truncate_string if text.jsize<=length result = "" width = 0 length = length*2 text.each_char { |c| if width<length if c.mbchar? result<<c if width+2<=length width+=2 else result<<c width+=1 end end if width>=length break end } result<<truncate_string end puts truncate_u("Helloa中文aaabbbbbbbbb",4) puts truncate_u("Helloworld",4) puts truncate_u("He中文lloworld",4) puts truncate_u("H中文中文elloworld",4) puts truncate_u("H中",4) |
返回顶楼 | |
sea gull 写道 用ruby1.9,特别的简单了:
#-*- coding:utf-8 -*- puts "Once u你好pon a time in a world far far away"[0,15] 能不能把运行结果也贴出来啊? |
返回顶楼 | |
carlosbdw 写道 sea gull 写道 用ruby1.9,特别的简单了:
#-*- coding:utf-8 -*- puts "Once u你好pon a time in a world far far away"[0,15] 能不能把运行结果也贴出来啊? ruby truncate_test.rb Once u你好pon a |
返回顶楼 | |
sea gull 写道 用ruby1.9,特别的简单了:
#-*- coding:utf-8 -*- puts "Once u你好pon a time in a world far far away"[0,15] 很好,很强大,是个不错的选择 |
返回顶楼 | |