论坛首页 编程语言技术论坛

Ruby每周一测 - 中英文混合字符串截取

浏览 18947 次
精华帖 (0) :: 良好帖 (1) :: 新手帖 (0) :: 隐藏帖 (0)
作者 正文
   发表时间:2008-06-11  
def truncate_u(text, length = 30, truncate_string = "...")
  l=0
  char_array=text.unpack("U*")
  char_array.each_with_index do |c,i|
    l = l+ (c<127 ? 0.5 : 1)
    if l>=length
      return char_array[0..i].pack("U*")+(i<char_array.length-1 ? truncate_string : "")
    end
  end
  return text
end
0 请登录后投票
   发表时间:2008-06-12  
这个题目其实可以扩充为中日韩等文字字符串的截取...
0 请登录后投票
   发表时间:2008-06-12  
看了老庄的解法,第一次知道了string的pack/unpack方法,呵呵
我的解法是用正则表达式:
def truncate_u(text, length = 30, truncate_string = "...")
	if r = Regexp.new("(?:(?:[^\xe0-\xef\x80-\xbf]{1,2})|(?:[\xe0-\xef][\x80-\xbf][\x80-\xbf])){#{length}}", true, 'n').match(text)
		r[0].length < text.length ? r[0] + truncate_string : r[0]
    else
		text
    end
end

和老庄的解法比起来就是太难懂了,不过在length比较小的情况下(<50),性能要好一些,顺便把我用的benchmark代码也贴出来:
require 'benchmark'

test_suits = [
["english string", 2],
["中文字符串", 2],
["中文 and english", 6],
["中文 and english", 8],
["veryveryveryveryveryveryveryveryveryveryveryveryveryverylongstring", 20],
["很长verylong很长verylong很长verylong很长verylong很长很长很长很长很的字符串", 30]
]

br = Benchmark.bmbm do |b|
  b.report("truncate_u benchmark") do
    5000.times {
        test_suits.each {|t| truncate_u(t[0], t[1])}
    }
  end
end
0 请登录后投票
   发表时间:2008-06-12  

String的unpack/pack非常的强大,我也只是用了其中的一个参数而已。

目前还没看到较为完整的中文介绍。

抄一段ruby doc在这里吧,希望有心人翻一下 

 


Decodes str (which may contain binary data) according to the format string, returning an array of each value extracted. The format string consists of a sequence of single-character directives, summarized in the table at the end of this entry. Each directive may be followed by a number, indicating the number of times to repeat with this directive. An asterisk (``*’’) will use up all remaining elements. The directives sSiIlL may each be followed by an underscore (``_’’) to use the underlying platform‘s native size for the specified type; otherwise, it uses a platform-independent consistent size. Spaces are ignored in the format string. See also Array#pack.


   "abc \0\0abc \0\0".unpack('A6Z6')   #=> ["abc", "abc "]
   "abc \0\0".unpack('a3a3')           #=> ["abc", " \000\000"]
   "abc \0abc \0".unpack('Z*Z*')       #=> ["abc ", "abc "]
   "aa".unpack('b8B8')                 #=> ["10000110", "01100001"]
   "aaa".unpack('h2H2c')               #=> ["16", "61", 97]
   "\xfe\xff\xfe\xff".unpack('sS')     #=> [-2, 65534]
   "now=20is".unpack('M*')             #=> ["now is"]
   "whole".unpack('xax2aX2aX1aX2a')    #=> ["h", "e", "l", "l", "o"]


This table summarizes the various formats and the Ruby classes returned by each.

   Format | Returns | Function
   -------+---------+-----------------------------------------
     A    | String  | with trailing nulls and spaces removed
   -------+---------+-----------------------------------------
     a    | String  | string
   -------+---------+-----------------------------------------
     B    | String  | extract bits from each character (msb first)
   -------+---------+-----------------------------------------
     b    | String  | extract bits from each character (lsb first)
   -------+---------+-----------------------------------------
     C    | Fixnum  | extract a character as an unsigned integer
   -------+---------+-----------------------------------------
     c    | Fixnum  | extract a character as an integer
   -------+---------+-----------------------------------------
     d,D  | Float   | treat sizeof(double) characters as
          |         | a native double
   -------+---------+-----------------------------------------
     E    | Float   | treat sizeof(double) characters as
          |         | a double in little-endian byte order
   -------+---------+-----------------------------------------
     e    | Float   | treat sizeof(float) characters as
          |         | a float in little-endian byte order
   -------+---------+-----------------------------------------
     f,F  | Float   | treat sizeof(float) characters as
          |         | a native float
   -------+---------+-----------------------------------------
     G    | Float   | treat sizeof(double) characters as
          |         | a double in network byte order
   -------+---------+-----------------------------------------
     g    | Float   | treat sizeof(float) characters as a
          |         | float in network byte order
   -------+---------+-----------------------------------------
     H    | String  | extract hex nibbles from each character
          |         | (most significant first)
   -------+---------+-----------------------------------------
     h    | String  | extract hex nibbles from each character
          |         | (least significant first)
   -------+---------+-----------------------------------------
     I    | Integer | treat sizeof(int) (modified by _)
          |         | successive characters as an unsigned
          |         | native integer
   -------+---------+-----------------------------------------
     i    | Integer | treat sizeof(int) (modified by _)
          |         | successive characters as a signed
          |         | native integer
   -------+---------+-----------------------------------------
     L    | Integer | treat four (modified by _) successive
          |         | characters as an unsigned native
          |         | long integer
   -------+---------+-----------------------------------------
     l    | Integer | treat four (modified by _) successive
          |         | characters as a signed native
          |         | long integer
   -------+---------+-----------------------------------------
     M    | String  | quoted-printable
   -------+---------+-----------------------------------------
     m    | String  | base64-encoded
   -------+---------+-----------------------------------------
     N    | Integer | treat four characters as an unsigned
          |         | long in network byte order
   -------+---------+-----------------------------------------
     n    | Fixnum  | treat two characters as an unsigned
          |         | short in network byte order
   -------+---------+-----------------------------------------
     P    | String  | treat sizeof(char *) characters as a
          |         | pointer, and  return \emph{len} characters
          |         | from the referenced location
   -------+---------+-----------------------------------------
     p    | String  | treat sizeof(char *) characters as a
          |         | pointer to a  null-terminated string
   -------+---------+-----------------------------------------
     Q    | Integer | treat 8 characters as an unsigned
          |         | quad word (64 bits)
   -------+---------+-----------------------------------------
     q    | Integer | treat 8 characters as a signed
          |         | quad word (64 bits)
   -------+---------+-----------------------------------------
     S    | Fixnum  | treat two (different if _ used)
          |         | successive characters as an unsigned
          |         | short in native byte order
   -------+---------+-----------------------------------------
     s    | Fixnum  | Treat two (different if _ used)
          |         | successive characters as a signed short
          |         | in native byte order
   -------+---------+-----------------------------------------
     U    | Integer | UTF-8 characters as unsigned integers
   -------+---------+-----------------------------------------
     u    | String  | UU-encoded
   -------+---------+-----------------------------------------
     V    | Fixnum  | treat four characters as an unsigned
          |         | long in little-endian byte order
   -------+---------+-----------------------------------------
     v    | Fixnum  | treat two characters as an unsigned
          |         | short in little-endian byte order
   -------+---------+-----------------------------------------
     w    | Integer | BER-compressed integer (see Array.pack)
   -------+---------+-----------------------------------------
     X    | ---     | skip backward one character
   -------+---------+-----------------------------------------
     x    | ---     | skip forward one character
   -------+---------+-----------------------------------------
     Z    | String  | with trailing nulls removed
          |         | upto first null with *
   -------+---------+-----------------------------------------
     @    | ---     | skip to the offset given by the
          |         | length argument
   -------+---------+-----------------------------------------

6 请登录后投票
   发表时间:2008-06-12  
quake wang:你的解法在测试‘ab你c好d’时有些问题

两位高手的解法很tricky,我写了个比较低级的

require 'stringio'
$KCODE = "u" 

def truncate_u(text, length = 30, truncate_string ="...")
  
  return text if text.size<=length
  
  ios=StringIO.new(text)
  while c=ios.getc
    break if length<=0
    if c>127
      length-=1
      ios.seek(ios.tell+2) #skip to next 'char'
    else
      length-=0.5
    end
    cursor=ios.tell
  end
  
  if length<0 #1.5 happens!!!
    sub_str=text[0..(cursor-4)]
  else
    sub_str=text[0..cursor-1]
  end
  
  if sub_str.size<text.size
    sub_str << truncate_string
  else
    sub_str
  end
end


上述解法启发自simohayha和老庄的0.5,在utf-8编码下有效
这道quize出的着实不错,学到了stringio,unpack,正则,benchmark,值啊....
0 请登录后投票
   发表时间:2008-06-16  
用ruby1.9,特别的简单了:

#-*- coding:utf-8 -*-
puts "Once u你好pon a time in a world far far away"[0,15]

0 请登录后投票
   发表时间:2008-06-25  
$KCODE='u'
require 'jcode'
require 'iconv' 
require 'benchmark'

def truncate_u(text, length = 30, truncate_string = "...") 
  return text<<truncate_string if text.jsize<=length
  result = ""
  width = 0
  length = length*2
  text.each_char { |c|
    if width<length
      if c.mbchar?
        result<<c if width+2<=length
        width+=2
      else
        result<<c
        width+=1
      end
    end
    if width>=length
      break
    end
  } 
  result<<truncate_string
end

puts truncate_u("Helloa中文aaabbbbbbbbb",4)
puts truncate_u("Helloworld",4)
puts truncate_u("He中文lloworld",4)
puts truncate_u("H中文中文elloworld",4)
puts truncate_u("H中",4)


0 请登录后投票
   发表时间:2008-06-25  
sea gull 写道
用ruby1.9,特别的简单了:

#-*- coding:utf-8 -*-
puts "Once u你好pon a time in a world far far away"[0,15]




能不能把运行结果也贴出来啊?

0 请登录后投票
   发表时间:2008-06-26  
carlosbdw 写道
sea gull 写道
用ruby1.9,特别的简单了:

#-*- coding:utf-8 -*-
puts "Once u你好pon a time in a world far far away"[0,15]




能不能把运行结果也贴出来啊?



ruby truncate_test.rb

Once u你好pon a
0 请登录后投票
   发表时间:2008-06-27  
sea gull 写道
用ruby1.9,特别的简单了:

#-*- coding:utf-8 -*-
puts "Once u你好pon a time in a world far far away"[0,15]


很好,很强大,是个不错的选择
0 请登录后投票
论坛首页 编程语言技术版

跳转论坛:
Global site tag (gtag.js) - Google Analytics