`
gelongmei
  • 浏览: 211291 次
  • 性别: Icon_minigender_1
  • 来自: 深圳
文章分类
社区版块
存档分类
最新评论

[awk]Awk常用字符串处理函数

 
阅读更多

gsub(regexp, replacement [, target])
Search target for all of the longest, leftmost, nonoverlapping matching substrings it can find and replace them with replacement. The ‘g’ in gsub() stands for “global,” which means replace everywhere. For example:
          { gsub(/Britain/, "United Kingdom"); print }
replaces all occurrences of the string ‘Britain’ with ‘United Kingdom’ for all input records.
The gsub() function returns the number of substitutions made. If the variable to search and alter (target) is omitted, then the entire input record ($0) is used. As in sub(), the characters ‘&’ and ‘\’ are special, and the third argument must be assignable.

index(in, find)
Search the string in for the first occurrence of the string find, and return the position in characters where that occurrence begins in the string in. Consider the following example:
          $ awk 'BEGIN { print index("peanut", "an") }'
          -| 3
If find is not found, index() returns zero. (Remember that string indices in awk start at one.)

length([string])
Return the number of characters in string. If string is a number, the length of the digit string representing that number is returned. For example, length("abcde") is five. By contrast, length(15 * 35) works out to three. In this example, 15 * 35 = 525, and 525 is then converted to the string "525", which has three characters.
If no argument is supplied, length() returns the length of $0.
NOTE: In older versions of awk, the length() function could be called without any parentheses. Doing so is considered poor practice, although the 2008 POSIX standard explicitly allows it, to support historical practice. For programs to be maximally portable, always supply the parentheses.
If length() is called with a variable that has not been used, gawk forces the variable to be a scalar. Other implementations of awk leave the variable without a type. (d.c.) Consider:

          $ gawk 'BEGIN { print length(x) ; x[1] = 1 }'
          -| 0
          error--> gawk: fatal: attempt to use scalar `x' as array
         
          $ nawk 'BEGIN { print length(x) ; x[1] = 1 }'
          -| 0
If --lint has been specified on the command line, gawk issues a warning about this.

With gawk and several other awk implementations, when given an array argument, the length() function returns the number of elements in the array. (c.e.) This is less useful than it might seem at first, as the array is not guaranteed to be indexed from one to the number of elements in it. If --lint is provided on the command line (see Options), gawk warns that passing an array argument is not portable. If --posix is supplied, using an array argument is a fatal error.

match(string, regexp)
Search string for the longest, leftmost substring matched by the regular expression, regexp and return the character position, or index, at which that substring begins (one, if it starts at the beginning of string). If no match is found, return zero.
The regexp argument may be either a regexp constant (/.../) or a string constant ("..."). In the latter case, the string is treated as a regexp to be matched.
The order of the first two arguments is backwards from most other string functions that work with regular expressions, such as sub() and gsub(). It might help to remember that for match(), the order is the same as for the ‘~’ operator: ‘string ~ regexp’.
The match() function sets the built-in variable RSTART to the index. It also sets the built-in variable RLENGTH to the length in characters of the matched substring. If no match is found, RSTART is set to zero, and RLENGTH to 0.
For example:
         {
                 if ($1 == "FIND")
                   regex = $2
                 else {
                   where = match($0, regex)
                   if (where != 0)
                     print "Match of", regex, "found at",
                               where, "in", $0
                 }
          }       
This program looks for lines that match the regular expression stored in the variable regex. This regular expression can be changed. If the first word on a line is ‘FIND’, regex is changed to be the second word on that line. Therefore, if given:
          FIND ru+n
          My program runs
          but not very quickly
          FIND Melvin
          JF+KM
          This line is property of Reality Engineering Co.
          Melvin was here.    
awk prints:
          Match of ru+n found at 12 in My program runs
          Match of Melvin found at 1 in Melvin was here.
If array is present, it is cleared, and then the zeroth element of array is set to the entire portion of string matched by regexp. If regexp contains parentheses, the integer-indexed elements of array are set to contain the portion of string matching the corresponding parenthesized subexpression. For example:

          $ echo foooobazbarrrrr |
          > gawk '{ match($0, /(fo+).+(bar*)/, arr)
          >         print arr[1], arr[2] }'
          -| foooo barrrrr
In addition, multidimensional subscripts are available providing the start index and length of each matched subexpression:
          $ echo foooobazbarrrrr |
          > gawk '{ match($0, /(fo+).+(bar*)/, arr)
          >           print arr[1], arr[2]
          >           print arr[1, "start"], arr[1, "length"]
          >           print arr[2, "start"], arr[2, "length"]
          > }'
          -| foooo barrrrr
          -| 1 5
          -| 9 7
There may not be subscripts for the start and index for every parenthesized subexpression, since they may not all have matched text; thus they should be tested for with the in operator (see Reference to Elements).


split(string, array [, fieldsep [, seps ] ])
Divide string into pieces separated by fieldsep and store the pieces in array and the separator strings in the seps array. The first piece is stored in array[1], the second piece in array[2], and so forth. The string value of the third argument, fieldsep, is a regexp describing where to split string (much as FS can be a regexp describing where to split input records; see Regexp Field Splitting). If fieldsep is omitted, the value of FS is used. split() returns the number of elements created. seps is a gawk extension with seps[i] being the separator string between array[i] and array[i+1]. If fieldsep is a single space then any leading whitespace goes into seps[0] and any trailing whitespace goes into seps[n] where n is the return value of split() (that is, the number of elements in array).
The split() function splits strings into pieces in a manner similar to the way input lines are split into fields. For example:

          split("cul-de-sac", a, "-", seps)
splits the string ‘cul-de-sac’ into three fields using ‘-’ as the separator. It sets the contents of the array a as follows:
          a[1] = "cul"
          a[2] = "de"
          a[3] = "sac"
and sets the contents of the array seps as follows:

          seps[1] = "-"
          seps[2] = "-"
The value returned by this call to split() is three.

As with input field-splitting, when the value of fieldsep is " ", leading and trailing whitespace is ignored in values assigned to the elements of array but not in seps, and the elements are separated by runs of whitespace. Also as with input field-splitting, if fieldsep is the null string, each individual character in the string is split into its own array element. (c.e.)

Note, however, that RS has no effect on the way split() works. Even though ‘RS = ""’ causes newline to also be an input field separator, this does not affect how split() splits strings.

Modern implementations of awk, including gawk, allow the third argument to be a regexp constant (/abc/) as well as a string. (d.c.) The POSIX standard allows this as well. See Computed Regexps, for a discussion of the difference between using a string constant or a regexp constant, and the implications for writing your program correctly.

Before splitting the string, split() deletes any previously existing elements in the arrays array and seps.

If string is null, the array has no elements. (So this is a portable way to delete an entire array with one statement. See Delete.)

If string does not match fieldsep at all (but is not null), array has one element only. The value of that element is the original string.

sprintf(format, expression1, ...)
Return (without printing) the string that printf would have printed out with the same arguments (see Printf). For example:
          pival = sprintf("pi = %.2f (approx.)", 22/7)
assigns the string ‘pi = 3.14 (approx.)’ to the variable pival.


sub(regexp, replacement [, target])
Search target, which is treated as a string, for the leftmost, longest substring matched by the regular expression regexp. Modify the entire string by replacing the matched text with replacement. The modified string becomes the new value of target. Return the number of substitutions made (zero or one).
The regexp argument may be either a regexp constant (/.../) or a string constant ("..."). In the latter case, the string is treated as a regexp to be matched. See Computed Regexps, for a discussion of the difference between the two forms, and the implications for writing your program correctly.
This function is peculiar because target is not simply used to compute a value, and not just any expression will do—it must be a variable, field, or array element so that sub() can store a modified value there. If this argument is omitted, then the default is to use and alter $0.42 For example:
          str = "water, water, everywhere"
          sub(/at/, "ith", str)
sets str to ‘wither, water, everywhere’, by replacing the leftmost longest occurrence of ‘at’ with ‘ith’.

If the special character ‘&’ appears in replacement, it stands for the precise substring that was matched by regexp. (If the regexp can match more than one string, then this precise substring may vary.) For example:

          { sub(/candidate/, "& and his wife"); print }
changes the first occurrence of ‘candidate’ to ‘candidate and his wife’ on each input line. Here is another example:

          $ awk 'BEGIN {
          >         str = "daabaaa"
          >         sub(/a+/, "C&C", str)
          >         print str
          > }'
          -| dCaaCbaaa
This shows how ‘&’ can represent a nonconstant string and also illustrates the “leftmost, longest” rule in regexp matching (see Leftmost Longest).

The effect of this special character (‘&’) can be turned off by putting a backslash before it in the string. As usual, to insert one backslash in the string, you must write two backslashes. Therefore, write ‘\\&’ in a string constant to include a literal ‘&’ in the replacement. For example, the following shows how to replace the first ‘|’ on each line with an ‘&’:

          { sub(/\|/, "\\&"); print }
As mentioned, the third argument to sub() must be a variable, field or array element. Some versions of awk allow the third argument to be an expression that is not an lvalue. In such a case, sub() still searches for the pattern and returns zero or one, but the result of the substitution (if any) is thrown away because there is no place to put it. Such versions of awk accept expressions like the following:

          sub(/USA/, "United States", "the USA and Canada")
For historical compatibility, gawk accepts such erroneous code. However, using any other nonchangeable object as the third parameter causes a fatal error and your program will not run.

Finally, if the regexp is not a regexp constant, it is converted into a string, and then the value of that string is treated as the regexp to match.

substr(string, start [, length])
Return a length-character-long substring of string, starting at character number start. The first character of a string is character number one.43 For example, substr("washington", 5, 3) returns "ing".
If length is not present, substr() returns the whole suffix of string that begins at character number start. For example, substr("washington", 5) returns "ington". The whole suffix is also returned if length is greater than the number of characters remaining in the string, counting from character start.

If start is less than one, substr() treats it as if it was one. (POSIX doesn't specify what to do in this case: Brian Kernighan's awk acts this way, and therefore gawk does too.) If start is greater than the number of characters in the string, substr() returns the null string. Similarly, if length is present but less than or equal to zero, the null string is returned.

The string returned by substr() cannot be assigned. Thus, it is a mistake to attempt to change a portion of a string, as shown in the following example:

          string = "abcdef"
          # try to get "abCDEf", won't work
          substr(string, 3, 3) = "CDE"
It is also a mistake to use substr() as the third argument of sub() or gsub():

          gsub(/xyz/, "pdq", substr($0, 5, 20))  # WRONG
(Some commercial versions of awk treat substr() as assignable, but doing so is not portable.)

If you need to replace bits and pieces of a string, combine substr() with string concatenation, in the following manner:

          string = "abcdef"
          ...
          string = substr(string, 1, 2) "CDE" substr(string, 6)

tolower(string)
Return a copy of string, with each uppercase character in the string replaced with its corresponding lowercase character. Nonalphabetic characters are left unchanged. For example, tolower("MiXeD cAsE 123") returns "mixed case 123".
toupper(string)
Return a copy of string, with each lowercase character in the string replaced with its corresponding uppercase character. Nonalphabetic characters are left unchanged. For example, toupper("MiXeD cAsE 123") returns "MIXED CASE 123".
分享到:
评论

相关推荐

    awk笔记 算数函数、字符串函数

    在给定的文件信息中,我们探讨了AWK这一强大文本处理工具中的算数函数、字符串函数以及其他功能,包括时间函数。以下是对这些知识点的详细解释: ### 算数函数 1. **atan2(y,x)**:此函数计算坐标 (x,y) 的角度...

    awk编程语言入门

    * 字符串函数:awk 中有多种字符串函数,如 sub、gsub、index、substr、split、length、match 等,用于对字符串进行操作。 * 数学函数:awk 中有多种数学函数,如 atan2、cos、exp、int、log、rand、sin、sqrt 等,...

    Linux AWK内置函数

    以上内容详细介绍了 Linux 下 AWK 的内置函数,包括算术函数和字符串函数。这些函数可以帮助开发者在进行文本处理时更加灵活高效。通过这些函数,我们可以轻松实现数据过滤、字符串操作等功能,从而提高工作效率。

    linux 字符串操作

    `awk` 是一种强大的文本处理语言,用于字符串操作非常灵活。 1. **使用 `split` 函数:** ```bash $ echo $var | awk '{printf("%d%s\n", split($0, var_arr, ""), var_arr[1])}' split 字符串到数组 var_arr 中...

    awk教程入门到精通

    awk 有多种函数,包括数学函数、字符串函数、时间函数等。数学函数包括 sqrt、sin、cos 等,字符串函数包括 substr、index、match 等,时间函数包括 systime、strftime 等。 awk 的应用 awk 的应用非常广泛,包括...

    Shell脚本中计算字符串长度的5种方法

    在日常的Shell脚本开发工作中,我们经常需要处理字符串,其中计算字符串长度是一个常见的需求。本文将详细介绍五种在Shell脚本中计算字符串长度的方法,并对每种方法进行详细的解析,帮助读者更好地理解和掌握这些...

    LINUX的awk和sed的常用用法

    awk 中可以使用内置字符串函数 gsub 来匹配模式,例如 awk 'gsub(/12101/,"hello") {print $0} END{print FILENAME}' tab1。 awk 中可以使用内置字符串函数 index 来匹配模式,例如 awk '{print index($2,"D")""t" ...

    shell字符串的截取

    1. 使用 strRepeat 函数可以重复字符串,例如:STR_REPEAT=`strRepeat "$USER_NAME" 3`。 2. 使用 printf 命令可以构造字符串,例如:STR_TEMP=`printf "%s%s" "$STR_ZERO" "$USER_NAME"`。 五、字符串比较 Shell ...

    Effective awk Programming, 4th Edition[awk高效编程4版]

    书中详细介绍了字符串函数,如length、index、sub、gsub等,以及如何进行字符串的拼接和比较。此外,书中还涵盖了数组的使用,包括关联数组和数值数组,它们允许开发者存储和处理多组相关数据。 模式匹配是awk的一...

    三剑客之【awk】.html

    awk命令常用用法整理;加入了自己在平时运用中的...awk有许多强大的字符串函数 gsub(r,s) #在整个$0中,用s代替r gsub(r,s,t) 在整个t中,用s代替r index(s,t) 返回s中字符串t的第一位置 length(s) 返回s长度 。。。

    awk 基本的一些常用用法

    ### AWK基本的一些常用用法 #### AWK简介 AWK是一种强大的文本处理工具,它最初是为了方便地处理结构化数据而设计的。AWK语言不仅支持基础的文本处理功能,还提供了高级的数据处理能力,使得它在数据分析、报告生成...

    awk教程-awk教程.rar

    - **函数**:awk提供了内置的数学和字符串处理函数,如length()、split()、substr()等。 - **自定义函数**:可以创建用户自定义函数来复用代码。 ### 5. 在Windows中使用AWK 虽然awk是Unix/Linux下的标准工具,但...

    Python实现像awk一样分割字符串

    在编程世界中,awk 是一个强大的文本分析工具,尤其在处理和分割字符串方面表现出色,它能够优雅地处理多个连续空格。然而,在 Python 中,`str.split()` 方法默认会将每个空格视为一个分隔符,导致多个连续空格被...

    awk参考资料下载awk

    - **内置函数**: 包括算术函数(如`+`、`-`、`*`、`/`)、字符串函数(如`length`、`index`、`substr`)和控制流程函数(如`if`、`for`、`while`)等。 - **表达式**: 用于计算或判断,结果可以影响操作的执行。 ...

    linux中shell脚本中awk的深入分析

    * 使用 awk 分割字符串:`awk 'BEGIN {split("123#xuyunbo#aini", dong1, "#"); print dong1[1]}'` * 使用 awk 生成报表:`awk -f awk4.sh` awk 是一个功能强大且灵活的文本处理工具,广泛应用于 Linux 系统中的...

    awk教程所速度

    awk还具备大量的内置函数,这些函数可以分为数值函数、字符串函数和系统函数三类。数值函数包括了三角函数、指数、对数和平方根等。字符串函数有用于长度计算、索引查找、子字符串操作、字符串分割等,nawk和gawk...

    awk知识文档学习

    - **运算和字符串操作**: Awk可以执行各种数学运算和字符串处理任务。 ### 3. Awk程序和命令结构 Awk程序通常由模式、操作或二者组合而成。模式决定执行哪些操作,操作则定义了当模式匹配时要执行的代码。Awk命令...

    The AWK Programming Language 中文版

    还可以进行字符串拼接和格式化输出: ```awk awk '{total += $2*$3; count++} END {print total, total/count}' input-file ``` ### AWK流程控制语句 AWK具有完整的流程控制语句,如 `if-else`、`while`、`for` 等...

Global site tag (gtag.js) - Google Analytics