`
sunxboy
  • 浏览: 2878169 次
  • 性别: Icon_minigender_1
  • 来自: 武汉
社区版块
存档分类
最新评论

awk - 10 examples to group data in a CSV or text file

 
阅读更多

awk is very powerful when it comes for file formatting.  In this article, we will discuss some wonderful grouping features of awk. awk can group a data based on a column or field , or on a set of columns. It uses the powerful associative array for grouping. If you are new to awk, this article will be easier to understand if you can go over the article how to parse a simple CSV file using awk.

Let us take a sample CSV file with the below contents. The file is kind of an expense report containing items and their prices. As seen, some expense items  have multiple entries.
$ cat file
Item1,200
Item2,500
Item3,900
Item2,800
Item1,600
 1. To find the total of all numbers in second column. i.e, to find the sum of all the prices.
$ awk -F"," '{x+=$2}END{print x}' file
3000
 The delimiter(-F) used is comma since its a comma separated file. x+=$2 stands for x=x+$2. When a line is parsed, the second column($2) which is the price, is added to the variable x. At the end, the variable x contains the sum. This example is same as discussed in the awk example of finding the sum of all numbers in a file.

   If your input file is a text file with the only difference being the comma not present in the above file, all you need to make is one change. Remove this part from the above command: -F","  . This is because the default delimiter in awk is whitespace.


2. To find the total sum of particular group entry alone. i.e, in this case, of "Item1":

$ awk -F, '$1=="Item1"{x+=$2;}END{print x}' file
800
  This gives us the total sum of all the items pertaining to "Item1". In the earlier example, no condition was specified since we wanted awk to work on every line or record. In this case, we want awk to work on only the records whose first column($1) is equal to Item1.


3. If the data to be worked upon is present in a shell variable:

 

$ VAR="Item1"
$ awk -F, -v inp=$VAR '$1==inp{x+=$2;}END{print x}' file
800
 -v is used to pass the shell variable to awk, and the rest is same as the last one.


4. To find unique values of first column

$ awk -F, '{a[$1];}END{for (i in a)print i;}' file
Item1
Item2
Item3
 Arrays in awk are associative and is a very powerful feature. Associate arrays have an index and a corresponding value. Example: a["Jan"]=30 meaning in the array a, "Jan" is an index with value 30. In our case here, we use only the index without values. So, the command a[$1] works like this: When the first record is processed, in the array named a, an index value "Item1" is stored. During the second record, a new index "Item2", during third "Item3" and so on. During the 4th record, since the "Item1" index is already there, no new index is added and the same continues.

  Now, once the file is processed completely, the control goes to the END label where we print all the index items. for loop in awk comes in 2 variants: 1. The C language kind of for loop,  Second being the one used for associate arrays.

  for i in a : This means for every index in the array a . The variable "i" holds the index value. In place of "i", it can be any variable name. Since there are 3 elements in the array, the loop will run for 3 times, each time holding the value of an index in the "i". And by printing "i", we get the index values printed.


 To understand the for loop better, look at this:

for (i in a)
{
  print i;
}
 

Note: The order of the output in the above command may vary from system to system. Associative arrays do not store the indexes in sequence and hence the order of the output need not be the same in which it is entered.

5. To find the sum of individual group records. i.e, to sum all records pertaining to Item1 alone, Item2 alone, and so on.

$ awk -F, '{a[$1]+=$2;}END{for(i in a)print i", "a[i];}' file
Item1, 800
Item2, 1300
Item3, 900
 a[$1]+=$2 . This can be written as a[$1]=a[$1]+$2. This works like this: When the first record is processed, a["Item1"] is assigned 200(a["Item1"]=200). During second "Item1" record, a["Item1"]=800 (200+600) and so on. In this way, every index item in the array is stored with the appropriate value associated to it which is the sum of the group.
   And in the END label, we print both the index(i) and the value(a[i]) which is nothing but the sum.

6. To find the sum of all entries in second column and add it as the last record.

$ awk -F"," '{x+=$2;print}END{print "Total,"x}' file
Item1,200
Item2,500
Item3,900
Item2,800
Item1,600
Total,3000
 This is same as the first example except that along with adding the value every time, every record is also printed, and at the end, the "Total" record is also printed.


7. To print the maximum or the biggest record of every group:

$ awk -F, '{if (a[$1] < $2)a[$1]=$2;}END{for(i in a){print i,a[i];}}' OFS=, file
Item1,600
Item2,800
Item3,900
 Before storing the value($2) in the array,  the current second column value is compared with the existing value and stored only if the value in the current record is bigger. And finally, the array will contain only the maximum values against every group. In the same way, just by changing the "lesser than(<)" symbol to greater than(>), we can find the smallest element in the group.
The syntax for if in awk is, similar to the C language syntax:


if (condition)
{  
  <code for true condition >
}else{  
 <code for false condition>
 }


8. To find the count of entries against every group:

$ awk -F, '{a[$1]++;}END{for (i in a)print i, a[i];}' file
Item1 2
Item2 2
Item3 1

 a[$1]++ : This can be put as a[$1]=a[$1]+1. When the first "Item1" record is parsed, a["Item1"]=1 and every item on encountering "Item1" record, this count is incremented, and the same follows for other entries as well. This code simply increments the count by 1 for the respective index on encountering a record. And finally on printing the array, we get the item entries and their respective counts.


9. To print only the first record of every group:

$ awk -F, '!a[$1]++' file
Item1,200
Item2,500
Item3,900
  A little tricky this one. In this awk command, there is only condition, no action statement. As a result, if the condition is true, the current record gets printed by default.
 !a[$1]++ : When the first record of a group is encountered, a[$1] remains 0 since ++ is post-fix, and not(!) of 0 is 1 which is true, and hence the first record gets printed. Now,  when the second records of "Item1" is parsed, a[$1] is 1 (will become 2 after the command since its a post-fix). Not(!) of 1 is 0 which is false, and the record does not get printed. In this way, the first record of every group gets printed.
   Simply by removing '!' operator, the above command will print all records other than the first record of the group.


10. To join or concatenate the values of all group items. Join the values of the second column with a colon separator:

$ awk -F, '{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS=, file
Item1,200:600
Item2,500:800
Item3,900
   This if condition is pretty simple: If there is some value in a[$1], then append or concatenate the current value using a colon delimiter, else just assign it to a[$1] since this is the first value.
To make the above if block clear, let me put it this way:  "if (a[$1])"  means "if a[$1] has some value".
if(a[$1])
 a[$1]=a[$1]":"$2;
else
 a[$1]=$2
 The same can be achieved using the awk ternary operator as well which is same as in the C language.

$ awk -F, '{a[$1]=a[$1]?a[$1]":"$2:$2;}END{for (i in a)print i, a[i];}' OFS=, file
Item1,200:600
Item2,500:800
Item3,900
 Ternary operator is a short form of if-else condition. An example of ternary operator is: x=x>10?"Yes":"No"  means if x is greater than 10, assign "Yes" to x, else assign "No".
In the same way: a[$1]=a[$1]?a[$1]":"$2:$2  means if a[$1] has some value assign a[$1]":"$2 to a[$1] , else simply assign $2 to a[$1].



Concatenate variables in awk:
One more thing to notice is the way string concatenation is done in awk. To concatenate 2 variables in awk, use a space in-between.
Examples:

z=x y    #to concatenate x and y
z=x":"y  #to concatenate x and y with a colon separator.
 

 


分享到:
评论
1 楼 最佳蜗牛 2014-03-17  
非常感谢,解释的非常详细。
是原创嘛?要不注明转载地址。

相关推荐

    MOXA_AWK-3121配置方法及要点说明

    MOXA AWK-3121 配置方法及要点说明 MOXA AWK-3121 是一种工业级无线设备,具有 web 功能配置、IP 网络参数设置、工作模式选择、功率增强等特点。本文将详细介绍 MOXA AWK-3121 的配置方法及要点说明。 一、Web ...

    无线AP藦萨AWK-3131A

    Moxa AWK-3131A 三合一工业级无线AP/Bridge/Client 支 持IEEE 802.11n 技术,数据传输率高达300Mbps,满足了 不断增长的快速数据传输和信号覆盖范围更广泛的要求。 AWK-3131A 符合各种工业标准,包括工作温度、输入...

    AWK-4121介绍

    ### AWK-4121系列室外无线AP/网桥/客户端关键技术知识点 #### 一、产品概述 AWK-4121是一款专为工业级应用设计的室外无线接入点/AP、网桥及客户端三合一设备。它适用于不易布线、布线成本较高或者需要在移动TCP/IP...

    moxa AirWorks AWK-1137C User’s Manual

    摩沙AP用戶手冊

    MOXA_AWK3121配置方法

    首先,配置AWK-3121-EU的IP网络参数是实现设备接入网络的基础。通过Web界面可以设置IP地址、子网掩码以及默认网关,为设备提供唯一的网络身份标识。在网络配置中,详细步骤通常包括IP地址的静态分配或通过DHCP自动...

    awk-思维导图

    awk-思维导图

    AWK-file.rar_awk_awk tcl_delay awk_jitter awk _jitter ns-2

    这个名为"AWK-file.rar_awk_awk tcl_delay awk_jitter awk _jitter ns-2"的压缩包文件显然是针对网络性能分析的,特别是使用AWK进行分析。下面将详细介绍涉及的知识点。 1. **AWK**: AWK是一种编程语言,由Aho、...

    awk使用手册

    - 示例:`awk 'BEGIN { FS=":" } /pat/ { print }' file1 file2`。 4. **脚本文件**: - awk程序可以保存到脚本文件中,并通过 `-f` 参数调用。 - 示例脚本文件 `script.awk` 内容为 `{ print $1, $2 }`,运行...

    awk教程--别人家的笔记

    - **通用格式**:`awk [options] '{awk-commands}' file` - `options`:可以是`awk`支持的各种选项。 - `awk-commands`:具体的`awk`命令。 - `file`:待处理的文件。 #### 六、awk程序文件 - **执行格式**:`...

    sed-awk-2nd-edition.chm

    The book begins with an overview and a tutorial that demonstrate a progression in functionality from grep to sed to awk. sed and awk share a similar command-line syntax, accepting user instructions in...

    awk-培训.docx

    在Linux系统中,awk是一种强大的文本分析工具,尤其在处理数据和日志文件时非常有用。awk其实有多个版本,包括awk、nawk和gawk,而在CentOS系统中默认使用的是gawk。awk的工作原理是对文本文件的每一行进行处理,...

    Vim-101-hacks、Sed-and-Awk-101-Hacks、Linux-101-hacks 英文版(高清)PDF

    "Sed-and-Awk-101-Hacks"可能包含如何使用基本的sed命令,如替换、删除、插入行,以及使用地址范围进行特定操作。此外,还可能涉及正则表达式在sed中的应用,如查找和替换模式,以及使用sed进行批量文本编辑。 3. ...

    awk-script.zip_NS2 awk_awk_ns2_awk_trace

    标题"awk-script.zip_NS2 awk_awk_ns2_awk_trace"暗示我们关注的是一个`awk`脚本,它设计用来处理与`ns2`追踪相关的任务。`ns2`的追踪文件通常包含大量的网络活动信息,如节点位置、数据包传输、延迟等,而`awk`脚本...

    awk-sed高级练习题pdf

    ### awk & sed 高级练习题解析 #### sed 命令详解 1. **删除一个文件的每行中的第1个字符** ```shell sed 's/^.//g' /etc/passwd ``` - **解析**:`s/^.//g` 表示将每行开头(`^`)的第一个字符(`.`)替换为...

    GNU Awk - 中文版1

    《GNU Awk - 中文版1》是一本详细介绍GNU Awk的参考手册,适用于广泛的读者群体,无论是初学者还是经验丰富的程序员。Awk是一种强大的文本分析工具,它基于Alfred V. Aho、Brian W. Kernighan和Peter J. Weinberger...

    AWK-学习笔记(共享)

    基本形式为`awk 'program' file1 file2 ...`,其中`program`是AWK脚本。 4. **AWK的语法**: - `pattern {action}`是AWK的基本结构,表示当行匹配`pattern`时执行`action`。 - 如果没有指定`pattern`,则`action`...

    all-awk.rar_All.awk_NS2仿真_all awk_awk_awk-scripts

    《全面解析NS2仿真与AWK脚本:深入理解all-awk.rar》 在计算机科学领域,网络模拟和分析是至关重要的环节,特别是在设计和优化网络协议时。NS2(Network Simulator 2)是一个广泛使用的开源网络模拟工具,它允许...

    awk--Linux awk 命令-基础知识概要

    AWK 是一种处理文本文件的语言,是一个强大的文本分析工具。 之所以叫 AWK 是因为其取了三位创始人 Alfred Aho,Peter Weinberger, 和 Brian Kernighan 的 Family Name 的首 字符。

    Awk - A Tutorial and Introduction - by Bruce Barnett.pdf

    examples in this page, plus a smidgen. The examples given below have the extensions of the executing script as part of the filename. Once you download it, and make it executable, you can rename it ...

Global site tag (gtag.js) - Google Analytics