以最快的速度获取文本行数（使用Java） -

silent2N

浏览: 7732 次
性别:
来自: 北京

最近访客更多访客>>

liuxiao723846

liuyouming

crane.ding

excaliburace

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (1)

社区版块

存档分类

2010-12 ( 1)
更多存档...

以最快的速度获取文本行数（使用Java）

Java 正则表达式嵌入式多线程 Groovy

业务场景：用户通过浏览器上传文件，后台须判断此文本文件的行数是否超过限制（50行）

涉及技术：Apache表单文件，JDK的io

代码片段：

        //判断文件行数，formFile是表单文件类org.apache.struts.upload.FormFile
        LineNumberReader lnr = new LineNumberReader(new InputStreamReader(formFile.getInputStream()));
        lnr.skip(formFile.getFileSize());
        if(lnr.getLineNumber() + 1 >= 50) {
            //err输出“文件行数不能超过50行！”
        }
        lnr.close();

优缺点：我认为优点是不用遍历了，缺点不能辨别空白行

小弟抛砖引玉，大家评论下这段代码吧，说说如何用Java最有效率的获取文本行数

分享到：

2010-12-13 11:46
浏览 7732
评论(60)
论坛回复 / 浏览 (60 / 23590)
分类:编程语言
查看更多

60 楼 ldbjakyo 2011-01-07

说个问题,LineNumberReader 只是个 Java提供的装饰器 ,该遍历还是遍历的

59 楼 mercyblitz 2010-12-27

<div class="quote_title">wezly 写道</div>
<div class="quote_div">
<div class="quote_title">yangyi 写道</div>
<div class="quote_div">
<div class="quote_title"><span style="font-weight: normal;"><br><strong><br></strong></span></div>
<p>你说得也有道理，但是要考虑到<strong>硬盘本身和操作系统都是有缓存的</strong>，这里的瓶颈的确是硬盘而非CPU，但是并不表示在硬盘IO已做优化的情况下，不可以优化CPU，在系统硬件更强大，转速更高，缓存更大的机器上，两种策略的差异会更大。我得到的10倍，实际上是小文件几MB到几十MB 在硬盘缓存的结果，这种速度可以达到1.5G/S.</p>
<p>当然对于上G的文件这种优势就不明显了。</p>
</div>
<p> </p>
<p> 我说了，你这个实现方式在提高读取<strong>缓存（内存中的数据）</strong>的时候确实可以提高效率，但是终究不能提高整体操作的速率，正如你所说“<span style="font-weight: bold;">硬盘本身和操作系统都是有缓存的”<span style="font-weight: normal;">，那么操作系统层面和磁盘本身都已经想尽各种办法用很优化的算法来提高读取缓存数据的效率了，更何况指望你这种现实能提高效率？因此，你这种实现的想法虽好，但是终究是没有效果的。</span></span></p>
<p> 这次你也算说到点子上了，外部存储（磁盘等）的IO再快也没有内存读写快，完全不是一个数量级的（至少很长一段时间都是这种现实），而各种多核并发等策略再优化也无济于事，更不要把这种策略用在直接读文件上面来！也就是说你再快，还是得等外部存储给数据给你，并且和文件大小无关，只和文件读写速率有关。而且测试的结果也证明，你这种实现的效果是无法提高效率的。</p>
</div>
<p> </p>
<p> </p>
<p>IO和CPU之间建立地缓存，是在CPU上，不在内存上，这里没有利用IO Mapping（这种方式才是放到内存中），别外行～！</p>
<p> </p>
<p>如果是第一次运行地，是不会有缓存地。</p>
<p> </p>
<p>关键是如果IO操作在IO总线地带宽之内的话，多核是有好处地，不然要多核干什么啊？之所以要监控系统IO，就是要看IO总线带宽地利用率。</p>
<p> </p>
<p>和文件大小无关的话，10G和1T地文件呢？先去扫扫盲～</p>
<p> </p>
<p> </p>

58 楼 yangyi 2010-12-26

<div class="quote_title">wezly 写道</div>
<div class="quote_div">
<div class="quote_title">yangyi 写道</div>
<div class="quote_div">
<div class="quote_title"><span style="font-weight: normal;"><br><strong><br></strong></span></div>
<p>你说得也有道理，但是要考虑到<strong>硬盘本身和操作系统都是有缓存的</strong>，这里的瓶颈的确是硬盘而非CPU，但是并不表示在硬盘IO已做优化的情况下，不可以优化CPU，在系统硬件更强大，转速更高，缓存更大的机器上，两种策略的差异会更大。我得到的10倍，实际上是小文件几MB到几十MB 在硬盘缓存的结果，这种速度可以达到1.5G/S.</p>
<p>当然对于上G的文件这种优势就不明显了。</p>
</div>
<p> </p>
<p> 我说了，你这个实现方式在提高读取<strong>缓存（内存中的数据）</strong>的时候确实可以提高效率，但是终究不能提高整体操作的速率，正如你所说“<span style="font-weight: bold;">硬盘本身和操作系统都是有缓存的”<span style="font-weight: normal;">，那么操作系统层面和磁盘本身都已经想尽各种办法用很优化的算法来提高读取缓存数据的效率了，更何况指望你这种现实能提高效率？因此，你这种实现的想法虽好，但是终究是没有效果的。</span></span></p>
<p> 这次你也算说到点子上了，外部存储（磁盘等）的IO再快也没有内存读写快，完全不是一个数量级的（至少很长一段时间都是这种现实），而各种多核并发等策略再优化也无济于事，更不要把这种策略用在直接读文件上面来！也就是说你再快，还是得等外部存储给数据给你，并且和文件大小无关，只和文件读写速率有关。而且测试的结果也证明，你这种实现的效果是无法提高效率的。</p>
</div>
<p>拜托测测再来回复，之前你说的有道理是在大文件的情况下，不得不读取硬盘，但是，对于小文件，尤其是反复读取等情况，JVM，操作系统，甚至硬盘都是有缓存的，这是就是发挥CPU威力的时候。你总是习惯于主观臆想，然后草率下结论。为什么不实际测一测呢。你说不能提高整体的效率，这个怎么定义，太不严谨了吧</p>

57 楼 wezly 2010-12-17

<div class="quote_title">yangyi 写道</div>
<div class="quote_div">
<div class="quote_title"><span style="font-weight: normal;"><br><strong><br></strong></span></div>
<p>你说得也有道理，但是要考虑到<strong>硬盘本身和操作系统都是有缓存的</strong>，这里的瓶颈的确是硬盘而非CPU，但是并不表示在硬盘IO已做优化的情况下，不可以优化CPU，在系统硬件更强大，转速更高，缓存更大的机器上，两种策略的差异会更大。我得到的10倍，实际上是小文件几MB到几十MB 在硬盘缓存的结果，这种速度可以达到1.5G/S.</p>
<p>当然对于上G的文件这种优势就不明显了。</p>
</div>
<p> </p>
<p> 我说了，你这个实现方式在提高读取<strong>缓存（内存中的数据）</strong>的时候确实可以提高效率，但是终究不能提高整体操作的速率，正如你所说“<span style="font-weight: bold;">硬盘本身和操作系统都是有缓存的”<span style="font-weight: normal;">，那么操作系统层面和磁盘本身都已经想尽各种办法用很优化的算法来提高读取缓存数据的效率了，更何况指望你这种现实能提高效率？因此，你这种实现的想法虽好，但是终究是没有效果的。</span></span></p>
<p> 这次你也算说到点子上了，外部存储（磁盘等）的IO再快也没有内存读写快，完全不是一个数量级的（至少很长一段时间都是这种现实），而各种多核并发等策略再优化也无济于事，更不要把这种策略用在直接读文件上面来！也就是说你再快，还是得等外部存储给数据给你，并且和文件大小无关，只和文件读写速率有关。而且测试的结果也证明，你这种实现的效果是无法提高效率的。</p>

56 楼 yangyi 2010-12-16

<div class="quote_title">wezly 写道</div>
<div class="quote_div">
<p>哥都不用测试您的代码，就直接可以否定所谓<strong>“快十倍以上”</strong>。<br>昨天用LineNumberReader直接统计一个846750行的文本（<strong>文件每行是一个手机号码，大小10M多一点</strong>），只用了<strong>1.9秒</strong>左右。我用的笔记本测试，笔记本的硬盘读取速度是47M/S，就算50M/S吧，也就是说单单操作系统读取那个测试文件（不做任何内容识别）都要<strong>0.2秒</strong>，而你所说的这个要快<strong>十倍以上</strong>也就是顶多只用<strong>0.19秒</strong>，也就是说你这个程序比在操作系统驱动下读取这个文件的速度都快，同时你提出了多CORE可以消除磁盘IO瓶颈的理论，对此，你的结论无非是颠覆性的。</p>
</div>
<p>你说得也有道理，但是要考虑到<strong>硬盘本身和操作系统都是有缓存的</strong>，这里的瓶颈的确是硬盘而非CPU，但是并不表示在硬盘IO已做优化的情况下，不可以优化CPU，在系统硬件更强大，转速更高，缓存更大的机器上，两种策略的差异会更大。我得到的10倍，实际上是小文件几MB到几十MB 在硬盘缓存的结果，这种速度可以达到1.5G/S.</p>
<p>当然对于上G的文件这种优势就不明显了。</p>

55 楼 wezly 2010-12-16

<div class="quote_title">yangyi 写道</div>
<div class="quote_div">Let's speak with numbers: LineCounter可以比lz的结果快十倍以上，如果core多一些还可以更快<br><br><pre name="code" class="java">import java.io.File;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.Callable;
import java.util.concurrent.CompletionService;
import java.util.concurrent.ExecutorCompletionService;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class LineCounter {

public static void main(String[] args) throws Exception{
File file = new File(args[0]);
int bufferSize = 32768;
//Fix buffer size, let's test thread number
for(int i=1;i<100;i++){
long start = System.currentTimeMillis();
long result = calculate(file, i, bufferSize);
long used = System.currentTimeMillis() - start;
System.out.println("Thread number:"+i+"\tTime cost:"+used);
}
System.out.println("------------------I am the separator---------------------");
//Fix thread Number test buffer size
int threads = Runtime.getRuntime().availableProcessors();
for(int i=1;i<Integer.MAX_VALUE >>> 1;i<<=1){
long start = System.currentTimeMillis();
long result = calculate(file, threads, i);
long used = System.currentTimeMillis() - start;
System.out.println("Buffer size:"+i+"\tTime cost:"+used);
}

}

private static class FilePartCallable implements Callable<Long>{
private int bufferSize;
private File file;
private long start;
private long count;

public FilePartCallable(File file, long start, long count, int bufferSize){
this.file = file;
this.start = start;
this.count = count;
this.bufferSize = bufferSize;
}

private long read(File file, long start, long count) throws IOException{
if(count == 0){
return 0;
}
RandomAccessFile raf = new RandomAccessFile(file, "r");
raf.seek(start);

int len = count < bufferSize ? (int)count : bufferSize;
long lines = 0L,c=count/len;
byte[] buffer = new byte[len];

for(int i=0;i<c;i++){
raf.read(buffer);
for(int j=0;j<len;j++){
if(buffer[j] == '\n'){
lines++;
}
}
}
int rest = (int)(count%len);
raf.read(buffer,0, rest);
for(int j=0;j<rest;j++){
if(buffer[j] == '\n'){
lines++;
}
}
raf.close();
return lines;
}

public Long call() throws Exception {
return Long.valueOf(read(file, start, count));
}
}

private static long calculate(File file, int threads, int bufferSize) throws Exception{
ExecutorService es = Executors.newFixedThreadPool(threads);
CompletionService<Long> cs = new ExecutorCompletionService<Long>(es);
long size = file.length();

long count = size/threads;

for(int i=0;i<threads;i++){
long start = count*i;
cs.submit(new FilePartCallable(file, start, count, bufferSize));
}

long start = count*threads;
long restCount = size - start;
cs.submit(new FilePartCallable(file, start, restCount, bufferSize));

long lines = 0L;
for(int i=0;i<threads+1;i++){
lines += cs.take().get();
}

es.shutdown();
return lines + 1;
}
}
</pre>
</div>
<p><br><br>哥都不用测试您的代码，就直接可以否定所谓<strong>“快十倍以上”</strong>。<br>昨天用LineNumberReader直接统计一个846750行的文本（<strong>文件每行是一个手机号码，大小10M多一点</strong>），只用了<strong>1.9秒</strong>左右。我用的笔记本测试，笔记本的硬盘读取速度是47M/S，就算50M/S吧，也就是说单单操作系统读取那个测试文件（不做任何内容识别）都要<strong>0.2秒</strong>，而你所说的这个要快<strong>十倍以上</strong>也就是顶多只用<strong>0.19秒</strong>，也就是说你这个程序比在操作系统驱动下读取这个文件的速度都快，同时你提出了多CORE可以消除磁盘IO瓶颈的理论，对此，你的结论无非是颠覆性的。</p>
<p> </p>
<p> </p>
<p> </p>

54 楼 yangyi 2010-12-15

Let's speak with numbers: LineCounter可以比lz的结果快十倍以上，如果core多一些还可以更快

import java.io.File;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.Callable;
import java.util.concurrent.CompletionService;
import java.util.concurrent.ExecutorCompletionService;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class LineCounter {
	
	public static void main(String[] args) throws Exception{
		File file = new File(args[0]);
		int bufferSize = 32768;
		//Fix buffer size, let's test thread number
		for(int i=1;i<100;i++){
			long start = System.currentTimeMillis();
			long result = calculate(file, i, bufferSize);
			long used = System.currentTimeMillis() - start;
			System.out.println("Thread number:"+i+"\tTime cost:"+used);
		}
		System.out.println("------------------I am the separator---------------------");
		//Fix thread Number test buffer size
		int threads = Runtime.getRuntime().availableProcessors();
		for(int i=1;i<Integer.MAX_VALUE >>> 1;i<<=1){
			long start = System.currentTimeMillis();
			long result = calculate(file, threads, i);
			long used = System.currentTimeMillis() - start;
			System.out.println("Buffer size:"+i+"\tTime cost:"+used);
		}
			
	}
	
	private static class FilePartCallable implements Callable<Long>{
		private int bufferSize;
		private File file;
		private long start;
		private long count;
		
		public FilePartCallable(File file, long start, long count, int bufferSize){
			this.file = file;
			this.start = start;
			this.count = count;
			this.bufferSize = bufferSize;
		}
		
		private long read(File file, long start, long count) throws IOException{
			if(count == 0){
				return 0;
			}
			RandomAccessFile raf = new RandomAccessFile(file, "r");
			raf.seek(start);
			
			int len = count < bufferSize ? (int)count : bufferSize;
			long lines = 0L,c=count/len;
			byte[] buffer = new byte[len];
			
			for(int i=0;i<c;i++){
				raf.read(buffer);
				for(int j=0;j<len;j++){
					if(buffer[j] == '\n'){
						lines++;
					}
				}
			}
			int rest = (int)(count%len);
			raf.read(buffer,0, rest);
			for(int j=0;j<rest;j++){
				if(buffer[j] == '\n'){
					lines++;
				}
			}
			raf.close();
			return lines;
		}
		
		public Long call() throws Exception {
			return Long.valueOf(read(file, start, count));
		}
	}
	
	private static long calculate(File file, int threads, int bufferSize) throws Exception{
		ExecutorService es = Executors.newFixedThreadPool(threads);
		CompletionService<Long> cs = new ExecutorCompletionService<Long>(es);
		long size = file.length();
		
		long count = size/threads;
		
		for(int i=0;i<threads;i++){
			long start = count*i;
			cs.submit(new FilePartCallable(file, start, count, bufferSize));
		}
		
		long start = count*threads;
		long restCount = size - start;
		cs.submit(new FilePartCallable(file, start, restCount, bufferSize));
		
		long lines = 0L;
		for(int i=0;i<threads+1;i++){
			lines += cs.take().get();
		}
		
		es.shutdown();
		return lines + 1;
	} 
}

53 楼 yangyi 2010-12-15

<div class="quote_title">wezly 写道</div>
<div class="quote_div">
<div class="quote_title">night_stalker 写道</div>
<div class="quote_div">还双核…… 纯扯蛋……<br>试试看就知道多 sb 了，10 万行文件用 wc -l 比启动 jvm 还快。<br><br><strong>磁盘比 CPU 慢一万倍</strong>，用顺序读 BufferedInputStream 单线程就够了。<br>
</div>
<p><br><br>此君所说正确。<img src="/images/smiles/icon_idea.gif" alt=""></p>
<p><br>这个问题的关键性是在磁盘IO的性能瓶颈，在这里什么并发、什么多线程都是浮云，<strong>用Buffer才是王道</strong>。</p>
<p>其实基于BufferedReader的LineNumberReader效率已经相当高了，LineNumberReader虽然是遍历统计行数，但是基于BufferedReader的遍历，另外我想问哪位可以不用遍历的方式统计行数？哥只能说你的智慧已经领先地球人很多了。</p>
<p>哥用LineNumberReader本机测试一个846750行的文本，只用了不到2秒，而用那位仁兄的<strong>LineCounter程序检测的结果简直惨不忍睹。</strong></p>
<p> </p>
<p><strong><span style="font-weight: normal;">而且LineCounter用RandomAccessFile来分块遍历文件这个非常不可取的，你的磁盘指针会随着文件的分块一顿乱飞，比用不带Buffer的FileReader遍历都慢很多，除非用的是RAID磁盘阵列，不然你这简直就是在制造灾难<img src="/images/smiles/icon_lol.gif" alt="">。当然，可以在读取Buffer的时候使用这种策略，这个还是<strong>有可能</strong>提高效率的。</span></strong></p>
<p> </p>
<p>最后说一句，除非你是Doug Lea、Reinhold，那还是老老实实的用JDK现有的功能吧，纵使有什么惊世大发现也请大侠们自己先测试下再放代码吧，不然就有误人子弟之嫌了。</p>
<p> </p>
<p> </p>
</div>
<p>阁下说的部分正确，部分错误，尤其是第二段完全是主观臆想（请原谅，说得比较极端），没有实际的论据。seek后成片read相当于调用了read(offset), 也就是pread系统调用。每个文件描述符对应内核中一个file的对象，而每个file对应一个inode节点。假设某个进程两次打开同一个文件，得到了两个文件描述符，那么在内核中对应的是两个file对象，但只有一个inode节点。文件的读写操作最终由inode对象完成。所以，如果读写线程打开同一个文件的话，即使采用各自独占的文件描述符，但最终都会作用到同一个inode上，对硬盘的操作是由内核调用驱动程序根据inode信息完成的。不过这里确实浪费了FD，直接read(offset)就好了，反正多少个FD并发都是一样的。</p>
<p>注意流过滤是个CPU敏感的操作，所以这里多核是必要的。</p>

52 楼 wezly 2010-12-15

<div class="quote_title">night_stalker 写道</div>
<div class="quote_div">还双核…… 纯扯蛋……<br>试试看就知道多 sb 了，10 万行文件用 wc -l 比启动 jvm 还快。<br><br><strong>磁盘比 CPU 慢一万倍</strong>，用顺序读 BufferedInputStream 单线程就够了。<br>
</div>
<p><br><br>此君所说正确。<img src="/images/smiles/icon_idea.gif" alt=""></p>
<p><br>这个问题的关键性是在磁盘IO的性能瓶颈，在这里什么并发、什么多线程都是浮云，<strong>用Buffer才是王道</strong>。</p>
<p>其实基于BufferedReader的LineNumberReader效率已经相当高了，LineNumberReader虽然是遍历统计行数，但是基于BufferedReader的遍历，另外我想问哪位可以不用遍历的方式统计行数？哥只能说你的智慧已经领先地球人很多了。</p>
<p>哥用LineNumberReader本机测试一个846750行的文本，只用了不到2秒，而用那位仁兄的<strong>LineCounter程序检测的结果简直惨不忍睹。</strong></p>
<p> </p>
<p><strong><span style="font-weight: normal;">而且LineCounter用RandomAccessFile来分块遍历文件这个非常不可取的，你的磁盘指针会随着文件的分块一顿乱飞，比用不带Buffer的FileReader遍历都慢很多，除非用的是RAID磁盘阵列，不然你这简直就是在制造灾难<img src="/images/smiles/icon_lol.gif" alt="">。当然，可以在读取Buffer的时候使用这种策略，这个还是<strong>有可能</strong>提高效率的。</span></strong></p>
<p> </p>
<p>最后说一句，除非你是Doug Lea、Reinhold，那还是老老实实的用JDK现有的功能吧，纵使有什么惊世大发现也请大侠们自己先测试下再放代码吧，不然就有误人子弟之嫌了。</p>
<p> </p>
<p> </p>

51 楼 smildlzj 2010-12-15

又被评为新手贴啊..

我觉得这贴还不错啊..说得还比较底层..

50 楼 zhangcong170 2010-12-14

mercyblitz 写道

zhangcong170 写道

mercyblitz 写道

zhangcong170 写道

mercyblitz 写道

文件如果有2G，你怎么统计？

可以用nio的文件通道，读取一部分数据到内存，记住此时的位置，统计行数，再读取一部分数据，再累加行数，直到文件数据全部被读取完，返回行数就OK了

用随机文件操作，比较合适！

FileChannel本身就提供随机文件操作

你误解了，随机文件存取，比如在Java中，java.io.RandomAccessFile可以通过BIO和NIO的方式操作，但是本质通过FD（文件描述器，Java中的类为java.io.FileDescriptor），和方式没有关系。

建议看一下RandomAccessFile的源码！

恩我只关注FileChannel了，就这个问题来看，实际上还是随机文件的存取
关注点不一样

49 楼 yangyi 2010-12-14

william_ai 写道

night_stalker 写道

还双核…… 纯扯蛋……
试试看就知道多 sb 了，10 万行文件用 wc -l 比启动 jvm 还快。

磁盘比 CPU 慢一万倍，用顺序读 BufferedInputStream 单线程就够了。

    提倡用wc -l实现功能。在Linux和unix上 wc -l 更实用和高效一些。甚至，可以组合awk，sort，uniq，sed，eval，grep，find，split，xargs等实现极其复杂的功能。而且，大多数时候一行代码就可以搞定很复杂东西，这是件很爽的事情。
    但是，还有一个问题，在windows平台怎么办呢？
    最后，在javaeye谈谈技术，扯扯淡是件很happy的事情。

各位大侠说得有道理，但是就算是wc -l那也是最后落实到系统调用上，只要没有数量级的差别就不能说是瓶颈吧，说JVM成了瓶颈有点太夸张了。
我的程序之所以慢是因为每次读一个byte，都有IO操作，改成buffer，就是read byte array后就变成几毫秒了，但是双核仍然比单核快50%以上，为免误导他人，特此声明。：）
看了一下bufferedReader的实现，也是提前预读一些byte到内存。这里让我不解的是，很多InputStream实现都是一个一个字节的读，岂不是很慢，看来我之前认为JVM会有缓冲的想法是极其错误的。
另外更正是硬盘比内存慢，而不是比CPU慢
但是buffer的数量是多大还是值得探讨的问题，要考虑现有的可用内存数量和分块的大小等

48 楼 william_ai 2010-12-14

night_stalker 写道

还双核…… 纯扯蛋……
试试看就知道多 sb 了，10 万行文件用 wc -l 比启动 jvm 还快。

磁盘比 CPU 慢一万倍，用顺序读 BufferedInputStream 单线程就够了。

47 楼 kraft 2010-12-14

一般的情况只要单线程顺序读统计\n个数就可以了，但是要优化一次读取bufer的size, 或者交给java做。深入这个问题的话，要看这个文件具体是怎么存储的，不同的存储形式，分区格式，操作系统都有影响

46 楼 seeker 2010-12-14

同事开始还不相信，用了一堆的方法去想快一点，俺说试一下readline，看了结果后他也不得不相信，也许有时最笨的方法就是最快的方法哈，呵呵

45 楼 seeker 2010-12-14

俺只试过readline，200多万行，读出这个count，t400上2秒多，同事的dell双核，大约前年的机器，慢点，也只要4秒多

44 楼 night_stalker 2010-12-14

还双核…… 纯扯蛋……
试试看就知道多 sb 了，10 万行文件用 wc -l 比启动 jvm 还快。

磁盘比 CPU 慢一万倍，用顺序读 BufferedInputStream 单线程就够了。

43 楼 mercyblitz 2010-12-13

yangyi 写道

javeaye 写道

mercyblitz 写道

javeaye 写道

mercyblitz 写道

yangyi 写道

1 获取文件字节数 M
2 设定指针数量 N，使得 M > N
3 根据随机位置，指向文件的不同位置，按照M/N的长度遍历，当遇到'\n'时，换行计数+1
4 注意最后的M%N部分也要进行检查

import java.io.File;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.Callable;
import java.util.concurrent.CompletionService;
import java.util.concurrent.ExecutorCompletionService;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class LineCounter {
	public static void main(String[] args) throws Exception{
		File file = new File(args[0]);
		int threads = Runtime.getRuntime().availableProcessors();
		long size = file.length();
		System.out.println(calculate(file, size, threads));
	}
	
	private static class FilePartCallable implements Callable<Long>{
		private File file;
		private long start;
		private long count;
		
		public FilePartCallable(File file, long start, long count){
			this.file = file;
			this.start = start;
			this.count = count;
		}
		
		private long read(File file, long start, long count) throws IOException{
			RandomAccessFile raf = new RandomAccessFile(file, "r");
			raf.seek(start);
			long lines = 0L;
			for(long i=0;i<count;i++){
				if(raf.readByte() == '\n'){
					lines++;
				}
			}
			raf.close();
			return lines;
		}
		
		public Long call() throws Exception {
			return Long.valueOf(read(file, start, count));
		}
		
	}
	
	private static long calculate(File file, long size, int threads) throws Exception{
		ExecutorService es = Executors.newFixedThreadPool(threads);
		CompletionService<Long> cs = new ExecutorCompletionService<Long>(es);
		
		long count = size/threads;
		
		for(int i=0;i<threads;i++){
			long start = count*i;
			cs.submit(new FilePartCallable(file, start, count));
		}
		
		long start = count*threads;
		long restCount = size - start;
		cs.submit(new FilePartCallable(file, start, restCount));
		
		long lines = 0L;
		for(int i=0;i<threads+1;i++){
			lines += cs.take().get();
		}
		
		es.shutdown();
		return lines + 1;
	} 
}

这个实现不错，不过呢，太浪费了，N个处理器只有N个线程，呵呵！

单个文件的话，分片，然后多线程统计每片。求和。

多个文件的话，排队。

多个文件一起来也没有关系的，会达到最终一致的，前提是原子计数和能够开启那么多的FD。

没问题，有个pool，要排队。
如果多个pool，context切换要消耗掉一部分资源。

这部分工作交给线程池和它内置的队列来完成就可以了，按照CPU计数是因为这种Filter字节是很耗CPU的，基本上不是任务越多越好，另外也可以节约FD。经测试，PC上10万行的文件，双核用时7秒左右，单核10秒多

确实不是越多与好，不过，要看具体情况，线程数和FD的话，可以通过内核设置来做。

42 楼 yangyi 2010-12-13

javeaye 写道

mercyblitz 写道

javeaye 写道

mercyblitz 写道

yangyi 写道

import java.io.File;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.Callable;
import java.util.concurrent.CompletionService;
import java.util.concurrent.ExecutorCompletionService;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class LineCounter {
	public static void main(String[] args) throws Exception{
		File file = new File(args[0]);
		int threads = Runtime.getRuntime().availableProcessors();
		long size = file.length();
		System.out.println(calculate(file, size, threads));
	}
	
	private static class FilePartCallable implements Callable<Long>{
		private File file;
		private long start;
		private long count;
		
		public FilePartCallable(File file, long start, long count){
			this.file = file;
			this.start = start;
			this.count = count;
		}
		
		private long read(File file, long start, long count) throws IOException{
			RandomAccessFile raf = new RandomAccessFile(file, "r");
			raf.seek(start);
			long lines = 0L;
			for(long i=0;i<count;i++){
				if(raf.readByte() == '\n'){
					lines++;
				}
			}
			raf.close();
			return lines;
		}
		
		public Long call() throws Exception {
			return Long.valueOf(read(file, start, count));
		}
		
	}
	
	private static long calculate(File file, long size, int threads) throws Exception{
		ExecutorService es = Executors.newFixedThreadPool(threads);
		CompletionService<Long> cs = new ExecutorCompletionService<Long>(es);
		
		long count = size/threads;
		
		for(int i=0;i<threads;i++){
			long start = count*i;
			cs.submit(new FilePartCallable(file, start, count));
		}
		
		long start = count*threads;
		long restCount = size - start;
		cs.submit(new FilePartCallable(file, start, restCount));
		
		long lines = 0L;
		for(int i=0;i<threads+1;i++){
			lines += cs.take().get();
		}
		
		es.shutdown();
		return lines + 1;
	} 
}

这个实现不错，不过呢，太浪费了，N个处理器只有N个线程，呵呵！

单个文件的话，分片，然后多线程统计每片。求和。

多个文件的话，排队。

多个文件一起来也没有关系的，会达到最终一致的，前提是原子计数和能够开启那么多的FD。

没问题，有个pool，要排队。
如果多个pool，context切换要消耗掉一部分资源。

这部分工作交给线程池和它内置的队列来完成就可以了，按照CPU计数是因为这种Filter字节是很耗CPU的，基本上不是任务越多越好，另外也可以节约FD。

41 楼 javeaye 2010-12-13

mercyblitz 写道

javeaye 写道

mercyblitz 写道

yangyi 写道

import java.io.File;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.Callable;
import java.util.concurrent.CompletionService;
import java.util.concurrent.ExecutorCompletionService;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class LineCounter {
	public static void main(String[] args) throws Exception{
		File file = new File(args[0]);
		int threads = Runtime.getRuntime().availableProcessors();
		long size = file.length();
		System.out.println(calculate(file, size, threads));
	}
	
	private static class FilePartCallable implements Callable<Long>{
		private File file;
		private long start;
		private long count;
		
		public FilePartCallable(File file, long start, long count){
			this.file = file;
			this.start = start;
			this.count = count;
		}
		
		private long read(File file, long start, long count) throws IOException{
			RandomAccessFile raf = new RandomAccessFile(file, "r");
			raf.seek(start);
			long lines = 0L;
			for(long i=0;i<count;i++){
				if(raf.readByte() == '\n'){
					lines++;
				}
			}
			raf.close();
			return lines;
		}
		
		public Long call() throws Exception {
			return Long.valueOf(read(file, start, count));
		}
		
	}
	
	private static long calculate(File file, long size, int threads) throws Exception{
		ExecutorService es = Executors.newFixedThreadPool(threads);
		CompletionService<Long> cs = new ExecutorCompletionService<Long>(es);
		
		long count = size/threads;
		
		for(int i=0;i<threads;i++){
			long start = count*i;
			cs.submit(new FilePartCallable(file, start, count));
		}
		
		long start = count*threads;
		long restCount = size - start;
		cs.submit(new FilePartCallable(file, start, restCount));
		
		long lines = 0L;
		for(int i=0;i<threads+1;i++){
			lines += cs.take().get();
		}
		
		es.shutdown();
		return lines + 1;
	} 
}

这个实现不错，不过呢，太浪费了，N个处理器只有N个线程，呵呵！

单个文件的话，分片，然后多线程统计每片。求和。

多个文件的话，排队。

多个文件一起来也没有关系的，会达到最终一致的，前提是原子计数和能够开启那么多的FD。

没问题，有个pool，要排队。
如果多个pool，context切换要消耗掉一部分资源。

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

以最快的速度获取文本行数（使用Java）

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

以最快的速度获取文本行数（使用Java）

评论

发表评论

相关推荐

最近访客更多访客>>