`
tmrp
  • 浏览: 44768 次
  • 性别: Icon_minigender_1
  • 来自: 深圳
社区版块
存档分类
最新评论

Java tip : How to read files quickly

阅读更多
Java has several classes for reading files, with and without buffering, random access, thread safety, and memory mapping. Some of these are much faster than the others. This tip benchmarks 13 ways to read bytes from a file and shows which ways are the fastest.

Table of Contents
A quick review of file reading classes
FileInputStream with byte reads
FileInputStream with byte array reads
BufferedInputStream with byte reads
BufferedInputStream with byte array reads
RandomAccessFile with byte reads
RandomAccessFile with byte array reads
FileChannel with ByteBuffer and byte gets
FileChannel with ByteBuffer and byte array gets
FileChannel with array ByteBuffer and byte array access
FileChannel with direct ByteBuffer and byte gets
FileChannel with direct ByteBuffer and byte array gets
FileChannel with MappedByteBuffer and byte gets
FileChannel with MappedByteBuffer and byte array gets
FileReader and BufferedReader
Benchmarks
Full plot
Zoomed plot
Really zoomed plot
Comparison to C
Conclusions
Further reading

A quick review of file reading classes
Let's quickly run through several ways to open and read a file of bytes in Java. To give us a way to compare them, let's compute a file checksum.

FileInputStream with byte reads
FileInputStream f = new FileInputStream( name );
int b;
long checkSum = 0L;
while ( (b=f.read()) != -1 )
    checkSum += b;FileInputStream opens a file by name or File object. It's read() method reads byte after byte from the file.

FileInputStream uses synchronization to make it thread-safe.

FileInputStream with byte array reads
FileInputStream f = new FileInputStream( name );
byte[] barray = new byte[SIZE];
long checkSum = 0L;
int nRead;
while ( (nRead=f.read( barray, 0, SIZE )) != -1 )
    for ( int i=0; i<nRead; i++ )
        checkSum += barray[i];FileInputStream does an I/O operation on every read and it synchronizes on all method calls to make it thread-safe. To reduce this overhead, read multiple bytes at once into a buffer array of bytes

BufferedInputStream with byte reads
BufferedInputStream f = new BufferedInputStream(    new FileInputStream( name ) );
int b;
long checkSum = 0L;
while ( (b=f.read()) != -1 )
    checkSum += b;BufferedInputStream handles FileInputStream buffering for you. It wraps an input stream, creates a large internal byte array (usually 8 kilobytes), and fills it in large reads from the stream. It's read() method gets the next byte from the buffer.

BufferedInputStream uses synchronization to be thread-safe.

BufferedInputStream with byte array reads
BufferedInputStream f = new BufferedInputStream(    new FileInputStream( name ) );
byte[] barray = new byte[SIZE];
long checkSum = 0L;
int nRead;
while ( (nRead=f.read( barray, 0, SIZE )) != -1 )
    for ( int i=0; i<nRead; i++ )
        checkSum += barray[i];BufferedInputStream synchronizes on all method calls to make it thread-safe. To reduce synchronization and method call overhead, make fewer read() calls by reading multiple bytes at a time.

RandomAccessFile with byte reads
RandomAccessFile f = new RandomAccessFile( name );
int b;
long checkSum = 0L;
while ( (b=f.read()) != -1 )
    checkSum += b;RandomAccessFile opens a file by name or File object. It can read, write, or read and write at your choice of position within the file. Its read() method reads the next byte from the current file position.

RandomAccessFile is thread-safe.

RandomAccessFile with byte array reads
RandomAccessFile f = new RandomAccessFile( name );
byte[] barray = new byte[SIZE];
long checkSum = 0L;
int nRead;
while ( (nRead=f.read( barray, 0, SIZE )) != -1 )
    for ( int i=0; i<nRead; i++ )
        checkSum += barray[i];Like a FileInputStream, RandomAccessFile issues an I/O operation on every read and synchronizes on all method calls to make it thread-safe. To reduce this overhead, make fewer method calls by reading into an array of bytes.

FileChannel with ByteBuffer and byte gets
FileInputStream f = new FileInputStream( name );
FileChannel ch = f.getChannel( );
ByteBuffer bb = ByteBuffer.allocate( SIZE );
long checkSum = 0L;
int nRead;
while ( (nRead=ch.read( bb )) != -1 )
{
    if ( nRead == 0 )
        continue;
    bb.position( 0 );
    bb.limit( nRead );
    while ( bb.hasRemaining() )
        checkSum += bb.get( );
    bb.clear( );
}FileInputStream and RandomAccessFile both can return a FileChannel for low-level I/O. FileChannel's read() method fills a ByteBuffer created using the allocate() method on ByteBuffer. ByteBuffer's get() method gets the next byte from the buffer.

FileChannel and ByteBuffer are not thread-safe.

FileChannel with ByteBuffer and byte array gets
FileInputStream f = new FileInputStream( name );
FileChannel ch = f.getChannel( );
ByteBuffer bb = ByteBuffer.allocate( BIGSIZE );
byte[] barray = new byte[SIZE];
long checkSum = 0L;
int nRead, nGet;
while ( (nRead=ch.read( bb )) != -1 )
{
    if ( nRead == 0 )
        continue;
    bb.position( 0 );
    bb.limit( nRead );
    while( bb.hasRemaining( ) )
    {
        nGet = Math.min( bb.remaining( ), SIZE );
        bb.get( barray, 0, nGet );
        for ( int i=0; i<nGet; i++ )
            checkSum += barray[i];
    }
    bb.clear( );
}To reduce method call overhead, get an array of bytes at a time. The array and ByteBuffer can be different sizes.

FileChannel with array ByteBuffer and byte array access
FileInputStream f = new FileInputStream( name );
FileChannel ch = f.getChannel( );
byte[] barray = new byte[SIZE];
ByteBuffer bb = ByteBuffer.wrap( barray );
long checkSum = 0L;
int nRead;
while ( (nRead=ch.read( bb )) != -1 )
{
    for ( int i=0; i<nRead; i++ )
        checkSum += barray[i];
    bb.clear( );
}A ByteBuffer created with the allocate() method above uses hidden internal storage. Instead, call wrap() to wrap a ByteBuffer around your own byte array. This lets you access your array directly after each read, reducing method call overhead and data copying.

FileChannel with direct ByteBuffer and byte gets
FileInputStream f = new FileInputStream( name );
FileChannel ch = f.getChannel( );
ByteBuffer bb = ByteBuffer.allocateDirect( SIZE );
long checkSum = 0L;
int nRead;
while ( (nRead=ch.read( bb )) != -1 )
{
    bb.position( 0 );
    bb.limit( nRead );
    while ( bb.hasRemaining() )
        checkSum += bb.get( );
    bb.clear( );
}A ByteBuffer created with the allocateDirect() method may directly use storage deeper in the JVM or OS. This can reduce copying of data upward into your application's array, saving some overhead.

FileChannel with direct ByteBuffer and byte array gets
FileInputStream f = new FileInputStream( name );
FileChannel ch = f.getChannel( );
ByteBuffer bb = ByteBuffer.allocateDirect( BIGSIZE );
byte[] barray = new byte[SIZE];
long checkSum = 0L;
int nRead, nGet;
while ( (nRead=ch.read( bb )) != -1 )
{
    if ( nRead == 0 )
        continue;
    bb.position( 0 );
    bb.limit( nRead );
    while( bb.hasRemaining( ) )
    {
        nGet = Math.min( bb.remaining( ), SIZE );
        bb.get( barray, 0, nGet );
        for ( int i=0; i<nGet; i++ )
            checkSum += barray[i];
    }
    bb.clear( );
}And of course, you can get arrays of bytes to save on method call overhead. The buffer and array sizes can differ.

FileChannel with MappedByteBuffer and byte gets
FileInputStream f = new FileInputStream( name );
FileChannel ch = f.getChannel( );
MappedByteBuffer mb = ch.map( ch.MapMode.READ_ONLY,
    0L, ch.size( ) );
long checkSum = 0L;
while ( mb.hasRemaining( ) )
    checkSum += mb.get( );
FileChannel's map method returns a MappedByteBuffer that memory maps part or all of the file into the address space of the application. This gives more direct access to the file without an intermediate buffer. Call the get() method on MappedByteBuffer to get the next byte.

FileChannel with MappedByteBuffer and byte array gets
FileInputStream f = new FileInputStream( name );
FileChannel ch = f.getChannel( );
MappedByteBuffer mb = ch.map( ch.MapMode.READ_ONLY,
    0L, ch.size( ) );
byte[] barray = new byte[SIZE];
long checkSum = 0L;
int nGet;
while( mb.hasRemaining( ) )
{
    nGet = Math.min( mb.remaining( ), SIZE );
    mb.get( barray, 0, nGet );
    for ( int i=0; i<nGet; i++ )
        checkSum += barray[i];
}And getting arrays of bytes saves method call overhead.

FileReader and BufferedReader
What about FileReader and BufferedReader? Both of these classes read characters, not bytes, so they are not included in these comparisons.

To handle characters, FileReader reads bytes from an internal FileInputStream and decodes them as UTF-8 multi-byte characters. BufferedReader does the same thing by wrapping a Reader and adding an internal buffer. Character decoding takes time, giving worse performance than simpler byte-reading classes.

Benchmarks
All of these approaches were benchmarked reading a 100 Mbyte file from a local disk. To make sure benchmarks included the cost of issuing I/O calls and their OS overhead, I measured wall clock time for the task, not CPU or user time. The results plotted below are from an unloaded Mac with JVM 5, but similar results were found on a Windows PC and JVM 6. The JVM was run with the following options:

java -server -XX:CompileThreshold=2 -XX:+AggressiveOpts -XX:+UseFastAccessorMethods -Xmx1000m Test

The -server option uses larger memory defaults and a parallel garbage collector to improve performance. The -XX:CompileThreshold=2 option forced the JVM to compile methods after just two calls, insuring that compilation occurred early, and not in the middle of the benchmarks. The -XX:+AggressiveOpts option enables fancier optimizations, and -XX:+UseFastAccessorMethods inlines some get() methods. The -Xmx1000m option set a 1Gbyte memory limit so that the largest tests could run easily. Each of these options were tested independently and they all made a difference.

There are three variables to test: the read approach, the internal buffer size (for BufferedInputStream and ByteBuffer), and the read array size. Buffer and array sizes were varied separately and together from 1 byte to 128 Mbytes in size.

Full plot




The cliff on the left is from very slow byte-by-byte reads. The peak is near 600 seconds, or 10 minutes! Over a 100 Mbyte file, that's a dismal 10 Mbytes per minute. For comparison, hard disks themselves can transfer data at better than 40 Mbytes per second.

This is what we expect to see. Issuing an I/O request is very expensive. To avoid this expense, make fewer I/O requests by reading an array of bytes at a time. Even a small array of just 64 bytes dramatically improves performance.

To see more, we need to zoom in to the bottom few seconds of the plot.

Zoomed plot


The curves drops down and level out as we increase the array sizes. By about 8Kbytes the curves flatten. Increasing array sizes further doesn't gain much. This is why BufferedInputStream's default buffer size is 8Kbytes.

Performance levels out into about three groups:

At 5 seconds is BufferedInputStream and getting one byte at a time from an internal buffer. Each byte requires a method call, and each method call requires a thread synchronization lock, hurting read performance.
At just under 1 second is getting one byte at a time from a FileChannel and a large ByteBuffer. FileChannel doesn't use synchronization locks, but a method call per byte still hurts performance.
At about 1/2 second are all of the rest. All of these read arrays of bytes to minimize I/O operations, method calls, and thread synchronization, giving them better performance.
A few approaches stand out at the left side of the plot with curves that drop faster than the others:

BufferedInputStream with an internal 8Kbyte buffer and FileChannel with an internal 8Kbyte ByteBuffer do better simply because there is that 8Kbyte buffer behind the reads. The left-shifted curves are really a plotting artifact. They should, perhaps, be plotted based upon the internal read buffer size, not the application's read array size.
FileChannel with a MappedByteBuffer does well because there is effectively an internal buffer involved in memory mapping. That buffer is a multiple of the OS page size, which is typically 4Kbytes.
To be sure we've seen everything, let's zoom in even more.

Really zoomed plot


Zoomed in this far, the lines vary by about 200ms in the "flat" area above 8Kbytes. Monitoring the JVM's compilation and garbage collection found that neither of these were occurring during the tests. So these variations are due to other overhead within the JVM.

If we have to pick a fastest approach, it would be one of three:

FileChannel with a MappedByteBuffer and array reads (the magenta line with round dots that hits a low at 2Kbytes to 4Kbytes).
FileChannel with a direct ByteBuffer and array reads (the dark blue line with triangle dots that hits a low at 128Kbytes).
FileChannel with a wrapped array ByteBuffer and direct array access (the light blue line with triangle dots that hits a low at 256Kbytes).
All three of these do better than the others by reducing the amount of data copying. They all enable the JVM to read new data into the application's own array without going through multiple intermediate buffers.

Comparison to C


How does Java I/O compare to C I/O? The upper thick line above shows C's buffered I/O (fopen() and fread()), the middle line shows memory mapping (mmap()), and the lower thick line shows unbuffered I/O (open() and read()). All of these C tests were compiled with gcc and the -fast option on a Mac.

C's buffered I/O uses an internal buffer filled by large reads from the file. Bytes from the buffer are copied to an application array on each fread() call. This is similar to FileChannel with a ByteBuffer.

C's unbuffered I/O reads bytes into the application's array without an internal buffer and an extra layer of byte copying. This gives better performance and is a similar approach to FileChannel with a wrapped array ByteBuffer.

C's memory mapping gives more direct access to the OS's paged-in chunks of the file. Memory mapping has more OS overhead, reducing its performance compared to unbuffered I/O. The approach is similar to FileChannel with a MappedByteBuffer.

C's performance is clearly better. But comparisons between C and Java are, perhaps, unfair. Java does array bounds checking, runtime type checking, thread synchronization, garbage collection, and so on. C does not. This lets C go faster, but with a higher risk of crashing if you've got a pointer error.

Conclusions
For the best Java read performance, there are four things to remember:

Minimize I/O operations by reading an array at a time, not a byte at a time. An 8Kbyte array is a good size.
Minimize method calls by getting data an array at a time, not a byte at a time. Use array indexing to get at bytes in the array.
Minimize thread synchronization locks if you don't need thread safety. Either make fewer method calls to a thread-safe class, or use a non-thread-safe class like FileChannel and MappedByteBuffer.
Minimize data copying between the JVM/OS, internal buffers, and application arrays. Use FileChannel with memory mapping, or a direct or wrapped array ByteBuffer.
Further reading
Tuning Java I/O Performance. This article at Sun's Developer Network discusses I/O approaches and also gives tips on text file handling, message printing, file attribute queries, and more.
Tweak your IO performance for faster runtime. This older article at JavaWorld discusses several I/O approaches and benchmarks them on five JVMs of the day. It's general points remain valid today, though newer JVMs are much more sophisticated and we now have low-level FileChannel I/O for better performance.
http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly
分享到:
评论

相关推荐

    JavaTip

    JavaTip.chm

    markdown-tip:markdown提示组件

    link rel =" stylesheet " href =" ./node_modules/markdown-tip/dist/markdown-tip.css " /&gt;vuejs组件 npm i markdown-tip-vue import { MarkdownTip } from "markdown-tip-vue"app . component ( 'markdown-...

    Java balloon Tip Tool

    Java Balloon Tip Tool 是一个基于Java编程语言的实用工具,主要用于在用户界面上显示临时性的提示信息,类似于操作系统中的气泡提示。这种工具通常用于快速传递简短的通知或信息,而不打断用户的正常工作流程。在...

    wechat2551000

    描述中提到的“WeChat requires the use browser cookies to help you log in to allow the web application to function.”,揭示了微信网页版登录机制的一个关键点。浏览器cookies是网站在用户计算机上存储的小型...

    UE(官方下载)

    UltraEdit and UEStudio provide multiple methods to quickly open files without using the standard Open File dialog. A favorite method among power users is the Quick Open in the File menu. The benefit ...

    Optimal Tip-To-Tip Efficiency(弟高值d2f算法).pdf

    Optimal Tip-To-Tip Efficiency(弟高值d2f算法).pdf Optimal Tip-To-Tip Efficiency(弟高值d2f算法).pdf Optimal Tip-To-Tip Efficiency(弟高值d2f算法).pdf Optimal Tip-To-Tip Efficiency(弟高值d2f算法).pdf ...

    tip:带箭头的tip组件

    "tip:带箭头的tip组件"这个标题所指的,是一个专门用于创建带有箭头指示的提示框的组件。这样的组件通常在用户界面上用于提供额外的信息或者指导,提升用户体验。 描述中的"tip"一词,通常在UI设计中指的是Tooltip...

    golang-tip:活跃的Go开发分支的日常构建

    docker pull aleksi/golang-tip:master docker pull ghcr.io/aleksi/golang-tip:master 当前构建的分支/ Docker标签(请参阅): master又名技巧,下一个Go版本 dev.boringcrypto – dev.fuzz – dev.go2go – ...

    drm-tip:DRM当前的发展和夜间树木

    drm-tip:DRM当前的发展和夜间树木

    Java balloon Tip Tool (transparent-window)

    Java Balloon Tip Tool 是一个Java开发的工具,用于创建具有透明窗口效果的提示信息,通常在用户界面设计中用于提供轻量级的通知或提示。这个工具的独特之处在于它利用了Java图形库的能力来实现气球形状的提示框,...

    【字节跳动】HRBP三面真题_嵌入式-常用知识&面试题库_大厂面试真题.pdf

    * 电话面试的 Tip:保持专业形象,简洁地回答问题,展示积极的态度 二、 一面面试 * 自我介绍的目的:展示自己的经历和技能,展示自己的优势 * 自我介绍的内容:简历、经历、技能、兴趣爱好等 * 自我介绍的 Tip:...

    完全CSS写的鼠标悬停tip效果

    - `a#tip:link, a#tip:hover` 控制链接未被访问及鼠标悬停时的样式,去除了下划线并设置了颜色。 - `a#tip:visited` 设置已访问链接的颜色和下划线样式。 - **提示信息样式**: - `a#tip span` 默认隐藏提示信息...

    leetcode1124java-arts:艺术

    Tip:At least study one technical skill each week Share:At least write one thoughtful article each week Algorithm:LC 230. Kth Smallest Element in a BST Review:高效的一些建议 Tip:Elastic Search VS ...

    jdCheeseFive_java_tip3ja_

    【标题】"jdCheeseFive_java_tip3ja_" 指的是一款基于Java语言开发的简单双人对战五子棋游戏。这个项目可能是为了帮助开发者或者爱好者在没有实体棋盘的情况下,能与朋友通过电脑进行游戏消遣,且由于不需要网络联机...

    传智播客 方立勋 javaweb 完全笔记

    **7.29 Tip:使用 JAXP 进行 DOM 解析** - **步骤**: 1. 加载 XML 文件。 2. 解析 XML。 3. 处理节点。 **7.30 Tip:DOM 编程** - **概念**: 一种标准模型,用于表示 XML 或 HTML 文档的结构。 - **操作**: - ...

    d3-v6-tip:一个著名的 d3-tip 库,适配了最新的 - d3.v6 版本

    d3-v6-tip 一个著名的 d3-tip 库,适用于最新 - d3.v6 版本。前言d3.v6 引入了一些更改,其中一些与 d3-tip 相关。 那些是: 全局 d3.event 已被删除从现在开始,每个事件处理程序都将接收事件作为第一个参数位于此...

    高级java笔试题-tip:自写博客、博客收藏、开源项目收藏

    高级java笔试题 自写博客或笔记 博客收藏 开源 并发 线程安全set:ConcurrentSkipListSet、CopyOnWriteArraySet、ConcurrentHashMap.newKeySet()、Collections.synchronizedSet(set) 分布式技术 缓存穿透: 请求去...

    pandoc-latex-tip:用于在LaTeX中添加提示的pandoc过滤器

    它使用icon_font_to_png包来生成即时图像。 文献资料 请参阅。 用法 要应用过滤器,请对pandoc使用以下选项: --filter pandoc-latex-tip 安装 pandoc-latex-tip需要 ,这是一种预安装在linux和Mac OS X上的编程...

    TIP:提示的静态程序分析

    Java 8(或更高版本) 我们建议您使用IntelliJ进行提示,但是以下所有选项都是可行的。 IntelliJ IDEA 在IntelliJ中安装Scala插件。 (按照有关的说明进行操作,在插件菜单中搜索“ Scala”。) 打开一个终端,...

Global site tag (gtag.js) - Google Analytics