hadoop源码解析copyFromLocal -

zhangbaoming815

浏览: 150950 次
性别:
来自: 北京

最近访客更多访客>>

ssssd1000

f641385712

qishinihenhao

simshine

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

hadoop源码解析copyFromLocal

博客分类：

hadoop

hadoop copyFromLocal 源码解析

好奇分布式存储是怎么实现的，如何能将一个文件存储到HDFS上，HDFS的文件目录只是一个空壳，真正存储数据的是DataNode，那么当我们把一个文件放到HDFS上的时候，集群都做了哪些工作呢 ?也就是执行命令copyFromLocal这个命令都做了哪些操作

首先命令肯定对应着源码里面的某一个方法，这个方法是FsShell类的copyFromLocal，代码：

    void copyFromLocal(Path[] srcs, String dstf) throws IOException
    {
        // 创建目标路径
        Path dstPath = new Path(dstf);
        
        // 获取目录存储目标路径的文件系统
        FileSystem dstFs = dstPath.getFileSystem(getConf());
        
        if (srcs.length == 1 && srcs[0].toString().equals("-"))
        {
            copyFromStdin(dstPath, dstFs);
        }
        else
        {
            
            dstFs.copyFromLocalFile(false, false, srcs, dstPath);
        }
    }

文件的拷贝是通过类FileUtil累的copy方法实现的：

    public static boolean copy(FileSystem srcFS, Path src, FileSystem dstFS,
            Path dst, boolean deleteSource, boolean overwrite,
            Configuration conf) throws IOException
    {
        // 检查目标路径是否合法
        dst = checkDest(src.getName(), dstFS, dst, overwrite);

        if (srcFS.getFileStatus(src).isDir())
        {
            // 检查目标目录是否是合理的目录
            checkDependencies(srcFS, src, dstFS, dst);
            if (!dstFS.mkdirs(dst))
            {
                return false;
            }
            FileStatus contents[] = srcFS.listStatus(src);
            for (int i = 0; i < contents.length; i++)
            {
                // 递归调用当前方法，如果原目标是文件，那么执行else if 代码块
                copy(srcFS, contents[i].getPath(), dstFS, new Path(dst,
                        contents[i].getPath().getName()), deleteSource,
                        overwrite, conf);
            }
        }
        else if (srcFS.isFile(src))
        {
            InputStream in = null;
            OutputStream out = null;
            try
            {
                in = srcFS.open(src);
                
                // 创建目标路径，在分布式中如何创建很重要
                out = dstFS.create(dst, overwrite);
                
                IOUtils.copyBytes(in, out, conf, true);
            }
            catch (IOException e)
            {
                IOUtils.closeStream(out);
                IOUtils.closeStream(in);
                throw e;
            }
        }
     }

文件的拷贝需要打开源文件流和目标文件流，目标文件流是通过DFSClient的create方法实现，创建一个DFSOutputStream流：

    public OutputStream create(String src, FsPermission permission,
            boolean overwrite, boolean createParent, short replication,
            long blockSize, Progressable progress, int buffersize)
            throws IOException
    {
        checkOpen();
        if (permission == null)
        {
            permission = FsPermission.getDefault();
        }
        FsPermission masked = permission
                .applyUMask(FsPermission.getUMask(conf));
        LOG.debug(src + ": masked=" + masked);
        
        // src 为要拷贝到的目标路径， 文件块大小blockSize应该是io.bytes.per.checksum
        // 大小的n倍，否则会出现异常
        OutputStream result = new DFSOutputStream(src, masked, overwrite,
                createParent, replication, blockSize, progress, buffersize,
                conf.getInt("io.bytes.per.checksum", 512));
        leasechecker.put(src, result);
        return result;
    }

在创建DFSOutputStream流的时候都做了什么工作，具体看创建方法，在DFSOutputStream中开启了DataStreamer进程，这个进程在后面的数据写入的时候扮演者重要的角色：

        DFSOutputStream(String src, FsPermission masked, boolean overwrite,
                boolean createParent, short replication, long blockSize,
                Progressable progress, int buffersize, int bytesPerChecksum)
                throws IOException
        {
            this(src, blockSize, progress, bytesPerChecksum, replication);

            computePacketChunkSize(writePacketSize, bytesPerChecksum);

            try
            {
                namenode.create(src, masked, clientName, overwrite,
                        createParent, replication, blockSize);
            }
            catch (RemoteException re)
            {
                throw re.unwrapRemoteException(AccessControlException.class,
                        FileAlreadyExistsException.class,
                        FileNotFoundException.class,
                        NSQuotaExceededException.class,
                        DSQuotaExceededException.class);
            }
            streamer.start();
        }

DataStreamer进程起来以后，开启与目标文件的通道，等待DataQueue队列有数据后，将数据写入到目标文件中，目标文件其实是DataNode上的文件，熟称block，关于如何寻找相应的block，可以从上面的另一条主线，创建文件的源码中查看。

......

// get packet to be sent.
if (dataQueue.isEmpty())
{        
       one = new Packet(); // heartbeat packet
}
 else
{
        // 从队列中获取一个 Packet
        one = dataQueue.getFirst(); // regular data
                                                            // packet
}

......

// 如果某一块的数据已经读取完，开启下一个块的连接
// 
if (blockStream == null)
{
          LOG.debug("Allocating new block");
          nodes = nextBlockOutputStream(src);
          this.setName("DataStreamer for file " + src
                     + " block " + block);
           response = new ResponseProcessor(nodes);
           response.start();
}

......
// blockStream向clinet(也就是某个DataNode)发送数据
blockStream.write(buf.array(), buf.position(),
                 buf.remaining());

这个进程会等待数据的来临，那么数据从何而来，看IOUtils的copyBytes方法，它判断是否是PrintStream流，这个用于打印到控制台：

    public static void copyBytes(InputStream in, OutputStream out, int buffSize)
            throws IOException
    {

        PrintStream ps = out instanceof PrintStream ? (PrintStream) out : null;
        byte buf[] = new byte[buffSize];
        int bytesRead = in.read(buf);
        while (bytesRead >= 0)
        {
            // 这个另有乾坤，不要简单的把out想象成OutputStream
            // 这个out方法最终会调用DFSClient.DFSOutputStream.writeChunk(..)
            out.write(buf, 0, bytesRead);
            if ((ps != null) && ps.checkError())
            {
                throw new IOException("Unable to write to output stream.");
            }
            bytesRead = in.read(buf);
        }
    }

这个out从刚才的原来看应该是FSDataOutputStream，追踪write方法，会到FSOutputSummer类的writeChecksumChunk方法中：

    private void writeChecksumChunk(byte b[], int off, int len, boolean keep)
            throws IOException
    {
        int tempChecksum = (int) sum.getValue();
        if (!keep)
        {
            sum.reset();
        }
        int2byte(tempChecksum, checksum);
        writeChunk(b, off, len, checksum);
    }

这个方法通过调用自身的抽象方法writeChunk方法来实现写数据，这个抽象的方法由DFSOutputStream实现，在writeChunk方法中将源文件的数据装载到DataQueue中，这样原先的DataStreamer进程就可以从DataQueue中读取数据并写如到指定的block中，具体可以看代码的实现。

        private synchronized void enqueueCurrentPacket()
        {
            synchronized (dataQueue)
            {
                if (currentPacket == null)
                    return;
                dataQueue.addLast(currentPacket);
                dataQueue.notifyAll();
                lastQueuedSeqno = currentPacket.seqno;
                currentPacket = null;
            }
        }

分享到：

InetSocketAddress的使用

2013-08-07 17:51
浏览 2285
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

hadoop源码解析copyFromLocal

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

hadoop源码解析copyFromLocal

评论

发表评论

相关推荐

hadoop中LineReader的readLine方法解析

hadoop新版本多文件输出

hadoop实现自定义的数据类型

使用MapReduce往Hbase插入数据

hbase整合hive

hive处理特殊分割符的日志

jdbc连接hive

在集群上运行hadoop程序

pig的一些基本函数的应用

pig中python的使用

pig的UDF函数的使用

在eclipse下运行Map-Reduce程序

hive中分区表，桶的使用

最近访客更多访客>>