- 浏览: 567260 次
- 性别:
- 来自: 济南
-
文章分类
- 全部博客 (270)
- Ask chenwq (10)
- JSF (2)
- ExtJS (5)
- Life (19)
- jQuery (5)
- ASP (7)
- JavaScript (5)
- SQL Server (1)
- MySQL (4)
- En (1)
- development tools (14)
- Data mining related (35)
- Hadoop (33)
- Oracle (13)
- To Do (2)
- SSO (2)
- work/study diary (10)
- SOA (6)
- Ubuntu (7)
- J2SE (18)
- NetWorks (1)
- Struts2 (2)
- algorithm (9)
- funny (1)
- BMP (1)
- Paper Reading (2)
- MapReduce (23)
- Weka (3)
- web design (1)
- Data visualisation&R (1)
- Mahout (7)
- Social Recommendation (1)
- statistical methods (1)
- Git&GitHub (1)
- Python (1)
- Linux (1)
最新评论
-
brandNewUser:
楼主你好,问个问题,为什么我写的如下的:JobConf pha ...
Hadoop ChainMap -
Molisa:
Molisa 写道mapred.min.split.size指 ...
Hadoop MapReduce Job性能调优——修改Map和Reduce个数 -
Molisa:
mapred.min.split.size指的是block数, ...
Hadoop MapReduce Job性能调优——修改Map和Reduce个数 -
heyongcs:
请问导入之后,那些错误怎么解决?
Eclipse导入Mahout -
a420144030:
看了你的文章深受启发,想请教你几个问题我的数据都放到hbase ...
Mahout clustering Canopy+K-means 源码分析
package cn.edu.xmu.dm.mpdemo.ioformat; import java.io.IOException; import java.net.URI; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.SequenceFile.CompressionType; import org.apache.hadoop.io.Text; /** * desc: SequenceFileWriter * <code>SequenceFileWriteDemo</code> * * @author chenwq (irwenqiang@gmail.com) * @version 1.0 2012/05/19 */ public class SequenceFileWriteDemo { private static final String[] DATA = { "One, two, buckle my shoe", "Three, four, shut the door", "Five, six, pick up sticks", "Seven, eight, lay them straight", "Nine, ten, a big fat hen" }; public static void main(String[] args) throws IOException { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path path = new Path(uri); IntWritable key = new IntWritable(); Text value = new Text(); SequenceFile.Writer writer = null; try { /** * fs: outputstream * conf: configuration object * key: the key' type * value: the value's type */ writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass()); // writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), // value.getClass(), CompressionType.BLOCK); for (int i = 0; i < 100; i++) { key.set(100 - i); value.set(DATA[i % DATA.length]); System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value); writer.append(key, value); } } finally { IOUtils.closeStream(writer); } } }
package cn.edu.xmu.dm.mpdemo.ioformat; import java.io.IOException; import java.net.URI; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.Writable; import org.apache.hadoop.util.ReflectionUtils; /** * desc: SequenceFileReader * <code>SequenceFileReadDemo</code> * * @author chenwq (irwenqiang@gmail.com) * @version 1.0 2012/05/19 */ public class SequenceFileReadDemo { public static void main(String[] args) throws IOException { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path path = new Path(uri); SequenceFile.Reader reader = null; try { reader = new SequenceFile.Reader(fs, path, conf); Writable key = (Writable) ReflectionUtils.newInstance( reader.getKeyClass(), conf); Writable value = (Writable) ReflectionUtils.newInstance( reader.getValueClass(), conf); long position = reader.getPosition(); while (reader.next(key, value)) { String syncSeen = reader.syncSeen() ? "*" : ""; System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, key, value); position = reader.getPosition(); // beginning of next record } } finally { IOUtils.closeStream(reader); } } }
使用Block压缩后的大小对比:
root@ubuntu:~# hadoop fs -ls mpdemo/ Found 2 items -rw-r--r-- 3 root supergroup 4788 2012-05-19 00:11 /user/root/mpdemo/seqinput -rw-r--r-- 3 root supergroup 484 2012-05-19 00:17 /user/root/mpdemo/seqinputblock
发表评论
-
Parallel K-Means Clustering Based on MapReduce
2012-08-04 20:28 1421K-means is a pleasingly paral ... -
Pagerank在Hadoop上的实现原理
2012-07-19 16:04 1483转自:pagerank 在 hadoop 上的实现原理 ... -
Including external jars in a Hadoop job
2012-06-25 20:24 1227办法1: 把所有的第三方jar和自己的class打成一个大 ... -
[转]BSP模型与实例分析(一)
2012-06-15 22:26 0一、BSP模型概念 BSP(Bulk Synchr ... -
Hadoop中两表JOIN的处理方法
2012-05-29 10:35 9741. 概述 在传统数据库(如:MYSQL)中,JOIN ... -
Hadoop DistributedCache
2012-05-27 23:45 1135Hadoop的DistributedCache,可以把 ... -
MapReduce,组合式,迭代式,链式
2012-05-27 23:27 23991.迭代式mapreduce 一些复杂的任务难以用一 ... -
Hadoop ChainMap
2012-05-27 23:09 1998单一MapReduce对一些非常简单的问题提供了很好的支持。 ... -
广度优先BFS的MapReduce实现
2012-05-25 21:47 4322社交网络中的图模型经常需要构造一棵树型结构:从一个特定的节点出 ... -
HADOOP程序日志
2012-05-23 19:53 1032*.log日志文件和*.out日志文件 进入Hadoo ... -
TFIDF based on MapReduce
2012-05-23 11:58 962Job1: Map: input: ( ... -
个人Hadoop 错误列表
2012-05-23 11:31 1504错误1:Too many fetch-failure ... -
Hadoop Map&Reduce个数优化设置以及JVM重用
2012-05-22 11:29 2448Hadoop与JVM重用对应的参数是map ... -
有空读下
2012-05-20 23:59 0MapReduce: JT默认task scheduli ... -
Hadoop MapReduce Job性能调优——修改Map和Reduce个数
2012-05-20 23:46 26789map task的数量即mapred ... -
Hadoop用于和Map Reduce作业交互的命令
2012-05-20 16:02 1235用法:hadoop job [GENERIC_OPTION ... -
Eclipse:Run on Hadoop 没有反应
2012-05-20 11:46 1293原因: hadoop-0.20.2下自带的eclise ... -
Hadoop0.20+ custom MultipleOutputFormat
2012-05-20 11:46 1553Hadoop0.20.2中无法使用MultipleOutput ... -
Custom KeyValueTextInputFormat
2012-05-19 16:23 1726在看老版的API时,发现旧的KeyValueTextInpu ... -
Hadoop Archive解决海量小文件存储
2012-05-18 21:32 2709单台服务器作为Namenode,当文件数量规 ...
相关推荐
这可能涉及到使用`SequenceFile.Writer`和`SequenceFile.Reader`类,以及相关的序列化和反序列化工具。 **MapFile** MapFile是SequenceFile的一种扩展,它提供了一种索引结构来加速查找特定键的数据。MapFile由两...
SequenceFile.Reader reader = new SequenceFile.Reader(conf, ..., ...); while (reader.next(key, value)) { // 处理键值对 } reader.close(); ``` 3. **Sequence File的合并** 合并多个Sequence Files...
例如,使用 SequenceFile.Writer 来写入记录,使用 SequenceFile.Reader 来读取记录。 二、MapFile MapFile 是一种特殊的 SequenceFile,是排序后的 SequenceFile。它可以将键值对按照 Key 的顺序进行排序,提供...
SequenceFile是Hadoop生态系统中的一个重要组件,是一种高效、可靠的二进制文件格式,常用于存储大规模数据集。本篇将深入探讨SequenceFile及其在Java环境下的操作,结合给定的"content.zip"压缩包,我们将分析如何...
本篇文章将深入探讨Hadoop文件的存储格式,尤其是SequenceFile格式,它是一种广泛使用的二进制文件格式,适合大规模数据处理。 首先,我们要了解最基础的1.txt纯文本格式。这种格式是最直观易读的,它由多行记录...
Parquet 是一种列式存储格式,被广泛用于大数据处理和分析场景,如 Apache Hadoop、Spark、Impala 等。它支持高效的读写操作,尤其适用于大规模数据处理,因为它的设计允许对数据进行快速的随机访问和压缩。C++ 是一...