MapReduce Algorithm - Secondary Sort

博客分类：

Hadoop
Algorithm

Secondary sort is used to sort to allow some records to arrive at a reducer ahead of other records, it requires an understanding of both data arrangement and data flow (partitioning, sorting and grouping) and how they're integrated into MapReduce. As below figure shown: The partitioner is invoked a ...

2013-07-25 19:34
浏览 1263
评论(0)
分类:企业架构

MapReduce Algorithm - Semi-joins

博客分类：

Algorithm
Hadoop

In relational world, semi-join can be defined as a join between two tables returns rows from the first table where one or more matches are found in the second table. The difference between a semi-join and a conventional join is that rows in the first table will be returned at most once. Even if the ...

2013-07-25 18:15
浏览 1203
评论(0)
分类:企业架构

MapReduce Algorithm - Another Way to Do Map-side Join

博客分类：

Algorithm
Hadoop

Map-side join is also known as replicated join, and gets is name from the fact that the smallest of the datasets is replicated to all the map hosts. You can find a implementation in Hadoop in Action. Another implementation is using CompositeInputFormat, which is shown in this blog post. The goal of ...

2013-07-25 17:51
浏览 2443
评论(0)
分类:企业架构

Homework - HBase Shell, Java Client and MapReduce Job

博客分类：

HBase

Env: Single Node with CentOS 6.2 x86_64, 2 processors, 4Gb memory CDH4.3 with Cloudera Manager 4.5 HBase 0.94.6-cdh4.3.0 HBase 0.94.6-cdh4.3.0 HBase shell exercise: [root@n8 ~]# hbase shell 13/07/21 21:11:25 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native ...

2013-07-21 23:36
浏览 4425
评论(1)
分类:企业架构

Running MapReduce Job with HBase

博客分类：

Hadoop
HBase

Generally there are three different ways of interacting with HBase from a MapReduce application. HBase can be used as data source at the beginning of a job, as a data sink at the end of a job or as a shared resource. HBase as a data source: The following example using HBase as a MapReduce sourc ...

2013-07-21 01:50
浏览 4392
评论(0)
分类:企业架构

Adding HBase Library into Java Classpath

博客分类：

HBase
Hadoop

Suppose you write some Java code to operate HBase via HBase Java client interface, you compile and package the java source code into a jar, called examples.jar. In Hadoop cluster you can use "hbase classpath" to get the class path needed. $ java -cp examples.jar:`hbase classpath` hbase ...

2013-07-20 14:17
浏览 1042
评论(0)
分类:企业架构

Moving Data in/out of Hadoop Filesystem

博客分类：

Hadoop

Hadoop has a number of built-in mechanisms that can facilitate ingress and egress operations, to name a few: Embedded NameNode HTTP server WebHDFS and Hadoop interfaces Hbase built-in API, be specifically the org.apache.hadoop.hbase.mapreduce.TableInputFormat and org.apache.hadoop.hbase.mapredu ...

2013-07-18 23:11
浏览 1603
评论(0)
分类:企业架构

Enabling Oozie Web Console in CDH3, CDH4 with/without Cloudera Manager

博客分类：

Hadoop
Oozie

To enable Oozie's web console, you must download and add the ExtJS library to the Oozie server. If you have not already done this, proceed as follows. If you use CDH3, you must do: Download the ExtJS version 2.2 library from http://extjs.com/deploy/ext-2.2.zip and place it in a convenient loc ...

2013-07-16 23:36
浏览 1303
评论(0)
分类:企业架构

指定Flume日志分类级别

博客分类：

Hadoop
Flume

用UDP或TCP接受syslog格式日志的时候，比如： flume dump 'syslogUdp(5140)' 这个命令使用UDP在5140端口接收日志。这时候假如你希望从命令行测试能否成功接收： echo '<37>Hello from cmd.' |nc -u localhost 5140 一定要在测试文本头加上<37>用来对日志进行分类，否则flume会抛出如下错误： 2013-07-16 08:26:49,614 [logicalNode dump-10] WARN syslog.SyslogUdpSource: 1 rejected pack ...

2013-07-16 08:41
浏览 3733
评论(0)
分类:企业架构

PageRank Algorithm in MapReduce

博客分类：

Hadoop
Algorithm

In chapter 5 of Data-Intensive Text Processing with MapReduce, it introduces how to implement PageRank algorithm in MapReduce way. Here I am not going to talk more about PageRank itself, please refer to wikipedia or other papers for further explaination. What I'm going to talk about is how to imple ...

2013-07-14 12:12
浏览 4396
评论(0)
分类:企业架构

Breadth-first Graph Search in MapReduce

博客分类：

Algorithm
Hadoop

In chapter 5 of the book "Data-Intensive Text Processing with MapReduce", it introduced how to parallel breadth-first graph search with MapReduce. This parallel algorithm is a variant of Dijkstra's algorithm. I'm not going to talk about the sequential version of Dijkstra's algorithm, for ...

2013-07-13 20:44
浏览 5832
评论(0)
分类:企业架构

Homework - How to Configure Hadoop Task Scheduler

博客分类：

Hadoop

To configure MapReduce or YARN task scheduler, go to Services -> mapreduce1/yarn1 -> Configuration. Then click the 'view and edit' tab, search for property 'mapred.jobtracker.taskScheduler'. You will see options as below screenshot shown:

2013-07-13 01:00
浏览 946
评论(0)
分类:企业架构

Homework - NASA Access Log Processing

博客分类：

Hadoop

Hadoop workshop homework. For privacy, the blog post will not show source code at all, only the job output logs and counters. Copy the packaged jar file into hadoop cluster: [root@n1 hadoop-examples]# scp gsun@192.168.1.102:~/prog/hadoop/cdh4-examples/cdh4-examples.jar . Password: cdh4-ex ...

2013-07-13 00:36
浏览 1295
评论(0)
分类:企业架构

Homework - Running Hadoop WordCount Examples

博客分类：

Hadoop

Hadoop workshop homework. Since I am an Intellij Idea guy now (I shifted to Intellij Idea from Eclipse several months ago because Intellij Idea is much much better than Eclipse now). Currently Intellij does't have any Hadoop plugins, so I package the output into a jar file, then copy the jar (c ...

2013-07-12 23:44
浏览 1634
评论(0)
分类:企业架构

Homework - Benchmarking Hadoop Cluster

博客分类：

Hadoop

In this blog post I introduce some of the benchmarking and testing tools in the Apache Hadoop distribution. Namely, I'll look at TeraSort, NNBench and MRBench. These are popular choices to benchmark a Hadoop cluster. Before we start, let me show you the clusters on which the tests will run: ...

2013-07-12 22:20
浏览 1912
评论(0)
分类:企业架构

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

MapReduce Algorithm - Secondary Sort

MapReduce Algorithm - Semi-joins

MapReduce Algorithm - Another Way to Do Map-side Join

Homework - HBase Shell, Java Client and MapReduce Job

Running MapReduce Job with HBase

Adding HBase Library into Java Classpath

Moving Data in/out of Hadoop Filesystem

Enabling Oozie Web Console in CDH3, CDH4 with/without Cloudera Manager

指定Flume日志分类级别

PageRank Algorithm in MapReduce

Breadth-first Graph Search in MapReduce

Homework - How to Configure Hadoop Task Scheduler

Homework - NASA Access Log Processing

Homework - Running Hadoop WordCount Examples

Homework - Benchmarking Hadoop Cluster

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

最近访客更多访客>>