1.
A
zettabyte is 1021
bytes, or equivalently one thousand exabytes, one
million petabytes, or one billion terabytes.
2.
It has been said
that “More data usually beats better algorithms,” which is to say that for some
problems (such as recommending movies or music based on past preferences),
however fiendish your algorithms are, they can often be beaten simply by having
more data (and a less sophisticated algorithm).
3.
While the storage
capacities of hard drives have increased massively over the years, access
speeds—the rate at which data can be read from drives—have not kept up.
4.
The first problem
of reading and writing data in parallel to or from multiple disks is hardware
failure. The second problem is that most analysis tasks need to be able to
combine the data in some way; Various distributed systems allow data to be
combined from multiple sources, but doing this correctly is notoriously
challenging.
5.
Hadoop provides:
a reliable shared storage and analysis system. The storage is provided by HDFS
and analysis by MapReduce. There are other parts to Hadoop, but these
capabilities are its kernel.
6.
MapReduce is a
batch
query processor, and the ability to run an ad
hoc query against your whole dataset and get the results in a reasonable time
is transformative.
7.
Seek time is
improving more slowly than transfer rate. Seeking is the process of moving the
disk’s head to a particular place on the disk to read or write data. It
characterizes the latency of a disk operation, whereas the transfer rate
corresponds to a disk’s bandwidth. If the data access pattern is dominated by
seeks, it will take longer to read or write large portions of the dataset than
streaming through it, which operates at the transfer rate. For updating a small
proportion of records in a database, a traditional B-Tree (which is limited by
the rate it can perform seeks) works well. For updating the majority of a
database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to
rebuild the database.
8.
MapReduce can be
seen as a complement to an RDBMS. MapReduce is a good fit for problems that
need to analyze the whole dataset, in a batch fashion, particularly for ad hoc
analysis. An RDBMS is good for point queries or updates, where the dataset has
been indexed to deliver low-latency retrieval and update times of a relatively
small amount of data. MapReduce suits applications where the data is written
once, and read many times, whereas a relational database is good for datasets
that are continually updated.
RDBMS compared to MapReduce
|
Traditional RDBMS
|
MapReduce
|
Data size
|
Gigabytes
|
Petabytes
|
Access
|
Interactive and batch
|
Batch
|
Updates
|
Read and write many times
|
Write once, read many times
|
Structure
|
Static schema
|
Dynamic schema
|
Integrity
|
High
|
Low
|
Scaling
|
Nonlinear
|
Linear
|
9.
Another
difference between MapReduce and an RDBMS is the amount of structure in the
datasets that they operate on. MapReduce works well on unstructured or
semi-structured data, since it is designed to interpret the data at processing
time. In other words, the input keys and values for MapReduce are not an
intrinsic property of the data, but they are chosen by the person analyzing the
data.
10.
One of the
central assumptions that MapReduce makes is that it is possible to perform
(high-speed) streaming reads and writes.
11.
MapReduce is a
linearly scalable programming model. The programmer writes two functions—a map
function and a reduce function—each of which defines a mapping from one set of
key-value pairs to another. These functions are oblivious to the size of the
data or the cluster that they are operating on, so they can be used unchanged
for a small dataset and for a massive one.
12.
The approach in
HPC is to distribute
the work across a cluster of machines, which access a shared filesystem, hosted
by a SAN. This works well for predominantly compute-intensive jobs, but becomes
a problem when nodes need to access larger data volumes (hundreds of gigabytes,
the point at which MapReduce really starts to shine), since the network
bandwidth is the bottleneck and compute nodes become idle.
13.
MPI(Message
Passing Interface) gives great control to the programmer, but requires that he
or she explicitly handle the mechanics of the data flow, exposed via low-level
C routines and constructs, such as sockets, as well as the higher-level
algorithm for the analysis. MapReduce operates only at the higher level: the
programmer thinks in terms of functions of key and value pairs, and the data
flow is implicit.
分享到:
相关推荐
hadoop-annotations-3.1.1.jar hadoop-common-3.1.1.jar hadoop-mapreduce-client-core-3.1.1.jar hadoop-yarn-api-3.1.1.jar hadoop-auth-3.1.1.jar hadoop-hdfs-3.1.1.jar hadoop-mapreduce-client-hs-3.1.1.jar ...
在这个版本中,Winutils.exe和hadoop.dll是两个关键组件,它们对于在Windows环境下运行Hadoop至关重要。 Winutils.exe是Hadoop在Windows系统上的一个实用工具,它提供了与Linux系统中bin/hadoop脚本类似的功能。这...
包org.apache.hadoop.mapreduce的Hadoop源代码分析
1. 初始化HDFS命名节点(NameNode)和数据节点(DataNode)。 2. 执行HDFS文件系统的操作,如创建目录、上传/下载文件、删除文件等。 3. 设置Hadoop的环境变量,如`HADOOP_HOME`、`HADOOP_OPTS`等。 4. 提供Hadoop...
此资源是由win7 64下编译成功的用语解决Could not locate executable null\bin\winutils.exe in the Hadoop binaries.的问题的winutils.exe 资源
hadoop安装与配置 hadoop安装与配置 Hadoop的安装与配置可以分成几个主要步骤: 1. 安装Java 2. 下载Hadoop 3. 配置Hadoop 4. 格式化Hadoop文件系统 5. 启动Hadoop 以下是基于Linux系统的简化安装与配置步骤: 1. ...
hadoop2.7.x 都可以使用,在windows环境下运行hadoop、hbase、spark需要winutils.exe,否则会报错Could not locate executable null\bin\winutils.exe in the Hadoop binaries。
在Hadoop生态系统中,`hadoop.dll`和`winutils.exe`是两个关键组件,尤其对于Windows用户来说。本文将详细介绍这两个文件以及它们在Hadoop 2.6.0版本中的作用。 `hadoop.dll`是Hadoop在Windows环境下运行所必需的一...
Java-org.apache.hadoop是Apache Hadoop项目的核心组件,它在分布式计算领域扮演着至关重要的角色。Hadoop是由Apache软件基金会开发的一个开源框架,主要用于处理和存储大量数据。它设计的初衷是为了支持数据密集型...
org.apache.hadoop.classification org.apache.hadoop.conf org.apache.hadoop.contrib.utils.join org.apache.hadoop.examples org.apache.hadoop.examples.dancing org.apache.hadoop.examples.pi org....
1. 下载适合系统的`hadoop.dll`和`winutils.exe`。 2. 设置HADOOP_HOME环境变量指向Hadoop的安装目录。 3. 配置`hadoop-env.cmd`以指定Java的路径。 4. 配置`core-site.xml`,设置HDFS的默认FS为本地文件系统,并...
Hive错误之 Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask错误分析_xiaohu21的博客-CSDN博客.mht
我的报错:Could not locate Hadoop executable: E:\big_data\hadoop-3.3.0\bin\winutils.ex hadoop的winutils.exe及hadoop.dll文件,可以用于hadoop3.3. 下载好直接将两个文件复制到我们hadoop的bin目录下就行了
必须将此jar包放在org.apache.hadoop.io包下,否则无法正常覆盖使用
这时,就需要在Windows上使用`winutils.exe`和`hadoop.dll`这两个关键组件。本文将深入探讨这两个文件的作用、安装与配置过程,以及在Windows环境下使用Hadoop的相关知识点。 `winutils.exe`是Apache Hadoop针对...
1. HDFS(Hadoop Distributed File System):分布式文件系统,能够将大规模数据分布在多台机器上,并提供高可用性和容错性。 2. MapReduce:一种编程模型,用于大规模数据集的并行计算。Map阶段将任务分解,Reduce...
hadoop支持LZO压缩配置 将...org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.SnappyCodec, ...
本文将详细介绍Hadoop 2.7.2版本中针对Windows的重要组件——winutils.exe、hadoop.dll、hadoop.exp以及libwinutils.lib,并探讨它们在Windows环境下的作用和使用方法。 首先,`winutils.exe`是Hadoop在Windows环境...