`
leonzhx
  • 浏览: 786040 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

Chapter 1. Meet Hadoop

阅读更多

1.      A  zettabyte is 1021 bytes, or equivalently one thousand exabytes, one million petabytes, or one billion terabytes.


2.      It has been said that “More data usually beats better algorithms,” which is to say that for some problems (such as recommending movies or music based on past preferences), however fiendish your algorithms are, they can often be beaten simply by having more data (and a less sophisticated algorithm).

 

3.      While the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives—have not kept up.

 

4.      The first problem of reading and writing data in parallel to or from multiple disks is hardware failure. The second problem is that most analysis tasks need to be able to combine the data in some way; Various distributed systems allow data to be combined from multiple sources, but doing this correctly is notoriously challenging.

 

5.      Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce. There are other parts to Hadoop, but these capabilities are its kernel.

 

6.      MapReduce is a batch query processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative.

 

7.      Seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth. If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate. For updating a small proportion of records in a database, a traditional B-Tree (which is limited by the rate it can perform seeks) works well. For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database.

 

8.      MapReduce can be seen as a complement to an RDBMS. MapReduce is a good fit for problems that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data. MapReduce suits applications where the data is written once, and read many times, whereas a relational database is good for datasets that are continually updated.

RDBMS compared to MapReduce

 

Traditional RDBMS

MapReduce

Data size

Gigabytes

Petabytes

Access

Interactive and batch

Batch

Updates

Read and write many times

Write once, read many times

Structure

Static schema

Dynamic schema

Integrity

High

Low

Scaling

Nonlinear

Linear

 

9.      Another difference between MapReduce and an RDBMS is the amount of structure in the datasets that they operate on. MapReduce works well on unstructured or semi-structured data, since it is designed to interpret the data at processing time. In other words, the input keys and values for MapReduce are not an intrinsic property of the data, but they are chosen by the person analyzing the data.

 

10.  One of the central assumptions that MapReduce makes is that it is possible to perform (high-speed) streaming reads and writes.

 

11.  MapReduce is a linearly scalable programming model. The programmer writes two functions—a map function and a reduce function—each of which defines a mapping from one set of key-value pairs to another. These functions are oblivious to the size of the data or the cluster that they are operating on, so they can be used unchanged for a small dataset and for a massive one.

 

12.  The approach in HPC is to distribute the work across a cluster of machines, which access a shared filesystem, hosted by a SAN. This works well for predominantly compute-intensive jobs, but becomes a problem when nodes need to access larger data volumes (hundreds of gigabytes, the point at which MapReduce really starts to shine), since the network bandwidth is the bottleneck and compute nodes become idle.

 

13.  MPI(Message Passing Interface) gives great control to the programmer, but requires that he or she explicitly handle the mechanics of the data flow, exposed via low-level C routines and constructs, such as sockets, as well as the higher-level algorithm for the analysis. MapReduce operates only at the higher level: the programmer thinks in terms of functions of key and value pairs, and the data flow is implicit.

分享到:
评论

相关推荐

    hadoop最新版本3.1.1全量jar包

    hadoop-annotations-3.1.1.jar hadoop-common-3.1.1.jar hadoop-mapreduce-client-core-3.1.1.jar hadoop-yarn-api-3.1.1.jar hadoop-auth-3.1.1.jar hadoop-hdfs-3.1.1.jar hadoop-mapreduce-client-hs-3.1.1.jar ...

    hadoop2.7.3 Winutils.exe hadoop.dll

    在这个版本中,Winutils.exe和hadoop.dll是两个关键组件,它们对于在Windows环境下运行Hadoop至关重要。 Winutils.exe是Hadoop在Windows系统上的一个实用工具,它提供了与Linux系统中bin/hadoop脚本类似的功能。这...

    winutils-master.zip hadoop windows运行插件

    1. 初始化HDFS命名节点(NameNode)和数据节点(DataNode)。 2. 执行HDFS文件系统的操作,如创建目录、上传/下载文件、删除文件等。 3. 设置Hadoop的环境变量,如`HADOOP_HOME`、`HADOOP_OPTS`等。 4. 提供Hadoop...

    WinUtils.exe hadoop.dll

    此资源是由win7 64下编译成功的用语解决Could not locate executable null\bin\winutils.exe in the Hadoop binaries.的问题的winutils.exe 资源

    Hadoop源代码分析(包org.apache.hadoop.mapreduce)

    包org.apache.hadoop.mapreduce的Hadoop源代码分析

    hadoop安装与配置 Hadoop的安装与配置可以分成几个主要步骤: 1. 安装Java 2. 下载Hadoop 3. 配

    hadoop安装与配置 hadoop安装与配置 Hadoop的安装与配置可以分成几个主要步骤: 1. 安装Java 2. 下载Hadoop 3. 配置Hadoop 4. 格式化Hadoop文件系统 5. 启动Hadoop 以下是基于Linux系统的简化安装与配置步骤: 1. ...

    hadoop2.7.6 winutils.exe hadoop.dll

    hadoop2.7.x 都可以使用,在windows环境下运行hadoop、hbase、spark需要winutils.exe,否则会报错Could not locate executable null\bin\winutils.exe in the Hadoop binaries。

    hadoop.dll & winutils.exe For hadoop-2.6.0

    在Hadoop生态系统中,`hadoop.dll`和`winutils.exe`是两个关键组件,尤其对于Windows用户来说。本文将详细介绍这两个文件以及它们在Hadoop 2.6.0版本中的作用。 `hadoop.dll`是Hadoop在Windows环境下运行所必需的一...

    hadoop0.23.9离线api

    org.apache.hadoop.classification org.apache.hadoop.conf org.apache.hadoop.contrib.utils.join org.apache.hadoop.examples org.apache.hadoop.examples.dancing org.apache.hadoop.examples.pi org....

    Java-org.apache.hadoop

    Java-org.apache.hadoop是Apache Hadoop项目的核心组件,它在分布式计算领域扮演着至关重要的角色。Hadoop是由Apache软件基金会开发的一个开源框架,主要用于处理和存储大量数据。它设计的初衷是为了支持数据密集型...

    win32win64hadoop2.7.x.hadoop.dll.bin

    1. 下载适合系统的`hadoop.dll`和`winutils.exe`。 2. 设置HADOOP_HOME环境变量指向Hadoop的安装目录。 3. 配置`hadoop-env.cmd`以指定Java的路径。 4. 配置`core-site.xml`,设置HDFS的默认FS为本地文件系统,并...

    org.apache.hadoop.io.nativeio

    必须将此jar包放在org.apache.hadoop.io包下,否则无法正常覆盖使用

    winutils.exe hadoop.dll

    这时,就需要在Windows上使用`winutils.exe`和`hadoop.dll`这两个关键组件。本文将深入探讨这两个文件的作用、安装与配置过程,以及在Windows环境下使用Hadoop的相关知识点。 `winutils.exe`是Apache Hadoop针对...

    Hive错误之 Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask错误分析_xiaohu21的博客-CSDN博客.mht

    Hive错误之 Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask错误分析_xiaohu21的博客-CSDN博客.mht

    hadoop的winutils.exe及hadoop.dll文件

    我的报错:Could not locate Hadoop executable: E:\big_data\hadoop-3.3.0\bin\winutils.ex hadoop的winutils.exe及hadoop.dll文件,可以用于hadoop3.3. 下载好直接将两个文件复制到我们hadoop的bin目录下就行了

    hadoop.zip hadoop2.7.1安装包

    1. HDFS(Hadoop Distributed File System):分布式文件系统,能够将大规模数据分布在多台机器上,并提供高可用性和容错性。 2. MapReduce:一种编程模型,用于大规模数据集的并行计算。Map阶段将任务分解,Reduce...

    hadoop-lzo-0.4.20.jar

    hadoop支持LZO压缩配置 将...org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.SnappyCodec, ...

    hadoop2.7.2windows工具winutils.exe和hadoop.dll hadoop.exp libwinutils.lib等

    本文将详细介绍Hadoop 2.7.2版本中针对Windows的重要组件——winutils.exe、hadoop.dll、hadoop.exp以及libwinutils.lib,并探讨它们在Windows环境下的作用和使用方法。 首先,`winutils.exe`是Hadoop在Windows环境...

Global site tag (gtag.js) - Google Analytics