Chapter 1. Meet Hadoop

leonzhx

浏览: 803853 次
性别:
来自: 上海

最近访客更多访客>>

u012363178

justsimple

cdphantom

wang_xuewu

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

2014-05 ( 22)
2014-04 ( 47)
2014-03 ( 25)
更多存档...

博客分类：

Hadoop: The Definitive Guide 读书笔记

Hadoop

1. A zettabyte is 10²¹ bytes, or equivalently one thousand exabytes, one million petabytes, or one billion terabytes.

2. It has been said that “More data usually beats better algorithms,” which is to say that for some problems (such as recommending movies or music based on past preferences), however fiendish your algorithms are, they can often be beaten simply by having more data (and a less sophisticated algorithm).

3. While the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives—have not kept up.

4. The first problem of reading and writing data in parallel to or from multiple disks is hardware failure. The second problem is that most analysis tasks need to be able to combine the data in some way; Various distributed systems allow data to be combined from multiple sources, but doing this correctly is notoriously challenging.

5. Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce. There are other parts to Hadoop, but these capabilities are its kernel.

6. MapReduce is a batch query processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative.

7. Seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth. If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate. For updating a small proportion of records in a database, a traditional B-Tree (which is limited by the rate it can perform seeks) works well. For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database.

8. MapReduce can be seen as a complement to an RDBMS. MapReduce is a good fit for problems that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data. MapReduce suits applications where the data is written once, and read many times, whereas a relational database is good for datasets that are continually updated.

RDBMS compared to MapReduce

	Traditional RDBMS	MapReduce
Data size	Gigabytes	Petabytes
Access	Interactive and batch	Batch
Updates	Read and write many times	Write once, read many times
Structure	Static schema	Dynamic schema
Integrity	High	Low
Scaling	Nonlinear	Linear

9. Another difference between MapReduce and an RDBMS is the amount of structure in the datasets that they operate on. MapReduce works well on unstructured or semi-structured data, since it is designed to interpret the data at processing time. In other words, the input keys and values for MapReduce are not an intrinsic property of the data, but they are chosen by the person analyzing the data.

10. One of the central assumptions that MapReduce makes is that it is possible to perform (high-speed) streaming reads and writes.

11. MapReduce is a linearly scalable programming model. The programmer writes two functions—a map function and a reduce function—each of which defines a mapping from one set of key-value pairs to another. These functions are oblivious to the size of the data or the cluster that they are operating on, so they can be used unchanged for a small dataset and for a massive one.

12. The approach in HPC is to distribute the work across a cluster of machines, which access a shared filesystem, hosted by a SAN. This works well for predominantly compute-intensive jobs, but becomes a problem when nodes need to access larger data volumes (hundreds of gigabytes, the point at which MapReduce really starts to shine), since the network bandwidth is the bottleneck and compute nodes become idle.

13. MPI(Message Passing Interface) gives great control to the programmer, but requires that he or she explicitly handle the mechanics of the data flow, exposed via low-level C routines and constructs, such as sockets, as well as the higher-level algorithm for the analysis. MapReduce operates only at the higher level: the programmer thinks in terms of functions of key and value pairs, and the data flow is implicit.

分享到：

读 AbstractStringBuilder/StringBuilder/S ... | 读 String原代码

2013-01-12 15:35
浏览 1033
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

相关推荐

HBase.The.Definitive.Guide.2nd.Edition: Chapter 1. Introduction Chapter 2. Installation Chapter 3. Client API: The Basics Chapter 4. Client API: Advanced Features Chapter 5. Client API: Administrative Features Chapter 6. Available Clients ...

MATLAB实现基于LSTM-AdaBoost长短期记忆网络结合AdaBoost时间序列预测（含模型描述及示例代码）: 内容概要：本文档详细介绍了基于 MATLAB 实现的 LSTM-AdaBoost 时间序列预测模型，涵盖项目背景、目标、挑战、特点、应用领域以及模型架构和代码示例。随着大数据和AI的发展，时间序列预测变得至关重要。传统方法如 ARIMA 在复杂非线性序列中表现欠佳，因此引入了 LSTM 来捕捉长期依赖性。但 LSTM 存在易陷局部最优、对噪声鲁棒性差的问题，故加入 AdaBoost 提高模型准确性和鲁棒性。两者结合能更好应对非线性和长期依赖的数据，提供更稳定的预测。项目还展示了如何在 MATLAB 中具体实现模型的各个环节。适用人群：对时间序列预测感兴趣的开发者、研究人员及学生，特别是有一定 MATLAB 编程经验和熟悉深度学习或机器学习基础知识的人群。使用场景及目标：①适用于金融市场价格预测、气象预报、工业生产故障检测等多种需要时间序列分析的场合；②帮助使用者理解并掌握将LSTM与AdaBoost结合的实现细节及其在提高预测精度和抗噪方面的优势。其他说明：尽管该模型有诸多优点，但仍存在训练时间长、计算成本高等挑战。文中提及通过优化数据预处理、调整超参数等方式改进性能。同时给出了完整的MATLAB代码实现，便于学习与复现。

palkert_3ck_01_0918.pdf: palkert_3ck_01_0918

pepeljugoski_01_1106.pdf: pepeljugoski_01_1106

tatah_01_1107.pdf: tatah_01_1107

[AB PLC例程源码][MMS_046393]Motor Speed Reference.zip: AB PLC例程代码项目案例【备注】 1、该资源内项目代码都经过测试运行成功，功能ok的情况下才上传的，请放心下载使用！有问题请及时沟通交流。 2、适用人群：计算机相关专业(如计科、信息安全、数据科学与大数据技术、人工智能、通信、物联网、自动化、电子信息等)在校学生、专业老师或者企业员工下载使用。 3、用途：项目具有较高的学习借鉴价值，不仅适用于小白学习入门进阶。也可作为毕设项目、课程设计、大作业、初期项目立项演示等。 4、如果基础还行，或热爱钻研，亦可在此项目代码基础上进行修改添加，实现其他不同功能。欢迎下载！欢迎交流学习！不清楚的可以私信问我！

基于51的步进电机控制系统20250302: 题目：基于单片机的步进电机控制系统模块：主控：AT89C52RC 步进电机（ULN2003驱动）按键（3个）蓝牙（虚拟终端模拟）功能： 1、可以通过蓝牙远程控制步进电机转动 2、可以通过按键实现手动与自动控制模式切换。 3、自动模式下，步进电机正转一圈，反转一圈，循环 4、手动模式下可以通过按键控制步进电机转动（顺时针和逆时针）

[AB PLC例程源码][MMS_041234]Logix Fault Handler.zip: AB PLC例程代码项目案例【备注】 1、该资源内项目代码都经过测试运行成功，功能ok的情况下才上传的，请放心下载使用！有问题请及时沟通交流。 2、适用人群：计算机相关专业(如计科、信息安全、数据科学与大数据技术、人工智能、通信、物联网、自动化、电子信息等)在校学生、专业老师或者企业员工下载使用。 3、用途：项目具有较高的学习借鉴价值，不仅适用于小白学习入门进阶。也可作为毕设项目、课程设计、大作业、初期项目立项演示等。 4、如果基础还行，或热爱钻研，亦可在此项目代码基础上进行修改添加，实现其他不同功能。欢迎下载！欢迎交流学习！不清楚的可以私信问我！

[AB PLC例程源码][MMS_042348]Using an Ultra3000 as an Indexer on DeviceNet with a CompactLogix.zip: AB PLC例程代码项目案例【备注】 1、该资源内项目代码都经过测试运行成功，功能ok的情况下才上传的，请放心下载使用！有问题请及时沟通交流。 2、适用人群：计算机相关专业(如计科、信息安全、数据科学与大数据技术、人工智能、通信、物联网、自动化、电子信息等)在校学生、专业老师或者企业员工下载使用。 3、用途：项目具有较高的学习借鉴价值，不仅适用于小白学习入门进阶。也可作为毕设项目、课程设计、大作业、初期项目立项演示等。 4、如果基础还行，或热爱钻研，亦可在此项目代码基础上进行修改添加，实现其他不同功能。欢迎下载！欢迎交流学习！不清楚的可以私信问我！

智慧校园平台建设全流程详解：从需求到持续优化: 内容概要：本文详细介绍了建设智慧校园平台所需的六个关键步骤。首先通过需求分析深入了解并确定校方和使用者的具体需求；其次是规划设计阶段，依据所得需求制定全面的建设方案。再者是对现有系统的整合——系统集成，确保新旧平台之间的互操作性和数据一致性。培训支持帮助全校教职工和学生快速熟悉新平台，提高效率。实施试点确保系统逐步稳定部署。最后，强调持续改进的重要性，以适应技术和环境变化。通过这一系列有序的工作，可以使智慧校园建设更为科学高效，减少失败风险。适用人群：教育领域的决策者和技术人员，包括负责信息化建设和运维的团队成员。使用场景及目标：用于指导高校和其他各级各类学校规划和发展自身的数字校园生态链；目的是建立更加便捷高效的现代化管理模式和服务机制。其他说明：智慧校园不仅仅是简单的IT设施升级或软件安装，它涉及到全校范围内的流程再造和创新改革。

AI淘金实战手册：100+高收益变现案例解析: 该文档系统梳理了人工智能技术在商业场景中的落地路径，聚焦内容生产、电商运营、智能客服、数据分析等12个高潜力领域，提炼出100个可操作性变现模型。内容涵盖AI工具开发、API服务收费、垂直场景解决方案、数据增值服务等多元商业模式，每个思路均配备应用场景拆解、技术实现路径及收益测算框架。重点呈现低代码工具应用、现有平台流量复用、细分领域自动化改造三类轻量化启动方案，为创业者提供从技术选型到盈利闭环的全流程参考。

palkert_3ck_02_0719.pdf: palkert_3ck_02_0719

2006-2023年地级市-克鲁格曼专业化指数.zip: 克鲁格曼专业化指数，最初是由Krugman于1991年提出，用于反映地区间产业结构的差异，也被用来衡量两个地区间的专业化水平，因而又称地区间专业化指数。该指数的计算公式及其含义可以因应用背景和具体需求的不同而有所调整，但核心都是衡量地区间的产业结构差异或专业化程度。指标年份、城市、第一产业人数（first_industry1）、第二产业人数（second_industry1）、第三产业人数（third_industry1）、专业化指数（ksi）。

[AB PLC例程源码][MMS_046305]R2FX.zip: AB PLC例程代码项目案例【备注】 1、该资源内项目代码都经过测试运行成功，功能ok的情况下才上传的，请放心下载使用！有问题请及时沟通交流。 2、适用人群：计算机相关专业(如计科、信息安全、数据科学与大数据技术、人工智能、通信、物联网、自动化、电子信息等)在校学生、专业老师或者企业员工下载使用。 3、用途：项目具有较高的学习借鉴价值，不仅适用于小白学习入门进阶。也可作为毕设项目、课程设计、大作业、初期项目立项演示等。 4、如果基础还行，或热爱钻研，亦可在此项目代码基础上进行修改添加，实现其他不同功能。欢迎下载！欢迎交流学习！不清楚的可以私信问我！

精品推荐-通信技术LTE干货资料合集（19份）.zip: 精品推荐，通信技术LTE干货资料合集，19份。 LTE PCI网络规划工具.xlsx LTE-S1切换占比专题优化分析报告.docx LTE_TDD问题定位指导书-吞吐量篇.docx LTE三大常见指标优化指导书.xlsx LTE互操作邻区配置核查原则.docx LTE信令流程详解指导书.docx LTE切换问题定位指导一（定位思路和问题现象）.docx LTE劣化小区优化指导手册.docx LTE容量优化高负荷小区优化指导书.docx LTE小区搜索过程学习.docx LTE小区级与邻区级切换参数说明.docx LTE差小区处理思路和步骤.docx LTE干扰日常分析介绍.docx LTE异频同频切换.docx LTE弱覆盖问题分析与优化.docx LTE网优电话面试问题-应答技巧.docx LTE网络切换优化.docx LTE高负荷小区容量优化指导书.docx LTE高铁优化之多频组网优化提升“用户感知，网络价值”.docx

matlab程序代码项目案例：matlab程序代码项目案例matlab中Toolbox中带有的模型预测工具箱.zip: matlab程序代码项目案例【备注】 1、该资源内项目代码都经过测试运行成功，功能ok的情况下才上传的，请放心下载使用！有问题请及时沟通交流。 2、适用人群：计算机相关专业(如计科、信息安全、数据科学与大数据技术、人工智能、通信、物联网、自动化、电子信息等)在校学生、专业老师或者企业员工下载使用。 3、用途：项目具有较高的学习借鉴价值，不仅适用于小白学习入门进阶。也可作为毕设项目、课程设计、大作业、初期项目立项演示等。 4、如果基础还行，或热爱钻研，亦可在此项目代码基础上进行修改添加，实现其他不同功能。欢迎下载！欢迎交流学习！不清楚的可以私信问我！

pepeljugoski_01_0508.pdf: pepeljugoski_01_0508

szczepanek_01_0308.pdf: szczepanek_01_0308

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论