`
mj4d
  • 浏览: 303286 次
  • 性别: Icon_minigender_1
  • 来自: 杭州
社区版块
存档分类
最新评论

hadoop overview

阅读更多
  • Hadoop生态圈

貌似翻开任何一本介绍hadoop的书籍这都是必须的,好吧,就是这些:

写道
The project includes these subprojects:
Hadoop Common: The common utilities that support the other Hadoop subprojects.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.

Other Hadoop-related projects at Apache include:
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
ZooKeeper™: A high-performance coordination service for distributed applications.

 

  • 疯狂的数据

如何对数据进行分析及面临的困难:多年来磁盘存储容量的快速增加的同时,起访问速度-磁盘数据读取速度-却未能与时俱进

一个例子:

写道
1990年一个普通磁盘已经可容纳1370MB的数据,读取速率为4.4MB/s,而20年后的今天,1T的磁盘已经普及,而读取的速率为100MB/S,按照这个比例来算,以前读一个磁盘的数据要5分钟,而现在则要2.5小时

读取磁盘中所有数据需要很长时间,当然写就更慢了。基于一个简单的想法:我们可以同时从多个磁盘上读数据来减少读取的时间。将1T的数据分割放在不同的磁盘上同时读取!


当然如果要放在不同磁盘上读取,就面临较多的问题了:

  1. 使用多个硬件读取,单点故障发生率提高,如何避免数据丢失
  2. 要对读取的结果进行分析,如何保证数据的准确性(如不重复读,不遗漏等)

当然这就设计hadoop中两个核心HDFS和MapReduce。用他们的话说

写道
Hadoop提供了一个可靠的共享存储和分析系统。HDFS实现存储,MapReduce实现分析处理

 

当然他也说了

写道
MapReduce似乎采用的一个蛮力方法:每个查询处理整个数据集-或至少数据集的很大一部分,反过来想,这也是他的能力

 

  • vs关系型数据库

其实这也涉及到什么的数据适合使用MapReduce来分析。一般来说处理大数据TB/PB级和非结构化或半结构化数据是比较擅长的 。而小文件则不适合,一个比较明显的理由是大量小文件最小化寻址的开销和文件传输速率比较不明显

一个简单的比较
关系型数据库和MapReduce的比较

数据大小                  GB                                     PB
访问                         交互式和批处理                 批处理
更新                         多次读写                           一次写入多次读取
结构                         静态模式                           动态模式
完整性                      高                                     低
横向扩展                  非线性                              线性

 

分享到:
评论

相关推荐

    大数据云计算技术 优酷网Hadoop及Mapreduce入门教程(共35页).pptx

    Hadoop Overview HDFS Map-reduce Programming Paradigm Hadoop Map-reduce Job Scheduler Resources Hadoop, Why? 数据太多了,需要能存储、快速分析Pb级数据集的系统 单机的存储、IO、内存、CPU有限,需要可扩展...

    [Hadoop] Hadoop 集群操作管理技巧 (英文版)

    Overview Hands-on recipes to configure a Hadoop cluster from bare metal hardware nodes Practical and in depth explanation of cluster management commands Easy-to-understand recipes for securing and ...

    apache hadoop 2.7.2.chm

    apahe hadoop2.7.2 官方文档,离线版 General Overview Single Node Setup Cluster Setup Hadoop Commands Reference FileSystem Shell Hadoop Compatibility Interface Classification FileSystem ...

    hadoop3.3.0.dllwinutil.zip

    hadoop3.3.0.dll&winutil工具连接 Apache Hadoop 3.3.0 incorporates a number of significant enhancements over the previous major release line (hadoop-3.2). Overview Users are encouraged to read the full...

    Introduction to SAS and Hadoop

    Although not covered in any detail, a brief overview of additional SAS and Hadoop technologies, including DS2, high-performance analytics, SAS LASR Server, and in- memory Statistics, as well as the ...

    hadoop win7环境搭建

    - [http://localhost:50070/dfshealth.html#tab-overview](http://localhost:50070/dfshealth.html#tab-overview) #### Hadoop 与 Eclipse 的集成 1. **下载并安装 Eclipse 插件** - 下载 `hadoop-eclipse-...

    Apache Hadoop 3 Quick Start Guide

    The book begins with an overview of big data and Apache Hadoop. Then, you will set up a pseudo Hadoop development environment and a multi-node enterprise Hadoop cluster. You will see how the parallel ...

    大数据hadoop的暂时性的小总结.docx

    安装完成后,可以通过`http://192.168.52.100:50070/dfshealth.html#tab-overview`等URL来检查Hadoop集群的状态。 Hadoop的shell命令是日常操作HDFS的主要工具。基本命令如`yarn jar`运行MapReduce作业,`hdfs dfs ...

    Big Data Analytics with Hadoop 3

    Once you have taken a tour of Hadoop 3’s latest features, you will get an overview of HDFS, MapReduce, and YARN, and how they enable faster, more efficient big data processing. You will then move on ...

    Moving Hadoop to the Cloud

    Chapter 1 Why Hadoop in the Cloud? Chapter 2 Overview and Comparison of Cloud Providers Chapter 3 Instances Chapter 4 Networking and Security Chapter 5 Storage Chapter 6 Setting Up in AWS Chapter 7 ...

    hadoop常见错误以及处理方法详解

    1、hadoop-root-datanode-master.log 中有如下错误:ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in导致datanode启动不了。原因:每次namenode format...

    Hadoop与大数据技术大会2012PPT

    接着是"**Hadoop Security Overview.pdf**",这通常会涉及Hadoop的安全管理,包括身份验证、授权和审计,这对于在企业环境中部署Hadoop至关重要。Hadoop的安全机制如Kerberos、HDFS加密和访问控制列表(ACLs)可能...

    Big Data Analytics with Hadoop 3 1st Edition

    Once you have taken a tour of Hadoop 3's latest features, you will get an overview of HDFS, MapReduce, and YARN, and how they enable faster, more efficient big data processing. You will then move on ...

    centos下hadoop2.6.0配置.pdf

    可以通过访问`http://master:50070/`或`http://master:50070/dfshealth.html#tab-overview`来检查HDFS的状态。 9. **运行示例程序**: Hadoop 2.6.0可能不包含示例包,可以将Hadoop 1.2.1中的示例程序复制过来,...

    [Apache Flume] Apache Flume 分布式日志采集应用 (Hadoop 实现) (英文版)

    Overview Integrate Flume with your data sources Transcode your data en-route in Flume Route and separate your data using regular expression matching Configure failover paths and load-balancing to ...

    An Overview on Big Data Technologies

    从Hadoop的分布式文件系统到MapReduce的数据处理模型,再到NoSQL数据库和MPP系统的应用,这些技术共同构成了现代大数据生态系统的基础。未来,随着技术的不断进步,我们可以期待更多创新的数据管理和分析工具的出现...

Global site tag (gtag.js) - Google Analytics