Quoted from http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/
Hadoop has the concept of “Rack Awareness”. Hadoop administrator can manually define the rack number of each slave Data Node in your cluster. Why would you go through the trouble of doing this? There are two key reasons for this: Data loss prevention, and network performance. Remember that each block of data will be replicated to multiple machines to prevent the failure of one machine from losing all copies of data. Wouldn’t it be unfortunate if all copies of data happened to be located on machines in the same rack, and that rack experiences a failure? Such as a switch failure or power failure. That would be a mess. So to avoid this, somebody needs to know where Data Nodes are located in the network topology and use that information to make an intelligent decision about where data replicas should exist in the cluster. That “somebody” is the Name Node.
There is also an assumption that two machines in the same rack have more bandwidth and lower latency between each other than two machines in two different racks. This is true most of the time. The rack switch uplink bandwidth is usually (but not always) less than its downlink bandwidth. Furthermore, in-rack latency is usually lower than cross-rack latency (but not always). If at least one of those two basic assumptions are true, wouldn’t it be cool if Hadoop can use the same Rack Awareness that protects data to also optimally place work streams in the cluster, improving network performance? Well, it does!
Quoted from http://developer.yahoo.com/hadoop/tutorial/module2.html#rack
Rack Awareness
For small clusters in which all servers are connected by a single switch, there are only two levels of locality: "on-machine" and "off-machine." When loading data from a DataNode's local drive into HDFS, the NameNode will schedule one copy to go into the local DataNode, and will pick two other machines at random from the cluster.
For larger Hadoop installations which span multiple racks, it is important to ensure that replicas of data exist on multiple racks. This way, the loss of a switch does not render portions of the data unavailable due to all replicas being underneath it.
HDFS can be made rack-aware by the use of a script which allows the master node to map the network topology of the cluster. While alternate configuration strategies can be used, the default implementation allows you to provide an executable script which returns the "rack address" of each of a list of IP addresses.
The network topology script receives as arguments one or more IP addresses of nodes in the cluster. It returns on stdout a list of rack names, one for each input. The input and output order must be consistent.
To set the rack mapping script, specify the key topology.script.file.name in conf/hadoop-site.xml. This provides a command to run to return a rack id; it must be an executable script or program. By default, Hadoop will attempt to send a set of IP addresses to the file as several separate command line arguments. You can control the maximum acceptable number of arguments with the topology.script.number.args key.
Rack ids in Hadoop are hierarchical and look like path names. By default, every node has a rack id of /default-rack. You can set rack ids for nodes to any arbitrary path, e.g., /foo/bar-rack. Path elements further to the left are higher up the tree. Thus a reasonable structure for a large installation may be /top-switch-name/rack-name.
Hadoop rack ids are not currently expressive enough to handle an unusual routing topology such as a 3-d torus; they assume that each node is connected to a single switch which in turn has a single upstream switch. This is not usually a problem, however. Actual packet routing will be directed using the topology discovered by or set in switches and routers. The Hadoop rack ids will be used to find "near" and "far" nodes for replica placement (and in 0.17, MapReduce task placement).
The following example script performs rack identification based on IP addresses given a hierarchical IP addressing scheme enforced by the network administrator. This may work directly for simple installations; more complex network configurations may require a file- or table-based lookup process. Care should be taken in that case to keep the table up-to-date as nodes are physically relocated, etc. This script requires that the maximum number of arguments be set to 1.
#!/bin/bash # Set rack id based on IP address. # Assumes network administrator has complete control # over IP addresses assigned to nodes and they are # in the 10.x.y.z address space. Assumes that # IP addresses are distributed hierarchically. e.g., # 10.1.y.z is one data center segment and 10.2.y.z is another; # 10.1.1.z is one rack, 10.1.2.z is another rack in # the same segment, etc.) # # This is invoked with an IP address as its only argument # get IP address from the input ipaddr=$0 # select "x.y" and convert it to "x/y" segments=`echo $ipaddr | cut --delimiter=. --fields=2-3 --output-delimiter=/` echo /${segments}
Topology Scripts sample from Hadoop Wiki.
Topology scripts are used by hadoop to determine the rack location of nodes. This information is used by hadoop to replicate block data to redundant racks. Here is a sample script that uses a separate data file. You can specified the rack mapping script via the key topology.script.file.name in conf/hadoop-site.xml, it must be an executable script or program.
Topology Script
HADOOP_CONF=/etc/hadoop/conf while [ $# -gt 0 ] ; do nodeArg=$1 exec< ${HADOOP_CONF}/topology.data result="" while read line ; do ar=( $line ) if [ "${ar[0]}" = "$nodeArg" ] ; then result="${ar[1]}" fi done shift if [ -z "$result" ] ; then echo -n "/default/rack " else echo -n "$result " fi done
Data file topology.data used by above topology script.
hadoopdata1.ec.com /dc1/rack1 hadoopdata1 /dc1/rack1 10.1.1.1 /dc1/rack2
OpenFlow
Even more interesting would be a OpenFlow network, where the Name Node could query the OpenFlow controller about a Node’s location in the topology. Refer to http://bradhedlund.com/2011/04/21/data-center-scale-openflow-sdn/
相关推荐
《Hadoop:权威指南》是了解和掌握Apache Hadoop生态系统不可或缺的一本著作。这本书由Tom White撰写,全面深入地介绍了Hadoop的各个组件及其工作原理,对于初学者和专业人士来说都是一份宝贵的参考资料。 Hadoop是...
"Data Analytics with Hadoop: An Introduction for Data Scientists" ISBN: 1491913703 | 2016 | PDF | 288 pages | 7 MB Ready to use statistical and machine-learning techniques across large data sets? ...
- **书名**:《Hadoop:The Definitive Guide》(第二版) - **作者**:Tom White - **前言作者**:Doug Cutting - **出版社**:O'Reilly Media, Inc. - **出版日期**:2010年10月 - **版权**:版权所有 © 2011 Tom...
实战Hadoop:开启通向云计算的捷径
Hadoop: The Definitive Guide, 4th Edition Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable,...
资源名称:云计算Hadoop:快速部署Hadoop集群内容简介: 近来云计算越来越热门了,云计算已经被看作IT业的新趋势。云计算可以粗略地定义为使用自己环境之外的某一服务提供的可伸缩计算资源,并按使用量付费。可以...
Apache Hadoop:Hadoop集群运维与优化.docx
Apache Hadoop:Hadoop资源管理器YARN详解.docx
Apache Hadoop:Hadoop数据仓库Hive入门与应用.docx
Apache Hadoop:Hadoop数据安全与权限管理技术教程.docx
《Hadoop:The Definitive Guide》是O'REILLY出版社出版的一本关于Apache Hadoop的权威指南,目前流行的是第四版。这本书为读者提供了一个全面的Hadoop学习平台,内容涵盖了如何构建和维护一个既可靠又可扩展的...
分布式存储系统hadoop:hbase安装经验,非常不错的hadoop之hbase,入门环境搭建。
实战Hadoop:开启通向云计算的捷径(刘鹏)PDF电子书,已添加目录。
该文档是实战Hadoop:开启通向云计算的捷径(刘鹏)这本书的下载地址及提取码
《Hadoop:开启通向云计算的捷径》一书,由刘鹏主编,深入探讨了Hadoop这一开源大数据处理框架如何在云计算领域发挥关键作用。Hadoop是Apache软件基金会的一个项目,它的出现彻底改变了大数据处理的方式,使得海量...
优秀PPT课件:Hadoop:Google云计算的开源实现(Hadoop安装 HDFS使用 MapReduce编程 数据表HBase 分布式数据处理MapReduce ).ppt