- 浏览: 610939 次
- 性别:
- 来自: 上海
文章分类
最新评论
-
月光杯:
问题解决了吗?
Exceptions in HDFS -
iostreamin:
神,好厉害,这是我找到的唯一可以ac的Java代码,厉害。
[leetcode] word ladder II -
standalone:
One answer I agree with:引用Whene ...
How many string objects are created? -
DiaoCow:
不错!,一开始对这些确实容易犯迷糊
erlang中的冒号 分号 和 句号 -
standalone:
Exception in thread "main& ...
one java interview question
http://www.ebaytechblog.com/2010/10/29/hadoop-the-power-of-the-elephant/
Hadoop – The Power of the Elephant
by Anil Madan on 10/29/2010
in Machine Learning
In a previous post, Junling discussed data mining and our need to process petabytes of data to gain insights from information. We use several tools and systems to help us with this task; the one I’ll discuss here is Apache Hadoop.
Created by Doug Cutting in 2006 who named it after his son’s stuffed yellow elephant, and based on Google’s MapReduce paper in 2004, Hadoop is an open source framework for fault tolerant, scalable, distributed computing on commodity hardware.
MapReduce is a flexible programming model for processing large data sets:
Map takes key/value pairs as input and generates an intermediate output of another type of key/value pairs, while Reduce takes the keys produced in the Map step along with a list of values associated with the same key to produce the final output of key/value pairs.
Map (key1, value1) -> list (key2, value2)
Reduce (key2, list (value2)) -> list (key3, value3)
Ecosystem
Athena, our first large cluster was put in use earlier this year.
Let’s look at the stack from bottom to top:
Core – The Hadoop runtime, some common utilities, and the Hadoop Distributed File System (HDFS). The File System is optimized for reading and writing big blocks of data (128 MB to 256 MB).
MapReduce – provides the APIs and components to develop and execute jobs.
Data Access – the most prominent data access frameworks today are HBase, Pig and Hive.
HBase – Column oriented multidimensional spatial database inspired by Google’s BigTable. HBase provides sorted data access by maintaining partitions or regions of data. The underlying storage is HDFS.
Pig (Latin) – A procedural language which provides capabilities to load, filter, transform, extract, aggregate, join and group data. Developers use Pig for building data pipelines and factories.
Hive – A declarative language with SQL syntax used to build data warehouse. The SQL interface makes Hive an attractive choice for developers to quickly validate data, for product managers and for analysts.
Tools & Libraries – UC4 is an enterprise scheduler used by eBay to automate data loading from multiple sources.
Libraries: Statistical (R), machine learning (Mahout), and mathematical libraries (Hama), and eBay’s homegrown library for parsing web logs (Mobius).
Monitoring & Alerting – Ganglia is a distributed monitoring system for clusters. Nagios is used for alerting on key events like servers being unreachable or disks being full.
Infrastructure
Our enterprise servers run 64-bit RedHat Linux.
NameNode is the master server responsible for managing the HDFS.
JobTracker is responsible for coordination of the Jobs and Tasks associated to the Jobs.
HBaseMaster stores the root storage for HBase and facilitates the coordination with blocks or regions of storage.
Zookeeper is a distributed lock coordinator providing consistency for HBase.
The storage and compute nodes are 1U units running Cent OS with 2 quad core machines and storage space of 12 to 24TB. We pack our racks with 38 to 42 of these units to have a highly dense grid.
On the networking side, we use top of rack switches with a node bandwidth of 1Gbps. The rack switches uplink to the core switches with a line rate of 40Gpbs to support the high bandwidth necessary for data to be shuffled around.
Scheduling
Our cluster is used by many teams within eBay, for production as well as one-time jobs. We use Hadoop’s Fair Scheduler to manage allocations, define job pools for teams, assign weights, limit concurrent jobs per user and team, set preemption timeouts and delayed scheduling.
Data Sourcing
On a daily basis we ingest about 8 to 10 TB of new data.
Road Ahead
Here are some of the challenges we are working on as we build out our infrastructure:
Scalability
In its current incarnation, the master server NameNode has scalability issues. As the file system of the cluster grows, so does the memory footprint as it keeps the entire metadata in memory. For 1 PB of storage approximately 1 GB of memory is needed. Possible solutions are hierarchical namespace partitioning or leveraging Zookeeper in conjunction with HBase for metadata management.
Availability
NameNode’s availability is critical for production workloads. The open source community is working on several cold, warm, and hot standby options like Checkpoint and Backup nodes; Avatar nodes switching avatar from the Secondary NameNode; journal metadata replication techniques. We are evaluating these to build our production clusters.
Data Discovery
Support data stewardship, discovery, and schema management on top of a system which inherently does not support structure. A new project is proposing to combine Hive’s metadata store and Owl into a new system, called Howl. Our effort is to tie this into our analytics platform so that our users can easily discover data across the different data systems.
Data Movement
We are working on publish/subscription data movement tools to support data copy and reconciliation across our different subsystems like the Data Warehouse and HDFS.
Policies
Enable good Retention, Archival, and Backup policies with storage capacity management through quotas (the current Hadoop quotas need some work). We are working on defining these across our different clusters based on the workload and the characteristics of the clusters.
Metrics, Metrics, Metrics
We are building robust tools which generate metrics for data sourcing, consumption, budgeting, and utilization. The existing metrics exposed by some of the Hadoop enterprise servers are either not enough, or transient which make patterns of cluster usage hard to see.
eBay is changing how it collects, transforms, and uses data to generate business intelligence. We’re hiring, and we’d love to have you come help.
Anil Madan
Director of Engineering, Analytics Platform Development
发表评论
-
hadoop-2.2.0 build failure due to missing dependancy
2014-01-06 13:18 746The bug and fix is at https://i ... -
HDFS中租约管理源代码分析
2013-07-05 18:05 0HDFS中Client写文件的时候要获得一个租约,用来保证Cl ... -
Question on HBase source code
2013-05-22 15:05 1098I'm reading source code of hbas ... -
Using the libjars option with Hadoop
2013-05-20 15:03 963As I have said in my last post, ... -
学习hadoop之基于protocol buffers的 RPC
2012-11-15 23:23 10101现在版本的hadoop各种serv ... -
学习hadoop之基于protocol buffers的 RPC
2012-11-15 22:59 2现在版本的hadoop各种server、client RPC端 ... -
Hadoop RPC 一问
2012-11-14 14:43 121看代码时候发现好像有个地方做得多余,不知道改一下会不会有好处, ... -
Hadoop Version Graph
2012-11-14 11:47 921可以到这里看全文: http://cloudblog.8km ... -
Hadoop 2.0 代码分析---MapReduce
2012-10-25 18:27 7089本文参考hadoop的版本: hadoop-2.0.1-alp ... -
how to study hadoop?
2012-04-27 15:34 1523From StackOverflow http://stack ... -
首相发怒记之hadoop篇
2012-03-23 12:14 794我在youtube上看到的,某位能翻*墙的看一下吧,挺好笑的。 ... -
一个HDFS Error
2011-06-11 21:53 1528ERROR: hdfs.DFSClient: Excep ... -
hadoop cluster at ebay
2011-06-11 21:39 1152Friday, December 17, 2010Hadoop ... -
【读书笔记】Data warehousing and analytics infrastructure at facebook
2011-03-18 22:03 1946这好像是sigmod2010上的paper。 读了之后做了以 ... -
impact of total region numbers?
2011-01-19 16:31 924这几天tune了hbase的几个参数,有些有意思的结果。具体看 ... -
Will all HFiles managed by a regionserver kept open
2011-01-19 10:29 1463code 没看仔细,所以在hbase 的mail list上面 ... -
problems in building hadoop
2010-12-22 10:28 1005When I try to modify some code ... -
HDFS scalability: the limits to growth
2010-11-30 12:52 2029Abstract: The Hadoop Distr ... -
hadoop-0.20.2+737 and hbase-0.20.6 not compatible?
2010-11-11 13:28 1304master log里面发现 0 region server ... -
Implementing WebGIS on Hadoop: A Case Study of Improving Small File IO Performan
2010-07-26 22:46 1669Implementing WebGIS on Hadoo ...
相关推荐
《Hadoop在eBay中的使用历程》是一篇深入探讨大数据处理技术如何在电子商务巨头eBay中发挥关键作用的文章。文章作者通过分享eBay在使用Hadoop进行数据处理和分析的实践经验,揭示了这一开源框架在实际业务场景中的...
Ideal for enterprise architects, IT managers, application architects, and data engineers, this book shows you how to overcome the many challenges that emerge during Hadoop projects. You'll explore the...
Hadoop Performance at LinkedIn
,Hadoop 技术已经在互联网领域得到了广泛的应用。互联网公司往往需要 存储海量的数据并对其进行处理,而这正是Hadoop 的强项。如Facebook 使用Hadoop 存储 内部的日志拷贝,以及数据挖掘和日志统计;Yahoo !利用...
文章标题《Hadoop at 10-the History and Evolution of the Apache Hadoop Ecosystem》和描述“ArchSummit”提示本文内容可能与Apache Hadoop生态系统的发展历史、现状以及未来展望相关,且很可能是在某个架构师峰会...
【第五届HADOOP中国2011大会_PPT】ebay公司的对hadoop的改进。 Evolution and Revolution of eBay’s Hadoop Stack, Juhan Lee, Director of Hadoop EngineeringPlatform, eBay inc
Hadoop是Apache软件基金会开发的一个开源分布式计算框架,它允许在普通硬件上高效处理大量数据。在Windows环境下,Hadoop的使用与Linux有所不同,因为它的设计最初是针对Linux操作系统的。"winutils"和"hadoop.dll...
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) at ...
在IT行业中,Hadoop是一个广泛使用的开源框架,主要用于大数据处理和分布式存储。Hadoop 2.7.3是这个框架的一个稳定版本,它包含了多个改进和优化,以提高性能和稳定性。在这个版本中,Winutils.exe和hadoop.dll是两...
Hadoop是一个开源的分布式计算框架,由Apache基金会开发,它主要设计用于处理和存储大量数据。在提供的信息中,我们关注的是"Hadoop的dll文件",这是一个动态链接库(DLL)文件,通常在Windows操作系统中使用,用于...
在Hadoop生态系统中,`hadoop.dll`和`winutils.exe`是两个关键组件,尤其对于Windows用户来说,它们在本地开发和运行Hadoop相关应用时必不可少。`hadoop.dll`是一个动态链接库文件,主要用于在Windows环境中提供...
在IT行业中,Hadoop是一个广泛使用的开源框架,主要用于大数据处理和分布式存储。Hadoop 2.7.3是Hadoop发展中的一个重要版本,它包含了众多的优化和改进,旨在提高性能、稳定性和易用性。在这个版本中,`hadoop.dll`...
在大数据处理领域,Hadoop是一个不可或缺的开源框架,它提供了分布式存储和计算的能力。本文将详细探讨与"Hadoop.dll"和"winutils.exe"相关的知识点,以及它们在Hadoop-2.7.1版本中的作用。 Hadoop.dll是Hadoop在...
Hadoop 是一个处理、存储和分析海量的分布式、非结构化数据的开源框架。最初由 Yahoo 的工程师 Doug Cutting 和 Mike Cafarella Hadoop 是一个处理、存储和分析海量的分布式、非结构化数据的开源框架。最初由 Yahoo...
### Hadoop 在 Cloudera 的应用 #### 一、Cloudera 的背景与构建 Cloudera 是一家致力于提供企业级大数据解决方案的公司,由 Jeff Hammerbacher 等人创立。作为 Cloudera 的首席科学家兼产品副总裁,Hammerbacher ...
在Hadoop生态系统中,Hadoop 2.7.7是一个重要的版本,它为大数据处理提供了稳定性和性能优化。Hadoop通常被用作Linux环境下的分布式计算框架,但有时开发者或学习者在Windows环境下也需要进行Hadoop相关的开发和测试...
For the latest information about Hadoop, please visit our website at: http://hadoop.apache.org/ and our wiki, at: http://wiki.apache.org/hadoop/ This distribution includes cryptographic software. ...
Hadoop是一个由Apache基金会所开发的分布式系统基础架构。用户可以在不了解分布式底层细节的情况下,开发分布式程序。充分利用集群的威力进 Hadoop是一个由Apache基金会所开发的分布式系统基础架构。用户可以在不...
标题 "hadoop2.6 hadoop.dll+winutils.exe" 提到的是Hadoop 2.6版本中的两个关键组件:`hadoop.dll` 和 `winutils.exe`,这两个组件对于在Windows环境中配置和运行Hadoop至关重要。Hadoop原本是为Linux环境设计的,...
Hadoop是Apache软件基金会开发的一个开源分布式计算框架,它允许在大规模集群中高效处理和存储海量数据。这个压缩包文件包含的"hadop实用案例"很可能是为了帮助初学者理解和应用Hadoop技术。以下是关于Hadoop的一些...