[转]hadoop at ebay

standalone

浏览: 619907 次
性别:
来自: 上海

最近访客更多访客>>

liujun.1980

rkikbs

yy629

songhait

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

hadoop

Hadoop HBase Mapreduce Rack SQL Server

Hadoop – The Power of the Elephant
by Anil Madan on 10/29/2010

in Machine Learning

In a previous post, Junling discussed data mining and our need to process petabytes of data to gain insights from information. We use several tools and systems to help us with this task; the one I’ll discuss here is Apache Hadoop.

Created by Doug Cutting in 2006 who named it after his son’s stuffed yellow elephant, and based on Google’s MapReduce paper in 2004, Hadoop is an open source framework for fault tolerant, scalable, distributed computing on commodity hardware.

MapReduce is a flexible programming model for processing large data sets:
Map takes key/value pairs as input and generates an intermediate output of another type of key/value pairs, while Reduce takes the keys produced in the Map step along with a list of values associated with the same key to produce the final output of key/value pairs.

Map (key1, value1) -> list (key2, value2)
Reduce (key2, list (value2)) -> list (key3, value3)

Ecosystem

Athena, our first large cluster was put in use earlier this year.
Let’s look at the stack from bottom to top:

Core – The Hadoop runtime, some common utilities, and the Hadoop Distributed File System (HDFS). The File System is optimized for reading and writing big blocks of data (128 MB to 256 MB).
MapReduce – provides the APIs and components to develop and execute jobs.
Data Access – the most prominent data access frameworks today are HBase, Pig and Hive.
HBase – Column oriented multidimensional spatial database inspired by Google’s BigTable. HBase provides sorted data access by maintaining partitions or regions of data. The underlying storage is HDFS.
Pig (Latin) – A procedural language which provides capabilities to load, filter, transform, extract, aggregate, join and group data. Developers use Pig for building data pipelines and factories.
Hive – A declarative language with SQL syntax used to build data warehouse. The SQL interface makes Hive an attractive choice for developers to quickly validate data, for product managers and for analysts.
Tools & Libraries – UC4 is an enterprise scheduler used by eBay to automate data loading from multiple sources.
Libraries: Statistical (R), machine learning (Mahout), and mathematical libraries (Hama), and eBay’s homegrown library for parsing web logs (Mobius).
Monitoring & Alerting – Ganglia is a distributed monitoring system for clusters. Nagios is used for alerting on key events like servers being unreachable or disks being full.
Infrastructure
Our enterprise servers run 64-bit RedHat Linux.

NameNode is the master server responsible for managing the HDFS.
JobTracker is responsible for coordination of the Jobs and Tasks associated to the Jobs.
HBaseMaster stores the root storage for HBase and facilitates the coordination with blocks or regions of storage.
Zookeeper is a distributed lock coordinator providing consistency for HBase.
The storage and compute nodes are 1U units running Cent OS with 2 quad core machines and storage space of 12 to 24TB. We pack our racks with 38 to 42 of these units to have a highly dense grid.

On the networking side, we use top of rack switches with a node bandwidth of 1Gbps. The rack switches uplink to the core switches with a line rate of 40Gpbs to support the high bandwidth necessary for data to be shuffled around.

Scheduling
Our cluster is used by many teams within eBay, for production as well as one-time jobs. We use Hadoop’s Fair Scheduler to manage allocations, define job pools for teams, assign weights, limit concurrent jobs per user and team, set preemption timeouts and delayed scheduling.

Data Sourcing

On a daily basis we ingest about 8 to 10 TB of new data.

Road Ahead
Here are some of the challenges we are working on as we build out our infrastructure:

Scalability
In its current incarnation, the master server NameNode has scalability issues. As the file system of the cluster grows, so does the memory footprint as it keeps the entire metadata in memory. For 1 PB of storage approximately 1 GB of memory is needed. Possible solutions are hierarchical namespace partitioning or leveraging Zookeeper in conjunction with HBase for metadata management.
Availability
NameNode’s availability is critical for production workloads. The open source community is working on several cold, warm, and hot standby options like Checkpoint and Backup nodes; Avatar nodes switching avatar from the Secondary NameNode; journal metadata replication techniques. We are evaluating these to build our production clusters.
Data Discovery
Support data stewardship, discovery, and schema management on top of a system which inherently does not support structure. A new project is proposing to combine Hive’s metadata store and Owl into a new system, called Howl. Our effort is to tie this into our analytics platform so that our users can easily discover data across the different data systems.
Data Movement
We are working on publish/subscription data movement tools to support data copy and reconciliation across our different subsystems like the Data Warehouse and HDFS.
Policies
Enable good Retention, Archival, and Backup policies with storage capacity management through quotas (the current Hadoop quotas need some work). We are working on defining these across our different clusters based on the workload and the characteristics of the clusters.
Metrics, Metrics, Metrics
We are building robust tools which generate metrics for data sourcing, consumption, budgeting, and utilization. The existing metrics exposed by some of the Hadoop enterprise servers are either not enough, or transient which make patterns of cluster usage hard to see.
eBay is changing how it collects, transforms, and uses data to generate business intelligence. We’re hiring, and we’d love to have you come help.

Anil Madan
Director of Engineering, Analytics Platform Development

分享到：

hadoop cluster at ebay | Extjs 4 : Customize Legend of Pie Chart

2011-06-11 21:09
浏览 1211
评论(0)
分类:非技术
查看更多

发表评论

您还没有登录,请您登录后再发表评论

相关推荐

稳压罐sw16_三维3D设计图纸_包括零件图_机械3D图可修改打包下载_三维3D设计图纸_包括零件图_机械3D图可修改打包下载.zip: 稳压罐sw16_三维3D设计图纸_包括零件图_机械3D图可修改打包下载_三维3D设计图纸_包括零件图_机械3D图可修改打包下载.zip

基于递推最小二乘法的永磁同步电机参数辨识及其MATLAB仿真: 内容概要：本文详细介绍了利用递推最小二乘法（RLS）进行永磁同步电机参数辨识的方法及其MATLAB仿真过程。首先解释了RLS算法的优势，如不需要概率模型、计算量适中以及适用于嵌入式系统的实时参数更新。接着展示了将电机电压方程转换为标准形式Y=φθ的具体步骤，并提供了核心的RLS迭代代码。文中还讨论了仿真过程中的一些关键技术细节，如遗忘因子的选择、协方差矩阵的初始化和更新方式、电流信号的处理方法等。最终给出了仿真结果，显示电阻和电感的辨识误差分别达到了0.08%和0.12%，并指出了实际应用中需要注意的数据同步和数值稳定性问题。适合人群：从事电机控制研究的技术人员、研究生及以上学历的学生。使用场景及目标：①帮助研究人员理解和掌握RLS算法在电机参数辨识中的应用；②提供详细的仿真代码和配置建议，便于快速搭建实验环境；③指导如何优化算法性能，提高参数辨识精度。其他说明：本文不仅涵盖了理论推导，还包括了大量的实践经验分享和技术细节探讨，有助于读者全面理解RLS算法的实际应用。同时，文中提到的仿真方案可以方便地移植到DSP平台，进一步扩展了其实用价值。

零起点Python大数据与量化交易: 零起点Python大数据与量化交易

管道清污机器人sw16可编辑_三维3D设计图纸_包括零件图_机械3D图可修改打包下载_三维3D设计图纸_包括零件图_机械3D图可修改打包下载.zip: 管道清污机器人sw16可编辑_三维3D设计图纸_包括零件图_机械3D图可修改打包下载_三维3D设计图纸_包括零件图_机械3D图可修改打包下载.zip

电路仿真：数字电路仿真.zip: 电子仿真教程，从基础到精通，每个压缩包15篇教程，每篇教程5000字以上。

电能质量分析：电压暂降与中断分析.zip: 电子仿真教程，从基础到精通，每个压缩包15篇教程，每篇教程5000字以上。

thai-scalable-garuda-fonts-0.6.5-1.el8.x64-86.rpm.tar.gz: 1、文件说明： Centos8操作系统thai-scalable-garuda-fonts-0.6.5-1.el8.rpm以及相关依赖，全打包为一个tar.gz压缩包 2、安装指令： #Step1、解压 tar -zxvf thai-scalable-garuda-fonts-0.6.5-1.el8.tar.gz #Step2、进入解压后的目录，执行安装 sudo rpm -ivh *.rpm

基于ABAQUS的滑坡与沉降对埋地管道影响的有限元分析及应用: 内容概要：本文详细介绍了利用ABAQUS进行滑坡和沉降对埋地管道影响的有限元分析方法。主要内容涵盖了几何建模、材料属性定义、接触设置、边界条件与加载等方面的技术细节。通过具体的Python脚本示例展示了如何构建模型，并深入探讨了滑坡和沉降条件下管道的应力、应变分布及其潜在破坏机制。此外，还分享了一些实战经验和优化技巧，如材料模型选择、接触条件设置、边界条件处理等，强调了这些因素对结果准确性的重要影响。适合人群：从事地下管道工程设计、施工及维护的专业技术人员，尤其是那些希望深入了解滑坡和沉降对管道影响的研究人员和技术专家。使用场景及目标：适用于评估和预测滑坡和沉降对埋地管道造成的力学响应，帮助工程师们更好地理解和应对复杂的地质灾害环境，从而提高管道系统的安全性与稳定性。其他说明：文中提供的Python代码片段仅为示意，具体实施时需结合ABAQUS的实际接口和项目需求进行适当调整。同时，对于大规模模型的计算，建议使用高性能计算资源以确保效率和精度。

Java实习一天高频面试突击!最常见的几种面试题型！！！: Java一天面试突击，迅速掌握Java常见面试题

莲子去壳机设计模型SW10_三维3D设计图纸_包括零件图_机械3D图可修改打包下载_三维3D设计图纸_包括零件图_机械3D图可修改打包下载.zip: 莲子去壳机设计模型SW10_三维3D设计图纸_包括零件图_机械3D图可修改打包下载_三维3D设计图纸_包括零件图_机械3D图可修改打包下载.zip

MFRC-522+RC522+RFID射频+IC卡感应模块: MFRC-522+RC522+RFID射频+IC卡感应模块

学术研究学术研究提示设计50招：从论文撰写到润色降重的全方位指南学术研究中常见的: 内容概要：《学术研究提示设计 50 招》是一份详尽的指南，旨在帮助研究人员提高学术写作和研究效率。该文档涵盖了从论文撰写、润色、翻译、查重降重、参考文献管理、投稿审稿到文献阅读等多个方面的具体操作指令。每一章节均针对特定任务提供了详细的步骤和注意事项，例如如何撰写标题、摘要、致谢，如何进行英文润色、中英翻译，以及如何优化逻辑结构等。文档还介绍了如何利用AI工具进行文献分析、术语表提取和研究方向探索等内容，为研究者提供了全面的支持。适合人群：适用于学术研究人员，特别是那些需要撰写、润色和提交学术论文的研究者，包括研究生、博士生及高校教师等。使用场景及目标：① 提供一系列具体的指令，帮助研究者高效完成论文的各个部分，如撰写标题、摘要、致谢等；② 提供润色和翻译的详细指导，确保论文语言的准确性和专业性；③ 提供查重降重的方法，确保论文的原创性；④ 提供参考文献管理和投稿审稿的指导，帮助研究者顺利发表论文；⑤ 利用AI工具进行文献分析、术语表提取和研究方向探索，提高研究效率。阅读建议：此资源不仅提供了具体的指令和方法，更重要的是引导研究者如何思考和解决问题。因此，在学习过程中，不仅要关注具体的步骤，还要理解背后的原理和逻辑，结合实际案例进行实践和反思。

项目optionc-20250409: 项目optionc-20250409

2023年c语言程序设计基本概念考点归纳.doc: 2023年c语言程序设计基本概念考点归纳.doc

电能质量仿真：谐波分析与仿真.zip: 电子仿真教程，从基础到精通，每个压缩包15篇教程，每篇教程5000字以上。

基于Matlab的模拟与数字滤波器设计：IIR、FIR及经典滤波器类型的实战详解: 内容概要：本文详细介绍了使用Matlab进行模拟和数字滤波器设计的方法，涵盖了巴特沃斯、切比雪夫等多种经典滤波器类型。首先讲解了模拟滤波器的设计，如巴特沃斯滤波器的通带平坦性和切比雪夫滤波器的通带波纹特性，并提供了具体的代码示例。接着讨论了数字滤波器的设计，包括IIR滤波器的递归特性和FIR滤波器的线性相位特性，同样附有详细的代码实现。文中还特别强调了不同类型滤波器之间的转换方法以及设计过程中常见的注意事项，如频率归一化、阶数选择等。最后推荐了一些实用的Matlab工具，如fvtool和FDATool，帮助用户更直观地理解和调试滤波器设计。适合人群：具有一定信号处理基础和技术背景的研究人员、工程师及学生。使用场景及目标：适用于需要进行滤波器设计的实际工程应用，如通信系统、音频处理等领域。目标是让读者掌握滤波器设计的基本原理和具体实现方法，能够独立完成滤波器的设计和调试。其他说明：文章不仅提供了理论知识，还通过大量实例代码帮助读者更好地理解和应用所学内容。建议读者在实践中多尝试不同的参数配置，以加深对滤波器特性的理解。

饲料干燥装置sw16_三维3D设计图纸_包括零件图_机械3D图可修改打包下载_三维3D设计图纸_包括零件图_机械3D图可修改打包下载.zip: 饲料干燥装置sw16_三维3D设计图纸_包括零件图_机械3D图可修改打包下载_三维3D设计图纸_包括零件图_机械3D图可修改打包下载.zip

MATLAB环境下独立分量分析(ICA)在土木、航空航天、机械领域的振动信号处理应用: 内容概要：本文详细介绍了独立分量分析（ICA）在MATLAB环境下的应用，特别是在土木工程、航空航天和机械领域的振动信号处理方面。文章通过具体实例展示了如何利用ICA将复杂的混合信号分解为独立分量，从而帮助识别结构损伤、故障特征等问题。文中提供了详细的MATLAB代码示例，涵盖数据预处理、核心算法实现以及结果可视化的全过程。此外，还讨论了ICA的应用限制及其与其他信号处理方法的结合使用。适合人群：从事土木工程、航空航天、机械等领域研究和技术工作的工程师及研究人员，尤其是那些需要处理复杂振动信号的人群。使用场景及目标：① 土木工程中用于结构健康监测，如桥梁、建筑物的振动数据分析；② 航空航天领域用于飞行器复合载荷分离；③ 机械设备故障诊断，如齿轮箱、轴承等部件的故障特征提取。通过ICA能够有效地从多源混合信号中分离出有用的独立分量，辅助决策。其他说明：ICA并非适用于所有情况，在某些特定条件下可能会失效，因此需要结合实际情况灵活运用。对于初学者来说，可以从简单的仿真数据入手，逐步过渡到真实的工程项目中。

【Linux详解】常用命令与系统配置：虚拟机搭建、文件管理及网络配置详解: 内容概要：本文详细介绍了Linux操作系统的概念、特点及其常见命令，旨在帮助用户掌握Linux的基础知识和操作技能。文章首先概述了Linux的操作系统特性，如免费、稳定、高效，以及其广泛的应用领域，包括服务器和个人设备。接着介绍了Linux的安装与配置，包括虚拟机的创建、分区设置、网络配置等。随后，重点讲解了Linux命令行的基本命令，涵盖文件和目录管理、用户和权限管理、进程和服务管理等方面。此外，还涉及了远程登录、文件传输、文本编辑器（如vi/vim）、定时任务、磁盘管理、网络配置、服务管理和包管理工具（如rpm/yum）。最后简要介绍了Shell编程的基础知识，包括变量、条件判断和脚本编写。适合人群：适合初学者和有一定经验的Linux用户，特别是希望深入了解Linux系统管理和操作的IT从业者。使用场景及目标：①帮助用户熟悉Linux操作系统的特性和应用场景；②掌握Linux系统的基本命令和操作技巧；③学会配置和管理Linux服务器，包括文件系统、用户权限、网络设置和服务管理；④能够编写简单的Shell脚本来自动化日常任务。阅读建议：由于本文内容丰富且涉及面广，建议读者在学习过程中结合实际操作进行练习，特别是在命令行操作、文件管理、用户权限设置和Shell编程方面。对于复杂命令和概念，可以通过查阅官方文档或在线资源进一步加深理解。

stm32仿真包-proteus8.15: stm32仿真包-proteus8.15

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论