[转]hadoop at ebay

standalone

浏览: 618095 次
性别:
来自: 上海

最近访客更多访客>>

liujun.1980

rkikbs

yy629

songhait

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

hadoop

Hadoop HBase Mapreduce Rack SQL Server

Hadoop – The Power of the Elephant
by Anil Madan on 10/29/2010

in Machine Learning

In a previous post, Junling discussed data mining and our need to process petabytes of data to gain insights from information. We use several tools and systems to help us with this task; the one I’ll discuss here is Apache Hadoop.

Created by Doug Cutting in 2006 who named it after his son’s stuffed yellow elephant, and based on Google’s MapReduce paper in 2004, Hadoop is an open source framework for fault tolerant, scalable, distributed computing on commodity hardware.

MapReduce is a flexible programming model for processing large data sets:
Map takes key/value pairs as input and generates an intermediate output of another type of key/value pairs, while Reduce takes the keys produced in the Map step along with a list of values associated with the same key to produce the final output of key/value pairs.

Map (key1, value1) -> list (key2, value2)
Reduce (key2, list (value2)) -> list (key3, value3)

Ecosystem

Athena, our first large cluster was put in use earlier this year.
Let’s look at the stack from bottom to top:

Core – The Hadoop runtime, some common utilities, and the Hadoop Distributed File System (HDFS). The File System is optimized for reading and writing big blocks of data (128 MB to 256 MB).
MapReduce – provides the APIs and components to develop and execute jobs.
Data Access – the most prominent data access frameworks today are HBase, Pig and Hive.
HBase – Column oriented multidimensional spatial database inspired by Google’s BigTable. HBase provides sorted data access by maintaining partitions or regions of data. The underlying storage is HDFS.
Pig (Latin) – A procedural language which provides capabilities to load, filter, transform, extract, aggregate, join and group data. Developers use Pig for building data pipelines and factories.
Hive – A declarative language with SQL syntax used to build data warehouse. The SQL interface makes Hive an attractive choice for developers to quickly validate data, for product managers and for analysts.
Tools & Libraries – UC4 is an enterprise scheduler used by eBay to automate data loading from multiple sources.
Libraries: Statistical (R), machine learning (Mahout), and mathematical libraries (Hama), and eBay’s homegrown library for parsing web logs (Mobius).
Monitoring & Alerting – Ganglia is a distributed monitoring system for clusters. Nagios is used for alerting on key events like servers being unreachable or disks being full.
Infrastructure
Our enterprise servers run 64-bit RedHat Linux.

NameNode is the master server responsible for managing the HDFS.
JobTracker is responsible for coordination of the Jobs and Tasks associated to the Jobs.
HBaseMaster stores the root storage for HBase and facilitates the coordination with blocks or regions of storage.
Zookeeper is a distributed lock coordinator providing consistency for HBase.
The storage and compute nodes are 1U units running Cent OS with 2 quad core machines and storage space of 12 to 24TB. We pack our racks with 38 to 42 of these units to have a highly dense grid.

On the networking side, we use top of rack switches with a node bandwidth of 1Gbps. The rack switches uplink to the core switches with a line rate of 40Gpbs to support the high bandwidth necessary for data to be shuffled around.

Scheduling
Our cluster is used by many teams within eBay, for production as well as one-time jobs. We use Hadoop’s Fair Scheduler to manage allocations, define job pools for teams, assign weights, limit concurrent jobs per user and team, set preemption timeouts and delayed scheduling.

Data Sourcing

On a daily basis we ingest about 8 to 10 TB of new data.

Road Ahead
Here are some of the challenges we are working on as we build out our infrastructure:

Scalability
In its current incarnation, the master server NameNode has scalability issues. As the file system of the cluster grows, so does the memory footprint as it keeps the entire metadata in memory. For 1 PB of storage approximately 1 GB of memory is needed. Possible solutions are hierarchical namespace partitioning or leveraging Zookeeper in conjunction with HBase for metadata management.
Availability
NameNode’s availability is critical for production workloads. The open source community is working on several cold, warm, and hot standby options like Checkpoint and Backup nodes; Avatar nodes switching avatar from the Secondary NameNode; journal metadata replication techniques. We are evaluating these to build our production clusters.
Data Discovery
Support data stewardship, discovery, and schema management on top of a system which inherently does not support structure. A new project is proposing to combine Hive’s metadata store and Owl into a new system, called Howl. Our effort is to tie this into our analytics platform so that our users can easily discover data across the different data systems.
Data Movement
We are working on publish/subscription data movement tools to support data copy and reconciliation across our different subsystems like the Data Warehouse and HDFS.
Policies
Enable good Retention, Archival, and Backup policies with storage capacity management through quotas (the current Hadoop quotas need some work). We are working on defining these across our different clusters based on the workload and the characteristics of the clusters.
Metrics, Metrics, Metrics
We are building robust tools which generate metrics for data sourcing, consumption, budgeting, and utilization. The existing metrics exposed by some of the Hadoop enterprise servers are either not enough, or transient which make patterns of cluster usage hard to see.
eBay is changing how it collects, transforms, and uses data to generate business intelligence. We’re hiring, and we’d love to have you come help.

Anil Madan
Director of Engineering, Analytics Platform Development

分享到：

hadoop cluster at ebay | Extjs 4 : Customize Legend of Pie Chart

2011-06-11 21:09
浏览 1207
评论(0)
分类:非技术
查看更多

发表评论

您还没有登录,请您登录后再发表评论

相关推荐

第二期Hadoop 交流资料分享: 2. "eBay聚技术Hadoop专场 - Hadoop at Baidu 2011.pdf" 可能介绍了百度在2011年时如何利用Hadoop进行大规模数据处理，可能包含百度的Hadoop架构设计、性能优化经验，以及在搜索引擎、推荐系统等方面的实践。...

Hadoop in Practice(2012): Especially effective for big data systems, Hadoop powers mission-critical software at Apple, eBay, LinkedIn, Yahoo, and Facebook. It offers developers handy ways to store, manage, and analyze data. ...

知名网站架构相关文档收集A: 7. **hadoop_qcon2009_beijing.pdf** - Hadoop是大数据处理的重要工具，这个文档可能介绍了Hadoop在处理海量数据时的分布式计算框架和应用案例。 8. **facebook_performance_caching-dc.pdf** - Facebook的缓存策略...

【大数据课设】p105出租车数据可视化分析-大数据-实训大作业.zip: 项目资源包含：可运行源码+数据集+文档 python + numpy, pandas, matplotlib, pyecharts, wordcloud 适用人群：学习不同技术领域的小白或进阶学习者；可作为课程设计、大作业、工程实训或初期项目立项。数据来源：数据集taxis.csv从网络下载数据清洗：异常值与缺失值的处理：有一些数据distance（乘车距离）为零而且上下车地点为空，还有些一些数据的payment（支付方式）为空。数据预处理：将列名更改成中文标准化与归一化：数据分析：数据可视化：

TypeScript 入门教程: TypeScript 入门教程

人脸识别_课堂考勤_OpenCV_服务端系统_1741777828.zip: 人脸识别项目实战

历届电赛试题及综合测评（真题+模拟题）: 本资源汇总了历届全国电子设计竞赛（电赛）真题+模拟题，涵盖电路设计、嵌入式系统、信号处理、自动控制等核心考点，并提供详细解析及综合测评，帮助参赛者高效备赛、查漏补缺、提升实战能力。适用人群：适合准备参加电子设计竞赛的大学生、电赛爱好者、电子信息类相关专业的学生，以及希望提高电子设计和电路分析能力的工程师。能学到什么：电赛考察重点：熟悉往届竞赛的命题方向及考核重点。电路设计与仿真：提升模拟电路、数字电路、单片机等核心技能。问题分析与解决能力：通过综合测评找到薄弱点并针对性提升。实战经验：掌握竞赛策略，提高应试效率和设计能力。阅读建议：建议先通读真题，了解题型与解题思路，然后结合模拟题实战演练，查找不足并通过测评强化练习，逐步提升竞赛能力。

2024人工智能如何塑造未来产业：AI对各行业组织带来的的变革研究研究报告.pdf: 2024人工智能如何塑造未来产业：AI对各行业组织带来的的变革研究研究报告.pdf

人脸识别_Golang_SDK_命令行登录_微信小程序应用_1741772240.zip: 人脸识别项目源码实战

Vulkan原理与实战课程: 给大家分享一套课程——Vulkan原理与实战课程

SiriYXR_Sokoban11_1741860914.zip: c语言学习

海豚鲸鱼数据集 5435张图正确识别率可达92.6% 可识别：海豚虎鲸蜥蜴海豹鲨鱼龟支持yolov8格式标注: 海豚鲸鱼数据集 5435张图正确识别率可达92.6% 可识别：海豚虎鲸蜥蜴海豹鲨鱼龟支持yolov8格式标注

答谢中书书教学设计.docx: 答谢中书书教学设计.docx

人脸识别_环境搭建_dlib_face_recognitio_1741771308.zip: 人脸识别项目源码实战

网络技术_Web服务器_C语言_学习交流版_1741863251.zip: c语言学习

安卓开发_Gradle配置_React_Native_Meg_1741777287.zip: 人脸识别项目源码实战

人工智能_深度学习_图像识别_UI界面_项目展示.zip: 人脸识别项目实战

基于Springboot框架的美发门店管理系统的设计与实现（Java项目编程实战+完整源码+毕设文档+sql文件+学习练手好项目）.zip: 本美发门店管理系统有管理员和用户两个角色。用户功能有项目预定管理，产品购买管理，会员充值管理，余额查询管理。管理员功能有个人中心，用户管理，美容项目管理，项目类型管理，项目预定管理，产品库存管理，产品购买管理，产品入库管理，会员卡管理，会员充值管理，余额查询管理，产品类型管理，系统管理等。因而具有一定的实用性。本站是一个B/S模式系统，采用SSM框架，MYSQL数据库设计开发，充分保证系统的稳定性。系统具有界面清晰、操作简单，功能齐全的特点，使得美发门店管理系统管理工作系统化、规范化。本系统的使用使管理人员从繁重的工作中解脱出来，实现无纸化办公，能够有效的提高美发门店管理系统管理效率。关键词：美发门店管理系统；SSM框架；MYSQL数据库；Spring Boot 1系统概述 1 1.1 研究背景 1 1.2研究目的 1 1.3系统设计思想 1 2相关技术 2 2.1 MYSQL数据库 2 2.2 B/S结构 3 2.3 Spring Boot框架简介 4 3系统分析 4 3.1可行性分析 4 3.1.1技术可行性 4 3.1.2经济可行性 5 3.1.3操作可行性 5 3.2系

Python实现基于SSA-CNN-GRU麻雀算法优化卷积门控循环单元数据分类预测的详细项目实例（含完整的程序，GUI设计和代码详解）: 内容概要：本文档介绍了基于SSA-CNN-GRU麻雀算法优化卷积门控循环单元数据分类预测的详细项目实例，重点讲述了该项目的背景、目标、挑战与解决方案、技术特点、应用领域等方面的内容。文档详细记录了从项目启动、数据预处理、算法设计（SSA优化CNN-GRU模型）、构建与评估模型到实现美观的GUI界面整个过程，并讨论了防止过拟合的技术如正则化、早停和超参数优化。另外还涵盖了项目扩展的可能性、部署和应用策略、需要注意的地方以及未来改进的方向。全文强调了模型的泛化能力和计算效率，展示了该混合算法模型在实际应用中的优越性能。适合人群：具备一定的Python编程经验及机器学习基础知识的研究人员和技术人员；对深度学习、智能优化算法及实际应用感兴趣的学者和从业者；寻求提升数据分析和预测准确性的金融分析师、数据科学家等相关专业人士。使用场景及目标：本文档非常适合用作学习和参考资料，以掌握如何将SSA、CNN与GRU三种先进技术结合起来进行复杂的分类和预测问题求解。具体应用场景包括但不限于以下几个方面：金融领域——股票价格预测；医疗保健领域——辅助诊断；工业制造——预防性维护；智能家居——个性化服务；以及其他涉及到时序数据分析和多模态数据处理的场合。文档既包含了理论知识又提供了完整的源代码示例，可以帮助读者理解算法原理并通过实践中加深对其的认识。其他说明：该项目不仅仅是关于算法的设计实现，更是有关于系统的整体架构规划以及工程上的考量，比如环境准备（确保环境洁净、必要包的安装等）、数据准备、GPU配置支持等等。同时文中给出了详细的代码片段，方便开发者理解和复现实验成果。值得注意的是，虽然文中提供了一套通用解决方案，但在真实场景下还需要针对性的调整参数或修改网络结构来达到最好的性能效果。此外，对于追求更高的预测精度或解决更大规模的问题，作者建议进一步探索深度强化学习等高级技术和多任务学习策略，并且考虑使用增量学习让模型能够适应新数据而不必重新训练整个模型。最后提到安全性和隐私保护也是项目实施过程中的重要因素，要妥善保管用户的敏感信息并且做到合法合规地收集和使用数据。

人脸识别_T形分布_Gabor变换_特征提取_增强鲁棒性_1741777397.zip: 人脸识别项目实战

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论