- 浏览: 618095 次
- 性别:
- 来自: 上海
-
文章分类
最新评论
-
月光杯:
问题解决了吗?
Exceptions in HDFS -
iostreamin:
神,好厉害,这是我找到的唯一可以ac的Java代码,厉害。
[leetcode] word ladder II -
standalone:
One answer I agree with:引用Whene ...
How many string objects are created? -
DiaoCow:
不错!,一开始对这些确实容易犯迷糊
erlang中的冒号 分号 和 句号 -
standalone:
Exception in thread "main& ...
one java interview question
http://www.ebaytechblog.com/2010/10/29/hadoop-the-power-of-the-elephant/
Hadoop – The Power of the Elephant
by Anil Madan on 10/29/2010
in Machine Learning
In a previous post, Junling discussed data mining and our need to process petabytes of data to gain insights from information. We use several tools and systems to help us with this task; the one I’ll discuss here is Apache Hadoop.
Created by Doug Cutting in 2006 who named it after his son’s stuffed yellow elephant, and based on Google’s MapReduce paper in 2004, Hadoop is an open source framework for fault tolerant, scalable, distributed computing on commodity hardware.
MapReduce is a flexible programming model for processing large data sets:
Map takes key/value pairs as input and generates an intermediate output of another type of key/value pairs, while Reduce takes the keys produced in the Map step along with a list of values associated with the same key to produce the final output of key/value pairs.
Map (key1, value1) -> list (key2, value2)
Reduce (key2, list (value2)) -> list (key3, value3)
Ecosystem
Athena, our first large cluster was put in use earlier this year.
Let’s look at the stack from bottom to top:
Core – The Hadoop runtime, some common utilities, and the Hadoop Distributed File System (HDFS). The File System is optimized for reading and writing big blocks of data (128 MB to 256 MB).
MapReduce – provides the APIs and components to develop and execute jobs.
Data Access – the most prominent data access frameworks today are HBase, Pig and Hive.
HBase – Column oriented multidimensional spatial database inspired by Google’s BigTable. HBase provides sorted data access by maintaining partitions or regions of data. The underlying storage is HDFS.
Pig (Latin) – A procedural language which provides capabilities to load, filter, transform, extract, aggregate, join and group data. Developers use Pig for building data pipelines and factories.
Hive – A declarative language with SQL syntax used to build data warehouse. The SQL interface makes Hive an attractive choice for developers to quickly validate data, for product managers and for analysts.
Tools & Libraries – UC4 is an enterprise scheduler used by eBay to automate data loading from multiple sources.
Libraries: Statistical (R), machine learning (Mahout), and mathematical libraries (Hama), and eBay’s homegrown library for parsing web logs (Mobius).
Monitoring & Alerting – Ganglia is a distributed monitoring system for clusters. Nagios is used for alerting on key events like servers being unreachable or disks being full.
Infrastructure
Our enterprise servers run 64-bit RedHat Linux.
NameNode is the master server responsible for managing the HDFS.
JobTracker is responsible for coordination of the Jobs and Tasks associated to the Jobs.
HBaseMaster stores the root storage for HBase and facilitates the coordination with blocks or regions of storage.
Zookeeper is a distributed lock coordinator providing consistency for HBase.
The storage and compute nodes are 1U units running Cent OS with 2 quad core machines and storage space of 12 to 24TB. We pack our racks with 38 to 42 of these units to have a highly dense grid.
On the networking side, we use top of rack switches with a node bandwidth of 1Gbps. The rack switches uplink to the core switches with a line rate of 40Gpbs to support the high bandwidth necessary for data to be shuffled around.
Scheduling
Our cluster is used by many teams within eBay, for production as well as one-time jobs. We use Hadoop’s Fair Scheduler to manage allocations, define job pools for teams, assign weights, limit concurrent jobs per user and team, set preemption timeouts and delayed scheduling.
Data Sourcing
On a daily basis we ingest about 8 to 10 TB of new data.
Road Ahead
Here are some of the challenges we are working on as we build out our infrastructure:
Scalability
In its current incarnation, the master server NameNode has scalability issues. As the file system of the cluster grows, so does the memory footprint as it keeps the entire metadata in memory. For 1 PB of storage approximately 1 GB of memory is needed. Possible solutions are hierarchical namespace partitioning or leveraging Zookeeper in conjunction with HBase for metadata management.
Availability
NameNode’s availability is critical for production workloads. The open source community is working on several cold, warm, and hot standby options like Checkpoint and Backup nodes; Avatar nodes switching avatar from the Secondary NameNode; journal metadata replication techniques. We are evaluating these to build our production clusters.
Data Discovery
Support data stewardship, discovery, and schema management on top of a system which inherently does not support structure. A new project is proposing to combine Hive’s metadata store and Owl into a new system, called Howl. Our effort is to tie this into our analytics platform so that our users can easily discover data across the different data systems.
Data Movement
We are working on publish/subscription data movement tools to support data copy and reconciliation across our different subsystems like the Data Warehouse and HDFS.
Policies
Enable good Retention, Archival, and Backup policies with storage capacity management through quotas (the current Hadoop quotas need some work). We are working on defining these across our different clusters based on the workload and the characteristics of the clusters.
Metrics, Metrics, Metrics
We are building robust tools which generate metrics for data sourcing, consumption, budgeting, and utilization. The existing metrics exposed by some of the Hadoop enterprise servers are either not enough, or transient which make patterns of cluster usage hard to see.
eBay is changing how it collects, transforms, and uses data to generate business intelligence. We’re hiring, and we’d love to have you come help.
Anil Madan
Director of Engineering, Analytics Platform Development
发表评论
-
hadoop-2.2.0 build failure due to missing dependancy
2014-01-06 13:18 763The bug and fix is at https://i ... -
HDFS中租约管理源代码分析
2013-07-05 18:05 0HDFS中Client写文件的时候要获得一个租约,用来保证Cl ... -
Question on HBase source code
2013-05-22 15:05 1133I'm reading source code of hbas ... -
Using the libjars option with Hadoop
2013-05-20 15:03 975As I have said in my last post, ... -
学习hadoop之基于protocol buffers的 RPC
2012-11-15 23:23 10142现在版本的hadoop各种serv ... -
学习hadoop之基于protocol buffers的 RPC
2012-11-15 22:59 2现在版本的hadoop各种server、client RPC端 ... -
Hadoop RPC 一问
2012-11-14 14:43 121看代码时候发现好像有个地方做得多余,不知道改一下会不会有好处, ... -
Hadoop Version Graph
2012-11-14 11:47 936可以到这里看全文: http://cloudblog.8km ... -
Hadoop 2.0 代码分析---MapReduce
2012-10-25 18:27 7106本文参考hadoop的版本: hadoop-2.0.1-alp ... -
how to study hadoop?
2012-04-27 15:34 1542From StackOverflow http://stack ... -
首相发怒记之hadoop篇
2012-03-23 12:14 794我在youtube上看到的,某位能翻*墙的看一下吧,挺好笑的。 ... -
一个HDFS Error
2011-06-11 21:53 1542ERROR: hdfs.DFSClient: Excep ... -
hadoop cluster at ebay
2011-06-11 21:39 1169Friday, December 17, 2010Hadoop ... -
【读书笔记】Data warehousing and analytics infrastructure at facebook
2011-03-18 22:03 1959这好像是sigmod2010上的paper。 读了之后做了以 ... -
impact of total region numbers?
2011-01-19 16:31 943这几天tune了hbase的几个参数,有些有意思的结果。具体看 ... -
Will all HFiles managed by a regionserver kept open
2011-01-19 10:29 1500code 没看仔细,所以在hbase 的mail list上面 ... -
problems in building hadoop
2010-12-22 10:28 1022When I try to modify some code ... -
HDFS scalability: the limits to growth
2010-11-30 12:52 2068Abstract: The Hadoop Distr ... -
hadoop-0.20.2+737 and hbase-0.20.6 not compatible?
2010-11-11 13:28 1338master log里面发现 0 region server ... -
Implementing WebGIS on Hadoop: A Case Study of Improving Small File IO Performan
2010-07-26 22:46 1691Implementing WebGIS on Hadoo ...
相关推荐
2. "eBay聚技术Hadoop专场 - Hadoop at Baidu 2011.pdf" 可能介绍了百度在2011年时如何利用Hadoop进行大规模数据处理,可能包含百度的Hadoop架构设计、性能优化经验,以及在搜索引擎、推荐系统等方面的实践。...
Especially effective for big data systems, Hadoop powers mission-critical software at Apple, eBay, LinkedIn, Yahoo, and Facebook. It offers developers handy ways to store, manage, and analyze data. ...
7. **hadoop_qcon2009_beijing.pdf** - Hadoop是大数据处理的重要工具,这个文档可能介绍了Hadoop在处理海量数据时的分布式计算框架和应用案例。 8. **facebook_performance_caching-dc.pdf** - Facebook的缓存策略...
项目资源包含:可运行源码+数据集+文档 python + numpy, pandas, matplotlib, pyecharts, wordcloud 适用人群:学习不同技术领域的小白或进阶学习者;可作为课程设计、大作业、工程实训或初期项目立项。 数据来源:数据集taxis.csv从网络下载 数据清洗:异常值与缺失值的处理:有一些数据distance(乘车距离)为零而且上下车地点为空,还有些一些数据的payment(支付方式)为空。 数据预处理:将列名更改成中文 标准化与归一化: 数据分析: 数据可视化:
TypeScript 入门教程
人脸识别项目实战
本资源汇总了 历届全国电子设计竞赛(电赛)真题+模拟题,涵盖 电路设计、嵌入式系统、信号处理、自动控制等核心考点,并提供详细解析及综合测评,帮助参赛者高效备赛、查漏补缺、提升实战能力。 适用人群: 适合 准备参加电子设计竞赛的大学生、电赛爱好者、电子信息类相关专业的学生,以及希望提高电子设计和电路分析能力的工程师。 能学到什么: 电赛考察重点:熟悉往届竞赛的命题方向及考核重点。 电路设计与仿真:提升模拟电路、数字电路、单片机等核心技能。 问题分析与解决能力:通过综合测评找到薄弱点并针对性提升。 实战经验:掌握竞赛策略,提高应试效率和设计能力。 阅读建议: 建议先 通读真题,了解题型与解题思路,然后 结合模拟题实战演练,查找不足并通过测评强化练习,逐步提升竞赛能力。
2024人工智能如何塑造未来产业:AI对各行业组织带来的的变革研究研究报告.pdf
人脸识别项目源码实战
给大家分享一套课程——Vulkan原理与实战课程
c语言学习
海豚鲸鱼数据集 5435张图 正确识别率可达92.6% 可识别:海豚 虎鲸 蜥蜴 海豹 鲨鱼 龟 支持yolov8格式标注
答谢中书书教学设计.docx
人脸识别项目源码实战
c语言学习
人脸识别项目源码实战
人脸识别项目实战
本美发门店管理系统有管理员和用户两个角色。用户功能有项目预定管理,产品购买管理,会员充值管理,余额查询管理。管理员功能有个人中心,用户管理,美容项目管理,项目类型管理,项目预定管理,产品库存管理,产品购买管理,产品入库管理,会员卡管理,会员充值管理,余额查询管理,产品类型管理,系统管理等。因而具有一定的实用性。 本站是一个B/S模式系统,采用SSM框架,MYSQL数据库设计开发,充分保证系统的稳定性。系统具有界面清晰、操作简单,功能齐全的特点,使得美发门店管理系统管理工作系统化、规范化。本系统的使用使管理人员从繁重的工作中解脱出来,实现无纸化办公,能够有效的提高美发门店管理系统管理效率。 关键词:美发门店管理系统;SSM框架;MYSQL数据库;Spring Boot 1系统概述 1 1.1 研究背景 1 1.2研究目的 1 1.3系统设计思想 1 2相关技术 2 2.1 MYSQL数据库 2 2.2 B/S结构 3 2.3 Spring Boot框架简介 4 3系统分析 4 3.1可行性分析 4 3.1.1技术可行性 4 3.1.2经济可行性 5 3.1.3操作可行性 5 3.2系
内容概要:本文档介绍了基于SSA-CNN-GRU麻雀算法优化卷积门控循环单元数据分类预测的详细项目实例,重点讲述了该项目的背景、目标、挑战与解决方案、技术特点、应用领域等方面的内容。文档详细记录了从项目启动、数据预处理、算法设计(SSA优化CNN-GRU模型)、构建与评估模型到实现美观的GUI界面整个过程,并讨论了防止过拟合的技术如正则化、早停和超参数优化。另外还涵盖了项目扩展的可能性、部署和应用策略、需要注意的地方以及未来改进的方向。全文强调了模型的泛化能力和计算效率,展示了该混合算法模型在实际应用中的优越性能。 适合人群:具备一定的Python编程经验及机器学习基础知识的研究人员和技术人员;对深度学习、智能优化算法及实际应用感兴趣的学者和从业者;寻求提升数据分析和预测准确性的金融分析师、数据科学家等相关专业人士。 使用场景及目标:本文档非常适合用作学习和参考资料,以掌握如何将SSA、CNN与GRU三种先进技术结合起来进行复杂的分类和预测问题求解。具体应用场景包括但不限于以下几个方面:金融领域——股票价格预测;医疗保健领域——辅助诊断;工业制造——预防性维护;智能家居——个性化服务;以及其他涉及到时序数据分析和多模态数据处理的场合。文档既包含了理论知识又提供了完整的源代码示例,可以帮助读者理解算法原理并通过实践中加深对其的认识。 其他说明:该项目不仅仅是关于算法的设计实现,更是有关于系统的整体架构规划以及工程上的考量,比如环境准备(确保环境洁净、必要包的安装等)、数据准备、GPU配置支持等等。同时文中给出了详细的代码片段,方便开发者理解和复现实验成果。值得注意的是,虽然文中提供了一套通用解决方案,但在真实场景下还需要针对性的调整参数或修改网络结构来达到最好的性能效果。此外,对于追求更高的预测精度或解决更大规模的问题,作者建议进一步探索深度强化学习等高级技术和多任务学习策略,并且考虑使用增量学习让模型能够适应新数据而不必重新训练整个模型。最后提到安全性和隐私保护也是项目实施过程中的重要因素,要妥善保管用户的敏感信息并且做到合法合规地收集和使用数据。
人脸识别项目实战