谁在用Hadoop？

bingyingao

浏览: 354412 次
性别:
来自: 杭州

最近访客更多访客>>

jli_iaspec_cn

hdwt

xiaomiya

ra7bit

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

翻译
数据挖掘

hadoop 应用场景公司数据挖掘

Hadoop技术的应用已经十分广泛了，而我是最近才开始对它有所了解，它在大数据领域的出色表现也让我产生了兴趣。浏览了他的官网，其中有一个页面专门介绍目前世界上有哪些公司在用Hadoop，这些公司涵盖各行各业，不乏一些大公司如alibaba,ebay,amazon,google,facebook,adobe等，主要用于日志分析、数据挖掘、机器学习、构建索引、业务报表等场景,这更加激发了学习它的热情。下面初略翻译了一部分。

版权所有：雁飞蓝天，转载请注明出处：http://bingyingao.iteye.com/blog/1832048

This page documents an alphabetical list of institutions that are using Hadoop for educational or production uses. Companies that offer services on or based around Hadoop are listed in Distributions and Commercial Support. Please include details about your cluster hardware and size. Entries without this may be mistaken for spam references and deleted.
本页面按之母顺序列出了一些使用到hadoop技术的公司，主要用于教育和生产的。
基于或者围绕着Hadoop技术的公司，列在分布式和商业支持。
请详细的列出你的集群硬件情况。如果没有这些，可能会被误当做垃圾邮件信息甚至被删除。

To add entries you need write permission to the wiki, which you can get by subscribing to the common-dev@hadoop.apache.org mailing list and asking for permissions on the wiki account username you've registered yourself as. If you are using Hadoop in production you ought to consider getting involved in the development process anyway, by filing bugs, testing beta releases, reviewing the code and turning your notes into shared documentation. Your participation in this process will ensure your needs get met.
如果想要加进你的案例，你需要在wiki上写一下授权，你可以订阅common-dev@hadoop.apache.org邮件列表，请求给你所注册的wiki用户名授权。如果你正在使用Hadoop,你应该考虑加入到Hadoop的开发进程中来，修改bug，测试beta版，审查代码并且将你的笔记加入到共享文档中来。你的加入将会保证的你需求得到满足。
Contents
1 A
2 B
3 C
4 D
5 E
6 F
7 G
A
A9.com - Amazon*
We build Amazon's product search indices using the streaming API and pre-existing C++, Perl, and Python tools.
We process millions of sessions daily for analytics, using both the Java and streaming APIs.
Our clusters vary from 1 to 100 nodes

我们用Steaming API 和 pre-existing C++, Perl, 和Python tools 来构建我们的产品索引。
每一天，我都使用Stringmeaning API 和 java处理数百万的会话信息用作分析。
我们的集群规模在1到100个节点不等。
Accela Communications
We use a Hadoop cluster to rollup registration and view data each night.
Our cluster has 10 1U servers, with 4 cores, 4GB ram and 3 drives
Each night, we run 112 Hadoop jobs
It is roughly 4X faster to export the transaction tables from each of our reporting databases, transfer the data to the cluster, perform the rollups, then import back into the databases than to perform the same rollups in the database.
每天晚上我们用一个Hadoop集群搜集注册信息和审查数据。
集群有101个服务器，4核心，4GB内存，3个硬盘。
每天晚上，我们运行112个Hadoop 工作任务。
从我们的数据库导出数据表传输到集群完成处理再导回到数据库比直接在数据库中处理，速度上大概快了四倍。
Adobe
We use Hadoop and HBase in several areas from social services to structured data storage and processing for internal use.
We currently have about 30 nodes running HDFS, Hadoop and HBase in clusters ranging from 5 to 14 nodes on both production and development. We plan a deployment on an 80 nodes cluster.
We constantly write data to HBase and run MapReduce jobs to process then store it back to HBase or external systems.
Our production cluster has been running since Oct 2008.
我们使用Hadoop从社会服务到结构化的数据存储和内部应用。
目前，我们有30个节点跑HDFS，开发环境和生产环境都有5到14个集群来跑Hadoop和HBase.正在计划部署80个节点的集群。
我们经常写数据到Hbase并且跑MapReduce job来处理然后重新存储在 HBase或者外部系统中.

adyard
We use Flume, Hadoop and Pig for log storage and report generation as well as ad-Targeting.
We currently have 12 nodes running HDFS and Pig and plan to add more from time to time.
50% of our recommender system is pure Pig because of it's ease of use.
Some of our more deeply-integrated tasks are using the streaming API and ruby as well as the excellent Wukong-Library.
我们用Flume,Hadoop,Pig来做日志存储和报表生成像精准广告定位一样好。
目前我们有12个节点在跑HDFS和Pig并且计划增加更多
我们推荐的50%都是完全基于Pig，因为它简单易用。
一些更深入的任务是用的streaming API 和ruby使得它像Wukong图书馆一样好。

Able Grape - Vertical search engine for trustworthy wine information
We have one of the world's smaller Hadoop clusters (2 nodes @ 8 CPUs/node)
Hadoop and Nutch used to analyze and index textual information
垂直的关于可信的酒信息的搜索引擎
是世界上最小的Hadoop集群(2个节点 @ 8个cpu)
Hadoop 和 Nutch 被用来分析和分析原始信息

Adknowledge - Ad network
Hadoop used to build the recommender system for behavioral targeting, plus other clickstream analytics
We handle 500MM clickstream events per day
Our clusters vary from 50 to 200 nodes, mostly on EC2.
Investigating use of R clusters atop Hadoop for statistical analysis and modeling at scale.
我们用Hadoop来搭建基于行为导向的推荐系统，加上点击流的分析
每一天我们处理5亿个点击事件
我们的集群拥有50到200个节点不等，主要是用的EC2.
建立在Hadoop之上的研究集群来做静态分析和建模
Aguja- E-Commerce Data analysis
We use hadoop, pig and hbase to analyze search log, product view data, and analyze all of our logs
3 node cluster with 48 cores in total, 4GB RAM and 1 TB storage each.
我们用hadoop，pig和hbase来分析搜索日志，产品浏览信息，和分析我们所有的日志
3个集群共有48个核，4GB内存和1TB的存储空间

Alibaba
A 15-node cluster dedicated to processing sorts of business data dumped out of database and joining them together. These data will then be fed into iSearch, our vertical search engine.
Each node has 8 cores, 16G RAM and 1.4T storage.
15个节点的集群专注于处理商业数据分类和聚合，这些商业数据是数据库中dump出来的。这些数据将会被填充入iSearch,我们的一个垂直搜索引擎.
每个节点有8个核心，16G内存，1.4T的硬盘存储空间
AOL
We use Hadoop for variety of things ranging from ETL style processing and statistics generation to running advanced algorithms for doing behavioral analysis and targeting.
The Cluster that we use for mainly behavioral analysis and targeting has 150 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk.
我们使用Hadoop在ETL style处理和统计代来运行高级算法来做行为分析和导向。
我们用的集群有150台机器，Intel Xeon 多处理器，多核心，都拥有16GB内存和800GB 硬盘。

ARA.COM.TR - Ara Com Tr - Turkey's first and only search engine
We build Ara.com.tr search engine using the Python tools.
We use Hadoop for analytics.
We handle about 400TB per month
Our clusters vary from 10 to 100 nodes
一个关于火鸡的第一个也是仅有的一个搜索引擎
我们构建这个搜索引擎用的是Python。
我们用Hadoop做分析
每个月处理400TB的数据
我们的集群规模在10到100个节点不等
Archive.is
HDFS, Accumulo, Scala
Currently 3 nodes (16Gb RAM, 6Tb storage)
目前3个节点(16GB内存，6TB的存储)
Atbrox
We use Hadoop for information extraction & search, and data analysis consulting
Cluster: we primarily use Amazon's Elastic MapReduce
我们用Hadoop做信息抽取和分析，数据分析
集群方面，目前我们主要使用亚马逊的弹性MapRuce服务
B
BabaCar
4 nodes cluster (32 cores, 1TB).
We use Hadoop for searching and analysis of millions of rental bookings.
4节点的集群(32核心，1TB存储)
我们用Hadoop做搜索，分析数百万的租赁记录
Basenfasten
Experimental installation - various TB storage for logs and digital assets
Currently 4 nodes cluster
Using hadoop for log analysis/data mining/machine learning
实验装置-TB 级的日志存储和数字信息资源
目前4个节点的集群
用hadoop做日志分析/数据挖掘/机器学习
Benipal Technologies - Big Data. Search. AI.
35 Node Cluster
We have been running our cluster with no downtime for over 2 ½ years and have successfully handled over 75 Million files on a 64 GB Namenode with 50 TB cluster storage.
We are heavy MapReduce and HBase users and use Hadoop with HBase for semi-supervised Machine Learning, AI R&D, Image Processing & Analysis, and Lucene index sharding using katta.
35个节点
我们已经无宕机运行集群一年多了，在64GB的namenode，500TB的集群上我们已经成功的处理了7500个文件
我们对MapReduce和HBase的需求很大，用Hadoop的HBase来做一些半自动化的机器学习,AI R&D,图像处理&分析
Beebler
14 node cluster (each node has: 2 dual core CPUs, 2TB storage, 8GB RAM)
We use Hadoop for matching dating profiles
14个节点的集群
用Hadoop来匹配年代信息

Bixo Labs - Elastic web mining
The Bixolabs elastic web mining platform uses Hadoop + Cascading to quickly build scalable web mining applications.
We're doing a 200M page/5TB crawl as part of the public terabyte dataset project.
This runs as a 20 machine Elastic MapReduce cluster.
Bixolabs弹性web挖掘平台用Hadoop + Cascading来快速构建可伸缩的web挖掘应用
运行一个20台机器的弹性MapReduce集群
BrainPad - Data mining and analysis
We use Hadoop to summarize of user's tracking data.
And use analyzing.
我们用Hadoop来描述用户的轨迹数据
用来做分析
Brilig - Cooperative data marketplace for online advertising
We use Hadoop/MapReduce and Hive for data management, analysis, log aggregation, reporting, ETL into Hive, and loading data into distributed K/V stores
Our primary cluster is 10 nodes, each member has 2x4 Cores, 24 GB RAM, 6 x 1TB SATA.
We also use AWS EMR clusters for additional reporting capacity on 10 TB of data stored in S3. We usually use m1.xlarge, 60 - 100 nodes.
我们用Hadoop/MapReduce和Hive来做数据管理，分析，日志聚集，报表， ETL录入到Hive,并且加载数据到分布式键值存储中。
我们主要的集群是10个基点

Brockmann Consult GmbH - Environmental informatics and Geoinformation services
We use Hadoop to develop the Calvalus system - parallel processing of large amounts of satellite data.
Focus on generation, analysis and validation of environmental Earth Observation data products.
Our cluster is a rack with 20 nodes (4 cores, 8 GB RAM each),
112 TB diskspace total.
我们用Hadoop来开发Calvalus系统-并行处理大量的卫星数据
专注于地理,分析和地球环境观察的数据产品
我们的集群机架有20个节点
共有112TB的硬盘空间
C
Caree.rs
Hardware: 15 nodes
We use Hadoop to process company and job data and run Machine learning algorithms for our recommendation engine.
15个节点
我们用Hadoop来处理公司和工作的数据并且运行机器学习算法来支持我们的推荐引擎。
CDU now!
We use Hadoop for our internal searching, filtering and indexing
我们用Hadoop 来支持内部的搜索，过滤和索引

Cloudspace
Used on client projects and internal log reporting/parsing systems designed to scale to infinity and beyond.
Client project: Amazon S3-backed, web-wide analytics platform
Internal: cross-architecture event log aggregation & processing
用在客户端项目上一级内部的日志报表和推荐系统从一定的规模到无穷大
Contextweb - Ad Exchange
We use Hadoop to store ad serving logs and use it as a source for ad optimizations, analytics, reporting and machine learning.
Currently we have a 50 machine cluster with 400 cores and about 140TB raw storage. Each (commodity) node has 8 cores and 16GB of RAM.
我们用Hadoop 来存储广告服务日志用它来最有广告，分析，报表和机器学习.
目前我们有50台机器的集群...
Cooliris - Cooliris transforms your browser into a lightning fast, cinematic way to browse photos and videos, both online and on your hard drive.
We have a 15-node Hadoop cluster where each machine has 8 cores, 8 GB ram, and 3-4 TB of storage.
We use Hadoop for all of our analytics, and we use Pig to allow PMs and non-engineers the freedom to query the data in an ad-hoc manner.
Clooliris提升你的浏览速度到光速，来浏览图片和视频，在线的或者在你的硬盘上.
我们的Hadoop集群有15个节点
我们所有的分析都是用的Hadoop,用Pig来允许PMs和非专业人士用一种点对点的方式来查询数据
Cornell University Web Lab
Generating web graphs on 100 nodes (dual 2.4GHz Xeon Processor, 2 GB RAM, 72GB Hard Drive)
在100个节点的集群上来生成图表信息
CRS4
Hadoop deployed dynamically on subsets of a 400-node cluster
node: two quad-core 2.83GHz Xeons, 16 GB RAM, two 250GB HDDs
most deployments use our high-performance GPFS (3.8PB, 15GB/s random r/w)
Computational biology applications
Hadoop被动态的部署在一个拥有400个节点集群的子集上
生物计算领域的应用
crowdmedia
Crowdmedia has a 5 Node Hadoop cluster for statistical analysis
We use Hadoop to analyse trends on Facebook and other social networks
大众媒体有5个节点的Hadoop集群来做统计分析
我们用Hadoop来分析趋势借助于Facebook和其他社交网络
D
Datagraph
We use Hadoop for batch-processing large RDF datasets, in particular for indexing RDF data.
We also use Hadoop for executing long-running offline SPARQL queries for clients.
We use Amazon S3 and Cassandra to store input RDF datasets and output files.
We've developed RDFgrid, a Ruby framework for map/reduce-based processing of RDF data.
We primarily use Ruby, RDF.rb and RDFgrid to process RDF data with Hadoop Streaming.
We primarily run Hadoop jobs on Amazon Elastic MapReduce, with cluster sizes of 1 to 20 nodes depending on the size of the dataset (hundreds of millions to billions of RDF statements).
我们用Hadoop来批量处理大量的RDF数据集，特别是索引RDF data.
也用Hadoop来执行一些长时间的离线查询
我们用Amazon S3和Cassandra来存储RDF 数据集并且输出文件
我们已经开发了RDF表格，一个Ruby基于map/reduce用来处理RDF数据的应用程序
我们主要用Ruby,RDF,rb 和RDFgrid来处理RDF数据，运用Hadoop Streaming。
我们主要是在亚马逊的弹性MapReduce上运行Hadoop jobs,集群规模由数据集大小来定，大概1到20个节点。
Dataium
We use a combination of Pig and Java based Map/Reduce jobs to sort, aggregate and help make sense of large amounts of data.
我们用Pig的排列和Map/Reduce jobs来分类，聚集和帮助来找到大数据的价值
Deepdyve
Elastic cluster with 5-80 nodes
We use Hadoop to create our indexes of deep web content and to provide a high availability and high bandwidth storage service for index shards for our search cluster.
弹性的集群，5-80个节点
我们用Hadoop来创建我们的深度内容的索引来提供一个高可用的高带宽的存储服务给我们的索引碎片和搜索集群
Detektei Berlin
We are using Hadoop in our data mining and multimedia/internet research groups.
3 node cluster with 48 cores in total, 4GB RAM and 1 TB storage each.
我们用Hadoop在数据挖掘和多媒体以及研究方面
Detikcom - Indonesia's largest news portal
We use Hadoop, pig and HBase to analyze search log, generate Most View News, generate top wordcloud, and analyze all of our logs
Currently We use 9 nodes
我们用Hadoop,pig和HBase来分析搜索日志，生成最热点新闻，生成热词和分析我们所有的日志
目前有9个节点
devdaily.com
We use Hadoop and Nutch to research data on programming-related websites, such as looking for current trends, story originators, and related information.
We're currently using three nodes, with each node having two cores, 4GB RAM, and 1TB storage. We'll expand these once we settle on our related technologies (Scala, Pig, HBase, other).
我们用Hadoop和Nutch来分析编程相关的网站，像查找趋势，故事起源，关联信息。我们目前用了3个节点，每个节点有2核心，4GB内存，1TB存储。一旦我们确定了我们看中的技术，我们将会扩大集群规模.

DropFire
We generate Pig Latin scripts that describe structural and semantic conversions between data contexts
We use Hadoop to execute these scripts for production-level deployments
Eliminates the need for explicit data and schema mappings during database integration
我们生成PIg Latin 脚本用来描述在数据上下文中的结构和语义转换
我们用Hadoop来生成这些生产级别的脚本
E
EBay
532 nodes cluster (8 * 532 cores, 5.3PB).
Heavy usage of Java MapReduce, Pig, Hive, HBase
Using it for Search optimization and Research.
532个节点的集群
大量使用到Java MapReduce,Pig,Hive,HBase
用它来优化搜索和调查
eCircle
two 60 nodes cluster each >1000 cores, total 5T Ram, 1PB
mostly HBase, some M/R
marketing data handling
2个拥有60节点的集群
主要是HBase，一些事M/R
商业数据处理
Enormo
4 nodes cluster (32 cores, 1TB).
We use Hadoop to filter and index our listings, removing exact duplicates and grouping similar ones.
We plan to use Pig very shortly to produce statistics.
4个节点的集群
我们用Hadoop来过滤、索引我们的列表，去重和相似性分组.
我们计划用Pig快速的生成统计信息
ESPOL University (Escuela Superior Politécnica del Litoral) in Guayaquil, Ecuador
4 nodes proof-of-concept cluster.
We use Hadoop in a Data-Intensive Computing capstone course. The course projects cover topics like information retrieval, machine learning, social network analysis, business intelligence, and network security.
The students use on-demand clusters launched using Amazon's EC2 and EMR services, thanks to its AWS in Education program.
4个节点的集群
我们用hadoop在一个课程中,名为：Data-Intensive Computing capstone。这个课程包括信息检索，机器学习，社交网络分析，商业智能，和网络安全。
学生使用的集群式跑在亚马逊的EC2和EMR服务上的，感谢它的AWS。
ETH Zurich Systems Group
We are using Hadoop in a course that we are currently teaching: "Massively Parallel Data Analysis with MapReduce". The course projects are based on real use-cases from biological data analysis.
Cluster hardware: 16 x (Quad-core Intel Xeon, 8GB RAM, 1.5 TB Hard-Disk)
我们用Hadoop在我们最近教的一门课程”用MapReduce来处理大规模的并行的数据分析”，课程是基于实际的生物数据分析案例.
Eyealike - Visual Media Search Platform
Facial similarity and recognition across large datasets.
Image content based advertising and auto-tagging for social media.
Image based video copyright protection.
通过大数据集来做人脸相似性和识别
？
基于视频版权保护的图像

Explore.To Yellow Pages - Explore To Yellow Pages
We use Hadoop for our internal search, filtering and indexing
Elastic cluster with 5-80 nodes
我们用Hadoop来做内部的搜索，过滤,索引
弹性集群有5-80个节点
F
Facebook
We use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.
Currently we have 2 major clusters:
A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
A 300-machine cluster with 2400 cores and about 3 PB raw storage.
Each (commodity) node has 8 cores and 12 TB of storage.
We are heavy users of both streaming as well as the Java APIs. We have built a higher level data warehousing framework using these features called Hive (see the http://hadoop.apache.org/hive/). We have also developed a FUSE implementation over HDFS.
我们用Hadoop来存储内部日志以及数据源拷贝，并且用它来做报表，分析和机器学习.
目前我们主要有两个集群:
一个是1100台机器(8800核心和大约12PB的存储)
一个是300台机器(2400核心和大约3PB的存储)
每一个节点有8核心和12TB的存储
我们大量使用了streaming就像使用Java API一样.基于这些特性，我们已经研发了一个高性能的数据仓库Hive.
通过HDFS我们同时也开发了一个融合的实现
FOX Audience Network
40 machine cluster (8 cores/machine, 2TB/machine storage)
70 machine cluster (8 cores/machine, 3TB/machine storage)
30 machine cluster (8 cores/machine, 4TB/machine storage)
Use for log analysis, data mining and machine learning
用来做日志分析，数据挖掘和机器学习
Forward3D
5 machine cluster (8 cores/machine, 5TB/machine storage)
Existing 19 virtual machine cluster (2 cores/machine 30TB storage)
Predominantly Hive and Streaming API based jobs (~20,000 jobs a week) using our Ruby library, or see the canonical WordCount example.
Daily batch ETL with a slightly modified clojure-hadoop
Log analysis
Data mining
Machine learning
5个机器的集群
19个虚拟机器集群
每周跑2万个Job
日志分析，数据挖掘
机器学习

Freestylers - Image retrieval engine
We, the Japanese company Freestylers, use Hadoop to build the image processing environment for image-based product recommendation system mainly on Amazon EC2, from April 2009.
Our Hadoop environment produces the original database for fast access from our web application.
We also uses Hadoop to analyzing similarities of user's behavior.
日本的Freestyles,用Hadoop来构建图像处理环境来支持基于图片产品的推荐系统，主要是在亚马逊的EC2上,从2009年4月开始.
我们的Hadoop环境生成原始的数据库来支持我们web应用上快速的访问.
我们也用Hadoop来分析用户行为上的相似性.
G
GBIF (Global Biodiversity Information Facility) - nonprofit organization that focuses on making scientific data on biodiversity available via the Internet
18 nodes running a mix of Hadoop and HBase
Hive ad hoc queries against our biodiversity data
Regular Oozie workflows to process biodiversity data for publishing
All work is Open source (e.g. Oozie workflow, Ganglia)
非盈利组织，致力于处理生物多样性方面的科学数据通过网络
18个节点搭配性的运行着Hadoop和Hbase
规则的工作流机制来处理生物多样性数据最后发布出来
所有的工作都是开放资源的
GIS.FCU
Feng Chia University
3 machine cluster (4 cores, 1TB/machine)
storeage for sensor data
3节点
存储着传感器数据

Google
University Initiative to Address Internet-Scale Computing Challenges
大学主动处理网络规模的计算挑战
Gruter. Corp.
30 machine cluster (4 cores, 1TB~2TB/machine storage)
storage for blog data and web documents
used for data indexing by MapReduce
link analyzing and Machine Learning by MapReduce
30台机器的集群
存储着博客信息和web文档
通过MapReduce来做数据索引
链接分析和机器学习
Gewinnspiele
6 node cluster (each node has: 4 dual core CPUs, 1,5TB storage, 4GB RAM, RedHat OS)
Using Hadoop for our high speed data mining applications in corporation with Twilight
6个node的集群
用Hadoop来做高速数据挖掘应用
GumGum
9 node cluster (Amazon EC2 c1.xlarge)
Nightly MapReduce jobs on Amazon Elastic MapReduce process data stored in S3
MapReduce jobs written in Groovy use Hadoop Java APIs
Image and advertising analytics
9个节点的集群
在晚上运行部署在亚马逊弹性MapReduce上的job来处理存储在S3上数据
MapReduce job是用Groovy写的，用Hadoop Java APIs
还用做图片和广告分析.
英文原文：
http://wiki.apache.org/hadoop/PoweredBy

分享到：

win7下visualbox安装ubuntu无法进入图形界 ... | 初识数据挖掘与分析的魅力

2013-03-19 18:23
浏览 4011
评论(1)
分类:开源软件
查看更多

1 楼 di1984HIT 2014-06-12

写的很好啊，谢谢。

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论