- 浏览: 618545 次
- 性别:
- 来自: 杭州
文章分类
- 全部博客 (228)
- io (15)
- cluster (16)
- linux (7)
- js (23)
- bizarrerie (46)
- groovy (1)
- thread (1)
- jsp (8)
- static (4)
- cache (3)
- protocol (2)
- ruby (11)
- hibernate (6)
- svn (1)
- python (8)
- spring (19)
- gma (1)
- architecture (4)
- search (15)
- db (3)
- ibatis (1)
- html5 (1)
- iptables (1)
- server (5)
- nginx (4)
- scala (1)
- DNS (1)
- jPlayer (1)
- Subversion 版本控制 (1)
- velocity (1)
- html (1)
- ppt poi (1)
- java (1)
- bizarrerie spring security (1)
最新评论
-
koreajapan03:
楼主啊,好人啊,帮我解决了问题,谢谢
自定义过滤器时,不能再使用<sec:authorize url="">问题 -
snailprince:
请问有同一页面,多个上传实例的例子吗
webuploader用java实现上传 -
wutao8818:
姚小呵 写道如何接收server返回的参数呢?例如你返回的是“ ...
webuploader用java实现上传 -
姚小呵:
如何接收server返回的参数呢?例如你返回的是“1”,上传的 ...
webuploader用java实现上传 -
zycjf2009:
你好,我想用jplayer做一个简单的播放器,但是因为对js不 ...
jplayer 实战
Tailrank Architecture - Learn How to Track Memes Across the
- 博客分类:
- cluster
转自:http://www.highscalability.com/tailrank-architecture-learn-how-track-memes-across-entire-blogosphere
Ever feel like the blogosphere is 500 million channels with nothing on? Tailrank finds the internet's hottest channels by indexing over 24M weblogs and feeds per hour. That's 52TB of raw blog content (no, not sewage) a month and requires continuously processing 160Mbits of IO. How do they do that?
This is an email interview with Kevin Burton, founder and CEO of Tailrank.com. Kevin was kind enough to take the time to explain how they scale to index the entire blogosphere.
Sites
Tailrank - We track the hottest news in the blogosphere!
Spinn3r - A blog spider you can specialize with your own behavior instead of creating your own.
Kevin Burton's Blog - his blog is an indexing mix of politics and technical talk. Both are always interesting.
Platform
MySQL
Java
Linux (Debian)
Apache
Squid
PowerDNS
DAS storage.
Federated database.
ServerBeach hosting.
Job scheduling system for work distribution.
Interview
What is your system is for?
Tailrank originally a memetracker to track the hottest news being discussed
within the blogosphere.
We started having a lot of requests to license our crawler and we shipped that
in the form of Spinn3r about 8 months ago.
Spinn3r is self contained crawler for companies that want to index the full
blogosphere and consumer generated media.
Tailrank is still a very important product alongside Spinn3r and we're working
on Tailrank 3.0 which should be available in the future. No ETA at the moment
but it's actively being worked on.
What particular design/architecture/implementation challenges does your system have?
The biggest challenge we have is the sheer amount of data we have to process and
keeping that data consistent within a distributed system.
For example, we process 52TB of content per month. this has to be indexed in a
highly available storage architecture so the normal distributed database
problems arise.
What did you do to meet these challenges?
We've spent a lot of time in building out a distributed system that can scale
and handle failure.
For example, we've built a tool called Task/Queue that is analogous to Google's
MapReduce. It has a centralized queue server which hands out units of work to
robots which make requests.
It works VERY well for crawlers in that slower machines just fetch work at a
slower rate while more modern machines (or better tuned machines) request work
at a higher rate.
This ends up easily solving one of the main distributed computing fallacies that
the network is homogeneous.
Task/Queue is generic enough that we could actually use it to implement
MapReduce on top of the system.
We'll probably open source it at some point. Right now it has too many
tentacles wrapped into other parts of our system.
How big is your system?
We index 24M weblogs and feeds per hour and process content at about
160-200Mbps.
At the raw level we're writing to our disks at about 10-15MBps continuously.
How many documents, do you serve? How many images? How much data?
Right now the database is about 500G. We're expecting it to grow well beyond
this in 2008 as we expand our product offering.
What is your rate of growth?
It's mostly a function of customer feature requests. If our customers want more data we sell it to them.
In 2008 we're planning on expanding our cluster to index larger portions of the
web and consumer generated media.
What is the architecture of your system?
We use Java, MySQL and Linux for our cluster.
Java is a great language for writing crawlers. The library support is pretty
solid (though it seems like Java 7 is going to be killer when they add
closures).
We use MySQL with InnoDB. We're mostly happy with it though it seems I end up
spending about 20% of my time fixing MySQL bugs and limitations.
Of course nothing is perfect. MySQL for example was really designed to be used
on single core systems.
The MySQL 5.1 release goes a bit farther to fix multi-core scalability locks.
I recently blogged about how these the new multi-core machines should really be
considered N machines instead of one logical unit: Distributed Computing Fallacy #9.
How is your system architected to scale?
We use a federated database system so that we can split the write load as we see
more IO.
We've released a lot of our code as Open Source a lot of our infrastructure and
this will probably be released as Open Source as well.
We've already opened up a lot of our infrastructure code:
http://code.tailrank.com/lbpool - load balancing JDBC driver for use with DB connection pools.
http://code.tailrank.com/feedparser - Java RSS/Atom parser designed to elegantly support all versions of RSS
http://code.google.com/p/benchmark4j/ - Java (and UNIX) equivalent of Windows' perfmon
http://code.google.com/p/spinn3r-client/ - Client bindings to access the Spinn3r web service
http://code.google.com/p/mysqlslavesync/ - Clone a MySQL installation and setup replication.
http://code.google.com/p/log5j/ - Logger facade that supports printf style message format for both performance and ease of use.
How many servers do you have?
About 15 machines so far. We've spent a lot of time tuning our infrastructure
so it's pretty efficient. That said, building a scalable crawler is not an easy
task so it does take a lot of hardware.
We're going to be expanding FAR past this in 2008 and will probably hit about
2-3 racks of machines (~120 boxes).
What operating systems do you use?
Linux via Debian Etch on 64 bit Opterons. I'm a big Debian fan. I don't know
why more hardware vendors don't support Debian.
Debian is the big secret in the valley that no one talks about. Most of the big
web 2.0 shops like Technorati, Digg, etc use Debian.
Which web server do you use?
Apache 2.0. Lighttpd is looking interesting as well.
Which reverse proxy do you use?
About 95% of the pages of Tailrank are served from Squid.
How is your system deployed in data centers?
We use ServerBeach for hosting. It's a great model for small to medium sized
startups. They rack the boxes, maintain inventory, handle network, etc. We
just buy new machines and pay a flat markup.
I wish Dell, SUN, HP would sell directly to clients in this manner.
One right now. We're looking to expand into two for redundancy.
What is your storage strategy?
Directly attached storage. We buy two SATA drives per box and set them up in
RAID 0.
We use the redundant array of inexpensive databases solution so if an individual
machine fails there's another copy of the data on another box.
Cheap SATA disks rule for what we do. They're cheap, commodity, and fast.
Do you have a standard API to your website?
Tailrank has RSS feeds for every page.
The Spinn3r service is itself an API and we have extensive documentation on the
protocol.
It's also free to use for researchers so if any of your readers are pursuing a
Ph.D and generally doing research work and needs access to blog data we'd love
to help them out.
We already have the Ph.D students at the University of Washington and University
of Maryland (my Alma Matter) using Spinn3r.
Which DNS service do you use?
PowerDNS. It's a great product. We only use the recursor daemon but it's FAST.
It uses async IO though so it doesn't really scale across processors on
multicore boxes. Apparenty there's a hack to get it to run across cores but it
isn't very reliable.
AAA caching might be broken though. I still need to look into this.
Who do you admire?
Donald Knuth is the man!
How are you thinking of changing your architecture in the future?
We're still working on finishing up a fully sharded database. MySQL fault
tolerance and autopromotion is also an issue.
Ever feel like the blogosphere is 500 million channels with nothing on? Tailrank finds the internet's hottest channels by indexing over 24M weblogs and feeds per hour. That's 52TB of raw blog content (no, not sewage) a month and requires continuously processing 160Mbits of IO. How do they do that?
This is an email interview with Kevin Burton, founder and CEO of Tailrank.com. Kevin was kind enough to take the time to explain how they scale to index the entire blogosphere.
Sites
Tailrank - We track the hottest news in the blogosphere!
Spinn3r - A blog spider you can specialize with your own behavior instead of creating your own.
Kevin Burton's Blog - his blog is an indexing mix of politics and technical talk. Both are always interesting.
Platform
MySQL
Java
Linux (Debian)
Apache
Squid
PowerDNS
DAS storage.
Federated database.
ServerBeach hosting.
Job scheduling system for work distribution.
Interview
What is your system is for?
Tailrank originally a memetracker to track the hottest news being discussed
within the blogosphere.
We started having a lot of requests to license our crawler and we shipped that
in the form of Spinn3r about 8 months ago.
Spinn3r is self contained crawler for companies that want to index the full
blogosphere and consumer generated media.
Tailrank is still a very important product alongside Spinn3r and we're working
on Tailrank 3.0 which should be available in the future. No ETA at the moment
but it's actively being worked on.
What particular design/architecture/implementation challenges does your system have?
The biggest challenge we have is the sheer amount of data we have to process and
keeping that data consistent within a distributed system.
For example, we process 52TB of content per month. this has to be indexed in a
highly available storage architecture so the normal distributed database
problems arise.
What did you do to meet these challenges?
We've spent a lot of time in building out a distributed system that can scale
and handle failure.
For example, we've built a tool called Task/Queue that is analogous to Google's
MapReduce. It has a centralized queue server which hands out units of work to
robots which make requests.
It works VERY well for crawlers in that slower machines just fetch work at a
slower rate while more modern machines (or better tuned machines) request work
at a higher rate.
This ends up easily solving one of the main distributed computing fallacies that
the network is homogeneous.
Task/Queue is generic enough that we could actually use it to implement
MapReduce on top of the system.
We'll probably open source it at some point. Right now it has too many
tentacles wrapped into other parts of our system.
How big is your system?
We index 24M weblogs and feeds per hour and process content at about
160-200Mbps.
At the raw level we're writing to our disks at about 10-15MBps continuously.
How many documents, do you serve? How many images? How much data?
Right now the database is about 500G. We're expecting it to grow well beyond
this in 2008 as we expand our product offering.
What is your rate of growth?
It's mostly a function of customer feature requests. If our customers want more data we sell it to them.
In 2008 we're planning on expanding our cluster to index larger portions of the
web and consumer generated media.
What is the architecture of your system?
We use Java, MySQL and Linux for our cluster.
Java is a great language for writing crawlers. The library support is pretty
solid (though it seems like Java 7 is going to be killer when they add
closures).
We use MySQL with InnoDB. We're mostly happy with it though it seems I end up
spending about 20% of my time fixing MySQL bugs and limitations.
Of course nothing is perfect. MySQL for example was really designed to be used
on single core systems.
The MySQL 5.1 release goes a bit farther to fix multi-core scalability locks.
I recently blogged about how these the new multi-core machines should really be
considered N machines instead of one logical unit: Distributed Computing Fallacy #9.
How is your system architected to scale?
We use a federated database system so that we can split the write load as we see
more IO.
We've released a lot of our code as Open Source a lot of our infrastructure and
this will probably be released as Open Source as well.
We've already opened up a lot of our infrastructure code:
http://code.tailrank.com/lbpool - load balancing JDBC driver for use with DB connection pools.
http://code.tailrank.com/feedparser - Java RSS/Atom parser designed to elegantly support all versions of RSS
http://code.google.com/p/benchmark4j/ - Java (and UNIX) equivalent of Windows' perfmon
http://code.google.com/p/spinn3r-client/ - Client bindings to access the Spinn3r web service
http://code.google.com/p/mysqlslavesync/ - Clone a MySQL installation and setup replication.
http://code.google.com/p/log5j/ - Logger facade that supports printf style message format for both performance and ease of use.
How many servers do you have?
About 15 machines so far. We've spent a lot of time tuning our infrastructure
so it's pretty efficient. That said, building a scalable crawler is not an easy
task so it does take a lot of hardware.
We're going to be expanding FAR past this in 2008 and will probably hit about
2-3 racks of machines (~120 boxes).
What operating systems do you use?
Linux via Debian Etch on 64 bit Opterons. I'm a big Debian fan. I don't know
why more hardware vendors don't support Debian.
Debian is the big secret in the valley that no one talks about. Most of the big
web 2.0 shops like Technorati, Digg, etc use Debian.
Which web server do you use?
Apache 2.0. Lighttpd is looking interesting as well.
Which reverse proxy do you use?
About 95% of the pages of Tailrank are served from Squid.
How is your system deployed in data centers?
We use ServerBeach for hosting. It's a great model for small to medium sized
startups. They rack the boxes, maintain inventory, handle network, etc. We
just buy new machines and pay a flat markup.
I wish Dell, SUN, HP would sell directly to clients in this manner.
One right now. We're looking to expand into two for redundancy.
What is your storage strategy?
Directly attached storage. We buy two SATA drives per box and set them up in
RAID 0.
We use the redundant array of inexpensive databases solution so if an individual
machine fails there's another copy of the data on another box.
Cheap SATA disks rule for what we do. They're cheap, commodity, and fast.
Do you have a standard API to your website?
Tailrank has RSS feeds for every page.
The Spinn3r service is itself an API and we have extensive documentation on the
protocol.
It's also free to use for researchers so if any of your readers are pursuing a
Ph.D and generally doing research work and needs access to blog data we'd love
to help them out.
We already have the Ph.D students at the University of Washington and University
of Maryland (my Alma Matter) using Spinn3r.
Which DNS service do you use?
PowerDNS. It's a great product. We only use the recursor daemon but it's FAST.
It uses async IO though so it doesn't really scale across processors on
multicore boxes. Apparenty there's a hack to get it to run across cores but it
isn't very reliable.
AAA caching might be broken though. I still need to look into this.
Who do you admire?
Donald Knuth is the man!
How are you thinking of changing your architecture in the future?
We're still working on finishing up a fully sharded database. MySQL fault
tolerance and autopromotion is also an issue.
发表评论
-
Membase分布式KeyValue数据库
2011-01-02 16:08 1605Membase is a distributed key-va ... -
可靠、高性能的 TCP/HTTP 负载均衡器
2009-08-12 10:09 1545HAProxy 可靠、高性能的 TCP/HTTP 负载均衡器 ... -
Welcome to Solr
2009-03-07 19:46 1179Welcome to Solr http://lucene.a ... -
Hibernate Shards 概略
2009-03-05 10:12 2167来自 hibernate_shards中文参考指南 分片策略 ... -
守护程序死亡时重新启动守护程序的方法
2008-05-12 16:52 1368可以令操作系统在一个守护程序死亡时自动重启它。 方法是将此可执 ... -
build a highly available cluster [1]
2008-05-12 15:21 1247最近在读Karl Kopper 用商业硬件和免费软件构建高可用 ... -
负载均衡中ehcache的配置
2007-12-15 23:58 1925http://forum.springside.org.cn/ ... -
Google Code for Educators
2007-12-14 23:11 1254Google: Cluster Computing and M ... -
Sharding the Hibernate Way
2007-12-14 15:34 2050http://highscalability.com/shar ... -
How To Setup MogileFS
2007-12-09 19:31 145Getting MogileFS $ mkdir mogil ... -
HA-JDBC: High-Availability JDBC
2007-12-09 03:27 5002数据库集群好伙伴 Overview HA-JDBC is a ... -
Hibernate Search 3.0.0.GA offers two back ends
2007-12-09 02:30 21402.2.1. Lucene In this mode, all ... -
Hibernate Shards 3.0.0.Beta2存在的限制
2007-12-09 02:22 2580来源 Hibernate Shards docs 6.1. ... -
Using Master/Slave Replication with ReplicationConnection
2007-12-04 12:03 1932Starting with Connector/J 3.1.7 ... -
Horizontal Database Partitioning with Spring and Hibernate
2007-12-04 12:01 3261Horizontal Database Partitionin ... -
无共享架构(Share Nothing Architecture)
2007-06-22 09:35 8758关于集群的补课 (转) http://www.blogjav ...
相关推荐
- **Tailrank**:使用实时算法分析博客和社交媒体上的内容,以快速识别热门话题和趋势。这涉及到自然语言处理、文本分析和大数据处理技术。 ### 4. 社交媒体与即时通讯 - **MySpace**:早期的社交网络平台,允许...
众多大型网站架构技术方案集锦,包括PlentyOfFish、YouTube、WikiPedia、Tailrank、Yahoo、Craigslist
基于Php语言设计并实现了微信小程序的社区门诊管理系统。该小程序基于B/S即所谓浏览器/服务器模式,选择MySQL作为后台数据库去开发并实现一个以微信小程序的社区门诊为核心的系统以及对系统的简易介绍。 用户注册,在用户注册页面通过填写账号、密码、确认密码、姓名、性别、手机、等信息进行注册操作; 用户登录,用户通过登录页面输入账号和密码,并点击登录进行小程序登录操作。 用户登陆微信端后,可以对首页、门诊信息、我的等功能进行详细操作 门诊信息,在门诊信息页面可以查看科室名称、科室类型、医生编号、医生姓名、 职称、坐诊时间、科室图片、点击次数、科室介绍等信息进行预约挂号操作 检查信息,在检查信息页面可以查看检查项目、检查地点、检查时间、检查费用、账号、姓名、医生编号、医生姓名、是否支付、审核回复、审核状态等信息进行支付操作
白色大气风格的设计师作品模板下载.zip
工程经济学自考必备软件下载
UML课程设计报告.doc
白色大气风格响应式彩绘精品水果网站模板.zip
白色简洁风格的别墅整站网站模板.zip
白色简洁风格的APP展示动态源码下载.zip
1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于计算机科学与技术等相关专业,更为适合;
白色大气风格的雪山旅游景区CSS3网站模板.zip
介绍 基于python开发的大模型调用基础框架(源码) 使用说明 修改配置文件 cd config vim __init__.py # 在配置文件中添加大模型调用地址,模型名称,API_KEY等配置 启动应用 应用启动分为两种模式,命令行模式和web模式 命令行模式 python main.py cli web模式 python main.py api
基于JavaWeb的小区物业管理系统源代码+数据库 负责数据库的设计和界面的设计和实现; 界面使用 BootStrap 框架,页面自适应效果,修改页面后实现各个功能模块的布局; 负责实现用户登录注册,查看小区活动公告、水电费查询、车费查询信息; 采用的技术:采用 MVC 架构,数据库用 MySql;
白色简单风格的商务企业网站模板下载.zip
1. 平台在家电和电子产品方面的营运情况如何? 2. 哪些品牌和类别销量最高? 3. 用户消费规律 4. 哪些是我们的重点用户? 5. 平台有哪些优势和不足,需要如何改进?
全平台数据库管理工具, 支持 ClickHouse, Presto, Trino, MySQL, PostgreSQL, Apache Druid, ElasticSearch...
白色大气风格的旅游整站网站模板.zip
1、嵌入式物联网单片机项目开发例程,简单、方便、好用,节省开发时间。 2、代码使用KEIL 标准库开发,当前在STM32F030C8T6运行,如果是STM32F030其他型号芯片,依然适用,请自行更改KEIL芯片型号以及FLASH容量即可。 3、软件下载时,请注意keil选择项是jlink还是stlink。 4、有偿指导v:wulianjishu666; 5、如果接入其他传感器,请查看账号发布的其他资料。 6、单片机与模块的接线,在代码当中均有定义,请自行对照。 7、若硬件有差异,请根据自身情况调整代码,程序仅供参考学习。 8、代码有注释说明,请耐心阅读。 9、编译时请注意提示,请选择合适的编译器版本。
Matlab领域上传的视频均有对应的完整代码,皆可运行,亲测可用,适合小白; 1、代码压缩包内容 主函数:main.m; 调用函数:其他m文件;无需运行 运行结果效果图; 2、代码运行版本 Matlab 2019b;若运行有误,根据提示修改;若不会,私信博主; 3、运行操作步骤 步骤一:将所有文件放到Matlab的当前文件夹中; 步骤二:双击打开main.m文件; 步骤三:点击运行,等程序运行完得到结果; 4、仿真咨询 如需其他服务,可私信博主; 4.1 博客或资源的完整代码提供 4.2 期刊或参考文献复现 4.3 Matlab程序定制 4.4 科研合作
白色大气风格的红唇少女女性类网站模板.zip