- 浏览: 622246 次
- 性别:
- 来自: 杭州
-
文章分类
- 全部博客 (228)
- io (15)
- cluster (16)
- linux (7)
- js (23)
- bizarrerie (46)
- groovy (1)
- thread (1)
- jsp (8)
- static (4)
- cache (3)
- protocol (2)
- ruby (11)
- hibernate (6)
- svn (1)
- python (8)
- spring (19)
- gma (1)
- architecture (4)
- search (15)
- db (3)
- ibatis (1)
- html5 (1)
- iptables (1)
- server (5)
- nginx (4)
- scala (1)
- DNS (1)
- jPlayer (1)
- Subversion 版本控制 (1)
- velocity (1)
- html (1)
- ppt poi (1)
- java (1)
- bizarrerie spring security (1)
最新评论
-
koreajapan03:
楼主啊,好人啊,帮我解决了问题,谢谢
自定义过滤器时,不能再使用<sec:authorize url="">问题 -
snailprince:
请问有同一页面,多个上传实例的例子吗
webuploader用java实现上传 -
wutao8818:
姚小呵 写道如何接收server返回的参数呢?例如你返回的是“ ...
webuploader用java实现上传 -
姚小呵:
如何接收server返回的参数呢?例如你返回的是“1”,上传的 ...
webuploader用java实现上传 -
zycjf2009:
你好,我想用jplayer做一个简单的播放器,但是因为对js不 ...
jplayer 实战
Tailrank Architecture - Learn How to Track Memes Across the
- 博客分类:
- cluster
转自:http://www.highscalability.com/tailrank-architecture-learn-how-track-memes-across-entire-blogosphere
Ever feel like the blogosphere is 500 million channels with nothing on? Tailrank finds the internet's hottest channels by indexing over 24M weblogs and feeds per hour. That's 52TB of raw blog content (no, not sewage) a month and requires continuously processing 160Mbits of IO. How do they do that?
This is an email interview with Kevin Burton, founder and CEO of Tailrank.com. Kevin was kind enough to take the time to explain how they scale to index the entire blogosphere.
Sites
Tailrank - We track the hottest news in the blogosphere!
Spinn3r - A blog spider you can specialize with your own behavior instead of creating your own.
Kevin Burton's Blog - his blog is an indexing mix of politics and technical talk. Both are always interesting.
Platform
MySQL
Java
Linux (Debian)
Apache
Squid
PowerDNS
DAS storage.
Federated database.
ServerBeach hosting.
Job scheduling system for work distribution.
Interview
What is your system is for?
Tailrank originally a memetracker to track the hottest news being discussed
within the blogosphere.
We started having a lot of requests to license our crawler and we shipped that
in the form of Spinn3r about 8 months ago.
Spinn3r is self contained crawler for companies that want to index the full
blogosphere and consumer generated media.
Tailrank is still a very important product alongside Spinn3r and we're working
on Tailrank 3.0 which should be available in the future. No ETA at the moment
but it's actively being worked on.
What particular design/architecture/implementation challenges does your system have?
The biggest challenge we have is the sheer amount of data we have to process and
keeping that data consistent within a distributed system.
For example, we process 52TB of content per month. this has to be indexed in a
highly available storage architecture so the normal distributed database
problems arise.
What did you do to meet these challenges?
We've spent a lot of time in building out a distributed system that can scale
and handle failure.
For example, we've built a tool called Task/Queue that is analogous to Google's
MapReduce. It has a centralized queue server which hands out units of work to
robots which make requests.
It works VERY well for crawlers in that slower machines just fetch work at a
slower rate while more modern machines (or better tuned machines) request work
at a higher rate.
This ends up easily solving one of the main distributed computing fallacies that
the network is homogeneous.
Task/Queue is generic enough that we could actually use it to implement
MapReduce on top of the system.
We'll probably open source it at some point. Right now it has too many
tentacles wrapped into other parts of our system.
How big is your system?
We index 24M weblogs and feeds per hour and process content at about
160-200Mbps.
At the raw level we're writing to our disks at about 10-15MBps continuously.
How many documents, do you serve? How many images? How much data?
Right now the database is about 500G. We're expecting it to grow well beyond
this in 2008 as we expand our product offering.
What is your rate of growth?
It's mostly a function of customer feature requests. If our customers want more data we sell it to them.
In 2008 we're planning on expanding our cluster to index larger portions of the
web and consumer generated media.
What is the architecture of your system?
We use Java, MySQL and Linux for our cluster.
Java is a great language for writing crawlers. The library support is pretty
solid (though it seems like Java 7 is going to be killer when they add
closures).
We use MySQL with InnoDB. We're mostly happy with it though it seems I end up
spending about 20% of my time fixing MySQL bugs and limitations.
Of course nothing is perfect. MySQL for example was really designed to be used
on single core systems.
The MySQL 5.1 release goes a bit farther to fix multi-core scalability locks.
I recently blogged about how these the new multi-core machines should really be
considered N machines instead of one logical unit: Distributed Computing Fallacy #9.
How is your system architected to scale?
We use a federated database system so that we can split the write load as we see
more IO.
We've released a lot of our code as Open Source a lot of our infrastructure and
this will probably be released as Open Source as well.
We've already opened up a lot of our infrastructure code:
http://code.tailrank.com/lbpool - load balancing JDBC driver for use with DB connection pools.
http://code.tailrank.com/feedparser - Java RSS/Atom parser designed to elegantly support all versions of RSS
http://code.google.com/p/benchmark4j/ - Java (and UNIX) equivalent of Windows' perfmon
http://code.google.com/p/spinn3r-client/ - Client bindings to access the Spinn3r web service
http://code.google.com/p/mysqlslavesync/ - Clone a MySQL installation and setup replication.
http://code.google.com/p/log5j/ - Logger facade that supports printf style message format for both performance and ease of use.
How many servers do you have?
About 15 machines so far. We've spent a lot of time tuning our infrastructure
so it's pretty efficient. That said, building a scalable crawler is not an easy
task so it does take a lot of hardware.
We're going to be expanding FAR past this in 2008 and will probably hit about
2-3 racks of machines (~120 boxes).
What operating systems do you use?
Linux via Debian Etch on 64 bit Opterons. I'm a big Debian fan. I don't know
why more hardware vendors don't support Debian.
Debian is the big secret in the valley that no one talks about. Most of the big
web 2.0 shops like Technorati, Digg, etc use Debian.
Which web server do you use?
Apache 2.0. Lighttpd is looking interesting as well.
Which reverse proxy do you use?
About 95% of the pages of Tailrank are served from Squid.
How is your system deployed in data centers?
We use ServerBeach for hosting. It's a great model for small to medium sized
startups. They rack the boxes, maintain inventory, handle network, etc. We
just buy new machines and pay a flat markup.
I wish Dell, SUN, HP would sell directly to clients in this manner.
One right now. We're looking to expand into two for redundancy.
What is your storage strategy?
Directly attached storage. We buy two SATA drives per box and set them up in
RAID 0.
We use the redundant array of inexpensive databases solution so if an individual
machine fails there's another copy of the data on another box.
Cheap SATA disks rule for what we do. They're cheap, commodity, and fast.
Do you have a standard API to your website?
Tailrank has RSS feeds for every page.
The Spinn3r service is itself an API and we have extensive documentation on the
protocol.
It's also free to use for researchers so if any of your readers are pursuing a
Ph.D and generally doing research work and needs access to blog data we'd love
to help them out.
We already have the Ph.D students at the University of Washington and University
of Maryland (my Alma Matter) using Spinn3r.
Which DNS service do you use?
PowerDNS. It's a great product. We only use the recursor daemon but it's FAST.
It uses async IO though so it doesn't really scale across processors on
multicore boxes. Apparenty there's a hack to get it to run across cores but it
isn't very reliable.
AAA caching might be broken though. I still need to look into this.
Who do you admire?
Donald Knuth is the man!
How are you thinking of changing your architecture in the future?
We're still working on finishing up a fully sharded database. MySQL fault
tolerance and autopromotion is also an issue.
Ever feel like the blogosphere is 500 million channels with nothing on? Tailrank finds the internet's hottest channels by indexing over 24M weblogs and feeds per hour. That's 52TB of raw blog content (no, not sewage) a month and requires continuously processing 160Mbits of IO. How do they do that?
This is an email interview with Kevin Burton, founder and CEO of Tailrank.com. Kevin was kind enough to take the time to explain how they scale to index the entire blogosphere.
Sites
Tailrank - We track the hottest news in the blogosphere!
Spinn3r - A blog spider you can specialize with your own behavior instead of creating your own.
Kevin Burton's Blog - his blog is an indexing mix of politics and technical talk. Both are always interesting.
Platform
MySQL
Java
Linux (Debian)
Apache
Squid
PowerDNS
DAS storage.
Federated database.
ServerBeach hosting.
Job scheduling system for work distribution.
Interview
What is your system is for?
Tailrank originally a memetracker to track the hottest news being discussed
within the blogosphere.
We started having a lot of requests to license our crawler and we shipped that
in the form of Spinn3r about 8 months ago.
Spinn3r is self contained crawler for companies that want to index the full
blogosphere and consumer generated media.
Tailrank is still a very important product alongside Spinn3r and we're working
on Tailrank 3.0 which should be available in the future. No ETA at the moment
but it's actively being worked on.
What particular design/architecture/implementation challenges does your system have?
The biggest challenge we have is the sheer amount of data we have to process and
keeping that data consistent within a distributed system.
For example, we process 52TB of content per month. this has to be indexed in a
highly available storage architecture so the normal distributed database
problems arise.
What did you do to meet these challenges?
We've spent a lot of time in building out a distributed system that can scale
and handle failure.
For example, we've built a tool called Task/Queue that is analogous to Google's
MapReduce. It has a centralized queue server which hands out units of work to
robots which make requests.
It works VERY well for crawlers in that slower machines just fetch work at a
slower rate while more modern machines (or better tuned machines) request work
at a higher rate.
This ends up easily solving one of the main distributed computing fallacies that
the network is homogeneous.
Task/Queue is generic enough that we could actually use it to implement
MapReduce on top of the system.
We'll probably open source it at some point. Right now it has too many
tentacles wrapped into other parts of our system.
How big is your system?
We index 24M weblogs and feeds per hour and process content at about
160-200Mbps.
At the raw level we're writing to our disks at about 10-15MBps continuously.
How many documents, do you serve? How many images? How much data?
Right now the database is about 500G. We're expecting it to grow well beyond
this in 2008 as we expand our product offering.
What is your rate of growth?
It's mostly a function of customer feature requests. If our customers want more data we sell it to them.
In 2008 we're planning on expanding our cluster to index larger portions of the
web and consumer generated media.
What is the architecture of your system?
We use Java, MySQL and Linux for our cluster.
Java is a great language for writing crawlers. The library support is pretty
solid (though it seems like Java 7 is going to be killer when they add
closures).
We use MySQL with InnoDB. We're mostly happy with it though it seems I end up
spending about 20% of my time fixing MySQL bugs and limitations.
Of course nothing is perfect. MySQL for example was really designed to be used
on single core systems.
The MySQL 5.1 release goes a bit farther to fix multi-core scalability locks.
I recently blogged about how these the new multi-core machines should really be
considered N machines instead of one logical unit: Distributed Computing Fallacy #9.
How is your system architected to scale?
We use a federated database system so that we can split the write load as we see
more IO.
We've released a lot of our code as Open Source a lot of our infrastructure and
this will probably be released as Open Source as well.
We've already opened up a lot of our infrastructure code:
http://code.tailrank.com/lbpool - load balancing JDBC driver for use with DB connection pools.
http://code.tailrank.com/feedparser - Java RSS/Atom parser designed to elegantly support all versions of RSS
http://code.google.com/p/benchmark4j/ - Java (and UNIX) equivalent of Windows' perfmon
http://code.google.com/p/spinn3r-client/ - Client bindings to access the Spinn3r web service
http://code.google.com/p/mysqlslavesync/ - Clone a MySQL installation and setup replication.
http://code.google.com/p/log5j/ - Logger facade that supports printf style message format for both performance and ease of use.
How many servers do you have?
About 15 machines so far. We've spent a lot of time tuning our infrastructure
so it's pretty efficient. That said, building a scalable crawler is not an easy
task so it does take a lot of hardware.
We're going to be expanding FAR past this in 2008 and will probably hit about
2-3 racks of machines (~120 boxes).
What operating systems do you use?
Linux via Debian Etch on 64 bit Opterons. I'm a big Debian fan. I don't know
why more hardware vendors don't support Debian.
Debian is the big secret in the valley that no one talks about. Most of the big
web 2.0 shops like Technorati, Digg, etc use Debian.
Which web server do you use?
Apache 2.0. Lighttpd is looking interesting as well.
Which reverse proxy do you use?
About 95% of the pages of Tailrank are served from Squid.
How is your system deployed in data centers?
We use ServerBeach for hosting. It's a great model for small to medium sized
startups. They rack the boxes, maintain inventory, handle network, etc. We
just buy new machines and pay a flat markup.
I wish Dell, SUN, HP would sell directly to clients in this manner.
One right now. We're looking to expand into two for redundancy.
What is your storage strategy?
Directly attached storage. We buy two SATA drives per box and set them up in
RAID 0.
We use the redundant array of inexpensive databases solution so if an individual
machine fails there's another copy of the data on another box.
Cheap SATA disks rule for what we do. They're cheap, commodity, and fast.
Do you have a standard API to your website?
Tailrank has RSS feeds for every page.
The Spinn3r service is itself an API and we have extensive documentation on the
protocol.
It's also free to use for researchers so if any of your readers are pursuing a
Ph.D and generally doing research work and needs access to blog data we'd love
to help them out.
We already have the Ph.D students at the University of Washington and University
of Maryland (my Alma Matter) using Spinn3r.
Which DNS service do you use?
PowerDNS. It's a great product. We only use the recursor daemon but it's FAST.
It uses async IO though so it doesn't really scale across processors on
multicore boxes. Apparenty there's a hack to get it to run across cores but it
isn't very reliable.
AAA caching might be broken though. I still need to look into this.
Who do you admire?
Donald Knuth is the man!
How are you thinking of changing your architecture in the future?
We're still working on finishing up a fully sharded database. MySQL fault
tolerance and autopromotion is also an issue.
发表评论
-
Membase分布式KeyValue数据库
2011-01-02 16:08 1621Membase is a distributed key-va ... -
可靠、高性能的 TCP/HTTP 负载均衡器
2009-08-12 10:09 1555HAProxy 可靠、高性能的 TCP/HTTP 负载均衡器 ... -
Welcome to Solr
2009-03-07 19:46 1197Welcome to Solr http://lucene.a ... -
Hibernate Shards 概略
2009-03-05 10:12 2174来自 hibernate_shards中文参考指南 分片策略 ... -
守护程序死亡时重新启动守护程序的方法
2008-05-12 16:52 1382可以令操作系统在一个守护程序死亡时自动重启它。 方法是将此可执 ... -
build a highly available cluster [1]
2008-05-12 15:21 1254最近在读Karl Kopper 用商业硬件和免费软件构建高可用 ... -
负载均衡中ehcache的配置
2007-12-15 23:58 1945http://forum.springside.org.cn/ ... -
Google Code for Educators
2007-12-14 23:11 1263Google: Cluster Computing and M ... -
Sharding the Hibernate Way
2007-12-14 15:34 2058http://highscalability.com/shar ... -
How To Setup MogileFS
2007-12-09 19:31 145Getting MogileFS $ mkdir mogil ... -
HA-JDBC: High-Availability JDBC
2007-12-09 03:27 5019数据库集群好伙伴 Overview HA-JDBC is a ... -
Hibernate Search 3.0.0.GA offers two back ends
2007-12-09 02:30 21482.2.1. Lucene In this mode, all ... -
Hibernate Shards 3.0.0.Beta2存在的限制
2007-12-09 02:22 2593来源 Hibernate Shards docs 6.1. ... -
Using Master/Slave Replication with ReplicationConnection
2007-12-04 12:03 1940Starting with Connector/J 3.1.7 ... -
Horizontal Database Partitioning with Spring and Hibernate
2007-12-04 12:01 3274Horizontal Database Partitionin ... -
无共享架构(Share Nothing Architecture)
2007-06-22 09:35 8774关于集群的补课 (转) http://www.blogjav ...
相关推荐
- **Tailrank**:使用实时算法分析博客和社交媒体上的内容,以快速识别热门话题和趋势。这涉及到自然语言处理、文本分析和大数据处理技术。 ### 4. 社交媒体与即时通讯 - **MySpace**:早期的社交网络平台,允许...
众多大型网站架构技术方案集锦,包括PlentyOfFish、YouTube、WikiPedia、Tailrank、Yahoo、Craigslist
人脸识别项目实战
内容概要:本文详细描述了一个完整的Web应用程序的开发过程。该项目主要采用了Hono作为服务器框架,Prisma作为ORM工具,JWT用于认证鉴权,以及一系列现代化的最佳实践确保系统的健壮性和安全性。项目初期构建了基础架构,并设置了必要的依赖和工具。在后端方面涵盖了公共API接口的设计、CRUD增删改查逻辑、用户认证和授权等功能。此外还特别关注到了API的安全保护,如输入输出的校验,跨站请求伪造CSRF的防范,XSS防御等措施;为确保代码的质量引入了代码检测(比如ESLint搭配Prettier),并建立了完善的测试框架以保障后续开发阶段的功能正确。对于可能出现的问题预先定义了一组规范化的异常响应,并提供OpenAPI文档以方便开发者理解和调用。数据存储层面上利用了关系型与非关系型数据库各自的特性,实现了数据的有效组织,最后提供了实用的脚本,可用于种子数据插入以及执行必要的初始化工作。 适合人群:面向具有一定JavaScript/TypeScript开发经验,尤其是Node.js后台服务搭建经验的中级程序员和技术团队。 使用场景及目标:这份材料非常适合那些需要快速建立安全高效的RES
【资源介绍】 1、该资源包括项目的全部源码,下载可以直接使用! 2、本项目适合作为计算机、数学、电子信息等专业的课程设计、期末大作业和毕设项目,也可以作为小白实战演练和初期项目立项演示的重要参考借鉴资料。 3、本资源作为“学习资料”如果需要实现其他功能,需要能看懂代码,并且热爱钻研和多多调试实践。 掌静脉识别算法源码(门禁).zip掌静脉识别算法源码(门禁).zip掌静脉识别算法源码(门禁).zip掌静脉识别算法源码(门禁).zip掌静脉识别算法源码(门禁).zip掌静脉识别算法源码(门禁).zip掌静脉识别算法源码(门禁).zip掌静脉识别算法源码(门禁).zip掌静脉识别算法源码(门禁).zip掌静脉识别算法源码(门禁).zip掌静脉识别算法源码(门禁).zip掌静脉识别算法源码(门禁).zip掌静脉识别算法源码(门禁).zip掌静脉识别算法源码(门禁).zip
手势识别项目实战
(参考GUI)MATLAB BP的交通标志系统.zip
人脸识别项目实战
内容概要:本文详细介绍了 C++ 函数的基础概念及其实战技巧。内容涵盖了函数的基本结构(定义、声明、调用)、多种参数传递方式(值传递、引用传递、指针传递),各类函数类型(无参无返、有参无返、无参有返、有参有返),以及高级特性(函数重载、函数模板、递归函数)。此外,通过实际案例展示了函数的应用,如统计数组元素频次和实现冒泡排序算法。最后,总结了C++函数的重要性及未来的拓展方向。 适合人群:有一定编程基础的程序员,特别是想要深入了解C++编程特性的开发人员。 使用场景及目标:① 学习C++中函数的定义与调用,掌握参数传递方式;② 掌握不同类型的C++函数及其应用场景;③ 深入理解函数重载、函数模板和递归函数的高级特性;④ 提升实际编程能力,通过实例强化所学知识。 其他说明:文章以循序渐进的方式讲解C++函数的相关知识点,并提供了实际编码练习帮助理解。阅读过程中应当边思考边实践,动手实验有助于更好地吸收知识点。
Comsol光学仿真模型:包括纳米球 柱 Mie散射多级分解 ,Comsol光学仿真模型; 纳米球; 柱; Mie散射; 多级分解,Comsol光学仿真模型:纳米结构Mie散射多级分解
永磁同步电机全速域控制高频方波注入法、滑模观测器法SMO、加权切矢量控制Simulink仿真模型 低速域采用高频方波注入法HF,高速域采用滑膜观测器法SMO,期间采用加权形式切 送前方法 1、零低速域,来用无数字滤波器高频方波注入法, 2.中高速域采用改进的SMO滑模观测器,来用的是sigmoid函数,PLL锁相环 3、转速过渡区域采用加权切法 该仿真各个部分清晰分明,仿真波形效果良好内附详细控制方法资料lunwen 带有参考文献和说明文档,仿真模型 ,核心关键词: 1. 永磁同步电机; 2. 全速域控制; 3. 高频方波注入法; 4. 滑模观测器法SMO; 5. 加权切换矢量控制; 6. Simulink仿真模型; 7. 零低速域控制; 8. 中高速域控制; 9. 转速过渡区域控制; 10. 仿真波形效果; 11. 详细控制方法资料; 12. 参考文献和说明文档。,永磁同步电机多域控制策略的仿真研究
基于蜣螂优化算法的无人机三维路径规划【23年新算法应用】可直接运行 Matlab语言 主要内容:读取地形数据,利用蜣螂算法DBO优化三维路径,目标函数为总路径最短,同时不能撞到障碍物,效果如图所示,包括迭代曲线图、三维路径图、二维平面图等等 ,基于蜣螂优化算法;无人机;三维路径规划;总路径最短;障碍物避免;Matlab语言;迭代曲线图;三维路径图;二维平面图,蜣螂算法优化无人机三维路径规划:实时避障、路径最短新应用
清华大学2024年研究生复试上机考试题.zip
南京理工大学研究生入学考试2011年复试上机试题
手势识别项目实战
这里是3501的内容,用于复习资料
异步电动机变压变频调速系统,包含六千多字的文档、框架图、Simulink仿真模型,电力拖动、电机控制仿真设计 仿真模型+报告 开关闭环对比仿真都有,资料如图所见如所得 ,异步电动机;变压变频调速系统;六千字文档;框架图;Simulink仿真模型;电力拖动;电机控制仿真设计;开闭环对比仿真;资料如图。,异步电机控制仿真系统:六千字详解与图解
人脸识别项目实战
手势识别项目实战
人脸识别项目实战