- 浏览: 616631 次
- 性别:
- 来自: 杭州
文章分类
- 全部博客 (228)
- io (15)
- cluster (16)
- linux (7)
- js (23)
- bizarrerie (46)
- groovy (1)
- thread (1)
- jsp (8)
- static (4)
- cache (3)
- protocol (2)
- ruby (11)
- hibernate (6)
- svn (1)
- python (8)
- spring (19)
- gma (1)
- architecture (4)
- search (15)
- db (3)
- ibatis (1)
- html5 (1)
- iptables (1)
- server (5)
- nginx (4)
- scala (1)
- DNS (1)
- jPlayer (1)
- Subversion 版本控制 (1)
- velocity (1)
- html (1)
- ppt poi (1)
- java (1)
- bizarrerie spring security (1)
最新评论
-
koreajapan03:
楼主啊,好人啊,帮我解决了问题,谢谢
自定义过滤器时,不能再使用<sec:authorize url="">问题 -
snailprince:
请问有同一页面,多个上传实例的例子吗
webuploader用java实现上传 -
wutao8818:
姚小呵 写道如何接收server返回的参数呢?例如你返回的是“ ...
webuploader用java实现上传 -
姚小呵:
如何接收server返回的参数呢?例如你返回的是“1”,上传的 ...
webuploader用java实现上传 -
zycjf2009:
你好,我想用jplayer做一个简单的播放器,但是因为对js不 ...
jplayer 实战
Tailrank Architecture - Learn How to Track Memes Across the
- 博客分类:
- cluster
转自:http://www.highscalability.com/tailrank-architecture-learn-how-track-memes-across-entire-blogosphere
Ever feel like the blogosphere is 500 million channels with nothing on? Tailrank finds the internet's hottest channels by indexing over 24M weblogs and feeds per hour. That's 52TB of raw blog content (no, not sewage) a month and requires continuously processing 160Mbits of IO. How do they do that?
This is an email interview with Kevin Burton, founder and CEO of Tailrank.com. Kevin was kind enough to take the time to explain how they scale to index the entire blogosphere.
Sites
Tailrank - We track the hottest news in the blogosphere!
Spinn3r - A blog spider you can specialize with your own behavior instead of creating your own.
Kevin Burton's Blog - his blog is an indexing mix of politics and technical talk. Both are always interesting.
Platform
MySQL
Java
Linux (Debian)
Apache
Squid
PowerDNS
DAS storage.
Federated database.
ServerBeach hosting.
Job scheduling system for work distribution.
Interview
What is your system is for?
Tailrank originally a memetracker to track the hottest news being discussed
within the blogosphere.
We started having a lot of requests to license our crawler and we shipped that
in the form of Spinn3r about 8 months ago.
Spinn3r is self contained crawler for companies that want to index the full
blogosphere and consumer generated media.
Tailrank is still a very important product alongside Spinn3r and we're working
on Tailrank 3.0 which should be available in the future. No ETA at the moment
but it's actively being worked on.
What particular design/architecture/implementation challenges does your system have?
The biggest challenge we have is the sheer amount of data we have to process and
keeping that data consistent within a distributed system.
For example, we process 52TB of content per month. this has to be indexed in a
highly available storage architecture so the normal distributed database
problems arise.
What did you do to meet these challenges?
We've spent a lot of time in building out a distributed system that can scale
and handle failure.
For example, we've built a tool called Task/Queue that is analogous to Google's
MapReduce. It has a centralized queue server which hands out units of work to
robots which make requests.
It works VERY well for crawlers in that slower machines just fetch work at a
slower rate while more modern machines (or better tuned machines) request work
at a higher rate.
This ends up easily solving one of the main distributed computing fallacies that
the network is homogeneous.
Task/Queue is generic enough that we could actually use it to implement
MapReduce on top of the system.
We'll probably open source it at some point. Right now it has too many
tentacles wrapped into other parts of our system.
How big is your system?
We index 24M weblogs and feeds per hour and process content at about
160-200Mbps.
At the raw level we're writing to our disks at about 10-15MBps continuously.
How many documents, do you serve? How many images? How much data?
Right now the database is about 500G. We're expecting it to grow well beyond
this in 2008 as we expand our product offering.
What is your rate of growth?
It's mostly a function of customer feature requests. If our customers want more data we sell it to them.
In 2008 we're planning on expanding our cluster to index larger portions of the
web and consumer generated media.
What is the architecture of your system?
We use Java, MySQL and Linux for our cluster.
Java is a great language for writing crawlers. The library support is pretty
solid (though it seems like Java 7 is going to be killer when they add
closures).
We use MySQL with InnoDB. We're mostly happy with it though it seems I end up
spending about 20% of my time fixing MySQL bugs and limitations.
Of course nothing is perfect. MySQL for example was really designed to be used
on single core systems.
The MySQL 5.1 release goes a bit farther to fix multi-core scalability locks.
I recently blogged about how these the new multi-core machines should really be
considered N machines instead of one logical unit: Distributed Computing Fallacy #9.
How is your system architected to scale?
We use a federated database system so that we can split the write load as we see
more IO.
We've released a lot of our code as Open Source a lot of our infrastructure and
this will probably be released as Open Source as well.
We've already opened up a lot of our infrastructure code:
http://code.tailrank.com/lbpool - load balancing JDBC driver for use with DB connection pools.
http://code.tailrank.com/feedparser - Java RSS/Atom parser designed to elegantly support all versions of RSS
http://code.google.com/p/benchmark4j/ - Java (and UNIX) equivalent of Windows' perfmon
http://code.google.com/p/spinn3r-client/ - Client bindings to access the Spinn3r web service
http://code.google.com/p/mysqlslavesync/ - Clone a MySQL installation and setup replication.
http://code.google.com/p/log5j/ - Logger facade that supports printf style message format for both performance and ease of use.
How many servers do you have?
About 15 machines so far. We've spent a lot of time tuning our infrastructure
so it's pretty efficient. That said, building a scalable crawler is not an easy
task so it does take a lot of hardware.
We're going to be expanding FAR past this in 2008 and will probably hit about
2-3 racks of machines (~120 boxes).
What operating systems do you use?
Linux via Debian Etch on 64 bit Opterons. I'm a big Debian fan. I don't know
why more hardware vendors don't support Debian.
Debian is the big secret in the valley that no one talks about. Most of the big
web 2.0 shops like Technorati, Digg, etc use Debian.
Which web server do you use?
Apache 2.0. Lighttpd is looking interesting as well.
Which reverse proxy do you use?
About 95% of the pages of Tailrank are served from Squid.
How is your system deployed in data centers?
We use ServerBeach for hosting. It's a great model for small to medium sized
startups. They rack the boxes, maintain inventory, handle network, etc. We
just buy new machines and pay a flat markup.
I wish Dell, SUN, HP would sell directly to clients in this manner.
One right now. We're looking to expand into two for redundancy.
What is your storage strategy?
Directly attached storage. We buy two SATA drives per box and set them up in
RAID 0.
We use the redundant array of inexpensive databases solution so if an individual
machine fails there's another copy of the data on another box.
Cheap SATA disks rule for what we do. They're cheap, commodity, and fast.
Do you have a standard API to your website?
Tailrank has RSS feeds for every page.
The Spinn3r service is itself an API and we have extensive documentation on the
protocol.
It's also free to use for researchers so if any of your readers are pursuing a
Ph.D and generally doing research work and needs access to blog data we'd love
to help them out.
We already have the Ph.D students at the University of Washington and University
of Maryland (my Alma Matter) using Spinn3r.
Which DNS service do you use?
PowerDNS. It's a great product. We only use the recursor daemon but it's FAST.
It uses async IO though so it doesn't really scale across processors on
multicore boxes. Apparenty there's a hack to get it to run across cores but it
isn't very reliable.
AAA caching might be broken though. I still need to look into this.
Who do you admire?
Donald Knuth is the man!
How are you thinking of changing your architecture in the future?
We're still working on finishing up a fully sharded database. MySQL fault
tolerance and autopromotion is also an issue.
Ever feel like the blogosphere is 500 million channels with nothing on? Tailrank finds the internet's hottest channels by indexing over 24M weblogs and feeds per hour. That's 52TB of raw blog content (no, not sewage) a month and requires continuously processing 160Mbits of IO. How do they do that?
This is an email interview with Kevin Burton, founder and CEO of Tailrank.com. Kevin was kind enough to take the time to explain how they scale to index the entire blogosphere.
Sites
Tailrank - We track the hottest news in the blogosphere!
Spinn3r - A blog spider you can specialize with your own behavior instead of creating your own.
Kevin Burton's Blog - his blog is an indexing mix of politics and technical talk. Both are always interesting.
Platform
MySQL
Java
Linux (Debian)
Apache
Squid
PowerDNS
DAS storage.
Federated database.
ServerBeach hosting.
Job scheduling system for work distribution.
Interview
What is your system is for?
Tailrank originally a memetracker to track the hottest news being discussed
within the blogosphere.
We started having a lot of requests to license our crawler and we shipped that
in the form of Spinn3r about 8 months ago.
Spinn3r is self contained crawler for companies that want to index the full
blogosphere and consumer generated media.
Tailrank is still a very important product alongside Spinn3r and we're working
on Tailrank 3.0 which should be available in the future. No ETA at the moment
but it's actively being worked on.
What particular design/architecture/implementation challenges does your system have?
The biggest challenge we have is the sheer amount of data we have to process and
keeping that data consistent within a distributed system.
For example, we process 52TB of content per month. this has to be indexed in a
highly available storage architecture so the normal distributed database
problems arise.
What did you do to meet these challenges?
We've spent a lot of time in building out a distributed system that can scale
and handle failure.
For example, we've built a tool called Task/Queue that is analogous to Google's
MapReduce. It has a centralized queue server which hands out units of work to
robots which make requests.
It works VERY well for crawlers in that slower machines just fetch work at a
slower rate while more modern machines (or better tuned machines) request work
at a higher rate.
This ends up easily solving one of the main distributed computing fallacies that
the network is homogeneous.
Task/Queue is generic enough that we could actually use it to implement
MapReduce on top of the system.
We'll probably open source it at some point. Right now it has too many
tentacles wrapped into other parts of our system.
How big is your system?
We index 24M weblogs and feeds per hour and process content at about
160-200Mbps.
At the raw level we're writing to our disks at about 10-15MBps continuously.
How many documents, do you serve? How many images? How much data?
Right now the database is about 500G. We're expecting it to grow well beyond
this in 2008 as we expand our product offering.
What is your rate of growth?
It's mostly a function of customer feature requests. If our customers want more data we sell it to them.
In 2008 we're planning on expanding our cluster to index larger portions of the
web and consumer generated media.
What is the architecture of your system?
We use Java, MySQL and Linux for our cluster.
Java is a great language for writing crawlers. The library support is pretty
solid (though it seems like Java 7 is going to be killer when they add
closures).
We use MySQL with InnoDB. We're mostly happy with it though it seems I end up
spending about 20% of my time fixing MySQL bugs and limitations.
Of course nothing is perfect. MySQL for example was really designed to be used
on single core systems.
The MySQL 5.1 release goes a bit farther to fix multi-core scalability locks.
I recently blogged about how these the new multi-core machines should really be
considered N machines instead of one logical unit: Distributed Computing Fallacy #9.
How is your system architected to scale?
We use a federated database system so that we can split the write load as we see
more IO.
We've released a lot of our code as Open Source a lot of our infrastructure and
this will probably be released as Open Source as well.
We've already opened up a lot of our infrastructure code:
http://code.tailrank.com/lbpool - load balancing JDBC driver for use with DB connection pools.
http://code.tailrank.com/feedparser - Java RSS/Atom parser designed to elegantly support all versions of RSS
http://code.google.com/p/benchmark4j/ - Java (and UNIX) equivalent of Windows' perfmon
http://code.google.com/p/spinn3r-client/ - Client bindings to access the Spinn3r web service
http://code.google.com/p/mysqlslavesync/ - Clone a MySQL installation and setup replication.
http://code.google.com/p/log5j/ - Logger facade that supports printf style message format for both performance and ease of use.
How many servers do you have?
About 15 machines so far. We've spent a lot of time tuning our infrastructure
so it's pretty efficient. That said, building a scalable crawler is not an easy
task so it does take a lot of hardware.
We're going to be expanding FAR past this in 2008 and will probably hit about
2-3 racks of machines (~120 boxes).
What operating systems do you use?
Linux via Debian Etch on 64 bit Opterons. I'm a big Debian fan. I don't know
why more hardware vendors don't support Debian.
Debian is the big secret in the valley that no one talks about. Most of the big
web 2.0 shops like Technorati, Digg, etc use Debian.
Which web server do you use?
Apache 2.0. Lighttpd is looking interesting as well.
Which reverse proxy do you use?
About 95% of the pages of Tailrank are served from Squid.
How is your system deployed in data centers?
We use ServerBeach for hosting. It's a great model for small to medium sized
startups. They rack the boxes, maintain inventory, handle network, etc. We
just buy new machines and pay a flat markup.
I wish Dell, SUN, HP would sell directly to clients in this manner.
One right now. We're looking to expand into two for redundancy.
What is your storage strategy?
Directly attached storage. We buy two SATA drives per box and set them up in
RAID 0.
We use the redundant array of inexpensive databases solution so if an individual
machine fails there's another copy of the data on another box.
Cheap SATA disks rule for what we do. They're cheap, commodity, and fast.
Do you have a standard API to your website?
Tailrank has RSS feeds for every page.
The Spinn3r service is itself an API and we have extensive documentation on the
protocol.
It's also free to use for researchers so if any of your readers are pursuing a
Ph.D and generally doing research work and needs access to blog data we'd love
to help them out.
We already have the Ph.D students at the University of Washington and University
of Maryland (my Alma Matter) using Spinn3r.
Which DNS service do you use?
PowerDNS. It's a great product. We only use the recursor daemon but it's FAST.
It uses async IO though so it doesn't really scale across processors on
multicore boxes. Apparenty there's a hack to get it to run across cores but it
isn't very reliable.
AAA caching might be broken though. I still need to look into this.
Who do you admire?
Donald Knuth is the man!
How are you thinking of changing your architecture in the future?
We're still working on finishing up a fully sharded database. MySQL fault
tolerance and autopromotion is also an issue.
发表评论
-
Membase分布式KeyValue数据库
2011-01-02 16:08 1599Membase is a distributed key-va ... -
可靠、高性能的 TCP/HTTP 负载均衡器
2009-08-12 10:09 1541HAProxy 可靠、高性能的 TCP/HTTP 负载均衡器 ... -
Welcome to Solr
2009-03-07 19:46 1165Welcome to Solr http://lucene.a ... -
Hibernate Shards 概略
2009-03-05 10:12 2161来自 hibernate_shards中文参考指南 分片策略 ... -
守护程序死亡时重新启动守护程序的方法
2008-05-12 16:52 1355可以令操作系统在一个守护程序死亡时自动重启它。 方法是将此可执 ... -
build a highly available cluster [1]
2008-05-12 15:21 1244最近在读Karl Kopper 用商业硬件和免费软件构建高可用 ... -
负载均衡中ehcache的配置
2007-12-15 23:58 1921http://forum.springside.org.cn/ ... -
Google Code for Educators
2007-12-14 23:11 1251Google: Cluster Computing and M ... -
Sharding the Hibernate Way
2007-12-14 15:34 2046http://highscalability.com/shar ... -
How To Setup MogileFS
2007-12-09 19:31 145Getting MogileFS $ mkdir mogil ... -
HA-JDBC: High-Availability JDBC
2007-12-09 03:27 4987数据库集群好伙伴 Overview HA-JDBC is a ... -
Hibernate Search 3.0.0.GA offers two back ends
2007-12-09 02:30 21362.2.1. Lucene In this mode, all ... -
Hibernate Shards 3.0.0.Beta2存在的限制
2007-12-09 02:22 2573来源 Hibernate Shards docs 6.1. ... -
Using Master/Slave Replication with ReplicationConnection
2007-12-04 12:03 1928Starting with Connector/J 3.1.7 ... -
Horizontal Database Partitioning with Spring and Hibernate
2007-12-04 12:01 3260Horizontal Database Partitionin ... -
无共享架构(Share Nothing Architecture)
2007-06-22 09:35 8753关于集群的补课 (转) http://www.blogjav ...
相关推荐
- **Tailrank**:使用实时算法分析博客和社交媒体上的内容,以快速识别热门话题和趋势。这涉及到自然语言处理、文本分析和大数据处理技术。 ### 4. 社交媒体与即时通讯 - **MySpace**:早期的社交网络平台,允许...
众多大型网站架构技术方案集锦,包括PlentyOfFish、YouTube、WikiPedia、Tailrank、Yahoo、Craigslist
数学建模学习资料 神经网络算法 参考资料-Matlab 共26页.pptx
happybirthday2 升级版生日祝福密码0000(7).zip
本项目是一个基于SSM框架的税务门户网站实现,结合了Vue技术,旨在提供一个全面的税务信息管理平台。该项目主要功能包括税务信息查询、税务申报、税务政策浏览及用户管理等多个模块。通过这些功能,用户可以方便地查询和管理税务相关的各类信息,同时也能及时了解最新的税务政策和规定。 项目采用SSM框架,即Spring、Spring MVC和MyBatis,这三者的结合为项目提供了强大的后端支持,确保了数据的安全性和系统的稳定性。前端则采用Vue.js框架,以其高效的数据绑定和组件化开发模式,提升了用户界面的响应速度和用户体验。 开发此项目的目的不仅是为了满足计算机相关专业学生在毕业设计中的实际需求,更是为了帮助Java学习者通过实战练习,深入理解并掌握SSM框架的应用,从而在实际工作中能够更好地运用这些技术。
php7.4.33镜像7z压缩包
本项目是一个基于Java的珠宝购物网站系统,采用SSM框架进行开发,旨在为计算机相关专业学生提供一个实践平台,同时也适合Java学习者进行实战练习。项目的核心功能涵盖商品展示、用户注册登录、购物车管理、订单处理和支付系统等。通过这一系统,用户可以浏览各类珠宝商品,包括详细的商品描述、高清图片和价格信息,同时能够方便地添加商品至购物车,并进行结算和支付操作。 在技术实现方面,项目运用了Spring、Spring MVC和MyBatis三大框架,确保系统的稳定性和扩展性。Spring负责业务逻辑层,提供依赖注入和面向切面编程的支持;Spring MVC则处理Web层的请求和响应,实现MVC设计模式;MyBatis作为持久层框架,简化了数据库操作。 此外,项目采用JSP技术进行前端页面展示,结合HTML、CSS和JavaScript等技术,为用户提供友好的交互界面。
基于java的高校大学生党建系统设计与实现.docx
本项目是一个基于Python-Django框架开发的疫情数据可视化分析系统,旨在为计算机相关专业的学生提供一个实践平台,同时也适用于需要进行项目实战练习的同学。项目集成了疫情数据的收集、处理、分析和可视化功能,为用户提供了一个直观、高效的数据分析环境。 在功能方面,系统能够自动抓取最新的疫情数据,包括确诊、疑似、治愈和死亡人数等关键指标。数据处理模块则负责清洗和整理这些数据,以确保分析的准确性。分析模块采用了多种统计方法和机器学习算法,以揭示疫情的发展趋势和潜在模式。可视化模块则通过图表和地图等形式,直观地展示了分析结果,便于用户理解和分享。 项目的开发框架选择了Django,这是一个高级Python Web框架,它鼓励快速开发和清晰、务实的设计。Django的强大功能和灵活性,使得项目能够快速响应需求变化,同时保证了系统的稳定性和安全性。
果树领养计划.docx
环境说明:开发语言:Java 框架:springboot JDK版本:JDK1.8 服务器:tomcat7 数据库:mysql 5.7 数据库工具:Navicat 开发软件:eclipse/myeclipse/idea Maven包:Maven 浏览器:谷歌浏览器。 项目均可完美运行 基于Java的云平台信息安全攻防实训平台提供了以下核心功能: 1. **实训课程与项目**:平台提供了丰富多样的实训课程和项目,覆盖网络安全基础知识、漏洞挖掘与利用、渗透测试技术、安全防护策略等多个领域。 2. **在线学习模块**:学员可以通过在线学习模块观看教学视频、阅读文档资料,系统地学习信息安全知识。 3. **虚拟实验室环境**:平台提供虚拟实验室环境,学员可以在模拟的真实网络场景中进行攻防演练,包括漏洞扫描、攻击测试和防御措施的学习。 4. **教学管理功能**:教师可以创建和管理课程内容,制定教学计划,布置实训作业和考试任务。 5. **监控和统计功能**:教师可以实时了解学员的学习进度、实践操作情况和考试成绩,进行有针对性的指导和辅导。 6. **平台管理功能**:管理员负责用户管理、资源分配、系统安全维护等,确保平台稳定运行和实训环境的安全性。 7. **实时监控和评估**:系统具备实时监控和评估功能,能够及时反馈学生的操作情况和学习效果。 8. **用户认证和授权机制**:平台采用了严格的用户认证和授权机制,确保数据的安全性和保密性。 这些功能共同构建了一个功能丰富、操作便捷的实训环境,旨在提升学员的信息安全技能,为信息安全领域的发展输送专业人才。
基于GrampusFramework的轻量级单体RBAC权限管理系统
内容概要:本文档全面整理了软考(中级-软件设计师)的关键知识点,涵盖了计算复杂度、网络协议、数据结构、编程语言、数据库理论、软件测试、编译原理、设计模式、安全协议等多个方面的内容。具体涉及环路复杂度计算、SSH协议、数据字典与数据流图、对象的状态与数字签名、编程语言分类、海明码、著作权法、物理层与数据链路层设备、归纳法与演绎法、模块间耦合、能力成熟度模型集成、配置管理与风险管理、数据库关系范式、内存技术、计算机网络端口、路由协议、排序算法、中间代码、软件测试类型、编译器各阶段任务、设计模式、耦合与内聚、计算机病毒种类等。 适用人群:备考软考(中级-软件设计师)的技术人员,尤其是有一定工作经验但希望进一步提升自身技能和知识的IT从业人员。 使用场景及目标:帮助考生系统梳理考试重点,理解和掌握软件设计师应具备的专业知识和技术。适合考前复习和巩固基础知识。文档还可以作为参考资料,用于日常工作中遇到相关问题时查阅。 其他说明:本文档不仅提供了丰富的知识点,还附带了一些关键术语的定义和详细的解释,确保读者能够全面理解相关内容。建议在复习过程中结合实际案例进行练习,加深理解。
数学建模学习资料 神经网络算法 Hopfield网络 共58页.pptx
工作寻(JobHunter)是一款招聘信息整合的网站,目前固定的模板有拉勾网,中华英才网,前程无忧。工作寻可以在线通过关
本项目是基于Python实现的协同过滤音乐推荐系统,旨在为计算机相关专业学生提供一个完整的毕设实战案例。项目以协同过滤算法为核心,通过分析用户历史行为数据,为用户推荐符合其兴趣偏好的音乐。 主要功能包括用户兴趣建模、音乐推荐生成以及用户反馈机制。系统能够实时捕捉用户听歌行为,动态更新用户兴趣模型,从而更精准地推送个性化音乐推荐。同时,系统设计了友好的用户界面,使用户能够方便地获取推荐音乐,并通过反馈机制不断完善推荐算法。 在技术框架方面,项目采用了Python编程语言,借助scikit-learn等机器学习库实现协同过滤算法,并结合Flask框架搭建了Web服务,确保了系统的性能和稳定性。此项目的开发,不仅能够帮助学生深入理解协同过滤算法及音乐推荐系统的工作原理,还能提升其软件开发和项目管理能力。
微型餐饮补正备案材料通知书.docx
食品生产许可质量跟踪监督建议书.docx
基于django的音乐推荐系统.zip
如果让某人推荐Python技术书,请让他看这个列表很棒的 Python 书籍如果让某人推荐Python技术书,请让他看这个列表前言好的技术书籍可以帮助我们快速成长,大部分人新生儿或者少部分受益于经典的技术书籍。在「Python开发者」微信公号后台,我们经常能收到帮忙推荐书籍的消息。此类问题在@Python开发者微博和伯乐在线的Python小组讨论中也绝非耳熟能详。 7月3日,伯乐在线在「Python开发者」微信公号发起了一个讨论(注PC端无法看到大家的评论,需要关注微信公号后,从微信公号才可以看到),通过这个讨论话题,在评论中分享对自己有帮助的大量Python技术书籍。 (Python开发者)入门《Head First Python》+入门级+微信49票+豆瓣评分9.5推荐语**66**浅显易懂,编排的顺序特别,有大量插图、对话,感觉枯燥古心通熟易懂,大量の图片,不会觉得枯燥,是一本不错的入门书《集体智慧编程》+入门级+微信123票+豆瓣评分 9.0推荐语**Mèrçurý**以实例具体的方式来展示Python的编程技巧,受益良多《Py