`
wutao8818
  • 浏览: 622012 次
  • 性别: Icon_minigender_1
  • 来自: 杭州
社区版块
存档分类
最新评论

Tailrank Architecture - Learn How to Track Memes Across the

阅读更多
转自:http://www.highscalability.com/tailrank-architecture-learn-how-track-memes-across-entire-blogosphere


Ever feel like the blogosphere is 500 million channels with nothing on? Tailrank finds the internet's hottest channels by indexing over 24M weblogs and feeds per hour. That's 52TB of raw blog content (no, not sewage) a month and requires continuously processing 160Mbits of IO. How do they do that?

This is an email interview with Kevin Burton, founder and CEO of Tailrank.com. Kevin was kind enough to take the time to explain how they scale to index the entire blogosphere.


Sites


Tailrank - We track the hottest news in the blogosphere!

Spinn3r - A blog spider you can specialize with your own behavior instead of creating your own.

Kevin Burton's Blog - his blog is an indexing mix of politics and technical talk. Both are always interesting.


Platform


MySQL

Java

Linux (Debian)

Apache

Squid

PowerDNS

DAS storage.

Federated database.

ServerBeach hosting.

Job scheduling system for work distribution.


Interview


What is your system is for?

Tailrank originally a memetracker to track the hottest news being discussed
within the blogosphere.

We started having a lot of requests to license our crawler and we shipped that
in the form of Spinn3r about 8 months ago.

Spinn3r is self contained crawler for companies that want to index the full
blogosphere and consumer generated media.

Tailrank is still a very important product alongside Spinn3r and we're working
on Tailrank 3.0 which should be available in the future. No ETA at the moment
but it's actively being worked on.


What particular design/architecture/implementation challenges does your system have?

The biggest challenge we have is the sheer amount of data we have to process and
keeping that data consistent within a distributed system.

For example, we process 52TB of content per month. this has to be indexed in a
highly available storage architecture so the normal distributed database
problems arise.


What did you do to meet these challenges?

We've spent a lot of time in building out a distributed system that can scale
and handle failure.

For example, we've built a tool called Task/Queue that is analogous to Google's
MapReduce. It has a centralized queue server which hands out units of work to
robots which make requests.

It works VERY well for crawlers in that slower machines just fetch work at a
slower rate while more modern machines (or better tuned machines) request work
at a higher rate.

This ends up easily solving one of the main distributed computing fallacies that
the network is homogeneous.

Task/Queue is generic enough that we could actually use it to implement
MapReduce on top of the system.

We'll probably open source it at some point. Right now it has too many
tentacles wrapped into other parts of our system.


How big is your system?

We index 24M weblogs and feeds per hour and process content at about
160-200Mbps.

At the raw level we're writing to our disks at about 10-15MBps continuously.


How many documents, do you serve? How many images? How much data?

Right now the database is about 500G. We're expecting it to grow well beyond
this in 2008 as we expand our product offering.


What is your rate of growth?

It's mostly a function of customer feature requests. If our customers want more data we sell it to them.

In 2008 we're planning on expanding our cluster to index larger portions of the
web and consumer generated media.


What is the architecture of your system?

We use Java, MySQL and Linux for our cluster.

Java is a great language for writing crawlers. The library support is pretty
solid (though it seems like Java 7 is going to be killer when they add
closures).

We use MySQL with InnoDB. We're mostly happy with it though it seems I end up
spending about 20% of my time fixing MySQL bugs and limitations.

Of course nothing is perfect. MySQL for example was really designed to be used
on single core systems.

The MySQL 5.1 release goes a bit farther to fix multi-core scalability locks.

I recently blogged about how these the new multi-core machines should really be
considered N machines instead of one logical unit: Distributed Computing Fallacy #9.


How is your system architected to scale?

We use a federated database system so that we can split the write load as we see
more IO.

We've released a lot of our code as Open Source a lot of our infrastructure and
this will probably be released as Open Source as well.

We've already opened up a lot of our infrastructure code:


http://code.tailrank.com/lbpool - load balancing JDBC driver for use with DB connection pools.

http://code.tailrank.com/feedparser - Java RSS/Atom parser designed to elegantly support all versions of RSS

http://code.google.com/p/benchmark4j/ - Java (and UNIX) equivalent of Windows' perfmon

http://code.google.com/p/spinn3r-client/ - Client bindings to access the Spinn3r web service

http://code.google.com/p/mysqlslavesync/ - Clone a MySQL installation and setup replication.

http://code.google.com/p/log5j/ - Logger facade that supports printf style message format for both performance and ease of use.


How many servers do you have?

About 15 machines so far. We've spent a lot of time tuning our infrastructure
so it's pretty efficient. That said, building a scalable crawler is not an easy
task so it does take a lot of hardware.

We're going to be expanding FAR past this in 2008 and will probably hit about
2-3 racks of machines (~120 boxes).


What operating systems do you use?

Linux via Debian Etch on 64 bit Opterons. I'm a big Debian fan. I don't know
why more hardware vendors don't support Debian.

Debian is the big secret in the valley that no one talks about. Most of the big
web 2.0 shops like Technorati, Digg, etc use Debian.


Which web server do you use?

Apache 2.0. Lighttpd is looking interesting as well.


Which reverse proxy do you use?

About 95% of the pages of Tailrank are served from Squid.


How is your system deployed in data centers?

We use ServerBeach for hosting. It's a great model for small to medium sized
startups. They rack the boxes, maintain inventory, handle network, etc. We
just buy new machines and pay a flat markup.

I wish Dell, SUN, HP would sell directly to clients in this manner.

One right now. We're looking to expand into two for redundancy.


What is your storage strategy?

Directly attached storage. We buy two SATA drives per box and set them up in
RAID 0.

We use the redundant array of inexpensive databases solution so if an individual
machine fails there's another copy of the data on another box.

Cheap SATA disks rule for what we do. They're cheap, commodity, and fast.


Do you have a standard API to your website?

Tailrank has RSS feeds for every page.

The Spinn3r service is itself an API and we have extensive documentation on the
protocol.

It's also free to use for researchers so if any of your readers are pursuing a
Ph.D and generally doing research work and needs access to blog data we'd love
to help them out.

We already have the Ph.D students at the University of Washington and University
of Maryland (my Alma Matter) using Spinn3r.


Which DNS service do you use?

PowerDNS. It's a great product. We only use the recursor daemon but it's FAST.
It uses async IO though so it doesn't really scale across processors on
multicore boxes. Apparenty there's a hack to get it to run across cores but it
isn't very reliable.

AAA caching might be broken though. I still need to look into this.


Who do you admire?

Donald Knuth is the man!


How are you thinking of changing your architecture in the future?

We're still working on finishing up a fully sharded database. MySQL fault
tolerance and autopromotion is also an issue.

分享到:
评论

相关推荐

    50大最酷网站

    - **Tailrank**:使用实时算法分析博客和社交媒体上的内容,以快速识别热门话题和趋势。这涉及到自然语言处理、文本分析和大数据处理技术。 ### 4. 社交媒体与即时通讯 - **MySpace**:早期的社交网络平台,允许...

    大型网站架构技术方案集锦

    众多大型网站架构技术方案集锦,包括PlentyOfFish、YouTube、WikiPedia、Tailrank、Yahoo、Craigslist

    [AB PLC例程源码][MMS_044666]Translation N-A.zip

    AB PLC例程代码项目案例 【备注】 1、该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的,请放心下载使用!有问题请及时沟通交流。 2、适用人群:计算机相关专业(如计科、信息安全、数据科学与大数据技术、人工智能、通信、物联网、自动化、电子信息等)在校学生、专业老师或者企业员工下载使用。 3、用途:项目具有较高的学习借鉴价值,不仅适用于小白学习入门进阶。也可作为毕设项目、课程设计、大作业、初期项目立项演示等。 4、如果基础还行,或热爱钻研,亦可在此项目代码基础上进行修改添加,实现其他不同功能。 欢迎下载!欢迎交流学习!不清楚的可以私信问我!

    kolesar_3cd_01_0716.pdf

    kolesar_3cd_01_0716

    latchman_01_0108.pdf

    latchman_01_0108

    matlab程序代码项目案例:matlab程序代码项目案例MPC在美国高速公路场景中移动的车辆上的实现.zip

    matlab程序代码项目案例 【备注】 1、该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的,请放心下载使用!有问题请及时沟通交流。 2、适用人群:计算机相关专业(如计科、信息安全、数据科学与大数据技术、人工智能、通信、物联网、自动化、电子信息等)在校学生、专业老师或者企业员工下载使用。 3、用途:项目具有较高的学习借鉴价值,不仅适用于小白学习入门进阶。也可作为毕设项目、课程设计、大作业、初期项目立项演示等。 4、如果基础还行,或热爱钻研,亦可在此项目代码基础上进行修改添加,实现其他不同功能。 欢迎下载!欢迎交流学习!不清楚的可以私信问我!

    pimpinella_3cd_01_0716.pdf

    pimpinella_3cd_01_0716

    petrilla_01_0308.pdf

    petrilla_01_0308

    [AB PLC例程源码][MMS_041452]Speed Controls in Plastic Extrusion.zip

    AB PLC例程代码项目案例 【备注】 1、该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的,请放心下载使用!有问题请及时沟通交流。 2、适用人群:计算机相关专业(如计科、信息安全、数据科学与大数据技术、人工智能、通信、物联网、自动化、电子信息等)在校学生、专业老师或者企业员工下载使用。 3、用途:项目具有较高的学习借鉴价值,不仅适用于小白学习入门进阶。也可作为毕设项目、课程设计、大作业、初期项目立项演示等。 4、如果基础还行,或热爱钻研,亦可在此项目代码基础上进行修改添加,实现其他不同功能。 欢迎下载!欢迎交流学习!不清楚的可以私信问我!

    强化学习驱动下DeepSeek技术创新及其对AI发展的影响

    内容概要:本文档由张卓老师讲解,重点探讨DeepSeek的技术革新及强化学习对未来AI发展的重要性。文章回顾了AI的历史与发展阶段,详细解析Transformer架构在AI上半场所起到的作用,深入介绍了MoE混合专家以及MLA低秩注意机制等技术特点如何帮助DeepSeek在AI中场建立优势,并探讨了当前强化学习的挑战和边界。文档不仅提及AlphaGo和小游戏等成功案例来说明强化学习的强大力量,还提出了关于未来人工通用智能(AGI)的展望,特别是如何利用强化学习提升现有LLMs的能力和性能。 适用人群:本资料适宜对深度学习感兴趣的研究人员、开发者以及想要深入了解人工智能最新进展的专业人士。 使用场景及目标:通过了解最新的AI技术和前沿概念,在实际工作中能够运用更先进的工具和技术解决问题。同时为那些寻求职业转型或者学术深造的人提供了宝贵的参考。 其他说明:文中提到了许多具体的例子和技术细节,如DeepSeek的技术特色、RL的理论背景等等,有助于加深读者对于现代AI系统的理解和认识。

    有师傅小程序开源版v2.4.14+前端.zip

    有师傅小程序开源版v2.4.14 新增报价短信奉告 优化部分细节

    [AB PLC例程源码][MMS_047333]Motor Sequence Starter with timers to start.zip

    AB PLC例程代码项目案例 【备注】 1、该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的,请放心下载使用!有问题请及时沟通交流。 2、适用人群:计算机相关专业(如计科、信息安全、数据科学与大数据技术、人工智能、通信、物联网、自动化、电子信息等)在校学生、专业老师或者企业员工下载使用。 3、用途:项目具有较高的学习借鉴价值,不仅适用于小白学习入门进阶。也可作为毕设项目、课程设计、大作业、初期项目立项演示等。 4、如果基础还行,或热爱钻研,亦可在此项目代码基础上进行修改添加,实现其他不同功能。 欢迎下载!欢迎交流学习!不清楚的可以私信问我!

    商城二级三级分销系统(小程序+后台含源码).zip

    商城二级三级分销系统(小程序+后台含源码).zip

    li_3ck_01b_0918.pdf

    li_3ck_01b_0918

    nicholl_3cd_01_0516.pdf

    nicholl_3cd_01_0516

    1995-2022年 网络媒体关注度、报刊媒体关注度与媒体监督相关数据.zip

    媒体关注度是一个衡量公众对某个事件、话题或个体关注程度的重要指标。它主要反映了新闻媒体、社交媒体、博客等对于某一事件、话题或个体的报道和讨论程度。 媒体监督的J-F系数(Janis-Fadner系数)是一种用于测量媒体关注度的指标,特别是用于评估媒体对企业、事件或话题的监督力度。J-F系数基于媒体报道的正面和负面内容来计算,从而为公众、研究者或企业提供一个量化工具,以了解媒体对其关注的方向和强度。 本数据含原始数据、参考文献、代码do文件、最终结果。参考文献中JF系数计算公式。 指标 代码、年份、标题出现该公司的新闻总数、内容出现该公司的新闻总数、正面新闻数全部、中性新闻数全部、负面新闻数全部、正面新闻数原创、中性新闻数原创、负面新闻数原创,媒体监督JF系数。

    [AB PLC例程源码][MMS_040315]Double INC and Double DEC of INT datatype.zip

    AB PLC例程代码项目案例 【备注】 1、该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的,请放心下载使用!有问题请及时沟通交流。 2、适用人群:计算机相关专业(如计科、信息安全、数据科学与大数据技术、人工智能、通信、物联网、自动化、电子信息等)在校学生、专业老师或者企业员工下载使用。 3、用途:项目具有较高的学习借鉴价值,不仅适用于小白学习入门进阶。也可作为毕设项目、课程设计、大作业、初期项目立项演示等。 4、如果基础还行,或热爱钻研,亦可在此项目代码基础上进行修改添加,实现其他不同功能。 欢迎下载!欢迎交流学习!不清楚的可以私信问我!

    [AB PLC例程源码][MMS_047773]Convert Feet to Millimeters.zip

    AB PLC例程代码项目案例 【备注】 1、该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的,请放心下载使用!有问题请及时沟通交流。 2、适用人群:计算机相关专业(如计科、信息安全、数据科学与大数据技术、人工智能、通信、物联网、自动化、电子信息等)在校学生、专业老师或者企业员工下载使用。 3、用途:项目具有较高的学习借鉴价值,不仅适用于小白学习入门进阶。也可作为毕设项目、课程设计、大作业、初期项目立项演示等。 4、如果基础还行,或热爱钻研,亦可在此项目代码基础上进行修改添加,实现其他不同功能。 欢迎下载!欢迎交流学习!不清楚的可以私信问我!

    [AB PLC例程源码][MMS_042349]How to read-write data to-from a PLC using OPC in Visual Basic 6.zip

    AB PLC例程代码项目案例 【备注】 1、该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的,请放心下载使用!有问题请及时沟通交流。 2、适用人群:计算机相关专业(如计科、信息安全、数据科学与大数据技术、人工智能、通信、物联网、自动化、电子信息等)在校学生、专业老师或者企业员工下载使用。 3、用途:项目具有较高的学习借鉴价值,不仅适用于小白学习入门进阶。也可作为毕设项目、课程设计、大作业、初期项目立项演示等。 4、如果基础还行,或热爱钻研,亦可在此项目代码基础上进行修改添加,实现其他不同功能。 欢迎下载!欢迎交流学习!不清楚的可以私信问我!

    matlab程序代码项目案例:matlab程序代码项目案例论文代码 多篇RMPC 鲁棒模型预测控制Paper-code-implementation.zip

    matlab程序代码项目案例 【备注】 1、该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的,请放心下载使用!有问题请及时沟通交流。 2、适用人群:计算机相关专业(如计科、信息安全、数据科学与大数据技术、人工智能、通信、物联网、自动化、电子信息等)在校学生、专业老师或者企业员工下载使用。 3、用途:项目具有较高的学习借鉴价值,不仅适用于小白学习入门进阶。也可作为毕设项目、课程设计、大作业、初期项目立项演示等。 4、如果基础还行,或热爱钻研,亦可在此项目代码基础上进行修改添加,实现其他不同功能。 欢迎下载!欢迎交流学习!不清楚的可以私信问我!

Global site tag (gtag.js) - Google Analytics