`
wutao8818
  • 浏览: 623886 次
  • 性别: Icon_minigender_1
  • 来自: 杭州
社区版块
存档分类
最新评论

Tailrank Architecture - Learn How to Track Memes Across the

阅读更多
转自:http://www.highscalability.com/tailrank-architecture-learn-how-track-memes-across-entire-blogosphere


Ever feel like the blogosphere is 500 million channels with nothing on? Tailrank finds the internet's hottest channels by indexing over 24M weblogs and feeds per hour. That's 52TB of raw blog content (no, not sewage) a month and requires continuously processing 160Mbits of IO. How do they do that?

This is an email interview with Kevin Burton, founder and CEO of Tailrank.com. Kevin was kind enough to take the time to explain how they scale to index the entire blogosphere.


Sites


Tailrank - We track the hottest news in the blogosphere!

Spinn3r - A blog spider you can specialize with your own behavior instead of creating your own.

Kevin Burton's Blog - his blog is an indexing mix of politics and technical talk. Both are always interesting.


Platform


MySQL

Java

Linux (Debian)

Apache

Squid

PowerDNS

DAS storage.

Federated database.

ServerBeach hosting.

Job scheduling system for work distribution.


Interview


What is your system is for?

Tailrank originally a memetracker to track the hottest news being discussed
within the blogosphere.

We started having a lot of requests to license our crawler and we shipped that
in the form of Spinn3r about 8 months ago.

Spinn3r is self contained crawler for companies that want to index the full
blogosphere and consumer generated media.

Tailrank is still a very important product alongside Spinn3r and we're working
on Tailrank 3.0 which should be available in the future. No ETA at the moment
but it's actively being worked on.


What particular design/architecture/implementation challenges does your system have?

The biggest challenge we have is the sheer amount of data we have to process and
keeping that data consistent within a distributed system.

For example, we process 52TB of content per month. this has to be indexed in a
highly available storage architecture so the normal distributed database
problems arise.


What did you do to meet these challenges?

We've spent a lot of time in building out a distributed system that can scale
and handle failure.

For example, we've built a tool called Task/Queue that is analogous to Google's
MapReduce. It has a centralized queue server which hands out units of work to
robots which make requests.

It works VERY well for crawlers in that slower machines just fetch work at a
slower rate while more modern machines (or better tuned machines) request work
at a higher rate.

This ends up easily solving one of the main distributed computing fallacies that
the network is homogeneous.

Task/Queue is generic enough that we could actually use it to implement
MapReduce on top of the system.

We'll probably open source it at some point. Right now it has too many
tentacles wrapped into other parts of our system.


How big is your system?

We index 24M weblogs and feeds per hour and process content at about
160-200Mbps.

At the raw level we're writing to our disks at about 10-15MBps continuously.


How many documents, do you serve? How many images? How much data?

Right now the database is about 500G. We're expecting it to grow well beyond
this in 2008 as we expand our product offering.


What is your rate of growth?

It's mostly a function of customer feature requests. If our customers want more data we sell it to them.

In 2008 we're planning on expanding our cluster to index larger portions of the
web and consumer generated media.


What is the architecture of your system?

We use Java, MySQL and Linux for our cluster.

Java is a great language for writing crawlers. The library support is pretty
solid (though it seems like Java 7 is going to be killer when they add
closures).

We use MySQL with InnoDB. We're mostly happy with it though it seems I end up
spending about 20% of my time fixing MySQL bugs and limitations.

Of course nothing is perfect. MySQL for example was really designed to be used
on single core systems.

The MySQL 5.1 release goes a bit farther to fix multi-core scalability locks.

I recently blogged about how these the new multi-core machines should really be
considered N machines instead of one logical unit: Distributed Computing Fallacy #9.


How is your system architected to scale?

We use a federated database system so that we can split the write load as we see
more IO.

We've released a lot of our code as Open Source a lot of our infrastructure and
this will probably be released as Open Source as well.

We've already opened up a lot of our infrastructure code:


http://code.tailrank.com/lbpool - load balancing JDBC driver for use with DB connection pools.

http://code.tailrank.com/feedparser - Java RSS/Atom parser designed to elegantly support all versions of RSS

http://code.google.com/p/benchmark4j/ - Java (and UNIX) equivalent of Windows' perfmon

http://code.google.com/p/spinn3r-client/ - Client bindings to access the Spinn3r web service

http://code.google.com/p/mysqlslavesync/ - Clone a MySQL installation and setup replication.

http://code.google.com/p/log5j/ - Logger facade that supports printf style message format for both performance and ease of use.


How many servers do you have?

About 15 machines so far. We've spent a lot of time tuning our infrastructure
so it's pretty efficient. That said, building a scalable crawler is not an easy
task so it does take a lot of hardware.

We're going to be expanding FAR past this in 2008 and will probably hit about
2-3 racks of machines (~120 boxes).


What operating systems do you use?

Linux via Debian Etch on 64 bit Opterons. I'm a big Debian fan. I don't know
why more hardware vendors don't support Debian.

Debian is the big secret in the valley that no one talks about. Most of the big
web 2.0 shops like Technorati, Digg, etc use Debian.


Which web server do you use?

Apache 2.0. Lighttpd is looking interesting as well.


Which reverse proxy do you use?

About 95% of the pages of Tailrank are served from Squid.


How is your system deployed in data centers?

We use ServerBeach for hosting. It's a great model for small to medium sized
startups. They rack the boxes, maintain inventory, handle network, etc. We
just buy new machines and pay a flat markup.

I wish Dell, SUN, HP would sell directly to clients in this manner.

One right now. We're looking to expand into two for redundancy.


What is your storage strategy?

Directly attached storage. We buy two SATA drives per box and set them up in
RAID 0.

We use the redundant array of inexpensive databases solution so if an individual
machine fails there's another copy of the data on another box.

Cheap SATA disks rule for what we do. They're cheap, commodity, and fast.


Do you have a standard API to your website?

Tailrank has RSS feeds for every page.

The Spinn3r service is itself an API and we have extensive documentation on the
protocol.

It's also free to use for researchers so if any of your readers are pursuing a
Ph.D and generally doing research work and needs access to blog data we'd love
to help them out.

We already have the Ph.D students at the University of Washington and University
of Maryland (my Alma Matter) using Spinn3r.


Which DNS service do you use?

PowerDNS. It's a great product. We only use the recursor daemon but it's FAST.
It uses async IO though so it doesn't really scale across processors on
multicore boxes. Apparenty there's a hack to get it to run across cores but it
isn't very reliable.

AAA caching might be broken though. I still need to look into this.


Who do you admire?

Donald Knuth is the man!


How are you thinking of changing your architecture in the future?

We're still working on finishing up a fully sharded database. MySQL fault
tolerance and autopromotion is also an issue.

分享到:
评论

相关推荐

    50大最酷网站

    - **Tailrank**:使用实时算法分析博客和社交媒体上的内容,以快速识别热门话题和趋势。这涉及到自然语言处理、文本分析和大数据处理技术。 ### 4. 社交媒体与即时通讯 - **MySpace**:早期的社交网络平台,允许...

    大型网站架构技术方案集锦

    众多大型网站架构技术方案集锦,包括PlentyOfFish、YouTube、WikiPedia、Tailrank、Yahoo、Craigslist

    计算机术语.pdf

    计算机术语.pdf

    包括缺陷和有限视场效应的Etalon模型 matlab代码.rar

    1.版本:matlab2014/2019a/2024a 2.附赠案例数据可直接运行matlab程序。 3.代码特点:参数化编程、参数可方便更改、代码编程思路清晰、注释明细。 4.适用对象:计算机,电子信息工程、数学等专业的大学生课程设计、期末大作业和毕业设计。

    基于PLC和组态软件的智能停车场收费系统:电气控制与梯形图程序详解

    内容概要:本文详细介绍了基于PLC(可编程逻辑控制器)和组态软件的智能停车场收费系统的实现方法和技术细节。首先,文章概述了系统的总体架构,指出PLC用于控制停车场的电气设备,而组态软件则提供直观的操作界面。接着,深入解析了PLC梯形图程序的具体逻辑,包括车辆检测、闸门控制、收费计算等功能模块。此外,文章还讨论了接线图的设计原则和注意事项,如防止电磁干扰、确保系统稳定性的措施。最后,介绍了组态画面的设计思路及其优化方法,如动态显示车位状态、实时更新收费信息等。通过这些内容,读者能够全面了解智能停车场收费系统的内部运作机制。 适合人群:从事自动化控制、工业物联网、智能交通等领域的工作技术人员,尤其是对PLC编程和组态软件应用感兴趣的工程师。 使用场景及目标:适用于新建或改造停车场项目的规划与实施阶段,帮助工程师理解和设计类似的自动化控制系统,提高停车场管理效率和服务质量。 其他说明:文中提供了大量实际案例和技术细节,有助于读者更好地掌握相关技术和应对实际工程中的挑战。

    MATLAB实现电-气-热综合能源系统耦合优化调度模型

    内容概要:本文详细介绍了利用MATLAB及其工具箱YALMIP和求解器CPLEX/Gurobi构建电-气-热综合能源系统耦合优化调度模型的方法。首先,文章描述了电网部分采用39节点系统进行直流潮流建模,气网部分则使用比利时20节点配气网,并对Weymouth方程进行了线性化处理,将非线性问题转化为线性规划问题。热网部分引入了热电联产(CHP)和电转气(P2G)设备,实现了热电耦合。通过模块化设计,代码能够灵活地添加新的能量存储或转换设备。实验结果显示,相比单一网络优化,三网耦合优化降低了12.6%的系统总成本,并显著改善了负荷峰谷差。 适合人群:从事能源系统优化研究的专业人士,尤其是熟悉MATLAB编程和优化理论的研究人员和技术人员。 使用场景及目标:适用于希望深入了解综合能源系统耦合优化调度机制的研究人员和技术人员。主要目标是掌握如何使用MATLAB搭建电-气-热耦合优化模型,理解各个子系统的数学建模方法以及它们之间的相互作用。 其他说明:文中提供了详细的代码片段和解释,帮助读者更好地理解和复现模型。此外,还讨论了一些实际应用中的注意事项,如求解器的选择、参数调优等。

    计算机三级网络机试考试试题及答案(下).pdf

    计算机三级网络机试考试试题及答案(下).pdf

    NX MCD时序仿真中机械臂抓取仿真的参数配置与PLC联动实现

    内容概要:本文详细介绍了使用NX MCD进行机械臂抓取仿真的方法和技术要点。首先探讨了运行时参数的配置,如夹爪力度的动态调整和位置控制的脚本编写。接着讨论了条件仿真序列的设计,包括状态机跳转、阻塞等待、异步响应和超时保护等关键概念。此外,文章还讲解了与PLC的联合仿真,展示了如何通过TIA Portal实现抓取力度的动态补偿以及信号同步。最后分享了一些实用的调试技巧,如使用半速模式观察力学变化、设置碰撞检测触发器等。 适合人群:从事自动化设备开发、机械臂控制系统设计的技术人员,尤其是对NX MCD和PLC有一定了解的工程师。 使用场景及目标:适用于需要进行复杂机械臂抓取仿真的项目,帮助工程师更好地理解和掌握NX MCD与时序仿真的核心技术,提高仿真精度和可靠性。 其他说明:文中提供了大量具体的代码片段和配置示例,便于读者快速上手实践。同时强调了参数化配置的重要性,指出这是为了在现场调试时提供更大的灵活性。

    计算机数控系统.pdf

    计算机数控系统.pdf

    基于Qt框架的音频采集与播放工具

    本人创作,禁止商用

    大型流水线贴膜机PLC与触摸屏程序:初学者必备的工业控制项目

    内容概要:本文详细介绍了一款大型流水线贴膜机的PLC程序和触摸屏程序,涵盖多个控制工艺如上下气缸控制、输送带电机控制、贴膜伺服控制等。程序适用于西门子S7-1200 PLC和KTP700触摸屏,支持V13及以上版本。文中提供了详细的代码示例和分析,解释了各个控制部分的工作原理及其优化技巧。此外,还介绍了异常处理机制、报警处理模块、以及触摸屏界面上的一些实用功能,如动画流程图显示和参数微调。 适合人群:工业自动化领域的初学者,尤其是对PLC编程和运动控制感兴趣的工程师和技术人员。 使用场景及目标:① 学习PLC编程和触摸屏程序设计的基础知识;② 掌握常见工业控制元件的编程方法和优化技巧;③ 提高对复杂控制系统的设计和调试能力。 其他说明:文章强调了程序中的关键技术和注意事项,如定时器保护、光电开关连锁、位置补偿算法等,有助于初学者避免常见错误并提高系统的可靠性和安全性。

    基于51单片机的多点测温系统:利用DS18B20传感器与LCD1602实现实时温度监测

    内容概要:本文详细介绍了基于51单片机的多点测温系统的构建方法。系统采用五个DS18B20数字温度传感器进行温度采集,并将数据实时显示在LCD1602屏幕上。文中涵盖了硬件连接、单总线通信协议、温度读取与显示的具体实现细节,以及常见问题的解决方案。特别强调了ROM匹配算法的应用,确保多个传感器在同一总线上能够正确通信。此外,还提供了Proteus仿真的注意事项和一些调试技巧。 适合人群:对嵌入式系统开发感兴趣的初学者和有一定单片机基础的研发人员。 使用场景及目标:适用于恒温箱监控、多房间温控等应用场景,旨在帮助开发者掌握多点温度监测系统的搭建方法和技术要点。 其他说明:文中附有完整的硬件连接图和核心代码片段,便于读者理解和实践。同时提到了一些扩展功能,如温度单位切换、阈值报警等,增加了项目的趣味性和实用性。

    直流电机模糊PID控制技术详解及其Python与C语言实现

    内容概要:本文详细介绍了将模糊控制与传统PID相结合应用于直流电机控制的方法。首先阐述了传统PID控制在面对负载突变或转速大幅变化时的局限性,随后引入模糊PID的概念并展示了具体的实现步骤。文中提供了完整的Python和C语言代码示例,涵盖模糊规则表的设计、隶属度函数的选择以及参数自适应调整机制。此外,作者还分享了多个实用的经验技巧,如参数调整范围限制、误差量化因子选择、抗积分饱和算法的应用等。并通过实验数据对比证明了模糊PID相比传统PID在响应速度和稳定性方面的优势。 适合人群:具有一定自动化控制理论基础和技术实践经验的研发人员,尤其是从事电机控制系统开发的技术人员。 使用场景及目标:适用于需要提高直流电机控制系统鲁棒性和响应速度的实际工程项目。主要目标是在保持系统稳定的前提下,缩短调节时间和减少超调量,从而提升整体性能。 其他说明:尽管模糊PID能够显著改善某些特定条件下的控制效果,但仍需注意合理设置初始参数和调整幅度限制。同时,对于不同类型的电机和应用场景,可能还需要进一步优化模糊规则和隶属度函数。

    计算机试题office应用.pdf

    计算机试题office应用.pdf

    强化学习算法的功能实现,举了一个小例子,运行无问题 matlab代码.rar

    1.版本:matlab2014/2019a/2024a 2.附赠案例数据可直接运行matlab程序。 3.代码特点:参数化编程、参数可方便更改、代码编程思路清晰、注释明细。 4.适用对象:计算机,电子信息工程、数学等专业的大学生课程设计、期末大作业和毕业设计。

    基于多目标粒子群算法的CCHP联供系统MATLAB优化代码解析与应用

    内容概要:本文详细介绍了用于冷热电联供系统(CCHP)的多目标粒子群优化(MOPSO)算法MATLAB实现。该代码通过动态惯性权重、轮盘赌全局最优选取和约束集成等特性,解决了燃气轮机出力与风光发电波动的平衡问题,优化了电制冷机和锅炉的启停策略,从而提高系统的经济性和环保性能。文中展示了核心代码片段,如粒子位置更新、适应度函数构建、约束处理策略以及帕累托前沿筛选等,强调了工程化思维的应用,如设备启停控制、风光预测处理等。 适合人群:从事能源系统优化的研究人员、工程师和技术爱好者,尤其是对MATLAB编程和多目标优化算法有一定了解的人士。 使用场景及目标:适用于需要优化冷热电联供系统运行策略的场合,旨在实现系统运行成本最小化和碳排放量最低的目标。具体应用场景包括但不限于:工业园区能源管理、分布式能源系统调度、智能电网优化等。 其他说明:该代码不仅提供了理论上的优化方案,还通过实际案例验证了其有效性,如在夏季负荷高峰场景下的动态调度策略。此外,代码具有良好的扩展性和实用性,支持多种设备模型和目标函数的定制化修改。

    计算机求职笔试内容与分类

    计算机求职笔试内容与分类

    料箱输送线程序:WCS与PLC的Socket接口及分拣控制详解

    内容概要:本文详细介绍了欧洲进口料箱分拣系统的程序架构及其核心技术。系统采用西门子S7-1500 PLC作为控制器,通过Socket接口实现WCS(仓储控制系统)与PLC之间的高效通信。文中展示了PLC端的Socket服务端代码,以及分拣逻辑的具体实现,包括动态权重算法优化分拣路径、异常处理机制、变频器控制和报警处理模块的设计。此外,文章还探讨了硬件配置如扫码枪、直流辊筒电机和变频器的作用,以及程序中的模块化设计和工业级代码规范。 适合人群:从事工业自动化领域的工程师和技术人员,尤其是对PLC编程、WCS集成和工业物联网感兴趣的读者。 使用场景及目标:适用于需要深入了解料箱输送线控制系统的工作原理、优化分拣效率、提高系统可靠性和稳定性的应用场景。目标是帮助读者掌握WCS与PLC的Socket通信设计、分拣逻辑优化及硬件配置的最佳实践。 其他说明:文章不仅提供了详细的代码示例,还分享了许多实际调试经验和设计思路,有助于读者更好地理解和应用相关技术。

    三菱FX5U PLC ST语言螺丝机程序详解:涵盖轴控制、气缸逻辑及触摸屏交互的标准模板

    内容概要:本文详细介绍了基于三菱FX5U PLC的螺丝机项目的ST语言程序,涵盖了轴控制、气缸逻辑以及威纶通触摸屏交互的设计与实现。程序采用功能块(FB)封装的方式,将复杂的运动控制和气缸操作简化为易于理解和使用的模块。轴控制部分利用状态机实现了伺服回原点等功能,并加入了类型校验和异常处理机制。气缸控制则通过状态机和超时保护确保可靠性。触摸屏程序通过严格的变量映射和结构化设计,实现了PLC与HMI的无缝对接。此外,还包括详细的注释和报警处理机制,使得系统更加健壮。 适合人群:具备PLC编程基础的技术人员,尤其是对三菱FX系列PLC和ST语言感兴趣的工程师。 使用场景及目标:适用于需要深入了解PLC编程和工业自动化系统的工程师,帮助他们在实际项目中快速掌握ST语言的应用技巧,提高开发效率并减少调试时间。 其他说明:文中提供了大量实际案例和技术细节,如轴控制、气缸控制、触摸屏交互等,有助于读者更好地理解和应用相关技术。同时,丰富的注释和错误处理机制也为后续维护提供了便利。

    地铁线路最短路径规划1.1版本

    帮助用户规划地铁出行路线

Global site tag (gtag.js) - Google Analytics