`
wutao8818
  • 浏览: 619933 次
  • 性别: Icon_minigender_1
  • 来自: 杭州
社区版块
存档分类
最新评论

Tailrank Architecture - Learn How to Track Memes Across the

阅读更多
转自:http://www.highscalability.com/tailrank-architecture-learn-how-track-memes-across-entire-blogosphere


Ever feel like the blogosphere is 500 million channels with nothing on? Tailrank finds the internet's hottest channels by indexing over 24M weblogs and feeds per hour. That's 52TB of raw blog content (no, not sewage) a month and requires continuously processing 160Mbits of IO. How do they do that?

This is an email interview with Kevin Burton, founder and CEO of Tailrank.com. Kevin was kind enough to take the time to explain how they scale to index the entire blogosphere.


Sites


Tailrank - We track the hottest news in the blogosphere!

Spinn3r - A blog spider you can specialize with your own behavior instead of creating your own.

Kevin Burton's Blog - his blog is an indexing mix of politics and technical talk. Both are always interesting.


Platform


MySQL

Java

Linux (Debian)

Apache

Squid

PowerDNS

DAS storage.

Federated database.

ServerBeach hosting.

Job scheduling system for work distribution.


Interview


What is your system is for?

Tailrank originally a memetracker to track the hottest news being discussed
within the blogosphere.

We started having a lot of requests to license our crawler and we shipped that
in the form of Spinn3r about 8 months ago.

Spinn3r is self contained crawler for companies that want to index the full
blogosphere and consumer generated media.

Tailrank is still a very important product alongside Spinn3r and we're working
on Tailrank 3.0 which should be available in the future. No ETA at the moment
but it's actively being worked on.


What particular design/architecture/implementation challenges does your system have?

The biggest challenge we have is the sheer amount of data we have to process and
keeping that data consistent within a distributed system.

For example, we process 52TB of content per month. this has to be indexed in a
highly available storage architecture so the normal distributed database
problems arise.


What did you do to meet these challenges?

We've spent a lot of time in building out a distributed system that can scale
and handle failure.

For example, we've built a tool called Task/Queue that is analogous to Google's
MapReduce. It has a centralized queue server which hands out units of work to
robots which make requests.

It works VERY well for crawlers in that slower machines just fetch work at a
slower rate while more modern machines (or better tuned machines) request work
at a higher rate.

This ends up easily solving one of the main distributed computing fallacies that
the network is homogeneous.

Task/Queue is generic enough that we could actually use it to implement
MapReduce on top of the system.

We'll probably open source it at some point. Right now it has too many
tentacles wrapped into other parts of our system.


How big is your system?

We index 24M weblogs and feeds per hour and process content at about
160-200Mbps.

At the raw level we're writing to our disks at about 10-15MBps continuously.


How many documents, do you serve? How many images? How much data?

Right now the database is about 500G. We're expecting it to grow well beyond
this in 2008 as we expand our product offering.


What is your rate of growth?

It's mostly a function of customer feature requests. If our customers want more data we sell it to them.

In 2008 we're planning on expanding our cluster to index larger portions of the
web and consumer generated media.


What is the architecture of your system?

We use Java, MySQL and Linux for our cluster.

Java is a great language for writing crawlers. The library support is pretty
solid (though it seems like Java 7 is going to be killer when they add
closures).

We use MySQL with InnoDB. We're mostly happy with it though it seems I end up
spending about 20% of my time fixing MySQL bugs and limitations.

Of course nothing is perfect. MySQL for example was really designed to be used
on single core systems.

The MySQL 5.1 release goes a bit farther to fix multi-core scalability locks.

I recently blogged about how these the new multi-core machines should really be
considered N machines instead of one logical unit: Distributed Computing Fallacy #9.


How is your system architected to scale?

We use a federated database system so that we can split the write load as we see
more IO.

We've released a lot of our code as Open Source a lot of our infrastructure and
this will probably be released as Open Source as well.

We've already opened up a lot of our infrastructure code:


http://code.tailrank.com/lbpool - load balancing JDBC driver for use with DB connection pools.

http://code.tailrank.com/feedparser - Java RSS/Atom parser designed to elegantly support all versions of RSS

http://code.google.com/p/benchmark4j/ - Java (and UNIX) equivalent of Windows' perfmon

http://code.google.com/p/spinn3r-client/ - Client bindings to access the Spinn3r web service

http://code.google.com/p/mysqlslavesync/ - Clone a MySQL installation and setup replication.

http://code.google.com/p/log5j/ - Logger facade that supports printf style message format for both performance and ease of use.


How many servers do you have?

About 15 machines so far. We've spent a lot of time tuning our infrastructure
so it's pretty efficient. That said, building a scalable crawler is not an easy
task so it does take a lot of hardware.

We're going to be expanding FAR past this in 2008 and will probably hit about
2-3 racks of machines (~120 boxes).


What operating systems do you use?

Linux via Debian Etch on 64 bit Opterons. I'm a big Debian fan. I don't know
why more hardware vendors don't support Debian.

Debian is the big secret in the valley that no one talks about. Most of the big
web 2.0 shops like Technorati, Digg, etc use Debian.


Which web server do you use?

Apache 2.0. Lighttpd is looking interesting as well.


Which reverse proxy do you use?

About 95% of the pages of Tailrank are served from Squid.


How is your system deployed in data centers?

We use ServerBeach for hosting. It's a great model for small to medium sized
startups. They rack the boxes, maintain inventory, handle network, etc. We
just buy new machines and pay a flat markup.

I wish Dell, SUN, HP would sell directly to clients in this manner.

One right now. We're looking to expand into two for redundancy.


What is your storage strategy?

Directly attached storage. We buy two SATA drives per box and set them up in
RAID 0.

We use the redundant array of inexpensive databases solution so if an individual
machine fails there's another copy of the data on another box.

Cheap SATA disks rule for what we do. They're cheap, commodity, and fast.


Do you have a standard API to your website?

Tailrank has RSS feeds for every page.

The Spinn3r service is itself an API and we have extensive documentation on the
protocol.

It's also free to use for researchers so if any of your readers are pursuing a
Ph.D and generally doing research work and needs access to blog data we'd love
to help them out.

We already have the Ph.D students at the University of Washington and University
of Maryland (my Alma Matter) using Spinn3r.


Which DNS service do you use?

PowerDNS. It's a great product. We only use the recursor daemon but it's FAST.
It uses async IO though so it doesn't really scale across processors on
multicore boxes. Apparenty there's a hack to get it to run across cores but it
isn't very reliable.

AAA caching might be broken though. I still need to look into this.


Who do you admire?

Donald Knuth is the man!


How are you thinking of changing your architecture in the future?

We're still working on finishing up a fully sharded database. MySQL fault
tolerance and autopromotion is also an issue.

分享到:
评论

相关推荐

    50大最酷网站

    - **Tailrank**:使用实时算法分析博客和社交媒体上的内容,以快速识别热门话题和趋势。这涉及到自然语言处理、文本分析和大数据处理技术。 ### 4. 社交媒体与即时通讯 - **MySpace**:早期的社交网络平台,允许...

    大型网站架构技术方案集锦

    众多大型网站架构技术方案集锦,包括PlentyOfFish、YouTube、WikiPedia、Tailrank、Yahoo、Craigslist

    2025职业教育知识竞赛题库(含答案).pptx

    2025职业教育知识竞赛题库(含答案).pptx

    基于.NET Core MVC与SQL Server的在线考试管理系统:多角色操作、国际化支持、全套源码与文档附赠,.net core mvc在线考试系统 asp.net在线考试管理系统 主要技术:

    基于.NET Core MVC与SQL Server的在线考试管理系统:多角色操作、国际化支持、全套源码与文档附赠,.net core mvc在线考试系统 asp.net在线考试管理系统 主要技术: 基于.net core mvc架构和sql server数据库,数据库访问采用EF core code first,前端采用vue.js和bootstrap。 功能模块: 系统包括前台和后台两个部分,分三种角色登录。 管理员登录后台,拥有科目管理,题库管理,考试管理,成绩管理,用户管理等功能。 教师登录后台,可进行题库管理,考试管理和成绩管理。 用户登录前台,可查看考试列表,参加考试,查看已考试的结果,修改密码等。 系统实现了国际化,支持中英两种语言。 源码打包: 包含全套源码,数据库文件,需求分析和代码说明文档。 运行环境: 运行需vs2019或者以上版本,sql server2012或者以上版本。 ,核心关键词: .net core mvc; asp.net在线考试管理系统; SQL Server数据库; EF core code first; vue.js; boot

    C++编写的资产管理系统(带SQLServer数据库文件 )

    C++编写的资产管理系统(带SQLServer数据库文件。)。

    递归最小二乘法在线识别轮胎前后侧偏刚度:应用sin工况效果显著,适用多种场景,附simulink模型及代码,1、基于递归最小二乘法在线识别轮胎前后侧偏刚度,图为在正弦曲线工况,估计侧偏刚度的大小,效果

    递归最小二乘法在线识别轮胎前后侧偏刚度:应用sin工况效果显著,适用多种场景,附simulink模型及代码,1、基于递归最小二乘法在线识别轮胎前后侧偏刚度,图为在正弦曲线工况,估计侧偏刚度的大小,效果较好 2、此模型也可用于其他工况下的刚度估计,有需要的朋友可以自行去尝试 3、包含simulink模型和递归最小二乘侧偏刚度估计代码 ,基于递归最小二乘法; 轮胎侧偏刚度在线识别; 正弦曲线工况估计; Simulink模型; 递归最小二乘侧偏刚度估计代码。,递归最小二乘法在正弦曲线工况下的轮胎刚度在线识别模型

    PLL锁相环技术实现:SMIC55工艺下20MHz参考频率三阶二型CPPLL,快速锁定至1GMHz并带环形振荡器与DIV模块功能,pll锁相环 cppll cadence 三阶二型锁相环 工艺smi

    PLL锁相环技术实现:SMIC55工艺下20MHz参考频率三阶二型CPPLL,快速锁定至1GMHz并带环形振荡器与DIV模块功能,pll锁相环 cppll cadence 三阶二型锁相环 工艺smic55 参考频率20MHz 分频比50 锁定频率1GMHz 锁定时间2us 环形振荡器 ring vco PFD模块 DIV模块 45分频,ps counter CP模块 工艺smic55 ,核心关键词:PLL锁相环; CPPLL; 工艺SMIC55; 参考频率20MHz; 分频比50; 锁定频率1GHz; 锁定时间2us; 环形振荡器(Ring VCO); PFD模块; DIV模块(45分频,ps counter); CP模块。,"SMIC55工艺:PLL环及二型锁相环技术解析"

    EKF SLAM 分析及matlab仿真源码

    EKF SLAM matlab simulation. EKF SLAM 分析及matlab仿真源码。

    CPRI IP License支持Xilinx Vivado全版本,无MAC绑定,永久有效授权,CPRI ip license xilinx vivado 支持Vivado各版本,不绑定mac,永久有

    CPRI IP License支持Xilinx Vivado全版本,无MAC绑定,永久有效授权,CPRI ip license xilinx vivado 支持Vivado各版本,不绑定mac,永久有效 ,CPRI; IP license; Xilinx; Vivado; 不绑定Mac; 永久有效; 支持各版本。,"Xilinx Vivado支持:永久有效的CPRI IP License,不绑定MAC"

    机器学习 KNN算法实现鸢尾花分类 (分类算法)

    1.内容概要 通过KNN实现鸢尾花分类,即将新的数据点分配给已知类别中的某一类。该算法的核心思想是通过比较距离来确定最近邻的数据点,然后利用这些邻居的类别信息来决定待分类数据点的类别。 2.KNN算法的伪代码 对未知类别属性的数据集中的每个点依次执行以下操作: (1)计算已知类别数据集中的点与当前点之间的距离; (2)按照距离递增次序排序; (3)选取与当前点距离最小的k个点; (4)确定前k个点所在类别的出现频率; (5)返回前k个点出现频率最高的类别作为当前点的预测分类。 3.数据集说明 代码使用`pandas`库加载了一个名为`iris.arff.csv`的数据集 4.学习到的知识 通过鸢尾花分类学习了KNN算法,选择样本数据集中前k个最相似的数据,就是KNN算法中k的出处。k值过大,会出现分类结果模糊的情况;k值较小,那么预测的标签比较容易受到样本的影响。在实验过程中,不同的k值也会导致分类器的错误率不同。KNN算法精度高、无数据输入的假定,可以免去训练过程。但是对于数据量较多的训练样本,KNN必须保存全部数据集,可能会存在计算的时间复杂度、空间复杂度高的情况,存在维数灾难问

    COMSOL三维采空区通风条件下的氧气与瓦斯浓度分布研究,comsol三维采空区通风条件下,氧气,瓦斯浓度分布 ,核心关键词:comsol; 三维采空区; 通风条件; 氧气浓度分布; 瓦斯浓度分布

    COMSOL三维采空区通风条件下的氧气与瓦斯浓度分布研究,comsol三维采空区通风条件下,氧气,瓦斯浓度分布。 ,核心关键词:comsol; 三维采空区; 通风条件; 氧气浓度分布; 瓦斯浓度分布;,"三维采空区通风模拟:氧气与瓦斯浓度分布研究"

    基于java+ssm+mysql的公交车信息管理系统 源码+数据库+论文(高分毕设项目).zip

    项目已获导师指导并通过的高分毕业设计项目,可作为课程设计和期末大作业,下载即用无需修改,项目完整确保可以运行。 包含:项目源码、数据库脚本、软件工具等,该项目可以作为毕设、课程设计使用,前后端代码都在里面。 该系统功能完善、界面美观、操作简单、功能齐全、管理便捷,具有很高的实际应用价值。 项目都经过严格调试,确保可以运行!可以放心下载 技术组成 语言:java 开发环境:idea 数据库:MySql8.0 部署环境:Tomcat(建议用 7.x 或者 8.x 版本),maven 数据库工具:navicat

    DaisyDisk for Mac v4.31

    DaisyDisk for Mac是一款直观且强大的磁盘清理工具,专为Mac用户设计。它通过交互式图表直观展示磁盘空间使用情况,以彩色区块形式呈现文件和文件夹大小,帮助用户快速定位占用空间的大文件。软件支持快速扫描,可在几秒内完成磁盘分析,并提供文件预览功能,避免误删重要文件。DaisyDisk还支持多磁盘管理、云存储扫描、隐私保护和安全删除功能。其界面简洁易用,适合新手和专业人士,是优化磁盘空间、提升系统性能的必备工具。

    三菱FX3U伺服控制框架标准程序详解:定位控制参数设定、回原点操作、JOG手动控制及绝对与相对定位控制,FX3U和三菱伺服控制的框架标准程序,适合新手学习定位用 用 标签分层,说明了定位控制中的公共

    三菱FX3U伺服控制框架标准程序详解:定位控制参数设定、回原点操作、JOG手动控制及绝对与相对定位控制,FX3U和三菱伺服控制的框架标准程序,适合新手学习定位用。 用 标签分层,说明了定位控制中的公共参数设定、回原点、JOG手动、绝对定位、相对定位、控制等部分,有伺服驱动器的针脚接线。 ‘包括有: 1、程序一份 2、说明一份 ,核心关键词:FX3U; 三菱伺服控制; 框架标准程序; 新手学习定位; 标签分层; 公共参数设定; 回原点; JOG手动; 绝对定位; 相对定位; 控制; 伺服驱动器针脚接线。,"三菱FX3U伺服控制框架标准程序:新手定位控制指南"

    Python自动化办公源码-40 excel处理实例(多工作表合并到单工作表)

    Python自动化办公源码-40 excel处理实例(多工作表合并到单工作表)

    2023-04-06-项目笔记 - 第三百八十六阶段 - 4.4.2.384全局变量的作用域-384 -2025.01.22

    2023-04-06-项目笔记-第三百八十六阶段-课前小分享_小分享1.坚持提交gitee 小分享2.作业中提交代码 小分享3.写代码注意代码风格 4.3.1变量的使用 4.4变量的作用域与生命周期 4.4.1局部变量的作用域 4.4.2全局变量的作用域 4.4.2.1全局变量的作用域_1 4.4.2.384局变量的作用域_384- 2025-01-22

    基于MATLAB的含风光柴储微网多目标优化调度策略与模型实现,含风光柴储微网多目标优化调度 MATLAB代码 关键词:微网调度 风光柴储 粒子群算法 多目标优化 参考文档:基于多目标粒子群算法的微

    基于MATLAB的含风光柴储微网多目标优化调度策略与模型实现,含风光柴储微网多目标优化调度 MATLAB代码 关键词:微网调度 风光柴储 粒子群算法 多目标优化 参考文档:《基于多目标粒子群算法的微电网优化调度》 仿真平台:MATLAB 平台采用粒子群实现求解 优势:代码注释详实,适合参考学习,非目前烂大街的版本,程序非常精品,请仔细辨识 主要内容:代码构建了含风机、光伏、柴油发电机以及储能电站在内的微网优化运行模型,并且考虑与上级电网的购电交易,综合考虑了多方经济成本以及风光新能源消纳等多方面的因素,从而实现微网系统的经济运行,求解采用的是MOPSO算法(多目标粒子群算法),求解效果极佳,具体可以看图 ,关键词:微网优化调度; 风光柴储; 粒子群算法; 多目标优化; MATLAB代码; MOPSO算法。,基于MATLAB的微网风光柴储多目标优化调度与MOPSO算法的实践研究

    基于java+ssm+mysql的高校就业管理系统 源码+数据库+论文(高分毕设项目).zip

    项目已获导师指导并通过的高分毕业设计项目,可作为课程设计和期末大作业,下载即用无需修改,项目完整确保可以运行。 包含:项目源码、数据库脚本、软件工具等,该项目可以作为毕设、课程设计使用,前后端代码都在里面。 该系统功能完善、界面美观、操作简单、功能齐全、管理便捷,具有很高的实际应用价值。 项目都经过严格调试,确保可以运行!可以放心下载 技术组成 语言:java 开发环境:idea 数据库:MySql8.0 部署环境:Tomcat(建议用 7.x 或者 8.x 版本),maven 数据库工具:navicat

    基于java+ssm+mysql的基金交易网站 源码+数据库+论文(高分毕设项目).zip

    项目已获导师指导并通过的高分毕业设计项目,可作为课程设计和期末大作业,下载即用无需修改,项目完整确保可以运行。 包含:项目源码、数据库脚本、软件工具等,该项目可以作为毕设、课程设计使用,前后端代码都在里面。 该系统功能完善、界面美观、操作简单、功能齐全、管理便捷,具有很高的实际应用价值。 项目都经过严格调试,确保可以运行!可以放心下载 技术组成 语言:java 开发环境:idea 数据库:MySql8.0 部署环境:Tomcat(建议用 7.x 或者 8.x 版本),maven 数据库工具:navicat

    西门子SMART 200电机控制子程序V1.6:智能管理多达7个电机,灵活设置运行参数,故障自动切换备用电机,版本升级持续优化 ,西门子SMART 200 电机控制子程序V1.6,可生成库 可控制1

    西门子SMART 200电机控制子程序V1.6:智能管理多达7个电机,灵活设置运行参数,故障自动切换备用电机,版本升级持续优化。,西门子SMART 200 电机控制子程序V1.6,可生成库 可控制1-7个电机 可设置同时运行的最大电机数量 可设置每个电机是否使用 可设置电机轮时间,当系统单次运行时间>轮时间,停止运行时间最长的电机,上累计运行时间最短的电机 可设置电机启动间隔 每次启动累计运行时间最短的电机 当有电机故障时,立即停止该电机,如果有备用电机自动切备用电机 7个电机内,可自由设置备用电机个数,使用的电机总数-最大电机数量=备用电机个数 附版本升级记录: V1.1优化:当使能被关闭后自动关闭对应电机 V1.2优化:运行中改变同时使用电机数量有效 V1.3更改:open信号上升沿直接启动1个电机(跳过启动间隔),第二个电机启动间隔才有效 轮时间改为秒,当系统单次运行时间>轮时间,停止运行时间最长的电机,上累计运行时间最短的电机 V1.4优化 V1.5满足可以运行的电机数量>同时使用电机数量 时 轮才有效,不满足时,轮计时清零 V1.6 优化某些情况下,无法正确延时 ,核心关键词

Global site tag (gtag.js) - Google Analytics