- 浏览: 225161 次
- 性别:
- 来自: 上海
-
文章分类
最新评论
-
Breather.杨:
斯库伊!受教
基于按annotation的hibernate主键生成策略 -
w420372197:
很详细,学习中..转载了
基于按annotation的hibernate主键生成策略 -
wslovenide:
...
基于按annotation的hibernate主键生成策略 -
Navee:
写的十分详细!感谢
基于按annotation的hibernate主键生成策略 -
eric.cheng:
很好,学习了
基于按annotation的hibernate主键生成策略
Flickr Architecture
Wed, 11/14/2007 - 10:04 — Todd Hoff
Update: Flickr hits 2 Billion photos served. That's a lot of hamburgers.
Flickr is both my favorite bird and the web's leading photo sharing site. Flickr has an amazing challenge, they must handle a vast sea of ever expanding new content, ever increasing legions of users, and a constant stream of new features, all while providing excellent performance. How do they do it?
Site: http://www.flickr.com/
Information Sources
Platform
The Stats
The Architecture
-- Pair of ServerIron's
---- Squid Caches
------ Net App's
---- PHP App Servers
------ Storage Manager
------ Master -master shards
------ Dual Tree Central Database
------ Memcached Cluster
------ Big Search Engine
- The Dual Tree structure is a custom set of changes to MySQL that
allows scaling by incrementally adding masters without a ring
architecture. This allows cheaper scaling because you need less
hardware as compared to master-master setups which always requires
double the hardware.
- The central database includes data like the 'users' table, which includes primary user
keys (a few different IDs) and a pointer to which shard a users' data can be found on.
- Shards: My data gets stored on my shard, but the record of performing action on your comment, is on your shard. When making a comment on someone else's’ blog
- Global Ring: Its like DNS, you need to know where to go and who controls where you go. Every page view, calculate where your data is, at that moment of time.
- PHP logic to connect to the shards and keep the data consistent (10 lines of code with comments!)
- Slice of the main database
- Active Master-Master Ring Replication: a few drawbacks in MySQL 4.1, as honoring commits in Master-Master. AutoIncrement IDs are automated to keep it Active Active.
- Shard assignments are from a random number for new accounts
- Migration is done from time to time, so you can remove certain power users. Needs to be balanced if you have a lot of photos… 192,000 photos, 700,000 tags, will take about 3-4 minutes. Migration is done manually.
- Pulls the Photo owners Account from Cache, to get the shard location (say on shard-5)
- Pulls my Information from cache, to get my shard location (say on shard-13)
- Starts a “distributed transaction” - to answer the question: Who favorited the photo? What are my favorites?
- every page load, the user is assigned to a bucket
- if host is down, go to next host in the list; if all hosts are down, display an error page. They don’t use persistent connections, they build connections and tear it down. Every page load thus, tests the connection.
- A lot of data is stored twice. For example, a comment is part of the relation between the commentor and the commentee. Where is the comment stored? How about both places? Transactions are used to prevent out of sync data: open transaction 1, write commands, open transaction 2, write commands, commit 1st transaction if all is well, commit 2nd transaction if 1st committed. but there still a chance for failure when a box goes down during the 1st commit.
- Two search back-ends: shards 35k qps on a few shards and Yahoo!’s (proprietary) web search
- Owner’s single tag search or a batch tag change (say, via Organizr) goes to the Shards due to real-time requirements, everything else goes to Yahoo!’s engine (probably about 90% behind the real-time goodness)
- Think of it such that you’ve got Lucene -like search
- EMT64 w/RHEL4, 16GB RAM
- 6-disk 15K RPM RAID -10.
- Data size is at 12 TB of user metadata (these are not photos, this is just innodb ibdata files - the photos are a lot larger).
- 2U boxes. Each shard has~120GB of data.
- ibbackup on a cron job, that runs across various shards at different times. Hotbackup to a spare.
- Snapshots are taken every night across the entire cluster of databases.
- Writing or deleting several huge backup files at once to a replication filestore can wreck performance on that filestore for the next few hours as it replicates the backup files. Doing this to an in-production photo storage filer is a bad idea.
- However much it costs to keep multiple days of backups of all of your data, it's worth it. Keeping staggered backups is good for when you discover something gone wrong a few days later. something like 1, 2, 10 and 30 day backups.
- Tags do not fit well with traditional normalized RDBMs schema design. Denormalization or heavy caching is the only way to generate a tag cloud in milliseconds for hundreds of millions of tags.
- Some of their data views are calculated offline by dedicated processing clusters which save the results into MySQL because some relationships are so complicated to calculate it would absorb all the database CPU cycles.
- Make it faster with real-time BCP , so all data centers can receive writes to the data layer (db, memcache, etc) all at the same time. Everything is active nothing will ever be idle.
Lessons Learned
- What is the maximum something that every server can do ?
- How close are you to that maximum, and how is it trending ?
- MySQL (disk IO ?)
- SQUID (disk IO ? or CPU ?)
- memcached (CPU ? or network ?)
- Do you have event related growth? For example: disaster, news event.
- Flickr gets 20-40% more uploads on first work day of the year than any previous peak the previous year.
- 40-50% more uploads on Sundays than the rest of the week, on average
发表评论
-
大型网站架构不得不考虑的10个问题
2009-01-16 14:41 1172大型网站架构不得不考虑的10个问题 来自CSDN:http:/ ... -
规划 SOA 参考架构
2009-01-07 16:22 2493规划 SOA 参考架构 2007-12-03 09: ... -
架构师书单
2009-01-07 16:09 1738架构师书单 一、S ... -
架构师之路
2009-01-07 16:07 5155架构师之路 什么是软件架构师? 架构 ... -
应用架构选型讨论
2008-12-10 09:29 1257应用架构选型讨论(PPT) ... -
系统构架设计应考虑的因素
2008-11-24 17:23 3267系统构架设计应考虑的 ... -
负载均衡--大型在线系统实现的关键(服务器集群架构的设计与选择)
2008-11-24 17:19 5739负载均衡--大型在 ... -
LinkedIn Architecture
2008-11-24 16:16 1678LinkedIn Architecture Category ... -
eBay Architecture
2008-11-24 16:14 1972eBay Architecture Tue, 05/27/2 ... -
LiveJournal Architecture
2008-11-24 16:13 1122LiveJournal Architecture Mon, ... -
Google Architecture
2008-11-24 16:09 1342Google Architecture Sun, 11/23 ... -
YouTube Architecture
2008-11-24 16:07 1565YouTube Architecture Thu, 03/1 ... -
Digg Architecture
2008-11-24 16:03 1324Digg Architecture Mon, 09/15/2 ... -
37signals Architecture
2008-11-24 16:02 121137signals Architecture Thu, 09 ... -
Scaling Twitter: Making Twitter 10000 Percent Fast
2008-11-24 15:59 1320Scaling Twitter: Making Twitter ... -
Amazon Architecture
2008-11-24 15:58 1244Amazon Architecture Tue, 09/18 ... -
Facebook 海量数据处理
2008-11-24 15:54 1870Facebook 海量数据处理 作者: F ... -
Scalability Best Practices: Lessons from eBay
2008-11-24 15:50 1180Scalability Best Practices: Le ... -
Yapache-Yahoo! Apache 的秘密
2008-11-24 02:15 1224Yapache-Yahoo! Apache 的秘密 作 ... -
Notes from Scaling MySQL - Up or Out
2008-11-24 02:14 1519Notes from Scaling MySQL - Up o ...
相关推荐
Flickr Architecture 309 Information Sources 309 Platform 310 The Stats 310 The Architecture 311 Lessons Learned 316 Comments 318 How to store images? 318 RE: How to store images? 318 ...
标题中的"dataset_coco.json+dataset_flickr8k.json+dataset_flickr30k.json"表明这是一组用于图像 caption 任务的数据集,其中包含了COCO(Common Objects in Context)、Flickr8k 和 Flickr30k 这三个知名数据集的...
《Flickr API在.Net环境中的应用详解》 Flickr API 是一个强大的工具,它允许开发者通过编程方式访问Flickr网站的海量图片库和用户数据。本文将深入探讨如何在.Net环境中利用Flickr API进行开发,结合提供的...
在IT行业中,Flickr是一个备受推崇的在线照片管理和分享平台,以其强大的功能和丰富的社区而闻名。本主题将深入探讨如何使用PHP与Flickr的API进行交互,以便开发出能够上传、下载、搜索以及管理Flickr相册的应用程序...
《Flickr社交网络数据集深度解析》 Flickr,作为全球知名的图片与视频分享平台,其用户间的互动数据构成了一个庞大的社交网络。该平台不仅承载了丰富的多媒体内容,更是研究社交网络、用户行为以及社区结构的理想...
标题中的“基于Python的关于Flickr图片网站的爬虫”表明了这个压缩包内容是关于使用Python编程语言来抓取Flickr网站上的图片数据。Flickr是一个著名的在线照片管理和分享平台,而网络爬虫则是自动提取网页信息的一种...
flickr.mat数据集,可以用于网络表示学习的数据集,论文中常用
图像描述数据集 Flickr8k
标题"Flickr & WEBIMAGER-截图并上传到flickr的工具(转)"涉及的知识点主要是关于图片分享服务Flickr以及一个与之相关的WEBIMAGER工具。Flickr是由雅虎创建的一款在线照片管理和分享应用,它允许用户上传、存储、...
总之,“flickr批量图片下载工具”是一个为方便用户快速、高效地获取flickr图片资源而设计的应用,特别适合那些需要处理大量flickr图片的摄影师、设计师或者收藏者。通过批量下载功能,用户可以节省大量时间,并且...
标题 "flickr flag 论文2" 涉及的主题主要集中在使用Flickr平台的数据进行图像相似性学习、社交标记对网络图像搜索的提升、通过Flickr理解世界、图像标签学习以及照片集智能批量标记等方面。这些论文代表了研究者...
标题:"flickr架构" 描述了flickr系统架构的关键组件与设计原则,为读者提供了深入理解这一著名照片分享平台背后的复杂技术体系的机会。 flickr,作为互联网早期的照片分享平台之一,其架构设计不仅支撑了海量用户...
在当今数字化时代,社交平台Flickr作为一个著名的图片共享网站,积累了大量的用户上传的图片数据。Flickr数据集分析成为研究社交网络、图像处理、用户行为等领域的宝贵资源。分析Flickr数据集_flickrAnalyse.zip文件...
标题中的"Flickr 客户端"指的是一个第三方应用程序,它允许用户通过编程接口与Flickr网站进行交互。Flickr是一个著名的在线照片管理和分享平台,它提供了API(应用程序编程接口),使得开发者可以创建自定义的客户端...
flicrk8k 数据集。 用于image caption等相关数据的处理
flickr-uploader, 上传一个媒体目录到 Flickr,作为你本地存储的备份 flickr上传器上传一个媒体目录到 Flickr,作为你本地存储的备份。有兴趣帮助管理请求请求和问题? 我需要一个或者多个协作者,因为我不再积极...
【Laravel开发-flickr-laravel5】是一款专为在 Laravel 5 框架中集成 Flickr API 而设计的扩展包。Laravel 是一个基于 PHP 的流行开源框架,以其优雅的语法和强大的功能深受开发者喜爱。Flickr,则是全球知名的图片...
python作业-基于Flickr30k数据集实现图像文本跨模态搜索python源码+数据集+测试界面+项目说明.zip 已获导师指导并通过的97分的高分期末大作业设计项目,可作为课程设计和期末大作业,下载即用无需修改,项目完整确保...