`

Digg Architecture

阅读更多

Digg Architecture

Update 2: : How Digg Works and How Digg Really Works (wear ear plugs). Brought to you straight from Digg's blog. A very succinct explanation of the major elements of the Digg architecture while tracing a request through the system. I've updated this profile with the new information.
Update: Digg now receives 230 million plus page views per month and 26 million unique visitors - traffic that necessitated major internal upgrades .

Traffic generated by Digg's over 22 million famously info-hungry users and 230 million page views can crash an unsuspecting website head-on into its CPU, memory, and bandwidth limits. How does Digg handle billions of requests a month?

Site: http://digg.com

Information Sources

  • How Digg Works by Digg
  • How Digg.com uses the LAMP stack to scale upward
  • Digg PHP 's Scalability and Performance

    Platform

  • MySQL
  • Linux
  • PHP
  • Lucene
  • Python
  • APC PHP Accelerator
  • MCache
  • Gearman - job scheduling system
  • MogileFS - open source distributed filesystem
  • Apache
  • Memcached

    The Stats

  • Started in late 2004 with a single Linux server running Apache 1.3, PHP 4, and MySQL. 4.0 using the default MyISAM storage engine
  • Over 22 million users.
  • 230 million plus page views per month
  • 26 million unique visitors per month
  • Several billion page views per month
  • None of the scaling challenges faced had anything to do with PHP. The biggest issues faced were database related.
  • Dozens of web servers.
  • Dozens of DB servers.
  • Six specialized graph database servers to run the Recommendation Engine.
  • Six to ten machines that serve files from MogileFS.

    What's Inside

  • Specialized load balancer appliances monitor the application servers, handle failover, constantly adjust the cluster according to health, balance incoming requests and caching JavaScript, CSS and images. If you don't have the fancy load balancers take a look at Linux Virtual Server and Squid as a replacement.
  • Requests are passed to the Application Server cluster. Application servers consist of: Apache+PHP, Memcached, Gearman and other daemons. They are responsible for making coordinating access to different services (DB, MogileFS, etc) and creating the response sent to the browser.
  • Uses a MySQL master-slave setup.
    - Four master databases are partitioned by functionality: promotion, profiles, comments, main. Many slave databases hang off each master.
    - Writes go to the masters and reads go to the slaves.
    - Transaction-heavy servers use the InnoDB storage engine.
    - OLAP-heavy servers use the MyISAM storage engine.
    - They did not notice a performance degradation moving from MySQL 4.1 to version 5.
    - The schema is denormalized more than "your average database design."
    - Sharding is used to break the database into several smaller ones.
  • Digg's usage pattern makes it easier for them to scale. Most people just view the front page and leave. Thus 98% of Digg's database accesses are reads. With this balance of operations they don't have to worry about the complex work of architecting for writes, which makes it a lot easier for them to scale.
  • They had problems with their storage system telling them writes were on disk when they really weren't. Controllers do this to improve the appearance of their performance. But what it does is leave a giant data integrity whole in failure scenarios. This is really a pretty common problem and can be hard to fix, depending on your hardware setup.
  • To lighten their database load they used the APC PHP accelerator MCache.
  • Memcached is used for caching and memcached servers seemed to be spread across their database and application servers. A specialized daemon monitors connections and kills connections that have been open too long.
  • You can configure PHP not parse and compile on each load using a combination of Apache 2’s worker threads, FastCGI, and a PHP accelerator. On a page's first load the PHP code is compiles so any subsequent page loads are very fast.
  • MogileFS, a distributed file system, serves story icons, user icons, and stores copies of each story’s source. A distributed file system spreads and replicates files across a lot of disks which supports fast and scalable file access.
  • A specialized Recommendation Engine service was built to act as their distributed graph database. Relational databases are not well structured for generating recommendations so a separate service was created. LinkedIn did something similar for their graph.

    Lessons Learned

  • The number of machines isn't as important what the pieces are and how they fit together.
  • Don't treat the database as a hammer. Recommendations didn't fit will with the relational model so they made a specialized service.
  • Tune MySQL through your database engine selection. Use InnoDB when you need transactions and MyISAM when you don't. For example, transactional tables on the master can use MyISAM for read-only slaves.
  • At some point in their growth curve they were unable to grow by adding RAM so had to grow through architecture.
  • People often complain Digg is slow. This is perhaps due to their large javascript libraries rather than their backend architecture.
  • One way they scale is by being careful of which application they deploy on their system. They are careful not to release applications which use too much CPU. Clearly Digg has a pretty standard LAMP architecture, but I thought this was an interesting point. Engineers often have a bunch of cool features they want to release, but those features can kill an infrastructure if that infrastructure doesn't grow along with the features. So push back until your system can handle the new features. This goes to capacity planning, something the Flickr emphasizes in their scaling process.
  • You have to wonder if by limiting new features to match their infrastructure might Digg lose ground to other faster moving social bookmarking services? Perhaps if the infrastructure was more easily scaled they could add features faster which would help them compete better? On the other hand, just adding features because you can doesn't make a lot of sense either.
  • The data layer is where most scaling and performance problems are to be found and these are language specific. You'll hit them using Java , PHP, Ruby, or insert your favorite language here.

    Related Articles

    * LinkedIn Architecture
    * Live Journal Architecture
    * Flickr Architecture
    * An Unorthodox Approach to Database Design : The Coming of the Shard

  • Ebay Architecture
  • 分享到:
    评论

    相关推荐

      高性能高并发服务器架构大全

       Digg PHP's Scalability and Performance 251  YouTube Architecture 253 Information Sources 254 Platform 254 What's Inside? 254 The Stats 254 Recipe for handling rapid growth 255 Web Servers ...

      BobBuilder_app

      Digg Facebook Del.icio.us Reddit Stumbleupon Newsvine Technorati Mr. Wong Yahoo! Google Windows Live Send as Email Add to your CodeProject bookmarks Discuss this article 85 ...

      一个基于Qt Creator(qt,C++)实现中国象棋人机对战

      qt 一个基于Qt Creator(qt,C++)实现中国象棋人机对战.

      热带雨林自驾游自然奇观探索.doc

      热带雨林自驾游自然奇观探索

      冰川湖自驾游冰雪交融景象.doc

      冰川湖自驾游冰雪交融景象

      C51 单片机数码管使用 Keil项目C语言源码

      C51 单片机数码管使用 Keil项目C语言源码

      基于智能算法的无人机路径规划研究 附Matlab代码.rar

      1.版本:matlab2014/2019a/2024a 2.附赠案例数据可直接运行matlab程序。 3.代码特点:参数化编程、参数可方便更改、代码编程思路清晰、注释明细。 4.适用对象:计算机,电子信息工程、数学等专业的大学生课程设计、期末大作业和毕业设计。

      前端分析-2023071100789s12

      前端分析-2023071100789s12

      Delphi 12.3控件之Laz-制作了一些窗体和对话框样式.7z

      Laz_制作了一些窗体和对话框样式.7z

      ocaml-docs-4.05.0-6.el7.x64-86.rpm.tar.gz

      1、文件内容:ocaml-docs-4.05.0-6.el7.rpm以及相关依赖 2、文件形式:tar.gz压缩包 3、安装指令: #Step1、解压 tar -zxvf /mnt/data/output/ocaml-docs-4.05.0-6.el7.tar.gz #Step2、进入解压后的目录,执行安装 sudo rpm -ivh *.rpm 4、更多资源/技术支持:公众号禅静编程坊

      学习笔记-沁恒第六讲-米醋

      学习笔记-沁恒第六讲-米醋

      工业机器人技术讲解【36页】.pptx

      工业机器人技术讲解【36页】

      基于CentOS 7和Docker环境下安装和配置Elasticsearch数据库

      内容概要:本文档详细介绍了在 CentOS 7 上利用 Docker 容器化环境来部署和配置 Elasticsearch 数据库的过程。首先概述了 Elasticsearch 的特点及其主要应用场景如全文检索、日志和数据分析等,并强调了其分布式架构带来的高性能与可扩展性。之后针对具体的安装流程进行了讲解,涉及创建所需的工作目录,准备docker-compose.yml文件以及通过docker-compose工具自动化完成镜像下载和服务启动的一系列命令;同时对可能出现的问题提供了应对策略并附带解决了分词功能出现的问题。 适合人群:从事IT运维工作的技术人员或对NoSQL数据库感兴趣的开发者。 使用场景及目标:该教程旨在帮助读者掌握如何在一个Linux系统中使用现代化的应用交付方式搭建企业级搜索引擎解决方案,特别适用于希望深入了解Elastic Stack生态体系的个人研究与团队项目实践中。 阅读建议:建议按照文中给出的具体步骤进行实验验证,尤其是要注意调整相关参数配置适配自身环境。对于初次接触此话题的朋友来说,应该提前熟悉一下Linux操作系统的基础命令行知识和Docker的相关基础知识

      基于CNN和FNN的进化神经元模型的快速响应尖峰神经网络 附Matlab代码.rar

      1.版本:matlab2014/2019a/2024a 2.附赠案例数据可直接运行matlab程序。 3.代码特点:参数化编程、参数可方便更改、代码编程思路清晰、注释明细。 4.适用对象:计算机,电子信息工程、数学等专业的大学生课程设计、期末大作业和毕业设计。

      网络小说的类型创新、情节设计与角色塑造.doc

      网络小说的类型创新、情节设计与角色塑造

      毕业设计-基于springboot+vue开发的学生考勤管理系统【源码+sql+可运行】50311.zip

      毕业设计_基于springboot+vue开发的学生考勤管理系统【源码+sql+可运行】【50311】.zip 全部代码均可运行,亲测可用,尽我所能,为你服务; 1.代码压缩包内容 代码:springboo后端代码+vue前端页面代码 脚本:数据库SQL脚本 效果图:运行结果请看资源详情效果图 2.环境准备: - JDK1.8+ - maven3.6+ - nodejs14+ - mysql5.6+ - redis 3.技术栈 - 后台:springboot+mybatisPlus+Shiro - 前台:vue+iview+Vuex+Axios - 开发工具: idea、navicate 4.功能列表 - 系统设置:用户管理、角色管理、资源管理、系统日志 - 业务管理:班级信息、学生信息、课程信息、考勤记录、假期信息、公告信息 3.运行步骤: 步骤一:修改数据库连接信息(ip、port修改) 步骤二:找到启动类xxxApplication启动 4.若不会,可私信博主!!!

      57页-智慧办公园区智能化设计方案.pdf

      在智慧城市建设的大潮中,智慧园区作为其中的璀璨明珠,正以其独特的魅力引领着产业园区的新一轮变革。想象一下,一个集绿色、高端、智能、创新于一体的未来园区,它不仅融合了科技研发、商业居住、办公文创等多种功能,更通过深度应用信息技术,实现了从传统到智慧的华丽转身。 智慧园区通过“四化”建设——即园区运营精细化、园区体验智能化、园区服务专业化和园区设施信息化,彻底颠覆了传统园区的管理模式。在这里,基础设施的数据收集与分析让管理变得更加主动和高效,从温湿度监控到烟雾报警,从消防水箱液位监测到消防栓防盗水装置,每一处细节都彰显着智能的力量。而远程抄表、空调和变配电的智能化管控,更是在节能降耗的同时,极大地提升了园区的运维效率。更令人兴奋的是,通过智慧监控、人流统计和自动访客系统等高科技手段,园区的安全防范能力得到了质的飞跃,让每一位入驻企业和个人都能享受到“拎包入住”般的便捷与安心。 更令人瞩目的是,智慧园区还构建了集信息服务、企业服务、物业服务于一体的综合服务体系。无论是通过园区门户进行信息查询、投诉反馈,还是享受便捷的电商服务、法律咨询和融资支持,亦或是利用云ERP和云OA系统提升企业的管理水平和运营效率,智慧园区都以其全面、专业、高效的服务,为企业的发展插上了腾飞的翅膀。而这一切的背后,是大数据、云计算、人工智能等前沿技术的深度融合与应用,它们如同智慧的大脑,让园区的管理和服务变得更加聪明、更加贴心。走进智慧园区,就像踏入了一个充满无限可能的未来世界,这里不仅有科技的魅力,更有生活的温度,让人不禁对未来充满了无限的憧憬与期待。

      一种欠定盲源分离方法及其在模态识别中的应用 附Matlab代码.rar

      1.版本:matlab2014/2019a/2024a 2.附赠案例数据可直接运行matlab程序。 3.代码特点:参数化编程、参数可方便更改、代码编程思路清晰、注释明细。 4.适用对象:计算机,电子信息工程、数学等专业的大学生课程设计、期末大作业和毕业设计。

      Matlab实现基于BO贝叶斯优化Transformer结合GRU门控循环单元时间序列预测的详细项目实例(含完整的程序,GUI设计和代码详解)

      内容概要:本文介绍了使用 Matlab 实现基于 BO(贝叶斯优化)的 Transformer 结合 GRU 门控循环单元时间序列预测的具体项目案例。文章首先介绍了时间序列预测的重要性及其现有方法存在的限制,随后深入阐述了该项目的目标、挑战与特色。重点描述了项目中采用的技术手段——结合 Transformer 和 GRU 模型的优点,通过贝叶斯优化进行超参数调整。文中给出了模型的具体实现步骤、代码示例以及完整的项目流程。同时强调了数据预处理、特征提取、窗口化分割、超参数搜索等关键技术点,并讨论了系统的设计部署细节、可视化界面制作等内容。 适合人群:具有一定机器学习基础,尤其是熟悉时间序列预测与深度学习的科研工作者或从业者。 使用场景及目标:适用于金融、医疗、能源等多个行业的高精度时间序列预测。该模型可通过捕捉长时间跨度下的复杂模式,提供更为精准的趋势预判,辅助相关机构作出合理的前瞻规划。 其他说明:此项目还涵盖了从数据采集到模型发布的全流程讲解,以及GUI图形用户界面的设计实现,有助于用户友好性提升和技术应用落地。此外,文档包含了详尽的操作指南和丰富的附录资料,包括完整的程序清单、性能评价指标等,便于读者动手实践。

      漫画与青少年教育关系.doc

      漫画与青少年教育关系

    Global site tag (gtag.js) - Google Analytics