`

Scaling Twitter: Making Twitter 10000 Percent Fast

阅读更多

Scaling Twitter: Making Twitter 10000 Percent Faster

Update 2: a commenter in Twitter Fails Macworld Keynote Test said this entry needs to be updated. LOL. My uneducated guess is it's not a language or architecture problem, but more a problem of not being able to add hardware fast enough into their data center. The predictability of this problem is debatable, but once you have it, it's hard to fix.
Update: Twitter releases Starling - light-weight persistent queue server that speaks the MemCache protocol. It was built to drive Twitter's backend, and is in production across Twitter's cluster.

Twitter started as a side project and blew up fast, going from 0 to millions of page views within a few terrifying months. Early design decisions that worked well in the small melted under the crush of new users chirping tweets to all their friends. Web darling Ruby on Rails was fingered early for the scaling problems, but Blaine Cook, Twitter's lead architect, held Ruby blameless:


For us, it’s really about scaling horizontally - to that end, Rails and Ruby haven’t been stumbling blocks, compared to any other language or framework. The performance boosts associated with a “faster” language would give us a 10-20% improvement, but thanks to architectural changes that Ruby and Rails happily accommodated, Twitter is 10000% faster than it was in January.

If Ruby on Rails wasn't to blame, how did Twitter learn to scale ever higher and higher?

Update: added slides Small Talk on Getting Big. Scaling a Rails App & all that Jazz

Site: http://twitter.com

Information Sources

  • Scaling Twitter Video by Blaine Cook.
  • Scaling Twitter Slides
  • Good News blog post by Rick Denatale
  • Scaling Twitter blog post Patrick Joyce.
  • Twitter API Traffic is 10x Twitter’s Site .
  • A Small Talk on Getting Big. Scaling a Rails App & all that Jazz - really cute dog picks

    The Platform

  • Ruby on Rails
  • Erlang
  • MySQL
  • Mongrel - hybrid Ruby/C HTTP server designed to be small, fast, and secure
  • Munin
  • Nagios
  • Google Analytics
  • AWStats - real-time logfile analyzer to get advanced statistics
  • Memcached

    The Stats

  • Over 350,000 users. The actual numbers are as always, very super super top secret.
  • 600 requests per second.
  • Average 200-300 connections per second. Spiking to 800 connections per second.
  • MySQL handled 2,400 requests per second.
  • 180 Rails instances. Uses Mongrel as the "web" server.
  • 1 MySQL Server (one big 8 core box) and 1 slave. Slave is read only for statistics and reporting.
  • 30+ processes for handling odd jobs.
  • 8 Sun X4100s.
  • Process a request in 200 milliseconds in Rails.
  • Average time spent in the database is 50-100 milliseconds.
  • Over 16 GB of memcached.

    The Architecture

  • Ran into very public scaling problems. The little bird of failure popped up a lot for a while.
  • Originally they had no monitoring, no graphs, no statistics, which makes it hard to pinpoint and solve problems. Added Munin and Nagios. There were difficulties using tools on Solaris. Had Google analytics but the pages weren't loading so it wasn't that helpful :-)
  • Use caching with memcached a lot.
    - For example, if getting a count is slow, you can memoize the count into memcache in a millisecond.
    - Getting your friends status is complicated. There are security and other issues. So rather than doing a query, a friend's status is updated in cache instead. It never touches the database. This gives a predictable response time frame (upper bound 20 msecs).
    - ActiveRecord objects are huge so that's why they aren't cached. So they want to store critical attributes in a hash and lazy load the other attributes on access.
    - 90% of requests are API requests. So don't do any page/fragment caching on the front-end. The pages are so time sensitive it doesn't do any good. But they cache API requests.
  • Messaging
    - Use message a lot. Producers produce messages, which are queued, and then are distributed to consumers. Twitter's main functionality is to act as a messaging bridge between different formats (SMS, web, IM, etc).
    - Send message to invalidate friend's cache in the background instead of doing all individually, synchronously.
    - Started with DRb , which stands for distributed Ruby. A library that allows you to send and receive messages from remote Ruby objects via TCP/IP. But it was a little flaky and single point of failure.
    - Moved to Rinda , which a shared queue that uses a tuplespace model, along the lines of Linda. But the queues are persistent and the messages are lost on failure.
    - Tried Erlang. Problem: How do you get a broken server running at Sunday Monday with 20,000 users waiting? The developer didn't know. Not a lot of documentation. So it violates the use what you know rule.
    - Moved to Starling, a distributed queue written in Ruby.
    - Distributed queues were made to survive system crashes by writing them to disk. Other big websites take this simple approach as well.
  • SMS is handled using an API supplied by third party gateway's. It's very expensive.
  • Deployment
    - They do a review and push out new mongrel servers. No graceful way yet.
    - An internal server error is given to the user if their mongrel server is replaced.
    - All servers are killed at once. A rolling blackout isn't used because the message queue state is in the mongrels and a rolling approach would cause all the queues in the remaining mongrels to fill up.
  • Abuse
    - A lot of down time because people crawl the site and add everyone as friends. 9000 friends in 24 hours. It would take down the site.
    - Build tools to detect these problems so you can pinpoint when and where they are happening.
    - Be ruthless. Delete them as users.
  • Partitioning
    - Plan to partition in the future. Currently they don't. These changes have been enough so far.
    - The partition scheme will be based on time, not users, because most requests are very temporally local.
    - Partitioning will be difficult because of automatic memoization . They can't guarantee read-only operations will really be read-only. May write to a read-only slave, which is really bad.
  • Twitter's API Traffic is 10x Twitter’s Site
    - Their API is the most important thing Twitter has done.
    - Keeping the service simple allowed developers to build on top of their infrastructure and come up with ideas that are way better than Twitter could come up with. For example, Twitterrific, which is a beautiful way to use Twitter that a small team with different priorities could create.
  • Monit is used to kill process if they get too big.

    Lessons Learned

  • Talk to the community. Don't hide and try to solve all problems yourself. Many brilliant people are willing to help if you ask.
  • Treat your scaling plan like a business plan. Assemble a board of advisers to help you.
  • Build it yourself. Twitter spent a lot of time trying other people's solutions that just almost seemed to work, but not quite. It's better to build some things yourself so you at least have some control and you can build in the features you need.
  • Build in user limits. People will try to bust your system. Put in reasonable limits and detection mechanisms to protect your system from being killed.
  • Don't make the database the central bottleneck of doom. Not everything needs to require a gigantic join. Cache data. Think of other creative ways to get the same result. A good example is talked about in Twitter, Rails, Hammers, and 11,000 Nails per Second .
  • Make your application easily partitionable from the start. Then you always have a way to scale your system.
  • Realize your site is slow. Immediately add reporting to track problems.
  • Optimize the database.
    - Index everything. Rails won't do this for you.
    - Use explain to how your queries are running. Indexes may not be being as you expect.
    - Denormalize a lot. Single handedly saved them. For example, they store all a user IDs friend IDs together, which prevented a lot of costly joins.
    - Avoid complex joins.
    - Avoid scanning large sets of data.
  • Cache the hell out of everything. Individual active records are not cached, yet. The queries are fast enough for now.
  • Test everything.
    - You want to know when you deploy an application that it will render correctly.
    - They have a full test suite now. So when the caching broke they were able to find the problem before going live.
  • Long running processes should be abstracted to daemons.
  • Use exception notifier and exception logger to get immediate notification of problems so you can address the right away.
  • Don't do stupid things.
    - Scale changes what can be stupid.
    - Trying to load 3000 friends at once into memory can bring a server down, but when there were only 4 friends it works great.
  • Most performance comes not from the language, but from application design.
  • Turn your website into an open service by creating an API. Their API is a huge reason for Twitter's success. It allows user's to create an ever expanding and ecosystem around Twitter that is difficult to compete with. You can never do all the work your user's can do and you probably won't be as creative. So open you application up and make it easy for others to integrate your application with theirs.

    Related Articles

  • For a discussion of partitioning take a look at Amazon Architecture , An Unorthodox Approach to Database Design : The Coming of the Shard , Flickr Architecture
  • The Mailinator Architecture has good strategies for abuse protection.
  • GoogleTalk Architecture addresses some interesting issues when scaling social networking sites.
  • 分享到:
    评论

    相关推荐

      Scaling Twitter

      ### 扩展Twitter:从慢速到高效的关键步骤 #### 概述 “扩展Twitter”是一份关于如何针对高负载、大数据流量环境优化Twitter平台的技术资料。文档详细介绍了Twitter在成长过程中遇到的各种技术挑战以及应对策略。...

      Scaling Software Agility: Best Practices for Large Enterprises part2

      Scaling Software Agility: Best Practices for Large Enterprises part2

      Scaling Software Agility: Best Practices for Large Enterprises part1.

      Scaling Software Agility: Best Practices for Large Enterprises part1.

      Auto Scaling

      Auto Scaling是亚马逊推出的弹性计算云(Amazon EC2)的一项Web服务,它能够根据用户设定的策略自动调整EC2实例的运行数量,以适应应用的负载变化。这项服务有助于维持应用的高可用性和扩展性,确保应用能够根据实际...

      docker-scaling:Docker扩展演示

      Docker-Scaling演示 该演示使用Docker容器实现了可扩展的Web服务器架构。 工具:docker,docker-compose 负载均衡器:HAProxy和Nginx 后端:一个简单的API 服务目录:领事 模板处理:Consul-Template 基于 入门...

      利用 Auto Scaling 实现 弹性高可用.pdf

      在本实验中,我们将探讨如何利用Amazon Web Services (AWS) 的Elastic Load Balancing (ELB) 和 Auto Scaling 功能来构建一个弹性且高可用的基础设施。这两个服务是云架构的关键组成部分,它们确保了应用程序在面对...

      matlab代码sqrt-Scaling_Sierpinski:Scaling_Sierpinski

      Scaling_Sierpinski Name of Quantlet : Scaling_Sierpinski Published in : Metis Description : ' Sierpinski plots the Sierpinski triangle ' Keywords : scaling, topology, self-similar, mandelbrot, ...

      CVPR2024:用于野外逼真图像恢复的实用算法

      (CVPR2024) Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

      ImageScaling:C#图像缩放示例

      `ImageScaling`项目就是一个专注于演示如何在C#中实现图像缩放功能的示例。在这个项目中,我们将探讨C#中处理图像的基本原理,包括加载图像、调整尺寸以及保存结果。 首先,C#中的`System.Drawing`命名空间提供了...

      MadGoat SSAA and Resolution Scaling 1.3抗锯齿

      《MadGoat SSAA and Resolution Scaling 1.3:Unity中的高级抗锯齿与分辨率缩放技术》 在游戏开发领域,图像质量是吸引玩家的关键因素之一,而抗锯齿和分辨率缩放技术则直接关系到游戏画面的细腻度和流畅度。...

      scaling-funicular:商业会议

      "scaling-funicular"项目,以其独特的"商业会议"为主题,为我们提供了一个创新的交互式体验平台。这个平台的核心在于其WorkAdventure Map,它是一种基于HTML技术的虚拟环境,旨在为参与者打造沉浸式的工作和学习体验...

      高性能高并发服务器架构大全

      整理的高性能高并发服务器架构文章,内容预览: ... Scaling Twitter: Making Twitter 10000 Percent Faster 331 Information Sources 332 The Platform 332 The Stats 333 The Architecture 333 L

      Addison.Wesley.Practices.for.Scaling.Lean.and.Agile.Development.Jan.2010

      ### Addison.Wesley.Practices.for.Scaling.Lean.and.Agile.Development.Jan.2010 #### 核心知识点概述 《Addison.Wesley.Practices.for.Scaling.Lean.and.Agile.Development.Jan.2010》是一本专注于如何在大型、...

      image_scaling:一些快速的超分辨率成像算法

      这是此视频后面的插值角度图基于角度图的插值DCCI 受到定向三次卷积插值的启发安装1下载资源库: $ git clone https://github.com/alexis-jacq/image_scaling.git2该库使用标准的CMake工作流程: $ mkdir build && ...

      战胜CMOS Scaling的研究挑战:半导体业发展方向.pdf

      【战胜CMOS Scaling的研究挑战:半导体业发展方向】 随着信息技术的快速发展,半导体产业面临着前所未有的挑战。CMOS(互补金属氧化物半导体)Scaling是推动半导体技术进步的关键,它在过去几十年里遵循摩尔定律,...

      AWS auto scaling uer guide 中文版

      Auto Scaling 可帮助确保您拥有适量的 Amazon EC2 实例来处理您的应用程序负载。您可创建 EC2 实例的 集合,称为 Auto Scaling 组 。您可以指定每个 Auto Scaling 组中最少的实例数量,Auto Scaling 会确保您的 组中...

    Global site tag (gtag.js) - Google Analytics