`
kakaluyi
  • 浏览: 446527 次
  • 性别: Icon_minigender_1
  • 来自: 苏州
社区版块
存档分类
最新评论

twitter的一次提高50%性能的优化

 
阅读更多

标红 的是陌生单词

The Anatomy of a Whale

Sometimes it's really hard to figure out what's causing problems in a web site like Twitter. But over time we have learned some techniques that help us to solve the variety of problems that occur in our complex web site.

A few weeks ago, we noticed something unusual: over 100 visitors to Twitter per second saw what is popularly known as "the fail whale". Normally these whales are rare; 100 per second was cause for alarm. Although even 100 per second is a very small fraction of our overall traffic, it still means that a lot of users had a bad experience when visiting the site. So we mobilized a team to find out the cause of the problem.

What Causes Whales?

What is the thing that has come to be known as "the fail whale"? It is a visual representation of the HTTP "503: Service Unavailable" error. It means that Twitter does not have enough capacity to serve all of its users. To be precise, we show this error message when a request would wait for more than a few seconds before resources become available to process it. So rather than make users wait forever, we "throw away" their requests by displaying an error.

This can sometimes happen because too many users try to use Twitter at once and we don't have enough computers to handle all of their requests. But much more likely is that some component part of Twitter suddenly breaks and starts slowing down.

Discovering the root cause can be very difficult because Whales are an indirect symptom of a root cause that can be one of many components. In other words, the only concrete fact that we knew at the time was that there was some problem, somewhere. We set out to uncover exactly where in the Twitter requests' lifecycle things were breaking down.

Debugging performance issues is really hard. But it's not hard due to a lack of data; in fact, the difficulty arises because there is too much data. We measure dozens of metrics per individual site request, which, when multiplied by the overall site traffic, is a massive amount of information about Twitter's performance at any given moment. Investigating performance problems in this world is more of an art than a science. It's easy to confuse causes with symptoms and even the data recording software itself is untrustworthy.

In the analysis below we used a simple strategy that involves proceeding from the most aggregate measures of system as a whole and at each step getting more fine grained , looking at smaller and smaller parts.

How is a Web Page Built?

Composing a web page for Twitter request often involves two phases. First data is gathered from remote sources called "network services". For example, on the Twitter homepage your tweets are displayed as well as how many followers you have. These data are pulled respectively from our tweet caches and our social graph database, which keeps track of who follows whom on Twitter. The second phase of the page composition process assembles all this data in an attractive way for the user. We call the first phase the IO phase and the second the CPU phase. In order to discover which phase was causing problems, we checked data that records what amount of time was spent in each phase when composing Twitter's web pages.

The green line in this graph represents the time spent in the IO phase and the blue line represents the CPU phase. This graph represents about 1 day of data. You can see that the relationships change over the course of the day. During non-peak traffic, CPU time is the dominant portion of our request, with our network services responding relatively quickly. However, during peak traffic, IO latency almost doubles and becomes the primary component of total request latency.

Understanding Performance Degradation

There are two possible interpretations for this ratio changing over the course of the day. One possibility is that the way people use Twitter during one part of the day differs from other parts of the day. The other possibility is that some network service degrades in performance as a function of use. In an ideal world, each network service would have equal performance for equal queries; but in the worst case, the same queries actually get slower as you run more simultaneously. Checking various metrics confirmed that users use Twitter the same way during different parts of the day. So we hypothesize that the problem must be in a network service degrading poorly. We were still unsure; in any good investigation one must constantly remain skeptical . But we decided that we had enough information to transition from this more general analysis of the system into something more specific, so we looked into IO latency data.

This graph represents the total amount of time waiting for our network services to deliver data. Since the amount of traffic we get changes over the course of the day, we expect any total to vary proportionally . But this graph is actually traffic independent; that is, we divide the measured latency by the amount of traffic at any given time. If any traffic-independent total latency changes over the course of the day, we know the corresponding network service is degrading with traffic. You can see that the purple line in this graph (which represents Memcached) degrades dramatically as traffic increases during peak hours. Furthermore, because it is at the top of the graph it is also the biggest proportion of time waiting for network services. So this correlates with the previous graph and we now have a stronger hypothesis: Memcached performance degrades dramatically during the course of the day, which leads to slower response times, which leads to whales.

This sort of behavior is consistent with insufficient resource capacity. When a service with limited resources, such as Memcached, is taxed to its limits, requests begin contending with each other for Memcached's computing time. For example, if Memcached can only handle 10 requests at a time but it gets 11 requests at time, the 11th request needs to wait in line to be served.

Focus on the Biggest Contributor to the Problem

If we can add sufficient Memcached capacity to reduce this sort of resource contention, we could increase the throughput of Twitter.com substantially. If you look at the above graph, you can infer that this optimization could increase twitter performance by 50%.

There are two ways to add capacity. We could do this by adding more computers (memcached servers). But we can also change the software that talks to Memcached to be as efficient with its requests as possible. Ideally we do both.

We decided to first pursue how we query Memcached to see if there was any easy way to optimize that by reducing the overall number of queries. But, there are many types of queries to memcached and it might be that some may take longer than others. We want to spend our time wisely and focus on optimizing the queries that are most expensive in aggregate .

We sampled a live process to record some statistics on which queries take the longest. The following is each type of Memcached query and how long they take on average:

get         0.003s
get_multi   0.008s
add         0.003s
delete      0.003s
set         0.003s
incr        0.003s
prepend     0.002s

You can see that get_multi is a little more expensive than the rest but everything else is the same. But that doesn't mean it's the source of the problem. We also need to know how many requests per second there are for each type of query.

get         71.44%
get_multi    8.98%
set          8.69%
delete       5.26%
incr         3.71%
add          1.62%
prepend      0.30%

If you multiply average latency by the percentage of requests you get a measure of the total contribution to slowness. Here, we found that gets were the biggest contributor to slowness. So, we wanted to see if we could reduce the number of gets.

Tracing Program Flow

Since we make Memcached queries from all over the Twitter software, it was initially unclear where to start looking for optimization opportunities. Our first step was to begin collecting stack traces, which are logs that represent what the program is doing at any given moment in time. We instrumented one of our computers to sample some small percentages of get memcached calls and record what sorts of things caused them.

Unfortunately, we collected a huge amount of data and it was hard to understand. Following our precedent of using visualizations in order to gain insight into large sets of data, we took some inspiration from the Google perf-tools project and wrote a small program that generated a cloud graph of the various paths through our code that were resulting in Memcached Gets. Here is a simplified picture:

Each circle represents one component/function. The size of the circle represents how big a proportion of Memcached get queries come from that function. The lines between the circles show which function caused the other function to occur. The biggest circle is check_api_rate_limit but it is caused mostly by authenticate_user and attempt_basic_auth . In fact, attempt_basic_auth is the main opportunity for enhancement. It helps us compute who is requesting a given web page so we can serve personalized (and private) information to just the right people.

Any Memcached optimizations that we can make here would have a large effect on the overall performance of Twitter. By counting the number of actual get queries made per request, we found that, on average, a single call to attempt_basic_auth was making 17 calls. The next question is: can any of them be removed?

To figure this out we need to look very closely at the all of the queries. Here is a "history" of the the most popular web page that calls attempt_basic_auth . This is the API request for http://twitter.com/statuses/friends_timeline.format , the most popular page on Twitter!

get(["User:auth:missionhipster",                       # maps screen name to user id
get(["User:15460619",                                  # gets user object given user id (used to match passwords)
get(["limit:count:login_attempts:...",                 # prevents dictionary attacks
set(["limit:count:login_attempts:...",                 # unnecessary in most cases, bug
set(["limit:timestamp:login_attempts:...",             # unnecessary in most cases, bug
get(["limit:timestamp:login_attempts:...",
get(["limit:count:login_attempts:...",                 # can be memoized
get(["limit:count:login_attempts:...",                 # can also be memoized
get(["user:basicauth:...",                             # an optimization to avoid calling bcrypt
get(["limit:count:api:...",                            # global API rate limit
set(["limit:count:api:...",                            # unnecessary in most cases, bug
set(["limit:timestamp:api:...",                        # unnecessary in most cases, bug
get(["limit:timestamp:api:...",
get(["limit:count:api:...",                            # can be memoized from previous query
get(["home_timeline:15460619",                         # determine which tweets to display
get(["favorites_timeline:15460619",                    # determine which tweets are favorited
get_multi([["Status:fragment:json:7964736693",         # load, in parallel, all of the tweets we're gonna display.

Note that all of the "limit:" queries above come from attempt_basic_auth . We noticed a few other (relatively minor) unnecessary queries as well. It seems like from this data we can eliminate seven out of seventeen total Memcached calls -- a 42% improvement for the most popular page on Twitter.

At this point, we need to write some code to make these bad queries go away. Some of them we cache (so we don't make the exact same query twice), some are just bugs and are easy to fix. Some we might try to parallelize (do more than one query at the same time). But this 42% optimization (especially if combined with new hardware) has the potential to eliminate the performance degradation of our Memcached cluster and also make most page loads that much faster. It is possible we could see a (substantially) greater than 50% increase in the capacity of Twitter with these optimizations.

This story presents a couple of the fundamental principles that we use to debug the performance problems that lead to whales. First, always proceed from the general to the specific. Here, we progressed from looking first at I/O and CPU timings to finally focusing on the specific Memcached queries that caused the issue. And second, live by the data, but don't trust it. Despite the promise of a 50% gain that the data implies, it's unlikely we'll see any performance gain anywhere near that. Even still, it'll hopefully be substantial.

— @asdf and @nk

分享到:
评论

相关推荐

    twitter性能优化

    【描述】:资源内容可能源自2011年QCon亚洲会议的一次演讲,该演讲深入探讨了Twitter在性能工程方面所采取的实践和经验。"一定要下哦"暗示了这份资料的价值,意味着它包含了许多实用的技术知识和宝贵的经验分享,...

    Scaling Twitter

    优化数据库是提高网站性能的关键环节之一。文档强调了几个关键点: - **增加索引**:对于频繁出现在查询条件中的字段(如WHERE子句),需要添加索引以加快查询速度。 - **数据分区**:尽管Twitter早期并未进行有效的...

    Twitter系统架构设计分析.pdf

    如果用户正处于在线状态且正在浏览其Twitter页面,内置的JavaScript代码会每隔一段时间自动向服务器请求更新,以实时显示新推文。 **1.3 初期架构问题** 尽管这种简单的架构在初期能满足需求,但很快就暴露出问题...

    google、facebook、Twitter、eBay、腾讯、淘宝技术发展历程

    早期,Google采用每月构建一次的索引,并通过分片技术(sharding)将索引和网页数据分散到多个服务器上。随着用户查询量的增加,Google在1999年引入了缓存集群(Cache Cluster),提升了响应速度和可处理的访问量。同年...

    简约的Twitter桌面客户端用于管理多个帐户

    他们可以根据自己的需求进行二次开发,定制个性化功能,或者修复可能存在的问题,从而共同推动Flock的持续优化。 总的来说,Flock作为一款简约的Twitter桌面客户端,凭借其多账号管理、简洁设计、丰富功能和开源...

    twitter上发现了<jQuery Performance Rules>这篇文章,

    尽量批量操作,例如使用`append()`一次性添加多个元素,而不是循环添加。 4. **缓存jQuery对象**: 如果多次使用相同的DOM查询,将结果存储在一个变量中,避免重复查询。例如,`var $elem = $("#myElement");`然后...

    广告系统服务化优化架构.pptx

    总之,广告系统服务化优化架构是一个针对传统架构痛点的革新过程,通过服务拆分、选择合适的RPC框架、实时监控以及优化数据传输等方式,提升了系统的可靠性和扩展性,以满足业务的快速发展需求。

    高性能高并发服务器架构大全

     从LiveJournal后台发展看大规模网站性能优化方法 70 一、LiveJournal发展历程 70 二、LiveJournal架构现状概况 70 三、从LiveJournal发展中学习 71 1、一台服务器 71 2、两台服务器 72 3、四台服务器 73 4...

    MySQL内幕揭秘:探索MySQL调优指南,解锁MySQL的强大功能

    2. **表结构优化**:良好的表设计可以显著提高查询效率。这包括合理使用分区表以加速查询,以及通过垂直拆分表来减轻数据库负载。选择合适的数据类型也能节省存储空间,提升性能。 3. **系统配置优化**:调整MySQL...

    分布式流式数据处理框架:功能对比以及性能评估.pdf

    - **Apache Flink** 提供了连续流处理(Continuous Streaming)模式,允许在事件时间上进行精确一次的处理保证。其JobManager存储状态信息,确保状态一致性,并且支持基于批次的检查点。 - **Apache Storm** ...

    PHP高性能服务框架架构与实践.pdf

    随着时间的推移,PSF经历了多次迭代,从2014年9月发布的第一个版本开始,逐渐应用于多个模块,包括设备认证、风控系统、消息中心等。在技术选型上,PSF采用了PHP 7.0,并且后端资源全部采用长连接,如DBConnection、...

    Social Networks: Getting Distributed Web Services Done with NoSQL

    - 高活跃度:超过80%的用户每月至少活跃一次,40%以上的用户每天都会使用该平台。 - 高粘性:用户平均每天在平台上花费超过30分钟的时间。 - 数据量巨大:系统中有超过10亿个关系、30亿张照片以及150TB的数据。 - 高...

    wp-rocket:WordPress性能优化插件

    我们旨在帮助提高网络速度,一次创建一个WordPress网站。 这就是为什么我们创建WP Rocket的原因。 它是一个缓存插件,可简化流程并帮助减少网站的加载时间。 如果您不是开发人员,请访问我们的。 文献资料 需要...

    30 What are the big news in the tech circle today.docx

    首先,苹果公司推出了搭载M2芯片的新款MacBook Air,这是苹果笔记本电脑的又一次重大升级。M2芯片是M1芯片的继任者,性能提升显著,据称运算速度可提升至18%。新芯片还引入了更强大的神经引擎,这将极大地提高机器...

    DX12什么时候出.docx

    这预示着微软的DirectX将迎来一次重大的复兴,为未来的跨平台游戏和图形应用提供强大的支持。 DX12的发布对整个游戏行业有重大影响,因为它不仅提高了游戏的视觉质量,还提升了游戏在不同硬件上的兼容性和运行效率...

    socialshare:一个用于社交共享的jQuery插件。 支持Facebook,Twitter,LinkedIn和Pinterest

    五、兼容性与性能优化 考虑到不同浏览器和设备的兼容性,Socialshare 采用现代Web技术实现,同时考虑了旧版浏览器的兼容性。此外,通过异步加载和延迟渲染,该插件能有效地减少页面加载时间,提高用户体验。 六、...

    OneshotBot:这是twitter bot @BotOneshot的源代码-one source code

    【标题解析】 ...通过深入研究OneshotBot的源代码,我们可以了解到如何利用Python和Twitter API创建自己的社交媒体机器人,同时也可以学习到如何处理代码的组织结构、错误处理和性能优化等方面的知识。

    Mysql数据库规范.docx

    MySQL作为一款广泛应用的关系型数据库管理系统,其历史可以追溯到1979年,由Monty Widenius编写,经过多次迭代和收购,如MySQLAB被Sun收购,随后Oracle接手,逐步发展成为现在的版本,如5.7和8.0,引入了更多的新...

    基于java的手机游戏设计的研究.zip

    Java是一种面向对象的编程语言,以其“一次编写,到处运行”的特性而闻名。它拥有简洁的语法,丰富的类库,以及强大的垃圾回收机制,这些都为游戏开发提供了便利。此外,Java还支持多线程编程,这对于处理游戏中的...

    手机页面下拉、上拉加载更多内容

    1. 分批加载:不是一次性加载所有数据,而是分批加载,减少首次加载的等待时间。 2. 数据预加载:在用户实际达到加载区域之前就开始加载数据,避免出现明显的延迟。 3. 使用懒加载:对于图片和其他非关键内容,只有...

Global site tag (gtag.js) - Google Analytics