- 浏览: 446527 次
- 性别:
- 来自: 苏州
文章分类
最新评论
-
danStart:
想问问,能监测服务是否挂掉吗?
公司要求实时监控服务器,写个Web的监控系统 -
hepct:
你好,最近在搭一个游戏服务器,能加好友请教下吗?1538863 ...
java游戏服务端实现 -
Limewwy:
没打完就发表了?为啥要这样设置?【游戏中需要传递用户的积分,这 ...
java游戏服务端实现 -
Limewwy:
楼主您好。请教为啥要这样设计?
java游戏服务端实现 -
3849801:
楼主,能够提供更具体的文档或者指导吗?我想搭建一个服务端,非常 ...
java游戏服务端实现
标红
的是陌生单词
The Anatomy of a Whale
Sometimes it's really hard to figure out what's causing problems in a web site like Twitter. But over time we have learned some techniques that help us to solve the variety of problems that occur in our complex web site.
A few weeks ago, we noticed something unusual: over 100 visitors to Twitter per second saw what is popularly known as "the fail whale". Normally these whales are rare; 100 per second was cause for alarm. Although even 100 per second is a very small fraction of our overall traffic, it still means that a lot of users had a bad experience when visiting the site. So we mobilized a team to find out the cause of the problem.
What Causes Whales?
What is the thing that has come to be known as "the fail whale"? It is a visual representation of the HTTP "503: Service Unavailable" error. It means that Twitter does not have enough capacity to serve all of its users. To be precise, we show this error message when a request would wait for more than a few seconds before resources become available to process it. So rather than make users wait forever, we "throw away" their requests by displaying an error.
This can sometimes happen because too many users try to use Twitter at once and we don't have enough computers to handle all of their requests. But much more likely is that some component part of Twitter suddenly breaks and starts slowing down.
Discovering the root cause can be very difficult because Whales are an indirect symptom of a root cause that can be one of many components. In other words, the only concrete fact that we knew at the time was that there was some problem, somewhere. We set out to uncover exactly where in the Twitter requests' lifecycle things were breaking down.
Debugging performance issues is really hard. But it's not hard due to a lack of data; in fact, the difficulty arises because there is too much data. We measure dozens of metrics per individual site request, which, when multiplied by the overall site traffic, is a massive amount of information about Twitter's performance at any given moment. Investigating performance problems in this world is more of an art than a science. It's easy to confuse causes with symptoms and even the data recording software itself is untrustworthy.
In the analysis below we used a simple strategy that involves proceeding from the most aggregate measures of system as a whole and at each step getting more fine grained , looking at smaller and smaller parts.
How is a Web Page Built?
Composing a web page for Twitter request often involves two phases. First data is gathered from remote sources called "network services". For example, on the Twitter homepage your tweets are displayed as well as how many followers you have. These data are pulled respectively from our tweet caches and our social graph database, which keeps track of who follows whom on Twitter. The second phase of the page composition process assembles all this data in an attractive way for the user. We call the first phase the IO phase and the second the CPU phase. In order to discover which phase was causing problems, we checked data that records what amount of time was spent in each phase when composing Twitter's web pages.
The green line in this graph represents the time spent in the IO phase and the blue line represents the CPU phase. This graph represents about 1 day of data. You can see that the relationships change over the course of the day. During non-peak traffic, CPU time is the dominant portion of our request, with our network services responding relatively quickly. However, during peak traffic, IO latency almost doubles and becomes the primary component of total request latency.
Understanding Performance Degradation
There are two possible interpretations for this ratio changing over the course of the day. One possibility is that the way people use Twitter during one part of the day differs from other parts of the day. The other possibility is that some network service degrades in performance as a function of use. In an ideal world, each network service would have equal performance for equal queries; but in the worst case, the same queries actually get slower as you run more simultaneously. Checking various metrics confirmed that users use Twitter the same way during different parts of the day. So we hypothesize that the problem must be in a network service degrading poorly. We were still unsure; in any good investigation one must constantly remain skeptical . But we decided that we had enough information to transition from this more general analysis of the system into something more specific, so we looked into IO latency data.
This graph represents the total amount of time waiting for our network services to deliver data. Since the amount of traffic we get changes over the course of the day, we expect any total to vary proportionally . But this graph is actually traffic independent; that is, we divide the measured latency by the amount of traffic at any given time. If any traffic-independent total latency changes over the course of the day, we know the corresponding network service is degrading with traffic. You can see that the purple line in this graph (which represents Memcached) degrades dramatically as traffic increases during peak hours. Furthermore, because it is at the top of the graph it is also the biggest proportion of time waiting for network services. So this correlates with the previous graph and we now have a stronger hypothesis: Memcached performance degrades dramatically during the course of the day, which leads to slower response times, which leads to whales.
This sort of behavior is consistent with insufficient resource capacity. When a service with limited resources, such as Memcached, is taxed to its limits, requests begin contending with each other for Memcached's computing time. For example, if Memcached can only handle 10 requests at a time but it gets 11 requests at time, the 11th request needs to wait in line to be served.
Focus on the Biggest Contributor to the Problem
If we can add sufficient Memcached capacity to reduce this sort of resource contention, we could increase the throughput of Twitter.com substantially. If you look at the above graph, you can infer that this optimization could increase twitter performance by 50%.
There are two ways to add capacity. We could do this by adding more computers (memcached servers). But we can also change the software that talks to Memcached to be as efficient with its requests as possible. Ideally we do both.
We decided to first pursue how we query Memcached to see if there was any easy way to optimize that by reducing the overall number of queries. But, there are many types of queries to memcached and it might be that some may take longer than others. We want to spend our time wisely and focus on optimizing the queries that are most expensive in aggregate .
We sampled a live process to record some statistics on which queries take the longest. The following is each type of Memcached query and how long they take on average:
get 0.003s get_multi 0.008s add 0.003s delete 0.003s set 0.003s incr 0.003s prepend 0.002s
You can see that get_multi
is a little more expensive
than the rest but everything else is the same. But that doesn't mean
it's the source of the problem. We also need to know how many requests
per second there are for each type of query.
get 71.44% get_multi 8.98% set 8.69% delete 5.26% incr 3.71% add 1.62% prepend 0.30%
If you multiply average latency by the percentage of requests you get
a measure of the total contribution to slowness. Here, we found that gets
were the biggest contributor to slowness. So, we wanted to see if we could reduce the number of gets.
Tracing Program Flow
Since we make Memcached queries from all over the Twitter software,
it was initially unclear where to start looking for optimization
opportunities. Our first step was to begin collecting stack traces,
which are logs that represent what the program is doing at any given
moment in time. We instrumented one of our computers to sample some
small percentages of get
memcached calls and record what sorts of things caused them.
Unfortunately, we collected a huge amount of data and it was hard to understand. Following our precedent of using visualizations in order to gain insight into large sets of data, we took some inspiration from the Google perf-tools project and wrote a small program that generated a cloud graph of the various paths through our code that were resulting in Memcached Gets. Here is a simplified picture:
Each circle represents one component/function. The size of the circle represents how big a proportion of Memcached get
queries come from that function. The lines between the circles show
which function caused the other function to occur. The biggest circle is
check_api_rate_limit
but it is caused mostly by authenticate_user
and attempt_basic_auth
. In fact, attempt_basic_auth
is the main opportunity for enhancement. It helps us compute who is
requesting a given web page so we can serve personalized (and private)
information to just the right people.
Any Memcached optimizations that we can make here would have a large
effect on the overall performance of Twitter. By counting the number of
actual get
queries made per request, we found that, on average, a single call to attempt_basic_auth
was making 17 calls. The next question is: can any of them be removed?
To figure this out we need to look very closely at the all of the
queries. Here is a "history" of the the most popular web page that calls
attempt_basic_auth
. This is the API request for http://twitter.com/statuses/friends_timeline.format
, the most popular page on Twitter!
get(["User:auth:missionhipster", # maps screen name to user id get(["User:15460619", # gets user object given user id (used to match passwords) get(["limit:count:login_attempts:...", # prevents dictionary attacks set(["limit:count:login_attempts:...", # unnecessary in most cases, bug set(["limit:timestamp:login_attempts:...", # unnecessary in most cases, bug get(["limit:timestamp:login_attempts:...", get(["limit:count:login_attempts:...", # can be memoized get(["limit:count:login_attempts:...", # can also be memoized get(["user:basicauth:...", # an optimization to avoid calling bcrypt get(["limit:count:api:...", # global API rate limit set(["limit:count:api:...", # unnecessary in most cases, bug set(["limit:timestamp:api:...", # unnecessary in most cases, bug get(["limit:timestamp:api:...", get(["limit:count:api:...", # can be memoized from previous query get(["home_timeline:15460619", # determine which tweets to display get(["favorites_timeline:15460619", # determine which tweets are favorited get_multi([["Status:fragment:json:7964736693", # load, in parallel, all of the tweets we're gonna display.
Note that all of the "limit:" queries above come from attempt_basic_auth
.
We noticed a few other (relatively minor) unnecessary queries as well.
It seems like from this data we can eliminate seven out of seventeen
total Memcached calls -- a 42% improvement for the most popular page on
Twitter.
At this point, we need to write some code to make these bad queries go away. Some of them we cache (so we don't make the exact same query twice), some are just bugs and are easy to fix. Some we might try to parallelize (do more than one query at the same time). But this 42% optimization (especially if combined with new hardware) has the potential to eliminate the performance degradation of our Memcached cluster and also make most page loads that much faster. It is possible we could see a (substantially) greater than 50% increase in the capacity of Twitter with these optimizations.
This story presents a couple of the fundamental principles that we use to debug the performance problems that lead to whales. First, always proceed from the general to the specific. Here, we progressed from looking first at I/O and CPU timings to finally focusing on the specific Memcached queries that caused the issue. And second, live by the data, but don't trust it. Despite the promise of a 50% gain that the data implies, it's unlikely we'll see any performance gain anywhere near that. Even still, it'll hopefully be substantial.
— @asdf and @nk
发表评论
-
java性能优化的地方
2011-08-23 17:06 1125自己都懂,但是难的 ... -
如何定位OutOfMemory的根本原因
2011-07-19 17:39 2066自己最近做了一些关于工厂MES软件导致的OOM,比如avon, ... -
JVM调优新(转)
2010-09-29 10:11 12481. Heap设定与垃圾回收 ... -
jvm调优总结(江南白衣)
2010-09-28 16:02 19087月16日 JVM调优总 ... -
主题:优化JVM参数提高eclipse运行速度
2010-09-27 15:58 971受此文启发: 随想配置:更快的启动eclipse 性能优化从身 ... -
java快速排序算法
2010-08-10 20:38 1123java实现快速排序,好不容易,写下来吧 public ... -
tomcat性能调优
2009-09-28 14:23 1053在catalina.sh的开头export JAVA_HOME ... -
分享一下自己写的简单的通用http测试工具
2009-07-01 14:35 2060晕要发到博客上频道的怎么发到了这里。。。管理员能帮忙移动一 ... -
Tomcat常用调优技巧
2009-06-25 22:59 1252本文是就Tomcat 4为基础 ... -
数据库水平切分的实现原理解析
2009-06-22 16:12 1284最近论坛上关于数据库水平切割的文章写的很好,可以借鉴一下 第 ... -
优化代码的小技巧--读effect java有感
2009-01-06 15:31 4701工作一年有余,已经早已不能算菜鸟了,最近工作也比较清闲,也接近 ... -
提升JSP应用程序的七大绝招(好文推荐)
2008-03-26 15:56 1110你时常被客户抱怨JSP页 ...
相关推荐
【描述】:资源内容可能源自2011年QCon亚洲会议的一次演讲,该演讲深入探讨了Twitter在性能工程方面所采取的实践和经验。"一定要下哦"暗示了这份资料的价值,意味着它包含了许多实用的技术知识和宝贵的经验分享,...
优化数据库是提高网站性能的关键环节之一。文档强调了几个关键点: - **增加索引**:对于频繁出现在查询条件中的字段(如WHERE子句),需要添加索引以加快查询速度。 - **数据分区**:尽管Twitter早期并未进行有效的...
如果用户正处于在线状态且正在浏览其Twitter页面,内置的JavaScript代码会每隔一段时间自动向服务器请求更新,以实时显示新推文。 **1.3 初期架构问题** 尽管这种简单的架构在初期能满足需求,但很快就暴露出问题...
早期,Google采用每月构建一次的索引,并通过分片技术(sharding)将索引和网页数据分散到多个服务器上。随着用户查询量的增加,Google在1999年引入了缓存集群(Cache Cluster),提升了响应速度和可处理的访问量。同年...
他们可以根据自己的需求进行二次开发,定制个性化功能,或者修复可能存在的问题,从而共同推动Flock的持续优化。 总的来说,Flock作为一款简约的Twitter桌面客户端,凭借其多账号管理、简洁设计、丰富功能和开源...
尽量批量操作,例如使用`append()`一次性添加多个元素,而不是循环添加。 4. **缓存jQuery对象**: 如果多次使用相同的DOM查询,将结果存储在一个变量中,避免重复查询。例如,`var $elem = $("#myElement");`然后...
总之,广告系统服务化优化架构是一个针对传统架构痛点的革新过程,通过服务拆分、选择合适的RPC框架、实时监控以及优化数据传输等方式,提升了系统的可靠性和扩展性,以满足业务的快速发展需求。
从LiveJournal后台发展看大规模网站性能优化方法 70 一、LiveJournal发展历程 70 二、LiveJournal架构现状概况 70 三、从LiveJournal发展中学习 71 1、一台服务器 71 2、两台服务器 72 3、四台服务器 73 4...
2. **表结构优化**:良好的表设计可以显著提高查询效率。这包括合理使用分区表以加速查询,以及通过垂直拆分表来减轻数据库负载。选择合适的数据类型也能节省存储空间,提升性能。 3. **系统配置优化**:调整MySQL...
- **Apache Flink** 提供了连续流处理(Continuous Streaming)模式,允许在事件时间上进行精确一次的处理保证。其JobManager存储状态信息,确保状态一致性,并且支持基于批次的检查点。 - **Apache Storm** ...
随着时间的推移,PSF经历了多次迭代,从2014年9月发布的第一个版本开始,逐渐应用于多个模块,包括设备认证、风控系统、消息中心等。在技术选型上,PSF采用了PHP 7.0,并且后端资源全部采用长连接,如DBConnection、...
- 高活跃度:超过80%的用户每月至少活跃一次,40%以上的用户每天都会使用该平台。 - 高粘性:用户平均每天在平台上花费超过30分钟的时间。 - 数据量巨大:系统中有超过10亿个关系、30亿张照片以及150TB的数据。 - 高...
我们旨在帮助提高网络速度,一次创建一个WordPress网站。 这就是为什么我们创建WP Rocket的原因。 它是一个缓存插件,可简化流程并帮助减少网站的加载时间。 如果您不是开发人员,请访问我们的。 文献资料 需要...
首先,苹果公司推出了搭载M2芯片的新款MacBook Air,这是苹果笔记本电脑的又一次重大升级。M2芯片是M1芯片的继任者,性能提升显著,据称运算速度可提升至18%。新芯片还引入了更强大的神经引擎,这将极大地提高机器...
这预示着微软的DirectX将迎来一次重大的复兴,为未来的跨平台游戏和图形应用提供强大的支持。 DX12的发布对整个游戏行业有重大影响,因为它不仅提高了游戏的视觉质量,还提升了游戏在不同硬件上的兼容性和运行效率...
五、兼容性与性能优化 考虑到不同浏览器和设备的兼容性,Socialshare 采用现代Web技术实现,同时考虑了旧版浏览器的兼容性。此外,通过异步加载和延迟渲染,该插件能有效地减少页面加载时间,提高用户体验。 六、...
【标题解析】 ...通过深入研究OneshotBot的源代码,我们可以了解到如何利用Python和Twitter API创建自己的社交媒体机器人,同时也可以学习到如何处理代码的组织结构、错误处理和性能优化等方面的知识。
MySQL作为一款广泛应用的关系型数据库管理系统,其历史可以追溯到1979年,由Monty Widenius编写,经过多次迭代和收购,如MySQLAB被Sun收购,随后Oracle接手,逐步发展成为现在的版本,如5.7和8.0,引入了更多的新...
Java是一种面向对象的编程语言,以其“一次编写,到处运行”的特性而闻名。它拥有简洁的语法,丰富的类库,以及强大的垃圾回收机制,这些都为游戏开发提供了便利。此外,Java还支持多线程编程,这对于处理游戏中的...
1. 分批加载:不是一次性加载所有数据,而是分批加载,减少首次加载的等待时间。 2. 数据预加载:在用户实际达到加载区域之前就开始加载数据,避免出现明显的延迟。 3. 使用懒加载:对于图片和其他非关键内容,只有...