Last week, between Thursday, January 24 and Friday, January 25 we experienced a critical outage surrounding our newly-launched Code Search service. As always, we strive to provide detailed, transparent post-mortems about these incidents. We’ll do our best to explain what happened and how we’ve mitigated the problems to prevent the cause of this outage from occurring again.
But first, I’d like to apologize on behalf of GitHub for this outage. While it did not affect the availability of any component but Code Search, the severity and the length of the outage are both completely unacceptable to us. I'm very sorry this happened, especially so soon after the launch of a feature we’ve been working on for a very long time.
Background
Our previous search implementation used a technology called Solr. With the launch of our new and improved search, we had finally finished migrating all search results served by GitHub to multiple new search clusters built on elasticsearch.
Since the code search index is quite large, we have a cluster dedicated to it. The cluster currently consists of 26 storage nodes and 8 client nodes. The storage nodes are responsible for holding the data that comprises the search index, while the client nodes are responsible for coordinating query activity. Each of the storage nodes has 2TB of SSD based storage.
At the time of the outage, we were storing roughly 17TB of code in this cluster. The data is sharded across the cluster and each shard has a single replica on another node for redundancy, bringing for a total of around 34TB of space in use. This put the total storage utilization of the cluster at around 67%. This Code Search cluster operated on Java 6 and elasticsearch 0.19.9, and had been running without problem for several months while we backfilled all the code into the index.
On Thursday, January 17 we were preparing to launch our Code Search service to complete the rollout of our new, unified search implementation. Prior to doing so, we noted that elasticsearch had since released version 0.20.2 which contained a number of fixes and some performance improvements.
We decided that delaying the Code Search launch to upgrade our elasticsearch cluster from version 0.19.9 to 0.20.2 before launching it publicly would help ensure a smooth launch.
We were able to complete this upgrade successfully on Thursday, January 17. All nodes in the cluster were successfully online and recovering the cluster state.
What went wrong?
Since this upgrade, we have experienced two outages in the Code Search cluster.
Unlike some other search services that use massive, single indexes to store data, elasticsearch uses a sharding pattern to divide data up so it can be easily distributed around the cluster in manageable chunks. Each of these shards is itself a Lucene index, and elasticsearch aggregates search queries across these shards using Lucene merge indexes.
The first outage occurred roughly 2 hours after the upgrade, during the recovery process that takes place as part of a cluster restart. We found error messages in the index logs indicating that some shards were unable to assign or allocate to certain nodes. Upon further inspection, we discovered that while some of these data shards had their segment cache files corrupted, others were missing on disk. elasticsearch was able to recover any shards with corrupted segment cache files and shards where only one of the replicas was missing, but 7 shards (out of 510) were missing both the primary copy and the replica.
We reviewed the circumstances of the outage and determined at the time that the problems we saw stemmed from the high load during the cluster recovery. Our research into this problem did not demonstrate other elasticsearch users encountering these sorts of problems. The cluster has happy and healthy over the weekend, and so we decided to send it out to the world.
The second outage began on Thursday, January 24. We first noticed problems as our exception tracking and monitoring systems detected a large spike in exceptions. Further review indicated that the majority of these exceptions were coming from timeouts in code search queries and from the background jobs that update our code search indexes with data from new pushes.
At this time, we began to examine both the overall state of all members of the cluster and elasticsearch's logs. We were able to identify massive levels of load on a seemingly random subset of storage nodes. While most nodes were using single digit percentages of CPU, several were consuming nearly 100% of all of the available CPU cores. We were able to eliminate system-induced load and IO-induced load as culprits: the only thing contributing to the massive load on these servers was the java process elasticsearch was running in. With the search and index timeouts still occurring, we also noticed in the logs that a number of nodes were being rapidly elected to and later removed from the master role in the cluster. In order to mitigate potential problems resulting from this rapid exchanging of master role around the cluster, we determined that the best course of action was to full-stop the cluster and bring it back up in "maintenance mode", which disables allocation and rebalancing of shards.
We were able to bring the cluster back online this way, but we noted a number of problems in the elasticsearch logs.
Recovery
After the cluster restart, we noticed that some nodes were completely unable to rejoin the cluster, and some data shards were trying to double-allocate to the same node. At this point, we reached out to Shay and Drew from elasticsearch, the company that develops and supports elasticsearch.
We were able to confirm with Shay and Drew that these un-allocatable shards (23 primaries plus replicas) had all suffered data loss. In addition to the data loss, the cluster spent a great deal of time trying to recover the remaining shards. During the course of this recovery, we had to restart the cluster several times as we rolled out further upgrades and configuration changes, which resulted in having to verify and recover shards again. This ended up being the most time consuming part of the outage as loading 17TB of indexed data off of disk multiple times is a slow process.
With Shay and Drew, we were able to discover some areas where our cluster was either misconfigured or the configuration required further tuning for optimal performance. They were also able to identify two bugs in elasticsearch itself (see these two commits for further details on those bugs) based on the problems we encountered and within a few hours released a new version with fixes included. Lastly, we were running a version of Java 6 that was released in early 2009. This contains multiple critical bugs that affect both elasticsearch and Lucene as well as problems with large memory allocation which can lead to high load.
Based on their suggestions, we immediately rolled out upgrades for Java and elasticsearch, and updated our configuration with their recommendations. This was done by creating a topic branch and environment on our Puppetmaster for these specific changes, and running Puppet on each of these nodes in that environment.
While these audits increased the length of the outage by a few hours, we believe that the time was well spent garnering the feedback from experts in large elasticsearch deployments.
With the updated configuration, new elasticsearch version with the fixes for the bugs we encountered, and the performance improvements in Java 7, we have not been able to reproduce any of the erratic load or rapid master election problems we witnessed in the two outages discussed so far.
Outage Monday
We suffered an additional outage Monday, January 28 to our Code Search cluster. This outage was unrelated to any of the previous incidents and was the result of human error.
An engineer was merging the feature branch containing the Java and elasticsearch upgrades back into our production environment. In the process, the engineer rolled the Puppet environment on the Code Search nodes back to the production environment before deploying the merged code. This resulted in elasticsearch being restarted on nodes as Puppet was running on them. We recognized immediately the source of the problem and stopped the cluster to prevent any problems caused by running multiple versions of Java and elasticsearch in the same cluster. Once the merged code was deployed, we ran Puppet on all the Code Search nodes again and brought the cluster back online. Rather than enabling Code Search indexing and querying while the cluster was in a degraded state, we opted to wait for full recovery. Once the cluster finished recovering, we turned Code Search back on.
Mitigating the problem
We did not sufficiently test the 0.20.2 release of elasticsearch on our infrastructure prior to rolling this upgrade out to our code search cluster, nor had we tested it on any other clusters beforehand. A contributing factor to this was the lack of a proper staging environment for the code search cluster. We are in the process of provisioning a staging environment for the code search cluster so we can better test infrastructure changes surrounding it.
The bug fixes included in elasticsearch 0.20.3 do make us confident that we won’t encounter the particular problems they caused again. We’re also running a Java version now that is actively tested by the elasticsearch team and is known to be more stable and performant running elasticsearch. Additionally, our code search cluster configuration has been audited by the team at elasticsearch with future audits scheduled to ensure it remains optimal for our use case.
As for Monday’s outage, we are currently working on automation to make the a Puppet run in a given environment impossible in cases where the branch on GitHub is ahead of the environment on the Puppetmaster.
Finally, there are some specific notes from the elasticsearch team regarding our configuration that we'd like to share in hopes of helping others who may be running large clusters:
- Set the ES_HEAP_SIZE environment variable so that the JVM uses the same value for minimum and maximum memory. Configuring the JVM to have different minimum and maximum values means that each time the JVM needs additional memory (up to the maximum), it will block the Java process to allocate it. Combined with the old Java version, this explains the pauses that our nodes exhibited when introduced to higher load and continuous memory allocation when they were opened up to public searches. The elasticsearch team recommends a setting of 50% of system RAM.
- Our cluster was configured with a recover_after_time set to 30 minutes. The elasticsearch team recommended a change so that recovery would begin immediately rather than after a timed period.
- We did not have minimum_master_nodes configured, so the cluster became unstable when nodes experienced long pauses as subsets of nodes would attempt to form their own clusters.
- During the initial recovery, some of our nodes ran out of disk space. It's unclear why this happened since our cluster was only operating at 67% utilization before the initial event, but it's believed this is related to the high load and old Java version. The elasticsearch team continues to investigate to understand the exact circumstances.
Summary
I’m terribly sorry about the availability problems of our Code Search feature since its launch. It has not been up to our standards, and we are taking each and every one of the lessons these outages have taught us to heart. We can and will do better. Thank you for supporting us at GitHub, especially during the difficult times like this.
from GitHub
相关推荐
基于RIR文件数据和MaxMind的地理数据库分析了埃及和利比亚两个国家的一系列中断事件;发现在试图测试防火墙阻塞之前利比亚执行了更积极的基于BGP的中断策略。
文档标题《Azure微软云历史上10大故障原因、教训及经验总结》及描述《GTM DNS resolution outage》表明本文旨在分析Azure云服务历史上所发生的十次重大服务中断事件,探讨造成这些中断的根本原因,并从中提取教训和...
request-Alerts_Outages安装Alerts_Outages是一个桥接服务数据的实用程序。 用户可以订阅各种服务,这些服务将使用户按重要性顺序更新影响这些服务的事件将“ Alerts_Outages”添加到packages文件夹。 通过Kinetic ...
本文献主要探讨了在复杂电力网络中进行大区域多线路故障(Wide-area Multiple Line Outages, WAMLOs)检测的方法和技术。电力系统的大规模、复杂性和互联性使其面临各种故障的风险,其中多线路同时故障是可能导致大...
停电显示 OpenNMS 中断的 Android 应用程序是一个开源的网络管理系统。 OpenNMS Outages 是一个 Android 应用程序,它显示 OpenNMS 服务器检测到的网络故障。 可从Google Play 获得。
标题中的“pse-outages:普吉特海湾能源中断历史”指的是 Puget Sound Energy(普吉特海湾能源)公司的电力供应中断记录。Puget Sound Energy 是一家位于美国华盛顿州的公用事业公司,主要负责为该地区的居民和企业...
在"cmp-outages-main"这个文件夹中,很可能包含了实现这一过程的Python脚本、数据文件和可能的分析结果。通过查看这些文件,我们可以进一步学习和改进这个跟踪 CMP 中断的系统,以适应不同的CMP平台和更复杂的故障...
➤gcp中断 这是一个非官方的存储库,提供了云计算平台的Git停机历史记录。 如何使用 访问文件的。 学分 数据可能归GCP所有,不归其他所有。 这个项目是受启发从复制而来。 Simon在上撰写了有关如何构建该项目,如果...
➤vultr中断 这是一个非官方的存储库,提供了云计算平台的Git停机历史记录。 如何使用 访问文件的。 学分 数据可能归Vultr所拥有,而不是其他所有。 这个项目是受启发复制自Simon在上撰写了有关如何构建该项目,如果...
➤AWS中断这是一个非官方的存储库,提供了云计算平台停机的Git历史记录。如何使用访问文件的。学分数据可能归AWS拥有,而不是其他所有。 这个项目是受启发复制自。 Simon在上撰写了有关如何构建该项目,如果您需要更...
标题中的“watch-pge-electrical-outages”是一个项目或服务,其主要目的是监控Pacific Gas and Electric Company(PG&E)的电力中断情况。PG&E是一家美国的公用事业公司,负责为加利福尼亚州的广大地区提供电力和...
标题 "watch-southern-california-edison-electrical-outages" 提供了我们正在处理一个针对南加州爱迪生(Southern California Edison, SCE)电力中断监控的项目。这个项目旨在自动化收集该地区的电力中断数据,以便...
在线支持向量回归机(Online-SVR)是一种用于在GPS失效期间利用低成本惯性导航系统(INSTR)预测车辆位置的先进技术。本文主要探讨了车辆定位在智能交通系统(ITS)中的重要性,并提出了一种新的预测方法,即在线支持向量...
Persisted: Any changes made are stored on hard disk, so you never lose data on power outages or crashes. Dictionary: A key/value storage system much like the implementation in .NET. MurMurHash: A ...
reliability assessment methods mainly analyze electricity outages observed from the system side, regardless of whether end consumers are in need of electricity during such outage events. That is, they...
• Improvements in RTK performance through correction outages • Fix for Ethernet issue with DHCP client on the ProPak-6 • Improvements to WiFi Access Point mode on PP6 • SPAN FSAS/KVH IMUs are now ...
• Session recovery - Recover lost transfer queues due to power outages or system crashes. • Server to server transfers (FXP) • SFTP support • IPv6 support • FTP/HTTP proxy • SOCKS 4/5 support •...
tune their current MySQL servers and find and fix problems with their MySQL database applications before they can become serious problems or costly outages. MONyog pro-actively monitors enterprise ...
This book introduces the problems facing Internet of Things developers and explores current technologies ... including examples of how to protect data from node outages using advanced features of MySQL.
- **液化天然气终端及北海中断(LNG Terminal and North Sea Outages)**:监控液化天然气终端及北海中断的情况。 - **现货船用燃料价格(Spot Bunker Fuel Prices)**:提供现货船用燃料的价格信息。 - **液化天然...