`
dengkane
  • 浏览: 42646 次
  • 性别: Icon_minigender_1
  • 来自: 广州
社区版块
存档分类
最新评论

How HipChat Stores And Indexes Billions Of Messages Using ElasticSearch & Redis

阅读更多

This article is from an interview with Zuhaib Siddique, a production engineer at HipChat, makers of group chat and IM for teams.

HipChat started in an unusual space, one you might not think would have much promise, enterprise group messaging, but as we are learning there is gold in them there enterprise hills. Which is why Atlassian, makers of well thought of tools like JIRA and Confluence, acquired HipChat in 2012.

And in a tale not often heard, the resources and connections of a larger parent have actually helped HipChat enter an exponential growth cycle. Having reached the 1.2 billion message storage mark they are now doubling the number of messages sent, stored, and indexed every few months.

That kind of growth puts a lot of pressure on a once adequate infrastructure. HipChat exhibited a common scaling pattern. Start simple, experience traffic spikes, and then think what do we do now? Using bigger computers is usually the first and best answer. And they did that. That gave them some breathing room to figure out what to do next. On AWS, after a certain inflection point, you start going Cloud Native, that is, scaling horizontally. And that’s what they did.

But there's a twist to the story. Security concerns have driven the development of an on-premises version of HipChat in addition to its cloud/SaaS version. We'll talk more about this interesting development in a post later this week.

While HipChat isn’t Google scale, there is good stuff to learn from HipChat about how they index and search billions of messages in a timely manner, which is the key difference between something like IRC and HipChat. Indexing and storing messages under load while not losing messages is the challenge.

This is the road that HipChat took, what can you learn? Let’s see…

Stats

  • 60 messages per second.

  • 1.2 Billion documents stored

  • 4TB of EBS Raid

  • 8 ElasticSearch servers on AWS

  • 26 front end proxy serves. Double that in backend app servers.

  • 18 people

  • .5 terabytes of search data.

Platform

  • Hosting: AWS EC2 East with 75 Instance currently all Ubuntu 12.04 LTS

  • Database: CouchDB currently for Chat History, transitioning to ElasticSearch.  MySQL-RDS for everything else

  • Caching: Redis

  • Search: ElasticSearch

  • Queue/Workers server: Gearman (queue) and Curler, (worker)

  • Language: Twisted Python (XMPP Server) and PHP (Web front end)

  • System Configure: Open Source Chef + Fabric

  • Code Deployment: Capistrano

  • Monitoring: Sensu and monit pumping alerts to Pagerduty

  • Graphing: statsd + Graphite

Product 

  • Traffic is very bursty. On weekends and holidays it will be quiet. During peak load is hundreds of requests per second. Chat messages don't actually make up the majority of traffic ; it's presence information (away, idle, available), people connecting/disconnecting, etc. So 60 messages per second may seem low, but it is just an average.

  • HipChat wants to be your notification center, where you go to collaborate with your team and get all the information coming from your tools and other systems. Helps keeps everyone in the loop. Especially with remote offices.

  • The big reason to use HipChat over say IRC, is that HipChat stores and indexes every conversation so you can search for them later. Search is emphasized so you can stay inside HipChat. The win for teams of this feature is that you can go back at any time and remember what happened and what you agreed to do. It will also route messages to multiple devices owned by the same user, as well as temporary message caching/retry if your device is unreachable when a message is sent.

  • Growth is from more customers, but also greater engagement as more customers use the product at each site. Also seeing growth from their API integration.

  • Storage and searching of messages is the major scalability bottleneck in their system.

  • HipChat uses XMPP so any XMPP client can connect into the system, which is a huge win for adoption. And they’ve built their own native clients (Windows, Linux, Mac, iOS, Android) with extensions like PDF viewing, custom emoticons, automatic user registration, etc.

  • Not long ago bringing tools like Wikis into a corporate culture was almost impossible. Now enterprise level tools seem to accepted. Why now?

    • Text based communication is now common and accepted. We have texting, IM, and Skype, so using chatting tools is now natural.

    • Rise of distributed work teams. Teams are more and more distributed. We can’t just sit down together and get a lecture. There’s a need to document everything which means organizing communication is perceived as a great asset.

    • Enhanced features. Features like inline images, animated gifs, etc make it fun, which appeals to wider groups of people.

  • HipChat has an API which makes it possible to write tools analogous to IRC bots. Example use is for Bitbucket commits. At 10:08 developer X commits some code to fix a bug. It is sent out via HipChat with links directly to the code commit and the commit log. All automated. Bitbucket commit hits a web hook and uses one of the addons to post the information. Addons help with writing your own bots. Go to your Bitbucket account. Say I have my API token and I want to post to this API whenever a commit happens. Works similarly for GitHub.

  • Started with Adobe Air on the client side, it leaked memory and would take down machines. So moved to native apps. A pain, but a good pain. They have users across all platforms in all parts of a company. You need to be where users are at. Want users to have a great experience on all OSs. Users are everyone, not just techies.

XMPP Server Architecture

  • HipChat is based on XMPP, a message is anything inside an XMPP stanza, which could be anything from a line of text or long section of log output. They didn’t want to talk about their XMPP architecture, so there aren’t a lot of details.

  • Instead of using a 3rd party XMPP server, they’ve built their own using using Twisted Python and XMPP libraries. This allows allows the creation of a scalable backend, user management, and easy addition of features without fighting someone elses code base.

  • RDS on AWS is used for user authentication and other areas where transactions and SQL would be useful. It’s a stable, proven technology. For the on-premises product they use MariaDB.

  • Redis for caching. Information like which users are in which rooms, presence information, who is online, etc, so it doesn’t matter which XMPP server you connect to, the XMPP server itself is not a limit.

    • Pain point is Redis doesn’t have clustering (yet). For high availability a hot/cold model is used so a slave is ready to go. Failover takes about 7 minutes from a master to a slave. Slave promotion is by hand, it’s not automated.

  • Increased loads exposed a weakness in the proxy server and how many clients they could handle.

    • This is a real problem as not losing messages is a high priority. Customers indicate that not dropping messages is more important than low latency. Users would rather get messages late than not at all.

    • With 6 XMPP servers the system worked well. As the connection counts increased they started seeing unacceptable latency. Connections don’t just come from clients, but also bots supporting their programmatic interfaces.

    • As a first pass they split out front-end servers and app servers. Proxies handle connections and the backend apps processes stanzas. The number of frontend servers is driven by the number of active listening clients and not the number of messages sent. It's a challenge to keep so many connections open while providing timely service.

    • After fixing the datastore issues the plan is to look into how to optimize the connection management. Twisted works well, but they have a lot of connections, so have to figure out how to handle that better.

Storage Architecture

  • At 1 billion messages sent on HipChat while experiencing growth, they pushed the limits of their CouchDB and Lucene solution for storing and searching messages.

    • Thought Redis would be the failure point. Thought Couch/Lucene would be good enough. Didn’t do proper capacity planning and looking at message growth rate. Growing faster than they thought and shouldn’t have focussed so much on Redis and focused on data storage instead.

    • At that time they believed in scaling by adding more horsepower, moving up to larger and larger Amazon instances. They got to the point where with their growth this approach would only work for another 2 months. So they had to do something different.

    • Couch/Lucene was not updated in over year and it couldn’t do faceting. Another reason for something different.

  • At about half a billion messages on Amazon was a tipping point. With a dedicated server and 200 Gig of RAM their previous architecture might have still worked, but not in a cloud with limited resources.

  • They wanted to stay on Amazon.

    • Like its flexibility. Just spin up a new instance and chug along.

    • Amazon’s flakiness makes you develop better. Don’t put all your eggs in one basket, if a node goes down you have deal with it or traffic will be lost for some users.

    • Uses a dynamic model. Can lose an instance quickly and bring up new instances. Cloud native type stuff. Kill a node at any time. Kill a Redis master. Recover in 5 minutes. Split across all 4 US-East availability zones currently, but not yet multi-region.

    • EBS only let’s you have 1TB of data. Didn’t know about this limit when they ran into it. With Couch they ran into problems with EBS disk size limitations. HipChat’s data was .5 terabytes. To do compaction Couch had to copy data into a compaction file which doubles the space. On a 2 TB RAID hitting limits during compaction on the weekends. Didn’t want a RAID solution.

  • Amazon DynamoDB wasn’t an option because they are launching a HipChat Server, a hosted service behind the firewall.

    • HipChat Server drives decisions on the tech stack. Private versions are a host your own solution. Certain customers can’t use the cloud/SaaS solution, like banks and financial institutions. NSA has spooked international customers. Hired two engineers to create an installable version of the product.

    • Redis Cluster can be self-hosted and it works on AWS as can ElasticSearch. Instead of RDS they use MariaDB in the on-premises version.

    • Can’t consider a full SaaS solution as that would be a lockin.

  • Now transitioning to ElasticSearch.

    • Moved to ElasticSearch as their storage and search backend because it can eat all the data they can feed it, it is highly available, it scales transparently by simply adding more nodes, it is multi-tenant, it can transparently with sharding and replication handle a node loss, and it’s built on top of Lucene.

    • Didn’t really need a MapReduce functionality. Looked into BigCouch and Riak Search (didn’t look well baked). But ES performance on a GET was pretty good. Liked the idea of removing a component, one less thing to troubleshoot. ES HA has made them feel really confident in the solidity of their system.

    • Lucene compatible is a big win as all their queries are already Lucene compatible, so it was a natural migration path.

    • Customer data is quite varied, ranging from chat logs to images, so response type are all over the place. They need to be able to quickly direct query data from over 1.2 Billion docs.

    • In a move that is becoming more and more common, HipChat is using ElasticSearch as their key-value store too, reducing the number of database systems they need and thus reducing overall complexity. Performance and response times were so good they thought why not use it? Between 10ms and 100ms response times. It was beating Couch in certain areas without any caching in front of it. Why have multiple tools?

    • With ES a node can go down and nobody notices, you will get an alert about high CPU usage while it is rebalancing, but it keeps chugging along.

    • Have 8 ES nodes to handle growth in traffic.

    • As with all Java based products JVM tuning can be tricky.

      • To use ES must capacity plan for how much heap space you’ll be using

      • Test out caching. ES can cache filter results, it is very fast, but you need a lot of heap space. With 22 gigs of heap on 8 boxes, memory was exhausted with caching turned on. So turn off cache unless planned for.

      • Had problems with caching because it would hit an out of memory error and then fail. The cluster self healed in minutes. Only a handful of users noticed a problem.

  • Automated failover on Amazon is problematic because of the network unreliability issues. In a cluster it can cause an election to occur errantly.

    • Ran into this problem with ElasticSearch. Originally had 6 ES node running as master electable. A node would run out of memory or hit a GC pause and on top of that a network loss. Then others could no longer see the master, hold an election, and declare itself a master. Flaw in their election architecture that they don’t need a quorum.  So there’s a brain split. Two masters. Caused a lot of problems.

    • Solution is to run ElasticSearch masters on dedicated nodes. That’s all they do is be the master. Since then there have been no problems. Masters handle how shard allocation is done, who is primary, and maps where replica shards are located.  Makes rebalancing much easier because the masters can handle all the rebalancing with excellent performance. Can query from any node and will do the internal routing itself.

  • Uses a month index. Every month is a separate index. Each primary index has 8 shards then two replicas on that. If one node is lost the system still functions.

  • Not moving RDS into ES. Stuff they need SQL for is staying in RDS/MariaDB, typicall User management data.

  • A lot of cache in Redis in a master/slave setup until Redis Cluster is released. There’s a Redis stat server, who is in a room, who is offline. Redis history caches the last 75 messages to prevent constantly hitting the database when loading conversation for the first time. Also status for internal status or quick data, like the number of people logged in. 

General 

  • Gearman used for async jobs like iOS push and delivering email.

  • AWS West is used for disaster recovery. Everything is replicated to AWS West.

  • Chef is used for all configuration. ElasticSearch has a nice cookbook for Chef. Easy to get started. Like Chef because you can start writing Ruby code instead of using a Puppet style DSL. It also has a nice active community.

  • Acquisition experience.  They now have access to core company assets and talent, but Atlassian doesn’t mess with what works, we bought you for a reason. Can ask internally, for example, how to scale ElasticSearch and when others in Atlassian need help they can pitch in. Good experience overall.

  • Team structure is flat. Still a small team. Currently have about 18 people. Two people in devops. Developers for platforms, iOS, Android, a few on the server side, and a web development engineer (in France).

  • Capistrano is used to deploy to all boxes.

  • Sensu for monitoring apps. Forgot to monitor a ElasticSearch node for heap space and then hit OOM problems without any notifications. Currently at 75% of heap which is the sweet spot they want.

  • Bamboo is used for continuous integration.

  • Client releases not formal yet, developer driven. Have a staging area for testing.

  • Group flags. Can control which groups get a feature to test features to slowly release features and control load on boxes.

  • Feature flag. Great for protection during the ElasticSearch rollout, for example. If they notice a bug they can turn off a feature and roll back to Couch. Users don’t notice a difference. During the transition phase between Couch and ElasticSearch they have applications replicate to both stores.

  • New API version will use Oauth so developers can deploy on their own servers using the HipChat API. Having customers use their own servers is a more scalable model.

The Future 

  • Will hit two billion messages in a few months and it’s estimated ElasticSearch can handle about two billion messages. Not sure how to handle the expected increases in load. Expect to go to Amazon West to get more datacenter availability and possibly put more users in different datacenters.

  • Looking to use AWS’s auto scale capabilities.

  • Moving into voice, private 1-1 video, audio chat, basic conferencing.

  • Might use RabbitMQ for messaging in the future.

  • Greater integration with Confluence (a wiki). Use HipChat to talk then use Confluence page to capture details.

Lessons

  • There’s money in them there Enterprise apps. Selling into an Enterprise can be torture. Long sales cycles means it can be hit or miss. But if you hit it’s a lucrative relationship. So you may want to consider the Enterprise market. Times are changing. Enterprise’s may still be stodgy and slow, but they are also adopting new tools and ways of doing things, and there’s an opportunity there.

  • Privacy is becoming more important and it impacts your stack choices when selling into the Enterprise. HipChat is making an on premise version of their product to satisfy customers who don’t trust the public networks or clouds with their data. To a programmer the cloud makes perfect sense as a platform. To an Enterprise the cloud can be evil. What this means is you must make flexible tech stack choices. If you rely 100% on AWS on services it will make it nearly impossible to move your system to another data center. This may not matter for Netflix, but it matters if you want to sell into the Enterprise market.

  • Scale up to get some breathing space. While you are waiting to figure out what you are going to next in your architecture, for very little money you can scale-up and give yourselves a few more months of breathing space. 

  • Pick what can’t fail. HipChat made never losing user chat history a priority, so their architecture reflects this priority to the point of saving chat to disk and then later reloading when the downed systems recover.

  • Go native. Your customers are on lots of different platforms and a native app would give the best experience. For a startup that’s a lot of resources, too many. So at some point it makes sense to sell to a company with more resources so you can build a better product.

  • Feature and group flags make a better release practice. If you can select which groups see a feature and if you can turn off features in production and test, then you don’t have to be so afraid of releasing new builds.

  • Select technologies you feel really confident with. ElasticSearch’s ability to expand horizontally to deal with growth helped HipChat feel very confident in their solution. That users would be happy. And that’s a good feeling.

  • Become part of the process flow and you become more valuable and hard to remove. HipChat as a natural meeting point between people and tools is also a natural point to write bots that implement all sorts of useful work flow hacks. This makes HipChat a platform play in the enterprise. It enables features that wouldn’t otherwise be buildable. And if you can do that you are hard to get rid of.

  • AWS needs a separate node presence bus. It’s ridiculous that in a cloud, which is a thing in itself, that machine presence information isn’t available from an objective third party source. If you look at rack it will have often have a separate presence bus so slots can know if other slots are available. This way you don’t have to guess. In the cloud software is using primitive TCP based connection techniques and heartbeats to guess when another node is down, which can lead to split brain problems and data loss on failovers. It’s time to evolve and take huge step towards reliability.

  • Product decision drive stack decisions. HipChat server drives decisions on the tech stack. Redis Cluster can be self-hosted. Amazon DynamoDB wasn’t an option because they are launching a hosted service behind the firewall.

  • Throwing horsepower at a problem will only get you so far. You need to capacity plan, even in the cloud. Unless your architecture is completely Cloud Native from the start, any architecture will have load inflection points at which their architecture will no longer be able to handle the load. Look at growth rate. Project out. What will break? What will you do about it? And don’t make the same mistake again. How will HipChat deal with 4 billion messages? That’s not clear.

  • Know the limits of your system. EBS has a 1 TB storage limit, which is a lot, but if your storage is approaching that limit there needs to be a plan. Similarly, if your database, like Couch, uses double the disk space during a compaction phase that will impact your limits.

  • The world will surprise you. Six months ago HipChat though Redis was going to be the weakest linked, but it’s still be doing strong and it was Couch and EBS that were the weakest link.

[转载自 http://www.cnblogs.com/huangfox/p/3625004.html]

 

对应的中文翻译文章:http://www.csdn.net/article/2014-01-16/2818165-how-hipchat-stores-and-indexes-billions-of-messages

分享到:
评论

相关推荐

    利用elasticsearch和redis检索和存储十亿信息.docx

    这篇文章讨论了使用Elasticsearch和Redis来检索和存储十亿信息的方法,并以HipChat为例,介绍了如何使用Elasticsearch和Redis来实现快速检索和存储大量信息。 知识点一:Elasticsearch的应用 Elasticsearch是一个...

    利用elasticsearch和redis检索和存储十亿信息.doc

    【Elasticsearch与Redis在大数据检索与存储中的应用】 在当今大数据时代,高效检索和存储海量信息成为企业和技术团队面临的重大挑战。Elasticsearch和Redis是两种常用的解决方案,它们分别在全文搜索引擎和高速缓存...

    Laravel开发-hipchat-api

    在本文中,我们将深入探讨如何使用Laravel框架与Hipchat API进行集成开发。Laravel是PHP领域中最受欢迎的Web应用程序框架之一,以其优雅的语法、强大的功能和丰富的生态系统而受到开发者喜爱。Hipchat(现已被...

    hipchat-cli, 用于处理 HipChat REST API的命令行 脚本.zip

    hipchat-cli, 用于处理 HipChat REST API的命令行 脚本 hipchat-cli用于执行 HipChat API调用的一些 命令行 脚本。 有关如何获得令牌和房间id的详细信息,请参见教程。 。/hipchat_room_message用于向房间发送消息。...

    HipChat聊天工具

    HipChat是一款专为开发团队设计的即时通讯(IM)工具,它旨在提高团队协作效率,促进内部沟通。作为Atlassian公司推出的产品,HipChat在软件开发领域有着广泛的使用,尤其适合那些需要频繁交流、共享代码片段和文件...

    HipChat Server 破解

    1、请到官网下载hipchat server 虚拟机模板 https://hipchat-server-stable.s3.amazonaws.com/HipChat.ova 2、Vritual box 导入虚拟机 3、参照 hipChat破解手册.doc 的破解步骤进行破解

    simple-hipchat-py, HipChat API的python 客户端.zip

    simple-hipchat-py, HipChat API的python 客户端 用于 v1 的python 客户机描述简单的事务包装器,用于 HipChat API 。 公开核心URI端点包装器和常用集成的一些基本方法。依赖项除了 python 标准库之外。用法安装:...

    Redis 的 Erlang 实现.zip

    Redis 的 Erlang 实现Erlang版本的Redis,目标是实现类似的算法性能,但支持多个主节点和大于 RAM 的数据集。有关更多信息,请参阅 2012 年 Erlang Factory 演讲的 PDF。联系我们如果您对使用本图书馆 有任何疑问或...

    php-hipchat:一个简单的 HipChat API 库

    一个简单的 HipChat API 库 安装 $ composer require mamor/php-hipchat “发送房间通知”API 示例 <?php require_once './vendor/autoload.php'; $hipChat = new Mamor\HipChat('YOUR_API_TOKEN'); $hipChat-...

    Laravel开发-hipchat

    在本文中,我们将深入探讨如何在 Laravel 开发中集成 HipChat 服务,以便实现通知功能。Laravel 是一个流行的 PHP 框架,它提供了一系列强大的工具和组件,使得开发者可以快速、高效地构建高质量的 web 应用。...

    Go-hipchat(xmpp)-一个golang包用于与HipChat通信通过XMPP

    HipChat是一款企业级的即时通讯软件,它使用XMPP(Extensible Messaging and Presence Protocol)协议来处理消息传递和实时通信。在本文中,我们将深入探讨Go-Hipchat库的功能、使用方法以及如何集成到你的Golang...

    Go-hipchat-ThisprojectimplementsagolangclientlibraryfortheHipchatAPI

    Go-Hipchat项目是一个用Go语言编写的客户端库,专门用于与Hipchat API进行交互。Hipchat是由Atlassian公司开发的一款团队沟通工具,它提供了实时消息、文件分享、视频聊天等功能,广泛应用于团队协作场景。Go-...

    HipchatAPIv2Client, [UNMANTAINED] PHP库处理对 Hipchat REST API的调用.zip

    HipchatAPIv2Client, [UNMANTAINED] PHP库处理对 Hipchat REST API的调用 Hipchat v2 Api客户端这个库是 unmantained 。请用 solutionDrive 检查这个 fork 。PHP库处理对 hipchat REST API的调用 这里软件包正在进行...

    Laravel开发-laravel-hipchat

    **Laravel 开发与 Hipchat 集成详解** 在现代Web开发中,实时通信和团队协作至关重要。Laravel作为一款流行的PHP框架,为开发者提供了丰富的工具和功能,以简化复杂的后端任务。本篇文章将深入探讨如何在Laravel...

    Laravel开发-hipchat-laravel

    【Laravel开发与Hipchat集成】 在Web开发领域,Laravel是一个广受欢迎的PHP框架,以其优雅的语法、强大的功能和高效的开发效率受到开发者们的喜爱。Hipchat则是Atlassian公司推出的一款团队协作工具,允许用户进行...

    HipChat_Java:HipChat代理服务器

    【HipChat_Java:HipChat代理服务器】 HipChat_Java是一个专门为HipChat设计的代理服务器,它允许用户通过Java应用程序接口(API)与HipChat服务进行交互。HipChat是一款由Atlassian开发的企业级即时通讯工具,提供了...

    trello-hipchat, Trello与HipChat的集成.zip

    trello-hipchat, Trello与HipChat的集成 集成Trello和 HipChat将重要的Trello活动日志发送到HipChat房间。这个程序将监控多个Trello板/列表并向多个HipChat房间发送通知。 你可以监视整个板,或者只监视其中列表的...

Global site tag (gtag.js) - Google Analytics