- 浏览: 849317 次
- 性别:
- 来自: lanzhou
文章分类
最新评论
-
liu346435400:
楼主讲了实话啊,中国程序员的现状,也是只见中国程序员拼死拼活的 ...
中国的程序员为什么这么辛苦 -
qw8226718:
国内ASP.NET下功能比较完善,优化比较好的Spacebui ...
国内外开源sns源码大全 -
dotjar:
敢问兰州的大哥,Prism 现在在12.04LTS上可用么?我 ...
最佳 Ubuntu 下 WebQQ 聊天体验 -
coralsea:
兄弟,卫星通信不是这么简单的,单向接收卫星广播信号不需要太大的 ...
Google 上网 -
txin0814:
我成功安装chrome frame后 在IE地址栏前加上cf: ...
IE中使用Google Chrome Frame运行HTML 5
The Problem
For about two months, we’ve been working on a static website that exposes the results of complicated economics model to non-economists. We decided to make the site static because of the overhead involved in computing the results and the proprietary nature of the model. We would simply pre-generate the output for all valid permutations of the inputs. The visitor could then choose her inputs from a questionnaire, click a button and immediately be shown the results.
The caveat of this decision is that in addition to the numerical outputs, three graphs and a summary (both in HTML and PDF) would need to be generated for each permutation. Since there were 3600 permutations, this would amount to 18000 files in total. Initial local runs of our generation process took about 30 seconds for each permutation, mostly due to embedding the graph images into the PDF. On a single machine, that would take 30 hours of uninterrupted processing! Clearly, this was a job for “the cloud”.
The Tools
Before we get into a discussion of the process of configuring and running the jobs, here’s overview of the tools we used to tackle the problem.
We initially considered using Amazon’s Elastic MapReduce to run the generation jobs, but it requires Java and Hadoop, we had already invested a lot of time in our Ruby tool chain. It is nigh impossible to automatically install Ruby and ImageMagick on an EMR node. Thus, we decided to use vanilla EC2 with the tools shown below.
Prawn
Prawn is the new kid in town for generating PDF in Ruby. Prawn is pretty well-written and easy to start using, and greatly improves on PDF::Writer.
Gruff
Gruff was not the most obvious choice for this project. We liked the flexibility and hackability of Scruffy , but translating its output to PDF was a nightmare and there were some strange inconsistencies in it. In the end, Gruff proved fast, reliable, and simple. The major caveat, as described above, is that embedding images in Prawn is orders of magnitude slower than simply drawing on the canvas.
Haml, Sass, Compass
Haml has been around for 3 years now. Many people cringe at the indentation-sensitive syntax, but it prevents so much frustration that it was a good fit for the project. Naturally, we also used its cousin Sass, and the new-ish CSS/Sass meta-framework Compass . The combination of the these three made it really quick to get started with the static site and make design changes as we iterated.
Chef
You may have already heard of the awesome configuration management tool, Chef . Chef allows you to ensure consistent configuration of your servers using a nice Ruby DSL and a huge library of community-developed “cookbooks” that covers many common use-cases. We were given the chance to try out an alpha of their “Chef Platform”, which is essentially a scalable, hosted, multi-tenant version of the server component of Chef and uses the pre-release version of Chef 0.8. With that, “knife”–the new CLI tool for interacting with the Chef server API–and the custom Opscode AMI, we were well-equipped to quickly deploy a bunch of EC2 nodes. We’ll talk more about the details of the Chef recipes below.
AMQP and RabbitMQ
What’s the best way to distribute a bunch of one-time jobs to a slew of independent machines? A message queue, of course! Despite the version packaged with Ubuntu 9.04 being pretty old, we chose RabbitMQ , having used it on another project. AMQP is also well supported in Ruby .
The Process
Preparing
The first step to start our processing job was to get the data up to S3. You could do this any number of ways, but we created a bucket solely for the data and uploaded all 3600 CSV files with a desktop client.
Next, we created the scripts for the workers and the job initiator. We would potentially need to run the process multiple times, so we chose Aman Gupta’s EventMachine-based AMQP client.
Here’s the worker script, which was set up as a daemon using runit:
#!/usr/bin/env ruby
$: << File.expand_path(
File.join(
File.dirname(
__FILE__)
,'..'
,'lib'
)
)
require
'rubygems'
require
'eventmachine'
require
'mq'
require
'custom_libraries'
Signal.trap(
'INT'
)
{ AMQP.stop{ EM.stop } }
Signal.trap(
'TERM'
)
{ AMQP.stop{ EM.stop } }
AMQP.start(
:host => ARGV.shift)
do
MQ.prefetch(
1)
MQ.queue(
'jobs'
)
.bind(
MQ.direct(
'jobs'
)
)
.subscribe do
|header, body|
GenerationJob.new(
body)
.generate
end
end
Basically, it connects to the RabbitMQ host specified on the command line, subscribes to the job queue, and starts processing messages.
The job initiation script is almost as simple:
#!/usr/bin/env ruby
$: << File.expand_path(
File.join(
File.dirname(
__FILE__)
,'..'
,'lib'
)
)
require
'rubygems'
require
'eventmachine'
require
'mq'
AWSID = (
ENV['AMAZON_ACCESS_KEY_ID'
] || 'XXXXXXXXXXXXXXXXXXXX'
)
AWSKEY = (
ENV['AMAZON_SECRET_ACCESS_KEY'
] || 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
)
Signal.trap(
'INT'
)
{ AMQP.stop{ EM.stop } }
Signal.trap(
'TERM'
)
{ AMQP.stop{ EM.stop } }
host = ARGV.shift
input_bucket = "custom-data"
output_bucket = "custom-output"
output_prefix = Time.now.strftime(
"/%Y%m%d%H%M%S"
)
count = 0
AMQP.start(
:host => host)
do
exchange = MQ.direct(
'jobs'
)
STDIN.each_line do
|file|
count += 1
$stdout.print "."
; $stdout.flush
payload = {
:input
=> [input_bucket, file.strip],
:output
=> [output_bucket, output_prefix],
:s3id
=> AWSID,
:s3key
=> AWSKEY
}
exchange.publish(
Marshal.dump(
payload)
)
end
AMQP.stop { EM.stop }
end
puts "#{count} data enqueued for generation."
It reads from STDIN the names of files to add to the queue, which are stored in the S3 bucket. Before running the job, we created a text file that listed each of the 3600 files, one per line, which could then be piped to this script on the command line. Then it passes along all the information each worker needs to find the data, and where to put it when completed. We scoped the output by the time the job was enqueued, making it easier to discern older runs from newer ones.
Configuring the cloud
Now that the meat of the job was ready, we dived into configuring the servers with Chef. We created a Chef repository, added the Opscode cookbooks as a submodule, and uploaded these default cookbooks to the server:
- apt
- build-essential
- erlang
- imagemagick
- runit
- ruby
We created some additional cookbooks to fill out the generic setup:
- rabbitmq - Installs and configures RabbitMQ
- gemcutter - Upgrades Rubygems, installs Gemcutter and makes gemcutter.org the default gem source
Lastly we created our custom cookbook, which sets up all the libraries we need, downloads the code, and sets up the worker process as a runit service. Let’s walk through the default recipe in that cookbook:
%w{haml gruff fastercsv activesupport prawn prawn-core prawn-format prawn-layout eventmachine amqp aws-s3}.each do
|g|
gem_package g
end
This simply installs all of gems that we need to run the job.
# Find the node that has the job queue
q = search(
:node, "run_list:role*job_queue*"
)
[0].first
Here we use Chef’s search feature to find the node that has RabbitMQ installed and running so we can pass it to the worker script.
# Create directory to put the code in
directory "/srv"
# Unzip the code if necessary
execute "Unpack code"
do
command "tar xzf generationjobs.tar.gz"
cwd "/srv"
action :nothing
end
# Download the code
remote_file "/srv/generationjobs.tar.gz"
do
source "generationjobs.tar.gz"
notifies :run
, resources(
:execute => "Unpack code"
)
, :immediate
end
# Create the directory where output goes
directory "/srv/generationjobs/tmp"
do
recursive true
end
In these four resources, we set up the working directory for the worker process, download the project code (stored on the Chef server as a tarball), and unpack it. The interesting thing about this sequence is that we don’t automatically unpack the tarball. Since the Chef client runs periodically in the background, we don’t want to be unpacking the code every time, but only when it has changed. We use an immediate notification from the remotefile resource to tell the unpacking to run when the tarball is a new version; remote file won’t download the tarball unless the file checksum has changed.
# Create runit service for worker
runit_service "generationworker"
do
options(
{:worker_bin
=> "/srv/generationjobs/bin/worker"
, :queue_host
=> q})
only_if { q }
end
The last step is a pseudo-resource defined in the “runit” cookbook that creates all the pieces of a runit daemon for you; we only had to create the configuration templates for the daemon and put them in our cookbook. The additional options passed to the runitservice tell the templates the location of the worker code and the RabbitMQ host. We also take advantage of the “only if” option so the service won’t be created if there’s no host with RabbitMQ on it yet.
The last step in the Chef configuration was to create two roles , one for the queue and one for the worker. Naturally, the node that has the queue can also act as a worker. Here’s what the role JSON documents look like:
// The queue role
{
"name": "job_queue",
"chef_type": "role",
"json_class": "Chef::Role",
"default_attributes": {
},
"description": "Provides a message queue for sending jobs out to the workers.",
"recipes": [
"erlang",
"rabbitmq"
],
"override_attributes": {
}
}
// The worker role
{
"name": "job_worker",
"chef_type": "role",
"json_class": "Chef::Role",
"default_attributes": {
},
"description": "Processes the data from a queue into the PDF, PNG and HTML output.",
"recipes": [
"apt",
"build-essential",
"ruby",
"gemcutter",
"imagemagick::rmagick",
"runit",
"custom"
],
"override_attributes": {
}
}
Running the jobs on EC2
Now comes the fun (and easy) part! Armed with an AWS account, an EC2 certificate, and knife, we began firing up nodes to run the job. With Opscode’s preconfigured Chef AMI, you can pass a JSON node configuration in the EC2 initial data. First we generated the configuration for the job queue node:
$ knife instance_data --run-list="role[job_queue] role[job_worker]" | pbcopy
With the JSON configuration in the clipboard, we could paste it into ElasticFox (or the AWS Management console) and fire up the first EC2 node. Several minutes later, the node was ready to go. Now, we created a similar configuration, but with only the worker role:
$ knife instance_data --run-list="role[job_worker]" | pbcopy
Then we fired up nine of the nodes with that configuration and proceeded to initiate the job:
$ ssh -i ~/ec2-keys/my-ec2-cert.pem root@ec2-public-hostname
[root@ec2-public-hostname]$ cd /srv/generationworker
[root@ec2-public-hostname]$ bin/startjobs localhost > manifest.txt
After all the preparation, that’s all there was to it! A little over an hour later, we had generated PNG graphs, PDF, and HTML from all 3600 datasets.
Conclusion
It’s no mystery why “cloud computing” is so popular. The ability to quickly and cheaply access computational power, utilize it, and then dispose of it is really appealing, and tools like Chef and EC2 make it really easy to accomplish. What can you cook up?
发表评论
-
Rails 3 Beta版本月将出 Merb融合带来选择
2010-01-11 09:48 1419Rails 3,目前流行Web开发框架Rails的一个升级版 ... -
MerbAdmin:Merb数据管理好帮手
2010-01-11 09:43 905Merb中要加入类似Django的Admin功能早有传闻,如今 ... -
rails cms
2009-12-28 20:29 1667Rails CMS alternatives ======= ... -
Shrink your JavaScript with the Google Compiler Rails Plugin
2009-11-16 11:27 933Like it or not, JavaScript has ... -
Thank you, Rails
2009-11-06 18:21 566It’s fashionable, or perhaps in ... -
Top 50 Ruby on Rails Websites
2009-10-31 15:18 943We’re big fans of Ruby on Rails ... -
Let a human test your app, not (just) unit tests
2009-10-31 09:26 852I’m a big believer in unit test ... -
Heroku Gets Add-Ons: Serious Ruby Webapp Hosting Made Easy
2009-10-30 07:37 911Heroku is a Ruby webapp hosti ... -
Rails + Google Analytics = easy goal tracking
2009-10-29 20:38 891Google Analytics is an indis ... -
Integrating Flickr into your rails website
2009-10-29 20:37 1066In this post I’m going to show ... -
Ruby on Rails Roadshow in Austin Thursday
2009-10-29 14:25 808Justin Britten founded Prefine ... -
Ruby on Rails and the importance of being stupid
2009-10-21 08:13 804A tale of two servers… Server ... -
How a 1-Engineer Rails Site Scaled to 10 Million Requests Per Day
2009-10-20 14:49 774Ravelry is an online knitting ... -
Installing Rails on CentOS 5
2009-10-20 14:24 1190Note: Since this post origina ... -
CentOS配置lighttpd和rails
2009-10-20 14:22 1121lighttpd版本:1.4.18 fastcgi版本: ... -
Cells:将组件开发带入Ruby2.3
2009-10-20 09:17 1117cells "将使得面向组 ... -
High Quality Ruby on Rails Example Applications
2009-10-15 16:34 1459Sometimes to best way to get ... -
Install Passenger on Ubuntu
2009-10-07 10:17 804Phusion Passenger is one of the ... -
Installing Ruby on Rails with Apache on Ubuntu 9.04 (Jaunty)
2009-10-07 10:00 1013Installing Passenger and Depe ... -
Ruby on Rails with Nginx on Ubuntu 9.04 (Jaunty)
2009-10-07 09:57 1065Install Required Packages ...
相关推荐
《Generating Parsers with JavaCC-Centennial》是Tom Copeland撰写的一本书,出版于2009年,主要探讨了如何使用JavaCC工具生成解析器。JavaCC(Java Compiler Compiler)是一个广泛使用的开源工具,它允许开发者用...
一篇较好的描述seq2seq模式的英文资料。This paper shows how Long Short-term Memory recurrent neural net- works can be used to generate complex sequences with long-range struc- ture, simply by predicting...
machining,ontheotherhandproducesproductwithminimumtimeandatdesiredlevelofaccuracy.In thepresentstudy,EN19steelwasmachinedusingCNCWireElectrical...
In fact, you could decide to start generating thousands of puzzles almost immediately, and go through the explanations of algorithms and techniques later, a bit at a time. The author chose to write ...
In fact, Wikipedia chose Reportlab as their tool of choice for generating PDFs of their content. Anytime you click the “Download as PDF” link on the left side of a Wikipedia page, it uses Python ...
具体来说,对于一个数列\(a_0, a_1, a_2, \ldots\),其生成函数可以表示为\(G(x) = a_0 + a_1x + a_2x^2 + \ldots\)。这种形式使得我们可以用代数方法来研究数列的性质。 **2. 应用领域**:生成函数在组合数学、...
- **Interplay Between Discrete and Continuous:** The true value of generating functions lies in their ability to seamlessly connect discrete mathematics with continuous analysis, allowing for a deeper...
This book starts with an introduction to EJB 3 and how to set up the environment, including the... In the final leg of this book, we will discuss support for generating and parsing JSON with WildFly 8.1.
In fact, you could decide to start generating thousands of puzzles almost immediately, and go through the explanations of algorithms and techniques later, a bit at a time. The author chose to write ...
nonethe- less we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of- the-art performance, with ...
HTML5 Gamesshows you how to combine HTML5, CSS3 and JavaScript to make games for the web and mobiles - games that were previously only possible with plugin ...Chapter 13: Going Online with WebSockets
《胡志元__Generating Chinese Ci with Designated Metrical Structure2》这篇论文主要关注的是如何利用指定的韵律结构自动生成中国古代诗歌——词(Ci)。词是中国古代文学的一种独特形式,其韵律规则极为严格,...
文中推导出了接收端输出信噪比(Signal-to-Noise Ratio, SNR)的矩生成函数(Moment Generating Function, MGF)和概率密度函数(Probability Density Function, PDF),考虑了过时的和有限速率反馈的影响,并进一步...
in this paper brings us closer to that goal by generating high-quality textures on arbitrary meshes in a matter of seconds. It achieves that by separating texture preprocessing from texture synthesis ...
Chapter 2, Classifying Handwritten Digits with a Feedforward Network, will introduce a simple, well-known and historical example which has been the starting proof of superiority of deep learning ...