`
love~ruby+rails
  • 浏览: 852107 次
  • 性别: Icon_minigender_1
  • 来自: lanzhou
社区版块
存档分类
最新评论

Generating Thousands of PDFs on EC2 with Ruby

阅读更多

The Problem

For about two months, we’ve been working on a static website that exposes the results of complicated economics model to non-economists. We decided to make the site static because of the overhead involved in computing the results and the proprietary nature of the model. We would simply pre-generate the output for all valid permutations of the inputs. The visitor could then choose her inputs from a questionnaire, click a button and immediately be shown the results.

The caveat of this decision is that in addition to the numerical outputs, three graphs and a summary (both in HTML and PDF) would need to be generated for each permutation. Since there were 3600 permutations, this would amount to 18000 files in total. Initial local runs of our generation process took about 30 seconds for each permutation, mostly due to embedding the graph images into the PDF. On a single machine, that would take 30 hours of uninterrupted processing! Clearly, this was a job for “the cloud”.

The Tools

Before we get into a discussion of the process of configuring and running the jobs, here’s overview of the tools we used to tackle the problem.

We initially considered using Amazon’s Elastic MapReduce to run the generation jobs, but it requires Java and Hadoop, we had already invested a lot of time in our Ruby tool chain. It is nigh impossible to automatically install Ruby and ImageMagick on an EMR node. Thus, we decided to use vanilla EC2 with the tools shown below.

Prawn

Prawn is the new kid in town for generating PDF in Ruby. Prawn is pretty well-written and easy to start using, and greatly improves on PDF::Writer.

Gruff

Gruff was not the most obvious choice for this project. We liked the flexibility and hackability of Scruffy , but translating its output to PDF was a nightmare and there were some strange inconsistencies in it. In the end, Gruff proved fast, reliable, and simple. The major caveat, as described above, is that embedding images in Prawn is orders of magnitude slower than simply drawing on the canvas.

Haml, Sass, Compass

Haml has been around for 3 years now. Many people cringe at the indentation-sensitive syntax, but it prevents so much frustration that it was a good fit for the project. Naturally, we also used its cousin Sass, and the new-ish CSS/Sass meta-framework Compass . The combination of the these three made it really quick to get started with the static site and make design changes as we iterated.

Chef

You may have already heard of the awesome configuration management tool, Chef . Chef allows you to ensure consistent configuration of your servers using a nice Ruby DSL and a huge library of community-developed “cookbooks” that covers many common use-cases. We were given the chance to try out an alpha of their “Chef Platform”, which is essentially a scalable, hosted, multi-tenant version of the server component of Chef and uses the pre-release version of Chef 0.8. With that, “knife”–the new CLI tool for interacting with the Chef server API–and the custom Opscode AMI, we were well-equipped to quickly deploy a bunch of EC2 nodes. We’ll talk more about the details of the Chef recipes below.

AMQP and RabbitMQ

What’s the best way to distribute a bunch of one-time jobs to a slew of independent machines? A message queue, of course! Despite the version packaged with Ubuntu 9.04 being pretty old, we chose RabbitMQ , having used it on another project. AMQP is also well supported in Ruby .

The Process

Preparing

The first step to start our processing job was to get the data up to S3. You could do this any number of ways, but we created a bucket solely for the data and uploaded all 3600 CSV files with a desktop client.

Next, we created the scripts for the workers and the job initiator. We would potentially need to run the process multiple times, so we chose Aman Gupta’s EventMachine-based AMQP client.

Here’s the worker script, which was set up as a daemon using runit:

#!/usr/bin/env ruby


$: << File.expand_path(
File.join(
File.dirname(
__FILE__)
,'..'
,'lib'
)
)

require
 'rubygems'

require
 'eventmachine'

require
 'mq'

require
 'custom_libraries'


Signal.trap(
'INT'
)
 { AMQP.stop{ EM.stop } }
Signal.trap(
'TERM'
)
{ AMQP.stop{ EM.stop } }

AMQP.start(
:host => ARGV.shift)
 do

 MQ.prefetch(
1)

 MQ.queue(
'jobs'
)
.bind(
MQ.direct(
'jobs'
)
)
.subscribe do
 |header, body|
   GenerationJob.new(
body)
.generate
 end

end

Basically, it connects to the RabbitMQ host specified on the command line, subscribes to the job queue, and starts processing messages.

The job initiation script is almost as simple:

#!/usr/bin/env ruby


$: << File.expand_path(
File.join(
File.dirname(
__FILE__)
,'..'
,'lib'
)
)

require
 'rubygems'

require
 'eventmachine'

require
 'mq'


AWSID = (
ENV['AMAZON_ACCESS_KEY_ID'
] || 'XXXXXXXXXXXXXXXXXXXX'
)

AWSKEY = (
ENV['AMAZON_SECRET_ACCESS_KEY'
] || 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
)


Signal.trap(
'INT'
)
 { AMQP.stop{ EM.stop } }
Signal.trap(
'TERM'
)
{ AMQP.stop{ EM.stop } }

host = ARGV.shift
input_bucket = "custom-data"

output_bucket = "custom-output"

output_prefix = Time.now.strftime(
"/%Y%m%d%H%M%S"
)

count = 0

AMQP.start(
:host => host)
 do

 exchange = MQ.direct(
'jobs'
)


 STDIN.each_line do
 |file|
   count += 1
   $stdout.print "."
; $stdout.flush
   payload = {
     :input
 => [input_bucket, file.strip],
     :output
 => [output_bucket, output_prefix],
     :s3id
 => AWSID,
     :s3key
 => AWSKEY
   }
   exchange.publish(
Marshal.dump(
payload)
)

 end

 AMQP.stop { EM.stop }
end

puts "#{count} data enqueued for generation."

It reads from STDIN the names of files to add to the queue, which are stored in the S3 bucket. Before running the job, we created a text file that listed each of the 3600 files, one per line, which could then be piped to this script on the command line. Then it passes along all the information each worker needs to find the data, and where to put it when completed. We scoped the output by the time the job was enqueued, making it easier to discern older runs from newer ones.

Configuring the cloud

Now that the meat of the job was ready, we dived into configuring the servers with Chef. We created a Chef repository, added the Opscode cookbooks as a submodule, and uploaded these default cookbooks to the server:

  • apt
  • build-essential
  • erlang
  • imagemagick
  • runit
  • ruby

We created some additional cookbooks to fill out the generic setup:

  • rabbitmq - Installs and configures RabbitMQ
  • gemcutter - Upgrades Rubygems, installs Gemcutter and makes gemcutter.org the default gem source

Lastly we created our custom cookbook, which sets up all the libraries we need, downloads the code, and sets up the worker process as a runit service. Let’s walk through the default recipe in that cookbook:


%w{haml gruff fastercsv activesupport prawn prawn-core prawn-format prawn-layout eventmachine amqp aws-s3}.each do
 |g|
 gem_package g
end


This simply installs all of gems that we need to run the job.


# Find the node that has the job queue

q = search(
:node, "run_list:role*job_queue*"
)
[0].first

Here we use Chef’s search feature to find the node that has RabbitMQ installed and running so we can pass it to the worker script.


# Create directory to put the code in

directory "/srv"


# Unzip the code if necessary

execute "Unpack code"
 do

 command "tar xzf generationjobs.tar.gz"

 cwd "/srv"

 action :nothing

end


# Download the code

remote_file "/srv/generationjobs.tar.gz"
 do

 source "generationjobs.tar.gz"

 notifies :run
, resources(
:execute => "Unpack code"
)
, :immediate

end


# Create the directory where output goes

directory "/srv/generationjobs/tmp"
 do

 recursive true
end


In these four resources, we set up the working directory for the worker process, download the project code (stored on the Chef server as a tarball), and unpack it. The interesting thing about this sequence is that we don’t automatically unpack the tarball. Since the Chef client runs periodically in the background, we don’t want to be unpacking the code every time, but only when it has changed. We use an immediate notification from the remotefile resource to tell the unpacking to run when the tarball is a new version; remote file won’t download the tarball unless the file checksum has changed.


# Create runit service for worker

runit_service "generationworker"
 do

 options(
{:worker_bin
 => "/srv/generationjobs/bin/worker"
, :queue_host
 => q})

 only_if { q }
end

The last step is a pseudo-resource defined in the “runit” cookbook that creates all the pieces of a runit daemon for you; we only had to create the configuration templates for the daemon and put them in our cookbook. The additional options passed to the runitservice tell the templates the location of the worker code and the RabbitMQ host. We also take advantage of the “only if” option so the service won’t be created if there’s no host with RabbitMQ on it yet.

The last step in the Chef configuration was to create two roles , one for the queue and one for the worker. Naturally, the node that has the queue can also act as a worker. Here’s what the role JSON documents look like:


// The queue role
{
 "name": "job_queue",
 "chef_type": "role",
 "json_class": "Chef::Role",
 "default_attributes": {

 },
 "description": "Provides a message queue for sending jobs out to the workers.",
 "recipes": [
   "erlang",
   "rabbitmq"
 ],
 "override_attributes": {

 }
}

// The worker role
{
 "name": "job_worker",
 "chef_type": "role",
 "json_class": "Chef::Role",
 "default_attributes": {

 },
 "description": "Processes the data from a queue into the PDF, PNG and HTML output.",
 "recipes": [
   "apt",
   "build-essential",
   "ruby",
   "gemcutter",
   "imagemagick::rmagick",
   "runit",
   "custom"
 ],
 "override_attributes": {

 }
}

Running the jobs on EC2

Now comes the fun (and easy) part! Armed with an AWS account, an EC2 certificate, and knife, we began firing up nodes to run the job. With Opscode’s preconfigured Chef AMI, you can pass a JSON node configuration in the EC2 initial data. First we generated the configuration for the job queue node:

$ knife instance_data --run-list="role[job_queue] role[job_worker]" | pbcopy

With the JSON configuration in the clipboard, we could paste it into ElasticFox (or the AWS Management console) and fire up the first EC2 node. Several minutes later, the node was ready to go. Now, we created a similar configuration, but with only the worker role:

$ knife instance_data --run-list="role[job_worker]" | pbcopy

Then we fired up nine of the nodes with that configuration and proceeded to initiate the job:

$ ssh -i ~/ec2-keys/my-ec2-cert.pem root@ec2-public-hostname
[root@ec2-public-hostname]$ cd /srv/generationworker
[root@ec2-public-hostname]$ bin/startjobs localhost > manifest.txt

After all the preparation, that’s all there was to it! A little over an hour later, we had generated PNG graphs, PDF, and HTML from all 3600 datasets.

Conclusion

It’s no mystery why “cloud computing” is so popular. The ability to quickly and cheaply access computational power, utilize it, and then dispose of it is really appealing, and tools like Chef and EC2 make it really easy to accomplish. What can you cook up?

分享到:
评论

相关推荐

    Generating Parsers with JavaCC-Centennial

    《Generating Parsers with JavaCC-Centennial》是Tom Copeland撰写的一本书,出版于2009年,主要探讨了如何使用JavaCC工具生成解析器。JavaCC(Java Compiler Compiler)是一个广泛使用的开源工具,它允许开发者用...

    Generating Sequences With Recurrent Neural Networks(英文原版)

    一篇较好的描述seq2seq模式的英文资料。This paper shows how Long Short-term Memory recurrent neural net- works can be used to generate complex sequences with long-range struc- ture, simply by predicting...

    Handbook of Research on Soft Computing and Nature-Inspired Algorithms

    machining,ontheotherhandproducesproductwithminimumtimeandatdesiredlevelofaccuracy.In thepresentstudy,EN19steelwasmachinedusingCNCWireElectrical...

    Sudoku Programming With C

    In fact, you could decide to start generating thousands of puzzles almost immediately, and go through the explanations of algorithms and techniques later, a bit at a time. The author chose to write ...

    ReportLab PDF Processing with Python 用Python处理PDF

    In fact, Wikipedia chose Reportlab as their tool of choice for generating PDFs of their content. Anytime you click the “Download as PDF” link on the left side of a Wikipedia page, it uses Python ...

    Generating Functionology

    具体来说,对于一个数列\(a_0, a_1, a_2, \ldots\),其生成函数可以表示为\(G(x) = a_0 + a_1x + a_2x^2 + \ldots\)。这种形式使得我们可以用代数方法来研究数列的性质。 **2. 应用领域**:生成函数在组合数学、...

    Herbert S. Wilf - Generating Functions.pdf

    - **Interplay Between Discrete and Continuous:** The true value of generating functions lies in their ability to seamlessly connect discrete mathematics with continuous analysis, allowing for a deeper...

    Advanced Java EE Development with WildFly(PACKT,2015)

    This book starts with an introduction to EJB 3 and how to set up the environment, including the... In the final leg of this book, we will discuss support for generating and parsing JSON with WildFly 8.1.

    Sudoku.Programming.with.C.1484209966

    In fact, you could decide to start generating thousands of puzzles almost immediately, and go through the explanations of algorithms and techniques later, a bit at a time. The author chose to write ...

    WaveNet论文

    nonethe- less we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of- the-art performance, with ...

    HTML5.Games.Creating.Fun.with.HTML5.CSS3.and.WebGL.2nd.Edition

    HTML5 Gamesshows you how to combine HTML5, CSS3 and JavaScript to make games for the web and mobiles - games that were previously only possible with plugin ...Chapter 13: Going Online with WebSockets

    胡志元__Generating Chinese Ci with Designated Metrical Structure2

    胡志元的论文《Generating Chinese Ci with Designated Metrical Structure2》便是该领域的一次深入探索,其目标是利用人工智能技术,尤其是深度神经网络,来自动生成符合严格韵律结构的中国古典词作。 词,又称...

    Error performance of transmit beamforming with delayed and limited feedback

    文中推导出了接收端输出信噪比(Signal-to-Noise Ratio, SNR)的矩生成函数(Moment Generating Function, MGF)和概率密度函数(Probability Density Function, PDF),考虑了过时的和有限速率反馈的影响,并进一步...

    fast texture synthesis on arbitrary meshs

    in this paper brings us closer to that goal by generating high-quality textures on arbitrary meshes in a matter of seconds. It achieves that by separating texture preprocessing from texture synthesis ...

    源码Deep Learning with Theano

    Chapter 2, Classifying Handwritten Digits with a Feedforward Network, will introduce a simple, well-known and historical example which has been the starting proof of superiority of deep learning ...

Global site tag (gtag.js) - Google Analytics