Spring Batch: 大数据量批量并行处理框架

Wuaner

浏览: 1160580 次
性别:
来自: 北京

最近访客更多访客>>

gengzg

Dong丶

hehuan215

dd533

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

J2EE
云计算/NoSQL/数据分析

spring batch parrell processing batch spring JSR-352

Spring Batch Documentation:
http://static.springsource.org/spring-batch/reference/index.html
Use Cases for Spring Batch:
http://static.springsource.org/spring-batch/cases/index.html
Spring Batch Tutorial：
http://www.mkyong.com/tutorials/spring-batch-tutorial/comment-page-1/#comment-138186

spring batch所能做的，hadoop都能做。但是spring batch写了一个java batch job framework，它的作用就是帮你管理好你的job，各种监控，流程控制，重启等，也可以说是一个标准，免得你自己写一个framework的时候，漏掉很多细节。不过你要把spring batch的工作都放到hadoop里面做，可能hadoop有点大柴小用了。pring batch 对于处理批量任务还是挺棒的，hadoop更加有利于数据挖掘之类。spring batch适合规模不算太大的数据处理，hadoop那肯定是上规模的计算与处理了。

Java EE 7 的 batch 框架基本是和 spring batch 一致的。两者比较见：
https://blog.codecentric.de/en/2013/07/spring-batch-and-jsr-352-batch-applications-for-the-java-platform-differences/

关于 Step：
http://docs.spring.io/spring-batch/reference/html/configureStep.html
http://www.mkyong.com/spring-batch/spring-batch-hello-world-example/
http://java.dzone.com/articles/chunk-oriented-processing

引用

Spring batch 提供两种 step：
1. Chunk-Oriented task，或称为 READ-PROCESS-WRITE task
2. TaskletStep-Oriented task，或称为 single operation task (即 Tasklet 接口)。The Tasklet is a simple interface that has one method, execute, which will be a called repeatedly by the TaskletStep until it either returns RepeatStatus.FINISHED or throws an exception to signal a failure. Each call to the Tasklet is wrapped in a transaction（即：一次对 TaskletStep 调用时的所有 DB 操作，都是在一个事务中的，所以你不用担心 TaskletStep 调用过程中的 failure 对数据的影响）. Tasklet implementors might call a stored procedure, a script, or a simple SQL update statement. To create a TaskletStep, the 'ref' attribute of the <tasklet/> element should reference a bean defining a Tasklet object; no <chunk/> element should be used within the <tasklet/>。
In Spring Batch, A job consists of many steps and each step consists of a READ-PROCESS-WRITE task or single operation task (tasklet).
1 Job = Many Steps.
1 Step = 1 READ-PROCESS-WRITE or 1 Tasklet.（严格一个，或者是 Chunk Oriented task，或者是 TaskletStep Oriented task）
Job = {Step 1 -> Step 2 -> Step 3} (Chained together)。
基本上，对于符合 IN-PROCESS-OUT 模型的 step，使用 Chunk Oriented task；对只需要 IN 或 OUT 两者之一，或者都不需要的(如只是做清理资源、truncate db table 等操作)，使用 TaskletStep Oriented task。

Spring batch 与 spring integration 的混合使用：
http://blog.springsource.org/2010/02/15/practical-use-of-spring-batch-and-spring-integration/
http://static.springsource.org/spring-batch-admin/trunk/spring-batch-integration/

存疑：
1. Passing data between steps？一个单线程下可行的方案：
http://wangxiangblog.blogspot.com/2013/02/spring-batch-pass-data-across-steps.html
2. processor 处理 item 的过程，可以是批量式的吗？同理 read item 可以是批量读吗？相关：
http://forum.spring.io/forum/spring-projects/batch/63873-itemreader-returning-one-list
3. read 一个 item，但经过 processor 处理后，变为多个 items（即：processor 的 input 为一个 object，但 output 为一个 list），怎么交给 writer？相关：
http://forum.spring.io/forum/spring-projects/batch/111650-itemprocessor-receiving-one-item-returning-more-than-one

Spring Batch ref:
A Job has one to many steps, which has exactly one ItemReader, ItemProcessor, and ItemWriter. A job needs to be launched (JobLauncher), and meta data about the currently running process needs to be stored (JobRepository):

Batch Stereotypes(Chapter 3. The Domain Language of Batch)

A JobLauncher uses the JobRepository to create new JobExecution objects and run them. Job and Step implementations later use the same JobRepository for basic updates of the same executions during the running of a Job. The basic operations suffice for simple scenarios, but in a large batch environment with hundreds of batch jobs and complex scheduling requirements, more advanced access of the meta data is required （4.5. Advanced Meta-Data Usage）

2.4. Meta Data Access Improvements

3.1. Job

Spring Batch uses a 'Chunk Oriented' processing style within its most common implementation. Chunk oriented processing refers to reading the data one at a time, and creating 'chunks' that will be written out, within a transaction boundary 一次面向块的read/write在一个事务中！. One item is read in from an ItemReader, handed to an ItemProcessor, and aggregated. Once the number of items read equals the commit interval, the entire chunk is written out via the ItemWriter, and then the transaction is committed.

5.1. Chunk-Oriented Processing

APIs:

引用

Job - A Job is an entity that encapsulates an entire batch process.
JobInstance - A JobInstance refers to the concept of a logical job run. 可以这样想：JobInstance = Job + JobParameters.
JobExecution - A JobExecution refers to the technical concept of a single attempt to run a Job. An execution may end in failure or success, but the JobInstance corresponding to a given execution will not be considered complete unless the execution completes successfully.
JobParameters - JobParameters is a set of parameters used to start a batch job. "how is one JobInstance distinguished from another?" The answer is: JobParameters.
Job conclusion - A Job defines what a job is and how it is to be executed, and JobInstance is a purely organizational object to group executions together, primarily to enable correct restart semantics. A JobExecution, however, is the primary storage mechanism for what actually happened during a run

Step - A Step is a domain object that encapsulates an independent, sequential phase of a batch job. Therefore, every Job is composed entirely of one or more steps. A Step contains all of the information necessary to define and control the actual batch processing. As with Job, a Step has an individual StepExecution that corresponds with a unique JobExecution.
StepExecution - A StepExecution represents a single attempt to execute a Step. A new StepExecution will be created each time a Step is run, similar to JobExecution. However, if a step fails to execute because the step before it fails, there will be no execution persisted for it. A StepExecution will only be created when its Step is actually started.

Tasklet -
Chunk -
ExecutionContext - An ExecutionContext is a collection of key/value pairs that are persisted by the framework and provide a place to store persistent data that is scoped to a StepExecution or JobExecution. This storage is useful for example in stateful ItemReaders where the current row being read from needs to be recorded.
JobListener -

JobRepository - JobRepository is the persistence mechanism for all of the Stereotypes such as JobInstance/JobParameters/JobExecution/StepExecution/ExecutionContext and so on. It provides CRUD operations for JobLauncher, Job, and Step implementations. When a Job is first launched, a JobExecution is obtained from the repository, and during the course of execution StepExecution and JobExecution implementations are persisted by passing them to the repository.
JobLauncher - JobLauncher represents a simple interface for launching a Job with a given set of JobParameters. It is expected that implementations will obtain a valid JobExecution from the JobRepository and execute the Job.
JobExplorer - provide the function that query the repository for existing executions. 你可以将其认为是 a read-only version of the JobRepository。
JobRegistry - A JobRegistry (and its parent interface JobLocator) is not mandatory, but it can be useful if you want to keep track of which jobs are available in the context. It is also useful for collecting jobs centrally in an application context when they have been created elsewhere (e.g. in child contexts). Custom JobRegistry implementations can also be used to manipulate the names and other properties of the jobs that are registered.
JobOperator - the JobRepository provides CRUD operations on the meta-data, and the JobExplorer provides read-only operations on the meta-data. However, those operations are most useful when used together to perform common monitoring tasks such as stopping, restarting, or summarizing a Job, as is commonly done by batch operators. Spring Batch provides for these types of operations via the JobOperator interface.

ItemReader - ItemReader is an abstraction that represents the retrieval of input for a Step, one item at a time. When the ItemReader has exhausted the items it can provide, it will indicate this by returning null. The basic contract of the ItemReader is that it is forward only.
ItemProcessor - ItemProcessor is an abstraction that represents the business processing of an item. While the ItemReader reads one item, and the ItemWriter writes them, the ItemProcessor provides access to transform or apply other business processing.
ItemWriter - ItemWriter is an abstraction that represents the output of a Step, one batch or chunk of items at a time. Generally, an item writer has no knowledge of the input it will receive next, only the item that was passed in its current invocation.

Introducing Spring Batch series (three parts):
http://keyholesoftware.com/2012/06/22/introducing-spring-batch/

Batch processing in Java with Spring batch (four parts)：
http://java-success.blogspot.com/2012/06/batch-processing-in-java-with-spring.html

Srcs：
中文 PPT 大致介绍：
http://www.slideshare.net/chijq/spring-batch
Spring Batch – Imperfect Yet Worthwhile:
http://www.summa-tech.com/blog/2012/01/23/spring-batch-imperfect-yet-worthwhile/
http://www.davenkin.me/post/2012-10-17/40039048526

Looking for some good examples？
Spring Batch - Hello World:
http://java.dzone.com/news/spring-batch-hello-world-1

引用

A batch Job is composed of one or more Steps. A JobInstance represents a given Job, parametrized with a set of typed properties called JobParameters. Each run of of a JobInstance is a JobExecution. Imagine a job reading entries from a data base and generating an xml representation of it and then doing some clean-up. We have a Job composed of 2 steps: reading/writing and clean-up. If we parametrize this job by the date of the generated data then our Friday the 13th job is a JobInstance. Each time we run this instance (if a failure occurs for instance) is a JobExecution. This model gives a great flexibility regarding how jobs are launched and run. This naturally brings us to launching jobs with their job parameters, which is the responsibility of JobLauncher. Finally, various objects in the framework require a JobRepository to store runtime information related to the batch execution. In fact, Spring Batch domain model is much more elaborate but this will suffice for our purpose.

What happends if a process throws an exception ?
http://alain-cieslik.com/2011/06/06/springbatch-what-append-if-a-process-throws-an-exception/
http://forum.springsource.org/showthread.php?61042-Spring-Batch-beginners-tutorial
http://stackoverflow.com/questions/1609793/how-can-i-get-started-with-spring-batch

分享到：

未完 Java: IO & NIO(new I/O) | Hadoop 异常总结

2013-01-11 16:19
浏览 4870
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论