`

Spring Batch: 大数据量批量并行处理框架

阅读更多
     
Spring Batch Documentation:
http://static.springsource.org/spring-batch/reference/index.html
Use Cases for Spring Batch:
http://static.springsource.org/spring-batch/cases/index.html
Spring Batch Tutorial:
http://www.mkyong.com/tutorials/spring-batch-tutorial/comment-page-1/#comment-138186



spring batch所能做的,hadoop都能做。但是spring batch写了一个java batch job framework,它的作用就是帮你管理好你的job,各种监控,流程控制,重启等,也可以说是一个标准,免得你自己写一个framework的时候,漏掉很多细节。不过你要把spring batch的工作都放到hadoop里面做,可能hadoop有点大柴小用了。pring batch 对于处理批量任务还是挺棒的,hadoop更加有利于数据挖掘之类。spring batch适合规模不算太大的数据处理,hadoop那肯定是上规模的计算与处理了。

Java EE 7 的 batch 框架基本是和 spring batch 一致的。两者比较见:
https://blog.codecentric.de/en/2013/07/spring-batch-and-jsr-352-batch-applications-for-the-java-platform-differences/

关于 Step:
http://docs.spring.io/spring-batch/reference/html/configureStep.html
http://www.mkyong.com/spring-batch/spring-batch-hello-world-example/
http://java.dzone.com/articles/chunk-oriented-processing
引用
Spring batch 提供两种 step:
1. Chunk-Oriented task,或称为 READ-PROCESS-WRITE task
2. TaskletStep-Oriented task,或称为 single operation task (即 Tasklet 接口)。The Tasklet is a simple interface that has one method, execute, which will be a called repeatedly by the TaskletStep until it either returns RepeatStatus.FINISHED or throws an exception to signal a failure. Each call to the Tasklet is wrapped in a transaction(即:一次对 TaskletStep 调用时的所有 DB 操作,都是在一个事务中的,所以你不用担心 TaskletStep 调用过程中的 failure 对数据的影响). Tasklet implementors might call a stored procedure, a script, or a simple SQL update statement. To create a TaskletStep, the 'ref' attribute of the <tasklet/> element should reference a bean defining a Tasklet object; no <chunk/> element should be used within the <tasklet/>
In Spring Batch, A job consists of many steps and each step consists of a READ-PROCESS-WRITE task or single operation task (tasklet).
1 Job = Many Steps.
1 Step = 1 READ-PROCESS-WRITE or 1 Tasklet.(严格一个,或者是 Chunk Oriented task,或者是  TaskletStep Oriented task)
Job = {Step 1 -> Step 2 -> Step 3} (Chained together)。
基本上,对于符合 IN-PROCESS-OUT 模型的 step,使用 Chunk Oriented task;对只需要 IN 或 OUT 两者之一,或者都不需要的(如只是做清理资源、truncate db table 等操作),使用 TaskletStep Oriented task。



Spring batch 与 spring integration 的混合使用:
http://blog.springsource.org/2010/02/15/practical-use-of-spring-batch-and-spring-integration/
http://static.springsource.org/spring-batch-admin/trunk/spring-batch-integration/


存疑:
1. Passing data between steps?一个单线程下可行的方案:
http://wangxiangblog.blogspot.com/2013/02/spring-batch-pass-data-across-steps.html
2. processor 处理 item 的过程,可以是批量式的吗?同理 read item 可以是批量读吗?相关:
http://forum.spring.io/forum/spring-projects/batch/63873-itemreader-returning-one-list
3. read 一个 item,但经过 processor 处理后,变为多个 items(即:processor 的 input 为一个 object,但 output 为一个 list),怎么交给 writer?相关:
http://forum.spring.io/forum/spring-projects/batch/111650-itemprocessor-receiving-one-item-returning-more-than-one

   
Spring Batch ref:
A Job has one to many steps, which has exactly one ItemReader, ItemProcessor, and ItemWriter. A job needs to be launched (JobLauncher), and meta data about the currently running process needs to be stored (JobRepository):

Batch Stereotypes(Chapter 3. The Domain Language of Batch)

A JobLauncher uses the JobRepository to create new JobExecution objects and run them. Job and Step implementations later use the same JobRepository for basic updates of the same executions during the running of a Job. The basic operations suffice for simple scenarios, but in a large batch environment with hundreds of batch jobs and complex scheduling requirements, more advanced access of the meta data is required (4.5. Advanced Meta-Data Usage)








2.4. Meta Data Access Improvements






3.1. Job



Spring Batch uses a 'Chunk Oriented' processing style within its most common implementation. Chunk oriented processing refers to reading the data one at a time, and creating 'chunks' that will be written out, within a transaction boundary 一次面向块的read/write在一个事务中!. One item is read in from an ItemReader, handed to an ItemProcessor, and aggregated. Once the number of items read equals the commit interval, the entire chunk is written out via the ItemWriter, and then the transaction is committed.

5.1. Chunk-Oriented Processing




APIs:
引用

Job - A Job is an entity that encapsulates an entire batch process.
JobInstance - A JobInstance refers to the concept of a logical job run. 可以这样想:JobInstance = Job + JobParameters.
JobExecution - A JobExecution refers to the technical concept of a single attempt to run a Job. An execution may end in failure or success, but the JobInstance corresponding to a given execution will not be considered complete unless the execution completes successfully.
JobParameters - JobParameters is a set of parameters used to start a batch job. "how is one JobInstance distinguished from another?" The answer is: JobParameters.
Job conclusion - A Job defines what a job is and how it is to be executed, and JobInstance is a purely organizational object to group executions together, primarily to enable correct restart semantics. A JobExecution, however, is the primary storage mechanism for what actually happened during a run


Step - A Step is a domain object that encapsulates an independent, sequential phase of a batch job. Therefore, every Job is composed entirely of one or more steps. A Step contains all of the information necessary to define and control the actual batch processing. As with Job, a Step has an individual StepExecution that corresponds with a unique JobExecution.
StepExecution - A StepExecution represents a single attempt to execute a Step. A new StepExecution will be created each time a Step is run, similar to JobExecution. However, if a step fails to execute because the step before it fails, there will be no execution persisted for it. A StepExecution will only be created when its Step is actually started.

Tasklet -
Chunk -
ExecutionContext - An ExecutionContext is a collection of key/value pairs that are persisted by the framework and provide a place to store persistent data that is scoped to a StepExecution or JobExecution. This storage is useful for example in stateful ItemReaders where the current row being read from needs to be recorded.
JobListener -

JobRepository - JobRepository is the persistence mechanism for all of the Stereotypes such as JobInstance/JobParameters/JobExecution/StepExecution/ExecutionContext and so on. It provides CRUD operations for JobLauncher, Job, and Step implementations. When a Job is first launched, a JobExecution is obtained from the repository, and during the course of execution StepExecution and JobExecution implementations are persisted by passing them to the repository.
JobLauncher - JobLauncher represents a simple interface for launching a Job with a given set of JobParameters. It is expected that implementations will obtain a valid JobExecution from the JobRepository and execute the Job.
JobExplorer - provide the function that query the repository for existing executions. 你可以将其认为是  a read-only version of the JobRepository。
JobRegistry - A JobRegistry (and its parent interface JobLocator) is not mandatory, but it can be useful if you want to keep track of which jobs are available in the context. It is also useful for collecting jobs centrally in an application context when they have been created elsewhere (e.g. in child contexts). Custom JobRegistry implementations can also be used to manipulate the names and other properties of the jobs that are registered.
JobOperator - the JobRepository provides CRUD operations on the meta-data, and the JobExplorer provides read-only operations on the meta-data. However, those operations are most useful when used together to perform common monitoring tasks such as stopping, restarting, or summarizing a Job, as is commonly done by batch operators. Spring Batch provides for these types of operations via the JobOperator interface.

ItemReader - ItemReader is an abstraction that represents the retrieval of input for a Step, one item at a time. When the ItemReader has exhausted the items it can provide, it will indicate this by returning null. The basic contract of the ItemReader is that it is forward only.
ItemProcessor - ItemProcessor is an abstraction that represents the business processing of an item. While the ItemReader reads one item, and the ItemWriter writes them, the ItemProcessor provides access to transform or apply other business processing.
ItemWriter - ItemWriter is an abstraction that represents the output of a Step, one batch or chunk of items at a time. Generally, an item writer has no knowledge of the input it will receive next, only the item that was passed in its current invocation.




Introducing Spring Batch series (three parts):
http://keyholesoftware.com/2012/06/22/introducing-spring-batch/




Batch processing in Java with Spring batch (four parts):
http://java-success.blogspot.com/2012/06/batch-processing-in-java-with-spring.html




Srcs:
中文 PPT 大致介绍:
http://www.slideshare.net/chijq/spring-batch
Spring Batch – Imperfect Yet Worthwhile:
http://www.summa-tech.com/blog/2012/01/23/spring-batch-imperfect-yet-worthwhile/
http://www.davenkin.me/post/2012-10-17/40039048526

Looking for some good examples?
Spring Batch - Hello World:
http://java.dzone.com/news/spring-batch-hello-world-1
引用
A batch Job is composed of one or more Steps. A JobInstance represents a given Job, parametrized with a set of typed properties called JobParameters. Each run of of a JobInstance is a JobExecution. Imagine a job reading entries from a data base and generating an xml representation of it and then doing some clean-up. We have a Job composed of 2 steps: reading/writing and clean-up. If we parametrize this job by the date of the generated data then our Friday the 13th job is a JobInstance. Each time we run this instance (if a failure occurs for instance) is a JobExecution. This model gives a great flexibility regarding how jobs are launched and run. This naturally brings us to launching jobs with their job parameters, which is the responsibility of JobLauncher. Finally, various objects in the framework require a JobRepository to store runtime information related to the batch execution. In fact, Spring Batch domain model is much more elaborate but this will suffice for our purpose.
What happends if a process throws an exception ?
http://alain-cieslik.com/2011/06/06/springbatch-what-append-if-a-process-throws-an-exception/
http://forum.springsource.org/showthread.php?61042-Spring-Batch-beginners-tutorial
http://stackoverflow.com/questions/1609793/how-can-i-get-started-with-spring-batch
分享到:
评论

相关推荐

    基于Spring Batch的大数据量并行处理

    ### 基于Spring Batch的大数据量并行处理 #### 概述 Spring Batch是一款用于高效处理大量数据的开源框架,特别适用于批处理任务。它由Spring Source与Accenture合作开发,结合了双方在批处理架构和技术上的优势,...

    spring Batch实现数据库大数据量读写

    Spring Batch 是一个强大的、可扩展的Java...总结,Spring Batch 是一个强大的批量处理框架,尤其适合处理大数据量的数据库读写。通过合理配置和优化,可以有效地提升批量处理的效率和稳定性,是企业级应用的理想选择。

    SpringBatch+Spring+Mybatis+MySql (spring batch 使用jar)

    Spring Batch是一个轻量级的,完全面向Spring的批处理框架,可以应用于企业级大量的数据处理系统。Spring Batch以POJO和大家熟知的Spring框架为基础,使开发者更容易的访问和利用企业级服务。Spring Batch可以提供...

    Spring Batch批处理框架

    Spring Batch是一个开源的轻量级批处理框架,它提供了一整套可复用的组件,用于构建健壮且高效的批处理应用程序。由于信息给定的【部分内容】并没有提供实际的技术细节,因此我将基于Spring Batch框架本身,详细介绍...

    springbatch 详解PDF附加 全书源码 压缩包

    Spring Batch 是一个强大的、全面的批处理框架,由 Spring 社区开发,旨在简化企业级应用中的批量数据处理任务。这个框架提供了一种标准的方式来处理大量的数据输入和输出,使得开发者能够专注于业务逻辑,而不是...

    Spring Boot整合Spring Batch,实现批处理

    在Java开发领域,Spring Boot和Spring Batch的整合是构建高效...通过学习和实践这个示例,你不仅可以掌握如何在Spring Boot中使用Spring Batch,还能了解批处理的最佳实践,这对于处理大数据量的应用场景非常有价值。

    Spring Batch in Action英文pdf版

    Spring Batch是一个开源的轻量级、全面的批处理框架,它是为了解决企业应用中的大规模数据处理需求而设计的。Spring Batch in Action是一本...这些高级特性进一步提高了Spring Batch在处理大规模数据时的灵活性和效率。

    springbatch mybatis 代码

    Spring Batch 是一个用于处理批量数据的强大框架,而MyBatis 是一个灵活的持久层框架,它简化了数据库操作。下面将详细探讨这两个技术以及它们的结合使用。 **Spring Batch** Spring Batch 提供了全面的批量处理...

    pro spring batch 源码

    Spring Batch 是一个强大的开源批处理框架,用于处理大量数据,尤其在企业级应用中广泛应用。它为批量处理任务提供了全面的解决方案,包括读取、处理和写入数据,以及事务管理、错误处理和作业调度。"Pro Spring ...

    Spring Batch In Action

    - **Spring Batch** 是一个基于 Java 的强大框架,专门设计用于处理大规模数据批处理任务。 - 它为开发人员提供了一套完整的工具来构建高效、可靠的批量数据处理应用。 - **Spring Batch** 的核心优势在于它能够简化...

    spring batch

    Spring Batch 是一个强大的、全面的批处理框架,用于处理大量数据。...通过 `springbatchdemo` 和 `springbatch` 文件,开发者可以深入学习如何配置和使用 Spring Batch 来处理复杂的批量数据处理任务。

    Spring batch in action

    在扩展批处理作业时,Spring Batch提供了多种策略来处理大量数据和高吞吐量的情况。开发者可以通过分区技术将作业分布到多个处理器上并行处理。 最后,测试是任何应用程序开发过程中的关键部分。Spring Batch提供了...

    详细spring batch资料

    Spring Batch 是一个强大的Java框架,专门用于处理批量处理任务。它是Spring生态系统的组成部分,提供了大量功能,如事务管理、错误处理、作业跟踪和监控。在本文中,我们将深入探讨Spring Batch的基本概念、核心...

    springbatch简单用

    SpringBatch 是一个强大的、全面的批处理框架,用于处理大量数据。它被设计为高度可扩展和模块化,适用于各种企业级应用。SpringBatch 提供了处理批量作业所需的全部基础设施,包括读取、处理和写入数据,以及作业...

    Spring Batch in Action

    这本书旨在帮助读者理解和掌握如何使用Spring Batch进行大规模数据处理作业的设计与实施。 ### Spring Batch概览 Spring Batch是Spring框架的一个模块,专门用于处理大量的记录或数据批处理作业。它提供了一种灵活...

    spring batch指南

    Spring Batch是Spring生态体系中的一个核心组件,专为处理大量数据而设计,它提供了丰富的功能和高度可配置性,适用于各种批处理场景。下面将详细阐述Spring Batch的主要知识点。 一、Spring Batch基本概念 1. **...

    Spring-batch简介.pdf

    SpringBatch是SpringSource与Accenture合作开发的一个开源大数据量并行处理框架。它提供了丰富的参考经验,Accenture在工业级别的批处理架构上拥有丰富的经验,而SpringSource则基于深刻的Spring框架编程模型。...

    Spring Batch Example

    Spring Batch 是一个强大的、全面的批处理框架,用于处理大量数据。它被广泛应用于企业级应用,尤其是在需要高效、可扩展且易于管理的批量数据处理的场景中。本示例项目"Spring Batch Example"旨在帮助开发者理解...

    spring batch简介

    总的来说,Spring Batch 是一个强大且灵活的批量处理解决方案,它的设计考虑了企业级应用的复杂性和可扩展性,使得开发者可以高效地处理大数据量的业务场景,同时保持代码的简洁和可维护性。无论是简单的数据导入...

    Spring Batch 中文

    Spring Batch 是一个基于Java的开源轻量级批处理框架,特别适合处理大量的数据。Spring Batch 提供了大量构建大规模、复杂批处理应用的工具和功能。它旨在提供一个可运行在任何环境中,并且可以扩展到满足业务需求的...

Global site tag (gtag.js) - Google Analytics