The batch processing framework Spring Batch is so powerful, will you use it?

Introduction to spring batch

Spring batch is a data processing framework provided by spring. Many applications in the enterprise domain require batch processing to perform business operations in mission-critical environments. These business operations include:

  • Automated, complex processing of large volumes of information most efficiently without user interaction. These actions often include time-based events (such as month-end calculations, notifications, or communications).
  • The periodic application of complex business rules (for example, insurance benefit determination or rate adjustments) is repeatedly processed across very large data sets.
  • Integrate information received from internal and external systems that often needs to be formatted, validated, and processed in a transactional manner into a system of record. Batch processing is used to process billions of transactions per day for enterprises.

Spring Batch is a lightweight, comprehensive batch processing framework designed to develop powerful batch processing applications that are critical to the day-to-day operation of enterprise systems. Spring Batch builds on the expected Spring Framework features (productivity, POJO-based development approach, and general ease of use) while making it easy for developers to access and leverage higher-level enterprise services when necessary. Spring Batch is not a schuedling framework.

Spring Batch provides reusable functionality that is critical for processing large amounts of data, including logging/tracing, transaction management, job processing statistics, job restart, skipping, and resource management. It also provides more advanced technical services and features, enabling extremely high-volume and high-performance batch jobs through optimization and partitioning techniques.

Spring Batch can be used for both simple use cases (such as reading a file into a database or running a stored procedure) as well as complex high-volume use cases (such as moving large amounts of data between databases, transforming it, etc.). High-volume batch jobs can leverage this framework to process large amounts of information in a highly scalable manner.

Introduction to Spring Batch Architecture

A typical batch application looks like this:

  • Read large numbers of records from a database, file or queue.
  • process the data in a certain way.
  • Write back the data in a modified form.

The corresponding schematic diagram is as follows:

An overall architecture of spring batch is as follows:

In spring batch, a job can define many steps. In each step, it can define its own ItemReader for reading data, ItemProcesseor for processing data, ItemWriter for writing data, and each defined job is In JobRepository, we can start a job through JobLauncher.

Introduction to the core concepts of Spring Batch

The following are some concepts that are core concepts in the Spring batch framework.

what is job

Job and Step are the two core concepts of spring batch to execute batch tasks.

Among them, Job is a concept that encapsulates the entire batch process. Job is only a top-level abstract concept in the spring batch system, and it is only a top-level interface when it is reflected in the code. The code is as follows:

 
 

csharp

copy code

/** * Batch domain object representing a job. Job is an explicit abstraction * representing the configuration of a job specified by a developer. It should * be noted that restart policy is applied to the job as a whole and not to a * step. */ public interface Job { String getName(); boolean isRestartable(); void execute(JobExecution execution); JobParametersIncrementer getJobParametersIncrementer(); JobParametersValidator getJobParametersValidator(); }

Five methods are defined in the Job interface, and its implementation class mainly has two types of jobs, one is simplejob and the other is flowjob. In spring batch, job is the top-level abstraction. In addition to job, we also have two lower-level abstractions, JobInstance and JobExecution.

A job is the basic unit of our operation, which is composed of steps. A job can essentially be regarded as a container of a step. A job can combine steps in a specified logical order, and provides a way for us to set the same properties for all steps, such as some event monitoring and skipping strategies.

Spring Batch provides a default simple implementation of the Job interface in the form of the SimpleJob class, which creates some standard functionality on top of the Job. An example code using java config is as follows:

 
 

kotlin

copy code

@Bean public Job footballJob() { return this.jobBuilderFactory.get("footballJob") .start(playerLoad()) .next(gameLoad()) .next(playerSummarization()) .end() .build(); }

The meaning of this configuration is: first give the job a name called footballJob, and then specify the three steps of the job, which are respectively implemented by the methods playerLoad, gameLoad, and playerSummarization.

What is JobInstance

We have mentioned JobInstance above, which is a lower-level abstraction of Job, and its definition is as follows:

 
 

csharp

copy code

public interface JobInstance { /** * Get unique id for this JobInstance. * @return instance id */ public long getInstanceId(); /** * Get job name. * @return value of 'id' attribute from <job> */ public String getJobName(); }

His method is very simple, one is to return the id of the Job, and the other is to return the name of the Job.

JobInstance refers to the concept during job execution and job execution. Instance is the meaning of the instance.

For example, now there is a batch job whose function is to execute rows once at the end of the day. We assume the batch job name is 'EndOfDay'. In this case, there will be a logical JobInstance every day, and we must record each run of the job.

What is JobParameters

As we mentioned above, if the same job runs once a day, there will be a jobIntsance every day, but their job definitions are the same, so how do we distinguish the different jobinstances of a job. May wish to make a guess first, although the job definition of jobinstance is the same, but they have different things, such as running time.

The thing provided in spring batch to identify a jobinstance is: JobParameters. A JobParameters object contains a set of parameters used to start a batch job, which can be used for identification or even as reference data during a run. The running time we assume can be used as a JobParameters.

For example, our previous 'EndOfDay' job now has two instances, one generated on January 1 and the other generated on January 2, then we can define two JobParameter objects: one parameter is 01 -01, another parameter is 01-02. Therefore, the method to identify a JobInstance can be defined as:

Therefore, we can operate the correct JobInstance through Jobparameter

What is JobExecution

JobExecution refers to a single attempt to run a code-level concept of a Job that we have defined. An execution of a job may fail or succeed. The given JobInstance corresponding to the execution is also considered complete only when the execution completes successfully.

Still taking the EndOfDay job described above as an example, assume that the result of running the JobInstance of 01-01-2019 for the first time is a failure. At this time, if you run again with the same Jobparameter parameter as the first run (that is, 01-01-2019), a new JobExecution instance corresponding to the previous jobInstance will be created, and there is still only one JobInstance.

The interface definition of JobExecution is as follows:

 
 

csharp

copy code

public interface JobExecution { /** * Get unique id for this JobExecution. * @return execution id */ public long getExecutionId(); /** * Get job name. * @return value of 'id' attribute from <job> */ public String getJobName(); /** * Get batch status of this execution. * @return batch status value. */ public BatchStatus getBatchStatus(); /** * Get time execution entered STARTED status. * @return date (time) */ public Date getStartTime(); /** * Get time execution entered end status: COMPLETED, STOPPED, FAILED * @return date (time) */ public Date getEndTime(); /** * Get execution exit status. * @return exit status. */ public String getExitStatus(); /** * Get time execution was created. * @return date (time) */ public Date getCreateTime(); /** * Get time execution was last updated updated. * @return date (time) */ public Date getLastUpdatedTime(); /** * Get job parameters for this execution. * @return job parameters */ public Properties getJobParameters(); }

The annotations of each method have been explained clearly, so I won't explain more here. Just mentioning BatchStatus, JobExecution provides a method getBatchStatus to obtain a status of a job that is specifically executed at a certain time. BatchStatus is an enumeration class representing job status, which is defined as follows:

 
 

arduino

copy code

public enum BatchStatus {STARTING, STARTED, STOPPING, STOPPED, FAILED, COMPLETED, ABANDONED }

These attributes are very critical information for the execution of a job, and spring batch will persist them to the database. In the process of using Spring batch, spring batch will automatically create some tables to store some job-related information, using The table for storing JobExecution is batch_job_execution. The following is an example of a screenshot from the database:

what is step

Each Step object encapsulates an independent stage of a batch job. In fact, each Job is essentially composed of one or more steps. Each step contains all the information needed to define and control the actual batch. Anything specific is at the discretion of the developer writing the Job.

A step can be very simple or very complex. For example, the function of a step is to load the data in the file into the database, so based on the support of the current spring batch, almost no code needs to be written. More complex steps may have complex business logic as part of the processing.

Like Job, Step has a StepExecution similar to JobExecution, as shown in the following figure:

What is StepExecution

StepExecution means to execute Step once, and a new StepExecution will be created every time a Step is run, similar to JobExecution. However, a step might not be able to execute because a step before it failed. And the StepExecution is only created when the Step is actually started.

An instance of step execution is represented by an object of the StepExecution class. Each StepExecution contains a reference to its corresponding step as well as JobExecution and transaction related data such as commit and rollback counts and start and end times.

Additionally, each step execution contains an ExecutionContext that contains any data the developer needs to persist across batch runs, such as statistics or status information required for restarts. Here is an example of a screenshot taken from the database:

What is ExecutionContext

ExecutionContext is the execution environment of each StepExecution. It contains a series of key-value pairs. We can get the ExecutionContext with the following code

 
 

This

copy code

ExecutionContext ecStep = stepExecution.getExecutionContext(); ExecutionContext ecJob = jobExecution.getExecutionContext();

What is JobRepository

JobRepository is a class used to persist the above job, step and other concepts. It also provides CRUD operations for Job and Step, as well as the JobLauncher implementation mentioned below.

When the Job is started for the first time, the JobExecution will be obtained from the repository, and the StepExecution and JobExecution will be stored in the repository during batch execution.

The @EnableBatchProcessing annotation can provide automatic configuration for JobRepository.

What is JobLauncher

The function of the JobLauncher interface is very simple. It is used to start a job with specified JobParameters. Why we emphasize the specification of JobParameters here? The reason is that we have already mentioned that jobparameters and jobs can be combined to form a job execution. Here is a code example:

 
 

java

copy code

public interface JobLauncher { public JobExecution run(Job job, JobParameters jobParameters) throws JobExecutionAlreadyRunningException, JobRestartException, JobInstanceAlreadyCompleteException, JobParametersInvalidException; }

The function implemented by the above run method is to obtain a JobExecution from the JobRepository and execute the Job according to the incoming job and jobparamaters.

What is Item Reader

ItemReader is an abstraction for reading data, and its function is to provide data input for each Step. When ItemReader has read all the data, it will return null to tell subsequent operations that the data has been read. Spring Batch provides a lot of useful implementation classes for ItemReader, such as JdbcPagingItemReader, JdbcCursorItemReader and so on.

The read-in data sources supported by ItemReader are also very rich, including various types of databases, files, data streams, and so on. Covers almost all of our scenarios.

The following is a sample code of JdbcPagingItemReader:

 
 

typescript

copy code

@Bean public JdbcPagingItemReader itemReader(DataSource dataSource, PagingQueryProvider queryProvider) { Map<String, Object> parameterValues = new HashMap<>(); parameterValues.put("status", "NEW"); return new JdbcPagingItemReaderBuilder<CustomerCredit>() .name("creditReader") .dataSource(dataSource) .queryProvider(queryProvider) .parameterValues(parameterValues) .rowMapper(customerCreditMapper()) .pageSize(1000) .build(); } @Bean public SqlPagingQueryProviderFactoryBean queryProvider() { SqlPagingQueryProviderFactoryBean provider = new SqlPagingQueryProviderFactoryBean(); provider.setSelectClause("select id, name, credit"); provider.setFromClause("from customer"); provider.setWhereClause("where status=:status"); provider.setSortKey("id"); return provider; }

JdbcPagingItemReader must specify a PagingQueryProvider, which is responsible for providing SQL query statements to return data by paging.

The following is a sample code of JdbcCursorItemReader:

 
 

typescript

copy code

private JdbcCursorItemReader<Map<String, Object>> buildItemReader(final DataSource dataSource, String tableName, String tenant) { JdbcCursorItemReader<Map<String, Object>> itemReader = new JdbcCursorItemReader<>(); itemReader.setDataSource(dataSource); itemReader.setSql("sql here"); itemReader.setRowMapper(new RowMapper()); return itemReader; }

What is Item Writer

Since ItemReader is an abstraction for reading data, ItemWriter is naturally an abstraction for writing data, which provides the function of writing data for each step. The writing unit is configurable. We can write one piece of data at a time, or write one chunk of data at a time. There will be a special introduction to the chunk below. ItemWriter can't do anything with the read data.

Spring Batch also provides a lot of useful implementation classes for ItemWriter. Of course, we can also implement our own writer function. In addition, welcome to pay attention to the official account Java Note Shrimp, and the background will reply "Backend Interview", and send you a collection of interview questions!

What is Item Processor

ItemProcessor is an abstraction of the business logic processing of the project. When ItemReader reads a record and before ItemWriter writes this record, we can use temProcessor to provide a function of processing business logic and perform corresponding operations on the data. If we find that a piece of data should not be written in ItemProcessor, it can be expressed by returning null. ItemProcessor, ItemReader and ItemWriter can work together very well, and the data transmission between them is also very convenient. We can use it directly.

chunk processing flow

Spring batch provides us with the ability to process data according to the chunk. The schematic diagram of a chunk is as follows:

Its meaning is the same as shown in the figure. Since our batch task may have a lot of data read and write operations, it will not be very efficient to process and submit to the database one by one, so spring batch provides the concept of chunk , we can set a chunk size, spring batch will process the data one by one, but not submit to the database, only when the amount of processed data reaches the value set by the chunk size, it will commit together.

The java instance definition code is as follows:

In the above step, the chunk size is set to 10. When the number of data read by ItemReader reaches 10, the batch of data will be sent to itemWriter together, and the transaction will be submitted at the same time.

skip policy and failure handling

The step of a batch job may process a very large amount of data, and it is inevitable to encounter errors. Although the probability of errors is small, we have to consider these situations, because the most important thing for us to do data migration It is to ensure the final consistency of the data. Of course, spring batch also takes this situation into consideration and provides us with relevant technical support. Please see the following bean configuration:

We need to pay attention to these three methods, namely skipLimit(), skip(), noSkip(),

The skipLimit method means that we can set a number of exceptions that we allow this step to skip. If we set it to 10, when the step is running, as long as the number of exceptions that occur does not exceed 10, the entire step will not fail. Note that if skipLimit is not set, its default value is 0.

In the skip method, we can specify the exceptions that we can skip, because we can ignore some exceptions.

The noSkip method means that we don’t want to skip this exception, that is, exclude this exception from all skip exceptions. From the above example, it means skipping all exceptions except FileNotFoundException.

So for this step, FileNotFoundException is a fatal exception. When this exception is thrown, the step will fail directly.

Batch Operation Guide

This section is some noteworthy points when using spring batch

Batch principle

When building a batch solution, the following key principles and considerations should be considered.

  • Batch architecture often affects architecture

  • Simplify as much as possible and avoid building complex logical structures in a single batch application

  • Keep the processing and storage of data in close physical proximity (in other words, keep the data while it is being processed).

  • Minimize the use of system resources, especially I/O. Perform as many operations as possible in internal memory.

  • Review application I/O (analyze SQL statements) to ensure that unnecessary physical I/O is avoided. In particular, look for the following four common pitfalls:

  • Data is read per transaction when the data can be read once and cached or kept in working storage.

  • Reread data from a transaction that previously read data in the same transaction.

  • Cause unnecessary table or index scans.

  • The key value is not specified in the WHERE clause of the SQL statement.

  • Don't do the same thing twice in a batch run. For example, if data aggregation is required for reporting purposes, you should (if possible) increment the stored total when the data is initially processed, so your reporting application does not have to reprocess the same data.

  • Allocate enough memory at the start of a batch application to avoid time-consuming reallocations in the process.

  • Always assume worst data integrity. Insert appropriate checks and record validations to maintain data integrity.

  • 尽可能实施校验和以进行内部验证。例如,对于一个文件里的数据应该有一个数据条数记录,告诉文件中的记录总数以及关键字段的汇总。

  • 在具有真实数据量的类似生产环境中尽早计划和执行压力测试。

  • 在大批量系统中,数据备份可能具有挑战性,特别是如果系统以24-7在线的情况运行。数据库备份通常在在线设计中得到很好的处理,但文件备份应该被视为同样重要。如果系统依赖于文件,则文件备份过程不仅应该到位并记录在案,还应定期进行测试。

如何默认不启动job

在使用java config使用spring batch的job时,如果不做任何配置,项目在启动时就会默认去跑我们定义好的批处理job。那么如何让项目在启动时不自动去跑job呢?

spring batch的job会在项目启动时自动run,如果我们不想让它在启动时run的话,可以在application.properties中添加如下属性:

 
 

ini

复制代码

spring.batch.job.enabled=false

在读数据时内存不够

在使用spring batch做数据迁移时,发现在job启动后,执行到一定时间点时就卡在一个地方不动了,且log也不再打印,等待一段时间之后,得到如下错误:

红字的信息为:Resource exhaustion event:the JVM was unable to allocate memory from the heap.

翻译过来的意思就是项目发出了一个资源耗尽的事件,告诉我们java虚拟机无法再为堆分配内存。另外,欢迎关注公众号Java笔记虾,后台回复“后端面试”,送你一份面试题宝典!

造成这个错误的原因是: 这个项目里的batch job的reader是一次性拿回了数据库里的所有数据,并没有进行分页,当这个数据量太大时,就会导致内存不够用。解决的办法有两个:

  • 调整reader读数据逻辑,按分页读取,但实现上会麻烦一些,且运行效率会下降
  • 增大service内存

Guess you like

Origin blog.csdn.net/BASK2312/article/details/131811737