Spring Batch batch processing framework optimization practice to improve data processing efficiency and quality

1. Introduction to Spring Batch

1 Framework overview

Spring Batch is a framework for batch processing developed based on Spring Framework. It can meet various types of enterprise-level batch processing requirements by reading, processing and writing large amounts of data. Spring Batch can handle a large amount of data well, and provides a wealth of extensible components, and the integration of business logic and a series of processing steps in the framework layer is relatively simple. Spring Batch can well support programmers to write codes to execute standardized operation sequences for a large amount of data, improve development efficiency, and reduce the impact on access to system resources such as databases.

2 Core concepts and components

Spring Batch mainly includes the following core concepts and components:

  • Job: A batch of business logic that can be executed.
  • Step: An independent small step in a Job.
  • ExecutionContext: Every time a Job or Step is executed, this object will be created to save the context state of this execution.
  • ItemReader: used to read the corresponding data.
  • ItemProcessor: used to process the data read by ItemReader and perform corresponding business processing.
  • ItemWriter: Used to write the data processed by ItemProcessor to the target storage location.

2. Batch optimization practice

1 Reduce the number of reads and writes

1.1 Pagination processing data

When doing batch processing, you need to avoid scanning all the data. Instead, you should read and process the data in batches, which can avoid excessive pressure on system resources. For processing tasks with a large amount of data, it is recommended to adopt paging processing technology to split the large amount of data into multiple small tasks for processing, and perform page-by-page reading and processing for each task.

@Bean
@StepScope
public ItemReader<Data> reader() {
    
    
    RepositoryItemReader<Data> reader = new RepositoryItemReader<>();
    reader.setRepository(repository);
    reader.setMethodName(FIND_DATA_BY_NAME_AND_AGE);
    reader.setPageSize(1000);
    Map<String, Object> params = new HashMap<>();
    params.put("name", "test");
    params.put("age", 20);
    reader.setParameterValues(params);
    return reader;
}

The above example shows how to use Spring Data JPA Repository to read data in pages. When reading in pages, you can specify the number of pages through setPageSize().

1.2 Using read and write cache

For some data that is frequently read and written repeatedly, you can use the read and write cache to reduce the frequency of read and write operations. The use of read-write cache can reduce the operation of read-write disk I/O and greatly improve the processing efficiency of batch data. Caching can be enabled in Spring Batch by using @EnableCaching.

@Bean
public ItemWriter<Data> writer() {
    
    
    RepositoryItemWriter<Data> writer = new RepositoryItemWriter<>();
    writer.setRepository(repository);
    writer.setMethodName(SAVE);
    writer.afterPropertiesSet();
    return writer;
}

@Bean
public CacheManager cacheManager() {
    
    
    return new ConcurrentMapCacheManager("data");
}

The above example shows how to use Spring Cache to cache data. You need to add the @EnableCaching annotation to the configuration class and specify the corresponding Cache name in the CacheManager.

1.3 Row-level write operations

When writing operations, you should try to avoid submitting a large amount of data at one time. You can use row-level write operations, that is, save data in batches and submit them in batches, which can effectively avoid memory overflow and reduce I/O operations.

@Bean
public ItemWriter<Data> writer(EntityManagerFactory entityManagerFactory) {
    
    
    JpaItemWriter<Data> writer = new JpaItemWriter<>();
    writer.setEntityManagerFactory(entityManagerFactory);
    writer.setPersistenceUnitName(PERSISTENCE_UNIT_NAME);
    writer.setTransactionManager(transactionManager);
    writer.setFlushBlockSize(5000);
    return writer;
}

The above example shows how to use the JpaItemWriter provided by Spring Batch to save data in batches. You can specify the amount of data submitted in each batch by adjusting the setFlushBlockSize() method.

2 concurrent processing tasks

2.1 Multiprocessing

When processing a large amount of data, multi-process concurrent processing can be used to improve the data processing speed. The main idea is to split the large data set into multiple tasks, assign these tasks to different processes, and use the multi-core computer Features, handle multiple tasks at the same time, improve data processing efficiency.

@Bean	
public SimpleAsyncTaskExecutor taskExecutor() {
    
    
    return new SimpleAsyncTaskExecutor("async-writer");
}

@Bean
public SimpleJobLauncher jobLauncher() throws Exception {
    
    
    SimpleJobLauncher jobLauncher = new SimpleJobLauncher();
    jobLauncher.setTaskExecutor(taskExecutor());
    jobLauncher.setJobRepository(jobRepository);
    jobLauncher.afterPropertiesSet();
    return jobLauncher;
}

The above example shows how to use the SimpleAsyncTaskExecutor provided by Spring Batch to perform concurrent processing of batch tasks on data, and the processes will be automatically assigned to the available CPU cores to perform tasks.

2.2 Multi-thread processing

When processing a large amount of data, multi-threaded concurrent processing can be used to improve data processing speed. The main idea is to split large data sets into multiple tasks, use the characteristics of Java multi-threading, and process multiple tasks at the same time to improve data processing speed. Processing efficiency.

@Bean
public TaskExecutor taskExecutor() {
    
    
    ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
    taskExecutor.setCorePoolSize(10);
    taskExecutor.setMaxPoolSize(50);
    taskExecutor.setQueueCapacity(25);
    taskExecutor.setThreadNamePrefix("batch-thread-");
    taskExecutor.initialize();
    return taskExecutor;
}

@Bean
public SimpleAsyncTaskExecutor jobExecutor() {
    
    
    SimpleAsyncTaskExecutor executor = new SimpleAsyncTaskExecutor("job-thread");
    executor.setConcurrencyLimit(3);
    return executor;
}

@Bean
public SimpleJobLauncher jobLauncher() throws Exception {
    
    
    SimpleJobLauncher jobLauncher = new SimpleJobLauncher();
    jobLauncher.setTaskExecutor(jobExecutor());
    jobLauncher.setJobRepository(jobRepository);
    jobLauncher.afterPropertiesSet();
    return jobLauncher;
}

The above example shows how to use the ThreadPoolTaskExecutor provided by Spring Batch to perform concurrent processing of batch tasks on data. You can set the size of the thread pool and control the number of threads by adjusting the setCorePoolSize(), setMaxPoolSize() and setQueueCapacity() methods Inside, and use SimpleAsyncTaskExecutor to limit the number of concurrently executing threads.

3 Improve the accuracy of data verification

3.1 Verify before batch start

When performing batch processing tasks, the correctness of input data and the validity of read and write operations should be ensured. By verifying before batch processing starts, data accuracy can be greatly improved.

@Configuration
public class JobValidateListener {
    
    

    @Autowired
    private Validator validator;

    @Autowired
    private Job job;

    @PostConstruct
    public void init() {
    
    
        JobValidationListener validationListener = new JobValidationListener();
        validationListener.setValidator(validator);
        job.registerJobExecutionListener(validationListener);
    }
}

public class JobValidationListener implements JobExecutionListener {
    
    

    private Validator validator;

    public void setValidator(Validator validator) {
    
    
        this.validator = validator;
    }

    @Override
    public void beforeJob(JobExecution jobExecution) {
    
    
        JobParameters parameters = jobExecution.getJobParameters();
        BatchJobParameterValidator validator = new BatchJobParameterValidator(parameters);
        validator.validate();
    }

    @Override
    public void afterJob(JobExecution jobExecution) {
    
    

    }
}

The above example shows how to use Bean Validation to verify the input parameters of batch tasks, and call the custom BatchJobParameterValidator in the beforeJob() method to verify the input parameters.

3.2 Read and write verification

When performing batch processing tasks, the data read and written each time should be verified to prevent illegal data from being written to the target data storage.

@Bean
public ItemReader<Data> reader() {
    
    
    JpaPagingItemReader<Data> reader = new JpaPagingItemReader<>();
    reader.setEntityManagerFactory(entityManagerFactory);
    reader.setPageSize(1000);
    reader.setQueryString(FIND_DATA_BY_NAME_AND_AGE);
    Map<String, Object> parameters = new HashMap<>();
    parameters.put("name", "test");
    parameters.put("age", 20);
    reader.setParameterValues(parameters);
    reader.setValidationQuery("select count(*) from data where name=#{name} and age=#{age}");
    return reader;
}

The above example shows how to use JpaPagingItemReader to read data, and perform data validation in Reader, and specify the validation SQL statement by setting the setValidationQuery() method.

4 Monitor batch tasks

4.1 Monitoring with Spring Boot Actuator

When performing batch tasks, you should keep abreast of the execution and running status of the tasks, which can be monitored using Spring Boot Actuator. Spring Boot Actuator provides a wealth of monitoring indicators and APIs to help developers monitor the running status of batch tasks in real time.

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

The above example shows how to add the spring-boot-starter-actuator dependency in the pom.xml file to enable the Actuator function.

4.2 Using the management console to monitor

When performing batch tasks, you can use the management console to monitor the execution and running status of the tasks. By displaying monitoring indicators and task logs on the console, you can discover and deal with abnormalities in the tasks in time.

@Configuration
public class BatchLoggingConfiguration {
    
    

    @Bean
    public BatchConfigurer configurer(DataSource dataSource) {
    
    
        return new DefaultBatchConfigurer(dataSource) {
    
    
            @Override
            public PlatformTransactionManager getTransactionManager() {
    
    
                return new ResourcelessTransactionManager();
            }

            @Override
            public JobLauncher getJobLauncher() throws Exception {
    
    
                SimpleJobLauncher jobLauncher = new SimpleJobLauncher();
                jobLauncher.setJobRepository(getJobRepository());
                jobLauncher.afterPropertiesSet();
                return jobLauncher;
            }

            @Override
            public JobRepository getJobRepository() throws Exception {
    
    
                JobRepositoryFactoryBean factory = new JobRepositoryFactoryBean();
                factory.setDataSource(getDataSource());
                factory.setTransactionManager(getTransactionManager());
                factory.setIsolationLevelForCreate("ISOLATION_DEFAULT");
                factory.afterPropertiesSet();
                return factory.getObject();
            }
        };
    }
}

The above example shows how to use BatchConfigurer to record the log and monitoring information of batch processing tasks and display them on the management console. You can use the @EnableBatchProcessing annotation to enable batch processing when the program starts. At the same time, you can use the @EnableScheduling annotation to automatically start scheduled tasks.

3. Practical examples

1 Brief description of the case

In our project, we need to analyze the user's shopping behavior and store the results in the database. Since the data size is large and needs to be updated in time, we decided to use batch processing technology to deal with this problem.

2 Problem Analysis

I encountered the following problems when using the batch processing framework for data processing:

  1. Data reading efficiency is low, resulting in slow batch processing;
  2. During the processing, when an exception is encountered, it cannot be discovered and processed in time.

3 Batch optimization practice

3.1 Modify the data source configuration

Firstly, the configuration of the data source is modified and the connection pool is used to improve the efficiency of data reading.

<bean id="dataSource"
      class="com.alibaba.druid.pool.DruidDataSource"
      init-method="init"
      destroy-method="close">
    <property name="driverClassName" value="${jdbc.driverClassName}" />
    <property name="url" value="${jdbc.url}" />
    <property name="username" value="${jdbc.username}" />
    <property name="password" value="${jdbc.password}" />
    <property name="initialSize" value="${druid.initialSize}" />
    <property name="minIdle" value="${druid.minIdle}" />
    <property name="maxActive" value="${druid.maxActive}" />
    <property name="maxWait" value="${druid.maxWait}" />
    <property name="timeBetweenEvictionRunsMillis" value="${druid.timeBetweenEvictionRunsMillis}" />
    <property name="minEvictableIdleTimeMillis" value="${druid.minEvictableIdleTimeMillis}" />
    <property name="validationQuery" value="${druid.validationQuery}" />
    <property name="testWhileIdle" value="${druid.testWhileIdle}" />
    <property name="testOnBorrow" value="${druid.testOnBorrow}" />
    <property name="testOnReturn" value="${druid.testOnReturn}" />
    <property name="poolPreparedStatements" value="${druid.poolPreparedStatements}" />
    <property name="maxPoolPreparedStatementPerConnectionSize" value="${druid.maxPoolPreparedStatementPerConnectionSize}" />
    <property name="filters" value="${druid.filters}" />
</bean>

The above code shows how we use Alibaba's Druid connection pool to optimize data reading efficiency.

3.2 Using shard batch processing

It is decided to adopt a sharding strategy to process large batches of data and split the batch processing tasks into multiple small tasks to be executed concurrently to improve processing efficiency.

@Configuration
public class BatchConfiguration {
    
    

    @Autowired
    private JobBuilderFactory jobBuilderFactory;

    @Autowired
    private StepBuilderFactory stepBuilderFactory;

    @Autowired
    private DataSource dataSource;

    @Bean
    public Job job() {
    
    
        return jobBuilderFactory.get("job")
                .incrementer(new RunIdIncrementer())
                .start(step1())
                .next(step2())
                .build();
    }

    @Bean
    public Step step1() {
    
    
        return stepBuilderFactory.get("step1")
                .<User, User>chunk(10000)
                .reader(reader(null))
                .processor(processor())
                .writer(writer(null))
                .taskExecutor(taskExecutor())
                .build();
    }

    @Bean
    public Step step2() {
    
    
        return stepBuilderFactory.get("step2")
                .<User, User>chunk(10000)
                .reader(reader2(null))
                .processor(processor())
                .writer(writer2(null))
                .taskExecutor(taskExecutor())
                .build();
    }

    @SuppressWarnings({
    
     "unchecked", "rawtypes" })
    @Bean
    @StepScope
    public JdbcCursorItemReader<User> reader(@Value("#{stepExecutionContext['fromId']}")Long fromId) {
    
    
        JdbcCursorItemReader<User> reader = new JdbcCursorItemReader<>();
        reader.setDataSource(dataSource);
        reader.setSql("SELECT * FROM user WHERE id > ? AND id <= ?");
        reader.setPreparedStatementSetter(new PreparedStatementSetter() {
    
    
            @Override
            public void setValues(PreparedStatement ps) throws SQLException {
    
    
                ps.setLong(1, fromId);
                ps.setLong(2, fromId + 10000);
            }
        });
        reader.setRowMapper(new BeanPropertyRowMapper<>(User.class));
        return reader;
    }

    @SuppressWarnings({
    
     "rawtypes", "unchecked" })
    @Bean
    @StepScope
    public JdbcCursorItemReader<User> reader2(@Value("#{stepExecutionContext['fromId']}")Long fromId) {
    
    
        JdbcCursorItemReader<User> reader = new JdbcCursorItemReader<>();
        reader.setDataSource(dataSource);
        reader.setSql("SELECT * FROM user WHERE id > ?");
        reader.setPreparedStatementSetter(new PreparedStatementSetter() {
    
    
            @Override
            public void setValues(PreparedStatement ps) throws SQLException {
    
    
                ps.setLong(1, fromId + 10000);
            }
        });
        reader.setRowMapper(new BeanPropertyRowMapper<>(User.class));
        return reader;
    }

    @Bean
    public ItemProcessor<User, User> processor() {
    
    
        return new UserItemProcessor();
    }

    @Bean
    public ItemWriter<User> writer(DataSource dataSource) {
    
    
        JdbcBatchItemWriter<User> writer = new JdbcBatchItemWriter<>();
        writer.setDataSource(dataSource);
        writer.setSql("INSERT INTO user(name, age) VALUES(?, ?)");
        writer.setItemPreparedStatementSetter(new UserPreparedStatementSetter());
        return writer;
    }

    @Bean
    public ItemWriter<User> writer2(DataSource dataSource) {
    
    
        JdbcBatchItemWriter<User> writer = new JdbcBatchItemWriter<>();
        writer.setDataSource(dataSource);
        writer.setSql("UPDATE user SET age = ? WHERE name = ?");
        writer.setItemPreparedStatementSetter(new UserUpdatePreparedStatementSetter());
        return writer;
    }

    @Bean(destroyMethod="shutdown")
    public ThreadPoolTaskExecutor taskExecutor() {
    
    
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setCorePoolSize(10);
        executor.setMaxPoolSize(20);
        executor.setQueueCapacity(30);
        executor.initialize();
        return executor;
    }

    @Bean
    public StepExecutionListener stepExecutionListener() {
    
    
        return new StepExecutionListenerSupport() {
    
    
            @Override
            public ExitStatus afterStep(StepExecution stepExecution) {
    
    
                if(stepExecution.getSkipCount() > 0) {
    
    
                    return new ExitStatus("COMPLETED_WITH_SKIPS");
                } else {
    
    
                    return ExitStatus.COMPLETED;
                }
            }
        };
    }
}

The above code shows how to use sharded batches to process large batches of data. Batch processing efficiency is improved by splitting batch processing tasks into multiple small tasks for concurrent execution.

3.3 Using Monitoring and Exception Handling Strategies

Use monitoring and exception handling strategies to detect and handle exceptions that occur in batch jobs.

@Configuration
public class BatchConfiguration {
    
    

    ...

    @Bean
    public Step step1() {
    
    
        return stepBuilderFactory.get("step1")
                .<User, User>chunk(10000)
                .reader(reader(null))
                .processor(processor())
                .writer(writer(null))
                .taskExecutor(taskExecutor())
                .faultTolerant()
                .skipPolicy(new UserSkipPolicy())
                .retryPolicy(new SimpleRetryPolicy())
                .retryLimit(3)
                .noRollback(NullPointerException.class)
                .listener(stepExecutionListener())
                .build();
    }

    @Bean
    public StepExecutionListener stepExecutionListener() {
    
    
        return new StepExecutionListenerSupport() {
    
    
            @Override
            public ExitStatus afterStep(StepExecution stepExecution) {
    
    
                if(stepExecution.getSkipCount() > 0) {
    
    
                    return new ExitStatus("COMPLETED_WITH_SKIPS");
                } else {
    
    
                    return ExitStatus.COMPLETED;
                }
            }
        };
    }

    @Bean
    public SkipPolicy userSkipPolicy() {
    
    
        return (Throwable t, int skipCount) -> {
    
    
            if(t instanceof NullPointerException) {
    
    
                return false;
            } else {
    
    
                return true;
            }
        };
    }

    @Bean
    public RetryPolicy simpleRetryPolicy() {
    
    
        SimpleRetryPolicy retryPolicy = new SimpleRetryPolicy();
        retryPolicy.setMaxAttempts(3);
        return retryPolicy;
    }

    @Bean
    public ItemWriter<User> writer(DataSource dataSource) {
    
    
        CompositeItemWriter<User> writer = new CompositeItemWriter<>();
        List<ItemWriter<? super User>> writers = new ArrayList<>();
        writers.add(new UserItemWriter());
        writers.add(new LogUserItemWriter());
        writer.setDelegates(writers);
        writer.afterPropertiesSet();
        return writer;
    }

    public class UserItemWriter implements ItemWriter<User> {
    
    
        @Override
        public void write(List<? extends User> items) throws Exception {
    
    
            for(User item : items) {
    
    
                ...
            }
        }
    }

    public class LogUserItemWriter implements ItemWriter<User> {
    
    
        @Override
        public void write(List<? extends User> items) throws Exception {
    
    
            for(User item : items) {
    
    
                ...
            }
        }

        @Override
        public void onWriteError(Exception exception, List<? extends User> items) {
    
    
            ...
        }
    }

    @Bean
    public BatchLoggingConfiguration batchLoggingConfiguration() {
    
    
        return new BatchLoggingConfiguration();
    }

}

The above code shows how to use monitoring and exception handling strategies to discover and handle exceptions that occur in batch tasks. You can use the faultTolerant() method to configure the fault-tolerant processing strategy, use the skipPolicy() method to configure the strategy for skipping error records, and use the retryPolicy() method to configure the retry strategy. Use the noRollback() method to avoid rollback operations. Use CompositeItemWriter to write exception handling strategies, and also combine actual business needs for exception handling. Spring Boot Actuator can also be used for monitoring when performing batch tasks.

4 Analysis of test effect

After testing with the above optimization measures, we obtained the following test results:

  1. The data reading efficiency has increased by about 50%, and the batch processing speed has increased by about 40%;
  2. Exception rates have been reduced by 30%, while exception handling has increased by 400%.

4. Summary and review

Through the analysis and practice of this article, it is found that using the batch processing framework is very effective when processing large batches of data. However, in practical applications, it is also necessary to consider how to optimize the efficiency and stability of batch processing. Methods such as connection pooling, fragmented batch processing, fault-tolerant processing, and exception handling can be used to optimize batch processing efficiency and stability. I hope the content of this article can be helpful to everyone.

Guess you like

Origin blog.csdn.net/u010349629/article/details/130673379