spring data hadoop操作hdfs,生成avro文件并上传

1. 首先创建Configuration类

Configuration config = new Configuration();

2.然后创建DatasetRepositoryFactory类

DatasetRepositoryFactory datasetRepositoryFactory = new DatasetRepositoryFactory();

有三个方法

setConf()  里面放Configuration的实例

setNamespace() 里面放数据目录

setBasePath() 里面放hdfs的地址- hdfs://......


DatasetDefinition类

 DatasetDefinition definition = new DatasetDefinition();
    definition.setFormat(Formats.AVRO.getName());    //    file type
    definition.setTargetClass(实体.class);

    definition.setAllowNullValues(false);  // 是否允许空值

definition.setPartitionStrategy(
        new PartitionStrategy.Builder()
            .identity("实体中的分区属性名", "表中的分区字段")
            .build());

DataStoreWriter<实体>类

new AvroPojoDatasetStoreWriter<ImportPojo>(ImportPojo.class, datasetRepositoryFactory(),

        刚刚创建的DatasetDefinition实例)


DatasetOperations类

 DatasetTemplate datasetOperations = new DatasetTemplate();
 datasetOperations.setDatasetDefinitions(Collections.singletonList(刚刚创建的DatasetDefinition实例));

 datasetOperations.setDatasetRepositoryFactory(刚刚创建的DatasetRepositoryFactory实例);

(DatasetOperations)datasetOperations // 强转一下就好,接口关系


这么一串下来,获得一个DatasetOperations类的实例对象

DatasetOperations类

write(Collection<T> records)方法,接受一个Collection参数,list<XXX>,会将数据以刚刚上面设置的格式写入hdfs,我设置的avro

由于kerberos的原因,还为测试能否成功,再测

==================================================

上面比较乱,下面贴一个简单的测试,将实体生成为avro文件上传至hdfs

package com.avro;

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.junit.Test;
import org.kitesdk.data.Formats;
import org.kitesdk.data.PartitionStrategy;
import org.springframework.data.hadoop.store.DataStoreWriter;
import org.springframework.data.hadoop.store.dataset.AvroPojoDatasetStoreWriter;
import org.springframework.data.hadoop.store.dataset.DatasetDefinition;
import org.springframework.data.hadoop.store.dataset.DatasetOperations;
import org.springframework.data.hadoop.store.dataset.DatasetRepositoryFactory;
import org.springframework.data.hadoop.store.dataset.DatasetTemplate;

import com.avro.pojo.ImportPOJO;

public class AVROTest {
	
	@Test
	public void test() throws Exception {
		
		Configuration config = new Configuration();
		
		DatasetRepositoryFactory datasetRepositoryFactory = new DatasetRepositoryFactory();
		datasetRepositoryFactory.setConf(config);
		datasetRepositoryFactory.setNamespace("files");
		datasetRepositoryFactory.setBasePath("hdfs://localhost:9000/");
		datasetRepositoryFactory.afterPropertiesSet();
		
		DatasetDefinition definition = new DatasetDefinition();
		definition.setFormat(Formats.AVRO.getName());
		definition.setTargetClass(ImportPOJO.class);
		definition.setAllowNullValues(false);
		definition.setPartitionStrategy(
		        new PartitionStrategy.Builder()
		            .identity("id", "pre_id")
		            .build());
		
		DataStoreWriter<ImportPOJO> dataStoreWriter = new AvroPojoDatasetStoreWriter<ImportPOJO>(ImportPOJO.class, datasetRepositoryFactory,definition);
		
		DatasetTemplate datasetOperations = new DatasetTemplate();
		datasetOperations.setDatasetDefinitions(Collections.singletonList(definition));
		datasetOperations.setDatasetRepositoryFactory(datasetRepositoryFactory);
		DatasetOperations DatasetOperations = (DatasetOperations)datasetOperations;
		
		ImportPOJO pojo = new ImportPOJO();
		pojo.setId(4);
		pojo.setName("rihanna");
		pojo.setAge(24);
		List<ImportPOJO> list = new ArrayList<ImportPOJO>();
		list.add(pojo);
		
		DatasetOperations.write(list);
	}
	
}

Configuration我没有加其他的东西,因为环境就是本机器的,由于我的是linux系统,用户是root,和是hdfs上的用户是hadoop,我在eclipse中测试,为了方便,我在运行的时候加了参数-DHADOOP_USER_NAME=hadoop,然后就行了,先测试把

猜你喜欢

转载自blog.csdn.net/u011856283/article/details/80311676
今日推荐