RandomAccessFile downloads file hash deduplication and Mapreduce downloads files--------Bridge mode performs data cleaning to

Is data cleaning justified? Multi-dimensional verification group group 
conversion 
ET2 offline engineer

Method 1: Use RandomAccessFile to download file hash to remove duplicates

public class App 
{ 
    public static void main( String[] args ) throws Exception { 
        //Prepare hash 
        HashMap<String, Integer> map = new HashMap<>(); 
        //Read file 
        RandomAccessFile raf = new RandomAccessFile("D:\ \bgdata\\bgdata01\\events.csv", "rw"); 
        //Skip the first line 
        raf.readLine(); 
        //Read a line of data 
        String line=""; 
        //Loop to read data 
        while ((line =raf.readLine())!=null){ 
            //The read data is separated by commas and the first 
            String eventid=line.split(",")[0]; 
            //Determine whether the map contains id 
            if (map .containsKey(eventid)){ 
                map.put(eventid,map.get(eventid)+1);
                //Otherwise, it is the first time to obtain the id value and set it to 1 
            }else {
                map.put(eventid,1); 
            } 
        } 
        //Close the stream file         
        raf.close(); 
        //Output and check the map size to see if there are duplicates 
        System.out.println(map.size()+"===== ========="); 
    } 
}

Disadvantages: The file data used this time was 3 million, and it was very slow to read. It took about 30 minutes o(╥﹏╥)o

It is recommended to use the following method

Method 2 uses Mapreduce to download files

2.1Configure pom file

    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>2.6.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-mapreduce-client-core</artifactId>
      <version>2.6.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-hdfs</artifactId>
      <version>2.6.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>2.6.0</version>
    </dependency>

2.2 Configure the RcMapper class

public class RcMapper extends Mapper<LongWritable,Text ,Text, IntWritable> {
   IntWritable one= new IntWritable(1);
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String eventid=value.toString().split(",")[0];
        context.write(new Text(eventid),one);
    }
}

2.3 Configure the RcReduce class

public class RcReduce extends Reducer<Text, IntWritable,Text,IntWritable>{
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int count=0;
        for (IntWritable it : values) {
            count+=1;
        }
        context.write(key,new IntWritable(count));
    }
}

2.4 Configure the RcCountDriver class

public class RcCountDriver { 
    public static void main(String[] args) throws Exception { 
        //Instantiate the job object 
        Job job = Job.getInstance(new Configuration()); 
        //Get the object through reflection 
        job.setJarByClass(RcCountDriver.class); 
//
        Reflection to obtain the RcMapper object 
        //Set the corresponding Text.class 
        job.setMapperClass(RcMapper.class); 
        job.setMapOutputKeyClass(Text.class); 
        job.setMapOutputValueClass(IntWritable.class); 
//
        Reflection to obtain the RcReduce object 
        / /Set the corresponding Text.class 
        job.setReducerClass(RcReduce.class); 
        job.setOutputKeyClass(Text.class); 
        job.setOutputValueClass(IntWritable.class);
//
        Get the target path file 
        FileInputFormat.addInputPath(job,new Path("file:///D:\\bgdata\\bgdata01\\events.csv")); 
        //Mark the download file address 
        FileOutputFormat.setOutputPath(job ,new Path("file:///d:/calres/cal01")); 
/
        ** 
         *Job is run through job.waitForCompletion(true), 
         * true means that the running progress and other information will be output to the user in a timely manner, 
         * If false, just wait for the job to end 
         */ 
        job.waitForCompletion(true); 
    } 
}

Just click to test! !

Method 3. Use Mapreduce to remove duplicates

3.1 Configure the CfMapper class

public class CfMapper extends Mapper<LongWritable, Text,Text, IntWritable> { 
    @Override 
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 
        //The obtained value is separated by spaces 
        String[] infos = value. toString().split("\t"); 
        //Operation if the second data is not one 
        if (!infos[1].equals("1")){ 
            //Get the first value, the second Remove spaces from values other than 1 
            context.write(new Text(infos[0]),new IntWritable(Integer.parseInt(infos[1].trim()))); 
        } 
    } 
}

3.2 Configure the CfCombiner class

public class CfCombiner extends Reducer<Text, IntWritable,Text,IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        //配置k v键值对
        context.write(key,values.iterator().next());
    }
}

3.3 Configure the CfCountDriver class

public class CfCountDriver { 
    public static void main(String[] args) throws Exception { 
        //Instantiate the job object 
        Job job = Job.getInstance(new Configuration()); 
        //Get the object through reflection 
        job.setJarByClass(CfCountDriver.class); 
//
        Reflection to obtain the CfMapper object 
        //Set the corresponding Text.class 
        job.setMapperClass(CfMapper.class); 
        job.setMapOutputKeyClass(Text.class); 
        job.setMapOutputValueClass(IntWritable.class); 
//
        Reflection to obtain the CfCombiner object 
        / /Set the corresponding Text.class 
// job.setCombinerClass(CfCombiner.class); 
        job.setReducerClass(CfCombiner.class); 
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class); 
//
        Get the target path file 
        FileInputFormat.addInputPath(job,new Path("file:///d:/calres/cal01/a")); 
        //Mark the download file address 
        FileOutputFormat .setOutputPath(job,new Path("file:///d:/calres/ca102/")); 
        /** 
         *Job is run through job.waitForCompletion(true), 
         * true means that the running progress and other information will be output in time To the user, 
         * if false, just wait for the job to end 
         */ 
        job.waitForCompletion(true); 
    } 
}

ideal configuration kafka code implementation (optimization: bridge mode)

1 Import kafka pom file

cannal pulls data from the database in real time

2Configure yml

Manually submit the read data serialization monitor from the beginning

server: 
  port: 8999 
spring: 
  application: 
    name: userinterest 
  kafka: 
    bootstrap-servers: 192.168.64.210:9092 #acks=0: Regardless of success or failure, only send once. No confirmation is required 
      #acks=1: that is, you only need to confirm that the leader has received the message 
      #acks=all or -1: ISR + Leader are sure to receive 
    the consumer: #Whether to automatically commit the offset offset 
      enable-auto-commit: false #Earliest: No submission record, consumption starts from the beginning 
      #latest: No submission record, consumption starts from the next latest message 
      auto-offset-reset: earliest       #Key encoding and decoding method 
      key-deserializer: org.apache.kafka.common.serialization. StringDeserializer   #value encoding and decoding method 
      value-deserializer: org.apache.kafka.common.serialization.StringDeserializer 
      #Configure the listener
     
     
     

    
    listener: 
      #When the value of enable.auto.commit is set to false, the value will take effect ; when it is true, it will not take effect. 
      # manual_immediate: You need to manually call Acknowledgment.acknowledge() to submit 
      ack-mode: manual_immediate

3. Configure bridge mode template

3.1 Bridge mode interface class

public interface FillHbaseData<T> extends FillData<Put,T> { 
    List<Put> fillData(List<T> lis); 
} 
/

** 
 * Convert data formats according to different user data and incoming entity class models Interface 
 * @param <T> 
 */ 
public interface StringToEntity<T> { 
    List<T> change(String line); 
}

3.2 Bridge mode implementation class

# EventAttendessFillDataImpl类
public class EventAttendessFillDataImpl implements FillHbaseData<EventAttendees> {
    @Override
    public List<Put> fillData(List<EventAttendees> lst) {
        List<Put> puts = new ArrayList<>();
        lst.stream().forEach(ea->{
            Put put = new Put((ea.getEventid() + ea.getUserid() + ea.getAnswer()).getBytes());
            put.addColumn("base".getBytes(),"eventid".getBytes(),ea.getAnswer().getBytes());
            put.addColumn("base".getBytes(),"userid".getBytes(),ea.getAnswer().getBytes());
            put.addColumn("base".getBytes(),"answer".getBytes(),ea.getAnswer().getBytes());
            puts.add(put);
        });
        return puts;
    }
}

# EventFillDataImpl类
public class EventFillDataImpl implements FillHbaseData<Events> {
    @Override
    public List<Put> fillData(List<Events> lis) {
        List<Put> puts = new ArrayList<>();
        lis.stream().forEach(
                event -> {
                    Put put = new Put(event.getEventid().getBytes());
                    put.addColumn("base".getBytes(),"userid".getBytes(),event.getUserid().getBytes());
                    put.addColumn("base".getBytes(),"starttime".getBytes(),event.getStarttime().getBytes());
                    put.addColumn("base".getBytes(),"city".getBytes(),event.getCity().getBytes());
                    put.addColumn("base".getBytes(),"zip".getBytes(),event.getZip().getBytes());
                    put.addColumn("base".getBytes(),"state".getBytes(),event.getState().getBytes());
                    put.addColumn("base".getBytes(),"country".getBytes(),event.getCountry().getBytes());
                    put.addColumn("base".getBytes(),"lat".getBytes(),event.getLat().getBytes());
                    put.addColumn("base".getBytes(),"lng".getBytes(),event.getLng().getBytes());
                    puts.add(put); 
 * The List collection converted for userfriends message queue is then converted to List<Put>
/**
#
 UserFriendsFillDataImpl class
}
    }
        return puts;
                });

 */
public class UserFriendsFillDataImpl implements FillHbaseData<UserFriends> {
    @Override
    public List<Put> fillData(List<UserFriends> lis) {
        List<Put> puts=new ArrayList<>();

        lis.stream().forEach(userFriends -> {
            Put put = new Put((userFriends.getUserid()+"-"+userFriends.getFriendid()).getBytes());//行键
            put.addColumn("base".getBytes(),"userid".getBytes(),userFriends.getUserid().getBytes());
                    put.addColumn("base".getBytes(),"friendid".getBytes(),userFriends.getFriendid().getBytes());
        });
        return null;
    }
}

# EventAttendeesChangeImpl类
public class EventAttendeesChangeImpl implements StringToEntity<EventAttendees> { 
/

    ** 
     * The data enters as eventid yes maybe invited no 
     * ex:123,112233,34343,234234,45454,112233,23232,234234,3434343,34343 
     * Convert the data format For 123 112233 yes, 123 34343 yes, 123 234234 maybe... 
     * @param line 
     * @return 
     */ 
    @Override 
    public List<EventAttendees> change(String line) { 
        String[] infos = line.split(" ,", -1); 
        List<EventAttendees> eas = new ArrayList<>(); 
        //First count all the people who answered yes 
        if (infos[1].trim().equals("")&&infos[1]! =null){ 
            Arrays.asList(infos[1].split(" ")).stream().forEach(
                    yes->{
                     EventAttendees ea = EventAttendees.builder()
                             .eventid(infos[0]).userid(yes).answer("yes")
                             .build();
                        eas.add(ea);
                    });
        }
        //先计算所有回答maybe的人
        if (infos[2].trim().equals("")&&infos[2]!=null){
            Arrays.asList(infos[2].split(" ")).stream().forEach(
                    maybe->{
                        EventAttendees ea = EventAttendees.builder()
                                .eventid(infos[0]).userid(maybe).answer("maybe")
                                .build();
                        eas.add(ea); 
                    }); 
        } 
        //First calculate all the people who answered invited 
        if (infos[3].trim().equals("")&&infos[3]!=null){ 
            Arrays.asList(infos [3].split(" ")).stream().forEach( 
                    invited->{ 
                        EventAttendees ea = EventAttendees.builder() 
                                .eventid(infos[0]).userid(invited).answer("invited") 
                                . build(); 
                        eas.add(ea); 
                    }); 
        } 
        //First calculate all the people who answered no 
        if (infos[4].trim().equals("")&&infos[4]!=null){
            Arrays.asList(infos[4].split(" ")).stream().forEach(
                    no->{
                        EventAttendees ea = EventAttendees.builder()
                                .eventid(infos[0]).userid(no).answer("no")
                                .build();
                        eas.add(ea);
                    });
        }
        return eas;
    }
}

# EventsChangeImpl类
public class EventsChangeImpl implements StringToEntity<Events> {
    @Override
    public List<Events> change(String line) {
        String[] infos = line.split(",", -1);
        List<Events> events=new ArrayList<>();
        Events event = Events.builder().eventid(infos[0]).userid(infos[1]).starttime(infos[2])
                .city(infos[3]).state(infos[4]).zip(infos[5]).country(infos[6])
                .lat(infos[7]).lng(infos[8]).build();
        events.add(event);

        return events;
    }
}

# UserFriendsChangeImpl类
/**
 * 将将123123, 123435 435455 345345 => 123123, 123435 123123,435455 123123, 345345
 */

public class UserFriendsChangeImpl  implements StringToEntity<UserFriends> {
    @Override
    public List<UserFriends> change(String line) {
        String[] infos = line.split(",");
        List<UserFriends> ufs = new ArrayList<>();
        Arrays.asList((infos[1]).split(" ")).stream().forEach(
                fid->{
                    UserFriends uf = UserFriends.builder().userid(infos[0]).friendid(fid).build();
                    ufs.add(uf);
                }
        );
        return ufs;
    }
}

3.3 Bridge mode abstract class

/**
 *#3.3.1 AbstractDataChanage抽象类
 * 桥梁模式中的抽象角色
 */
public abstract class AbstractDataChanage<E,T> implements DataChanage<T> {

    protected FillData<E,T> fillData;
    protected StringToEntity<T> stringToEntity;


    public AbstractDataChanage(FillData<E, T> fillData, StringToEntity<T> stringToEntity) {
        this.fillData = fillData;
        this.stringToEntity = stringToEntity;
    }

    @Override
    public abstract List<T> change(String line);

    public abstract void fill(ConsumerRecord<String,String> record);

}


/** 
 * #3.3.2 DataChanage interface 
 * Data conversion interface kafka data is converted into common data formats 
 * (if there are redis hbase oracle multiple databases, a filling interface needs to be written) 
 * @param <T> 
 */ 
public
interface DataChanage<T > { 
    List<T> change(String line); 
}
 
# 3.3.3 DataChangeFillHbaseDatabase class 
public class DataChangeFillHbaseDatabase<T> extends AbstractDataChanage<Put,T> { 
public
    DataChangeFillHbaseDatabase( 
            FillData<Put,T> fillData, 
            StringToEntity<T > stringToEntity) 
    { 
        super(fillData,stringToEntity); 
    } 
    @Override



    public List<T> change(String line){ 
return
        stringToEntity.change(line); 
    } 
    @Override
 
    public void fill(ConsumerRecord<String,String> record){ 
        //Read the ConsumerRecord obtained by kafka and convert it to a string 
        List< Put> puts = fillData.fillData(change(record.value())); 
        //Fill the collection into the corresponding hbase database 
    } 
} 
#
 3.3.4 Interface FillData 
public interface FillData<T,E> { 
    List< T> fillData(List<E> lst); 
}

3.4 Bridge mode

Abstract factory 
a, factory abc, 3 products, 
factory b, factory abc, 3 products

4. Write entity classes

@Data
@AllArgsConstructor
@NoArgsConstructor
@Builder
public class EventAttendees {
    private String eventid;
    private String userid;
    private String answer;
}

@Data
@AllArgsConstructor
@NoArgsConstructor
@Builder
public class Events {
    private String eventid;
    private String userid;
    private String starttime;
    private String city;
    private String state;
    private String zip;
    private String country;
    private String lat;
    private String lng;
}

@Data
@AllArgsConstructor
@NoArgsConstructor
@Builder
public class UserFriends {
    private String userid;
    private String friendid;
}

5.Write configuration class

@Configuration
public class HbaseConfig {

        @Bean
        public org.apache.hadoop.conf.Configuration hbaseConfiguration(){
            org.apache.hadoop.conf.Configuration cfg= HBaseConfiguration.create();
            cfg.set(HConstants.ZOOKEEPER_QUORUM,"192.168.64.210:2181");
            return cfg;

        }

        @Bean
        @Scope("prototype")
        public Connection getConnection() {
            Connection connection=null;
            try {
                connection = ConnectionFactory.createConnection(hbaseConfiguration());
            } catch (IOException e) {
                e.printStackTrace();
            }
            return connection;
        }

        @Bean
        public Supplier<Connection> hbaseConSupplier(){
            return ()->{
                return getConnection();
            };
        }
}

5.1 Solve data duplication hbase

hbase underlying kv k is the row key and the row key is username + friend 
5.2
The amount of data is too large? More than 30w, dozens of g, 3000w data, about several g, 
sub-database, table, vertical partitioning 10g 
hbase function pre-partition function 
oracle partition hash partition range partition list partition

5.2 Writing service classes

#5.2.1 
/**
 * Accept the converted List<Put> data set to fill in the Hbase database
 */
@Component

public class HbaseWriter {
    @Resource
    private Connection hbaseConnection;
public
    void write(List<Put> puts,String tableName ){
        try {
            Table table = hbaseConnection.getTable(TableName.valueOf(tableName));
            table.put(puts);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}
# 5.2.2 
@Component
public class KafkaReader {
@KafkaListener
    (groupId = "cm",topics = {"events_raw"})

    public void readEventToHbase(ConsumerRecord<String ,String > record, Acknowledgment ack){
        AbstractDataChanage<Put,Events> eventsHandler = new DataChangeFillHbaseDatabase<Events>(
                new EventFillDataImpl(),
                new EventsChangeImpl()
        );
        eventsHandler.fill(record);
        ack.acknowledge();
    }
    @KafkaListener(groupId = "cm",topics = {"event_attendees_raw"})
    public void readEventToHbase1(ConsumerRecord<String ,String > record, Acknowledgment ack){
        AbstractDataChanage<Put,EventAttendees> eventsHandler = new DataChangeFillHbaseDatabase<EventAttendees>(
                new EventAttendessFillDataImpl(),
                new EventAttendeesChangeImpl()

        );
        eventsHandler.fill(record);
        ack.acknowledge();
    }
    @KafkaListener(groupId = "cm",topics = {"user_friends_raw"})
    public void readEventToHbase2(ConsumerRecord<String ,String > record, Acknowledgment ack){
        AbstractDataChanage<Put, UserFriends> eventsHandler = new DataChangeFillHbaseDatabase<UserFriends>(
                new UserFriendsFillDataImpl(),
                new UserFriendsChangeImpl()

        );
        eventsHandler.fill(record);
        ack.acknowledge();
    }

}

6 Create column cluster in hbase database

#Automatically calculate the split based on the required number of regions and split algorithm 
create 'userfriends','base',{ NUMREGIONS   => 3, SPLITALGO  =>'HexStringSplit' } 
==============
================================================== ====== NUMREGIONS description: 
hbase
’s default HFile size is 10G (hbase.hregion.max.filesize=10737418240=10G) 
The source
data is Hive: Recommended number of partitions ≈ HDFS size/10G * 10 *1.2 
HexStringSplit
, UniformSplit, DecimalStringSplit Description: 
UniformSplit
(small space occupation, completely random rowkey prefix •••••••): an aggregate that evenly divides the space of possible keys. This is recommended when the keys are approximately consistent random bytes (such as hashes). Rows are raw byte values in the range 00 => FF, right padded with 0s to maintain the same memcmp() order. This is a natural algorithm for a byte[] environment and saves space, but it's not necessarily the simplest for readability.

HexStringSplit (takes up a lot of space, rowkey is a hexadecimal string as prefix •••••••): HexStringSplit is a typical RegionSplitter.SplitAlgorithm to select the region boundary. The format of the HexStringSplit region bounds is an ASCII representation of an MD5 checksum or any other uniformly distributed hexadecimal value. Row is a hex-encoded long value in the range "00000000" => "FFFFFFFF", left padded with 0s so that it remains lexicographically in the same order as binary. Because this splitting algorithm uses hexadecimal strings as keys, it is easy to read and write in the shell, but takes up more space and may not be intuitive. 
DecimalStringSplit: rowkey is a decimal string as a prefix 
===================================== ================================= 
create
'eventAttendees','base' 
cat
events.csv.COMPLETED | head -2 
cd
/opt/data/attendees 
cat event_attendees_raw |head -2 
create
'events' 'base'

6.1 hbase startup command

#hbaseStart
 start-hbase.sh 
or start the script! ! ! ! The script is as follows #Browser address http://192.168.64.210: 60010 /master-status