Is data cleaning justified? Multi-dimensional verification group group conversion ET2 offline engineer
Method 1: Use RandomAccessFile to download file hash to remove duplicates
public class App { public static void main( String[] args ) throws Exception { //Prepare hash HashMap<String, Integer> map = new HashMap<>(); //Read file RandomAccessFile raf = new RandomAccessFile("D:\ \bgdata\\bgdata01\\events.csv", "rw"); //Skip the first line raf.readLine(); //Read a line of data String line=""; //Loop to read data while ((line =raf.readLine())!=null){ //The read data is separated by commas and the first String eventid=line.split(",")[0]; //Determine whether the map contains id if (map .containsKey(eventid)){ map.put(eventid,map.get(eventid)+1); //Otherwise, it is the first time to obtain the id value and set it to 1 }else { map.put(eventid,1); } } //Close the stream file raf.close(); //Output and check the map size to see if there are duplicates System.out.println(map.size()+"===== ========="); } }
Disadvantages: The file data used this time was 3 million, and it was very slow to read. It took about 30 minutes o(╥﹏╥)o
It is recommended to use the following method
Method 2 uses Mapreduce to download files
2.1Configure pom file
<dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.6.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>2.6.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.6.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.6.0</version> </dependency>
2.2 Configure the RcMapper class
public class RcMapper extends Mapper<LongWritable,Text ,Text, IntWritable> { IntWritable one= new IntWritable(1); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String eventid=value.toString().split(",")[0]; context.write(new Text(eventid),one); } }
2.3 Configure the RcReduce class
public class RcReduce extends Reducer<Text, IntWritable,Text,IntWritable>{ @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int count=0; for (IntWritable it : values) { count+=1; } context.write(key,new IntWritable(count)); } }
2.4 Configure the RcCountDriver class
public class RcCountDriver { public static void main(String[] args) throws Exception { //Instantiate the job object Job job = Job.getInstance(new Configuration()); //Get the object through reflection job.setJarByClass(RcCountDriver.class); // Reflection to obtain the RcMapper object //Set the corresponding Text.class job.setMapperClass(RcMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); // Reflection to obtain the RcReduce object / /Set the corresponding Text.class job.setReducerClass(RcReduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // Get the target path file FileInputFormat.addInputPath(job,new Path("file:///D:\\bgdata\\bgdata01\\events.csv")); //Mark the download file address FileOutputFormat.setOutputPath(job ,new Path("file:///d:/calres/cal01")); / ** *Job is run through job.waitForCompletion(true), * true means that the running progress and other information will be output to the user in a timely manner, * If false, just wait for the job to end */ job.waitForCompletion(true); } }
Just click to test! !
Method 3. Use Mapreduce to remove duplicates
3.1 Configure the CfMapper class
public class CfMapper extends Mapper<LongWritable, Text,Text, IntWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { //The obtained value is separated by spaces String[] infos = value. toString().split("\t"); //Operation if the second data is not one if (!infos[1].equals("1")){ //Get the first value, the second Remove spaces from values other than 1 context.write(new Text(infos[0]),new IntWritable(Integer.parseInt(infos[1].trim()))); } } }
3.2 Configure the CfCombiner class
public class CfCombiner extends Reducer<Text, IntWritable,Text,IntWritable> { @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { //配置k v键值对 context.write(key,values.iterator().next()); } }
3.3 Configure the CfCountDriver class
public class CfCountDriver { public static void main(String[] args) throws Exception { //Instantiate the job object Job job = Job.getInstance(new Configuration()); //Get the object through reflection job.setJarByClass(CfCountDriver.class); // Reflection to obtain the CfMapper object //Set the corresponding Text.class job.setMapperClass(CfMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); // Reflection to obtain the CfCombiner object / /Set the corresponding Text.class // job.setCombinerClass(CfCombiner.class); job.setReducerClass(CfCombiner.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // Get the target path file FileInputFormat.addInputPath(job,new Path("file:///d:/calres/cal01/a")); //Mark the download file address FileOutputFormat .setOutputPath(job,new Path("file:///d:/calres/ca102/")); /** *Job is run through job.waitForCompletion(true), * true means that the running progress and other information will be output in time To the user, * if false, just wait for the job to end */ job.waitForCompletion(true); } }
ideal configuration kafka code implementation (optimization: bridge mode)
1 Import kafka pom file
cannal pulls data from the database in real time
2Configure yml
Manually submit the read data serialization monitor from the beginning
server: port: 8999 spring: application: name: userinterest kafka: bootstrap-servers: 192.168.64.210:9092 #acks=0: Regardless of success or failure, only send once. No confirmation is required #acks=1: that is, you only need to confirm that the leader has received the message #acks=all or -1: ISR + Leader are sure to receive the consumer: #Whether to automatically commit the offset offset enable-auto-commit: false #Earliest: No submission record, consumption starts from the beginning #latest: No submission record, consumption starts from the next latest message auto-offset-reset: earliest #Key encoding and decoding method key-deserializer: org.apache.kafka.common.serialization. StringDeserializer #value encoding and decoding method value-deserializer: org.apache.kafka.common.serialization.StringDeserializer #Configure the listener listener: #When the value of enable.auto.commit is set to false, the value will take effect ; when it is true, it will not take effect. # manual_immediate: You need to manually call Acknowledgment.acknowledge() to submit ack-mode: manual_immediate
3. Configure bridge mode template
3.1 Bridge mode interface class
public interface FillHbaseData<T> extends FillData<Put,T> { List<Put> fillData(List<T> lis); } / ** * Convert data formats according to different user data and incoming entity class models Interface * @param <T> */ public interface StringToEntity<T> { List<T> change(String line); }
3.2 Bridge mode implementation class
# EventAttendessFillDataImpl类 public class EventAttendessFillDataImpl implements FillHbaseData<EventAttendees> { @Override public List<Put> fillData(List<EventAttendees> lst) { List<Put> puts = new ArrayList<>(); lst.stream().forEach(ea->{ Put put = new Put((ea.getEventid() + ea.getUserid() + ea.getAnswer()).getBytes()); put.addColumn("base".getBytes(),"eventid".getBytes(),ea.getAnswer().getBytes()); put.addColumn("base".getBytes(),"userid".getBytes(),ea.getAnswer().getBytes()); put.addColumn("base".getBytes(),"answer".getBytes(),ea.getAnswer().getBytes()); puts.add(put); }); return puts; } } # EventFillDataImpl类 public class EventFillDataImpl implements FillHbaseData<Events> { @Override public List<Put> fillData(List<Events> lis) { List<Put> puts = new ArrayList<>(); lis.stream().forEach( event -> { Put put = new Put(event.getEventid().getBytes()); put.addColumn("base".getBytes(),"userid".getBytes(),event.getUserid().getBytes()); put.addColumn("base".getBytes(),"starttime".getBytes(),event.getStarttime().getBytes()); put.addColumn("base".getBytes(),"city".getBytes(),event.getCity().getBytes()); put.addColumn("base".getBytes(),"zip".getBytes(),event.getZip().getBytes()); put.addColumn("base".getBytes(),"state".getBytes(),event.getState().getBytes()); put.addColumn("base".getBytes(),"country".getBytes(),event.getCountry().getBytes()); put.addColumn("base".getBytes(),"lat".getBytes(),event.getLat().getBytes()); put.addColumn("base".getBytes(),"lng".getBytes(),event.getLng().getBytes()); puts.add(put); * The List collection converted for userfriends message queue is then converted to List<Put> /** # UserFriendsFillDataImpl class } } return puts; }); */ public class UserFriendsFillDataImpl implements FillHbaseData<UserFriends> { @Override public List<Put> fillData(List<UserFriends> lis) { List<Put> puts=new ArrayList<>(); lis.stream().forEach(userFriends -> { Put put = new Put((userFriends.getUserid()+"-"+userFriends.getFriendid()).getBytes());//行键 put.addColumn("base".getBytes(),"userid".getBytes(),userFriends.getUserid().getBytes()); put.addColumn("base".getBytes(),"friendid".getBytes(),userFriends.getFriendid().getBytes()); }); return null; } } # EventAttendeesChangeImpl类 public class EventAttendeesChangeImpl implements StringToEntity<EventAttendees> { / ** * The data enters as eventid yes maybe invited no * ex:123,112233,34343,234234,45454,112233,23232,234234,3434343,34343 * Convert the data format For 123 112233 yes, 123 34343 yes, 123 234234 maybe... * @param line * @return */ @Override public List<EventAttendees> change(String line) { String[] infos = line.split(" ,", -1); List<EventAttendees> eas = new ArrayList<>(); //First count all the people who answered yes if (infos[1].trim().equals("")&&infos[1]! =null){ Arrays.asList(infos[1].split(" ")).stream().forEach( yes->{ EventAttendees ea = EventAttendees.builder() .eventid(infos[0]).userid(yes).answer("yes") .build(); eas.add(ea); }); } //先计算所有回答maybe的人 if (infos[2].trim().equals("")&&infos[2]!=null){ Arrays.asList(infos[2].split(" ")).stream().forEach( maybe->{ EventAttendees ea = EventAttendees.builder() .eventid(infos[0]).userid(maybe).answer("maybe") .build(); eas.add(ea); }); } //First calculate all the people who answered invited if (infos[3].trim().equals("")&&infos[3]!=null){ Arrays.asList(infos [3].split(" ")).stream().forEach( invited->{ EventAttendees ea = EventAttendees.builder() .eventid(infos[0]).userid(invited).answer("invited") . build(); eas.add(ea); }); } //First calculate all the people who answered no if (infos[4].trim().equals("")&&infos[4]!=null){ Arrays.asList(infos[4].split(" ")).stream().forEach( no->{ EventAttendees ea = EventAttendees.builder() .eventid(infos[0]).userid(no).answer("no") .build(); eas.add(ea); }); } return eas; } } # EventsChangeImpl类 public class EventsChangeImpl implements StringToEntity<Events> { @Override public List<Events> change(String line) { String[] infos = line.split(",", -1); List<Events> events=new ArrayList<>(); Events event = Events.builder().eventid(infos[0]).userid(infos[1]).starttime(infos[2]) .city(infos[3]).state(infos[4]).zip(infos[5]).country(infos[6]) .lat(infos[7]).lng(infos[8]).build(); events.add(event); return events; } } # UserFriendsChangeImpl类 /** * 将将123123, 123435 435455 345345 => 123123, 123435 123123,435455 123123, 345345 */ public class UserFriendsChangeImpl implements StringToEntity<UserFriends> { @Override public List<UserFriends> change(String line) { String[] infos = line.split(","); List<UserFriends> ufs = new ArrayList<>(); Arrays.asList((infos[1]).split(" ")).stream().forEach( fid->{ UserFriends uf = UserFriends.builder().userid(infos[0]).friendid(fid).build(); ufs.add(uf); } ); return ufs; } }
3.3 Bridge mode abstract class
/** *#3.3.1 AbstractDataChanage抽象类 * 桥梁模式中的抽象角色 */ public abstract class AbstractDataChanage<E,T> implements DataChanage<T> { protected FillData<E,T> fillData; protected StringToEntity<T> stringToEntity; public AbstractDataChanage(FillData<E, T> fillData, StringToEntity<T> stringToEntity) { this.fillData = fillData; this.stringToEntity = stringToEntity; } @Override public abstract List<T> change(String line); public abstract void fill(ConsumerRecord<String,String> record); } /** * #3.3.2 DataChanage interface * Data conversion interface kafka data is converted into common data formats * (if there are redis hbase oracle multiple databases, a filling interface needs to be written) * @param <T> */ public interface DataChanage<T > { List<T> change(String line); } # 3.3.3 DataChangeFillHbaseDatabase class public class DataChangeFillHbaseDatabase<T> extends AbstractDataChanage<Put,T> { public DataChangeFillHbaseDatabase( FillData<Put,T> fillData, StringToEntity<T > stringToEntity) { super(fillData,stringToEntity); } @Override public List<T> change(String line){ return stringToEntity.change(line); } @Override public void fill(ConsumerRecord<String,String> record){ //Read the ConsumerRecord obtained by kafka and convert it to a string List< Put> puts = fillData.fillData(change(record.value())); //Fill the collection into the corresponding hbase database } } # 3.3.4 Interface FillData public interface FillData<T,E> { List< T> fillData(List<E> lst); }
3.4 Bridge mode
Abstract factory a, factory abc, 3 products, factory b, factory abc, 3 products
4. Write entity classes
@Data @AllArgsConstructor @NoArgsConstructor @Builder public class EventAttendees { private String eventid; private String userid; private String answer; } @Data @AllArgsConstructor @NoArgsConstructor @Builder public class Events { private String eventid; private String userid; private String starttime; private String city; private String state; private String zip; private String country; private String lat; private String lng; } @Data @AllArgsConstructor @NoArgsConstructor @Builder public class UserFriends { private String userid; private String friendid; }
5.Write configuration class
@Configuration public class HbaseConfig { @Bean public org.apache.hadoop.conf.Configuration hbaseConfiguration(){ org.apache.hadoop.conf.Configuration cfg= HBaseConfiguration.create(); cfg.set(HConstants.ZOOKEEPER_QUORUM,"192.168.64.210:2181"); return cfg; } @Bean @Scope("prototype") public Connection getConnection() { Connection connection=null; try { connection = ConnectionFactory.createConnection(hbaseConfiguration()); } catch (IOException e) { e.printStackTrace(); } return connection; } @Bean public Supplier<Connection> hbaseConSupplier(){ return ()->{ return getConnection(); }; } }
5.1 Solve data duplication hbase
hbase underlying kv k is the row key and the row key is username + friend 5.2 The amount of data is too large? More than 30w, dozens of g, 3000w data, about several g, sub-database, table, vertical partitioning 10g hbase function pre-partition function oracle partition hash partition range partition list partition
5.2 Writing service classes
#5.2.1 /** * Accept the converted List<Put> data set to fill in the Hbase database */ @Component public class HbaseWriter { @Resource private Connection hbaseConnection; public void write(List<Put> puts,String tableName ){ try { Table table = hbaseConnection.getTable(TableName.valueOf(tableName)); table.put(puts); } catch (IOException e) { e.printStackTrace(); } } } # 5.2.2 @Component public class KafkaReader { @KafkaListener (groupId = "cm",topics = {"events_raw"}) public void readEventToHbase(ConsumerRecord<String ,String > record, Acknowledgment ack){ AbstractDataChanage<Put,Events> eventsHandler = new DataChangeFillHbaseDatabase<Events>( new EventFillDataImpl(), new EventsChangeImpl() ); eventsHandler.fill(record); ack.acknowledge(); } @KafkaListener(groupId = "cm",topics = {"event_attendees_raw"}) public void readEventToHbase1(ConsumerRecord<String ,String > record, Acknowledgment ack){ AbstractDataChanage<Put,EventAttendees> eventsHandler = new DataChangeFillHbaseDatabase<EventAttendees>( new EventAttendessFillDataImpl(), new EventAttendeesChangeImpl() ); eventsHandler.fill(record); ack.acknowledge(); } @KafkaListener(groupId = "cm",topics = {"user_friends_raw"}) public void readEventToHbase2(ConsumerRecord<String ,String > record, Acknowledgment ack){ AbstractDataChanage<Put, UserFriends> eventsHandler = new DataChangeFillHbaseDatabase<UserFriends>( new UserFriendsFillDataImpl(), new UserFriendsChangeImpl() ); eventsHandler.fill(record); ack.acknowledge(); } }
6 Create column cluster in hbase database
#Automatically calculate the split based on the required number of regions and split algorithm create 'userfriends','base',{ NUMREGIONS => 3, SPLITALGO =>'HexStringSplit' } ============== ================================================== ====== NUMREGIONS description: hbase ’s default HFile size is 10G (hbase.hregion.max.filesize=10737418240=10G) The source data is Hive: Recommended number of partitions ≈ HDFS size/10G * 10 *1.2 HexStringSplit , UniformSplit, DecimalStringSplit Description: UniformSplit (small space occupation, completely random rowkey prefix •••••••): an aggregate that evenly divides the space of possible keys. This is recommended when the keys are approximately consistent random bytes (such as hashes). Rows are raw byte values in the range 00 => FF, right padded with 0s to maintain the same memcmp() order. This is a natural algorithm for a byte[] environment and saves space, but it's not necessarily the simplest for readability. HexStringSplit (takes up a lot of space, rowkey is a hexadecimal string as prefix •••••••): HexStringSplit is a typical RegionSplitter.SplitAlgorithm to select the region boundary. The format of the HexStringSplit region bounds is an ASCII representation of an MD5 checksum or any other uniformly distributed hexadecimal value. Row is a hex-encoded long value in the range "00000000" => "FFFFFFFF", left padded with 0s so that it remains lexicographically in the same order as binary. Because this splitting algorithm uses hexadecimal strings as keys, it is easy to read and write in the shell, but takes up more space and may not be intuitive. DecimalStringSplit: rowkey is a decimal string as a prefix ===================================== ================================= create 'eventAttendees','base' cat events.csv.COMPLETED | head -2 cd /opt/data/attendees cat event_attendees_raw |head -2 create 'events' 'base'
6.1 hbase startup command
#hbaseStart start-hbase.sh or start the script! ! ! ! The script is as follows #Browser address http://192.168.64.210: 60010 /master-status