The setting of input and output parameter paths in hadoop programming

In the programming of hadoop, if you write MapReduce to process some data, then the setting of input and output parameter paths cannot be avoided. The file base class FileInputFormat in hadoop provides the following APIs to formulate:




As shown in the figure above, there are
(1 ) addInputPath(), add an input path Path each time
(2) addInputPaths, take multiple paths separated by commas as input parameters, support multiple paths
(3) setInputPath, set an input path Path, it will overwrite the original The path
(4) setInputPath, sets multiple paths, supports the Path object rewritten by the Hadoop file system, which is an interface in JAVA.

code show as below:

Java code copy code  Favorite code
  1. FileInputFormat.setInputDirRecursive(job,  true ); //Set the directory to be read recursively  
  2. FileInputFormat.addInputPath(job, new Path("path1"));  
  3. FileInputFormat.addInputPaths(job, "path1,path2,path3,path....");  
  4. FileInputFormat.setInputPaths(job, new Path("path1"),new Path("path2"));  
  5. FileInputFormat.setInputPaths(job, "path1,path2,path3,path....");  
   FileInputFormat.setInputDirRecursive(job, true);//Set the directory to be read recursively
			FileInputFormat.addInputPath(job, new Path("path1"));
			FileInputFormat.addInputPaths(job, "path1,path2,path3,path....");
			FileInputFormat.setInputPaths(job, new Path("path1"),new Path("path2"));
			FileInputFormat.setInputPaths(job, "path1,path2,path3,path....");




When we actually use it, we only need to use one of the above paths according to the business.
Ok know how, the incoming path, let's see how to filter out the files or directories you want on HDFS, the path of the HDFS system supports regular filtering by default, which is very powerful, as long as we can write regular , we can filter almost any path or file we want.

For details, please refer to this link http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path )
Give an example in an actual project application, which can help everyone better understand and use it.
Let's first look at the following storage structure diagram on HDFS:





This is a folder that is generated every day by date. Of course, there can be many sub-dimension methods, such as dividing by year, month, day, and hour. The specific situation should be Consider it in conjunction with the business.

Take a look, the next level directory of the direct root directory:




ok, the storage structure is clear, then now put forward a few requirements

(1) only filter out the data in the pv directory
(2) only filter out the data in the uv directory
(3) ) Only filter out the data in the keyword directory
(4) Only filter out pv and uv data or data ending with v
(5) Filter out 2015 data
(6) Filter out data within a certain time range such as 2015- The pv data in the time range from 04-10 to 2015-04-17


In fact, the previous requirement is very simple and is a requirement:

The FileStatus class in hadoop supports path wildcards, and the corresponding writing is as follows:



Java code copy code  Favorite code
  1. FileSystem fs = FileSystem.get(conf);  
  2. //  
  3. //过滤pv或uv的目录数据  
  4.   String basepath="/user/d1/DataFileShare/Search/*/*/{pv,uv}";  
  5. //过滤v结尾的目录数据  
  6.   String basepath="/user/d1/DataFileShare/Search//*/*/*v";  
  7. //过滤uv的数据  
  8.   String basepath="/user/d1/DataFileShare/Search//*/*/uv";  
  9. //过滤pv的数据  
  10.   String basepath="/user/d1/DataFileShare/Search//*/*/pv";  
  11.   
  12. //过滤2015年的pv的数据  
  13. String basepath="/user/d1/DataFileShare/Search/2015*/*/pv";    
  14. //获取globStatus  
  15. FileStatus[] status = fs.globStatus(new Path(basepath));  
  16. for(FileStatus f:status){  
  17.     //打印全路径,  
  18.     System.out.println(f.getPath().toString());  
  19.     //打印最后一级目录名  
  20.     //System.out.println(f.getPath().getName());  
  21. }  
	        FileSystem fs = FileSystem.get(conf);
	        //
	        //过滤pv或uv的目录数据
//	        String basepath="/user/d1/DataFileShare/Search/*/*/{pv,uv}";
	        //过滤v结尾的目录数据
//	        String basepath="/user/d1/DataFileShare/Search//*/*/*v";
	        //过滤uv的数据
//	        String basepath="/user/d1/DataFileShare/Search//*/*/uv";
	        //过滤pv的数据
//	        String basepath="/user/d1/DataFileShare/Search//*/*/pv";
	        
	        //过滤2015年的pv的数据
	        String basepath="/user/d1/DataFileShare/Search/2015*/*/pv";  
	        //获取globStatus
	        FileStatus[] status = fs.globStatus(new Path(basepath));
	        for(FileStatus f:status){
	        	//打印全路径,
	        	System.out.println(f.getPath().toString());
	        	//打印最后一级目录名
	        	//System.out.println(f.getPath().getName());
	        }






最后一个复杂,直接使用正则,会比较繁琐,而且假如有一些其他的逻辑在里面会比较难控制,比如说你拿到这个日期,会从redis里面再次匹配,是否存在,然后在做某些决定。

hadoop在globStatus的方法里,提供了一个路径重载,根据PathFilter类,通过正则再次过滤出我们需要的文件即可,使用此类,我们可以以更灵活的方式,操作,过滤路径,比如说上面的那个日期范围的判断,我们就可以根据全路径中,截取出日期,再做一些判断,并且可以再次过滤低级的路径,比如是pv,uv或keyword的路径。

实例代码如下:

调用代码:

Java代码 copy code  Favorite code
  1. FileStatus[] status = fs.globStatus(new Path(basepath),new RegexExcludePathAndTimeFilter(rexp_date,rexp_business, "2015-04-04",  "2015-04-06"));  
 FileStatus[] status = fs.globStatus(new Path(basepath),new RegexExcludePathAndTimeFilter(rexp_date,rexp_business, "2015-04-04",  "2015-04-06"));



处理代码:

Java代码 copy code  Favorite code
  1. /** 
  2.      * 实现PathFilter接口使用正则过滤 
  3.      * 所需数据 
  4.      * 加强版,按时间范围,路径过滤 
  5.      * @author qindongliang 
  6.      * 大数据交流群:(1号群) 376932160 (2号群) 415886155 
  7.      *  
  8.      * **/  
  9.     static class RegexExcludePathAndTimeFilter implements PathFilter{  
  10.          //日期的正则  
  11.          private final  String regex;  
  12.          //时间开始过滤  
  13.          private final  String start;  
  14.          //时间结束过滤  
  15.          private final  String end;  
  16.          //业务过滤  
  17.          private final  String regex_business;  
  18.        
  19.         public RegexExcludePathAndTimeFilter(String regex,String regex_business,String start,String end) {  
  20.             this.regex=regex;  
  21.             this.start=start;  
  22.             this.end=end;  
  23.             this.regex_business=regex_business;  
  24.         }  
  25.           
  26.         @Override  
  27.         public boolean accept(Path path) {  
  28.             String data[]=path.toString().split("/");  
  29.             String date=data[7];  
  30.             String business=data[9];  
  31.             return  Pattern.matches(regex_business, business)&&Pattern.matches(regex,date) && TimeTools.checkDate(start, end, date);  
  32.         }  
  33.           
  34.           
  35.           
  36.     }  
  37.       
  38.     /**日期比较的工具类**/  
  39.     static class TimeTools{  
  40.           
  41.         final static String DATE_FORMAT="yyyy-MM-dd";  
  42.           
  43.         final static SimpleDateFormat sdf=new SimpleDateFormat(DATE_FORMAT);  
  44.           
  45.         public static boolean cnull(String checkString){  
  46.             if(checkString==null||checkString.equals("")){  
  47.                 return false;  
  48.             }  
  49.             return true;  
  50.         }  
  51.           
  52.         /** 
  53.          * @param start 开始时间 
  54.          * @param end 结束时间 
  55.          * @param path 比较的日期路径 
  56.          * **/  
  57.         public static boolean checkDate(String start,String end,String path){  
  58.             long startlong=0;  
  59.             long endlong=0;  
  60.             long pathlong=0;  
  61.             try{  
  62.              if(cnull(start)){  
  63.                  startlong=sdf.parse(start).getTime();  
  64.              }  
  65.              if(cnull(end)){  
  66.                  endlong=sdf.parse(end).getTime();  
  67.              }  
  68.              if(cnull(path)){  
  69.                  pathlong=sdf.parse(path).getTime();  
  70.              }  
  71.             //当end日期为空时,只取start+的日期  
  72.             if(end==null||end.equals("")){  
  73.                 if(pathlong>=startlong){  
  74.                     return true;  
  75.                 }else{  
  76.                     return false;  
  77.                 }  
  78.             }else{//当end不为空时,取日期范围直接比较  
  79.                 //过滤在规定的日期范围之内  
  80.                 if(pathlong>=startlong&&pathlong<=endlong){  
  81.                     return true;  
  82.                 }else{  
  83.                     return false;  
  84.                 }  
  85.                   
  86.             }  
  87.               
  88.             }catch(Exception e){  
  89.                 log.error("路径日期转换异常: 开始日期:  "+start+"  结束日期 "+end+"  比较日期: "+path+"  异常: "+e);  
  90.             }  
  91.               
  92.             return false;  
  93.         }  
/**
	 * 实现PathFilter接口使用正则过滤
	 * 所需数据
	 * 加强版,按时间范围,路径过滤
	 * @author qindongliang
	 * 大数据交流群:(1号群) 376932160 (2号群) 415886155
	 * 
	 * **/
	static class RegexExcludePathAndTimeFilter implements PathFilter{
		 //日期的正则
		 private final  String regex;
		 //时间开始过滤
		 private final  String start;
		 //时间结束过滤
		 private final  String end;
		 //业务过滤
		 private final  String regex_business;
	 
		public RegexExcludePathAndTimeFilter(String regex,String regex_business,String start,String end) {
			this.regex=regex;
			this.start=start;
			this.end=end;
			this.regex_business=regex_business;
		}
		
		@Override
		public boolean accept(Path path) {
			String data[]=path.toString().split("/");
			String date=data[7];
			String business=data[9];
			return  Pattern.matches(regex_business, business)&&Pattern.matches(regex,date) && TimeTools.checkDate(start, end, date);
		}
		
		
		
	}
	
	/**日期比较的工具类**/
	static class TimeTools{
		
		final static String DATE_FORMAT="yyyy-MM-dd";
		
		final static SimpleDateFormat sdf=new SimpleDateFormat(DATE_FORMAT);
		
		public static boolean cnull(String checkString){
			if(checkString==null||checkString.equals("")){
				return false;
			}
			return true;
		}
		
		/**
		 * @param start 开始时间
		 * @param end 结束时间
		 * @param path 比较的日期路径
		 * **/
		public static boolean checkDate(String start,String end,String path){
			long startlong=0;
			long endlong=0;
			long pathlong=0;
			try{
			 if(cnull(start)){
				 startlong=sdf.parse(start).getTime();
			 }
			 if(cnull(end)){
				 endlong=sdf.parse(end).getTime();
			 }
			 if(cnull(path)){
				 pathlong=sdf.parse(path).getTime();
			 }
			//当end日期为空时,只取start+的日期
			if(end==null||end.equals("")){
				if(pathlong>=startlong){
					return true;
				}else{
					return false;
				}
			}else{//当end不为空时,取日期范围直接比较
				//过滤在规定的日期范围之内
				if(pathlong>=startlong&&pathlong<=endlong){
					return true;
				}else{
					return false;
				}
				
			}
			
			}catch(Exception e){
				log.error("路径日期转换异常: 开始日期:  "+start+"  结束日期 "+end+"  比较日期: "+path+"  异常: "+e);
			}
			
			return false;
		}





Summary:

(1) If it is just simple path filtering, it is the simplest and most powerful to use regular wildcards directly in the path.

(2) If it is a more complex path filtering, it is recommended to customize the PathFilter to encapsulate the filtering code.

(3) If the storage of each folder directory file is planned in the early stage of construction, this is the best. For example, the pv above is a folder, then the following is the date, the uv is a folder, and then The following are various dates, so from the business point of view, it can be divided according to the dimension, so it is very convenient for us to deal with. This is the corresponding partition function in Hive. Avoid unnecessary extra operations.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326614819&siteId=291194637