Java crawls Baidu pictures & Google pictures & Bing pictures

Take a look at the fetched results first.

Catch the Baidu image

of the keyword "beauty": Catch the Google image

of the keyword "beauty": Catch the Bing image of the keyword "beauty":


8 Java classes:
  • Startup.java - main function
  • ImageCrawler.java - Crawler base class
  • BaiduImageCrawler.java - The specific crawling implementation of Baidu images
  • GoogleImageCrawler.java - specific crawling implementation of Google images
  • BingImageCrawler.java - specific crawling implementation of Bing images
  • ImageWorker.java - Timely download image URLs from the Queue (100 are enabled by default)
  • ImageDownloader.java - Image download
  • MD5Checksum.java - file MD5 calculation (file deduplication)

The following are the specific crawling implementations of the three search engines. Complete code: click to download
public class BaiduImageCrawler extends ImageCrawler {
	
	//tn:resultjsonavatarnew
	//ie:utf-8 character encoding (ie input oe output)
	//word: beauty search keyword
	//pn:60 start number
	//rn:30 display quantity
	//z:0 size (0 all sizes 9 extra large 3 large 2 medium 1 small)
	//width:1024 custom size-width
	//height:768 custom size-height
	//ic: 0 color (0 all colors 1 red 2 yellow 4 green 8 cyan 16 blue 32 purple 64 pink 128 brown 256 orange 512 black 1024 white 2048 black and white)
	//s:0 3 avatar image
	//face:0 1 face close-up
	//st:-1 -1 all types 1 cartoon drawing 2 simple strokes
	//lm:-1 (6 dynamic pictures 7 static pictures)
	//gsm:3c hexadecimal number of pn value
	private static final String BAIDU_IMAGE_SEARCH_URL = "http://image.baidu.com/search/avatarjson?tn=resultjsonavatarnew&ie=utf-8&word=%s&pn=%d&rn=%d&z=3&ic=0&s=0&face=0&st=-1&lm=-1&gsm=%s";
	private static final int PAGE_SIZE = 60;
	private static final String IMAGE_URL_REG = "\"objURL\":\"(https?://[^\"]+)\"";
	private static final Pattern IMAGE_PATTERN = Pattern.compile(IMAGE_URL_REG);
	
	@Override
	public String getSearchUrl(String keyword, int page) {
		int begin = page * PAGE_SIZE;
		return String.format(BAIDU_IMAGE_SEARCH_URL, keyword, begin, PAGE_SIZE, Integer.toHexString(begin));
	}

	@Override
	public int parseImageUrl(ConcurrentLinkedQueue<String> queue, StringBuffer data) {
		int count = 0;
		Matches matches = IMAGE_PATTERN.matches (data);
		while (matcher.find()) {
			queue.offer(matcher.group(1));
			count++;
		}
		return count;
	}

}

public class GoogleImageCrawler extends ImageCrawler {
	
	//tbm=isch
	//q=Beauty search keyword
	//ijn=0 page number (***Google only provides 0 to 8 pages of data!)
	//start=0 start number
	//tbs=isz:l search criteria
	// size
	// tbs=isz:l large
	// in tbs=isz:m
	// colour
	// tbs=ic:color color
	// tbs=ic:gray black and white
	// tbs=ic:trans transparent
	// Types of
	// tbs=itp:face face close-up
	// tbs=itp:photo photo
	// tbs=itp:clipart clip art
	// tbs=itp:lineart sketch
	// tbs=itp:animated animation
	// condition combination
	//	tbs=isz:l,ic:color,itp:face
	private static final String GOOGLE_IMAGE_SEARCH_URL = "https://www.google.com/search?tbm=isch&q=%s&ijn=%d&start=%d&tbs=isz:l";
	private static final int PAGE_SIZE = 100;
	private static final String IMAGE_URL_REG = "\"ou\":\"(https?://[^\"]+)\"";
	private static final Pattern IMAGE_PATTERN = Pattern.compile(IMAGE_URL_REG);
	
	@Override
	public String getSearchUrl(String keyword, int page) {
		int begin = page * PAGE_SIZE;
		return String.format(GOOGLE_IMAGE_SEARCH_URL, keyword, page, begin);
	}

	@Override
	public int parseImageUrl(ConcurrentLinkedQueue<String> queue, StringBuffer data) {
		int count = 0;
		Matches matches = IMAGE_PATTERN.matches (data);
		while (matcher.find()) {
			queue.offer(matcher.group(1));
			count++;
		}
		return count;
	}
}

public class BingImageCrawler extends ImageCrawler {

	//async=content
	//q=Beauty search keyword
	//first=118 start number
	//count=35 display the number
	private static final String BING_IMAGE_SEARCH_URL = "http://www.bing.com/images/async?async=content&q=%s&first=%d&count=%d";
	private static final int PAGE_SIZE = 35;
	private static final String IMAGE_URL_REG = "imgurl:"(https?://[^,]+)"";
	private static final Pattern IMAGE_PATTERN = Pattern.compile(IMAGE_URL_REG);
	
	@Override
	public String getSearchUrl(String keyword, int page) {
		int begin = page * PAGE_SIZE;
		return String.format(BING_IMAGE_SEARCH_URL, keyword, begin, PAGE_SIZE);
	}

	@Override
	public int parseImageUrl(ConcurrentLinkedQueue<String> queue, StringBuffer data) {
		int count = 0;
		Matches matches = IMAGE_PATTERN.matches (data);
		while (matcher.find()) {
			queue.offer(matcher.group(1));
			count++;
		}
		return count;
	}

}


The log of the crawling process:




Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326504371&siteId=291194637