Catch the Baidu image
of the keyword "beauty": Catch the Google image
of the keyword "beauty": Catch the Bing image of the keyword "beauty":
8 Java classes:
- Startup.java - main function
- ImageCrawler.java - Crawler base class
- BaiduImageCrawler.java - The specific crawling implementation of Baidu images
- GoogleImageCrawler.java - specific crawling implementation of Google images
- BingImageCrawler.java - specific crawling implementation of Bing images
- ImageWorker.java - Timely download image URLs from the Queue (100 are enabled by default)
- ImageDownloader.java - Image download
- MD5Checksum.java - file MD5 calculation (file deduplication)
The following are the specific crawling implementations of the three search engines. Complete code: click to download
public class BaiduImageCrawler extends ImageCrawler { //tn:resultjsonavatarnew //ie:utf-8 character encoding (ie input oe output) //word: beauty search keyword //pn:60 start number //rn:30 display quantity //z:0 size (0 all sizes 9 extra large 3 large 2 medium 1 small) //width:1024 custom size-width //height:768 custom size-height //ic: 0 color (0 all colors 1 red 2 yellow 4 green 8 cyan 16 blue 32 purple 64 pink 128 brown 256 orange 512 black 1024 white 2048 black and white) //s:0 3 avatar image //face:0 1 face close-up //st:-1 -1 all types 1 cartoon drawing 2 simple strokes //lm:-1 (6 dynamic pictures 7 static pictures) //gsm:3c hexadecimal number of pn value private static final String BAIDU_IMAGE_SEARCH_URL = "http://image.baidu.com/search/avatarjson?tn=resultjsonavatarnew&ie=utf-8&word=%s&pn=%d&rn=%d&z=3&ic=0&s=0&face=0&st=-1&lm=-1&gsm=%s"; private static final int PAGE_SIZE = 60; private static final String IMAGE_URL_REG = "\"objURL\":\"(https?://[^\"]+)\""; private static final Pattern IMAGE_PATTERN = Pattern.compile(IMAGE_URL_REG); @Override public String getSearchUrl(String keyword, int page) { int begin = page * PAGE_SIZE; return String.format(BAIDU_IMAGE_SEARCH_URL, keyword, begin, PAGE_SIZE, Integer.toHexString(begin)); } @Override public int parseImageUrl(ConcurrentLinkedQueue<String> queue, StringBuffer data) { int count = 0; Matches matches = IMAGE_PATTERN.matches (data); while (matcher.find()) { queue.offer(matcher.group(1)); count++; } return count; } }
public class GoogleImageCrawler extends ImageCrawler { //tbm=isch //q=Beauty search keyword //ijn=0 page number (***Google only provides 0 to 8 pages of data!) //start=0 start number //tbs=isz:l search criteria // size // tbs=isz:l large // in tbs=isz:m // colour // tbs=ic:color color // tbs=ic:gray black and white // tbs=ic:trans transparent // Types of // tbs=itp:face face close-up // tbs=itp:photo photo // tbs=itp:clipart clip art // tbs=itp:lineart sketch // tbs=itp:animated animation // condition combination // tbs=isz:l,ic:color,itp:face private static final String GOOGLE_IMAGE_SEARCH_URL = "https://www.google.com/search?tbm=isch&q=%s&ijn=%d&start=%d&tbs=isz:l"; private static final int PAGE_SIZE = 100; private static final String IMAGE_URL_REG = "\"ou\":\"(https?://[^\"]+)\""; private static final Pattern IMAGE_PATTERN = Pattern.compile(IMAGE_URL_REG); @Override public String getSearchUrl(String keyword, int page) { int begin = page * PAGE_SIZE; return String.format(GOOGLE_IMAGE_SEARCH_URL, keyword, page, begin); } @Override public int parseImageUrl(ConcurrentLinkedQueue<String> queue, StringBuffer data) { int count = 0; Matches matches = IMAGE_PATTERN.matches (data); while (matcher.find()) { queue.offer(matcher.group(1)); count++; } return count; } }
public class BingImageCrawler extends ImageCrawler { //async=content //q=Beauty search keyword //first=118 start number //count=35 display the number private static final String BING_IMAGE_SEARCH_URL = "http://www.bing.com/images/async?async=content&q=%s&first=%d&count=%d"; private static final int PAGE_SIZE = 35; private static final String IMAGE_URL_REG = "imgurl:"(https?://[^,]+)""; private static final Pattern IMAGE_PATTERN = Pattern.compile(IMAGE_URL_REG); @Override public String getSearchUrl(String keyword, int page) { int begin = page * PAGE_SIZE; return String.format(BING_IMAGE_SEARCH_URL, keyword, begin, PAGE_SIZE); } @Override public int parseImageUrl(ConcurrentLinkedQueue<String> queue, StringBuffer data) { int count = 0; Matches matches = IMAGE_PATTERN.matches (data); while (matcher.find()) { queue.offer(matcher.group(1)); count++; } return count; } }
The log of the crawling process: