[Crawler Basics] Java uses regular expressions to extract web page information

In terms of web crawlers, Java is not as easy to use as Python. This article only uses regular expressions to extract information. If you want to extract information from html files more accurately, you must use a web page parser. You can use third-party libraries, such as Jsoup, etc.

We extracted Douban’s Top250 movie titles

Without a web page parser, this is a more difficult thing. We first get the web page. JDK9 started to add net.http package, which is much simpler than the original
package newHTTP;

import java.io.IOException;
import java.net.URI;
import java.net.URLEncoder;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;

public class HttpClientDoPost
{
    
    
	public static void main(String[] args) throws InterruptedException,IOException
	{
    
    
		doPost();
	}
	
	public static void doPost() throws InterruptedException
	{
    
    
		try
		{
    
    
			//创建客户机
			HttpClient client=HttpClient.newHttpClient();
			//定义请求，配置参数
			HttpRequest request=HttpRequest.newBuilder()
					.uri(URI.create("https://movie.douban.com/top250"))
					.header("User-Agent","HTTPie/0.9.2")
					.header("Content-Tpye", "application/x-www-form-urlencoded;charset=utf-8")
					.POST(HttpRequest.BodyPublishers.ofString("tAddress="+URLEncoder.encode("1 Market Street", "UTF-8")
					+"&tCity="+URLEncoder.encode("San Francisco","UTF-8")))
					.build();
			//得到网页
			HttpResponse<String> response=client.send(request, HttpResponse.BodyHandlers.ofString());
			System.out.println(response.body());
		}catch(IOException e)
		{
    
    
			e.printStackTrace();
		}
	}
}

After getting the webpage, we noticed that each movie name is under the alt tag, so we use a pre-search (zero-width assertion) to determine its location

Pattern pattern=Pattern.compile("[！\s·,：()\\u4e00-\\u9fa5]*\\d{0,4}[！\s·,：()\\u4e00-\\u9fa5]+(?<=alt.*)");

Not much code, as follows:

package spider;

import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.net.http.HttpResponse.BodyHandlers;
import java.nio.charset.StandardCharsets;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class DoubanTop250 
{
    
    
	public static void main(String[] args) throws InterruptedException {
    
    
		int count=0;
	for(int i=0;i<10;i++) {
    
    
		String content=getHtml("https://movie.douban.com/top250?start="+i*25).replace(" ", "");
		//根据电影名称及位置写出正则表达式
		Pattern pattern=Pattern.compile("[！\s·,：()\\u4e00-\\u9fa5]*\\d{0,4}[！\s·,：()\\u4e00-\\u9fa5]+(?<=alt.*)");
		Matcher matcher=pattern.matcher(content);
	
		while(matcher.find())
		{
    
    
			
			String match=matcher.group();
			if(!match.contains("豆瓣")) {
    
    
			System.out.println(match);
			count++;
			}
			
		}
		
		matcher.reset();
		
	}
	System.out.println(count);
	}
	

	private static String getHtml(String url) throws InterruptedException
	{
    
    
		try
		{
    
    
		
			HttpClient client=HttpClient.newHttpClient();
		
			HttpRequest request=HttpRequest.newBuilder(URI.create(url)).build();
			
			HttpResponse<String> response=client.send(request, BodyHandlers.ofString(StandardCharsets.UTF_8));
			
			String content= response.body();
		    return content;
					
		}catch(IOException e)
		{
    
    
			return null;
		}
	}
}

Of course, the author tried hard and did not get 250. The reason is that there are too many movie titles and it is not easy to match (you can watch Lust and Caution, but it is not successfully displayed). If you want to crawl the content faster and better, you need to use parsing. Device.

[Crawler Basics] Java uses regular expressions to extract web page information

Guess you like