Java模拟登录并抓取数据

问题:

最近做一个抓取数据的项目,发现网上很多资料不完备,或者按照代码执行不能真实爬取数据,自己特别根据自己的网站进行登录并进行数据爬取

未登录

登录后,正常抓取数据截图(预期目标数据)

解决办法

(1)找到登录接口,并利用程序抓取数据。

我这里接口是

http://192.168.1.119/blog11/login/login3.shtml?username=xxx&password=xxx

(2)利用okhttp3模拟登录并抓取数据。

https://mvnrepository.com/ mvn仓库

需要引入的jar包,okhttp-4.9.1.jar,kotlin-stdlib-1.5.0-M2.jar,okio-2.10.0.jar

package com.game.test;

import java.util.ArrayList;
import java.util.List;

import okhttp3.Cookie;
import okhttp3.CookieJar;
import okhttp3.FormBody;
import okhttp3.HttpUrl;
import okhttp3.OkHttpClient;
import okhttp3.Request;
import okhttp3.RequestBody;
import okhttp3.Response;

/**
 * 模拟登陆请求并抓数据
 * 
 * @author Leng
 *
 */
public class OkTest3 {

	public static void main(String[] args) {

		long s1 = System.currentTimeMillis();
		try {
			OkHttpClient okHttpClient = new OkHttpClient().newBuilder().cookieJar(new CookieJar() {
				private List<Cookie> listCookie = new ArrayList<>();

				@Override
				public void saveFromResponse(HttpUrl arg0, List<Cookie> arg1) {
					listCookie = arg1;
				}

				@Override
				public List<Cookie> loadForRequest(HttpUrl arg0) {
					return listCookie;
				}
			}).build();
			
			{
				System.out.println("----------------登录------------------");
				String url = "http://192.168.1.119/blog11/login/login3.shtml";
				RequestBody formBody = new FormBody.Builder().add("lang_type", "zh-cn").add("username", "admin")
						.add("password", "111111").build();
				final Request request = new Request.Builder().url(url).post(formBody)
						// 默认就是GET请求,可以不写
						.build();
				// Call call = okHttpClient.newCall(request);
				Response response = okHttpClient.newCall(request).execute();
				String result = response.body().string();
				System.out.println(result);
			}

			long s2 = System.currentTimeMillis();

			{
				System.out.println("--------------获取数据--------------------");
				String url2 = "http://192.168.1.119/blog11/setting/getMySetting.shtml";
				final Request request2 = new Request.Builder().url(url2).get() //模拟浏览器添加的header可写可不写,是为了虚构爬虫的身份
						.header("User-Agent",
								"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36")
						.addHeader("Content-Type", "application/x-www-form-urlencoded")
						.addHeader("Upgrade-Insecure-Requests", "1")
						// 默认就是GET请求,可以不写
						.build();
				Response response2 = okHttpClient.newCall(request2).execute();
				String result2 = response2.body().string();
				System.out.println(result2);
			}

			long s3 = System.currentTimeMillis();
			System.out.println("----登陆耗时:" + (s2 - s1) + " 毫秒,获取数据耗时:" + (s3 - s2) + " 毫秒");

		} catch (Exception e) {
			// TODO: handle exception
		}
	}

}

执行抓取数据结果:

 

衍生问题:爬虫需要经常切换IP怎么办?

使用代理IP进行抓取数据如下

 {
			OkHttpClient okHttpClient = new OkHttpClient().newBuilder().cookieJar(new CookieJar() {
				private List<Cookie> listCookie = new ArrayList<>();

				@Override
				public void saveFromResponse(HttpUrl arg0, List<Cookie> arg1) {
					listCookie = arg1;
				}

				@Override
				public List<Cookie> loadForRequest(HttpUrl arg0) {
					return listCookie;
				}
			}).proxySelector(new ProxySelector() {
				
				@Override
				public List<Proxy> select(URI uri) {
//					 if (uri.getHost().endsWith("")) {
                         List<Proxy> proxyList = new ArrayList<>();
                         proxyList.add(new Proxy(Proxy.Type.HTTP,new InetSocketAddress("180.97.250.22",10290)));
                         return proxyList;
//                     } else {
//                         return null;
//                     }
				}
				
				@Override
				public void connectFailed(URI uri, SocketAddress sa, IOException ioe) {
					// TODO Auto-generated method stub
					
				}
			}
			).build();

 

如何获取动态IP,请查看《java爬虫技术(IP代理)》:https://blog.csdn.net/u011628753/article/details/116026914

 

衍生问题:设置超时时间

OkHttpClient client = new OkHttpClient.Builder()
.connectTimeout(30, TimeUnit.SECONDS)
.callTimeout(120, TimeUnit.SECONDS)
.pingInterval(5, TimeUnit.SECONDS)
.readTimeout(60, TimeUnit.SECONDS)
.writeTimeout(60, TimeUnit.SECONDS)
.build();

 

Guess you like

Origin blog.csdn.net/u011628753/article/details/115956940