问题:
最近做一个抓取数据的项目,发现网上很多资料不完备,或者按照代码执行不能真实爬取数据,自己特别根据自己的网站进行登录并进行数据爬取。
未登录
登录后,正常抓取数据截图(预期目标数据)
解决办法:
(1)找到登录接口,并利用程序抓取数据。
我这里接口是
http://192.168.1.119/blog11/login/login3.shtml?username=xxx&password=xxx
(2)利用okhttp3模拟登录并抓取数据。
https://mvnrepository.com/ mvn仓库
需要引入的jar包,okhttp-4.9.1.jar,kotlin-stdlib-1.5.0-M2.jar,okio-2.10.0.jar
package com.game.test;
import java.util.ArrayList;
import java.util.List;
import okhttp3.Cookie;
import okhttp3.CookieJar;
import okhttp3.FormBody;
import okhttp3.HttpUrl;
import okhttp3.OkHttpClient;
import okhttp3.Request;
import okhttp3.RequestBody;
import okhttp3.Response;
/**
* 模拟登陆请求并抓数据
*
* @author Leng
*
*/
public class OkTest3 {
public static void main(String[] args) {
long s1 = System.currentTimeMillis();
try {
OkHttpClient okHttpClient = new OkHttpClient().newBuilder().cookieJar(new CookieJar() {
private List<Cookie> listCookie = new ArrayList<>();
@Override
public void saveFromResponse(HttpUrl arg0, List<Cookie> arg1) {
listCookie = arg1;
}
@Override
public List<Cookie> loadForRequest(HttpUrl arg0) {
return listCookie;
}
}).build();
{
System.out.println("----------------登录------------------");
String url = "http://192.168.1.119/blog11/login/login3.shtml";
RequestBody formBody = new FormBody.Builder().add("lang_type", "zh-cn").add("username", "admin")
.add("password", "111111").build();
final Request request = new Request.Builder().url(url).post(formBody)
// 默认就是GET请求,可以不写
.build();
// Call call = okHttpClient.newCall(request);
Response response = okHttpClient.newCall(request).execute();
String result = response.body().string();
System.out.println(result);
}
long s2 = System.currentTimeMillis();
{
System.out.println("--------------获取数据--------------------");
String url2 = "http://192.168.1.119/blog11/setting/getMySetting.shtml";
final Request request2 = new Request.Builder().url(url2).get() //模拟浏览器添加的header可写可不写,是为了虚构爬虫的身份
.header("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36")
.addHeader("Content-Type", "application/x-www-form-urlencoded")
.addHeader("Upgrade-Insecure-Requests", "1")
// 默认就是GET请求,可以不写
.build();
Response response2 = okHttpClient.newCall(request2).execute();
String result2 = response2.body().string();
System.out.println(result2);
}
long s3 = System.currentTimeMillis();
System.out.println("----登陆耗时:" + (s2 - s1) + " 毫秒,获取数据耗时:" + (s3 - s2) + " 毫秒");
} catch (Exception e) {
// TODO: handle exception
}
}
}
执行抓取数据结果:
衍生问题:爬虫需要经常切换IP怎么办?
使用代理IP进行抓取数据如下
{
OkHttpClient okHttpClient = new OkHttpClient().newBuilder().cookieJar(new CookieJar() {
private List<Cookie> listCookie = new ArrayList<>();
@Override
public void saveFromResponse(HttpUrl arg0, List<Cookie> arg1) {
listCookie = arg1;
}
@Override
public List<Cookie> loadForRequest(HttpUrl arg0) {
return listCookie;
}
}).proxySelector(new ProxySelector() {
@Override
public List<Proxy> select(URI uri) {
// if (uri.getHost().endsWith("")) {
List<Proxy> proxyList = new ArrayList<>();
proxyList.add(new Proxy(Proxy.Type.HTTP,new InetSocketAddress("180.97.250.22",10290)));
return proxyList;
// } else {
// return null;
// }
}
@Override
public void connectFailed(URI uri, SocketAddress sa, IOException ioe) {
// TODO Auto-generated method stub
}
}
).build();
如何获取动态IP,请查看《java爬虫技术(IP代理)》:https://blog.csdn.net/u011628753/article/details/116026914
衍生问题:设置超时时间
OkHttpClient client = new OkHttpClient.Builder()
.connectTimeout(30, TimeUnit.SECONDS)
.callTimeout(120, TimeUnit.SECONDS)
.pingInterval(5, TimeUnit.SECONDS)
.readTimeout(60, TimeUnit.SECONDS)
.writeTimeout(60, TimeUnit.SECONDS)
.build();