topic
The title of the meaning is clear, crawling epidemic data site with java, co-exist in the database. We can use Jsoup java plug-crawling.
Ideas analysis
- 1. How data is crawling with Jsoup it, we must first find a website displaying the epidemic, where we use today's headlines: [Today's headlines epidemic data] ( https://i.snssdk.com/feoffline/hot_list/template /hot_list/forum_tab.html?activeWidget=1 ), after entering the site with firefox browser press F12, you can find the interface needed, this is no longer looking for. We found that interface ( https://i.snssdk.com/forum/home/v1/info/?activeWidget=1&forum_id=1656784762444839 )
- 2. After finding the interface, we have to analyze what we want and where you can view using js code page to find the list, we find here is ncov_string_list
- 3. With a basic list, we just exist in the database and the data were taken to
Effect screenshots
After the crawler is stored to the screenshot mysql
Code Display
//定义几个常量防止反爬虫
public static String USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:49.0) Gecko/20100101 Firefox/49.0";
public static String HOST = "i.snssdk.com";
public static String REFERER = "https://i.snssdk.com/feoffline/hot_list/template/hot_list/forum_tab.html?activeWidget=1";
public static void main(String[] args) throws IOException, SQLException {
//根URL
String url = "https://i.snssdk.com/forum/home/v1/info/?activeWidget=1&forum_id=1656784762444839";
String resultBody = Jsoup.connect(url).
userAgent(USER_AGENT).header("Host", HOST).header("Referer", REFERER).execute().body();
JSONObject jsonObject = JSON.parseObject(resultBody);
String ncovStringList = jsonObject.getJSONObject("forum").getJSONObject("extra").getString("ncov_string_list");
JSONObject ncovListObj = JSON.parseObject(ncovStringList);
JSONArray todaydata = ncovListObj.getJSONArray("provinces");
QueryRunner queryRunner = new QueryRunner(DataSourceUtils.getDataSource());
String sql = "insert into todaydata_copy1 values(?,?,?,?,?,?,?,?)";
String confirmedNum,deathsNum,cityname,cityid,treatingNum,provinceid;
String reprovinceid=null;
int confirmedNumSum=0,deathsNumSum=0,treatingNumSum=0;
for(int i=0;i<todaydata.size();i++) {
JSONObject todayData1 = todaydata.getJSONObject(i);
String updateDate = todayData1.getString("updateDate");
JSONArray city = todayData1.getJSONArray("cities");
for(int j=0;j<city.size();j++) {
JSONObject cities = city.getJSONObject(j);
confirmedNum= cities.getString("confirmedNum");
deathsNum = cities.getString("deathsNum");
cityname = cities.getString("name");
cityid = cities.getString("id");
treatingNum = cities.getString("treatingNum");
provinceid = cityid.substring(0,2);
reprovinceid=provinceid;
confirmedNumSum+=Integer.parseInt(confirmedNum);
deathsNumSum+=Integer.parseInt(deathsNum);
treatingNumSum+=Integer.parseInt(treatingNum);
queryRunner.update(sql, updateDate,provinceid,cityname,confirmedNum,deathsNum,treatingNum,cityid,null);
}
queryRunner.update(sql,updateDate,reprovinceid,null,confirmedNumSum,deathsNumSum,treatingNumSum,null,null);
confirmedNumSum=0;
deathsNumSum=0;
treatingNumSum=0;
}
}
The actual completion schedule
Estimated time: three hours
date | Starting time | End Time | Downtime | Net time | activity | Remark |
---|---|---|---|---|---|---|
3.10 | 16:00 | 16:50 | 20 | 30 | Preparation for coding | |
3.10 | 16:50 | 17:50 | 60 | Looking list | To find a list here of the need to waste a lot of time | |
3.10 | 17:50 | 18:30 | 10 | 30 | Write code | The middle ten minutes to go to the toilet and then water |
3.10 | 18:30 | 19:35 | 5 | 60 | test | The middle five minutes to go to the toilet |
3.10 | 19:35 | 19:45 | 10 | Finishing |
Defect record form
date | Numbering | Types of | The introduction stage | Out phase | Time to Repair | Repair defects | description |
---|---|---|---|---|---|---|---|
3.10 | 1 | 1 | coding | coding | 50min | Looking pretty | Unable to find the correct list |
3.10 | 2 | 2 | coding | test | 10min | The method used to access the database error, error will update method has become the execute method | |
3.10 | 3 | 3 | coding | test | 20min | Change the sql statement | The data in the database can not be displayed correctly provinces |
to sum up
The use Jsoup crawling data, to deeply understand what is the most difficult crawling data: find the right json and extracted. Also decided to use this as an opportunity python crawling it and see, see next blog:
( https://www.cnblogs.com/wushenjiang/p/12466220.html )