Use Jsoup epidemic data were crawling

topic

The title of the meaning is clear, crawling epidemic data site with java, co-exist in the database. We can use Jsoup java plug-crawling.

Ideas analysis

Effect screenshots

After the crawler is stored to the screenshot mysql

Code Display

//定义几个常量防止反爬虫

      public static String USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:49.0) Gecko/20100101 Firefox/49.0";

      public static String HOST = "i.snssdk.com";

      public static String REFERER = "https://i.snssdk.com/feoffline/hot_list/template/hot_list/forum_tab.html?activeWidget=1";
      
      public static void main(String[] args) throws IOException, SQLException {
          //根URL

            String url = "https://i.snssdk.com/forum/home/v1/info/?activeWidget=1&forum_id=1656784762444839";

            String resultBody = Jsoup.connect(url).

                userAgent(USER_AGENT).header("Host", HOST).header("Referer", REFERER).execute().body();

            JSONObject jsonObject = JSON.parseObject(resultBody);

            String ncovStringList = jsonObject.getJSONObject("forum").getJSONObject("extra").getString("ncov_string_list");

            JSONObject ncovListObj = JSON.parseObject(ncovStringList);
            JSONArray todaydata = ncovListObj.getJSONArray("provinces");
            QueryRunner queryRunner = new QueryRunner(DataSourceUtils.getDataSource());
            String sql = "insert into todaydata_copy1 values(?,?,?,?,?,?,?,?)";
            String confirmedNum,deathsNum,cityname,cityid,treatingNum,provinceid;
            String reprovinceid=null;
            int confirmedNumSum=0,deathsNumSum=0,treatingNumSum=0;
            for(int i=0;i<todaydata.size();i++) {
                JSONObject todayData1 = todaydata.getJSONObject(i);
                String updateDate = todayData1.getString("updateDate");
                JSONArray city = todayData1.getJSONArray("cities");
                for(int j=0;j<city.size();j++) {
                    JSONObject cities = city.getJSONObject(j);
                    confirmedNum= cities.getString("confirmedNum");
                    deathsNum = cities.getString("deathsNum");
                    cityname = cities.getString("name");
                    cityid = cities.getString("id");
                    treatingNum = cities.getString("treatingNum");
                    provinceid = cityid.substring(0,2);
                    reprovinceid=provinceid;
                    confirmedNumSum+=Integer.parseInt(confirmedNum);
                    deathsNumSum+=Integer.parseInt(deathsNum);
                    treatingNumSum+=Integer.parseInt(treatingNum);
                    queryRunner.update(sql, updateDate,provinceid,cityname,confirmedNum,deathsNum,treatingNum,cityid,null);
                }
                queryRunner.update(sql,updateDate,reprovinceid,null,confirmedNumSum,deathsNumSum,treatingNumSum,null,null); 
                confirmedNumSum=0;
                deathsNumSum=0;
                treatingNumSum=0;
            }
          }

The actual completion schedule

Estimated time: three hours

date Starting time End Time Downtime Net time activity Remark
3.10 16:00 16:50 20 30 Preparation for coding
3.10 16:50 17:50 60 Looking list To find a list here of the need to waste a lot of time
3.10 17:50 18:30 10 30 Write code The middle ten minutes to go to the toilet and then water
3.10 18:30 19:35 5 60 test The middle five minutes to go to the toilet
3.10 19:35 19:45 10 Finishing

Defect record form

date Numbering Types of The introduction stage Out phase Time to Repair Repair defects description
3.10 1 1 coding coding 50min Looking pretty Unable to find the correct list
3.10 2 2 coding test 10min The method used to access the database error, error will update method has become the execute method
3.10 3 3 coding test 20min Change the sql statement The data in the database can not be displayed correctly provinces

to sum up

The use Jsoup crawling data, to deeply understand what is the most difficult crawling data: find the right json and extracted. Also decided to use this as an opportunity python crawling it and see, see next blog:
( https://www.cnblogs.com/wushenjiang/p/12466220.html )

Guess you like

Origin www.cnblogs.com/wushenjiang/p/12466025.html