Reptile Summary

Written at the beginning

Most sites do not have any anti-climb measures directly crawling to get the data.
Some websites have some anti-climbing measures, common anti-climb over and over again that there is some.
For some sites more difficult crawling according to the actual needs and budget to estimate the need for crawling.
including, but not limited to IP, js加密, 验证码etc.

The resulting non-see

Most of the sites are WYSIWYG, but the page content of some websites is to use js to render directly grab the data and see not the same not the same way roughly divided into two: 同步加载and 异步加载.

Synchronous load

http://xxgk.dingnan.gov.cn/bmgkxx/hbj/gzdt/gggs/index.htm

Demand is to obtain data in the table 信息索取号 类 别 信息名称 生成日期 公开方式 公开时限

The browser opens the page, view the page by f12, reads in part

We can see, there is a there is a label to view the source page, and did not find this a tag

Data obtained by reptiles and it was not a label, this time you need to scripttext tag parsing, and through ,access to information to segment


Says only one of them, and other such similar, older site most likely to encounter this situation.

When the positioning element can not, we need to look at the source code of the page

Asynchronous loading

http://hbj.jxfz.gov.cn/col/col4374/index.html?uid=14940&pageNum=1

There is also loaded asynchronously getand post, relatively speaking, it is generally getsimple

Inside this page by postrequesting the returned data is also a need for a simple process
https://www.w3school.com.cn/xml/xml_cdata.asp

返回的数据包含在注释中, 需要先把注释取消掉, 再进行信息的提取

        res = response.text.replace('<![CDATA[', '').replace(']]>', '')
        data_xml = Selector(text=res)
        datas = data_xml.xpath("//record")

遇到带有类似csrf_token的值时, 需要先获取这些隐藏输入框的值, 在进行post请求

参考网址 http://www.pxedz.gov.cn/info/index.aspx?t=34

调试这个网站时浏览器可能会卡死

        inputs = response.xpath('//form[@id="aspnetForm"]//input[@type="hidden"]')
        for input_ in inputs:
            name = input_.xpath('./@name').get()
            value = input_.xpath('./@value').get()
            self.datas.update({name: value})

Guess you like

Origin www.cnblogs.com/gaoyongjian/p/11571897.html