2020 winter life learning diary (xv)

Later appeared a lot of problems with the contents of the letter crawling Beijing JAVA process.

I crawl with a python.

 

 This is the URL suffix of each letter I crawled out of that is ( http://www.beijing.gov.cn/hudong/hdjl/com.web.suggest.suggesDetail.flow?originalId=AH20021200370 )

Then write code:

import requests
import re
import xlwt
# #https://flightaware.com/live/flight/CCA101/history/80
url = 'http://www.beijing.gov.cn/hudong/hdjl/com.web.consult.consultDetail.flow?originalId=AH20021300174'
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"
}
def get_page(url):
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            #print('获取网页成功')
            return response.text
        else:
            print('获取网页失败')
    except Exception as e:
        print (e)
fopen = open ( 'C: \\ Users \\ hp \\ Desktop \\ list.txt', 'r') // this is a URL to access contents of the letter suffix 
Lines fopen.readlines = () 
URLs = [ 'HTTP: //www.beijing.gov.cn/hudong/hdjl/com.web.consult.consultDetail.flow?originalId={}'.format(line) for in Lines Line] 
for URLs in URL: 
    Print (URL) 
    Page = the get_page (URL) 
    items the re.findall = ( '', Page, re.S) 
    Print (items) 
    Print (len (items))

  But there are some problems when crawling content with the regular method. When the problem is edited again.

Guess you like

Origin www.cnblogs.com/jccjcc/p/12309714.html