Later appeared a lot of problems with the contents of the letter crawling Beijing JAVA process.
I crawl with a python.
This is the URL suffix of each letter I crawled out of that is ( http://www.beijing.gov.cn/hudong/hdjl/com.web.suggest.suggesDetail.flow?originalId=AH20021200370 )
Then write code:
import requests import re import xlwt # #https://flightaware.com/live/flight/CCA101/history/80 url = 'http://www.beijing.gov.cn/hudong/hdjl/com.web.consult.consultDetail.flow?originalId=AH20021300174' headers = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36" } def get_page(url): try: response = requests.get(url, headers=headers) if response.status_code == 200: #print('获取网页成功') return response.text else: print('获取网页失败') except Exception as e: print (e) fopen = open ( 'C: \\ Users \\ hp \\ Desktop \\ list.txt', 'r') // this is a URL to access contents of the letter suffix Lines fopen.readlines = () URLs = [ 'HTTP: //www.beijing.gov.cn/hudong/hdjl/com.web.consult.consultDetail.flow?originalId={}'.format(line) for in Lines Line] for URLs in URL: Print (URL) Page = the get_page (URL) items the re.findall = ( '', Page, re.S) Print (items) Print (len (items))
But there are some problems when crawling content with the regular method. When the problem is edited again.