1. The basic process of crawler:
Get the url of the website through the get method of the requests library
The browser opens the webpage source code analysis element node
Extract the desired data through BeautifulSoup or regular expressions
Store data to local disk or database
2. Officially started
url = “http://www.jianshu.com”
page = requests.get(url) #It is found that the return status code 403 indicates that there is a problem (except 200, everything else is problematic)
#At this time, check the robots protocol of the crawler. There are indeed some problems. The solutions are as follows:
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
获取html页面
page = requests.get(url, headers = headers)
demo = page.text
# Remember, sometimes there may be encoding problems
page.encoding = page.apparent_encoding
#Convert the acquired content to BeautifulSoup format, and use html.parser as the interpreter (make a pot of soup)
soup = BeautifulSoup(demo, 'html.parser')
# print html in formatted form
print(soup.prettify()) #Useful for analyzing element nodes
#Find all the statements with class='tilte' in the a tag
titles = soup.find_all('a', 'title')
#Print the string and article link of each tag found
for titile in titles:
print(title.string) #Print string
print("http://www.jianshu.com" + title.get('href')) #Use the get method of title to get the connection, you can view the available methods through dir(titles)
#Write the acquired content to the local disk
with open('aa.txt', 'w') as f:
for title in titles:
f.write(title.string+'\n')
f.write('http://www.jianshu.com' + title.get('href') + '\n\n')