The basic framework of python crawler

1. The basic process of crawler:

Get the url of the website through the get method of the requests library

The browser opens the webpage source code analysis element node

Extract the desired data through BeautifulSoup or regular expressions

Store data to local disk or database

2. Officially started

url = “http://www.jianshu.com”

page = requests.get(url) #It is   found that the return status code 403 indicates that there is a problem (except 200, everything else is problematic)

#At this time, check the robots protocol of the crawler. There are indeed some problems. The solutions are as follows:

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'} 
获取html页面

page = requests.get(url, headers = headers)

demo = page.text

# Remember, sometimes there may be encoding problems

page.encoding = page.apparent_encoding

#Convert the acquired content to BeautifulSoup format, and use html.parser as the interpreter (make a pot of soup)

soup = BeautifulSoup(demo, 'html.parser')

# print html in formatted form

print(soup.prettify()) #Useful        for analyzing element nodes

#Find all the statements with class='tilte' in the a tag

titles = soup.find_all('a', 'title') 

#Print the string and article link of each tag found

for titile in titles:

  print(title.string) #Print string

  print("http://www.jianshu.com" + title.get('href')) #Use the get method of title to get the connection, you can view the available methods through dir(titles)

#Write the acquired content to the local disk

with open('aa.txt', 'w') as f:

  for title in titles:

    f.write(title.string+'\n')

    f.write('http://www.jianshu.com' + title.get('href') + '\n\n')

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325209860&siteId=291194637