2019 series python-based web crawler, crawling embarrassments Encyclopedia

** because the URL change embarrassments Wikipedia, regular expressions also changed, resulting in a lot of code can not use the Internet, so wrote this blog post, I want to help, thank you! **

Ado, directly on the code.

In order to facilitate the extraction of data, I use the library and requests beautifulsoup

! [Use requests and bs4] (https://img-blog.csdnimg.cn/20191017093920758.png)

`` ## specific code as follows


```
import requests
from bs4 import BeautifulSoup


def download_page(url):
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0"}
r = requests.get(url, headers=headers)
return r.text


def get_content(html):
soup = BeautifulSoup(html, 'html.parser')
con = soup.find(id='main')
con_list = con.find_all('div', class_="cat_llb")
for i in con_list:
author = i.find('h3').string # 获取名字
content = i.find('div', id="endtext").get_text() # 获取内容
save_txt(author, content)


def save_txt(*args):
for i in args:
with open('qiubai.txt', 'a', encoding='utf-8') as f:

f.write(i+'\n'+'\n')


# def save_txt(str):
# for i in str:
#
# with open('qiubai.txt', 'a', encoding='utf-8') as f:
# f.write(str + '\n')
# f.write(i)

 

main DEF ():
# may be configured as follows url,

for i in range(1, 20):

url = 'http://www.lovehhy.net/Joke/Detail/QSBK/{}'.format(i)
html = download_page(url)
get_content(html)


if __name__ == '__main__':
main()

```

Oh, yes, the address of the new website is http://www.lovehhy.net/Joke/Detail/QSBK/
what do not know how welcome message

Guess you like

Origin www.cnblogs.com/chx123/p/11692125.html