First, into csv
Last crawl to the desired content, but has not been credited to csv, this time into a csv file, the code is as follows:
import requests from bs4 import BeautifulSoup import csv import io import sys sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') def get_url():#得到A-Z所有网站 urls=[] for i in range(1,27): i = chr(i+96) urls.append('http://www.thinkbabynames.com/start/0/%s'%i) return urls pass DEF get_text (URL): # give all names and connection crawling desired content headers = { ' cookies ' : " the User-- Agent: the Mozilla / 5.0 (the Linux; the Android 6.0; the Nexus. 5 the Build / MRA58N) AppleWebKit / 537.36 (KHTML , like the Gecko) the Chrome / 63.0.3239.132 Mobile Safari / 537.36 " } docx = requests.get (URL) Soup = the BeautifulSoup (docx.content, ' html.parser ' ) c_txt1 = soup.find ( ' sectionTop ' , { ' ID ' : ' index ' .}) the findAll ( ' B ') For X in c_txt1: S = [] IF x.find ( ' A ' ): name = x.find ( ' A ' ) [ ' the href ' ] .split ( " / " ) [-. 1] # using regular expressions type get all the names # url.append ( 'http://www.thinkbabynames.com/meaning/0/%s'%i) # get the names of all details page link IF name: r = requests.get ( ' HTTP: / /www.thinkbabynames.com/meaning/0/%s ' % name) Result= R.text BS = the BeautifulSoup (Result, ' html.parser ' ) Li = bs.find ( ' div ' , the class_ = ' Content ' ) .find ( ' h1 of ' ) Enname = li.text [::. 8. 1] # using slice syntax for details page name (s [x: y: z ] x as a starting, y is terminated, z in steps) Gender = li.text [. 1:. 8:. 1] # microtome syntax for details page name gender in Li1 = bs.find ( ' sectionTop ' , ID = ' meaning ' ) .find ( ' P ' ) The Description = li1.text # save the name, gender, introduction into s s.append (Enname) s.append (Gender) s.append (the Description) save_text (s) return s Pass DEF save_text (s): # save to csv in with Open ( ' text.csv ' , ' A ' , encoding = ' utf_8_sig ' , NEWLINE = '' ) AS F: Writer = csv.writer (F) writer.writerow (S) IF the __name__ =='__main__': urls=get_url(); for url in urls: get_text(url)
As to get the name, gender, and an introduction into the s, s then saved to the csv.
Two, csv file screenshot
Third, the problems encountered and solutions
You can not get the text content when (1) the names of all crawled to
Solution: Choose the appropriate regular expression
docx=requests.get(url) soup=BeautifulSoup(docx.content,'html.parser') c_txt1=soup.find('section',{'id':'index'}).findAll('b') for x in c_txt1: s=[] if x.find('a'): name=x.find('a')['href'].split("/")[-1]#使用正则表达式获得所有名字
(2)获取名字详情页内容时,名字和性别在一起。
解决方案:使用切片语法分别获得名字和姓名分开存取
li=bs.find('div',class_='content').find('h1') Enname=li.text[8::1]#使用切片语法获得详情页名字(s[x:y:z]x为起始,y为终止,z为步长) Gender=li.text[1:8:1]#使用切片语法获得详情页名字性别
(3)在笔记本上运行时,访问量大
解决方案:分开来爬
如上图,改变range()函数中的数字来选择爬取部分网站以减少访问量。
这样既能够满足爬取要求,也不会被网站禁止爬取。