two large python jobs

First, into csv

Last crawl to the desired content, but has not been credited to csv, this time into a csv file, the code is as follows:

import requests
from bs4 import BeautifulSoup
import csv
import io
import sys
sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')

def get_url():#得到A-Z所有网站
    urls=[]
    for i in range(1,27):
        i = chr(i+96)
        urls.append('http://www.thinkbabynames.com/start/0/%s'%i)
    return urls
    pass

DEF get_text (URL): # give all names and connection crawling desired content 
    headers = { ' cookies ' : " the User-- Agent: the Mozilla / 5.0 (the Linux; the Android 6.0; the Nexus. 5 the Build / MRA58N) AppleWebKit / 537.36 (KHTML , like the Gecko) the Chrome / 63.0.3239.132 Mobile Safari / 537.36 " } 
    docx = requests.get (URL) 
    Soup = the BeautifulSoup (docx.content, ' html.parser ' ) 
    c_txt1 = soup.find ( ' sectionTop ' , { ' ID ' : ' index ' .}) the findAll ( ' B ')    
     For X in c_txt1: 
        S = []
         IF x.find ( ' A ' ): 
            name = x.find ( ' A ' ) [ ' the href ' ] .split ( " / " ) [-. 1] # using regular expressions type get all the names 
            # url.append ( 'http://www.thinkbabynames.com/meaning/0/%s'%i) # get the names of all details page link 
            IF name: 
                r = requests.get ( ' HTTP: / /www.thinkbabynames.com/meaning/0/%s ' % name) 
            Result= R.text 
            BS = the BeautifulSoup (Result, ' html.parser ' ) 
            Li = bs.find ( ' div ' , the class_ = ' Content ' ) .find ( ' h1 of ' ) 
            Enname = li.text [::. 8. 1] # using slice syntax for details page name (s [x: y: z ] x as a starting, y is terminated, z in steps)         
            Gender = li.text [. 1:. 8:. 1] # microtome syntax for details page name gender 
            in Li1 = bs.find ( ' sectionTop ' , ID = ' meaning ' ) .find ( ' P ' )
            The Description = li1.text
             # save the name, gender, introduction into s 
            s.append (Enname) 
            s.append (Gender) 
            s.append (the Description) 
        save_text (s) 
    return s
     Pass 

DEF save_text (s): # save to csv in 
    with Open ( ' text.csv ' , ' A ' , encoding = ' utf_8_sig ' , NEWLINE = '' ) AS F: 
        Writer = csv.writer (F) 
        writer.writerow (S) 

IF  the __name__ =='__main__':
    urls=get_url();
    for url in urls:
        get_text(url)

As to get the name, gender, and an introduction into the s, s then saved to the csv.

Two, csv file screenshot

Third, the problems encountered and solutions

You can not get the text content when (1) the names of all crawled to

Solution: Choose the appropriate regular expression

docx=requests.get(url)
    soup=BeautifulSoup(docx.content,'html.parser')
    c_txt1=soup.find('section',{'id':'index'}).findAll('b')    
    for x in c_txt1:
        s=[]
        if x.find('a'):
            name=x.find('a')['href'].split("/")[-1]#使用正则表达式获得所有名字

(2)获取名字详情页内容时,名字和性别在一起。

解决方案:使用切片语法分别获得名字和姓名分开存取

li=bs.find('div',class_='content').find('h1')
            Enname=li.text[8::1]#使用切片语法获得详情页名字(s[x:y:z]x为起始,y为终止,z为步长)        
            Gender=li.text[1:8:1]#使用切片语法获得详情页名字性别

(3)在笔记本上运行时,访问量大

解决方案:分开来爬

如上图,改变range()函数中的数字来选择爬取部分网站以减少访问量。

这样既能够满足爬取要求,也不会被网站禁止爬取。

Guess you like

Origin www.cnblogs.com/sndd/p/12079942.html