Embarrassments Encyclopedia crawling

It counted on a practice climb, so long learning to be regular, but also ignorant force for some time, but once found to write regular true to force ah!


tips:

  • zip () function:  for an object as a parameter the iteration may be the corresponding element of the object packed into a tuple and returns a list of these tuples. If the number of elements of each iterator inconsistency, the shortest length of the list and returns the same object, using the asterisk operator, tuples will extract the list. Returns a list of tuples.
  • zip Syntax: zip ([ Iterable , ...])
  • Parameters: iterabl - one or more iterator

Method in different zip 3 Python 2 and Python: Python 3.x in order to reduce the memory, zip () returns an object. To display the list, you need to manually list () conversion. If you need to understand the application Pyhton3, refer to A Python 3 ZIP () .

  • strip () method:   for removing head and tail of the specified character string (line feed spaces or default) or character sequence.
  • \ s:   match blank, space, tab key
  • == re.DOTALL re.S:   . the match, including all characters, including newline
'' ' 
@Modify Time @author    
----------------------- -------     
2019/9/8 23:50 laoalo     
' '' 

Import Re
 Import Requests 

head = {
     ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 64.0.3282.140 Safari / 537.36 Edge / 17.17134 ' 
} 


DEF Spider (URL):
     ' '' 
        crawl a page the user name and his embarrassing stories to tell 
    : return: a list element is a dictionary 
    '' ' 
    the Response = requests.get (url = url, headers = head).text 

    # crawling articles 
    articals = re.findall (r'<div\sclass="content">\s<span>(.*?)</span>', response, re.DOTALL)
    str_artical = []
    for i in articals:
        temp = re.sub('<.*>', "", i).strip()
        str_artical.append(temp)

    # 爬取用户名
    users = re.findall(r'<div class="author clearfix">.*?<a.*?>.*?</a>.*?<a.*?>\s<h2>(.*?)</a>', response, re.DOTALL)
    str_user = []
    for i in users:
        temp = re.sub('<.*>', "", i)
        temp = re.sub('amp;', "", temp).strip()
        str_user.append(temp)

    # 汇总
    funny=[]
    for i,j in zip(str_user,str_artical):
        temp = {
            "用户名":i,
            "糗事":j
        }
        funny.append(temp)

    return funny

if __name__ == '__main__':
    for I in Range (1,10 ):
         Print ( " * " * 50, " which is the first page embarrassments {}: \ n- " .format (I)) 
        URL = " https://www.qiushibaike.com/ Hot / Page / {} / " .format (I)
         Print (Spider (URL))
Code

 

Guess you like

Origin www.cnblogs.com/chrysanthemum/p/11489463.html