Getting Started with Python Web Crawler-Advanced Project Practice Questions "Write a private reward, learn a gift package"

the first:

1. urllib realizes the page acquisition of Jingdong
2. Try to crawl the home page of
Zhihu 3. Extract the json dynamic data of lagou network to obtain the job name, company name, welfare and salary
4. Douban's simulated login-requests.session and get it Homepage data in html format
5. Not required: try to capture small tiktok video (single)

second:

'''
Domain name:
    https://www.baidu.com/word?input=Altman

    http: Hypertext Transfer Protocol is a method of publishing and receiving HTML pages.
    Default port number: 80
    url Uniform Resource Locator

    https: http + ssl (secure socket layer) 443

    Domain name: Server IP port

    path => path and parameters of the path

GET POST (data submission) HEAD (only get the header) delete


Douban source: http://pypi.douban.com/simple/
get request paging url in the
post paging data parameter

Free proxy: https://ip.ihuan.me/

Assignment: requests get the page of Baidu Tieba and save it locally

Assignment 2: Get Retract Python Job Information: Job Name Salary Company Name

'''

third:

Download the picture and save it to the local https://www.1000tuku.com/tupiangushi/
    Remarks: Three-level folder for storing pictures 1. Images folder 2. Picture story 3. The title of the series of pictures 4. The picture
    uses xpath


    /html/body/div[4]/ul/li[1]/a/img # Absolute path
    relative path extraction failed to obtain a lot of data we don't want

    When using a relative path to extract unwanted data -> add a parent node

urls = url[:-5] + '_' + str(page) + '.html'
     response = requests.get(urls, headers=headers).content.decode('gbk')

 

Guess you like

Origin blog.csdn.net/weixin_45293202/article/details/112523509