the first:
1. urllib realizes the page acquisition of Jingdong
2. Try to crawl the home page of
Zhihu 3. Extract the json dynamic data of lagou network to obtain the job name, company name, welfare and salary
4. Douban's simulated login-requests.session and get it Homepage data in html format
5. Not required: try to capture small tiktok video (single)
second:
'''
Domain name:
https://www.baidu.com/word?input=Altman
http: Hypertext Transfer Protocol is a method of publishing and receiving HTML pages.
Default port number: 80
url Uniform Resource Locator
https: http + ssl (secure socket layer) 443
Domain name: Server IP port
path => path and parameters of the path
GET POST (data submission) HEAD (only get the header) delete
Douban source: http://pypi.douban.com/simple/
get request paging url in the
post paging data parameter
Free proxy: https://ip.ihuan.me/
Assignment: requests get the page of Baidu Tieba and save it locally
Assignment 2: Get Retract Python Job Information: Job Name Salary Company Name
'''
third:
Download the picture and save it to the local https://www.1000tuku.com/tupiangushi/
Remarks: Three-level folder for storing pictures 1. Images folder 2. Picture story 3. The title of the series of pictures 4. The picture
uses xpath
/html/body/div[4]/ul/li[1]/a/img # Absolute path
relative path extraction failed to obtain a lot of data we don't want
When using a relative path to extract unwanted data -> add a parent node
urls = url[:-5] + '_' + str(page) + '.html'
response = requests.get(urls, headers=headers).content.decode('gbk')