requests library usage:
requests are simple to use python implementation of HTTP library
Because it is a third-party library, so cmd installed before use
pip ×××tall requests
After the installation is complete import, then the normal ready to use
Basic usage:
Requests Import
form the BeautifulSoup BS4 Import
Response = requests.get ( ' http://www.baidu.com ')
Print (response.status_code) # print status code
print (response.url) # print request URL
Print (Response.Headers) # print header information
print (response.cookies) # cookie information printed
print (response.text) # print page in the source text
print (response.content) # print stream of bytes
#!/usr/bin/env python
encoding=utf-8
from future import print_function
import requests
from bs4 import BeautifulSoup
import pymongo
import json
db = pymongo.MongoClient().iaaf
def spider_iaaf():
url 100 into the longjump
# url = 'https://www.iaaf.org/records/toplists/sprints/100-metres/outdoor/men/senior/2018?page={}'
url = 'https://www.iaaf.org/records/toplists/jumps/long-jump/outdoor/men/senior/2018?regionType=world&windReading=regular&page={}&bestResultsOnly=true'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15', }
for i in range(1,23):
res = requests.get(url.format(i), headers=headers)
html = res.text
print(i)
soup = BeautifulSoup(html, 'html.parser')
#tbody_l = soup.find_all('tbody')
record_table = soup.find_all('table', class_='records-table')
list_re = record_table[2]
tr_l = list_re.find_all('tr')
for i in tr_l: # 针对每一个tr 也就是一行
td_l = i.find_all('td') # td的列表 第三项是 带href
# 只要把td_l里面的每一项赋值就好了 组成json数据 {} 插入到mongo
# 再从mongo里面取href 访问 得到 生涯数据 再存回这个表
# 再 把所有数据 存到 excel
j_data = {}
try:
j_data['Rank'] = td_l[0].get_text().strip()
j_data['Mark'] = td_l[1].get_text().strip()
j_data['WIND'] = td_l[2].get_text().strip()
j_data['Competitior'] = td_l[3].get_text().strip()
j_data['DOB'] = td_l[4].get_text().strip()
j_data['Nat'] = td_l[5].get_text().strip()
j_data['Pos'] = td_l[6].get_text().strip()
j_data['Venue'] = td_l[8].get_text().strip()
j_data['Date'] = td_l[9].get_text().strip()
j_data['href'] = td_l[3].find('a')['href']
except:
pass
db.athletes.×××ert_one(j_data)
if name == 'main':
spider_iaaf()
bs4 usage:
BeautifulSoup, it is a third-party library needs to be installed before use
pip ×××tall bs4
Configuration:
(. 1) ~ CD
(2) mkdir .pip
(. 3) ~ VI / .pip / pip.conf
(. 4), and edits the contents identical windows
What bs4 that?
它的作用是能够快速方便简单的提取网页中指定的内容,给我一个网页字符串,然后使用它的接口将网页字符串生成一个对象,然后通过这个对象的方法来提取数据
bs4 grammar learning
Learning through local files, write the code over a network
(1) were acquired under the label of the node name
can only be found to meet the requirements of the first node
(2) to get the text content and attribute
properties
soup.a.attrs returns a dictionary, which is all the attributes and values
soup.a [ 'href'] href attribute acquisition
text
soup.a.string
soup.a.text
soup.a.get_text ()
[Note] there is also a label when the label when, string acquired to None, the other two for plain text content
(3) find methods
soup.find ( 'A')
soup.find ( 'A', class _ = 'XXX')
soup.find ( 'A', title = 'XXX')
soup.find ( 'A', ID = 'XXX')
soup.find ( 'a', id = re.compile (r'xxx '))
[Note] find can only be found to meet the requirements of the first label, he returned an object
(4)find_all
It returns a list, a list of all objects which meet the requirements
soup.find_all ( 'A')
soup.find All ( 'A', class = 'Wang')
soup.find_all ( 'A', the re.compile ID = ( r'xxx '))
soup.find_all (' a ', limit = 2) is extracted before the two meet the requirements of a