python usage summary

requests library usage:

requests are simple to use python implementation of HTTP library

Because it is a third-party library, so cmd installed before use

pip ×××tall requests

After the installation is complete import, then the normal ready to use

Basic usage:

Requests Import
form the BeautifulSoup BS4 Import
Response = requests.get ( ' http://www.baidu.com ')
Print (response.status_code) # print status code
print (response.url) # print request URL
Print (Response.Headers) # print header information
print (response.cookies) # cookie information printed
print (response.text) # print page in the source text
print (response.content) # print stream of bytes

#!/usr/bin/env python

encoding=utf-8

from future import print_function
import requests
from bs4 import BeautifulSoup
import pymongo
import json

db = pymongo.MongoClient().iaaf
def spider_iaaf():

url 100 into the longjump

# url = 'https://www.iaaf.org/records/toplists/sprints/100-metres/outdoor/men/senior/2018?page={}'
url = 'https://www.iaaf.org/records/toplists/jumps/long-jump/outdoor/men/senior/2018?regionType=world&windReading=regular&page={}&bestResultsOnly=true'
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15', }

for i in range(1,23):
    res = requests.get(url.format(i), headers=headers)
    html = res.text
    print(i)
    soup = BeautifulSoup(html, 'html.parser')
    #tbody_l = soup.find_all('tbody')
    record_table = soup.find_all('table', class_='records-table')
    list_re = record_table[2]
    tr_l = list_re.find_all('tr')
    for i in tr_l:    # 针对每一个tr  也就是一行
        td_l = i.find_all('td')    # td的列表 第三项是 带href
        # 只要把td_l里面的每一项赋值就好了  组成json数据  {}  插入到mongo
        # 再从mongo里面取href  访问  得到 生涯数据  再存回这个表
        # 再 把所有数据 存到 excel

        j_data = {}
        try:
            j_data['Rank'] = td_l[0].get_text().strip()
            j_data['Mark'] = td_l[1].get_text().strip()
            j_data['WIND'] = td_l[2].get_text().strip()
            j_data['Competitior'] = td_l[3].get_text().strip()
            j_data['DOB'] = td_l[4].get_text().strip()
            j_data['Nat'] = td_l[5].get_text().strip()
            j_data['Pos'] = td_l[6].get_text().strip()
            j_data['Venue'] = td_l[8].get_text().strip()
            j_data['Date'] = td_l[9].get_text().strip()

            j_data['href'] = td_l[3].find('a')['href']
        except:
            pass
        db.athletes.×××ert_one(j_data)

if name == 'main':
spider_iaaf()

bs4 usage:
BeautifulSoup, it is a third-party library needs to be installed before use

pip ×××tall bs4

Configuration:
(. 1) ~ CD
(2) mkdir .pip
(. 3) ~ VI / .pip / pip.conf
(. 4), and edits the contents identical windows

What bs4 that?

它的作用是能够快速方便简单的提取网页中指定的内容,给我一个网页字符串,然后使用它的接口将网页字符串生成一个对象,然后通过这个对象的方法来提取数据

bs4 grammar learning

Learning through local files, write the code over a network
(1) were acquired under the label of the node name
can only be found to meet the requirements of the first node
(2) to get the text content and attribute
properties

soup.a.attrs returns a dictionary, which is all the attributes and values
soup.a [ 'href'] href attribute acquisition

text

soup.a.string
soup.a.text
soup.a.get_text ()
[Note] there is also a label when the label when, string acquired to None, the other two for plain text content

(3) find methods

soup.find ( 'A')
soup.find ( 'A', class _ = 'XXX')
soup.find ( 'A', title = 'XXX')
soup.find ( 'A', ID = 'XXX')
soup.find ( 'a', id = re.compile (r'xxx '))
[Note] find can only be found to meet the requirements of the first label, he returned an object

(4)find_all

It returns a list, a list of all objects which meet the requirements
soup.find_all ( 'A')
soup.find All ( 'A', class = 'Wang')
soup.find_all ( 'A', the re.compile ID = ( r'xxx '))
soup.find_all (' a ', limit = 2) is extracted before the two meet the requirements of a

Guess you like

Origin blog.51cto.com/14259167/2409354