Bridge data, foreign bridge ranking data list, Python crawler 120 cases, the 32nd case

"Offer arrives, dig friends to pick up! I am participating in the 2022 Spring Recruitment Check-in Event, click to view the details of the event ."

The 32nd example of the column "Crawler 120 Examples", this example begins to learn the PyQueryparsing framework, which is very friendly to friends who switch from the front end to Python, because it simulates the JQuery operation.

Before the official start, install pyqueryit into the local development environment. The command is as follows: pip install pyquery, the version I am using is 1.4.3.

The basic use is as follows, and it is as simple as that when you understand it.

from pyquery import PyQuery as pq

s = '<html><title>橡皮擦的PyQuery小课堂</title></html>'
doc = pq(s)
print(doc('title'))
复制代码

The output is the following:

<title>橡皮擦的PyQuery小课堂</title>
复制代码

You can also directly pass the URL of the URL to be parsed to the pyqueryobject , the code is as follows:

from pyquery import PyQuery as pq

url = "https://www.bilibili.com/"
doc = pq(url=url,encoding="utf-8")

print(doc('title')) # <title>哔哩哔哩 (゜-゜)つロ 干杯~-bilibili</title>
复制代码

In the same way, you can also initialize the pyqueryobject , just modify the parameter filenameto .

After the foundation is laid, you can enter the practical operation. The following is the target case analysis to be captured this time.

target site analysis

This time the collection is: List of Highest International Bridges (the list of the highest international bridges). The data presented on the page is as follows.

insert image description here

In the process of browsing, I found that most of them are designed in China. Sure enough, we are the world's first in infrastructure.

The page turning rules are as follows:

http://www.highestbridges.com/wiki/index.php?title=List_of_Highest_International_Bridges/Page_1
http://www.highestbridges.com/wiki/index.php?title=List_of_Highest_International_Bridges/Page_2
# 实测翻到第 13 页数据就空了,大概1200座桥梁
http://www.highestbridges.com/wiki/index.php?title=List_of_Highest_International_Bridges/Page_13
复制代码

Since the target data exists in the form of a table, the data can be extracted directly according to the table header. Rank, Name, Height (meters / feet), Main Span Length, Completed, Location, Country

encoding time

Before the formal coding, take the first page to practice.

from pyquery import PyQuery as pq

url = "http://www.highestbridges.com/wiki/index.php?title=List_of_Highest_International_Bridges/Page_1"
doc = pq(url=url, encoding='utf-8')
print(doc('title'))


def remove(str):
    return str.replace("<br/>", "").replace("\n", "")


# 获取所有数据所在的行,下面使用的是 css 选择器,称作 jquery 选择器也没啥问题
items = doc.find('table.wikitable.sortable tr').items()
for item in items:
    td_list = item.find('td')
    rank = td_list.eq(1).find("span.sorttext").text()
    name = td_list.eq(2).find("a").text()
    height = remove(td_list.eq(3).text())
    length = remove(td_list.eq(4).text())
    completed = td_list.eq(5).text()
    location = td_list.eq(6).text()
    country = td_list.eq(7).text()
    print(rank, name, height, length, completed, location, country)
复制代码

The code is written down as a whole, and it is found that it still relies heavily on the selector, that is, it requires skilled operation of the selector, and the target element is selected to facilitate the acquisition of the final data.

Expand the above code to all data and modify it to iterative acquisition.

from pyquery import PyQuery as pq
import time


def remove(str):
    return str.replace("<br/>", "").replace("\n", "").replace(",", ",")


def get_data(page):
    url = "http://www.highestbridges.com/wiki/index.php?title=List_of_Highest_International_Bridges/Page_{}".format(
        page)
    print(url)
    doc = pq(url=url, encoding='utf-8')
    print(doc('title'))

    # 获取所有数据所在的行,下面使用的是 css 选择器,称作 jquery 选择器也没啥问题

    items = doc.find('table.wikitable.sortable tr').items()
    for item in items:
        td_list = item.find('td')
        rank = td_list.eq(1).find("span.sorttext").text()
        name = remove(td_list.eq(2).find("a").text())
        height = remove(td_list.eq(3).text())
        length = remove(td_list.eq(4).text())
        completed = remove(td_list.eq(5).text())
        location = remove(td_list.eq(6).text())
        country = remove(td_list.eq(7).text())
        data_tuple = (rank, name, height, length, completed, location, country)

        save(data_tuple)


def save(data_tuple):
    try:
        my_str = ",".join(data_tuple) + "\n"
        # print(my_str)
        with open(f"./data.csv", "a+", encoding="utf-8") as f:
            f.write(my_str)
            print("写入完毕")
    except Exception as e:
        pass


if __name__ == '__main__':
    for page in range(1, 14):
        get_data(page)
        time.sleep(3)
复制代码

It is found that there are commas in English, and they are modified uniformly, that is, the application of remove(str)functions .

Guess you like

Origin juejin.im/post/7079963162409173005