Data collection: selenium obtains the ranking information of a website's CDN merchants

written in front


  • Encountered at work, easy to organize
  • If you don’t understand enough, please help me to correct

There is only one real duty for every human being: to find himself. Then stick to his life in his heart, wholeheartedly, and never stop. All other paths are incomplete, man's way of escaping, a cowardly return to popular ideals, drifting with the tide, and fear of the heart - Hermann Hesse, Demian


Collection process:

  1. automatic login
  2. Get the current page data of the business ranking page
  3. Get the total number of pages, and the corresponding elements of the next page
  4. Loop through according to the total number of pages, and simulate clicking on the next page to obtain data paging data
  5. data summary
from seleniumwire import webdriver
import json
import time
from selenium.webdriver.common.by import By
import pandas as pd


# 自动登陆
driver = webdriver.Chrome()
with open('C:\\Users\山河已无恙\\Documents\GitHub\\reptile_demo\\demo\\cookie.txt', 'r', encoding='u8') as f:
    cookies = json.load(f)

driver.get('https://cdn.chinaz.com/')
for cookie in cookies:
    driver.add_cookie(cookie)

driver.get('https://cdn.chinaz.com/')

time.sleep(6)
#CND 商家排行获取 https://cdn.chinaz.com/
CDN_Manufacturer = []
new_div_element = driver.find_element(By.CSS_SELECTOR, ".toplist-main")
div_elements = new_div_element.find_elements(By.CSS_SELECTOR, ".ullist")
#CDN_Manufacturer.extend(div_elements)
for mdn_ms in div_elements:
    a_target = mdn_ms.find_element(By.CSS_SELECTOR,".tohome")
    home_url = a_target.get_attribute('href')
    print(mdn_ms.text)
    text_temp =  str(mdn_ms.text).split("\n")
    CDN_Manufacturer.append({
    
    
       "公司名称": text_temp[0],
       "官网地址": home_url,
       "经营资质":  text_temp[1],
       "CDN网站数量":  text_temp[2],
       "网站占比": text_temp[3],
       "IP节点":text_temp[4],
       "IP占比":text_temp[5],

    })
sum_page = driver.find_element(By.XPATH,"//a[contains(@title, '尾页')]")
attribute_value = sum_page.get_attribute('val')

print(attribute_value)
for page in range(1,int(attribute_value)):
    next_page = driver.find_element(By.XPATH,"//a[contains(@title, '下一页')]")
    next_page.click()
    time.sleep(5)
    new_div_element = driver.find_element(By.CSS_SELECTOR, ".toplist-main")
    div_elements = new_div_element.find_elements(By.CSS_SELECTOR, ".ullist")
    #CDN_Manufacturer.extend(div_elements)
    for mdn_ms in div_elements:
        a_target = mdn_ms.find_element(By.CSS_SELECTOR,".tohome")
        home_url = a_target.get_attribute('href')
        print(mdn_ms.text)
        text_temp =  str(mdn_ms.text).split("\n")
        CDN_Manufacturer.append({
    
    
           "公司名称": text_temp[0],
           "官网地址": home_url,
           "经营资质":  text_temp[1],
           "CDN网站数量":  text_temp[2],
           "网站占比": text_temp[3],
           "IP节点":text_temp[4],
           "IP占比":text_temp[5],

        })

#print(CDN_Manufacturer)    
#a_list =  page_element.find_elements(By.TAG_NAME,"a")
for mdn_ms in CDN_Manufacturer:
    #divs =  mdn_ms.find_elements(By.XPATH,"//div")
    pass


df = pd.DataFrame(CDN_Manufacturer)

# 将数据保存为CSV文件
df.to_csv('CDN_Manufacturer.csv', index=False)

print("数据已保存为CSV文件")


pd directly prints the generated result

数据已保存为CSV文件
       公司名称                                      官网地址  ...    IP节点   IP占比
0     百度云加速  https://cloud.baidu.com/product/cdn.html  ...   92100   4.7%
1       阿里云                   https://www.aliyun.com/  ...  238994  12.3%
2       腾讯云                https://cloud.tencent.com/  ...   57212   2.9%
3   知道创宇云防御                https://www.yunaq.com/jsl/  ...   16333   0.8%
4        网宿            http://www.chinanetcenter.com/  ...   67683   3.5%
..      ...                                       ...  ...     ...    ...
67    睿江CDN                       http://www.efly.cc/  ...       1   <0.1
68    领智云画科              http://www.linkingcloud.com/  ...       6   <0.1
69     郑州珑凌                    http://www.lonlife.cn/  ...       1   <0.1
70   中国联合网络                    http://www.wocloud.cn/  ...       2   <0.1
71   极兔云CDN                  https://www.jitucdn.com/  ...       9   <0.1

data visualization

pyechartsSimple visualization of data by

def to_echarts(CDN_Manufacturer):
    from pyecharts.charts import Bar
    from pyecharts import options as opts
    # 内置主题类型可查看 pyecharts.globals.ThemeType
    from pyecharts.globals import ThemeType
    xaxis =  [ cdn["公司名称"] for cdn in   CDN_Manufacturer ][:10]
    yaxis1 =  [ cdn["CDN网站数量"] for cdn in   CDN_Manufacturer ][:10]
    yaxis2 =  [ cdn["IP节点"] for cdn in   CDN_Manufacturer ][:10]
    bar = (
        Bar(init_opts=opts.InitOpts(theme=ThemeType.LIGHT))
        .add_xaxis(xaxis)
        .add_yaxis("CDN网站数量", yaxis1)
        .add_yaxis("IP节点", yaxis2)
        .set_global_opts(title_opts=opts.TitleOpts(title="主标题", subtitle="副标题"))
)  
    bar.render()

insert image description here

Also consider some other visualization tools

Matplotlib: Matplotlib is one of the most commonly used data visualization libraries in Python, providing a wide range of drawing functions, including line charts, scatter plots, histograms, pie charts, etc. It can be used to create static charts and interactive graphs, and is highly customizable.

Seaborn: Seaborn is a statistical data visualization library based on Matplotlib, focusing on statistical charts and information visualization. Seaborn provides more advanced statistical chart types with better default styles and color themes.

Plotly: Plotly is an interactive visualization library for creating highly customized charts and visualizations. Plotly provides a wealth of chart types, including line charts, scatter plots, histograms, heat maps, etc., and supports the creation of interactive dashboards and visualization applications.

Bokeh: Bokeh is a library for creating interactive charts and visualizations with powerful drawing capabilities and cross-platform support. Bokeh can generate HTML, JavaScript, and WebGL, enabling cross-browser and cross-device visualizations.

Altair: Altair is a declarative data visualization library that uses simple Python syntax to generate visual charts. Altair is based on the Vega-Lite specification, with a clear syntax and a concise API.

Part of the blog post content reference

© The copyright of the content of the reference link in the article belongs to the original author, if there is any infringement, please report


<pyecharts: https://pyecharts.org/#/zh-cn/quickstart>

<Matplotlib: https://github.com/matplotlib/matplotlib>

<Seaborn: https://github.com/seaborn/seaborn>

<Plotly: https://github.com/plotly/plotly.py>

<Bokeh: https://github.com/bokeh/bokeh>

<Altair: https://github.com/altair-viz/altair>


© 2018-2023 [email protected], All rights reserved. Attribution-Non-Commercial-Share Alike (CC BY-NC-SA 4.0)

Guess you like

Origin blog.csdn.net/sanhewuyang/article/details/132452477