written in front
- Encountered at work, easy to organize
- If you don’t understand enough, please help me to correct
There is only one real duty for every human being: to find himself. Then stick to his life in his heart, wholeheartedly, and never stop. All other paths are incomplete, man's way of escaping, a cowardly return to popular ideals, drifting with the tide, and fear of the heart - Hermann Hesse, Demian
Collection process:
- automatic login
- Get the current page data of the business ranking page
- Get the total number of pages, and the corresponding elements of the next page
- Loop through according to the total number of pages, and simulate clicking on the next page to obtain data paging data
- data summary
from seleniumwire import webdriver
import json
import time
from selenium.webdriver.common.by import By
import pandas as pd
# 自动登陆
driver = webdriver.Chrome()
with open('C:\\Users\山河已无恙\\Documents\GitHub\\reptile_demo\\demo\\cookie.txt', 'r', encoding='u8') as f:
cookies = json.load(f)
driver.get('https://cdn.chinaz.com/')
for cookie in cookies:
driver.add_cookie(cookie)
driver.get('https://cdn.chinaz.com/')
time.sleep(6)
#CND 商家排行获取 https://cdn.chinaz.com/
CDN_Manufacturer = []
new_div_element = driver.find_element(By.CSS_SELECTOR, ".toplist-main")
div_elements = new_div_element.find_elements(By.CSS_SELECTOR, ".ullist")
#CDN_Manufacturer.extend(div_elements)
for mdn_ms in div_elements:
a_target = mdn_ms.find_element(By.CSS_SELECTOR,".tohome")
home_url = a_target.get_attribute('href')
print(mdn_ms.text)
text_temp = str(mdn_ms.text).split("\n")
CDN_Manufacturer.append({
"公司名称": text_temp[0],
"官网地址": home_url,
"经营资质": text_temp[1],
"CDN网站数量": text_temp[2],
"网站占比": text_temp[3],
"IP节点":text_temp[4],
"IP占比":text_temp[5],
})
sum_page = driver.find_element(By.XPATH,"//a[contains(@title, '尾页')]")
attribute_value = sum_page.get_attribute('val')
print(attribute_value)
for page in range(1,int(attribute_value)):
next_page = driver.find_element(By.XPATH,"//a[contains(@title, '下一页')]")
next_page.click()
time.sleep(5)
new_div_element = driver.find_element(By.CSS_SELECTOR, ".toplist-main")
div_elements = new_div_element.find_elements(By.CSS_SELECTOR, ".ullist")
#CDN_Manufacturer.extend(div_elements)
for mdn_ms in div_elements:
a_target = mdn_ms.find_element(By.CSS_SELECTOR,".tohome")
home_url = a_target.get_attribute('href')
print(mdn_ms.text)
text_temp = str(mdn_ms.text).split("\n")
CDN_Manufacturer.append({
"公司名称": text_temp[0],
"官网地址": home_url,
"经营资质": text_temp[1],
"CDN网站数量": text_temp[2],
"网站占比": text_temp[3],
"IP节点":text_temp[4],
"IP占比":text_temp[5],
})
#print(CDN_Manufacturer)
#a_list = page_element.find_elements(By.TAG_NAME,"a")
for mdn_ms in CDN_Manufacturer:
#divs = mdn_ms.find_elements(By.XPATH,"//div")
pass
df = pd.DataFrame(CDN_Manufacturer)
# 将数据保存为CSV文件
df.to_csv('CDN_Manufacturer.csv', index=False)
print("数据已保存为CSV文件")
pd directly prints the generated result
数据已保存为CSV文件
公司名称 官网地址 ... IP节点 IP占比
0 百度云加速 https://cloud.baidu.com/product/cdn.html ... 92100 4.7%
1 阿里云 https://www.aliyun.com/ ... 238994 12.3%
2 腾讯云 https://cloud.tencent.com/ ... 57212 2.9%
3 知道创宇云防御 https://www.yunaq.com/jsl/ ... 16333 0.8%
4 网宿 http://www.chinanetcenter.com/ ... 67683 3.5%
.. ... ... ... ... ...
67 睿江CDN http://www.efly.cc/ ... 1 <0.1
68 领智云画科 http://www.linkingcloud.com/ ... 6 <0.1
69 郑州珑凌 http://www.lonlife.cn/ ... 1 <0.1
70 中国联合网络 http://www.wocloud.cn/ ... 2 <0.1
71 极兔云CDN https://www.jitucdn.com/ ... 9 <0.1
data visualization
pyecharts
Simple visualization of data by
def to_echarts(CDN_Manufacturer):
from pyecharts.charts import Bar
from pyecharts import options as opts
# 内置主题类型可查看 pyecharts.globals.ThemeType
from pyecharts.globals import ThemeType
xaxis = [ cdn["公司名称"] for cdn in CDN_Manufacturer ][:10]
yaxis1 = [ cdn["CDN网站数量"] for cdn in CDN_Manufacturer ][:10]
yaxis2 = [ cdn["IP节点"] for cdn in CDN_Manufacturer ][:10]
bar = (
Bar(init_opts=opts.InitOpts(theme=ThemeType.LIGHT))
.add_xaxis(xaxis)
.add_yaxis("CDN网站数量", yaxis1)
.add_yaxis("IP节点", yaxis2)
.set_global_opts(title_opts=opts.TitleOpts(title="主标题", subtitle="副标题"))
)
bar.render()
Also consider some other visualization tools
Matplotlib
: Matplotlib is one of the most commonly used data visualization libraries in Python, providing a wide range of drawing functions, including line charts, scatter plots, histograms, pie charts, etc. It can be used to create static charts and interactive graphs, and is highly customizable.
Seaborn
: Seaborn is a statistical data visualization library based on Matplotlib, focusing on statistical charts and information visualization. Seaborn provides more advanced statistical chart types with better default styles and color themes.
Plotly
: Plotly is an interactive visualization library for creating highly customized charts and visualizations. Plotly provides a wealth of chart types, including line charts, scatter plots, histograms, heat maps, etc., and supports the creation of interactive dashboards and visualization applications.
Bokeh
: Bokeh is a library for creating interactive charts and visualizations with powerful drawing capabilities and cross-platform support. Bokeh can generate HTML, JavaScript, and WebGL, enabling cross-browser and cross-device visualizations.
Altair
: Altair is a declarative data visualization library that uses simple Python syntax to generate visual charts. Altair is based on the Vega-Lite specification, with a clear syntax and a concise API.
Part of the blog post content reference
© The copyright of the content of the reference link in the article belongs to the original author, if there is any infringement, please report
<pyecharts: https://pyecharts.org/#/zh-cn/quickstart>
<Matplotlib: https://github.com/matplotlib/matplotlib>
<Seaborn: https://github.com/seaborn/seaborn>
<Plotly: https://github.com/plotly/plotly.py>
<Bokeh: https://github.com/bokeh/bokeh>
<Altair: https://github.com/altair-viz/altair>
© 2018-2023 [email protected], All rights reserved. Attribution-Non-Commercial-Share Alike (CC BY-NC-SA 4.0)