Express Information crawling

https://www.kuaidi100.com/network/net_4117_all_all_2.htm

Get Link each page

import requests 
from bs4  import BeautifulSoup
url = "https://www.kuaidi100.com/network/net_4117_all_all_1.htm"
try:
    r = requests.get(url)
    r.raise_for_status()   ##
    r.encoding = r.apparent_encoding
    print(r.text[:1000])
except:
    print("爬取出错")
    
soup = BeautifulSoup(r.text, "html.parser")
networklist = soup.select(".networkListItem")
for i in networklist:
    print(i.find("a").attrs['href'])    

Shop node link https://www.kuaidi100.com/network/net_4117_all_all_2.htm only become 2.htm

So for https://www.kuaidi100.com/network/net_4117_all_all_2 digital accumulate, and determines whether the link shop to decide 0

url = "https://www.kuaidi100.com/network/net_4117_all_all_60.htm"
try:
    r = requests.get(url)
    r.raise_for_status()   ##
    r.encoding = r.apparent_encoding
    print(r.text[:1000])
except:
    print("爬取出错")
    
soup = BeautifulSoup(r.text, "html.parser")
networklist = soup.select(".networkListItem")
len(networklist)  
0  ##长度为0 /net_4117_all_all_60.htm  链接无信息

url = "https://www.kuaidi100.com/network/net_4117_all_all_1.htm"
soup = BeautifulSoup(r.text, "html.parser")
networklist = soup.select(".networkListItem")
len(networklist)  
10  ##长度为10 该连接有网店信息

Match url, url cumulative time

import re 
pattern = re.compile(r"(.*_)(\d+)\.htm$", re.I)
url = "https://www.kuaidi100.com/network/net_4117_all_all_1.htm"
m = pattern.match(url)
m.group(0)
'https://www.kuaidi100.com/network/net_4117_all_all_1.htm'
m.group(1)
'https://www.kuaidi100.com/network/net_4117_all_all_'
m.group(2)
'1'
m.groups()
('https://www.kuaidi100.com/network/net_4117_all_all_', '1')

Url link to continue to accumulate, to determine if a web page where the link information online shop url less than 10 stops

i = 1 
while  True:
    if i < 50:
        i = i + 1
        print(i)
    else:
        print("you are over")
        break

analysis

##输入初始url 以便获取网点详情

def 







##根据1,抓取每个url




##对每个url 抓取信息
url = "https://www.kuaidi100.com/network/networkdt792925391984709.htm"
try:
    r = requests.get(url)
    r.raise_for_status()   ##
    r.encoding = r.apparent_encoding
    print(r.text[:1000])
except:
    print("爬取出错")
    
##抓取信息
soup = BeautifulSoup(r.text, "html.parser")
kdinfo = soup.select(".kd-info")[0]   ##获取的为list 
kdinfo.prettify() ##打印获取的信息
## kddlinfo = kdinfo.find_all("dl")  ##获取kdinfo 标签内的dl标签,标签内有dt(名称)  及dd(详情)
title = kdinfo.h1.text



ddlist = kdinfo.find_all("dd")

for dd in ddlist:
    print("\n----------")
    print(dd.text)

    
    ##打印
-------------------------
河南,驻马店市,正阳县

-------------------------
正确路口东段北侧

-------------------------
查件电话:17744695161业务电话:17744695161

-------------------------
联系时,请一定说明是在快递100看到的信息,谢谢!

-------------------------
南环路以北、西环路以东,交警队、电视台以南,正付路-东环路以西。东、西、南、北大街,中心街、花园路、正大路、东、西顺河街、慎西路。县直各单位、局委、厂区、学校。铜钟街及铜钟全境。
陡沟镇、傅寨乡、兰青乡、永兴镇、彭桥乡、新阮店乡、熊寨镇、吕河乡。

延迟派送:岳城乡:1天,西严店乡:1天。

-------------------------
寒冻镇。

-------------------------
到付业务,代收货款

-------------------------
2019-11-04




列表:
h1:地点;驻马店正阳县
  title = kdinfo.h1.text
所在地区:

公司地址:
联系电话:
派送范围:
延迟派送:
派送范围:
不派送范围:
备注:
本站更新:






##保存抓取的信息
http://www.python-excel.org/
https://www.jianshu.com/p/a8391a2b8c6c







https://www.kancloud.cn/xmsumi/pythonspider/160081

Guess you like

Origin www.cnblogs.com/g2thend/p/12382087.html