Python crawler practical teaching, hands-on introduction

I. Introduction

This article was previously used for training newcomers. Everyone found it easy to understand, so I shared it and learned with everyone. If you have learned some python and want to do something with it but have no direction, you might as well try to complete the following cases.

The old rules, you need packaged software to pay attention to the editor, QQ group: 721195303 to receive.

2. Environmental preparation

Install requests lxml beautifulsoup4 three libraries (the following codes have passed the test in the python3.5 environment)

pip install requests lxml beautifulsoup4


image.png

Three, a few small cases of crawlers

  • Obtain the local public network IP address
  • Use Baidu search interface to write url collector
  • Automatically download Sogou wallpapers
  • Automatically fill out the questionnaire
  • Obtain the public network proxy IP, and judge whether it can be used or delayed

3.1 Obtain the local public network IP address

Use the excuse to query the IP on the public network and use the python requests library to automatically obtain the IP address.

import requests
r = requests.get("http://2017.ip138.com/ic.asp")
r.encoding = r.apparent_encoding        #使用requests的字符编码智能分析,避免中文乱码
print(r.text)
# 你还可以使用正则匹配re模块提取出IP
import re
print(re.findall("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}",r.text))


image.png

3.2 Use Baidu search interface to write url collector

In this case, we will use requests combined with the BeautifulSoup library to complete the task. We need to set the User-Agent header in the program to bypass the anti-crawler mechanism of the Baidu search engine (you can try without the User-Agent header to see if you can get the data). Pay attention to the URL link rules of Baidu search structure. For example, the URL link parameter pn=0 on the first page, the URL link parameter pn=10 on the second page, and so on. Here, we use the css selector path to extract the data.

import requests
from bs4 import BeautifulSoup
# 设置User-Agent头,绕过百度搜索引擎的反爬虫机制
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'}
# 注意观察百度搜索结构的URL链接规律,例如第一页pn=0,第二页pn=10.... 依次类推,下面的for循环搜索前10页结果
for i in range(0,100,10):
        bd_search = "https://www.baidu.com/s?wd=inurl:/dede/login.php?&pn=%s" % str(i)
        r = requests.get(bd_search,headers=headers)
        soup = BeautifulSoup(r.text,"lxml")
    # 下面的select使用了css选择器路径提取数据
        url_list = soup.select(".t > a")
        for url in url_list:
                real_url = url["href"]
                r = requests.get(real_url)
                print(r.url)

After writing the program, we use keywords inurl:/dede/login.php to extract the background address of dream weaving cms in batches. The effect is as follows:


image.png

3.3 Automatically download Sogou wallpapers

In this example, we will automatically download the searched wallpaper through the crawler, and change the path where the pictures are stored in the program to the path of the directory where you want to store the pictures. Another point is that we used the json library in the program. This is because during our observation, we found that the address of Sogou’s wallpaper is stored in json format, so we parse this set of data in json.

import requests
import json
#下载图片
url = "http://pic.sogou.com/pics/channel/getAllRecomPicByTag.jsp?category=%E5%A3%81%E7%BA%B8&tag=%E6%B8%B8%E6%88%8F&start=0&len=15&width=1366&height=768"
r = requests.get(url)
data = json.loads(r.text)
for i in data["all_items"]:
    img_url = i["pic_url"]
    # 下面这行里面的路径改成你自己想要存放图片的目录路径即可
    with open("/home/evilk0/Desktop/img/%s" % img_url[-10:]+".jpg","wb") as f:
        r2 = requests.get(img_url)
        f.write(r2.content)
    print("下载完毕:",img_url)


1.gif

3.4 Automatically fill in the questionnaire

import requests
import random

url = "https://www.wjx.cn/joinnew/processjq.ashx?submittype=1&curID=21581199&t=1521463484600&starttime=2018%2F3%2F19%2020%3A44%3A30&rn=990598061.78751211"
data = {
    "submitdata" : "1$%s}2$%s}3$%s}4$%s}5$%s}6$%s}7$%s}8$%s}9$%s}10$%s"
}
header = {
    "User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)",
    "Cookie": ".ASPXANONYMOUS=iBuvxgz20wEkAAAAZGY4MDE1MjctNWU4Ni00MDUwLTgwYjQtMjFhMmZhMDE2MTA3h_bb3gNw4XRPsyh-qPh4XW1mfJ41; spiderregkey=baidu.com%c2%a7%e7%9b%b4%e8%be%be%c2%a71; UM_distinctid=1623e28d4df22d-08d0140291e4d5-102c1709-100200-1623e28d4e1141; _umdata=535523100CBE37C329C8A3EEEEE289B573446F594297CC3BB3C355F09187F5ADCC492EBB07A9CC65CD43AD3E795C914CD57017EE3799E92F0E2762C963EF0912; WjxUser=UserName=17750277425&Type=1; LastCheckUpdateDate=1; LastCheckDesign=1; DeleteQCookie=1; _cnzz_CV4478442=%E7%94%A8%E6%88%B7%E7%89%88%E6%9C%AC%7C%E5%85%8D%E8%B4%B9%E7%89%88%7C1521461468568; jac21581199=78751211; CNZZDATA4478442=cnzz_eid%3D878068609-1521456533-https%253A%252F%252Fwww.baidu.com%252F%26ntime%3D1521461319; Hm_lvt_21be24c80829bd7a683b2c536fcf520b=1521461287,1521463471; Hm_lpvt_21be24c80829bd7a683b2c536fcf520b=1521463471",
}

for i in range(0,500):
    choice = (
        random.randint(1, 2),
        random.randint(1, 4),
        random.randint(1, 3),
        random.randint(1, 4),
        random.randint(1, 3),
        random.randint(1, 3),
        random.randint(1, 3),
        random.randint(1, 3),
        random.randint(1, 3),
        random.randint(1, 3),
    )
    data["submitdata"] = data["submitdata"] % choice
    r = requests.post(url = url,headers=header,data=data)
    print(r.text)
    data["submitdata"] = "1$%s}2$%s}3$%s}4$%s}5$%s}6$%s}7$%s}8$%s}9$%s}10$%s"

When we use the same IP to submit multiple questionnaires, the target's anti-crawler mechanism will be triggered, and a verification code will appear on the server.


image.png


image.png

We can use X-Forwarded-For to fake our IP, the modified code is as follows:

import requests
import random

url = "https://www.wjx.cn/joinnew/processjq.ashx?submittype=1&curID=21581199&t=1521463484600&starttime=2018%2F3%2F19%2020%3A44%3A30&rn=990598061.78751211"
data = {
    "submitdata" : "1$%s}2$%s}3$%s}4$%s}5$%s}6$%s}7$%s}8$%s}9$%s}10$%s"
}
header = {
    "User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)",
    "Cookie": ".ASPXANONYMOUS=iBuvxgz20wEkAAAAZGY4MDE1MjctNWU4Ni00MDUwLTgwYjQtMjFhMmZhMDE2MTA3h_bb3gNw4XRPsyh-qPh4XW1mfJ41; spiderregkey=baidu.com%c2%a7%e7%9b%b4%e8%be%be%c2%a71; UM_distinctid=1623e28d4df22d-08d0140291e4d5-102c1709-100200-1623e28d4e1141; _umdata=535523100CBE37C329C8A3EEEEE289B573446F594297CC3BB3C355F09187F5ADCC492EBB07A9CC65CD43AD3E795C914CD57017EE3799E92F0E2762C963EF0912; WjxUser=UserName=17750277425&Type=1; LastCheckUpdateDate=1; LastCheckDesign=1; DeleteQCookie=1; _cnzz_CV4478442=%E7%94%A8%E6%88%B7%E7%89%88%E6%9C%AC%7C%E5%85%8D%E8%B4%B9%E7%89%88%7C1521461468568; jac21581199=78751211; CNZZDATA4478442=cnzz_eid%3D878068609-1521456533-https%253A%252F%252Fwww.baidu.com%252F%26ntime%3D1521461319; Hm_lvt_21be24c80829bd7a683b2c536fcf520b=1521461287,1521463471; Hm_lpvt_21be24c80829bd7a683b2c536fcf520b=1521463471",
    "X-Forwarded-For" : "%s"
}

for i in range(0,500):
    choice = (
        random.randint(1, 2),
        random.randint(1, 4),
        random.randint(1, 3),
        random.randint(1, 4),
        random.randint(1, 3),
        random.randint(1, 3),
        random.randint(1, 3),
        random.randint(1, 3),
        random.randint(1, 3),
        random.randint(1, 3),
    )
    data["submitdata"] = data["submitdata"] % choice
    header["X-Forwarded-For"] = (str(random.randint(1,255))+".")+(str(random.randint(1,255))+".")+(str(random.randint(1,255))+".")+str(random.randint(1,255))
    r = requests.post(url = url,headers=header,data=data)
    print(header["X-Forwarded-For"],r.text)
    data["submitdata"] = "1$%s}2$%s}3$%s}4$%s}5$%s}6$%s}7$%s}8$%s}9$%s}10$%s"
    header["X-Forwarded-For"] = "%s"

Effect picture:


image.png

image.png


image.png

 

About this article, because I have written it before, I won’t repeat it

3.5 Obtain the public network proxy IP, and judge whether it can be used and the delay time

In this example, we want to crawl the proxy IP and verify the viability and latency of these proxies. (You can add the crawled proxy IP to proxychain, and then perform the usual infiltration tasks.) Here, I directly call the linux system command

ping -c 1 " + ip.string + " | awk 'NR==2{print}' - 

If you want to run this program in Windows, you need to modify os.popenthe command in the third-to-last line to be executable by Windows.

from bs4 import BeautifulSoup
import requests
import os

url = "http://www.xicidaili.com/nn/1"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'}
r = requests.get(url=url,headers=headers)
soup = BeautifulSoup(r.text,"lxml")
server_address = soup.select(".odd > td:nth-of-type(4)")
ip_list = soup.select(".odd > td:nth-of-type(2)")
ports = soup.select(".odd > td:nth-of-type(3)")
for server,ip in zip(server_address,ip_list):
    if len(server.contents) != 1:
        print(server.a.string.ljust(8),ip.string.ljust(20), end='')
    else:
        print("未知".ljust(8), ip.string.ljust(20), end='')
    delay_time = os.popen("ping -c 1 " + ip.string + " | awk 'NR==2{print}' -")
    delay_time = delay_time.read().split("time=")[-1].strip("\r\n")
    print("time = " + delay_time)


image.png

3.gif

Four, conclusion

Of course, you can also do many interesting things with python. If you don't understand the above examples very well, then I will finally send a set of Python crawler introductory tutorial: Python web crawler introductory chapter ---my grandfather can understand it . There is really a lot of learning on the Internet now, I hope you can make good use of it.


I still want to recommend the Python learning group I built myself : 721195303 , all of whom are learning Python. If you want to learn or are learning Python, you are welcome to join. Everyone is a software development party and share dry goods from time to time (only Python software development related), including a copy of the latest Python advanced materials and zero-based teaching compiled by myself in 2021. Welcome friends who are in advanced and interested in Python to join!

Guess you like

Origin blog.csdn.net/aaahtml/article/details/112857514
Recommended