A summary of reptiles

Foreword

Big Data finals to begin, team data crawling task is given to me
Here Insert Picture Description
to find an example
Here Insert Picture Description

Let us carry out his right-hand.

Preparation before crawling

Language used: python
used by libraries: BS4, Requests , urllib, Re
both libraries is not certain, we need to download:

pip install beautisoup4
pip install requests

Two other libraries if not also self-pip download.
Even if you are not, then back to pip another say:

When crawling

Target site
to attach crawling Code:

import requests
import re
from bs4 import BeautifulSoup
from urllib import request

url = 'https://www.discuz.net/forum-10-1.html'
res = request.urlopen(url)#打开链接
print(res)
soup = BeautifulSoup(res,'html.parser')#返回网页源代码
#然后我们在用find去找相关标签包含的东西
t_div = soup.findAll('a',attrs={"c":"1"})
a = 1
author = []

# author's name
for i in t_div:
    if a%2 == 0:
        a += 1
        continue
    a += 1
    author.append(i.text)

th = soup.findAll('a',attrs={"class":"showcontent y"})
d = str(th)
tid = re.findall('\'\d+\'',d)
uid = re.findall('uid=\d+',str(t_div))
print("tid",    " author",  "uid")
for i in range(len(tid)):
    print (tid[i][1:-1],author[i],uid[i][4:])

Here Insert Picture DescriptionThere is not there c = 1, so the following phrase is used to obtain all the content c 1 = a tag

t_div = soup.findAll('a',attrs={"c":"1"})

If we loop through direct use for it, he will find that the author and the last one to record the spokesman. Then we separated output a not on the list.
Username crawling is complete, look at the label of a href was found here are the author's uid, here we use regular expressions to match .

uid = re.findall('uid=\d+',str(t_div))

This put uid in the form of a list stored in the uid inside.
About Regular Expressions, transmission link: regular expression
user id is also crawling completed, we then look for post tid:
Here Insert Picture Description
it is easy to find here is a CONTENT_TID We also use regular expressions to match him out

th = soup.findAll('a',attrs={"class":"showcontent y"})
d = str(th)
tid = re.findall('\'\d+\'',d)

In this way we can put a label containing a class of 'showcontent y' out of the collection, and then match the TID in the TID stored.
In order to match the accuracy of, when I added some matching keywords: UID where I use the uid = xxxx
this time we slice and then to delete it, so is the normal UID, TID head and tail of empathy double quotation marks removed.

for i in range(len(tid)):
    print (tid[i][1:-1],author[i],uid[i][4:])

Here Insert Picture Description
This will get the data you want.

Precautions

Crawling need to be cautious, crawling to sensitive, private information of others to go to prison.

Published 36 original articles · won praise 29 · views 3951

Guess you like

Origin blog.csdn.net/YUK_103/article/details/102760115