Foreword
Big Data finals to begin, team data crawling task is given to me
to find an example
Let us carry out his right-hand.
Preparation before crawling
Language used: python
used by libraries: BS4, Requests , urllib, Re
both libraries is not certain, we need to download:
pip install beautisoup4
pip install requests
Two other libraries if not also self-pip download.
Even if you are not, then back to pip another say:
When crawling
Target site
to attach crawling Code:
import requests
import re
from bs4 import BeautifulSoup
from urllib import request
url = 'https://www.discuz.net/forum-10-1.html'
res = request.urlopen(url)#打开链接
print(res)
soup = BeautifulSoup(res,'html.parser')#返回网页源代码
#然后我们在用find去找相关标签包含的东西
t_div = soup.findAll('a',attrs={"c":"1"})
a = 1
author = []
# author's name
for i in t_div:
if a%2 == 0:
a += 1
continue
a += 1
author.append(i.text)
th = soup.findAll('a',attrs={"class":"showcontent y"})
d = str(th)
tid = re.findall('\'\d+\'',d)
uid = re.findall('uid=\d+',str(t_div))
print("tid", " author", "uid")
for i in range(len(tid)):
print (tid[i][1:-1],author[i],uid[i][4:])
There is not there c = 1, so the following phrase is used to obtain all the content c 1 = a tag
t_div = soup.findAll('a',attrs={"c":"1"})
If we loop through direct use for it, he will find that the author and the last one to record the spokesman. Then we separated output a not on the list.
Username crawling is complete, look at the label of a href was found here are the author's uid, here we use regular expressions to match .
uid = re.findall('uid=\d+',str(t_div))
This put uid in the form of a list stored in the uid inside.
About Regular Expressions, transmission link: regular expression
user id is also crawling completed, we then look for post tid:
it is easy to find here is a CONTENT_TID We also use regular expressions to match him out
th = soup.findAll('a',attrs={"class":"showcontent y"})
d = str(th)
tid = re.findall('\'\d+\'',d)
In this way we can put a label containing a class of 'showcontent y' out of the collection, and then match the TID in the TID stored.
In order to match the accuracy of, when I added some matching keywords: UID where I use the uid = xxxx
this time we slice and then to delete it, so is the normal UID, TID head and tail of empathy double quotation marks removed.
for i in range(len(tid)):
print (tid[i][1:-1],author[i],uid[i][4:])
This will get the data you want.
Precautions
Crawling need to be cautious, crawling to sensitive, private information of others to go to prison.