[Python implements web crawler 21] Tianyancha enterprise data acquisition


Manual anti-crawler: original blog address

 知识梳理不易,请尊重劳动成果,文章仅发布在CSDN网站上,在其他网站看到该博文均属于未经作者授权的恶意爬取信息

If reprinted, please indicate the source, thank you!

1. Destination URL and crawling requirements

According to the search to crawl the specific information data of the corresponding company, the first step is to enter the official website of Tianyancha, and then enter the name of the company, and then click to enter the company with the first default score in the returned data. The result displayed is to be crawled Content, here is Xiaomi as an example

Step 1: Open the homepage of Tianyancha website

Insert picture description here
Step 2: Enter Xiaomi and press Enter to confirm, then scroll down to find the matching company

Insert picture description here
The third step, click to enter the company, view the details, and finally crawl the content in the red frame below

Insert picture description here

2. Web page transition

Due to the phenomenon of page turning, if you want to get specific data, you need to first enter the transition webpage (search results page), then extract the url (company page) that contains the information we want to crawl, and finally enter the url corresponding Request information under the URL

After entering the homepage of Tianyan Check, enter "Xiaomi" and press Enter to confirm. Copy and paste the redirected URL into the editor and find that it will be automatically transcoded. Therefore, specific data must be crawled according to the company name entered by the user , It is necessary to perform transcoding recognition of Chinese characters

Insert picture description here

Here, the quote and unquote modules are used for transcoding and decoding, such as the decoding and transcoding of the words Xiaomi and Huawei, as follows

Insert picture description here
Therefore, the code for requesting the transition interface is as follows: (note adding your own cookie and header)

headers = {
    
    
	'Cookie': 'aliyungf_tc=AQAAAAEVaWaoZwkALukNq6/ruqDOSG3n; csrfToken=VtbnK1tn9NoUb0fUqHVlS0Xc; jsid=SEM-BAIDU-PZ0824-SY-000001; bannerFlag=false; Hm_lvt_e92c8d65d92d534b0fc290df538b4758=1600135387; show_activity_id_4=4; _gid=GA1.2.339666032.1600135387; relatedHumanSearchGraphId=23402373; relatedHumanSearchGraphId.sig=xQxyUIDqVdMkulWk5m_htP28Pzw8_eM8tUMIyK4_qqs; refresh_page=0; RTYCID=69cd6d574b1c4116995bab3fd96a9466; CT_TYCID=a870d4ebb91849b094668d1d969c7702; token=899079c4b21e4d22916083d22f72e1e3; _utm=dac53239b45f49709262be264fd289f3; cloud_token=bb4c875aed794641966b7f7536187e80; Hm_lpvt_e92c8d65d92d534b0fc290df538b4758=1600147199; _gat_gtag_UA_123487620_1=1',
	'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}

key_word = '小米'
url = 'https://www.tianyancha.com/search?key={}'.format(quote(key_word))
html = requests.get(url,headers = headers)
soup = BeautifulSoup(html.text,'lxml')
info = soup.select('.result-list .content a')[0]
company_name = info.text
company_url = info['href']
print(company_name,company_url)

The output result is: (normally get the name and url of the URL to be crawled)

Insert picture description here

3. Acquisition of specific data

Or based on the obtained company_url, initiate a requests request again, and then parse the data. The key is to parse the label information, as follows

Insert picture description here
Therefore, the data can be extracted after traversing the tags obtained by the loop. The code is as follows. By the way, you can also save the data locally while crawling the data. Because the data is obtained one by one, use the additional mode to write.

html_detail = requests.get(company_url,headers = headers)
soup_detail = BeautifulSoup(html_detail.text,'lxml')
data_infos = soup_detail.select('.table.-striped-col tbody tr td')
for info in data_infos:
	print(info.text)
	with open('company.txt','a',encoding='utf-8') as f:
		f.write(info.text)

The output result is: (Intercept part of the result)

Insert picture description here

4. Extension and full code

The above is the acquisition of a single company data, but also multiple or batch data acquisition. You only need to save the company name in the form of a list and then traverse it. Here, "Huawei", "Xiaomi", " "Zhihu" as an example, the data is obtained and written into the txt text for newline saving

key_words = ['小米','华为','知乎']
for key_word in key_words:
	url = 'https://www.tianyancha.com/search?key={}'.format(quote(key_word))
	html = requests.get(url,headers = headers)
	soup = BeautifulSoup(html.text,'lxml')
	info = soup.select('.result-list .content a')[0]
	company_name = info.text
	company_url = info['href']
	print(company_name,company_url)

	html_detail = requests.get(company_url,headers = headers)
	soup_detail = BeautifulSoup(html_detail.text,'lxml')
	data_infos = soup_detail.select('.table.-striped-col tbody tr td')

	with open('company.txt','a+',encoding='utf-8') as f:
		f.write('\n'+company_name + " : " + '\n')
		for info in data_infos:
			print(info.text)
			f.write(info.text)

The output result is: (It can be found that the first line has one more line break, which is a bit obsessive-compulsive. It can be eliminated by adding a counter. It can be seen that only in the first line, you can set num. If num is 0, then write For the first data, the newline character is 0)

Insert picture description here
The complete code is as follows:

import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import quote

headers = {
    
    
	'Cookie': 'csrfToken=VtbnK1tn9NoUb0fUqHVlS0Xc; jsid=SEM-BAIDU-PZ0824-SY-000001; bannerFlag=false; Hm_lvt_e92c8d65d92d534b0fc290df538b4758=1600135387; show_activity_id_4=4; _gid=GA1.2.339666032.1600135387; relatedHumanSearchGraphId=23402373; relatedHumanSearchGraphId.sig=xQxyUIDqVdMkulWk5m_htP28Pzw8_eM8tUMIyK4_qqs; refresh_page=0; RTYCID=69cd6d574b1c4116995bab3fd96a9466; CT_TYCID=a870d4ebb91849b094668d1d969c7702; token=899079c4b21e4d22916083d22f72e1e3; _utm=dac53239b45f49709262be264fd289f3; cloud_token=bb4c875aed794641966b7f7536187e80; Hm_lpvt_e92c8d65d92d534b0fc290df538b4758=1600147199; _gat_gtag_UA_123487620_1=1',
	'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}

key_words = ['小米','华为','知乎']
num = 0
for key_word in key_words:
	url = 'https://www.tianyancha.com/search?key={}'.format(quote(key_word))
	html = requests.get(url,headers = headers)
	soup = BeautifulSoup(html.text,'lxml')
	info = soup.select('.result-list .content a')[0]
	company_name = info.text
	company_url = info['href']
	print(company_name,company_url)


	html_detail = requests.get(company_url,headers = headers)
	soup_detail = BeautifulSoup(html_detail.text,'lxml')
	data_infos = soup_detail.select('.table.-striped-col tbody tr td')

	with open('company.txt','a+',encoding='utf-8') as f:
		if num != 0:
			num = 1
		f.write('\n'*num+company_name + " : " + '\n')
		for info in data_infos:
			print(info.text)
			f.write(info.text)
		num += 1

The output result is: (The problem of empty line breaks is solved)

Insert picture description here
Note: In addition, the program can directly encapsulate the function, pass a list of data and then generate a local data file (you can also form the data into a string and then traverse). The advantage of this is to process the data through pandas and extract the data to be crawled. The list of companies can then be directly passed to this function to achieve the data acquisition on the Tianyan check according to the required batch data

The demo code is as follows: (only part of the code, if you want to get the batch data in the Tianyan check, you still need some other skills, such as dynamic IP settings)

import Company
def searchData(name,dic_h,dic_c): 
	print(f'正在搜索:{name}')
	name_quote = quote(name)
	url = f'https://www.tianyancha.com/search?key={name}'
	html = requests.get(url, headers = dic_h,cookies = dic_c)
	soup = BeautifulSoup(html.text,'lxml')
	campany = soup.select('div.header a')[0]
	href = campany['href']
	print(href)
	return href

companys = Company.company
data_list = []
error_list = []
i = 1
for company in companys:
	try:
		dic = {
    
    }
		print('爬取第{}次数据'.format(i))
		i += 1
		dic['公司名称'] = company
		dic['url'] = searchData(company,dic_h,dic_c,ips)
		data_list.append(dic)
	except:
		print('信息爬取失败')
		error_list.append(company)
data = pd.DataFrame(data_list)
data.to_excel('final.xlsx',index = False)
data_error = pd.DataFrame(error_list)
data_error.to_excel('error.xlsx',index = False)

The Company module puts the names of all companies in the file and assigns them to the company variable. Some screenshots are as follows

Insert picture description here

Guess you like

Origin blog.csdn.net/lys_828/article/details/108594153