Tianyancha enterprise data acquisition
Manual anti-crawler: original blog address
知识梳理不易,请尊重劳动成果,文章仅发布在CSDN网站上,在其他网站看到该博文均属于未经作者授权的恶意爬取信息
If reprinted, please indicate the source, thank you!
1. Destination URL and crawling requirements
According to the search to crawl the specific information data of the corresponding company, the first step is to enter the official website of Tianyancha, and then enter the name of the company, and then click to enter the company with the first default score in the returned data. The result displayed is to be crawled Content, here is Xiaomi as an example
Step 1: Open the homepage of Tianyancha website
Step 2: Enter Xiaomi and press Enter to confirm, then scroll down to find the matching company
The third step, click to enter the company, view the details, and finally crawl the content in the red frame below
2. Web page transition
Due to the phenomenon of page turning, if you want to get specific data, you need to first enter the transition webpage (search results page), then extract the url (company page) that contains the information we want to crawl, and finally enter the url corresponding Request information under the URL
After entering the homepage of Tianyan Check, enter "Xiaomi" and press Enter to confirm. Copy and paste the redirected URL into the editor and find that it will be automatically transcoded. Therefore, specific data must be crawled according to the company name entered by the user , It is necessary to perform transcoding recognition of Chinese characters
Here, the quote and unquote modules are used for transcoding and decoding, such as the decoding and transcoding of the words Xiaomi and Huawei, as follows
Therefore, the code for requesting the transition interface is as follows: (note adding your own cookie and header)
headers = {
'Cookie': 'aliyungf_tc=AQAAAAEVaWaoZwkALukNq6/ruqDOSG3n; csrfToken=VtbnK1tn9NoUb0fUqHVlS0Xc; jsid=SEM-BAIDU-PZ0824-SY-000001; bannerFlag=false; Hm_lvt_e92c8d65d92d534b0fc290df538b4758=1600135387; show_activity_id_4=4; _gid=GA1.2.339666032.1600135387; relatedHumanSearchGraphId=23402373; relatedHumanSearchGraphId.sig=xQxyUIDqVdMkulWk5m_htP28Pzw8_eM8tUMIyK4_qqs; refresh_page=0; RTYCID=69cd6d574b1c4116995bab3fd96a9466; CT_TYCID=a870d4ebb91849b094668d1d969c7702; token=899079c4b21e4d22916083d22f72e1e3; _utm=dac53239b45f49709262be264fd289f3; cloud_token=bb4c875aed794641966b7f7536187e80; Hm_lpvt_e92c8d65d92d534b0fc290df538b4758=1600147199; _gat_gtag_UA_123487620_1=1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
key_word = '小米'
url = 'https://www.tianyancha.com/search?key={}'.format(quote(key_word))
html = requests.get(url,headers = headers)
soup = BeautifulSoup(html.text,'lxml')
info = soup.select('.result-list .content a')[0]
company_name = info.text
company_url = info['href']
print(company_name,company_url)
The output result is: (normally get the name and url of the URL to be crawled)
3. Acquisition of specific data
Or based on the obtained company_url, initiate a requests request again, and then parse the data. The key is to parse the label information, as follows
Therefore, the data can be extracted after traversing the tags obtained by the loop. The code is as follows. By the way, you can also save the data locally while crawling the data. Because the data is obtained one by one, use the additional mode to write.
html_detail = requests.get(company_url,headers = headers)
soup_detail = BeautifulSoup(html_detail.text,'lxml')
data_infos = soup_detail.select('.table.-striped-col tbody tr td')
for info in data_infos:
print(info.text)
with open('company.txt','a',encoding='utf-8') as f:
f.write(info.text)
The output result is: (Intercept part of the result)
4. Extension and full code
The above is the acquisition of a single company data, but also multiple or batch data acquisition. You only need to save the company name in the form of a list and then traverse it. Here, "Huawei", "Xiaomi", " "Zhihu" as an example, the data is obtained and written into the txt text for newline saving
key_words = ['小米','华为','知乎']
for key_word in key_words:
url = 'https://www.tianyancha.com/search?key={}'.format(quote(key_word))
html = requests.get(url,headers = headers)
soup = BeautifulSoup(html.text,'lxml')
info = soup.select('.result-list .content a')[0]
company_name = info.text
company_url = info['href']
print(company_name,company_url)
html_detail = requests.get(company_url,headers = headers)
soup_detail = BeautifulSoup(html_detail.text,'lxml')
data_infos = soup_detail.select('.table.-striped-col tbody tr td')
with open('company.txt','a+',encoding='utf-8') as f:
f.write('\n'+company_name + " : " + '\n')
for info in data_infos:
print(info.text)
f.write(info.text)
The output result is: (It can be found that the first line has one more line break, which is a bit obsessive-compulsive. It can be eliminated by adding a counter. It can be seen that only in the first line, you can set num. If num is 0, then write For the first data, the newline character is 0)
The complete code is as follows:
import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import quote
headers = {
'Cookie': 'csrfToken=VtbnK1tn9NoUb0fUqHVlS0Xc; jsid=SEM-BAIDU-PZ0824-SY-000001; bannerFlag=false; Hm_lvt_e92c8d65d92d534b0fc290df538b4758=1600135387; show_activity_id_4=4; _gid=GA1.2.339666032.1600135387; relatedHumanSearchGraphId=23402373; relatedHumanSearchGraphId.sig=xQxyUIDqVdMkulWk5m_htP28Pzw8_eM8tUMIyK4_qqs; refresh_page=0; RTYCID=69cd6d574b1c4116995bab3fd96a9466; CT_TYCID=a870d4ebb91849b094668d1d969c7702; token=899079c4b21e4d22916083d22f72e1e3; _utm=dac53239b45f49709262be264fd289f3; cloud_token=bb4c875aed794641966b7f7536187e80; Hm_lpvt_e92c8d65d92d534b0fc290df538b4758=1600147199; _gat_gtag_UA_123487620_1=1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
key_words = ['小米','华为','知乎']
num = 0
for key_word in key_words:
url = 'https://www.tianyancha.com/search?key={}'.format(quote(key_word))
html = requests.get(url,headers = headers)
soup = BeautifulSoup(html.text,'lxml')
info = soup.select('.result-list .content a')[0]
company_name = info.text
company_url = info['href']
print(company_name,company_url)
html_detail = requests.get(company_url,headers = headers)
soup_detail = BeautifulSoup(html_detail.text,'lxml')
data_infos = soup_detail.select('.table.-striped-col tbody tr td')
with open('company.txt','a+',encoding='utf-8') as f:
if num != 0:
num = 1
f.write('\n'*num+company_name + " : " + '\n')
for info in data_infos:
print(info.text)
f.write(info.text)
num += 1
The output result is: (The problem of empty line breaks is solved)
Note: In addition, the program can directly encapsulate the function, pass a list of data and then generate a local data file (you can also form the data into a string and then traverse). The advantage of this is to process the data through pandas and extract the data to be crawled. The list of companies can then be directly passed to this function to achieve the data acquisition on the Tianyan check according to the required batch data
The demo code is as follows: (only part of the code, if you want to get the batch data in the Tianyan check, you still need some other skills, such as dynamic IP settings)
import Company
def searchData(name,dic_h,dic_c):
print(f'正在搜索:{name}')
name_quote = quote(name)
url = f'https://www.tianyancha.com/search?key={name}'
html = requests.get(url, headers = dic_h,cookies = dic_c)
soup = BeautifulSoup(html.text,'lxml')
campany = soup.select('div.header a')[0]
href = campany['href']
print(href)
return href
companys = Company.company
data_list = []
error_list = []
i = 1
for company in companys:
try:
dic = {
}
print('爬取第{}次数据'.format(i))
i += 1
dic['公司名称'] = company
dic['url'] = searchData(company,dic_h,dic_c,ips)
data_list.append(dic)
except:
print('信息爬取失败')
error_list.append(company)
data = pd.DataFrame(data_list)
data.to_excel('final.xlsx',index = False)
data_error = pd.DataFrame(error_list)
data_error.to_excel('error.xlsx',index = False)
The Company module puts the names of all companies in the file and assigns them to the company variable. Some screenshots are as follows