(Zhihu also has my article)
Here, first indicate whether this crawler is invalid or not, depending on the time, the method of parsing the content of the web page is relatively primitive,
I am not a crawler god. I started crawling because of mathematical modeling and I need to crawl data by myself. I am a computer major in the whole team. The responsibility lies with me.
Ok, not much to say, go directly to the code
# -*- coding: utf-8 -*- """ Created on Thu Feb 8 18:09:44 2018 @author: A white horse is not a horse """ #!/usr/bin/env python # -*- coding:utf-8 -*- #Need to install selenium and plantom.js in advance, not suitable for a large number of crawlers, the speed is too slow, here only 20 companies on the first page of all industries are crawled #Note that a small amount of manual duty is required, when the proxy IP unexpectedly crashes within 4 industries. It needs to be closed manually. It may be automatically improved without closing the proxy IP, but it is basically impossible from selenium import webdriver import time import pymysql from bs4 import BeautifulSoup #Webpage code parser from selenium.webdriver.common.desired_capabilities import DesiredCapabilities from selenium.webdriver.common.proxy import Proxy from selenium.webdriver.common.proxy import ProxyType import json import urllib.request ipurl = "http://piping.mogumiao.com/proxy/api/get_ip_al?appKey=6d22aed70f7d0479cbce55dff726a8d8a&count=1&expiryDate=5&format=1" #Proxy IP to get API connect = pymysql.Connect( host='localhost', port=3306, user='root', passwd='1234', db='user', charset='utf8' ) #mysql database driver information #Get proxy IP def getip_port(): req = urllib.request.Request(ipurl) data = urllib.request.urlopen(req).read() #loads: convert json to dict s1 = json.loads(data) #print (s1["msg"][0]["ip"] ) #print (s1["msg"][0]["port"] ) ipstrs=s1["msg"][0]["ip"]+":"+s1["msg"][0]["port"] print("Proxy IP:"+ipstrs) return ipstrs #Create browser driver def driver_open(): #dcap = dict(DesiredCapabilities.PHANTOMJS) # dcap["phantomjs.page.settings.userAgent"] = ( #"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0" #) #driver = webdriver.PhantomJS(executable_path='phantomjs.exe', desired_capabilities=dcap) proxy = Proxy( { 'proxyType': ProxyType.MANUAL, 'httpProxy': getip_port() # proxy ip and port } ) desired_capabilities = DesiredCapabilities.PHANTOMJS.copy() desired_capabilities = dict(DesiredCapabilities.PHANTOMJS) desired_capabilities["phantomjs.page.settings.userAgent"] = ( "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0" ) # Add the proxy ip to the skill proxy.add_to_capabilities(desired_capabilities) driver = webdriver.PhantomJS( executable_path='phantomjs.exe', desired_capabilities=desired_capabilities ) return driver # get web content def get_content(driver,url): driver.get(url) #Wait for 5 seconds, and customize it according to the dynamic web page loading time # sleeptime = random.randint (2,3) time.sleep(1) content = driver.page_source.encode('utf-8') #driver.close() soup = BeautifulSoup(content, 'lxml') #print(soup) return soup #Parse the content of the webpage, the crawler filter is not perfect, does not match all webpages, #天眼Check three-quarters of the web pages can be parsed normally, the time is 2018-2-27 #There is a reptile god who can improve this, looking forward to thank you def get_basic_info(soup,instr): #com=soup.find_all("span") #print (com [6]) company = soup.find(attrs={'class':'f18 in-block vertival-middle sec-c2'}).text fddbr = soup.find(attrs={'class':'f18 overflow-width sec-c3'}).text #fddbr=soup.find_all("a") baseinfo = soup.find_all(attrs={'class':'baseinfo-module-content-value'}) zczb =baseinfo[0].text zt = baseinfo[2].text zcrq =baseinfo[1].text foundAllTd = soup.find_all("td"); #print len(basics) #jyfw = soup.find(attrs={'class':'js-full-container hidden'}).text print (u'company name:'+company) print( u' legal representative: '+fddbr) print (u'registered capital:'+zczb) print (u'Company status:'+zt) print (u'Registration date:'+zcrq) #Roughly identify the type of web page according to the td tag of the web page, #There are two types, one is a large company, with more report content, and the number of td tags is roughly 800 to 1000 #small companies are basically below 500 #The number of td tags of a small number of companies is in the middle, which cannot be well identified, the number is not large, and the impact is not large, time: 2018-2-26 if len(foundAllTd) > 600: """ print (u'Number of employees:'+foundAllTd[50].text) print (u'industry:'+foundAllTd[527].text) print (u'Enterprise type:'+foundAllTd[523].text) #print (u'Business registration number:'+foundAllTd[517].text) print( u'Organization code: '+foundAllTd[519].text) print (u'Business period: '+foundAllTd[529].text) print( u'Registration agency:'+foundAllTd[533].text) print (u'Approval date:'+foundAllTd[531].text) print( u'Unified social credit code: '+foundAllTd[521].text) print (u'Registration address:'+foundAllTd[537].text) print (u'Business scope:'+foundAllTd[539].text) """ sql = "INSERT INTO company (instr,company_name,industry,business_scope,type_enterprise,regist_capital,legal_represent,regist_date,company_status,operat_period,registrat_body,approval_date,address,people_num) VALUES ( '%s','%s','%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s' )" data = (instr,company, foundAllTd[527].text, foundAllTd[539].text,foundAllTd[523].text ,zczb , fddbr,zcrq ,zt,foundAllTd[529].text ,foundAllTd[533].text ,foundAllTd[531].text ,foundAllTd[537].text,foundAllTd[49].text) else: """ print (u'industry:'+foundAllTd[18].text) #print (u'Business registration number:'+foundAllTd[8].text) print (u'Enterprise type:'+foundAllTd[14].text) print( u'Organization code: '+foundAllTd[10].text) print (u'Business period: '+foundAllTd[20].text) print( u'Registration agency:'+foundAllTd[24].text) print (u'Approval date:'+foundAllTd[22].text) print( u'Unified social credit code: '+foundAllTd[16].text) print (u'Registration address:'+foundAllTd[28].text) print (u'Business scope:'+foundAllTd[30].text) """ sql = "INSERT INTO company (instr,company_name,industry,business_scope,type_enterprise,regist_capital,legal_represent,regist_date,company_status,operat_period,registrat_body,approval_date,address) VALUES ( '%s','%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s' )" data = (instr,company, foundAllTd[18].text, foundAllTd[30].text,foundAllTd[14].text ,zczb , fddbr,zcrq ,zt,foundAllTd[20].text ,foundAllTd[24].text ,foundAllTd[22].text ,foundAllTd[28].text) # insert data cursor.execute(sql % data) connect.commit() #print('Successfully inserted', cursor.rowcount, 'Data') #Get executive information, which is invalid and has no effect on code operation def get_gg_info(soup): ggpersons = soup.find_all(attrs={"event-name": "company-detail-staff"}) ggnames = soup.select('table.staff-table > tbody > tr > td.ng-scope > span.ng-binding') # print(len(gg)) for i in range(len(ggpersons)): ggperson = ggpersons[i].text ggname = ggnames[i].text print (ggperson+" "+ggname) #Get information, it is invalid, it has no effect on the code running def get_gd_info(soup): tzfs = soup.find_all(attrs={"event-name": "company-detail-investment"}) for i in range(len(tzfs)): tzf_split = tzfs[i].text.replace("\n","").split() tzf = '' .join (tzf_split) print (tzf) #Get information, it is invalid, it has no effect on the code running def get_tz_info(soup): btzs = soup.select('a.query_name') for i in range(len(btzs)): btz_name = btzs[i].select('span')[0].text print (btz_name) #Get industry links on the homepage def get_industry(soup): # print(soup.find(attrs={'class':'industry_container js-industry-container'})) #hangye = soup.find(attrs={'class':'industry_container js-industry-container'}).find_all("a") x=[] buyao=70 #delete when starting to crawl data hangye = soup.find_all('a') for item in hangye: if 'https://www.tianyancha.com/search/oc' in str(item.get("href")): print (item.get("href")) if buyao>0: buyao-=1 else: x.append(str(item.get("href"))) print("Industry number") print (len (x)) return x; #Get the company link under the industry def get_industry_company(soup): y=[] companylist = soup.find_all('a') for item in companylist: if 'https://www.tianyancha.com/company/' in str(item.get("href")): print (item.get("href")) y.append(str(item.get("href"))) return y if __name__=='__main__': cursor = connect.cursor() #Connect to the database companycount=0 #Number of companies crawled instrcount=0 #Number of industries crawled, change a proxy IP for every 4 industries, and crawl 20 first pages for each industry theinscount=0 #Number of industry tags to be crawled, change a proxy IP for every 4 industries, and crawl 20 first pages for each industry driver = driver_open() url = "https://www.tianyancha.com/" soup = get_content(driver, url) instrlist=get_industry(soup) theinscount=len(instrlist) print for instr in instrlist: #traverse industry links instrcount+=1 print(instrcount) print(instr) compsoup = get_content(driver, instr) complist =get_industry_company(compsoup) for comp in complist: #traverse the company links under the industry print(comp) companycount + = 1 #print(num) print("The number of industries has climbed"+str(instrcount)) try: infosoup = get_content(driver, comp) print ('----Get basic information----') get_basic_info(infosoup,instr) except: print('Exception skipped', end=' ') if instrcount%4 == 0 : #Change a proxy IP for every 3 industry links to prevent the web page from banning proxy IP, #Sometimes there is a problem, the proxy IP times out, etc., in this case, turn off the program, or turn off plantomjs print("换IP") #driver.close()#Close the driver, there may be multiple planomjs windows, which need to be closed driver = driver_open() #try: # get_basic_info(soup,instr) #except: # print('Exception skipped', end=' ') # print() cursor.close() connect.close() #Close the database connection
The code comments have been typed in more detail and can be viewed directly.
The above code crawling results also require data preprocessing, especially the data encryption of Tianyan Chasha Pen.
The encrypted data above includes, registered capital, registration time, business period, encryption method, and original encryption method.
The encryption I have come across is,
digital encryption cipher text 7 4 5 8 4 . 3 9 0 1 . 5 9 2 6 0 1 3 8 6 2 7
It's that simple, hahaha, I didn't die laughing when I found this. This decoding operation is relatively simple, and the friends can practice it by themselves.
Some people say that there is only one reason why I don't go to the national enterprise credit information publicity system . I really don't bother to do any sliding verification codes. Clicking on the verification codes in text is annoying. Look at this guy's blog, he said it has expired, you can learn from some experience, [Crawler] About the enterprise credit information publicity system - Accelerate the latest anti-crawler mechanism
Fortunately, there is no verification code in Tianyancha, otherwise the modeling friends will be mad at me.
In addition, if there are small partners who really don't want to crawl the data by themselves, but only want the data, you can send me a private message to ask for it. Yes, I really want to buy some data in the past to cope with the modeling, but when I see the price, I basically give up. look at the picture
The proxy IP is the mushroom proxy I found . It cost 6 yuan, 1000 high-secret IPs, and the above API seems to have 700 left. Let me use it for you. Anyway, I don't want to engage in crawling. Some gods made a crawler. Click to open the link to get the proxy IP .
, I feel that in a word, doing technology is just tiring, and learning technology is not only tiring but also difficult.
Pits, reptile pits, filled with soil.