Crawling enterprise information-Enterprise credit information query system-Tianyancha crawler

(Zhihu also has my article)

Here, first indicate whether this crawler is invalid or not, depending on the time, the method of parsing the content of the web page is relatively primitive,

I am not a crawler god. I started crawling because of mathematical modeling and I need to crawl data by myself. I am a computer major in the whole team. The responsibility lies with me.

Ok, not much to say, go directly to the code

# -*- coding: utf-8 -*-
"""
Created on Thu Feb  8 18:09:44 2018

@author: A white horse is not a horse
"""

#!/usr/bin/env python
# -*- coding:utf-8 -*-
#Need to install selenium and plantom.js in advance, not suitable for a large number of crawlers, the speed is too slow, here only 20 companies on the first page of all industries are crawled
#Note that a small amount of manual duty is required, when the proxy IP unexpectedly crashes within 4 industries. It needs to be closed manually. It may be automatically improved without closing the proxy IP, but it is basically impossible
from selenium import webdriver    
import time
import pymysql
from bs4 import BeautifulSoup #Webpage code parser
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.proxy import Proxy
from selenium.webdriver.common.proxy import ProxyType
import json
import urllib.request

ipurl = "http://piping.mogumiao.com/proxy/api/get_ip_al?appKey=6d22aed70f7d0479cbce55dff726a8d8a&count=1&expiryDate=5&format=1"
#Proxy IP to get API


connect = pymysql.Connect(  
    host='localhost',  
    port=3306,  
    user='root',  
    passwd='1234',  
    db='user',  
    charset='utf8'  
)

#mysql database driver information


#Get proxy IP
def getip_port():
    req = urllib.request.Request(ipurl)
    data = urllib.request.urlopen(req).read()
    #loads: convert json to dict  
    s1 = json.loads(data)
    #print (s1["msg"][0]["ip"] )
    #print (s1["msg"][0]["port"] )
    ipstrs=s1["msg"][0]["ip"]+":"+s1["msg"][0]["port"]
    print("Proxy IP:"+ipstrs)
    return ipstrs

#Create browser driver
def driver_open():
    #dcap = dict(DesiredCapabilities.PHANTOMJS)
   # dcap["phantomjs.page.settings.userAgent"] = (
#"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0"
    #)
    #driver = webdriver.PhantomJS(executable_path='phantomjs.exe', desired_capabilities=dcap)
    
    
    proxy = Proxy(
    {
    'proxyType': ProxyType.MANUAL,
    'httpProxy': getip_port() # proxy ip and port
    }
    )
    desired_capabilities = DesiredCapabilities.PHANTOMJS.copy()
    desired_capabilities = dict(DesiredCapabilities.PHANTOMJS)
    desired_capabilities["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0"
    )
    # Add the proxy ip to the skill
    proxy.add_to_capabilities(desired_capabilities)
    driver = webdriver.PhantomJS(
    executable_path='phantomjs.exe',
    desired_capabilities=desired_capabilities
    )
    return driver

# get web content
def get_content(driver,url):
    driver.get(url)
#Wait for 5 seconds, and customize it according to the dynamic web page loading time
    # sleeptime = random.randint (2,3)
    time.sleep(1)
    content = driver.page_source.encode('utf-8')
    #driver.close()
    soup = BeautifulSoup(content, 'lxml')
    #print(soup)
    return soup

#Parse the content of the webpage, the crawler filter is not perfect, does not match all webpages,
#天眼Check three-quarters of the web pages can be parsed normally, the time is 2018-2-27
#There is a reptile god who can improve this, looking forward to thank you
def get_basic_info(soup,instr):
    #com=soup.find_all("span")
    #print (com [6])
    
    company = soup.find(attrs={'class':'f18 in-block vertival-middle sec-c2'}).text
    fddbr = soup.find(attrs={'class':'f18 overflow-width sec-c3'}).text
    #fddbr=soup.find_all("a")           
    baseinfo = soup.find_all(attrs={'class':'baseinfo-module-content-value'})                
    zczb =baseinfo[0].text
    zt = baseinfo[2].text           
    zcrq =baseinfo[1].text
    
    foundAllTd = soup.find_all("td");                   
    #print len(basics)
    
    #jyfw = soup.find(attrs={'class':'js-full-container hidden'}).text
    print (u'company name:'+company)
    print( u' legal representative: '+fddbr)
    print (u'registered capital:'+zczb)
    
    print (u'Company status:'+zt)
    print (u'Registration date:'+zcrq)
    
    #Roughly identify the type of web page according to the td tag of the web page,
    #There are two types, one is a large company, with more report content, and the number of td tags is roughly 800 to 1000
    #small companies are basically below 500
    #The number of td tags of a small number of companies is in the middle, which cannot be well identified, the number is not large, and the impact is not large, time: 2018-2-26
    if len(foundAllTd) > 600:
        """
    
        print (u'Number of employees:'+foundAllTd[50].text)
        print (u'industry:'+foundAllTd[527].text)
        print (u'Enterprise type:'+foundAllTd[523].text)
        
        #print (u'Business registration number:'+foundAllTd[517].text)
        print( u'Organization code: '+foundAllTd[519].text)
        print (u'Business period: '+foundAllTd[529].text)
        print( u'Registration agency:'+foundAllTd[533].text)
        print (u'Approval date:'+foundAllTd[531].text)
        print( u'Unified social credit code: '+foundAllTd[521].text)
        print (u'Registration address:'+foundAllTd[537].text)
        print (u'Business scope:'+foundAllTd[539].text)
        """
        sql = "INSERT INTO company (instr,company_name,industry,business_scope,type_enterprise,regist_capital,legal_represent,regist_date,company_status,operat_period,registrat_body,approval_date,address,people_num) VALUES ( '%s','%s','%s', '%s', '%s', '%s', '%s',  '%s', '%s', '%s', '%s', '%s', '%s', '%s' )"  
        data = (instr,company, foundAllTd[527].text, foundAllTd[539].text,foundAllTd[523].text ,zczb , fddbr,zcrq ,zt,foundAllTd[529].text ,foundAllTd[533].text ,foundAllTd[531].text ,foundAllTd[537].text,foundAllTd[49].text)  
        
    else:
        """
        print (u'industry:'+foundAllTd[18].text)
        #print (u'Business registration number:'+foundAllTd[8].text)
        print (u'Enterprise type:'+foundAllTd[14].text)
        print( u'Organization code: '+foundAllTd[10].text)
        print (u'Business period: '+foundAllTd[20].text)
        print( u'Registration agency:'+foundAllTd[24].text)
        print (u'Approval date:'+foundAllTd[22].text)
        print( u'Unified social credit code: '+foundAllTd[16].text)
        print (u'Registration address:'+foundAllTd[28].text)
        print (u'Business scope:'+foundAllTd[30].text)
        """
        sql = "INSERT INTO company (instr,company_name,industry,business_scope,type_enterprise,regist_capital,legal_represent,regist_date,company_status,operat_period,registrat_body,approval_date,address) VALUES ( '%s','%s', '%s', '%s', '%s', '%s',  '%s', '%s', '%s', '%s', '%s', '%s', '%s' )"  
        data = (instr,company, foundAllTd[18].text, foundAllTd[30].text,foundAllTd[14].text ,zczb , fddbr,zcrq ,zt,foundAllTd[20].text ,foundAllTd[24].text ,foundAllTd[22].text ,foundAllTd[28].text)  
    
     
  
# insert data  
   
    cursor.execute(sql % data)  
    connect.commit()  
    #print('Successfully inserted', cursor.rowcount, 'Data')


        
#Get executive information, which is invalid and has no effect on code operation
def get_gg_info(soup):
    ggpersons = soup.find_all(attrs={"event-name": "company-detail-staff"})
    ggnames = soup.select('table.staff-table > tbody > tr > td.ng-scope > span.ng-binding')
# print(len(gg))
    for i in range(len(ggpersons)):
            ggperson = ggpersons[i].text
            ggname = ggnames[i].text
            print (ggperson+" "+ggname)
#Get information, it is invalid, it has no effect on the code running
def get_gd_info(soup):
    tzfs = soup.find_all(attrs={"event-name": "company-detail-investment"})
    for i in range(len(tzfs)):
            tzf_split = tzfs[i].text.replace("\n","").split()
            tzf = '' .join (tzf_split)
            print (tzf)
#Get information, it is invalid, it has no effect on the code running
def get_tz_info(soup):
    btzs = soup.select('a.query_name')
    for i in range(len(btzs)):
            btz_name = btzs[i].select('span')[0].text
            print (btz_name)

#Get industry links on the homepage         
def get_industry(soup):
   # print(soup.find(attrs={'class':'industry_container js-industry-container'}))
    #hangye = soup.find(attrs={'class':'industry_container js-industry-container'}).find_all("a")
    x=[]
    buyao=70 #delete when starting to crawl data
    hangye = soup.find_all('a')
    for item in hangye:
        if 'https://www.tianyancha.com/search/oc' in str(item.get("href")):
            print (item.get("href"))
            if buyao>0:
                buyao-=1
            else:
                x.append(str(item.get("href")))
    print("Industry number")
    print (len (x))
    return x;

#Get the company link under the industry            
def get_industry_company(soup):
    y=[]
    companylist = soup.find_all('a')
    for item in companylist:
        if 'https://www.tianyancha.com/company/' in str(item.get("href")):
            print (item.get("href"))
            y.append(str(item.get("href")))
    return y

if __name__=='__main__':
    cursor = connect.cursor() #Connect to the database
    
    companycount=0 #Number of companies crawled
    instrcount=0 #Number of industries crawled, change a proxy IP for every 4 industries, and crawl 20 first pages for each industry
    theinscount=0 #Number of industry tags to be crawled, change a proxy IP for every 4 industries, and crawl 20 first pages for each industry
    
    driver = driver_open()
    url = "https://www.tianyancha.com/"
    soup = get_content(driver, url)
    instrlist=get_industry(soup)
    theinscount=len(instrlist)
    print
    
    for instr in instrlist: #traverse industry links
        instrcount+=1
        print(instrcount)
        print(instr)
        compsoup = get_content(driver, instr)
        complist =get_industry_company(compsoup)
        for comp in complist: #traverse the company links under the industry  
            print(comp)
            companycount + = 1
            #print(num)
            print("The number of industries has climbed"+str(instrcount))
            
            try:
                infosoup = get_content(driver, comp)
                print ('----Get basic information----')
                get_basic_info(infosoup,instr)
            except:
                print('Exception skipped', end=' ')
        
        if instrcount%4 == 0 : #Change a proxy IP for every 3 industry links to prevent the web page from banning proxy IP,
        #Sometimes there is a problem, the proxy IP times out, etc., in this case, turn off the program, or turn off plantomjs
            print("换IP")
            #driver.close()#Close the driver, there may be multiple planomjs windows, which need to be closed
            driver = driver_open()
            #try:
            #    get_basic_info(soup,instr)
            #except:
            # print('Exception skipped', end=' ')
           # print()
    
            
    cursor.close()  
    connect.close() #Close the database connection

The code comments have been typed in more detail and can be viewed directly.

The above code crawling results also require data preprocessing, especially the data encryption of Tianyan Chasha Pen.

The encrypted data above includes, registered capital, registration time, business period, encryption method, and original encryption method.

The encryption I have come across is,

digital encryption

cipher text

7    4
5    8
4    .
3    9
0    1
.    5
9    2
6    0
1    3
8    6
2    7

It's that simple, hahaha, I didn't die laughing when I found this. This decoding operation is relatively simple, and the friends can practice it by themselves.

Some people say that there is only one reason why I don't go to the national enterprise credit information publicity system . I really don't bother to do any sliding verification codes. Clicking on the verification codes in text is annoying. Look at this guy's blog, he said it has expired, you can learn from some experience, [Crawler] About the enterprise credit information publicity system - Accelerate the latest anti-crawler mechanism

Fortunately, there is no verification code in Tianyancha, otherwise the modeling friends will be mad at me.

In addition, if there are small partners who really don't want to crawl the data by themselves, but only want the data, you can send me a private message to ask for it. Yes, I really want to buy some data in the past to cope with the modeling, but when I see the price, I basically give up. look at the picture


The proxy IP is the mushroom proxy I found . It cost 6 yuan, 1000 high-secret IPs, and the above API seems to have 700 left. Let me use it for you. Anyway, I don't want to engage in crawling. Some gods made a crawler. Click to open the link to get the proxy IP .

, I feel that in a word, doing technology is just tiring, and learning technology is not only tiring but also difficult.

Pits, reptile pits, filled with soil.





Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325727341&siteId=291194637