[Crawler] Python uses dynamic IP, multi-threading, to crawl uncomtrade data

        The United Nations trade statistics database UNCOMTRADE is a comprehensive information database of the International Customs Organization that summarizes the import and export trade conditions reported by all members, and is an indispensable data source for international trade analysis. The United Nations trade statistics database provides data query of various commodity classification standards of the International Customs Organization, including HS2002, HS1996, HS1992, SITC1, SITC2, SITC3, SITC4, etc., covering more than 250 countries and 6-digit tax codes of 5,000 kinds of commodities (International Customs Organization 4-digit tax number), the earliest data can be traced back to 1962, and the total number of trade data records exceeds 1 billion. It officially provides an API method to obtain data through the web protocol, but there are many problems in use. This article will implement the following parts to solve the problems: ① Encapsulate the API to make it more in line with the common data acquisition in Python API form; ② Use PPTP (dynamic ip proxy server) to change the request ip to break uncomtrade’s restriction on single ip data fetching; ③ Use multi-threading method to extract data from multiple countries at the same time to speed up data extraction efficiency.

 

Table of contents

1 Repackaging of UNCOMTRADE API

1.1 Introduction to the original API

 1.2 API encapsulated in this article

2. Dynamically use dynamic ip proxy to obtain a large amount of data

3. Use multithreading to speed up data extraction efficiency

4. A few points worth noting


1 Repackaging of UNCOMTRADE API

1.1 Introduction to the original API

In short, it is to change the parameters in the URL by yourself and use the URL to get the data.

URL format: http://comtrade.un.org/api/get?parameters

Parameter introduction:   

    max: the maximum amount of returned data (100000 by default);
    r: reporting area, select the desired target country;
    freq: select the data as annual or monthly (A, M);
    ps: select the desired year;
    px: select the category The standard, such as the commonly used SITC Revision 3 is S3;
    p: partner area, select the desired target country, if the export volume between China and Russia is required, the target is China, and the target is Russia;
    rg: choose import or export (import is 1, export is 2);
    cc: select the product code;
    fmt: select the output file format, csv or json, and use json by default (csv is faster in the actual test);
    type: select the trade type, product (C) or service (S) ;

Example:

http://comtrade.un.org/api//get/plus?max=100000&type=C&freq=A&px=HS&ps=2021&r=156&p=all&rg=1&cc=TOTAL&fmt=csv

 Data download page:

 1.2 API encapsulated in this article

In actual use, [502 Bad Gateway] [ 403 Forbidden ] and other problems may occur, and it is necessary to use methods such as exception capture to record or deal with these problems. Moreover, import data and export data can only be obtained separately. If we use the logic of the country to obtain data, we need to splice the import and export data in the newly packaged API. In addition, we name each acquired data to facilitate subsequent understanding of the data, so the encapsulated function will return {data name (str): data (dataframe)} in the form of a dictionary.

*Considering the subsequent need to use the proxy, the situation of the proxy is also taken into account here. When the proxy is not needed, just set the relevant parameters to False and None.

code:

import requests
import time
import pandas as pd
from pandas import json_normalize
import numpy as np
from tqdm import tqdm
from random import randint
import datetime
from io import StringIO

class proxy:
    proxyHost = "u8804.5.tn.16yun.cn"
    proxyPort = "6441"
    proxyUser = "16IHUBEP"
    proxyPass = "727634"
    user_agents = []
    proxies = {}

    def __init__(self, proxyHost, proxyPort, proxyUser, proxyPass, user_agents):
        self.proxyHost = proxyHost
        self.proxyPort = proxyPort
        self.proxyUser = proxyUser
        self.proxyPass = proxyPass
        self.user_agents = user_agents
        proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
            "host": proxyHost,
            "port": proxyPort,
            "user": proxyUser,
            "pass": proxyPass,
        }
        self.proxies = {
            "http": proxyMeta,
            "https": proxyMeta,
        }
    
    
def download_url(url,  ifuse_proxy = False ,proxy = None):
    if(ifuse_proxy):
        time.sleep(0.5)# 调用多线程时不需要这一行
        random_agent = proxy.user_agents[randint(0, len(proxy.user_agents) - 1)]		# chose an user agent from the user agent list above
        tunnel = randint(1, 10000)# generate a tunnel
        header = {
            "Proxy-Tunnel": str(tunnel),
            "User-Agent": random_agent
        }
        
    #print(header,proxy.proxies)

    try:
        if(ifuse_proxy):
            content = requests.get(url, timeout=100,headers = header, proxies=proxy.proxies)
        else:
            content = requests.get(url, timeout=100, proxies=proxy)
        ''' note that sometimes we only get error informations in the responses, and here are some really dumb quick fixes'''
        if (
                content.text == "<html><body><h1>502 Bad Gateway</h1>\nThe server returned an invalid or incomplete response.\n</body></html>\n" or content.text == "Too Many Requests.\n" or content.text == "{\"Message\":\"An error has occurred.\"}"):
            with open("./uncomtrade_data/serverError.csv", 'a', encoding="utf-8") as log:
                log.write(str(datetime.datetime.now()) + "," + str(url) + "\n")
                print("\n" + content.content.decode())
                if(ifuse_proxy):
                    download_url(url,ifuse_proxy = True , proxy = proxy)
                else:
                    download_url(url,ifuse_proxy = False , proxy = None)
        else:
            if('json' in url):
                return json_normalize(content.json()['dataset'])
            elif('csv' in url):
                return pd.read_csv(StringIO(content.text),on_bad_lines  = 'skip')

    except requests.RequestException as e:
        ''' I have absolutely no knowledge about Request Exception Handling so I chose to write the error information to a log file'''
        print(type(e).__name__ + " has occurred, change proxy!")
#         if(type(e).__name__=='JSONDecodeError'):
#             print(content.content)
        with open("./uncomtrade_data/exp.csv", 'a', encoding="utf-8") as log:
            log.write(
                str(datetime.datetime.now()) + "," + str(type(e).__name__) + "," + str(url) + "\n")
        if(ifuse_proxy):
            download_url(url,ifuse_proxy = True , proxy = proxy)
        else:
            download_url(url,ifuse_proxy = False , proxy = None)

def get_data_un_comtrade(max_un = 100000,r = '156',freq = 'A',ps = '2021',px = 'S4',p = 'all',rg = '2',cc = 'TOTAL',fmt = 'json',type_un ='C',ifuse_proxy = False ,proxy = None ):
    '''
    max_un:最大返回数据量(默认为100000);
    r:reportering area,选择所需要的目标国家;
    freq:选择数据为年度或月度(A,M);
    ps:选择所需要的年份;
    px:选择分类标准,如常用的SITC Revision 3为S3;
    p:partner area,选择所需要的对象国家,如需要中国与俄罗斯的出口额,则目标为中国,对象为俄罗斯;
    rg:选择进口或出口(进口为1,出口为2);
    cc:选择产品代码;
    fmt:选择输出文件格式,csv或json,默认使用json(实测中csv更快);
    type_un:选择贸易类型,产品或服务;
    ifuse_proxy:是否使用代理;
    proxy:代理信息。
    
    return:{数据名称: 数据}{str:dataframe}
    '''
    pre_url = "http://comtrade.un.org/api//get/plus?max={}&type={}&freq={}&px={}&ps={}&r={}&p={}&rg={}&cc={}&fmt={}"
    url_use = pre_url.format(max_un,type_un,freq,px,ps,r,p,rg,cc,fmt)
    print("Getting data from:"+url_use)
    data = download_url(url_use, ifuse_proxy = ifuse_proxy ,proxy = proxy)
    if(rg==1):
        ex_or_in = 'IMPORT'
    else:
        ex_or_in = 'EXPORT'
    data_name = ps+"_"+r+"_"+p+"_"+px+"_"+cc+"_"+ex_or_in+"_"+freq
    return {data_name:data}

At the same time, when obtaining data from different countries, country codes of different countries are required. The official data is provided at https://comtrade.un.org/Data/cache/reporterAreas.json   , and the following dictionary data can also be used directly :

countries = {'156': 'China','344': 'China, Hong Kong SAR','446': 'China, Macao SAR',
             '4': 'Afghanistan','8': 'Albania','12': 'Algeria','20': 'Andorra','24': 'Angola','660': 'Anguilla','28': 'Antigua and Barbuda','32': 'Argentina','51': 'Armenia',
             '533': 'Aruba','36': 'Australia','40': 'Austria','31': 'Azerbaijan','44': 'Bahamas','48': 'Bahrain','50': 'Bangladesh', '52': 'Barbados',
             '112': 'Belarus','56': 'Belgium','58': 'Belgium-Luxembourg','84': 'Belize','204': 'Benin','60': 'Bermuda','64': 'Bhutan','68': 'Bolivia (Plurinational State of)','535': 'Bonaire',
             '70': 'Bosnia Herzegovina','72': 'Botswana','92': 'Br. Virgin Isds','76': 'Brazil','96': 'Brunei Darussalam','100': 'Bulgaria','854': 'Burkina Faso','108': 'Burundi','132': 'Cabo Verde','116': 'Cambodia',
             '120': 'Cameroon','124': 'Canada','136': 'Cayman Isds','140': 'Central African Rep.','148': 'Chad','152': 'Chile',
             '170': 'Colombia','174': 'Comoros','178': 'Congo','184': 'Cook Isds','188': 'Costa Rica','384': "Côte d'Ivoire",'191': 'Croatia','192': 'Cuba','531': 'Curaçao','196': 'Cyprus','203': 'Czechia',
             '200': 'Czechoslovakia','408': "Dem. People's Rep. of Korea",'180': 'Dem. Rep. of the Congo','208': 'Denmark','262': 'Djibouti','212': 'Dominica','214': 'Dominican Rep.','218': 'Ecuador',
             '818': 'Egypt','222': 'El Salvador','226': 'Equatorial Guinea','232': 'Eritrea','233': 'Estonia','231': 'Ethiopia','234': 'Faeroe Isds','238': 'Falkland Isds (Malvinas)','242': 'Fiji','246': 'Finland',
             '251': 'France','254': 'French Guiana','258': 'French Polynesia','583': 'FS Micronesia','266': 'Gabon','270': 'Gambia','268': 'Georgia','276': 'Germany','288': 'Ghana','292': 'Gibraltar',
             '300': 'Greece','304': 'Greenland','308': 'Grenada','312': 'Guadeloupe','320': 'Guatemala','324': 'Guinea','624': 'Guinea-Bissau','328': 'Guyana','332': 'Haiti','336': 'Holy See (Vatican City State)',
             '340': 'Honduras','348': 'Hungary','352': 'Iceland','699': 'India','364': 'Iran','368': 'Iraq','372': 'Ireland','376': 'Israel','381': 'Italy','388': 'Jamaica','392': 'Japan',
             '400': 'Jordan','398': 'Kazakhstan','404': 'Kenya','296': 'Kiribati','414': 'Kuwait','417': 'Kyrgyzstan','418': "Lao People's Dem. Rep.",'428': 'Latvia','422': 'Lebanon','426': 'Lesotho',
             '430': 'Liberia','434': 'Libya','440': 'Lithuania','442': 'Luxembourg','450': 'Madagascar','454': 'Malawi','458': 'Malaysia','462': 'Maldives','466': 'Mali','470': 'Malta','584': 'Marshall Isds',
             '474': 'Martinique','478': 'Mauritania','480': 'Mauritius','175': 'Mayotte','484': 'Mexico','496': 'Mongolia','499': 'Montenegro','500': 'Montserrat','504': 'Morocco','508': 'Mozambique','104': 'Myanmar',
             '580': 'N. Mariana Isds','516': 'Namibia','524': 'Nepal','530': 'Neth. Antilles','532': 'Neth. Antilles and Aruba','528': 'Netherlands','540': 'New Caledonia','554': 'New Zealand','558': 'Nicaragua',
             '562': 'Niger','566': 'Nigeria','579': 'Norway','512': 'Oman','586': 'Pakistan','585': 'Palau','591': 'Panama','598': 'Papua New Guinea','600': 'Paraguay','459': 'Peninsula Malaysia','604': 'Peru','608': 'Philippines',
             '616': 'Poland','620': 'Portugal','634': 'Qatar','410': 'Rep. of Korea','498': 'Rep. of Moldova','638': 'Réunion','642': 'Romania','643': 'Russian Federation','646': 'Rwanda','647': 'Ryukyu Isd','461': 'Sabah',
             '652': 'Saint Barthelemy','654': 'Saint Helena','659': 'Saint Kitts and Nevis','662': 'Saint Lucia','534': 'Saint Maarten','666': 'Saint Pierre and Miquelon','670': 'Saint Vincent and the Grenadines',
             '882': 'Samoa','674': 'San Marino','678': 'Sao Tome and Principe','457': 'Sarawak','682': 'Saudi Arabia','686': 'Senegal','688': 'Serbia','690': 'Seychelles','694': 'Sierra Leone','702': 'Singapore',
             '703': 'Slovakia','705': 'Slovenia','90': 'Solomon Isds','706': 'Somalia','710': 'South Africa','728': 'South Sudan','724': 'Spain','144': 'Sri Lanka','275': 'State of Palestine',
             '729': 'Sudan','740': 'Suriname','748': 'Eswatini','752': 'Sweden','757': 'Switzerland','760': 'Syria','762': 'Tajikistan','807': 'North Macedonia','764': 'Thailand','626': 'Timor-Leste',
             '768': 'Togo','772': 'Tokelau','776': 'Tonga','780': 'Trinidad and Tobago','788': 'Tunisia','795': 'Turkmenistan','796': 'Turks and Caicos Isds','798': 'Tuvalu','800': 'Uganda',
             '804': 'Ukraine','784': 'United Arab Emirates','826': 'United Kingdom','834': 'United Rep. of Tanzania','858': 'Uruguay','850': 'US Virgin Isds','842': 'USA','860': 'Uzbekistan',
             '548': 'Vanuatu','862': 'Venezuela','704': 'Viet Nam','876': 'Wallis and Futuna Isds','887': 'Yemen','894': 'Zambia','716': 'Zimbabwe'}

Example of calling the above API:

#Using the above packaged get_data_un_comtrade() to get data examples [do not use dynamic ip proxy] [do not use multi-threading] #Example
: Get all commodity trade data of China's exports to all countries in 2021

#使用上述封装后的get_data_un_comtrade()取数据实例【不使用动态ip代理】【不使用多线程】
#例子:获取中国2021年出口所有国家的所有商品贸易数据
temp = get_data_un_comtrade(max_un = 100000,r = '156',freq = 'A',ps = '2021',px = 'S4',p = 'all',rg = '2',cc = 'TOTAL',fmt = 'json',type_un ='C',ifuse_proxy = False ,proxy = None )
temp_name = list(temp.keys())[0]
print("DATA NAME IS "+temp_name)
temp[temp_name]

 output:

Getting data from:http://comtrade.un.org/api//get/plus?max=100000&type=C&freq=A&px=S4&ps=2021&r=156&p=all&rg=2&cc=TOTAL&fmt=json
DATA NAME IS 2021_156_all_S4_TOTAL_EXPORT_A

Out[13]:

pfCode

yr

period

period Desc

aggrLevel

IsLeaf

rgCode

rgDesc

rtCode

rtTitle

...

qtAltCode

qtAltDesc

TradeQuantity

AltQuantity

NetWeight

GrossWeight

TradeValue

CIFValue

FOBValue

estCode

0 H5 2021 2021 2021 5 0 0 X 156 China ... -1 N/A 0 0.0 0 0.0 2756859964 0.0 2.756860e+09 4
1 H5 2021 2021 2021 5 0 0 X 156 China ... -1 N/A 0 0.0 0 0.0 2312182195 0.0 2.312182e+09 4
2 H5 2021 2021 2021 5 0 0 X 156 China ... -1 N/A 0 0.0 0 0.0 1658462 0.0 1.658462e+06 4
3 H5 2021 2021 2021 5 0 0 X 156 China ... -1 N/A 0 0.0 0 0.0 3565158772 0.0 3.565159e+09 4
4 H5 2021 2021 2021 5 0 0 X 156 China ... -1 N/A 0 0.0 0 0.0 276740207 0.0 2.767402e+08 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214 H5 2021 2021 2021 5 0 0 X 156 China ... -1 N/A 0 0.0 0 0.0 1227521463 0.0 1.227521e+09 4
215 H5 2021 2021 2021 5 0 0 X 156 China ... -1 N/A 0 0.0 0 0.0 1950333100 0.0 1.950333e+09 4
216 H5 2021 2021 2021 5 0 0 X 156 China ... -1 N/A 0 0.0 0 0.0 88878914 0.0 8.887891e+07 4
217 H5 2021 2021 2021 5 0 0 X 156 China ... -1 N/A 0 0.0 0 0.0 2714022068 0.0 2.714022e+09 4
218 H5 2021 2021 2021 5 0 0 X 156 China ... -1 N/A 0 0.0 0 0.0 43630019204 0.0 4.363002e+10 4

219 rows × 35 columns

2. Dynamically use dynamic ip proxy to obtain a large amount of data

Since the official has the following data acquisition restrictions on a single ip:

 Therefore, if you want to obtain a large amount of data at a time, you must use PPTP.

In addition, it is best to add the User-Agent field to the request header, and set it to a request sent from a browser such as pc chorm, ie (by default, it will be sent from Pythonxxx, and many servers will directly reject this request header) access). Available options are https://download.csdn.net/download/standingflower/86515035

Use code example:

#Use the above encapsulated get_data_un_comtrade() to fetch data instances [use dynamic ip proxy] [do not use multithreading]

#使用上述封装后的get_data_un_comtrade()取数据实例【使用动态ip代理】【不使用多线程】
#proxyHost、proxyPort、proxyUser、proxyPass、user_agents需根据自己使用的代理来进行设置
proxyHost = "your proxyHost "
proxyPort = "your proxyPort "
proxyUser = "your proxyUser "
proxyPass = "your proxyPass "
user_agents = [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52"]

#初始化proxy实例
proxy_use = proxy(proxyHost, proxyPort, proxyUser, proxyPass, user_agents)
#调用数据获取函数,使用proxy,防止获取数据时ip被封禁
#例子:获取所有国家2021年的贸易数据
num = 0
for key in list(countries.keys()):
    print(countries[key]+" BEGINS! TIME:",datetime.datetime.now())
    temp_import = get_data_un_comtrade(max_un = 100000,r = key,freq = 'A',ps = '2021',px = 'HS',p = 'all',rg = '1',cc = 'TOTAL',fmt = 'csv',type_un ='C',ifuse_proxy = True ,proxy = proxy_use )
    temp_name_import = list(temp_import.keys())[0]
    temp_export = get_data_un_comtrade(max_un = 100000,r = key,freq = 'A',ps = '2021',px = 'HS',p = 'all',rg = '2',cc = 'TOTAL',fmt = 'csv',type_un ='C',ifuse_proxy = True ,proxy = proxy_use )
    temp_name_export = list(temp_export.keys())[0]
    if((temp_import[temp_name_import] is not None) or (temp_export[temp_name_export] is not None)):
        temp_data = pd.concat([temp_import[temp_name_import],temp_export[temp_name_export]],axis=0)
        num = num +1
        if(not temp_data.empty):
            temp_data.to_excel("./uncomtrade_data/uncomtrade_data_test2/"+countries[key]+".xlsx")
            print("DATA NAME IS "+temp_name_import+" and "+temp_name_export + ". COMPLETED!  "+ str(len(list(countries.keys()))-num)+" remains!")
        else:
            print(temp_name_import +" or "+temp_name_export+" is None! SKIP!")
          
    else:
        print(temp_name_import +" or "+temp_name_export+" is None! SKIP!")
    print("******************************************************************")

output:

 

 

3. Use multithreading to speed up data extraction efficiency

 Multi-threaded operation is performed on the basis of dynamic ip, which can greatly speed up the speed of data extraction.

The relevant knowledge used to build multi-threading here is: build thread instances by integrating the threading.thread class and rewriting the run() method, and use semaphores to control the number of threads.

code:

#使用上述封装后的get_data_un_comtrade()取数据实例【使用动态ip代理】【使用多线程】
#proxyHost、proxyPort、proxyUser、proxyPass、user_agents需根据自己使用的代理来进行设置
proxyHost = "your proxyHost "
proxyPort = "your proxyPort "
proxyUser = "your proxyUser "
proxyPass = "your proxyPass "
user_agents = [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52"]
#初始化proxy实例
proxy_use = proxy(proxyHost, proxyPort, proxyUser, proxyPass, user_agents)

import threading
import queue
import random
import time

class DownloadData(threading.Thread):
    all_data = None
    def __init__(self,country_code):
        super().__init__()
        self.country_code = country_code

    def run(self):
        with semaphore:
            print(countries[self.country_code]+" BEGINS! TIME:",datetime.datetime.now())
            temp_import = get_data_un_comtrade(max_un = 100000,r = self.country_code,freq = 'A',ps = '2021',px = 'HS',p = 'all',rg = '1',cc = 'TOTAL',fmt = 'csv',type_un ='C',ifuse_proxy = True ,proxy = proxy_use )
            temp_name_import = list(temp_import.keys())[0]
            temp_export = get_data_un_comtrade(max_un = 100000,r = self.country_code,freq = 'A',ps = '2021',px = 'HS',p = 'all',rg = '2',cc = 'TOTAL',fmt = 'csv',type_un ='C',ifuse_proxy = True ,proxy = proxy_use )
            temp_name_export = list(temp_export.keys())[0]
            if((temp_import[temp_name_import] is not None) or (temp_export[temp_name_export] is not None)):
                temp_data = pd.concat([temp_import[temp_name_import],temp_export[temp_name_export]],axis=0)
                if(not temp_data.empty):
                    temp_data.to_excel("./uncomtrade_data/uncomtrade_data_test3/"+countries[self.country_code]+".xlsx")
                    print("DATA NAME IS "+temp_name_import+" and "+temp_name_export + ". COMPLETED!  ")
                else:
                    print(temp_name_import +" or "+temp_name_export+" is None! SKIP!")
            else:
                print(temp_name_import +" or "+temp_name_export+" is None! SKIP!")
        return
    

thread_list = []  # 定义一个列表,向里面追加线程
MAX_THREAD_NUM = 5 #最大线程数
semaphore = threading.BoundedSemaphore(MAX_THREAD_NUM) # 或使用Semaphore方法
for i,country_code in zip(list(range(len(countries.keys()))),list(countries.keys())):
    # print(i)
    m = DownloadData(country_code)
    thread_list.append(m)
for m in thread_list:
    m.start()  # 调用start()方法,开始执行

for m in thread_list:
    m.join()  # 子线程调用join()方法,使主线程等待子线程运行完毕之后才退出

The more threads that can be opened, the faster the improvement speed, and the data downloaded is the same as the data without multi-threading

4. A few points worth noting

(1) It is measured that using csv format to extract data is much faster than json;

(2) The data of Taiwan is No. 490, Other Asia, nes

(3) When querying the country, only the country number can be entered, and the address for querying the corresponding number of the country ishttps://comtrade.un.org/Data/cache/reporterAreas.json

(4) When querying the country of the counterparty, only the country number can be entered. The address for querying the corresponding number of the country of the counterparty ishttps://comtrade.un.org/Data/cache/partnerAreas.json

(5) Query the codes corresponding to the relevant commodities HS, the address is https://comtrade.un.org/Data/cache/classificationHS.json

(6) If  freq the parameter is assigned a value  M (meaning that the data is obtained in units of months), the px (classification) parameter should not select the SITC set (ST, S1, S2, … , S4) because there is no data, and the obtained table will be empty . 

Guess you like

Origin blog.csdn.net/standingflower/article/details/126843518