1. Problem description
When processing text data, there are often mixed situations in various languages in the text, including: English, Japanese, Russian, French, etc. It is necessary to batch translate the languages of different languages into Chinese for processing. You can directly call the translation API provided by Baidu through Python for batch translation.
For detailed documentation of Baidu Translator API, see: Baidu Translator API Documentation
2. Problem solving
Development environment: Linux
Separate Chinese and non-Chinese in the text, and translate the non-Chinese part.
The Python code is as follows: translate.py
#!/usr/bin/python #-*- coding:utf-8 -*- import sys reload(sys) sys.setdefaultencoding("utf8") import json #import json module import urllib #import urllib module from urllib2 import Request, urlopen, URLError, HTTPError #Import urllib2 module def translate(inputFile, outputFile): fin = open(inputFile, 'r') #Open the input file for reading fout = open(outputFile, 'w') #Open the output file by writing for eachLine in fin: #Read the file by line line = eachLine.strip() #Remove possible spaces at the beginning and end of each line, etc. quoteStr = urllib.quote(line) #Convert each line read into a specific format for translation url = 'http://openapi.baidu.com/public/2.0/bmt/translate?client_id=WtzfFYTtXyTocv7wjUrfGR9W&q=' + quoteStr + '&from=auto&to=zh' try: resultPage = urlopen(url) #Call Baidu Translate API for batch translation except HTTPError as e: print('The server couldn\'t fulfill the request.') print('Error code: ', e.code) except URLError as e: print('We failed to reach a server.') print('Reason: ', e.reason) except Exception, e: print 'translate error.' print e continue resultJason = resultPage.read().decode('utf-8') #Get the translation result, the translation result is in json format js = None try: js = json.loads(resultJason) #Convert the result in json format into a Python dictionary structure except Exception, e: print 'loads Json error.' print e continue key = u"trans_result" if key in js: dst = js["trans_result"][0]["dst"] #Get the translated text result outStr = dst else: outStr = line #If the translation is wrong, output the original text fout.write(outStr.strip().encode('utf-8') + '\n') #Output the result fin.close() fout.close() if __name__ == '__main__': translate(sys.argv[1], sys.argv[2]) #Execute by obtaining the input and output file names by obtaining command line parameters, which is convenient
After the program is completed, enter on the Linux command line: python translate.py myinput.txt myoutput.txt
will be able to execute. The final translation results are written to the output file myoutput.txt.
3. Pay attention
(1) The first few lines of the program are conventionally written, in order to solve the Chinese encoding problem that may often occur.
(2) In line 18, the text that needs to be translated needs to be converted into a specific format code by the quote function for translation.
(3) Line 19, "&from=auto&to=zh" in the url, from is followed by the code of the source language, to is followed by the code of the destination language, such as: zh means Chinese, en means English, auto means any language that is automatic .
Hope it helps everyone, thank you.