Python implementation with a reptile crawling ZINC site bioinformatics data analysis

  Recently received laboratory mentor gave me a task, that is, they do have a lot of smile expression format similar to this: C (= C (c1ccccc1) c1ccccc1) c1ccccc1 ( which is expressed in bioinformatics small molecular structure a common expression), they need each smile expression ZINC search on the site (bioinformatics data site), and then find the corresponding information ZINC number, small molecule suppliers, conformation prediction. The basic steps are as follows:

 After clicking the page will jump to find more information, we need to obtain the information it ZINC number, small molecule supplier, structure prediction, CAS number, etc., as follows:

 

 

 

   If this is a process by manually completed, then a little unrealistic, after all, they have more than a thousand such a smile numbers, if not yet exhausted out one by one, so they thought of me and wanted me to write a reptile automation This information is extracted, once and for all. To be honest I did not hesitate to receive this task when the agreed down, one is written before I did a similar program, and secondly, he recently really busy, do a thesis about it, lacks something playing every day king one day would be lost, might as well pick a task, can be considered practice your hand. Ado, directly open to dry.

  Before the code line and we must first clear up some questions can not be reckless. First of all we need to know when we enter smile number and click search process this time the front and rear of the server interaction is what kind of, that sent the tip in the end what kind of HTTP request to the backend. To know that we have entered a start, but only a smile numbers, the page jump directly to the http://zinc15.docking.org/substances/ZINC000001758809/ , this is definitely back after a smile in response to a number based on a number of heavy isolated ZINC directional request guess so, let's look at the actual situation. In the browser, right-click on the check to see what browser in the end sent a request to the background in the course of our operations.

 

 

从上图可以看到,我们一旦键入smile表达式之后,浏览器立马给后台发送了一个请求,然后网页显示出一个小分子的图像,很显然这个请求是为了获取小分子构象信息然后生成图片的,这个流程我们不做深究,我们要知道到底发送什么请求才能获得重定向后的地址,并拿到真正有用的网页。我们点击搜索,接着往下看:

在接下来的请求中,我们发现了一个关键请求(上图标红处),这个请求的响应体返回的是一个序列号,如下图:

不要小看这个序列号,虽然我也不知道它具体代表什么意思,但是后面的请求充分向我们说明了这个序列号的重要性,即后面需要smile表达式带上这个序列号一起发送一个HTTP请求,才能获取到那个关键的重定向网页,如下图:

 

  到目前为止,这个网页的请求逻辑已经很清楚了,我们只需要利用python模仿浏览器发送同样的请求,首先获取这个inchikey序列号,然后通过这个序列号和smile表达式再次发起请求就能得到重定向的网址了,然后对这个重定向网址发起请求就能获得我们所需要的关键网页了,我们所需要的全部信息都包含在这个重定向后的网页里,然后只要解析这个html网页,从中提取出我们想要的信息就行了。思路已经很清晰了,可以撸代码了,具体Python代码如下:

 

  1 #coding=utf-8
  2 
  3 '''
  4 @Author: [email protected]
  5 @Date: 2019-6-1
  6 @Description: 
  7 本爬虫运行环境为python2.7,在python3中不能运行。运行前先将含有smile表达式的文件命名为SMILE.txt放在与本文件相同的目录下,执行程序后,
  8 本爬虫会自动读取SMILE.txt文件中的内容,并根据smile表达式抓取ZINC网站的网页进行分析,得到的结果会以当前时间命名放在当前执行目录下。
  9 PS:程序运行快慢取决于当前网速和SMILE.txt文件大小,请耐心等待。
 10 '''
 11 
 12 import os,sys
 13 import urllib
 14 import urllib2
 15 import json
 16 import time
 17 import re
 18 from HTMLParser import HTMLParser
 19 from datetime import datetime
 20 
 21 headers = {
 22             "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
 23             "Accept-Encoding": "gzip, deflate",
 24             "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
 25             "Host": "zinc15.docking.org",
 26             "Referer": "http://zinc15.docking.org/substances/home/",
 27             "Upgrade-Insecure-Requests": "1",
 28             "Cookie": "_ga=GA1.2.1842270709.1559278006; _gid=GA1.2.1095204289.1559278006; _gat=1; session=.eJw9zLEKgzAQANBfKTd3qcRFcEgJBIdLQE7hbhFqW6pRC20hGPHf26nvA94G3XCFYoPDBQoQCjlm7YCzTDLWk6N2xBSi2CoKcXSzjGJ0zqkvYT9C_37du88z3JZ_gXP98MTJWY6eesXUKG85RwonTs3q6BzEyOQMrmirzCWtUJe_bv8CllwtkQ.D9Kh8w.M2p5DfE-_En2mAGby_xvS01rLiU",
 29             "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36"
 30             }
 31 
 32 #解析SEA Predictions
 33 def getSeaPredictions(html):
 34     seaPrediction = []
 35     begin = html.find("SEA Predictions", 0)
 36     if(begin == -1):
 37         return seaPrediction
 38     end = html.find("</tbody>", begin)
 39     left = html.find("<td>", begin)
 40     right = html.find("</td>", left)
 41     pattern = re.compile('>(.*?)<')
 42     while(right < end):
 43         str = html[left:right+5]
 44         str = str.replace("\n", "").replace("\t", "")
 45         str = ''.join(pattern.findall(str))
 46         str = str.strip()
 47         seaPrediction.append(' '.join(str.split()))
 48         left = html.find("<td>", right)
 49         right = html.find("</td>", left)
 50         if(left == -1 or right == -1):
 51             break
 52     return seaPrediction
 53 
 54 #解析Vendors
 55 def getVendors(zincNum):
 56     url = "http://zinc15.docking.org/substances/" + zincNum + "/catitems/subsets/for-sale/table.html"
 57     request = urllib2.Request(url, headers = headers)
 58     response = urllib2.urlopen(request)
 59     html = response.read()
 60     
 61     #获取More about字段结束的位置列表,在其附近查找Vendors
 62     indexList = []
 63     begin = 0
 64     index = html.find("More about ", begin)
 65     while(index != -1):
 66         indexList.append(index + 11)
 67         begin = index + 11
 68         index = html.find("More about ", begin)
 69     
 70     vendors = []
 71     pattern = re.compile('>(.*?)<')
 72     for i in range(len(indexList)):
 73         begin = indexList[i]
 74         end = html.find('">', begin)
 75         vendors.append(html[begin:end])
 76         
 77         begin = html.find("<td>", end)
 78         end = html.find("</td>", begin)
 79         str = html[begin:end+5]
 80         vendors.append(''.join(pattern.findall(str)))
 81         
 82     return vendors
 83     
 84 #解析CAS numbers
 85 def getCasNum(html):
 86     result = re.search("<dt>CAS Numbers</dt>", html)
 87     if(result == None):
 88         return "None"
 89     begin = result.span()[1]
 90     begin = html.find("<dd>", begin, len(html))
 91     begin = begin + 4
 92     end = html.find("</dd>", begin, len(html))
 93     if(begin + 1 >= end):
 94         return "None"
 95     str = html[begin:end]
 96     casNumList = re.findall('[0-9]+-[0-9]+-[0-9]+', str)
 97     if(casNumList == None):
 98         return "None"
 99     casNumStr = ""
100     for i in range(len(casNumList)):
101         casNumStr = casNumStr + casNumList[i]
102         if(i != len(casNumList)-1):
103             casNumStr = casNumStr + ","
104     return casNumStr
105 
106 #解析ZINC号
107 def getZincNum(html):
108     result = re.search("Permalink", html)
109     if result is None:
110         return None
111     else:
112         begin = result.span()[1]
113         while(html[begin] != '\n'):
114             begin = begin +1
115         begin = begin + 1
116         end = begin
117         while(html[end] != '\n'):
118             end = end + 1
119         zincNum = html[begin:end]
120         return zincNum.strip()
121 
122 #解析网页数据并写入文件
123 def parseHtmlAndWriteToFile(smile, html, output):
124 
125     zincNum = getZincNum(html)
126     if zincNum is None:
127         print "ZINC number:\tNone"
128         output.write("ZINC number:\tNone\n")
129         return
130     else:
131         print "ZINC number: " + zincNum
132         output.write("ZINC number:\t" + zincNum + '\n')
133     
134     casNum = getCasNum(html)
135     print "CAS numbers: " + casNum
136     output.write("CAS numbers:\t" + casNum + '\n')
137     
138     output.write('\n')
139     
140     vendors = getVendors(zincNum)
141     if(0 == len(vendors)):
142         print "Vendors:\tNone"
143         output.write("Vendors:\tNone\n")
144     else:
145         print "Vendors:\t"+str(len(vendors)/2)+" total"
146         output.write("Vendors:\t"+str(len(vendors)/2)+" total\n")
147         i = 0
148         while(i < len(vendors)-1):
149             output.write(vendors[i]+" | "+vendors[i+1]+"\n")
150             i = i + 2
151     
152     output.write('\n')
153     
154     seaPrediction = getSeaPredictions(html)
155     if(0 == len(seaPrediction)):
156         print "SEA Prediction:\tNone"
157         output.write("SEA Prediction:\tNone\n")
158     else:
159         print "SEA Prediction:\t"+str(len(seaPrediction)/5)+" total"
160         output.write("SEA Prediction:\t"+str(len(seaPrediction)/5)+" total\n")
161         i = 0
162         while(i < len(seaPrediction)-4):
163             output.write(seaPrediction[i] + " | " + seaPrediction[i+1] + " | " + seaPrediction[i+2] + " | " + seaPrediction[i+3] + " | " + seaPrediction[i+4] +"\n")
164             i = i + 5
165         
166 #向重定向地址发起请求,获取网页数据
167 def getPage(url):
168     request = urllib2.Request(url, headers = headers)
169     response = urllib2.urlopen(request)
170     html = response.read()
171     return html
172 
173 
174 #构造url发起请求,获取inchikey然后获取重定向地址
175 def getRedirectUrl(smile):
176     encodeSmile = urllib.quote(smile, 'utf-8')
177     url = 'http://zinc15.docking.org/apps/mol/convert?from=' + encodeSmile + '&to=inchikey&onfail=error'
178     request = urllib2.Request(url, headers = headers)
179     response = urllib2.urlopen(request)
180     inchikey = response.read()
181     
182     url = 'http://zinc15.docking.org/substances/?highlight=' + encodeSmile + '&inchikey=' + inchikey + '&onperfect=redirect'
183     request = urllib2.Request(url, headers = headers)
184     newUrl = urllib2.urlopen(request).geturl()
185     return newUrl
186     
187 
188 def main():
189     inputFilename = "D:\python\SMILE.txt"
190     #outputFilename = datetime.now().strftime('%Y-%m-%d-%H-%M-%S') + ".txt"
191     outputFilename = "result.txt"
192     with open(inputFilename, "r") as input, open(outputFilename, "w") as output:
193         for line in input.readlines():
194             smile = line.strip()
195             print "SMILE:\t" + smile
196             output.write("SMILE:\t" + smile + '\n')
197             newUrl = getRedirectUrl(smile)
198             print newUrl
199             html = getPage(newUrl)
200             parseHtmlAndWriteToFile(smile, html, output)
201             print '\n'
202             output.write('\n\n\n')
203 
204 if __name__ == "__main__":
205     main()

 以下是程序运行时截图:

 

最终抓取到的文本信息截图:

 

 完美解决问题!!!

 

这里总结一下写爬虫需要注意哪些问题:

1、要摸清楚网站前后端交互的逻辑,要明确知道你的爬虫需要哪些网页,哪些网页是包含关键信息的网页,我们应该怎样构造请求获取它们。逻辑清晰了,思路就有了,代码写起来就快了。

2、解析html文件的时候思路灵活一点,各种正则表达式和查询过滤操作可以混着来,解析html文件归根到底还是对于字符串的处理,尤其是不规范的html文件,更能考验编程功底。

 

以上,可能是我大学在校期间做的最后一个项目了,四年的大学生活即将结束,感慨颇多。毕业之后即将走上工作岗位,心里既有一份期待也有一些焦虑,程序员这条道路我走得并不容易,希望以后一切顺利,与诸君共勉!

 

Guess you like

Origin www.cnblogs.com/jeysin/p/10962316.html