As for me, the Java-born company PHP started a website after reading thinkphp for a few days, so I embarked on the road of no return. In general, it's almost the same. The syntax is still the same as Java. Build a bunch of objects, engage in methods, engage in inheritance, and engage in MVC.
Today, I went to another road of no return, Python. I took the wild road without a formal reading of the basic grammar of documentation tutorials. I like to go there and see that. I can figure it out for the regular army!
I will explain each line of code, please see the official documentation for the specific meaning
drive
import urllib2 from bs4 import BeautifulSoup class indexAction: url = 'http://www.baidu.com' def getHtml(self,url): request = urllib2.Request(url) request.add_header('user-agent', 'Mozilla/5.0') response = urllib2.urlopen(request) return response if __name__ == '__main__': index = indexAction(); html = index.getHtml(index.url); html_str = html.read(); soup = BeautifulSoup(html_str,'html.parser',from_encoding='urt-8'); links=soup.find_all('a'); for link in links: print link.name,link['href'],link.get_text(); pass
import urllib2
import keyword to import an object
urllib2 is a Python extension library responsible for network protocols and support for URLs. See the information yourself
Data address: http://python.usyiyi.cn/translate/python_278/library/urllib2.html
from bs4 import BeautifulSoup
bs4 is the third tool of python, which is responsible for parsing text content. Please see the installation that needs to be installed.
Information address: http://helanhe.iteye.com/admin/blogs/2394806
BeautifulSoup is a class of bs4 specifically to find what we need
class indexAction:
class Python is the keyword
indexAction is a class name I defined
: The colon is the content body of python is equivalent to the curly brackets in Java
url = 'http://www.baidu.com';
This is a variable defined. It is a bit like weak type in python and PHP, so variables do not need to define types, and they do not need to be defined in python; the semicolon ends, of course, it is customary to make the semicolon look more comfortable.
def getHtml(self,url):
def is a keyword equivalent to function in java, indicating that this is a method or function
getHtml(self, url) getHtml is two parameters in the method name,
I don't know what the first parameter self is used for, but it should be there anyway. For information, please see http://python.jobbole.com/81921/,
The second parameter is the parameter url we need to pass in
request = urllib2.Request(url)
It means to request a URL address through the plugin of urllib2, and then method an object. At this time, no actual request data is still being assembled.
request.add_header('user-agent', 'Mozilla/5.0')
This is to load the HTTP header of the request through the object returned above. We can load whatever we want, depending on our mood
response = urllib2.urlopen(request)
This is to request content through the urllib2 plugin like the target URL, and return an object of the target website to us. I don't know what the hell is. I can print it out and see it myself.
if __name__ == '__main__':
This is a main function why is it like this, it should be like this
http://www.dengfeilong.com/post/60.html
html_str = html.read();
This is to read the html of the target website through the built-in method of read from the returned object (the return is a string)
soup = BeautifulSoup(html_str,'html.parser',from_encoding='urt-8');
This is actually instantiating a class, not calling a method. Inside python you don't need new class name
It requires three parameters
the first string to parse
The second way of analysis
The third parsed encoding
https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
last method an object
links=soup.find_all('a');
Call the find_all() function through the above object and pass in the HTML tag keyword you are looking for. There are many ways to find it. See the document for details.
for link in links: print link.name,link['href'],link.get_text(); pass
Finally, the loop prints the output
其他的自己看着办吧,听说还会有什么爬取分页的内容,登录爬取什么的,我搞清楚在给大家说