Python crawler road (1)

As for me, the Java-born company PHP started a website after reading thinkphp for a few days, so I embarked on the road of no return. In general, it's almost the same. The syntax is still the same as Java. Build a bunch of objects, engage in methods, engage in inheritance, and engage in MVC.

 

Today, I went to another road of no return, Python. I took the wild road without a formal reading of the basic grammar of documentation tutorials. I like to go there and see that. I can figure it out for the regular army!

 

I will explain each line of code, please see the official documentation for the specific meaning

 

drive

 

 

import urllib2
from bs4 import BeautifulSoup

class indexAction:
    
    url = 'http://www.baidu.com'
        
    def getHtml(self,url):
        request = urllib2.Request(url)
        request.add_header('user-agent', 'Mozilla/5.0')
        response = urllib2.urlopen(request)
        return response


if __name__ == '__main__':
    index = indexAction();
    html = index.getHtml(index.url);
    html_str = html.read();

    soup = BeautifulSoup(html_str,'html.parser',from_encoding='urt-8');
    
    links=soup.find_all('a');
 
    for link in links:
        print link.name,link['href'],link.get_text();
    pass
import urllib2

     import keyword to import an object 

     urllib2 is a Python extension library responsible for network protocols and support for URLs. See the information yourself

     Data address: http://python.usyiyi.cn/translate/python_278/library/urllib2.html

 

from bs4 import BeautifulSoup

    bs4 is the third tool of python, which is responsible for parsing text content. Please see the installation that needs to be installed.

    Information address: http://helanhe.iteye.com/admin/blogs/2394806

    BeautifulSoup is a class of bs4 specifically to find what we need

 

class indexAction:

    class Python is the keyword 

    indexAction is a class name I defined

    : The colon is the content body of python is equivalent to the curly brackets in Java

 

url = 'http://www.baidu.com';

    This is a variable defined. It is a bit like weak type in python and PHP, so variables do not need to define types, and they do not need to be defined in python; the semicolon ends, of course, it is customary to make the semicolon look more comfortable.

def getHtml(self,url):

    def is a keyword equivalent to function in java, indicating that this is a method or function

    getHtml(self, url) getHtml is two parameters in the method name,

    I don't know what the first parameter self is used for, but it should be there anyway. For information, please see http://python.jobbole.com/81921/,

    The second parameter is the parameter url we need to pass in 

 

request = urllib2.Request(url)

    It means to request a URL address through the plugin of urllib2, and then method an object. At this time, no actual request data is still being assembled.

request.add_header('user-agent', 'Mozilla/5.0')

    This is to load the HTTP header of the request through the object returned above. We can load whatever we want, depending on our mood

  

response = urllib2.urlopen(request)

    This is to request content through the urllib2 plugin like the target URL, and return an object of the target website to us. I don't know what the hell is. I can print it out and see it myself.

if __name__ == '__main__':

    This is a main function why is it like this, it should be like this

    http://www.dengfeilong.com/post/60.html

   

html_str = html.read();

    This is to read the html of the target website through the built-in method of read from the returned object (the return is a string)

soup = BeautifulSoup(html_str,'html.parser',from_encoding='urt-8');

    This is actually instantiating a class, not calling a method. Inside python you don't need new class name

    It requires three parameters 

    the first string to parse

    The second way of analysis

    The third parsed encoding

    https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html 

    last method an object

links=soup.find_all('a');

    Call the find_all() function through the above object and pass in the HTML tag keyword you are looking for. There are many ways to find it. See the document for details.

    

for link in links:
        print link.name,link['href'],link.get_text();
    pass

    Finally, the loop prints the output

 

    其他的自己看着办吧,听说还会有什么爬取分页的内容,登录爬取什么的,我搞清楚在给大家说

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326494320&siteId=291194637