Python3 crawler (1)_Urllib for web crawling

Web Crawler

      Also known as web spiders , web robots, in the FOAF community, more often called web chasers, is a program or script that automatically crawls information on the World Wide Web according to certain rules. Other less commonly used names are ant , autoindex, emulator, or worm .

    (Refer to Baidu Encyclopedia, please see https://baike.baidu.com/item/web crawler/5162711?fr=aladdin&fromid=22046949&fromtitle=%E7%88%AC%E8%99%AB)

Code and step description: Learn from http://cuijiahua.com. https://blog.csdn.net/c406495762/article/details/58716886

Urllib

urllib is a URL processing package that integrates some modules for processing URLs, as follows:

  1. Open and read URL: urllib.request
  2. Contains the error generated by the request, which can be captured using try: urllib.error
  3. Contains methods for parsing URLs: urllib.parse
  4. The urllib.robotparser module is used to parse the robots.txt text file. It provides a separate RobotFileParser class to test whether the crawler can download a page through the can_fetch() method provided by this class

urllib_test01.py

1 from urllib import request
2 
3 if __name__=="__main__":
4     response=request.urlopen("http://i.cnblogs.com")
5     html=response.read()
6     print(html)

operation result:

>>>
 RESTART: C:\Users\DELL\AppData\Local\Programs\Python\Python36\urllib_test01.py
b'\r\n<!DOCTYPE html>\r\n<html>\r\n<head>\r\n    <meta charset="utf-8" />\r\n    <meta name="viewport" content="width=device-width" />\r\n    <title>\xe7\x94\xa8\xe6\x88\xb7\xe7\x99\xbb\xe5\xbd\x95 - \xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad</title>\r\n    <link rel="stylesheet" href="/scripts/bootstrap/css/bootstrap.min.css" />\r\n    <link href="/scripts/ladda/ladda-themeless.min.css" rel="stylesheet" />\r\n    <link href="/css/signin_bundle.css?v=L6jW_dned1XSxz8ohN2oMp1Q1fPUq1W5sWqqw6HNaH01" type="text/css" rel="stylesheet" />   \r\n    <script src="/scripts/jquery.min.js"></script>\r\n    <script src="/scripts/bootstrap/js/bootstrap.min.js"></script>\r\n    <script src="/scripts/ladda/spin.min.js"></script>\r\n    <script src="/scripts/ladda/ladda.min.js"></script>\r\n    <script src="/scripts/jsencrypt.min.js"></script>\r\n    <script>\r\n        var return_url = \'http://i.cnblogs.com/\';\r\n        var ajax_url = \'/user\' + \'/signin\';\r\n        var enable_captcha = false;\r\n        var is_in_progress = false;\r\n    </script>\r\n    <script src="/scripts/signin_bundle.js?v=1spnpY8gb0K9MfNetxJoLoPjd7dN7PIKB8kMqcak-RQ1"></script>\r\n\r\n</head>\r\n<body onload="setFocus()">\r\n    <div style="width: 100%;">\r\n        <div align="center">\r\n            <div id="Main">\r\n                <noscript>\r\n                    <div style="font-size:15px;margin-bottom:20px;">\r\n                        \xe6\x82\xa8\xe7\x9a\x84\xe6\xb5\x8f\xe8\xa7\x88\xe5\x99\xa8\xe6\x9c\xaa\xe5\x90\xaf\xe7\x94\xa8Javascript\xef\xbc\x8c\xe6\x97\xa0\xe6\xb3\x95\xe8\xbf\x9b\xe8\xa1\x8c\xe7\x99\xbb\xe5\xbd\x95\xe3\x80\x82\r\n                    </div>\r\n                    <style>\r\n                        form {\r\n                            display: none;\r\n                        }\r\n                    </style>\r\n                </noscript>\r\n                <form method="post" onsubmit="return false;">\r\n                    <div id="Heading">\xe7\x99\xbb\xe5\xbd\x95\xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad - \xe4\xbb\xa3\xe7\xa0\x81\xe6\x94\xb9\xe5\x8f\x98\xe4\xb8\x96\xe7\x95\x8c</div>\r\n                    <div class="block">\r\n                        <label class="label-line">\xe7\x99\xbb\xe5\xbd\x95\xe7\x94\xa8\xe6\x88\xb7\xe5\x90\x8d(<a href="/GetUsername.aspx" tabindex="-1" class="tb_right">\xe6\x89\xbe\xe5\x9b\x9e</a>)</label>\r\n                        <input type="text" id="input1" value="" class="input-text" onkeydown="check_enter(event)" /> <span id="tip_input1" class="tip"></span>\r\n                    </div>\r\n                    <div class="block">\r\n                        <label class="label-line">\xe5\xaf\x86\xe7\xa0\x81(<a href="/GetMyPassword.aspx" tabindex="-1" class="tb_right">\xe9\x87\x8d\xe7\xbd\xae</a>)</label>\r\n                        <input type="password" id="input2" value="" class="input-text"  onkeydown="check_enter(event)" /> <span id="tip_input2" class="tip"></span>\r\n                    </div>\r\n\r\n                    <div class="modal fade" id="checkWay" tabindex="-1" role="dialog" aria-hidden="true">\r\n                        <div class="modal-dialog">\r\n                            <div class="modal-content center-block">\r\n                                <div class="modal-header">\r\n                                    <button type="button" class="close" data-dismiss="modal"><span aria-hidden="true">×</span><span class="sr-only">Close</span></button>\r\n                                    <h4 class="modal-title">\r\n                                        \xe8\xaf\xb7\xe5\xae\x8c\xe6\x88\x90\xe4\xba\xba\xe6\x9c\xba\xe8\xaf\x86\xe5\x88\xab\xe9\xaa\x8c\xe8\xaf\x81\r\n                                    </h4>\r\n                                </div>\r\n                                <div class="modal-body">\r\n                                    <div id="showLoading" class="ladda-button"data-style="zoom-in"></div>\r\n                                    <div id="captchaBox" class="center-block">\r\n                                        <span id="geetestLoading"> \xe9\xaa\x8c\xe8\xaf\x81\xe7\xa0\x81\xe7\xbb\x84\xe4\xbb\xb6\xe5\x8a\xa0\xe8\xbd\xbd\xe4\xb8\xad,\xe8\xaf\xb7\xe7\xa8\x8d\xe5\x90\x8e...</span>\r\n                                    </div>\r\n                                </div>\r\n                            </div>\r\n                        </div>\r\n                    </div>\r\n\r\n                    <div class="block">\r\n                        <input id="remember_me" type="checkbox" name="remember_me" onkeydown="check_enter(event)" /><label for="remember_me" onkeydown="check_enter(event)">\xe4\xb8\x8b\xe6\xac\xa1\xe8\x87\xaa\xe5\x8a\xa8\xe7\x99\xbb\xe5\xbd\x95</label>\r\n                    </div>\r\n                    <div class="block">\r\n                        <input type="submit" id="signin" class="button" value="\xe5\x8a\xa0\xe8\xbd\xbd\xe4\xb8\xad..." /> <span id="tip_btn" class="tip"></span>\r\n                    </div>\r\n                    <div class="block nav">\r\n                        » <a href="/register.aspx?ReturnUrl=http://i.cnblogs.com/" title="\xe6\xb3\xa8\xe5\x86\x8c\xe6\x88\x90\xe4\xb8\xba\xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad\xe7\x94\xa8\xe6\x88\xb7">\xe7\xab\x8b\xe5\x8d\xb3\xe6\xb3\xa8\xe5\x86\x8c</a><br />\r\n                        » <a href="http://www.cnblogs.com/ContactUs.aspx">\xe5\x8f\x8d\xe9\xa6\x88\xe9\x97\xae\xe9\xa2\x98</a>\r\n                    </div>\r\n                </form>\r\n                <div style="clear: both" />\r\n            </div>\r\n        </div>\r\n    </div>\r\n</body>\r\n</html>\r\n'\r\n    </div>\r\n</body>\r\n</html>\r\n'\r\n    </div>\r\n</body>\r\n</html>\r\n'
>>>

After we crawl the website, what we get is a bunch of binary code. According to the normal process, the browser will parse the information from the server and then show it to us. And we can now decode the information of the web page through a simple decode() command and display it. The updated code is:

from urllib import request

if __name__=="__main__":
    response=request.urlopen("http://i.cnblogs.com")
    html=response.read()
    html = html.decode("utf-8")
    print(html)

Shown as:

Python 3.6.3 (v3.6.3:2c5fed8, Oct  3 2017, 18:11:49) [MSC v.1900 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>>
 RESTART: C:\Users\DELL\AppData\Local\Programs\Python\Python36\urllib_test01.py


<!DOCTYPE html>

<html>

<head>

    <meta charset="utf-8" />

    <meta name="viewport" content="width=device-width" />

    <title>User Login - Blog Park</title>

    <link rel="stylesheet" href="/scripts/bootstrap/css/bootstrap.min.css" />

    <link href="/scripts/ladda/ladda-themeless.min.css" rel="stylesheet" />

    <link href="/css/signin_bundle.css?v=L6jW_dned1XSxz8ohN2oMp1Q1fPUq1W5sWqqw6HNaH01" type="text/css" rel="stylesheet" />  

    <script src="/scripts/jquery.min.js"></script>

    <script src="/scripts/bootstrap/js/bootstrap.min.js"></script>

    <script src="/scripts/ladda/spin.min.js"></script>

    <script src="/scripts/ladda/ladda.min.js"></script>

    <script src="/scripts/jsencrypt.min.js"></script>

    <script>

        var return_url = 'http://i.cnblogs.com/';

        var ajax_url = '/user' + '/signin';

        var enable_captcha = false;

        var is_in_progress = false;

    </script>

    <script src="/scripts/signin_bundle.js?v=1spnpY8gb0K9MfNetxJoLoPjd7dN7PIKB8kMqcak-RQ1"></script>

 

</head>

<body onload="setFocus()">

    <div style="width: 100%;">

        <div align="center">

            <div id="Main">

                <noscript>

                    <div style="font-size:15px;margin-bottom:20px;">

                        您的浏览器未启用Javascript,无法进行登录。

                    </div>

                    <style>

                        form {

                            display: none;

                        }

                    </style>

                </noscript>

                <form method="post" onsubmit="return false;">

                    <div id="Heading">登录博客园 - 代码改变世界</div>

                    <div class="block">

                        <label class="label-line">登录用户名(<a href="/GetUsername.aspx" tabindex="-1" class="tb_right">找回</a>)</label>

                        <input type="text" id="input1" value="" class="input-text" onkeydown="check_enter(event)" /> <span id="tip_input1" class="tip"></span>

                    </div>

                    <div class="block">

                        <label class="label-line">密码(<a href="/GetMyPassword.aspx" tabindex="-1" class="tb_right">重置</a>)</label>

                        <input type="password" id="input2" value="" class="input-text"  onkeydown="check_enter(event)" /> <span id="tip_input2" class="tip"></span>

                    </div>

 

                    <div class="modal fade" id="checkWay" tabindex="-1" role="dialog" aria-hidden="true">

                        <div class="modal-dialog">

                            <div class="modal-content center-block">

                                <div class="modal-header">

                                    <button type="button" class="close" data-dismiss="modal"><span aria-hidden="true">&times;</span><span class="sr-only">Close</span></button>

                                    <h4 class="modal-title">

                                        请完成人机识别验证

                                    </h4>

                                </div>

                                <div class="modal-body">

                                    <div id="showLoading" class="ladda-button" data-style="zoom-in"></div>

                                    <div id="captchaBox" class="center-block">

                                        <span id="geetestLoading"> 验证码组件加载中,请稍后...</span>

                                    </div>

                                </div>

                            </div>

                        </div>

                    </div>

 

                    <div class="block">

                        <input id="remember_me" type="checkbox" name="remember_me" onkeydown="check_enter(event)" /><label for="remember_me" onkeydown="check_enter(event)">下次自动登录</label>

                    </div>

                    <div class="block">

                        <input type="submit" id="signin" class="button" value="加载中..." /> <span id="tip_btn" class="tip"></span>

                    </div>

                    <div class="block nav">

                        &raquo; <a href="/register.aspx?ReturnUrl=http://i.cnblogs.com/" title="注册成为博客园用户">立即注册</a><br />

                        &raquo; <a href="http://www.cnblogs.com/ContactUs.aspx">反馈问题</a>

                    </div>

                </form>

                <div style="clear: both" />

            </div>

        </div>

    </div>

</body>

</html>

 

自动获取网页编码方式的方法

安装第三方库chardet,它是用来判断编码的模块,打开cmd,只需要输入指令:

pip install chardet


即可进行下载。

新的代码:

 

# -*- coding: UTF-8 -*-
from urllib import request
import chardet

if __name__ == "__main__":
    response = request.urlopen("http://i.cnblogs.com/")
    html = response.read()
    charset = chardet.detect(html)
    print(charset)

 

返回的结果是一个字典,会告知我们网页的编码方式。

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324862351&siteId=291194637