1 pyhton crawling basic data

1.    What is a web crawler?

    In the era of big data, collection of information is an important task, and the data in the Internet is massive, if simply relying on human information collection, not only inefficient cumbersome collection costs will increase. How to automatically and efficiently obtain information we are interested in the Internet and work for us is an important issue, and crawler technology to solve these problems and students.

    Web crawler (Web crawler) also known as web robots, it can automatically replace the collection and collation of data on the Internet. It is a follow certain rules, automatically grab information on the World Wide Web program or script that can automatically collect all page content which can be accessed in order to obtain relevant data.

    Functionally speaking, reptiles generally divided into data acquisition, processing, storage in three parts. Reptile from a URL or several pages of initial start, get the URL of the original page, in the process of crawling web pages, and continue to extract new URL from the current page into the queue until the system must meet the stop condition.

 

2. The    role of Web Crawler

1. can achieve search engine

    After we learned to write reptiles, reptiles can use the information in the Internet automatically collect, collect the corresponding storage or processing came back, you need to retrieve certain information in time, just retrieve the information collection back, that achieve a personal search engine.

 

If you want to get a simple website information of how to do it?

import requests

# Obtain information website Baidu 
r = requests.get ( ' http://www.baidu.com ' )

# View status code 
Print (r.status_code)


# Specifies the character encoding 
r.encoding = ' UTF-. 8 '

print(r.text)

First, we want to import requests to the library and get information Baidu website and then use r.status_code view the current status code 200 with r.requests.get if it means to succeed

r.encoding specifies the character encoding last call r.text will get out of the content output

 

The results of the implementation of what we want to come out

 

He who wants to get product information on the website how should we do

First, we find a site search into a commodity click on the link to copy him

Then write a program where we use to try and except method

import requests

url = 'https://item.jd.com/100008348542.html'

try:
    r = requests.get(url)
    r.raise_for_status ()   # detection state 200 if no error code if it is not thrown 200 
    # Get the character encoding 
    r.encoding = r.apparent_encoding
     Print (r.text [: 1000])  

the except :
     Print ( " crawling failure " )

show result

Is not it simple?

Example 2

import requests

url = 'https://www.amazon.cn/dp/B06XGXXDV9?ref_=Oct_DLandingSV2_PC_45268aee_0&smid=A26HDXW89ZT98L'

try:
    kv = {"User-Agent": 'Mozilla5.0'}
    r = requests.get(url, headers=kv)
    # headers=kv 将自己请求头部信息更改为 kv这个字典的数据
    # 若果不更改头部信息默认是以requests库的身份去访问的网站
    # 有时候会导致获取信息的获取异常 我们吧头部信息该为 Mozilla5.0
    # 就会让服务器认为我们是以用户的身份去访问的网站
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text)
    print(r.request.headers)
except:
    print("爬取失败")

如果访问一个网站信息时发现 信息没有提取出来 则有可能是访问的服务器 通过判断你的头部信息拒绝了你的访问 这时候我们就要更改自己请求头的信息了

 

修改前

修改后 User-Agent 被修改了

 

如果你想要输入一个内容 查看百度返回的结果 可以通过查看百度接口来进行获取 

我们可以看到当你输入图片 上面的wd会跟着输入的改变而改变 我们通过这个可以得出 只要修改wd的值就可以查看到相应的内容

下面我们通过代码来实现一下百度的搜索

import requests

keyword=input("请输入你要查找的内容:")

try:
    kv = {"wd":keyword,
          "User-Agent": 'Mozilla5.0'
          }
    r = requests.get('http://www.baidu.com',kv)
    r.raise_for_status()
    print(len(r.text))
except:
    print("爬取失败")

显示结果 我们以len计算一下数据的长度就可以了

 

如果我们要获取网上的一张图片并将他保存在本地硬盘该怎么办呢?

 

首先我们鼠标右击复制图片地址

然后进行程序的编写

import requests
import os
#图片链接
url = 'http://b-ssl.duitang.com/uploads/item/201210/03/20121003220216_xTBdK.jpeg'
root = 'D://pics//'
path = root + url.split('/')[-1]

try:
    if not os.path.exists(root):  #判断是否有这个目录 如果没有则创建一个
        os.mkdir(root)
    if not os.path.exists(path):  #判断是否有这个文件如果没有 则从url链接上面获取
        r = requests.get(url)
        with open(path,'wb') as f:
            f.write(r.content)     #r.content 是文件的二进制返回内容 f.write(r.content) 将返回的二进制形式写到文件中
            f.close()
            print("文件保存成功")
    else:
        print("文件已存在")
except :
    print("爬取失败")

然后产看D盘是否有这个文件

然后你就会发现图片已经保存在了本地 并且在pics目录下

 

 

Guess you like

Origin www.cnblogs.com/love2000/p/11920494.html