Picture category, and detailed picture of the process of crawling under the big data (a)

Picture had intended to do with the relevance of the text, and later models do not have all things right, climb took 2 million pictures, pictures can leave a message if necessary, free gift.

Image data categories under the large picture and crawling detailed process: In this section is divided into two crawlers, one crawler type, the image is marked. The second is the use of these categories (keywords) crawling picture of the program, this time I got class, and wrote a second reptile, the reptile stay as keywords crawling pictures, as I got the first the climb down a reptile class, when I lost the second category I reptiles go, this time, the corresponding input desired picture category keywords, as well as the number, you can pull down from Baidu's gallery.

(A) The first category climb to take pictures and pictures of the process:

First, I want to pick up the photo category as, picture and data related to the treatment of the text, and then I need to find the site I want, as shown in Figure.

Where the URL of a page: http://so.sccnn.com/search/%D1%F9%BB%FA/1.html

 

Figure I

 

Url of the site for analysis:

This is where the URL of a page, but now I need to pick up all the categories of the site, this time I found a pattern: that is, the front and rear are also the same, only need to change the inside of the 1-242 numbers page, which figures: so I can write:

 

for i in range (1, 242): # get a list of all the url


    Str_i = str (i) # is converted to a string


    url = 'http://so.sccnn.com/search/%D1%F9%BB%FA/' + Str_i + '.html'

 The above code so you can get to write all of the pages.

 

Now begin to analyze the page:

First, open the Google debugger F12's, and I installed a convenient xpth I parse the data directly. At this point the mouse to click f12 appear debugging page, use the search arrow positioning, such as clicking an image, then this time you can locate the path to tag the picture. Two as shown in FIG.

 

 

                                             Figure II

When positioning the label path time, this time right click click copy you can see there are CopyXpath, which is required to obtain the necessary means in the path reptile pictures in resolution, the process shown in Figure III.

 When positioning the label path time, this time right click click copy you can see there are CopyXpath, which is required to obtain the necessary means in the path reptile pictures in resolution, the process shown in Figure III.

                                                Figure III

 At this point we need to test our xpath resolve it: There xpath aids to help us directly in the Chrome browser. The process shown in Figure IV, and xpath aids can be seen at the top point to note here, because some sites use Xpath copy is not direct, it is easy to see in Figure IV at this time the right is an empty value, so in this case You need to write our own xpath.

 

                                                      Figure IV

 

 Verify whether this time to write again xpath get the data, as shown in Figure five: you can clearly see on the right've got each category.

 

 At this point we use the same path test method picture, as five: It is clear you can see the picture right path, as shown in Figure V. Preliminary work ready, this time we can start writing a crawler.

 

                                                            Figure V

 

There is a question you need to consider before preparing reptiles. It is this one has 16 sub-projects, a total of 242, this time I used what kind of method to extract a single page of a single page extract is a problem, this time I need to use the cycle, and how to get subclasses from the following categories, as shown in the following code:

 

Here are a reptile's source, this reptile is only used to obtain categories and pictures of this site. Here is the complete code for this reptile:

title就是我拿取得类别长度也就是16,因为一页16个类别展示,i为第页数,也就是说开始为16,当第二页时候,我从16到32之间取值,这样保证了我拿取得第几张图片,而我把n赋值为0时,因为每页拿取得都是从0-16之间的类别和图片,因为我的类别和图片是相对应的,所以多少张图片也是相对应。

 

import requests
from lxml import etree
import re
#文本信息保存在test.txt,也就是类别信息存储
file = open("test.txt", 'w', encoding='utf-8')
for i in range(1, 242):  # 循环拿到所有的列表的url
    Str_i = str(i)
    url = 'http://so.sccnn.com/search/%D1%F9%BB%FA/' + Str_i + '.html'
#url = "http://so.sccnn.com/search/%D1%F9%BB%FA/2.html"
    res = requests.get(url) #利用response响应获取URL
    content = res.content #拿取文本内容
    html = etree.HTML(content)
    # 数据解析
    title1 = html.xpath('//td/div/a/text()') #获取文本的类别
    desc = html.xpath('//td/div/a/text()')
    imgs = html.xpath('//td/div/a/img/@src')  #获取图片地址
   

 

 

 # 写入文件
    title=len(title1) #解析文件类别
    title = 0 + 16*i #从第一个页拿取
    x=title-16 #
    n = 0  #初始为零
    for m in range(x, title):
        # 描述要特别处理
        # 保存文本信息
        file.write(title1[n]+ "\n")
        # 下载图片
        with open('D:\Spideimages\\' + str(m) + '.jpg', 'wb') as fd:
            picture = requests.get(imgs[n + 1]).content
            fd.write(picture)
            print("成功下载%s.jpg" % m)
    # 关闭文件
        n=n+1
file.close()

 图片爬取类别工作完成。

Guess you like

Origin blog.csdn.net/qq_41479464/article/details/94393390