Record a crawling reptile a nickname website

Students ran internship ... then her work time to write a reptile with a python, crawling ten thousand users can use the nickname. (Why they can get a job ah QAQ)

Then she found me ... and then when I started to write, found written before reptile basically forgotten over ... but unfortunately had previously written against the project, re find the next article, now write an article before regrouped under fragmented knowledge points.

I am here to write content for their own needs only to write, if you want a thorough understanding BeautifulSoupof usage, you can refer to the following article: () => article (read this article I want to knock Java8 new lambda expression style ... really super cool)

Skip to explain the basics of the case, the point I

Closer to home, begin.


python reptiles, might use two packages (used to be such), one BeautifulSoup, one etree.

I used my understanding of it is simple to talk about these two packages (here, then in all likelihood do not believe ...)

BeautifulSoup:

  • A plug-crawl the page, the page can crawl out, can do simple screening data on a page
  • Introduction of methods: from bs4 import BeautifulSoup
  • installation method: pip install bs4

etree:

  • 3.7 (should be 3.7) version of the past, are particularly useful, because it can be used directly locked XPath DOM element, but in future updates, it XPath compatibility is not good, so just cut off. Note: You can copy the Google browser directly XPath page, so individuals feel there is no need to remember XPath syntax, because after all, we are not playing with the python reptile and play, you know, but because of artificial intelligence and python blockbuster.
  • Introducing ways: from lxml import etree(this word should Element Tree)
  • Installation: pip install lxml
  • It is said in the 4.1.1 version can from lxml.html import etreeuse etree
  • You want to use, you can use pip specified version installed, install the last version

Because I am currently not want to toss this in reptiles, so this will use the BeautifulSoup finished this reptile, here is not to make use of etree introduced, I would like to know if installed according to their lxmlchanging version of an article, on do not waste time

First is that BeautifulSoup of four data types, in order to pave the way for the future:

  1. Tag
  2. NavigatableString
  3. BeautifulSoup
  4. Comment

Tag => is the page where the label;

NavigableString => If you want to get Tagtext, you can TagName.stringget the contents inside, the value obtained is the NavigableStringtype of object;

BeautifulSoup => it is a function of more objects, you can use more ways to get the object subclasses, access is very simple, give chestnuts:

soup = BeautifulSoup(你就想象这里有很长很长很长的html代码吧, 'lxml')  # 获取页面文档 这是个BeautifulSoup类型的数据

# 如果愿意的话,可以用 print(type(soup)) 检查下 soup 的数据类型,我没有测试

item = soup.find_all(name='a', attrs={'class': 'hover', 'target': '_self'})  
# 这里用的 find_all 方法,意思就是找到之前获取的文档对象(soup)下的所有满足<a class='hover' target='_self'></a>的对象

Comment => This property is special, and it can find all of the comments have not used the content ...

For the BeautifulSoupobject has many attributes can be used, for example:

  • Find by label name => soup.select('div')
  • Find by selector => soup.select('.top-bar')
  • Find by id => soup.select('#app')
  • Combined => soup.select('#app .top-bar')
  • Subclass => soup.select('#app > .top-bar')
  • Specific => soup.select('div[data="NickName" class="name"]')# here to learn grammar under Emmet, quite like ... in square brackets is the property, the braces is the content, without spaces.

Reference article => Now that I think ... Why did not I use select to write ...

Basic things finished ... back feeling completely not read, if nothing to dry, then you

Started on the code

Target drone (drone use of the word is not feeling particularly loaded to force ah)

http://www.oicq88.com

Thinking

  1. Get to page elements

  2. Check the next page of the document, we found it the nickname divided into many categories, a total of more than fifty, after the point in, will enter the sub-domain, using string concatenation direct access to internal link

  3. Each category has a lot of pages of data, we need to give priority to know the total number of pages before they can go to traverse, at least subscript cross-border situation does not appear, inadvertently found that splicing of digital pages in the back of beyond is not being given, and can see page was the last page of the site (what I'm talking ah) | in fact, there is a train of thought, then climbed to determine whether there is a next page, if so, continue the walk back, no exit cycle.

  4. Use BeautifulSoupacquired all of the nicknames content (why not Taiwan before and after separation, the front desk request json data to the background, so I can take it directly to all the nicknames ... do not know if this is pseudo-static or static)

  5. Write IO streams, file

Outline

  1. Get the page elements, namely document file
  2. Get all types of stitching nickname
  3. Analyze
  4. To get the number of pages for each type of
  5. Through each page, and write data files

Got to get the page elements

Take a page element, a very simple

response = requests.get(url='http://www.oicq88.com')

But I say that the definition of the functions of ...

import requests  # 别忘了导包,报错的话装包 pip install requests,pip版本过低的话更新下,会有提示的

# 模拟用户从浏览器登录
def get_html(url):
    headers = {
        'User-Agent': 'Mozilla/5.0(Macintosh; Intel Mac OS X 10_11_4)\
        AppleWebKit/537.36(KHTML, like Gecko) Chrome/52 .0.2743. 116 Safari/537.36'
    }  # 模拟浏览器访问

    response = requests.get(url, headers=headers)  # 请求访问网站
    html = response.content.decode()  # 获取网页源码 因为编码问题,我们看不懂机器的语言,所以需要先解码

    return html

Then, for convenience, I added attribute points below the global use of the function

base_url = 'http://www.oicq88.com'  # 基本域名
file_name = 'result.txt'  # 输出的文件名
currentItem = 0  # 当前的项目数,只是为了记录反馈用
params = []

Get all types of stitching nickname

I said here and find find_all the difference, find the first finger found to meet the conditions, and find_all is to find all the entries that satisfy the condition.

For chestnut:

params = [ "<a class='A'>item1</a>", "<a class='A'>item2</a>", "<a>item3</a>" ]

都使用(name='a', attrs={'class': 'A'})查找的情况下

find相当于找到 "<a class='A'>item1</a>"
find_all相当于找到 ["<a class='A'>item1</a>", "<a class='A'>item2</a>"]

Then, on the code

# 找到所有的子项目
for name in soup.find_all(name='a'):  # 拿到所有 a 标签
    child_path = name.get('href')  # 拿到 href 属性里的内容
    method = re.compile(r'/(.*?)/')  # 定义正则方法,将不需要的地址筛选掉
    flag = re.findall(method, child_path)  # 正则删选之后的结果
    if len(flag) == 1:  # 规定长度为了防止角标越界的状况发生
        if flag[0]:
            params.append(child_path)

Analyze

For now, we know that there are several information

  1. Nickname 50 multiple types
  2. Each nickname has more than one page
  3. Each page has n data

From here one by one break, we need to traverse out each page of each type in each data

Here we start with the type of start

To get the number of pages for each type of

In fact, here I can write fewer cycles cough ...

# 昵称类型
for i in range(len(params)):
    index = 411  # 这个数字还是必要的 你可以看看 http://www.oicq88.com/shanggan/97.htm 和 http://www.oicq88.com/shanggan/411.htm 之间是否有任何区别
    soup = BeautifulSoup(get_html(base_url + params[i] + '%s.htm' % index), 'lxml')  # 用一个巨大的数字,拿到一共有多少个页面

    item = soup.find_all(name='a', attrs={'class': 'hover', 'target': '_self'})  # 这个就是用来拿到最后一页的内容的方法

    index = int(item[0].text)  # 将最后一页的内容强转为int类型的数据,方便遍历每一页

    # 后面这两个就是给个反馈,用户体验更加爽快
    print('Current module is => ' + params[i] + 'And ...')
    print('\nThe page count is => ' + str(index) + '\n')

Through each page, and write data files

Here under you can say that value after soup.find be for intraversed out, it is not BeautifulSoupan object, so can not be used in the same way to view the contents of a subclass

But the content of the subclass can be directly traversed again (this time a little long hole card, and write blog when I have found that this problem can be solved n ways ...)

# 分页!
    for page in range(1, index):

        print('\ncurrent page => ' + str(page) + '\n')

        # 昵称项目 / 页
        soup = BeautifulSoup(get_html(base_url + params[i] + str(page) + '.htm'), 'lxml') 
        try:
            for ul in soup.find(name='ul', attrs={'class': 'list'}):
                for ls in ul:
                    for p in ls:
                        # 写入文件
                        try:
                            with codecs.open(file_name, "a", 'utf-8') as f:  # 这个用的是 a 方法,意味着直接在文件后面补充,不删除之前内容(重写),如果用w的话,可能会让字符串超出范围,从而抛出异常
                                f.write(p)
                                f.write('\n')
                                currentItem += 1
                                print('The ' + str(currentItem) + ' => Done')
                        except IOError:
                            print(IOError)
                        finally:
                            f.close()
        except ValueError:
            print('The null item not can be iterable...')

This, almost on the case, the simple recording, you can copy the code directly with the assumption that loading homogeneous environment

For python, or we should always throw an error, so the problems blocking the process ... or else climb half blown up, the time wasted

  1. Best to let the content crawled more controlled
  2. A set of log records system error thrown (except now think about the contents of the file stream into operation on the line)
  3. It seems to have nothing

Guess you like

Origin www.cnblogs.com/Arunoido/p/11140495.html