Beautifulsoup crawling page questions

Suppose there is a page with n-multiple-choice questions, each option has a number of multiple-choice questions. H6 stem that portion labeled with labels. Options section using a div tag under td. As shown below:

The entire page is HTML fragment following loop n times.

    
                                    
    
        
            
                
                    >
                </ul>
            </div>
            <ul style="list-style-type:none">
                <li>
                    <table>                                                
                        <tbody><tr>
                                 <td>
                                    <div>A.一二三</div>        
                                 </td>
                                </tr>
                                <tr>
                                  <td>
                                     <div>B.四五六</div>                    
                                  </td>
                                 </tr>                                            
                         </tbody></table>
                 </li>
            </ul>
        </div>            
<div>                                

 The following beautifulsoup want to use the library method to extract the title and options on the page.

First to be introduced need to use the package:

from bs4 import BeautifulSoup
from urllib import request,parse
import re

You can then extract the page source in a variety of methods:

# Either open an html file: 
Soup = the BeautifulSoup (Open ( " page_source_code.html " ))
 # or directly into the HTML code: 
Soup = the BeautifulSoup ( " <html> Data </ html> " )
 # may be transmitted and intercepts request:
URL = "HTTP: //fake.html" Response = request.urlopen (URL, timeout = 20 is ) responseUTF8 . response.read = () decode ( " UTF-. 8 " ) Soup = the BeautifulSoup (responseUTF8, ' lxml ' )

 In short, so we got a soup object. Next, the structure of the object according to the label as long as the positioning method to the target through a certain label on it.

This method is substantially the following, with the "absolute path" Find Label

# The observation that the stem, are all part of the label h6, h6 and a label in the label. Rest of the page is not used h6 tags, so use .find_all way to grab all the casual working. To give a label List 
h6lbs = soup.find_all ( ' H6 ' )
 # define a two-dimensional array, for holding the grabbed choice. Multiple-choice options and casual working as a member of each array member. 
= item_question []
 for I in Range (len (h6lbs)):
     # define a one-dimensional array, and can be for holding casual working options. First stem that just get stored into the array 
    Item = [] 
    item.append (h6lbs [I] .text) 
    # can be known by the above HTML structure, after finding casual working "grandfather" casual working tag "three brother" is the place to save options, so here with a lot more .parent and .next_sibling method, by way of absolute positioning path label 
    tag1 = h6lbs [i] .parent.parent.parent.next_sibling.next_sibling
     #. IF the this IS the Check Choice Question or the MUST IT BE A Yes / No Questionnaire 
    IF tag1 IS  not None and tag1.li IS  not None:
         # just said grandfather brother is a place to store three options. Tbody tab under the table tag label li in the three brother grandfather storing a plurality of options need to traverse from here, each of the extracted options 
        Tag = h6lbs [I] .parent.parent.parent. next_sibling.next_sibling.li.table.tbody
         for Child in tag.children:
             # because crawl out empty object, so here joined a judge. If None, not saved. 
            IF child.string IS None: 
                TAG_STRINGUse = child.td.div.string
                 #Through each label, wherein the content extraction option, the method by .string 
                # will get the option to join a content-dimensional array just created 
                item.append (tag_string)
         # Print (Item) 
    # one-dimensional array each time obtained save into a two-dimensional array 
    item_question.append (Item)
 Print (item_question)

The method also can be positioned relative path tags:

# Can be found by observing, in each of the test questions are a div tag, and the tag value is # 11111 # 1,11111 2,11111 # 3. So we get to the first label as a reference. re.compile is a regular expression method 
the_tag = soup.find_all ( ' div ' , value = re.compile ( ' 11111 # \ d ' ))
 Print (len (the_tag))
 # create a two-dimensional array, used to save the questions 
item_question = []
 for tag in the_tag:
     # create a one-dimensional array, used to save the casual working and options 
    Item = []
     # traversing just selected reference label  
    for child_tag in tag.descendants:
         # H6 tags under each have a reference label a stem, extracted save.  
        IF child_tag.name == " H6 " :
            item.append (child_tag.get_text ( "" , Strip = True))
         # td tag under each label is a reference to the options, save extracted 
        elif child_tag.name == " td " :
             IF child_tag.div IS  not None: 
                item.append (child_tag.div.get_text ( "" , Strip = True)) 
    item_question.append (Item) 
Print (item_question)

Guess you like

Origin www.cnblogs.com/testertry/p/11516536.html