Suppose there is a page with n-multiple-choice questions, each option has a number of multiple-choice questions. H6 stem that portion labeled with labels. Options section using a div tag under td. As shown below:
The entire page is HTML fragment following loop n times.
> </ul> </div> <ul style="list-style-type:none"> <li> <table> <tbody><tr> <td> <div>A.一二三</div> </td> </tr> <tr> <td> <div>B.四五六</div> </td> </tr> </tbody></table> </li> </ul> </div> <div>
The following beautifulsoup want to use the library method to extract the title and options on the page.
First to be introduced need to use the package:
from bs4 import BeautifulSoup from urllib import request,parse import re
You can then extract the page source in a variety of methods:
# Either open an html file: Soup = the BeautifulSoup (Open ( " page_source_code.html " )) # or directly into the HTML code: Soup = the BeautifulSoup ( " <html> Data </ html> " ) # may be transmitted and intercepts request:
URL = "HTTP: //fake.html" Response = request.urlopen (URL, timeout = 20 is ) responseUTF8 . response.read = () decode ( " UTF-. 8 " ) Soup = the BeautifulSoup (responseUTF8, ' lxml ' )
In short, so we got a soup object. Next, the structure of the object according to the label as long as the positioning method to the target through a certain label on it.
This method is substantially the following, with the "absolute path" Find Label
# The observation that the stem, are all part of the label h6, h6 and a label in the label. Rest of the page is not used h6 tags, so use .find_all way to grab all the casual working. To give a label List h6lbs = soup.find_all ( ' H6 ' ) # define a two-dimensional array, for holding the grabbed choice. Multiple-choice options and casual working as a member of each array member. = item_question [] for I in Range (len (h6lbs)): # define a one-dimensional array, and can be for holding casual working options. First stem that just get stored into the array Item = [] item.append (h6lbs [I] .text) # can be known by the above HTML structure, after finding casual working "grandfather" casual working tag "three brother" is the place to save options, so here with a lot more .parent and .next_sibling method, by way of absolute positioning path label tag1 = h6lbs [i] .parent.parent.parent.next_sibling.next_sibling #. IF the this IS the Check Choice Question or the MUST IT BE A Yes / No Questionnaire IF tag1 IS not None and tag1.li IS not None: # just said grandfather brother is a place to store three options. Tbody tab under the table tag label li in the three brother grandfather storing a plurality of options need to traverse from here, each of the extracted options Tag = h6lbs [I] .parent.parent.parent. next_sibling.next_sibling.li.table.tbody for Child in tag.children: # because crawl out empty object, so here joined a judge. If None, not saved. IF child.string IS None: TAG_STRINGUse = child.td.div.string #Through each label, wherein the content extraction option, the method by .string # will get the option to join a content-dimensional array just created item.append (tag_string) # Print (Item) # one-dimensional array each time obtained save into a two-dimensional array item_question.append (Item) Print (item_question)
The method also can be positioned relative path tags:
# Can be found by observing, in each of the test questions are a div tag, and the tag value is # 11111 # 1,11111 2,11111 # 3. So we get to the first label as a reference. re.compile is a regular expression method the_tag = soup.find_all ( ' div ' , value = re.compile ( ' 11111 # \ d ' )) Print (len (the_tag)) # create a two-dimensional array, used to save the questions item_question = [] for tag in the_tag: # create a one-dimensional array, used to save the casual working and options Item = [] # traversing just selected reference label for child_tag in tag.descendants: # H6 tags under each have a reference label a stem, extracted save. IF child_tag.name == " H6 " : item.append (child_tag.get_text ( "" , Strip = True)) # td tag under each label is a reference to the options, save extracted elif child_tag.name == " td " : IF child_tag.div IS not None: item.append (child_tag.div.get_text ( "" , Strip = True)) item_question.append (Item) Print (item_question)