Universal extractor GNEv0.04 news page updated version, supports extract text images and source code

GeneralNewsExtractorHereinafter referred to as GNEa general news page extractor can, without specifying any extraction rules, the extracted text news site.

We look at its basic use.

Installation GNE

Use pip installation:

pip install --upgrade git+https://github.com/kingname/GeneralNewsExtractor.git

Of course, you can also use the pipenvinstallation:

pipenv install git+https://github.com/kingname/GeneralNewsExtractor.git#egg=gne

Get news page source

GNE not now, and will not provide the functionality requested page, so you need to find ways to get their own 经过渲染以后的web page source code. You can use Seleniumor Pyppeteeror copied directly from your browser.

Here we demonstrate how to copy the page's source code directly from your browser:

  1. Chrome browser opens in the corresponding page, and then open the developer tools, as shown below:

  1. Elements in the positioning tab to tab, and right, select Copy-Copy OuterHTML, as shown in FIG.

  1. Save the source code for the 1.html

Extract text information

Write the following code:

from gne import GeneralNewsExtractor

with open('1.html') as f:
    html = f.read()

extractor = GeneralNewsExtractor()
result = extractor.extract(html)
print(result)

Run results as shown below:

What this update

In the latest update v0.04 version, open a text picture extraction function, the function returns the text of the source code. Function which returns an image URL above has been demonstrated, the results of the imagesfield is the image within the body.

So how to return the body of the source code? Only you need to add a parameter with_body_html=Trueto:

from gne import GeneralNewsExtractor

with open('1.html') as f:
    html = f.read()

extractor = GeneralNewsExtractor()
result = extractor.extract(html, with_body_html=True)
print(result)

Run results as shown below:

Returns the result of body_htmlthat body of html source.

In-depth about the use of GNE, GNE can access the Github: https://github.com/kingname/GeneralNewsExtractor .

Guess you like

Origin www.cnblogs.com/xieqiankun/p/gne_v_0_0_4.html