GeneralNewsExtractor
Hereinafter referred to as GNE
a general news page extractor can, without specifying any extraction rules, the extracted text news site.
We look at its basic use.
Installation GNE
Use pip installation:
pip install --upgrade git+https://github.com/kingname/GeneralNewsExtractor.git
Of course, you can also use the pipenv
installation:
pipenv install git+https://github.com/kingname/GeneralNewsExtractor.git#egg=gne
Get news page source
GNE not now, and will not provide the functionality requested page, so you need to find ways to get their own 经过渲染以后的
web page source code. You can use Selenium
or Pyppeteer
or copied directly from your browser.
Here we demonstrate how to copy the page's source code directly from your browser:
- Chrome browser opens in the corresponding page, and then open the developer tools, as shown below:
- Elements in the positioning tab to tab, and right, select Copy-Copy OuterHTML, as shown in FIG.
- Save the source code for the 1.html
Extract text information
Write the following code:
from gne import GeneralNewsExtractor
with open('1.html') as f:
html = f.read()
extractor = GeneralNewsExtractor()
result = extractor.extract(html)
print(result)
Run results as shown below:
What this update
In the latest update v0.04 version, open a text picture extraction function, the function returns the text of the source code. Function which returns an image URL above has been demonstrated, the results of the images
field is the image within the body.
So how to return the body of the source code? Only you need to add a parameter with_body_html=True
to:
from gne import GeneralNewsExtractor
with open('1.html') as f:
html = f.read()
extractor = GeneralNewsExtractor()
result = extractor.extract(html, with_body_html=True)
print(result)
Run results as shown below:
Returns the result of body_html
that body of html source.
In-depth about the use of GNE, GNE can access the Github: https://github.com/kingname/GeneralNewsExtractor .