Use HtmlUnit to grab the js-rendered page

Purpose: Execute commands on the Linux command line, such as someCmd someUrl

If someCmd is wget/curl, you can only get an html, and the data pulled (rendered) by the js inside is not available.

1. Several failed attempts

1、PhantomJS

But after testing, I could only get the html when js was not rendered. I also tried window.setTimeout to wait for js rendering to complete, but it was unsuccessful.

2、httrack

Testing the command line was also unsuccessful. I could only get unrendered html. I don’t know how to configure the option.

2. HtmlUnit

It is a java library (with a bunch of jars in it). First, I made a java project in the Windows Eclipse reference link below.

Reference  https://www.cnblogs.com/lavender-pansy/p/10845297.html

Right-click the project - Java Build Path - Add External JARs - Browse to the folder you downloaded and decompressed in advance

For example, they have been extracted to htmlunit-3.3.0\lib and all JARs are added to the project.

However, the import htmlunit related statements in the java reference example above need to be deleted, and then modified to the appropriate code according to the project prompts. For example, mine is:

import org.htmlunit.BrowserVersion;
import org.htmlunit.FailingHttpStatusCodeException;
import org.htmlunit.WebClient;
import org.htmlunit.html.HtmlPage;

Then this source code must also be placed under a certain package. Do not use the package (it may be related to whether an error is reported when executing the jar in Linux).

After running successfully in Eclipse (that is, getting the server data pulled after js rendering in html), right-click Export - Runnable JAR file - name it, for example, HtmlUnit.jar

So this jar already contains all dependent jars (more than 20M)

Upload this jar to the linux server (java environment has been configured)

In the folder where the jar is located

java -jar HtmlUnit.jar

That’s it.

In Java, the URL is hard-coded, and then it is made into a parameter or a configuration file is read (for example, each line contains a URL to be read)

Improvement: Running a 20M jar is too clumsy. Later I learned to run the class file directly when I had time:

Example of linking multiple jar java projects under Eclipse under Windows, compiling and running from the command line under Linux_piggy514's blog-CSDN blog

おすすめ

転載: blog.csdn.net/piggy514/article/details/131354237