Foreword
"Java is used selenium and chrome browser download dynamic page" in one article, we demonstrate how to download dynamic web window environment by selenium and chrome. But our crawlers are generally run on linux server. Generally there is no GUI on the server environment. Unable to open chrome window interface. The previous time, crawler system is PhantomJS a non-browser interface to achieve. But now because FireFox, chrome after these browsers began to support the headless mode, PhantomJS have stopped updating, so now recommended to use FireFox and chrome headless mode to replace the PhantomJS. The so-called headless mode is no interface operating mode, just right for use in such situations without a GUI environment linux server.
The use of selenium drivers need to be installed chrome browser chrome and chrome webdriver in the linux environment. The following shows should do in centos 7 environment.
Install google chrome
First address from https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm download the offline installation package. Then execute the following command to install the required dependencies chrome
yum install libX11 libXcursor libXdamage libXext libXcomposite libXi libXrandr gtk3 libappindicator-gtk3 xdg-utils libXScrnSaver liberation-fonts
Then execute the command to install chrome
rpm -ivh google-chrome-stable_current_x86_64.rpm
Execute the following command to view the complete version
[root@localhost ~]# google-chrome --version
Google Chrome 70.0.3538.110
The current version can be seen as a version 70
Install chrome webdriver
And the article "Java use selenium and chrome browser download dynamic page" as in, find support for version 70 of chrome download, download linux platform version of the file can be chromedriver_linux64.zip
Headless mode selenium chrome caller sample
Or in the "Java use selenium and chrome browser download dynamic page" program is based on the transformation of his headless mode
WebDriver webDriver = null;
try {
String url = "https://www.jianshu.com/p/675ea919230e";
ChromeOptions chromeOptions=new ChromeOptions(); //设置 chrome 的无头模式 chromeOptions.setHeadless(Boolean.TRUE); //启动一个 chrome 实例 webDriver = new ChromeDriver(chromeOptions); //访问网址 webDriver.get(url); Document document = Jsoup.parse(webDriver.getPageSource()); Element titleElement = document.selectFirst("div.article h1.title"); Element authorElement = document.selectFirst("div.article div.author span.name"); Element timeElement = document.selectFirst("div.article span.publish-time"); Element wordCountElement = document.selectFirst("div.article span.wordage"); Element viewCountElement = document.selectFirst("div.article span.views-count"); Element commentCountElement = document.selectFirst("div.article span.comments-count"); Element likeCountElement = document.selectFirst("div.article span.likes-count"); Element contentElement = document.selectFirst("div.article div.show-content"); if (titleElement != null) { System.out.println("标题:" + titleElement.text()); } if (authorElement != null) { System.out.println("作者:" + authorElement.text()); } if (timeElement != null) { System.out.println("发布时间:" + timeElement.text()); } if (wordCountElement != null) { System.out.println(wordCountElement.text()); } if (viewCountElement != null) { System.out.println(viewCountElement.text()); } if (commentCountElement != null) { System.out.println(commentCountElement.text()); } if (likeCountElement != null) { System.out.println(likeCountElement.text()); } if (contentElement != null && contentElement.text() != null) { System.out.println("正文长度:" + contentElement.text().length()); } } catch (Exception e) { e.printStackTrace(); } finally { if (webDriver != null) { //退出 chrome webDriver.quit(); } }
And compared to the pre-text of the code, the following is not the same place
ChromeOptions chromeOptions=new ChromeOptions();
//设置 chrome 的无头模式
chromeOptions.setHeadless(Boolean.TRUE);
//启动一个 chrome 实例
webDriver = new ChromeDriver(chromeOptions);
This parameter determines whether to start with a headless mode
will be packaged and uploaded to the linux server, execute the command
java -jar -Dwebdriver.chrome.driver=/data/deploy/chromedriver spider_demo-0.0.1-SNAPSHOT.jar
The console will print out the following content
标题:是什么支撑了淘宝双十一,没错就是它java编程语言。
作者:Java帮帮
发布时间:2018.08.29 14:49 字数 561 阅读 632 评论 0 喜欢 4 正文长度:655
Description in linux call chrome visit this page by selenium success with the java. If no endless mode parameters above, it will be executed when the following prompt
org.openqa.selenium.WebDriverException: unknown error: Chrome failed to start: exited abnormally
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
(Driver info: chromedriver=2.44.609551 (5d576e9a44fe4c5b6a07e568f1ebc753f1214634),platform=Linux 3.10.0-514.26.2.el7.x86_64 x86_64) (WARNING: The server did not provide any stacktrace information)
Command duration or timeout: 399 milliseconds
Build info: version: 'unknown', revision: 'unknown', time: 'unknown'
System info: host: 'iz2ze9kvzy03hms75m3jzlz', ip: '172.17.251.3', os.name: 'Linux', os.arch: 'amd64', os.version: '3.10.0-514.26.2.el7.x86_64', java.version: '1.8.0_171' Driver info: driver.version: ChromeDriver at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.openqa.selenium.remote.ErrorHandler.createThrowable(ErrorHandler.java:214) at org.openqa.selenium.remote.ErrorHandler.throwIfResponseFailed(ErrorHandler.java:166) at org.openqa.selenium.remote.JsonWireProtocolResponse.lambda$new$0(JsonWireProtocolResponse.java:53) at org.openqa.selenium.remote.JsonWireProtocolResponse.lambda$getResponseFunction$2(JsonWireProtocolResponse.java:91) at org.openqa.selenium.remote.ProtocolHandshake.lambda$createSession$0(ProtocolHandshake.java:122) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.Spliterators$ArraySpliterator.tryAdvance(Spliterators.java:958) at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:498) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:485) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:464) at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:125) at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:73) at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:136) at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:83) at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:548) at org.openqa.selenium.remote.RemoteWebDriver.startSession(RemoteWebDriver.java:212) at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:130) at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:181) at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:168) at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:123) at com.yanggaochao.spider.SpiderDemoApplication.run(SpiderDemoApplication.java:34) at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:813) at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:797) at org.springframework.boot.SpringApplication.run(SpringApplication.java:324) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1260) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1248) at com.yanggaochao.spider.SpiderDemoApplication.main(SpiderDemoApplication.java:21) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:48) at org.springframework.boot.loader.Launcher.launch(Launcher.java:87) at org.springframework.boot.loader.Launcher.launch(Launcher.java:50) at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:51)
In this way, we can at our crawler system which uses the browser to download the page, WYSIWYG effect Realization. Never worry about dynamic rendering of Web content can not be downloaded.
Author: Tuu not my
link: https: //www.jianshu.com/p/b2609ed57f07
Source: Jane book
Jane book copyright reserved by the authors, are reproduced in any form, please contact the author to obtain authorization and indicate the source.