Google set up a headless browser to fetch page mode service, laravel-> php-> python-> docker !!!

 background:

Company management system needs to obtain information about configuration parameters, such as enterprise micro-enterprise page letter name, logo, etc. and the number of operations, related to hide sensitive information and simplify the process of configuring a custom Enterprise

The first edition has achieved a scan code to obtain login cookie, the cookie will be able to use to obtain legal status and freely request page interface, the first edition of the simulated operation is mainly caught the interface, an interface to use with a no no

Source page second edition of this version requires some configuration parameters js is rendered up, there is no interface, an ordinary get page after page of the document can not get rendered, we can only use a headless browser to crawl and operation page

Implementation process:

laravel version

Laravel development project is to use, the first thought is integrated into the framework, and laravel does provide related components: Laravel Dusk

Although this plug-in is used for browser testing, but is often used to crawl the page

Handsome ,, but when the installation does not go up operation,

 

PHP version

Well, then realize it themselves, directly on the code

Their own package to a class, new time directly to the login cookie pass over before, so you can jump directly to the page

QyWebChrome class 
{ 
  # google-chrome download the corresponding version of the driver https://sites.google.com/a/chromium.org/chromedriver/downloads 
    Private envchromedriverpath $ = 'webdriver.chrome.driver = / usr / bin / chromedriver'; 

    $ Driver Private; 

    Private $ error; 

    public function the __construct ($ cookie_str) 
    { 
        the putenv ($ this-> envchromedriverpath); 
        $ :: Capabilities = DesiredCapabilities Chrome (); 
// $ cookie_str = '= sssf1 ;; _gxxxx = sdfn. 1' ; 

        // '- headless' headless mode: the browser is running in the background, the browser installed on the server desktop environment can preview the entire process to remove 
        $ capabilities-> setCapability ( 
            'chromeOptions',
            [ 'args' => [' --disable-gpu ',' - headless', '- user-agent = Mozilla / 5.0 (Windows NT 10.0; Win64; x64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / Safari 71.0.3578.80 / 537.36 ']] 
        ); 

        $ this-> Driver = ChromeDriver :: Start ($ The Capabilities, null); 

        SLEEP (3); 
        // go to the index set the login cookie, after which wanted to jump on the jump page which page 
        $ this-> driver-> GET ( 'https://work.weixin.qq.com/'); 
        SLEEP (2); 
        // Adding the cookie 
        $ this-> driver-> the Manage () -> deleteAllCookies () ; 
        SLEEP (. 1); 
        $ = cookie_arr the explode ( ';', $ cookie_str); 
        the foreach (AS $ $ cookie_arr cookpair) { 
            $ = cookie_item the explode ( '=', $ cookpair);
            $ cookie = [
                'name' => TRIM ($ cookie_item [0]), 
                'value' => TRIM ($ cookie_item [. 1]), 
                'Domain' => 'work.weixin.qq.com', 
                'httpOnly' => to false, 
                'path' => '/', 
                'Secure' => to false, 
            ]; 
            $ this-> driver-> Manage () -> AddCookie ($ Cookie); 
        } 

        SLEEP (. 1); 

    } 

    public __destruct function () 
    { 
        $ this-> driver-> Close (); 
    } 

    // jump to my business enterprise information page for 
    public getProfilePage function () { 
        $ Data = []; 
        $ this-> driver-> GET ( 'HTTPS://work.weixin.qq.com/wework_admin/frame#profile/enterprise');
        SLEEP (. 3); 
        // business logoUrl 

        // enterprises referred 
        $ companynamespan = $ this-> driver-> findElement ( 
            WebDriverBy :: className ( 'profile_enterprise_item_shareName') 
        ); 
        $ Data [ 'CompanyName'] = $ companynamespan-> getText ( ); 

        return $ Data; 
    } 

    // Get the html renderer 
    // $ driver-> getPageSource (); 
    / * 
     the webdriver mainly provides two DOM API to the operating element to our 

RemoteWebDriver :: findElement (WebDriverBy) obtaining single element 
RemoteWebDriver :: findElements (WebDriverBy) to obtain a list of elements 
WebDriverBy is a query object provides the following several commonly used way 

WebDriverBy :: id ($ id) find elements based on ID  
WebDriverBy :: className ($ className) Find a class element according to
WebDriverBy :: cssSelector ($ selctor) according to general inquiries css selectors
WebDriverBy :: name ($ name) query based on the element's name attribute 
WebDriverBy :: linkText ($ text) according to the text anchor visible elements of the query 
WebDriverBy :: tagName ($ tagName) query based on the element tag name 
WebDriverBy :: xpath ($ xpath ) according xpath query expression, this is very powerful 
     * / 

    // Screenshot 
    public function takeScrenshot ($ savepath = "test.png") { 
        $ this-> driver-> takeScreenshot ($ savepath); 
    } 

    / ** 
     * @return Mixed 
     * / 
    public function the getError () 
    { 
        return $ this-> error; 
    } 

    / ** 
     * @param Mixed $ error 
     * / 
    public function the setError ($ error): void  
    {
        $ this-> $ error = error; 
    } 
}

  Deployment:

Install google-chrome

yum install google-chrome

  After installation is complete acquisition chrome version

Download the corresponding chromedriver https://sites.google.com/a/chromium.org/chromedriver/downloads ah this song in Valley

Page is like this, mainly of correspondence between googlechrome and chromedirver

 

 

Various versions here are saved to a network disk https://pan.baidu.com/s/1xykIgqKUAUm_l0iVHbh6kw extraction code: hbvz 

Run shot:

 

  

This completes the thought, I did not expect a problem can not be deployed on the line! !

wf ?? original operation and maintenance to ensure the server version of the software is compatible with low, very low reliance C version installed, so dependent on the underlying or do not move,

There are two solutions:

1 to find the server version of the installation of high GLIBC_2.14, GLIBC_2.16;

2 reptile inside this package to docker, provide external crawling service is a direct request to the time the interface, the interface to crawl back into the business of micro-channel page

Because the company has k8s clusters, so the direct build a docker more simply, so choose Option 2

Python docker version

It is as simple as using a docker point, direct use python scripts, crawlers, or use some of the more violent python, dependent on a variety of direct pip, 2017 use a headless browser to do the monitoring reptiles when driving or using phantomjs it, now chrome of headless direct switching over, api not changed,

First Package docker: go dockers in the environment take up and find out the relevant dependent

docker run -it -v /test:/test python:3.7.4 /bin/bash

  

Use / test as a shared directory, and the host to facilitate file transfer docker

Install google-chrome, python: 3.7.4 Direct Download deb installation package https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb

这有网盘共享 链接: https://pan.baidu.com/s/15rlArB7xCGOHXSko6UUkJA 提取码: p6d5

docker内安装google-chrome

然后就是解决依赖,

现在直接上Dockerfile

# Use an official Python runtime as a parent image
FROM python:3.7.4

# Set the working directory to /app
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

#install depend
RUN apt update && apt -y --fix-broken install libnss3-dev fonts-liberation libappindicator3-1 libasound2 libatk-bridge2.0-0 libatk1.0-0 libatspi2.0-0 libcups2 libdbus-1-3 libgtk-3-0 libx11-xcb1 libxcomposite1 libxcursor1 libxdamage1 libxfixes3 libxi6 libxrandr2 libxtst6 lsb-release xdg-utils libdbusmenu-glib4 libdbusmenu-gtk3-4 libindicator3-7 libasound2-data libatk1.0-data libavahi-client3 libavahi-common3 adwaita-icon-theme libcolord2 libepoxy0 libjson-glib-1.0-0 librest-0.7-0 libsoup2.4-1 libwayland-client0 libwayland-cursor0 libwayland-egl1 libxinerama1 libxkbcommon0 libgtk-3-common libgtk-3-bin distro-info-data gtk-update-icon-cache libavahi-common-data gtk-update-icon-cache dconf-gsettings-backend libjson-glib-1.0-common libsoup-gnome2.4-1 glib-networking xkb-data dconf-service libdconf1 libproxy1v5 glib-networking-services glib-networking-common gsettings-desktop-schemas default-dbus-session-bus dbus libpam-systemd systemd systemd-sysv libapparmor1 libapparmor1 libcryptsetup12 libidn11 libip4tc0 libkmod2 libargon2-1 libdevmapper1.02.1  libjson-c3 dmsetup \
&& apt-get install -y fonts-wqy-zenhei \
&& dpkg -i google-chrome-stable_current_amd64.deb

# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt

d +x run.sh

# Make port 80 available to the world outside this container
EXPOSE 80

# Define environment variable
ENV NAME World

#v1 dev Run app.py when the container launches
#CMD ["python", "app.py"]
#v2 production
#CMD ["gunicorn","--config","gunicorn_config.py","app:app"]
#v3 
#ENTRYPOINT [
"gunicorn","--config","gunicorn_config.py","app:app"] #v4 ENTRYPOINT ["./run.sh"]

项目目录

app.py   处理请求

from flask import Flask
import os
import socket
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
from time import sleep


app = Flask(__name__)

@app.route("/hello/<cookie_str>/<aim_url>/<end_class>")
def hello(cookie_str,aim_url,end_class):
    print(cookie_str)
    print(aim_url)
    print(end_class)
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument('window-size=1200x600')
    user_ag='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36'
    chrome_options.add_argument('user-agent=%s'%user_ag)

    profile_url = "https://work.weixin.qq.com/wework_admin/frame#profile"
    base_url = "https://work.weixin.qq.com"
    #chromedriver
    driver = webdriver.Chrome(executable_path=(r'/test/chromedriver'), chrome_options=chrome_options)

    #加载首页设置登录cookie
    driver.get(base_url + "/")
    driver.implicitly_wait(10)
    driver.save_screenshot('screen1.png')
    for coo in cookie_str.split(';'):
        cooki=coo.split('=')
        print(cooki[0].strip())
        print(cooki[1].strip())
        driver.add_cookie({'name':cooki[0].strip(),'value':cooki[1].strip(),'domain':'work.weixin.qq.com','httpOnly':False,'path':'/','secure':False})

    driver.implicitly_wait(10)

    #跳转目标页面
    driver.get(profile_url)
    WebDriverWait(driver,20,0.5).until(EC.presence_of_element_located((By.CLASS_NAME, 'ww_commonCntHead_title_inner_text')))
    #sleep(5)
    driver.save_screenshot('screen.png')

    driver.close()
    return "hhhhhh $s" % cookie_str

if __name__ == "__main__":
    app.run(host='0.0.0.0', port=80)

  

run.sh

#!/bin/bash
set -e
pwd

touch access.log error.log

exec gunicorn app:app \
        --bind 0.0.0.0:80 \
        --workers 4 \
        --timeout 120 \
        --log-level debug \
        --access-logfile=access.log \
        --error-logfile=error.log

exec "$@"

  

requirements.txt

Flask
selenium
gunicorn

  

flask的内置服务器开发的时候能用,线上部署的时候使用官方推荐的gunicorn部署,这里直接用了gunicorn运行

gunicorn的启动配置后来写进run.sh了,所以gunicorn_config.py就没用了

docker 镜像构建

docker build -t mypythonflask:v6 .

docker启动命令

docker run -d -v /data:/data -p 8888:80 -v /dev/shm:/dev/shm mypythonflask:v6

这里的/dev/shm是为了解决当加载的页面过大或者加载大图docker内存不够浏览器爆掉 解决方案来着 https://stackoverflow.com/questions/39936240/selenium-error-in-python-webdriverexception-unknown-error-session-deleted-bec/57302028#57302028

Selenium error in python: WebDriverException: unknown error: session deleted because of page crash from tab crashed

  

 请求测试

[root@localhost testdockerchrome]# curl "http://localhost:8888/hello/sss=sss;%20_ssst=1/bb/cc"
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>

#处理时间太长导致超时,检查下截图

 

  

这曲折的实现历程。。。

至此,爬取服务搭建完毕,后面只要是处理一下业务相关的东西,比如拓展app.py的功能,使其支持更多的操作

总结下来就是使用docker部署了一个服务,该服务接收登录cookie,url,配置等参数,使用chrome的headless模式抓取页面操作页面,返回结果,拓展浏览器操作可以写在app.py中

 

^-^原创文章,转载请声明出处

 

Guess you like

Origin www.cnblogs.com/timseng/p/11287815.html