background:
Company management system needs to obtain information about configuration parameters, such as enterprise micro-enterprise page letter name, logo, etc. and the number of operations, related to hide sensitive information and simplify the process of configuring a custom Enterprise
The first edition has achieved a scan code to obtain login cookie, the cookie will be able to use to obtain legal status and freely request page interface, the first edition of the simulated operation is mainly caught the interface, an interface to use with a no no
Source page second edition of this version requires some configuration parameters js is rendered up, there is no interface, an ordinary get page after page of the document can not get rendered, we can only use a headless browser to crawl and operation page
Implementation process:
laravel version
Laravel development project is to use, the first thought is integrated into the framework, and laravel does provide related components: Laravel Dusk
Although this plug-in is used for browser testing, but is often used to crawl the page
Handsome ,, but when the installation does not go up operation,
PHP version
Well, then realize it themselves, directly on the code
Their own package to a class, new time directly to the login cookie pass over before, so you can jump directly to the page
QyWebChrome class { # google-chrome download the corresponding version of the driver https://sites.google.com/a/chromium.org/chromedriver/downloads Private envchromedriverpath $ = 'webdriver.chrome.driver = / usr / bin / chromedriver'; $ Driver Private; Private $ error; public function the __construct ($ cookie_str) { the putenv ($ this-> envchromedriverpath); $ :: Capabilities = DesiredCapabilities Chrome (); // $ cookie_str = '= sssf1 ;; _gxxxx = sdfn. 1' ; // '- headless' headless mode: the browser is running in the background, the browser installed on the server desktop environment can preview the entire process to remove $ capabilities-> setCapability ( 'chromeOptions', [ 'args' => [' --disable-gpu ',' - headless', '- user-agent = Mozilla / 5.0 (Windows NT 10.0; Win64; x64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / Safari 71.0.3578.80 / 537.36 ']] ); $ this-> Driver = ChromeDriver :: Start ($ The Capabilities, null); SLEEP (3); // go to the index set the login cookie, after which wanted to jump on the jump page which page $ this-> driver-> GET ( 'https://work.weixin.qq.com/'); SLEEP (2); // Adding the cookie $ this-> driver-> the Manage () -> deleteAllCookies () ; SLEEP (. 1); $ = cookie_arr the explode ( ';', $ cookie_str); the foreach (AS $ $ cookie_arr cookpair) { $ = cookie_item the explode ( '=', $ cookpair); $ cookie = [ 'name' => TRIM ($ cookie_item [0]), 'value' => TRIM ($ cookie_item [. 1]), 'Domain' => 'work.weixin.qq.com', 'httpOnly' => to false, 'path' => '/', 'Secure' => to false, ]; $ this-> driver-> Manage () -> AddCookie ($ Cookie); } SLEEP (. 1); } public __destruct function () { $ this-> driver-> Close (); } // jump to my business enterprise information page for public getProfilePage function () { $ Data = []; $ this-> driver-> GET ( 'HTTPS://work.weixin.qq.com/wework_admin/frame#profile/enterprise'); SLEEP (. 3); // business logoUrl // enterprises referred $ companynamespan = $ this-> driver-> findElement ( WebDriverBy :: className ( 'profile_enterprise_item_shareName') ); $ Data [ 'CompanyName'] = $ companynamespan-> getText ( ); return $ Data; } // Get the html renderer // $ driver-> getPageSource (); / * the webdriver mainly provides two DOM API to the operating element to our RemoteWebDriver :: findElement (WebDriverBy) obtaining single element RemoteWebDriver :: findElements (WebDriverBy) to obtain a list of elements WebDriverBy is a query object provides the following several commonly used way WebDriverBy :: id ($ id) find elements based on ID WebDriverBy :: className ($ className) Find a class element according to WebDriverBy :: cssSelector ($ selctor) according to general inquiries css selectors WebDriverBy :: name ($ name) query based on the element's name attribute WebDriverBy :: linkText ($ text) according to the text anchor visible elements of the query WebDriverBy :: tagName ($ tagName) query based on the element tag name WebDriverBy :: xpath ($ xpath ) according xpath query expression, this is very powerful * / // Screenshot public function takeScrenshot ($ savepath = "test.png") { $ this-> driver-> takeScreenshot ($ savepath); } / ** * @return Mixed * / public function the getError () { return $ this-> error; } / ** * @param Mixed $ error * / public function the setError ($ error): void { $ this-> $ error = error; } }
Deployment:
Install google-chrome
yum install google-chrome
After installation is complete acquisition chrome version
Download the corresponding chromedriver https://sites.google.com/a/chromium.org/chromedriver/downloads ah this song in Valley
Page is like this, mainly of correspondence between googlechrome and chromedirver
Various versions here are saved to a network disk https://pan.baidu.com/s/1xykIgqKUAUm_l0iVHbh6kw extraction code: hbvz
Run shot:
This completes the thought, I did not expect a problem can not be deployed on the line! !
wf ?? original operation and maintenance to ensure the server version of the software is compatible with low, very low reliance C version installed, so dependent on the underlying or do not move,
There are two solutions:
1 to find the server version of the installation of high GLIBC_2.14, GLIBC_2.16;
2 reptile inside this package to docker, provide external crawling service is a direct request to the time the interface, the interface to crawl back into the business of micro-channel page
Because the company has k8s clusters, so the direct build a docker more simply, so choose Option 2
Python docker version
It is as simple as using a docker point, direct use python scripts, crawlers, or use some of the more violent python, dependent on a variety of direct pip, 2017 use a headless browser to do the monitoring reptiles when driving or using phantomjs it, now chrome of headless direct switching over, api not changed,
First Package docker: go dockers in the environment take up and find out the relevant dependent
docker run -it -v /test:/test python:3.7.4 /bin/bash
Use / test as a shared directory, and the host to facilitate file transfer docker
Install google-chrome, python: 3.7.4 Direct Download deb installation package https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
这有网盘共享 链接: https://pan.baidu.com/s/15rlArB7xCGOHXSko6UUkJA 提取码: p6d5
docker内安装google-chrome
然后就是解决依赖,
现在直接上Dockerfile
# Use an official Python runtime as a parent image FROM python:3.7.4 # Set the working directory to /app WORKDIR /app # Copy the current directory contents into the container at /app COPY . /app #install depend RUN apt update && apt -y --fix-broken install libnss3-dev fonts-liberation libappindicator3-1 libasound2 libatk-bridge2.0-0 libatk1.0-0 libatspi2.0-0 libcups2 libdbus-1-3 libgtk-3-0 libx11-xcb1 libxcomposite1 libxcursor1 libxdamage1 libxfixes3 libxi6 libxrandr2 libxtst6 lsb-release xdg-utils libdbusmenu-glib4 libdbusmenu-gtk3-4 libindicator3-7 libasound2-data libatk1.0-data libavahi-client3 libavahi-common3 adwaita-icon-theme libcolord2 libepoxy0 libjson-glib-1.0-0 librest-0.7-0 libsoup2.4-1 libwayland-client0 libwayland-cursor0 libwayland-egl1 libxinerama1 libxkbcommon0 libgtk-3-common libgtk-3-bin distro-info-data gtk-update-icon-cache libavahi-common-data gtk-update-icon-cache dconf-gsettings-backend libjson-glib-1.0-common libsoup-gnome2.4-1 glib-networking xkb-data dconf-service libdconf1 libproxy1v5 glib-networking-services glib-networking-common gsettings-desktop-schemas default-dbus-session-bus dbus libpam-systemd systemd systemd-sysv libapparmor1 libapparmor1 libcryptsetup12 libidn11 libip4tc0 libkmod2 libargon2-1 libdevmapper1.02.1 libjson-c3 dmsetup \ && apt-get install -y fonts-wqy-zenhei \ && dpkg -i google-chrome-stable_current_amd64.deb # Install any needed packages specified in requirements.txt RUN pip install --trusted-host pypi.python.org -r requirements.txt d +x run.sh # Make port 80 available to the world outside this container EXPOSE 80 # Define environment variable ENV NAME World #v1 dev Run app.py when the container launches #CMD ["python", "app.py"] #v2 production #CMD ["gunicorn","--config","gunicorn_config.py","app:app"] #v3
#ENTRYPOINT ["gunicorn","--config","gunicorn_config.py","app:app"] #v4 ENTRYPOINT ["./run.sh"]
项目目录
app.py 处理请求
from flask import Flask import os import socket from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.chrome.options import Options from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By import time from time import sleep app = Flask(__name__) @app.route("/hello/<cookie_str>/<aim_url>/<end_class>") def hello(cookie_str,aim_url,end_class): print(cookie_str) print(aim_url) print(end_class) chrome_options = Options() chrome_options.add_argument("--headless") chrome_options.add_argument("--no-sandbox") chrome_options.add_argument("--disable-gpu") chrome_options.add_argument('window-size=1200x600') user_ag='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36' chrome_options.add_argument('user-agent=%s'%user_ag) profile_url = "https://work.weixin.qq.com/wework_admin/frame#profile" base_url = "https://work.weixin.qq.com" #chromedriver driver = webdriver.Chrome(executable_path=(r'/test/chromedriver'), chrome_options=chrome_options) #加载首页设置登录cookie driver.get(base_url + "/") driver.implicitly_wait(10) driver.save_screenshot('screen1.png') for coo in cookie_str.split(';'): cooki=coo.split('=') print(cooki[0].strip()) print(cooki[1].strip()) driver.add_cookie({'name':cooki[0].strip(),'value':cooki[1].strip(),'domain':'work.weixin.qq.com','httpOnly':False,'path':'/','secure':False}) driver.implicitly_wait(10) #跳转目标页面 driver.get(profile_url) WebDriverWait(driver,20,0.5).until(EC.presence_of_element_located((By.CLASS_NAME, 'ww_commonCntHead_title_inner_text'))) #sleep(5) driver.save_screenshot('screen.png') driver.close() return "hhhhhh $s" % cookie_str if __name__ == "__main__": app.run(host='0.0.0.0', port=80)
run.sh
#!/bin/bash set -e pwd touch access.log error.log exec gunicorn app:app \ --bind 0.0.0.0:80 \ --workers 4 \ --timeout 120 \ --log-level debug \ --access-logfile=access.log \ --error-logfile=error.log exec "$@"
requirements.txt
Flask selenium gunicorn
flask的内置服务器开发的时候能用,线上部署的时候使用官方推荐的gunicorn部署,这里直接用了gunicorn运行
gunicorn的启动配置后来写进run.sh了,所以gunicorn_config.py就没用了
docker 镜像构建
docker build -t mypythonflask:v6 .
docker启动命令
docker run -d -v /data:/data -p 8888:80 -v /dev/shm:/dev/shm mypythonflask:v6
这里的/dev/shm是为了解决当加载的页面过大或者加载大图docker内存不够浏览器爆掉 解决方案来着 https://stackoverflow.com/questions/39936240/selenium-error-in-python-webdriverexception-unknown-error-session-deleted-bec/57302028#57302028
Selenium error in python: WebDriverException: unknown error: session deleted because of page crash from tab crashed
请求测试
[root@localhost testdockerchrome]# curl "http://localhost:8888/hello/sss=sss;%20_ssst=1/bb/cc"
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>
#处理时间太长导致超时,检查下截图
这曲折的实现历程。。。
至此,爬取服务搭建完毕,后面只要是处理一下业务相关的东西,比如拓展app.py的功能,使其支持更多的操作
总结下来就是使用docker部署了一个服务,该服务接收登录cookie,url,配置等参数,使用chrome的headless模式抓取页面操作页面,返回结果,拓展浏览器操作可以写在app.py中
^-^原创文章,转载请声明出处