Selenium&playwright obtains the website Authorization authentication to realize the fake requests request

This article is based on actual combat. If you are not familiar with selenium or playwright, it is recommended to add relevant knowledge points:

Cookie, session, request, headers related concepts

selenium: get_log() obtains user permission information, opens a specified browser, avoids login, and forges request headers

playwright: class methods - Page, Request, Route, Docs - Authentication, Network

The framework versions used in this article are as follows:

python-3.8.8
selenium-3.141.0
playwright-1.32.1
requests-2.27.1

Among them, there are some differences in the operation of selenium4 and selenium3, which will not be studied here.

To complain, playwright has very little information (except for basic information), so I can only go to the official website by myself. The official website is still ok, and I can still make something by trying more, but it is really tiring. woo woo woo~

demand background

1. Log in to the Google-like web terminal. Logging in through automated means will be monitored by Google and block the login request. There is a serious risk of banning the account (as mentioned in the previous article).

2. Selenium or playwright opens the browser specified to log in to the Google account to obtain user authentication information.

3. Forge the request header, obtain the information of the corresponding interface through requests, and pull the data.

The actual combat background of this article is based on the FireBase background, https://console.firebase.google.com/

For those who have not touched it, you can use other series of Google applications such as Gmail, but the focus is on ideas and methods. For details, see the step-by-step analysis below.

Knowledge point: selenium obtains the user authentication information of the logged-in website

directly on the code

__author__ = "梦无矶小仔"
import json,time,requests
from datetime import datetime, timedelta
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

def get_headers():
    # 关键步骤 1:下面两行代码是用来设置特性,获取request的信息前提步骤。
    d = DesiredCapabilities.CHROME
    d['loggingPrefs'] = {'performance': 'ALL'}
    options = webdriver.ChromeOptions()
    options.add_experimental_option('useAutomationExtension', False)
    # # 防止打印一些无用的日志
    options.add_experimental_option("excludeSwitches", ['enable-automation', 'enable-logging'])
    options.add_argument("--disable-software-rasterizer")

    chrome_options = Options()
    chrome_options.add_experimental_option('w3c', False)
    chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
    chrome_driver = "./chromedriver.exe"  # 我是把chromedriver驱动放在项目根目录下
    driver = webdriver.Chrome(executable_path=chrome_driver, options=options, chrome_options=chrome_options)
    driver.get("https://console.firebase.google.com/")
    info = driver.get_log('performance')
    cookie_list = []
    for i in info:
        dic_info = json.loads(i["message"])  # 把json格式转成字典。
        infom = dic_info["message"]  # request 信息,在字典的 键 ["message"]['params'] 中。
        if infom['method'] == 'Network.requestWillBeSentExtraInfo' and infom["params"]["headers"].get(":authority"):
            if infom["params"]["headers"][":authority"] == "mobilesdk-pa.clients6.google.com" and \
                    infom["params"]["headers"][":method"] == 'POST':
                cookie_list.append(infom["params"]["headers"])

    authorization = cookie_list[0]["authorization"]
    cookie = cookie_list[0]["cookie"]
 
    # 伪造请求头
    headers = {
        "Host": "crashlytics-pa.clients6.google.com",
        "content-type": "application/json",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36 Edg/96.0.1054.62",
        "referer": "https://console.firebase.google.com/",
        "cookie": cookie,
        "origin": "https://console.firebase.google.com",
        "authorization": authorization
    }

    return headers

"Code Analysis"

1. It seems that there is nothing to analyze. It is to filter the interface I want through the performance log and get all the information in the interface.

2. Leave me a message if you don’t understand

"Notice"

I am using selenium3, if you are selenium4, you need to use the following method to obtain:

from seleniumwire import webdriver  #pip install selenium-wire
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
browser = webdriver.Chrome()
browser.get('http://....') #打开
#获取Authorization_str
Authorization_str=''
for request in  browser.requests:  #遍历所有 请求
    # if request.method == 'POST' and \
    #    request.url == 'http://....': #找到这个请求
        if 'Authorization' in request.headers: #有这个标志
            Authorization_str = request.headers['Authorization']  #找到了结果
            break

"Key points:"

1. After installing selenium, you have to install selenium-wire

from seleniumwire import webdriver  #pip install selenium-wire
代替
from selenium import webdriver

2. Only the browser uses seleniumwire, and the others remain unchanged, such as By, keys, etc. also use selenium

Knowledge point: playwright gets cookie

Playwright official cookie code: BrowserContext | Playwright Python

"method one:"

Automatically open the browser, save the cookie to the local through playwright after manual login, and then need to read the cookie directly through the file.

__author__ = "梦无矶小仔"
from playwright.sync_api import sync_playwright
import json
# 先手动登录,保存Cooies到文件。
def saveCookies():
    with sync_playwright() as p:
        # 显示浏览器,每步操作等待100毫秒
        browser = p.chromium.launch(headless=False, slow_mo=100)
        context = browser.new_context()
        page = context.new_page()
        page.goto('https://cq.meituan.com/', timeout=50000)  # 设置超时时间为50s
        time.sleep(80)  # 此处手动登录,然后到个人信息页再获取cookie
        cookies = context.cookies()
        print(page.title())
        browser.close()
        f = open('cookies.txt', 'w+',,encoding="utf-8")
        json.dump(cookies, f)
        time.sleep(2)
        browser.close()
        print("cookie获取完毕")
saveCookies()#执行函数

"Method Two:"

Manually open the specified browser, make playwright specify the browser to run, obtain the logged-in cookie information, and save it locally.

__author__ = "梦无矶小仔"
# 对已经打开的浏览器进行操作
import json
import subprocess
from pprint import pprint
from playwright.sync_api import Playwright,sync_playwright

playwright = sync_playwright().start()
# 连接已打开浏览器,指定端口
browser = playwright.chromium.connect_over_cdp("http://127.0.0.1:9222")
default_context = browser.contexts[0] # 注意这里不是browser.new_page()了
page = default_context.pages[0]
base_url = r"https://console.firebase.google.com/" # 我这里截去了项目网站的url进行脱敏
page.goto(base_url)
# page.wait_for_timeout(timeout=15000)
print(page.title()) #firebase标题
time.sleep(5)
cookies = default_context.cookies(urls=base_url) #指定url下的cookie值,不填则是所有的
pprint(cookies)
# 保存cookies到本地
filePath = r'cookies.txt'
with open(filePath,'w+',encoding="utf-8") as f:
    json.dump(cookies,f)

playwright.stop() 

Knowledge point: playwright gets storage_state to extract cookie

F12 opens the browser, and you can see some information stored locally under Aplication, such as cookies and sessions

Official Tutorial: BrowserContext | Playwright Python

__author__ = "梦无矶小仔"
# 对已经打开的浏览器进行操作
import json
import subprocess
import time
from pprint import pprint
from playwright.sync_api import Playwright,sync_playwright

playwright = sync_playwright().start()
# 连接已打开浏览器,找好端口
browser = playwright.chromium.connect_over_cdp("http://127.0.0.1:9222")
default_context = browser.contexts[0] # 注意这里不是browser.new_page()了
page = default_context.pages[0]
base_url = r"https://console.firebase.google.com"  # 我这里截去了项目网站的url进行脱敏
page.goto(base_url)
print(page.title()) #firebase标题
filePath = r'storage_state.txt'
storage_state = default_context.storage_state(path=filePath)
pprint(storage_state)
playwright.stop()
# browser.close()

But there is a problem with this method, which will take out all the strong storage content of your current browser.

If it’s just a cookie for a certain website interface, this method is a bit bloated, and you need to filter it as a whole, and there is a problem of refreshing the local one in time (I have encountered a very short validity period).

baldness condition

On May 10, 2023, windows was automatically updated, and even if it was updated, it also automatically updated my fixed version of the chrome browser to the latest version.

I had previously disabled the automatic update function of chrome, but this time the windows update can be unblocked, which is outrageous.

Update here to disable the automatic update of the chrome browser on windows. The current version is 113.0.5672.93(正式版本) (64 位)

So what kind of problem will I encounter when the browser is updated?

1. Selenium relies on the driver to operate the browser. When the browser is updated, I have to update the driver, but I don’t have the function of automatically updating the driver.

2. I have been using a fixed version before, so I can always use one driver, but now I have to face three choices

  • Update the driver, disable the update again, and talk about it later (trouble)

  • Add the function of automatically updating the driver (there will be unexpected pits in the future)

  • The UI level is changed to playwright, because playwright does not need to rely on third-party drivers (a must for lazy people)

"So I studied all three, and then I will analyze them one by one."

Windows prohibits the automatic update of the chrome browser

1. Find C:\Users\xiaozai\AppData\Local\Googlethe Update folder in the directory

2. Right-click Properties, select Security Options, click Edit, and change all the permissions of these users to Deny

3. Under Security Options, click Advanced, click Disable Inheritance, delete Allowed Users, and click OK

4. When you click Confirm halfway, there will be a bunch of pop-up windows because you have banned the permission, just keep clicking Confirm and it will be ok

5. After verification, you double-click the Updata folder and find that you have no access

6. Go to chrome to check the update options, and find that it can no longer be updated

Selenium automatically downloads the driver

How to use Drivers

Official introduction: https://www.selenium.dev/documentation/webdriver/getting_started/install_drivers/

"Install the library first:"

pip install webdriver-manager

webdriver-manager supports selenium3.0, selenium4.0

For details, please see the instructions on github: https://github.com/SergeyPirogov/webdriver_manager

Chrome demonstration based on selenium3

# pip install webdriver-manager
#selenium3
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://www.baidu.com")
driver.maximize_window()
time.sleep(5)
driver.quit()

Chrome demonstration based on selenium4

import time
from selenium import webdriver
from selenium.webdriver.edge.service import Service
from webdriver_manager.chrome import ChromeDriverManager

service = Service(executable_path=ChromeDriverManager().install())
driver = webdriver.Ie(service=service)
driver.get("https://www.baidu.com")
driver.maximize_window()
time.sleep(5)
driver.quit()

Is it very simple and very nice?

I won’t release the official Liezi here, everyone is interested to study it by themselves~

playwright driverless operation opened browser

See the article I wrote before for details, so I won’t go into details here, the link is as follows:

Official account: playwright connects to an existing browser operation (qq.com)

CSDN:https://blog.csdn.net/qq_46158060/article/details/130429536?spm=1001.2014.3001.5501

AuthorizationAuthentication

All authentication permissions of the google type contain an Authorization, and the encryption is SAPISIDHASH, which I will not crack.

If the request header does not carry this authentication field, it is impossible to access the relevant interface.

{
  "error": {
    "code": 401,
    "message": "Request is missing required authentication credential. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.",
    "status": "UNAUTHENTICATED",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.ErrorInfo",
        "reason": "CREDENTIALS_MISSING",
        "domain": "googleapis.com",
        "metadata": {
          "service": "crashlytics-pa.googleapis.com",
          "method": "google.internal.crashlytics.dashboard.v1.CrashlyticsMetricsReadService.GetCrashStatistics"
        }
      }
    ]
  }
}

Through selenium, we know that the request request information can be obtained in performance (there is a demo in the previous article), so does playwright have a similar method?

By consulting the official documents, it is found that there is indeed, it is called event monitoring.

At present, we need to get the information of the request header, so as to obtain the data through the interface.

Authorization
Cookie
Origin
Referer
User-Agent

playwright event listener

Official documentation:

page.on event listener: https://playwright.dev/python/docs/api/class-page#page-event-request

request interception interface: https://playwright.dev/python/docs/api/class-request

We mainly use event monitoring page.on("request",my_request), other monitoring events can refer to the official website.

There is a method in the Request event all_headers, which will return the request header information we requested in the form of a dictionary.

Request interception interface code example:

def my_request(request):
    print(request.all_headers())

page.on("request",my_request) # 创建拦截请求,获取请求的hearders  
# 这里推荐使用requestfinished

Note: page.on will be created after the page instance, which represents the corresponding event that occurs on the page after monitoring. If the page.on method is created after the event occurs, the event cannot be monitored, only the operation after the page.on is created.

For example, if I monitor the FireBase background data page, the sample code is as follows

import json
from pprint import pprint
import requests
from playwright.sync_api import sync_playwright

def my_request(request):
    print(request.all_headers())
    # 对headers进行劫持处理

playwright = sync_playwright().start()
# 连接已打开浏览器,找好端口
browser = playwright.chromium.connect_over_cdp("http://127.0.0.1:9222")
default_context = browser.contexts[0] # 注意这里不是browser.new_page()了
page = default_context.pages[0]
page.on("requestfinished",my_request) # 创建拦截请求,获取请求的hearders
base_url = r"https://console.firebase.google.com/"
page.goto(base_url)
page.wait_for_load_state('networkidle') #等待资源加载,直到没有网络请求,否则得到的资源不完整,拿不到想要的鉴权信息

The console outputs all the monitored request header information, and the authorization field is listed. We can continue to modify my_requestthe method to get the headers information we need.

image-20230509123703422

But at this time, I still encountered a problem. The authorization obtained here is not really available to me. I also need to filter the referer field, but I found that there is nothing I am looking for. F12 checks the network and finds that the front end is refreshed. What is printed out is all https://console.firebase.google.com/

You need to add this sentence after the request, which means waiting for the resource to load until there is no network request. See the official website for details: https://playwright.dev/python/docs/api/class-page#page-wait-for-load-state

page.wait_for_load_state('networkidle')

But sometimes it can be a bit ineffective, so I recommend using forced waiting.

page.wait_for_timeout(timeout=20000) # 这个timeout是毫秒

Next, these request headers need to be filtered. I only need to get the information of the headers field containing the Authorization field, and then I can forge it with the cookie I got earlier.

At the same time, filter :authoritythe fields. Note that the first letter you see in F12 is capitalized. It is stated in the official playwright document that the headers return all lowercase fields, so we need to extract them in lowercase when we take them.

Modified my_requestmethod

# 全局变量
user_headers = {}
def my_request(request):
    all_headers_dict = request.all_headers()
    # 过滤请求(这里我对:path也进行了过滤,完整path我脱敏处理了)
    if all_headers_dict.get(':path') == "/metrics:getCrashFree.." and not user_headers:
        if all_headers_dict[":method"]=='POST' and all_headers_dict[":authority"]=='crashlytics-pa.clients6.google.com' :
            # 提取我需要的信息
            user_headers["cookie"] = all_headers_dict.get("cookie")
            user_headers["user-agent"] = all_headers_dict.get("user-agent")
            user_headers["authorization"] = all_headers_dict.get("authorization")

Finally, the headers information obtained for the first time will be recorded in the user_headers dictionary.

Then we can use requests to carry the request header with authentication information to make interface requests.

Use route hijacking

Official documentation: Route | Playwright Python

This method can also be used to obtain the relevant information of the request header, and it finally uses the request to obtain the request header.

In the process of using it, I found that sometimes the request will be blocked. I don’t know why. If you have research on this aspect, please enlighten me. Thank you very much.

user_msg_list = []
def handler(route):
    headers_dict = {}
    all_headers_dict = route.request.headers
    if all_headers_dict.get('authorization') and all_headers_dict.get('cookie'):
        # if 'https://scone-pa.clients6.google.com/static/proxy' in all_headers_dict["referer"]:
            # 提取我需要的信息
            headers_dict["referer"] = all_headers_dict.get("referer")
            headers_dict["user-agent"] = all_headers_dict.get("user-agent")
            headers_dict["authorization"] = all_headers_dict.get("authorization")
            headers_dict["cookie"] = all_headers_dict.get("cookie")
            user_msg_list.append(headers_dict)
    route.continue_()

page.route(**/*,handler)
page.wait_for_load_state('networkidle')

final code

"Note: Sensitive information has been desensitized"

# -*- coding: utf-8 -*-
'''
@Time : 2023/5/10 13:42
@Author : Vincent.xiaozai
@Email : [email protected]
@File : demo06_整合请求伪造.py
'''
__author__ = "梦无矶小仔"

import json
import subprocess
import time
from datetime import datetime, timedelta
from pprint import pprint

import requests
from playwright.sync_api import Playwright,sync_playwright

# 全局变量
user_headers = {}
def my_request(request):
    all_headers_dict = request.all_headers()
    # 过滤请求
    if all_headers_dict.get(':path') == "/metrics:getCrashFreeTime..." and not user_headers:
        if all_headers_dict[":method"]=='POST' and all_headers_dict[":authority"]=='crashlytics-pa.clients6.google.com' :
            # 提取我需要的信息
            user_headers["cookie"] = all_headers_dict.get("cookie")
            user_headers["user-agent"] = all_headers_dict.get("user-agent")
            user_headers["authorization"] = all_headers_dict.get("authorization")
            user_headers["referer"] = all_headers_dict.get("referer")
            user_headers["origin"] = all_headers_dict.get("origin")

playwright = sync_playwright().start()
# 连接已打开浏览器,找好端口
browser = playwright.chromium.connect_over_cdp("http://127.0.0.1:9222")
default_context = browser.contexts[0] # 注意这里不是browser.new_page()了
page = default_context.pages[0]
page.on("requestfinished",my_request) # 创建拦截请求,获取请求的hearders
base_url = "https://console.firebase.google.com/u/0/project/..." #完整url已脱敏
page.goto(base_url) 
# 如果要保证刷新可以强制等待
page.wait_for_timeout(timeout=20000)

# 请求头伪造
headers = {
    "Host": "crashlytics-pa.clients6.google.com",
    "content-type": "application/json",
    'user-agent':user_headers["user-agent"],
    "referer": user_headers["referer"],
    'cookie':user_headers["cookie"],
    "origin": user_headers["origin"],
    'authorization': user_headers["authorization"]
}
print("---------------用户cookie及Authorization--------------------------")
print(f"伪造的请求头:{headers}")
print("---------------用户cookie及Authorization--------------------------")

## 执行request请求获取数据
crashAndUsersUrl = "https://crashlytics-pa.clients6.google.com/v1/projects" # 完整url已脱敏
crashAndUsersNum = get_all_crashAndUser(day=2,headers=headers,url=crashAndUsersUrl,the_latest="None",version_bt_list="None",platform='Android',eventType=["FATAL"])
print("----------接口信息打印-------------------")
print(crashAndUsersNum)

Here get_all_crashAndUseris my business code, which processes the request of the interface, so it will not be released here.

It can be seen that we finally got the information of this interface.

In subsequent operations, you can always use requests to make interface requests. If the cookie has an expiration date, then use playwright to re-acquire it every once in a while and forge the request header again.

Finally: The complete software testing video learning tutorial below has been organized and uploaded, and friends who need it can get it by themselves [Guaranteed 100% free]

insert image description here

Software Testing Interview Documentation

We must study to find a high-paying job. The following interview questions are the latest interview materials from first-tier Internet companies such as Ali, Tencent, and Byte, and some Byte bosses have given authoritative answers. Finish this set The interview materials believe that everyone can find a satisfactory job.

picture

Guess you like

Origin blog.csdn.net/weixin_50829653/article/details/130683629