The browser uses a line of JS code to export cookies.txt, and Python's requests library imports cookies and formats them as a dictionary

When crawling in Python, if you only use the requests library to open a certain webpage, the session.cookies of requests saves very little cookie information, and sometimes the cookies are even blank! But when you open the same webpage in the browser, the cookie information is very detailed. For example, the browser’s cookies retain the status information after login. In order to quickly enter a certain webpage in Python without logging in, we need to export the browser’s webpage cookies first, and then in Python uses the requests library to import cookies.

Step 1: The browser exports cookies.txt in text format

In the address bar of the browser, you can enter the following JS code to quickly export the cookies of the current page as a text file cookies.txt:

javascript: (function() { const a = document.createElement('a');  a.href = 'data:text/plain,' + document.cookie;  a.download = 'cookies.txt';  a.target = '_blank';  a.style.display = 'none';  document.body.appendChild(a);  a.click();  setTimeout(function() {    document.body.removeChild(a);  }, 100);})();

Note: You cannot paste the above code directly, otherwise the browser will automatically block the "javascript:" at the beginning.

The correct posture is: first paste the above code in the address bar, then press the Home key in the address bar to return to the front of the code, then manually type javascript: (note the colon and half-width), and hit Enter, This will export cookies.txt to the specified directory.

If the browser does not pop up the dialog box to save the file, it is because the browser is set to automatically save to the default directory, such as C:\Users\Administrator\Downloads\. If you need the browser to ask you where to save the file every time you save it, you must first find the following settings in the browser settings:

 Step 2: Python's requests library imports cookies.txt as a dictionary format

The session of the Requests library can import cookies, and the cookies must be in dictionary format.

The cookies.txt exported by entering the JS code in the browser above has only one line, for example:

mediav={"_refnf":0};ID=ed5026a06; param={"user_agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64)","token":""}; cookie_referer=; token=validation:token

This is just a simple example for the convenience of demonstration, the actual cookie string will be longer than this.

The predecessor's approach is to split the string by semicolon, and then use dict() to convert it into a dictionary format. But in this example, the user-agent value also has a semicolon. If you simply divide it by semicolon, it will be wrong to convert it into a dictionary format after dividing it.

My approach is: divide by the equal sign, and then remove the extra semicolon.

with open('cookies.txt', encoding = 'utf-8') as f:
    cookies = f.read()
'''
载入cookies.txt,假设cookies内容是:'mediav={"_refnf":0};ID=ed5026a06; param={"user_agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64)","token":""}; cookie_referer=; token=validation:token'
'''

l1=cookies.split('=')
print(l1)

'''
输出:
['mediav', '{"_refnf":0};ID', 'ed5026a06; param', '{"user_agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64)","token":""}; cookie_referer', '; token', 'validation:token']
'''

You can see that the even-numbered elements of the l1 list are mixed with keys and values. To remove extra semicolons, they must be dismantled. The following is my improved code:

def cookies_to_dict(cookies):
    l1=l2=[]
    l1=cookies.split('=')
    for i in l1:
        if ';' in i:
            # 如果字符串存在分号就从后往前找分号来分割
            ss = i.rpartition(';')
            # rpartition()的作用就是从后往前找,返回三个元素的元组
            # 第一个为分隔符左边的子串,第二个为分隔符本身,第三个为分隔符右边的子串
            # 因此这里使用extend一次添加ss[0]和ss[2]的子串
            l2.extend([ss[0].strip(), ss[2].strip()])
            # strip()的作用是清除子串左端和右端的多余的空格
        else:
            l2.append(i.strip())
    c_dict={}
    for i in range(0,len(l2),2):
        c_dict.update({l2[i]:l2[i+1]})
    return c_dict


cookie_dict = cookies_to_dict(cookies)
print(cookie_dict)


'''
输出:
{'mediav': '{"_refnf":0}', 'ID': 'ed5026a06', 'param': '{"user_agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64)","token":""}', 'cookie_referer': '', 'token': 'validation:token'}
'''

In this way, cookies.txt is formatted as a dictionary. Then it can be imported into the session of requests.

import requests

se = requests.Session()

se.cookies.clear()
# 先清空Session的cookies,再导入cookies_dict
se.cookies.update(cookies_dict)

Update on February 27: use the findall of re library regular expressions to solve it perfectly

Today, I debugged the code of this article several times, and found that when the cookies contain nested information and "=", the above code parsing error. For example, cookies have a content of:

param={"user_agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64)","token":"","url":"https://xxxxxxx.com/login/source=xxxxxxx"};

This is a nested dictionary format cookie information. There is an equal sign after param, and an equal sign is also nested in the curly braces. If you first find the equal sign to split according to the above code, and then convert the dictionary format, an error will occur.

Our purpose is very simple, that is to find "xxx = xxxxxxxx" in the cookies information to split. After debugging many times, only the regular expressions of the re library can solve it. This regular expression is difficult to adjust, and I can successfully split it into the following code.

 ([a-zA-Z0-9_]+)=((?:{[^{}]*}|[^;]*)+) 

So the code is updated as follows:

import requests
import re

def cookies_to_dict(cookie_str):
    matches = re.findall(r'([a-zA-Z0-9_]+)=((?:{[^{}]*}|[^;]*)+)', cookie_str)
    result = {}
    for i in matches:
        result.update({i[0]:i[1]})
    return result



if __name__ == '__main__':
    # 载入cookies.txt
    with open(r'C:\Users\Administrator\Downloads\cookies.txt', 'r', encoding='utf-8') as f:
        cookie_str = f.read()

    # 把cookies转换为字典格式
    cookies_dict = cookies_to_dict(cookie_str)

    # 初始化requests的Session
    se = requests.Session()
    se.cookies.clear()
    # 先清空Session的cookies,再导入cookies_dict
    se.cookies.update(cookies_dict)

Assuming that cookies.txt is loaded, the cookie_str string content is 'mediav={"_refnf":0};ID=ed5026a06; param={"user_agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64)"," token":"","url":"https://xxxxxxx.com/login/source=xxxxxxx"}; cookie_referer=; token=validation:token'

After the above improved cookies_to_dict() function is converted into a dictionary format, the content of cookies_dict is:

{
    'mediav': '{"_refnf":0}',
    'ID': 'ed5026a06',
    'param': '{"user_agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64)","token":"","url":"https://xxxxxxx.com/login/source=xxxxxxx"}',
    'cookie_referer': '',
    'token': 'validation:token'
}

Guess you like

Origin blog.csdn.net/Scott0902/article/details/129166907