[Python] From entry to advanced—basic applications of commonly used built-in modules (13)

datetime

datetime is Python's standard library for handling dates and times.

Get current date and time

from datetime import datetime
now = datetime.now()  # 获取当前datetime
print(now) #2023-09-13 10:28:48.621343
print(type(now))#<class 'datetime.datetime'>
  • Notice datetime是模块that from datetime import datetimeit is the datetime class that is imported.
    • If you only import import datetime, you must quote全名datetime.datetime

Get the specified date and time

dt = datetime(2023, 9, 13, 12, 20) # 用指定日期时间创建datetime
print(dt)

Convert datetime to timestamp

  • In computers, time is actually 数字represented in terms of . We 1970年1月1日 00:00:00 UTC+00:00call the moment in the time zone epoch time, recorded as 0(the time before 1970 (timestamp is a negative number ), and the current time is relative to epoch time的秒数, called timestamp.
dt = datetime(2023, 9, 13, 12, 20)  # 用指定日期时间创建datetime
print(dt.timestamp()) # 把datetime转换为timestamp
#1694578800.0
  • Note that Python's timestamp is one 浮点数,整数位表示秒.

The timestamp can also be converted directly to the UTC standard time zone:

t = 1429417200.0
print(datetime.fromtimestamp(t)) # 本地时间
#2015-04-19 12:20:00
print(datetime.utcfromtimestamp(t)) # UTC时间
#2015-04-19 04:20:00

Convert str to datetime

  • Implemented through datetime.strptime(), a formatted string of date and time is required:
cday = datetime.strptime('2015-6-1 18:19:59', '%Y-%m-%d %H:%M:%S')
print(cday)
#2015-06-01 18:19:59

Convert datetime to str

now = datetime.now()
print(now.strftime('%a, %b %d %H:%M'))
#Mon, May 05 16:28

datetime addition and subtraction

  • Adding and subtracting dates and times is actually to datetime往后或往前计算get a new datetime. 加减可以直接用+和-运算符,不过需要导入timedelta这个类:
from datetime import datetime, timedelta
now = datetime.now()


datetime(2023, 9, 13, 10, 30, 3, 540997)
print(now + timedelta(hours=10))#2023-09-13 20:38:44.709003

datetime(2023, 9, 13, 10, 30, 3, 540997)
print(now - timedelta(days=1))#2023-09-12 10:38:44.709003

datetime(2023, 9, 13, 10, 30, 3, 540997)
print(now + timedelta(days=2, hours=12))#2023-09-12 10:38:44.709003

Convert local time to UTC time

from datetime import datetime, timedelta, timezone

tz_utc_8 = timezone(timedelta(hours=8))  # 创建时区UTC+8:00
now = datetime.now()
print(now)

dt = now.replace(tzinfo=tz_utc_8)  # 强制设置为UTC+8:00
print(dt)

dt = datetime(2015, 9, 13, 10, 40, 13, 610986, tzinfo=timezone(timedelta(0, 28800)))
print(dt)

time zone conversion

from datetime import datetime, timedelta, timezone

# 拿到UTC时间,并强制设置时区为UTC+0:00:
utc_dt = datetime.utcnow().replace(tzinfo=timezone.utc)
print(utc_dt)
# astimezone()将转换时区为北京时间:
bj_dt = utc_dt.astimezone(timezone(timedelta(hours=8)))
print(bj_dt)
# astimezone()将转换时区为东京时间:
tokyo_dt = utc_dt.astimezone(timezone(timedelta(hours=9)))
print(tokyo_dt)
# astimezone()将bj_dt转换时区为东京时间:
tokyo_dt2 = bj_dt.astimezone(timezone(timedelta(hours=9)))
print(tokyo_dt2)

summary

  • The time represented by datetime requires time zone information to determine a specific time, otherwise it can only be regarded as local time.

  • If you want to store datetime, the best way is to use it 转换为timestamp再存储,因为timestamp的值与时区完全无关.

base64

Base64 is an encoding method for converting text strings into arbitrary binary formats. It is commonly used in URLs, cookies, and web pages 传输少量二进制数据.

  • The principle of Base64 is very simple. First, prepare an array containing 64 characters:

    ['A', 'B', 'C', ... 'a', 'b', 'c', ... '0', '1', ... '+', '/']
    
  • Then, the binary data is processed, 每3个字节一组, in total 3x8=24bit, divided into 4 groups, each group is exactly 6个bit:

    Insert image description here

  • In this way, we get 4 numbers as index, and then 查表we get the corresponding 4 characters, which is the encoded string.

    • 会把3字节的二进制数据编码为4字节的文本数据,长度增加33%Therefore, the advantage of Base64 encoding is that the encoded text data can be directly displayed in the body of emails, web pages, etc.

    • What if the binary data to be encoded is not a multiple of 3 and there will be 1 or 2 bytes left at the end?

      • Base64 is used after padding at the end, and then indicates how many bytes are padded \x00字节at the end of the encoding .上1个或2个=号,解码的时候,会自动去掉

Python's built-in base64 can directly encode and decode base64:

import base64
#`b'str'`可以表示字节,
a = base64.b64encode(b'binary\x00string')
print(a)
b = base64.b64decode(b'YmluYXJ5AHN0cmluZw==')
print(b)
#b'YmluYXJ5AHN0cmluZw=='
#b'binary\x00string'
  • b'str'Can represent bytes,

Since it may appear after standard Base64 encoding 字符+和/, it cannot be used directly as a parameter in the URL, so there is another "url safe"base64 encoding, which is actually 把字符+和/分别变成-和_ :

#`b'str'`可以表示字节,
c= base64.b64encode(b'i\xb7\x1d\xfb\xef\xff')
print(c)#b'abcd++//'
d = base64.urlsafe_b64encode(b'i\xb7\x1d\xfb\xef\xff')
print(d)#b'abcd++//'
e = base64.urlsafe_b64decode('abcd--__')
print(e)#b'abcd++//'

hashlib

Python's hashlib provides common digest algorithms, such as MD5,SHA1etc.

**What is a summary algorithm?

  • **Digest algorithm is also known as 哈希算法、散列算法. The summary algorithm calculates a fixed length 摘要函数f()from any length , with the purpose of discovering whether the original data has been tampered with. (Usually represented by a hexadecimal string).数据data摘要digest

  • The reason why the digest algorithm can indicate whether the data has been tampered with

    • Since the summary function is one 单向函数, the calculation f(data)is easy but passes digest反推data却非常困难. Moreover, any single bit modification to the original data will result in a completely different calculated summary.

Application scenarios

  • Wrote an article, the content is a string 'how to use python hashlib - by Michael', and attached the abstract of this article is ' 2d73d4f15c0db7f5ecb321b6a65e5d6d'. If someone tampered with your article and published it as ' how to use python hashlib - by Bob', you can immediately point out that Bob tampered with your article, because how to use python hashlib - by Bobthe abstract calculated based on ' ' is different from the abstract of the original article.

MD5 is the most common digest algorithm. It is very fast and the generated result is fixed 128 bit/16字节. It is usually 32位的16进制字符串represented by one. As follows

import hashlib
md5 = hashlib.md5()
md5.update('how to use md5 in python hashlib?'.encode('utf-8'))
print(md5.hexdigest())
#d26a53750bc40b38b65a520292f69306

If the amount of data is large, yes 分块多次调用update(), the final calculation result is the same:

import hashlib

md5 = hashlib.md5()
md5.update('how to use md5 in '.encode('utf-8'))
md5.update('python hashlib?'.encode('utf-8'))
print(md5.hexdigest())
#d26a53750bc40b38b65a520292f69306

Another common digest algorithm is SHA1that calling SHA1 is exactly like calling MD5: the result of SHA1 is 160 bit/20字节, usually 40位的16进制字符串represented by a .

import hashlib

sha1 = hashlib.sha1()
sha1.update('how to use sha1 in '.encode('utf-8'))
sha1.update('python hashlib?'.encode('utf-8'))
print(sha1.hexdigest())
#2c76b57293ce30acef38d98f6046927161b46a44

There are more secure algorithms than SHA1 SHA256和SHA512, though 越安全的算法不仅越慢,而且摘要长度更长.

hmac

password_md5Through the hash algorithm, we can verify whether a piece of data is valid by comparing the hash value of the data. For example, to determine whether the user password is correct, we use the comparison calculation result stored in the database. If it is consistent, the password entered by the md5(password)user That's right.

In order to prevent hackers from 彩虹表inferring 哈希值the original password, when calculating the hash, it cannot only be calculated based on the original input. It is necessary to add one saltso that the same input can also get different hashes. This greatly increases the difficulty for hackers to crack.

  • If the salt is randomly generated by ourselves, we usually use it when calculating MD5 md5(message + salt). But in fact, considering salt as a "password", the hash of salt is: when calculating the hash of a message, different hashes are calculated based on different passwords. To verify the hash value, the correct password must also be provided.

    • This is actually Hmac算法: Keyed-Hashing for Message Authentication. It uses a standard algorithm to calculate the hash 把key混入计算过程中.

    • Different from our custom salt-adding algorithm, Hmac算法针对所有哈希算法都通用,无论是MD5还是SHA-1. Using Hmac to replace our own salt algorithm can make the program algorithm more standardized and safer.

The hmac module that comes with Python implements the standard Hmac algorithm. Let's take a look at how to use hmac to implement hashing with keys.

import hmac
#原始数据
message = b'Hello, world!'
#密钥
key = b'secret'
h = hmac.new(key, message, digestmod='MD5')
# 如果消息很长,可以多次调用h.update(msg)
print(h.hexdigest())
#'fa4ee7d173f2d97ee79022d1a7355bcf'
  • It should be noted that the incoming key and message are both bytes类型,str类型需要首先编码为bytes.

screaming

For details, see [Python] From Getting Started to Top—Application Scenarios of Network Request Modules urlib and reuests (12)

XML

There are two ways to manipulate XML:DOM和SAX .

  • DOM will read the entire XML into memory and parse it into a tree, so it takes up space 内存大,解析慢. The advantage is that it can be used 任意遍历树的节点.

  • SAX is 流模式, parsing while reading, occupying 内存小,解析快, the disadvantage is us 需要自己处理事件.

  • Under normal circumstances, SAX is given priority because DOM takes up too much memory.

Using SAX to parse XML in Python is very simple. Usually what we care about is start_element,end_element和char_datato prepare these three functions and then parse the xml.

For example: When the SAX parser reads a node:

<a href="/">python</a>

3 events will be generated:

  • start_element event, when reading <a href="/">;

  • char_data event, when reading python;

  • end_element event, while reading </a>.

    from xml.parsers.expat import ParserCreate
    
    
    class DefaultSaxHandler(object):
        def start_element(self, name, attrs):
            print('sax:start_element: %s, attrs: %s' % (name, str(attrs)))
    
        def end_element(self, name):
            print('sax:end_element: %s' % name)
    
        def char_data(self, text):
            print('sax:char_data: %s' % text)
    
    
    xml = r'''<?xml version="1.0"?>
    <ol>
        <li><a href="/python">Python</a></li>
        <li><a href="/ruby">Ruby</a></li>
    </ol>
    '''
    
    handler = DefaultSaxHandler()
    parser = ParserCreate()
    #start_element事件
    parser.StartElementHandler = handler.start_element
    #end_element事件
    parser.EndElementHandler = handler.end_element
    #char_data事件
    parser.CharacterDataHandler = handler.char_data
    #解析
    parser.Parse(xml)
    

    Results of the

    sax:start_element: ol, attrs: {
          
          }
    sax:char_data: 
    
    sax:char_data:     
    sax:start_element: li, attrs: {
          
          }
    sax:start_element: a, attrs: {
          
          'href': '/python'}
    sax:char_data: Python
    sax:end_element: a
    sax:end_element: li
    sax:char_data: 
    
    sax:char_data:     
    sax:start_element: li, attrs: {
          
          }
    sax:start_element: a, attrs: {
          
          'href': '/ruby'}
    sax:char_data: Ruby
    sax:end_element: a
    sax:end_element: li
    sax:char_data: 
    
    sax:end_element: ol
    
    • It should be noted that when reading a large string, CharacterDataHandlerit may be called multiple times, so it needs to be merged inside 自己保存起来.EndElementHandler

In addition to parsing XML, how to generate XML?

  • In 99% of cases the XML structure that needs to be generated is very simple, therefore 最简单也是最有效的生成XML的方法是拼接字符串:

    L = []
    L.append(r'<?xml version="1.0"?>')
    L.append(r'<root>')
    L.append(encode('some & data'))
    L.append(r'</root>')
    return ''.join(L)
    

HTMLParser

If we want to write a search engine, the first step is to use a crawler to crawl the page of the target website. The second step is to parse the HTML page to see whether the content is news, pictures or videos.

  • Assuming that the first step has been completed, how should the second step parse HTML?

HTML essentially is XML的子集, but HTML的语法it is not as strict as XML, so you cannot use the standard DOM或SAXto parse HTML.

Python provides HTMLParsera very convenient way to parse HTML with just a few lines of code:

from html.parser import HTMLParser
from html.entities import name2codepoint

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        print('<%s>' % tag)

    def handle_endtag(self, tag):
        print('</%s>' % tag)

    def handle_startendtag(self, tag, attrs):
        print('<%s/>' % tag)

    def handle_data(self, data):
        print(data)

    def handle_comment(self, data):
        print('<!--', data, '-->')

    def handle_entityref(self, name):
        print('&%s;' % name)

    def handle_charref(self, name):
        print('&#%s;' % name)

parser = MyHTMLParser()
parser.feed('''<html>
<head></head>
<body>
<!-- test html parser -->
    <p>Some <a href=\"#\">html</a> HTML&nbsp;tutorial...<br>END</p>
</body></html>''')
  • The feed() method can be used 多次调用, that is, the entire HTML string does not have to be inserted at once, but can be inserted part by part.

  • There are two types of special characters, one is expressed in English &nbsp;and the other is expressed in numbers 的&#1234;. Both of these characters can Parserbe parsed.

random

The Python random module is mainly used to generate random numbers. Implemented pseudo-random number generators for various distributions.

Common methods

andom()	生成一个 [0.0, 1.0) 之间的随机小数
seed(seed)	初始化给定的随机数种子
randint(a, b)	生成一个 [a, b] 之间的随机整数
uniform(a, b)	生成一个 [a, b] 之间的随机小数
choice(seq)	从序列 seq 中随机选择一个元素
shuffle(seq)	将序列 seq 中元素随机排列, 返回打乱后的序列

random.random()

import random
print(random.random())
#0.4784904215869241

**random.seed(seed) **

  • Initialize the given random number seed

  • The computer uses a deterministic algorithm to calculate a sequence of random numbers. Random numbers generated by computers are not truly random 但具有类似于随机数的统计特征,如均匀性、独立性等.

  • The computer is based on 随机数种子产生随机数序列,如果随机数种子相同,每次产生的随机数序列是相同的; if the random number seeds are different, the random number sequences generated are different.

    random.seed(10)
    a = random.randint(0, 100)
    print(a)
    a = random.randint(0, 100)
    print(a)
    a = random.randint(0, 100)
    print(a)
    # 73
    # 4
    # 54
    
    random.seed(10)
    a = random.randint(0, 100)
    print(a)
    a = random.randint(0, 100)
    print(a)
    a = random.randint(0, 100)
    print(a)
    # 73
    # 4
    # 54
    
    • result
    1个random.seed(10)设定种子为 10
    产生第 1 个随机数 73
    产生第 2 个随机数 4
    产生第 3 个随机数 542个random.seed(10)设定种子为 10
    产生第 1 个随机数 73
    产生第 2 个随机数 4
    产生第 3 个随机数 54
    
    可以看出,当种子相同时,产生的随机数序列是相同的
    

random.randint(a, b)

  • Generate a random integer between [a, b], the example is as follows:

    a = random.randint(0, 2)
    print(a)
    a = random.randint(0, 2)
    print(a)
    a = random.randint(0, 2)
    print(a)
    # 1
    # 2
    # 0
    

random.uniform(a, b)

  • is to generate a random decimal between [a, b]
    import random
    random.uniform(0, 2)
    #0.20000054219225438
    random.uniform(0, 2)
    #1.4472780206791538
    random.uniform(0, 2)
    #0.5927807855738692
    

random.choice(seq)

  • Randomly select an element from the sequence seq

    import random
    seq = [1, 2, 3, 4]
    random.choice(seq)
    #3
    random.choice(seq)
    #1
    

random.shuffle(seq)

  • Randomly arrange the elements in the sequence seq and return the scrambled sequence

    import random
    seq = [1, 2, 3, 4]
    random.shuffle(seq)
    #[1, 3, 2, 4]
    

summary

  • Using HTMLParser, you can parse the content in the web page 文本、图像.

Guess you like

Origin blog.csdn.net/qq877728715/article/details/132846354