Third-party libraries used by crawlers (for self-examination)

Documentation of methods used to make a crawler based on functions, briefly described, and supplemented later based on practice. For detailed description, you still need to read the documentation.

  • (Level 1 heading) The document is organized into 3 broad stages
  • (Second-level title) The document is organized according to the libraries commonly used at this stage
    . There is a problem with the organization. If it is used for reference, it may be more effective by function.

send request

urllib library

There are mainly 4 modules

  • Request: Construct request headers, initiate requests, and various handlers handle authentication, proxy, and cookie settings.
  • parse: some parsing and encoding of url
  • error: handling of errors during the request process
  • robotparser: parsing of robots files

method

urllib.request.holiday

  • Function: initiate http request
  • Return value: HTTPResponse object
  • parameter:
    • url: requested address
    • data: (optional) sent data (post data), bytes type
    • timeout: Set the sending timeout time. If it times out, an exception will be thrown.
response = urllib.request.urlopen("http://www.baidu.com")
data = bytes(urllib.parse.urlencode({
    
    'name': 'fatw'}), encoding='utf-8')

urllib.parse.urlencode

  • Convert dictionary type parameters into string type GET request parameters

urllib.parse.parse_qs

  • Convert request parameters into a dictionary

urllib.parse.parse_qsl

  • Convert to a list of tuples

urllib.parse.quote

  • Convert the content to URL encoding, it seems that English is not transcoded

urllib.parse.unquote

  • URL decoding

urllib.parse.urlparse

  • Decompose the url

urllib.parse.urlunparse

  • Concatenate the provided information into a url

Involved classes

  • HTTPResponse class

    • Attributes: msg, status
    • method:
      • read: read web page content
      • getheaders
  • Request class

    • Function: Construction of request body
    • Attributes:
      • url
      • data: bytes type, if it is a dictionary type, you can use urlencode to convert it first
      • headers: dictionary, can also be added through add_header() later
  • HTTPBasicAuthHandler

    • Role: Process http requests that require verification
  • ProxyHandler

    • Function: Process http requests that need to be proxied
  • HTTPCookieProcessor

    • Function: You can build a handler to process cookies. Cookie access can also be performed directly using code.
  • HTTPError

    • Handle HTTP request errors, such as authentication request failure, etc.
  • URLError

    • The parent class of HTTPError has a reason attribute that returns the reason for failure.
  • RobotFileParser

    • Parse robots files to determine which ones can be crawled and which ones cannot

requests library

Makes processing web page verification, proxy settings, and get post requests less cumbersome than urllib

method

get

  • Function: initiate a get request
  • Return a response
  • parameter:
    • url
    • params: parameters of get request, dictionary type
    • headers: dictionary type
    • verify: whether to verify the certificate
    • timeout: If a number is passed in, it represents the sum of connection and reading. If it is a tuple, one represents the connection time and the other represents the reading time.
    • auth: If you are using HTTPBasicAuth, you can use a tuple as a parameter. If it is other authentication types, you can pass in the corresponding class, such as OAuth authentication (requires additional library installation)
    • proxies: Set proxies, the parameter type is a dictionary. If the proxy requires identity authentication, you can use
      the syntax http://user:password@host: port to set the proxy. requests also supports the proxy of socks protocol

post

  • parameter:
    • files: dictionary type, file content uploaded by post. The value of the dictionary is the open function

kind

response class

  • Attributes:
    • status_code: returned status code
    • Cookie: You can use item to convert it into a list of tuples
    • codes.ok: built-in http status codes
    • content: bytes form of content
    • text: The text form of the content. If it is a string in json format, it can be directly converted to json for further analysis.

Session class

  • Function: In order to maintain the session, because each get and post method request is equivalent to opening a browser, which means that the two requests are not in the same session. In order to avoid setting cookies to ensure the unity of identity, there is session
  • method:
    • get
    • post

httpx library, supports crawling of http2.0 websites

However, httpx uses http1.1 by default, and http2.0 needs to be declared.

Client object:

client = httpx.Client(http2=True)

The officially recommended usage is:

import httpx

with httpx.Client as client:
    response = client.get(url)
    print(response)

You can also specify headers when initializing the Client object.

aiohttp library, supports asynchronous requests

Provides both a server (which can be used to build a server) and a client

tips:

For those things that need to call the close method, you can use the with as structure. In an asynchronous function, adding async in front of the with as structure can declare a context manager that supports asynchronous

Client usage:

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text(), response.status


async def main():
    async with aiohttp.ClientSession() as session:
        html, status = await fetch(session, 'https://cuiqingcai.com')
        print(f'html: {
      
      html[:100]}...')
        print(f'status: {
      
      status}')

After obtaining a session through ClientSession, you can use the session to call various request methods.

Settings of url parameters

Set the parameter param in the get method. The type of the parameter is a dictionary type.

In the post method, data is placed after the parameters of different post methods according to different data types, and the type is also a dictionary type.

When getting the response

If a coroutine object is returned, add an await in front.

Timeout settings

With the help of ClientTimeout object, pass in the ClientTimeout object when getting the Session.

import aiohttp
import asyncio

async def main():
    timeout = aiohttp.ClientTimeout(total=1)
    async with aiohttp.ClientSession(timeout=timeout) as session:
        async with session.get('https://httpbin.org/get') as response:
            print('status:', response.status)

if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(main())

Concurrency limit

This prevents the number of simultaneous requests from being too large and causing the website to be crawled. You can control the amount of concurrency with the help of asyncio's Semaphore (semaphore)

How to use semaphore:

import asyncio
import aiohttp

CONCURRENCY = 5
URL = 'https://www.baidu.com'

semaphore = asyncio.Semaphore(CONCURRENCY)
session = None


async def scrape_api():
    async with semaphore:
        print('scraping', URL)
        async with session.get(URL) as response:
            await asyncio.sleep(1)
            return await response.text()


async def main():
    global session
    session = aiohttp.ClientSession()
    scrape_index_tasks = [asyncio.ensure_future(scrape_api()) for _ in range(10000)]
    await asyncio.gather(*scrape_index_tasks)
    # gather函数可能有很多个参数,但是只传递了一个形参。*表示会将scrape_index_tasks依次分配给函数中的参数


if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(main())

It is to add the semaphore where the parallel amount needs to be controlled. That is the get function. Use the with structure.

Web page data analysis and extraction

XPath library

I feel like this is no better than regular expressions

method

etree.HTML(text)

  • Function: Parse text into html (with automatic repair function)

etree.parse(path, etree.HTMLParser())

  • Function: Parse files under path into html

kind:

lxml.etree._Element class

  • method:
    • xpath(pattern) finds nodes according to the rules in pattern. The return value depends on the pattern of pattern. Pattern usually starts with //

Commonly used rules for XPath

expression describe
nodename Select all child nodes of this node
/ Select direct child nodes from current node
// Select descendant nodes from the current node
. Select this current node
Select the parent node of the current node
@ Select attribute

Example:

  • //li/a: represents selecting the direct child node a under the li node
  • //ul//a: Select all child nodes a under the li node
  • //a[@href="link4.html"]/…/@class: Select the a node whose href attribute is link4.html, then get its parent node, and then get its class attribute
  • //li[@class="item-0"]//text(): The text of all child nodes under the li node whose class attribute is item-0 (that is, the part of text included between the two <>,
    if If there are child nodes, the content within the child nodes will not be returned as the text of the current node)
  • //li[contains(@class, “li”)]/a/text(): If the class attribute of li contains two attributes, li and li-first, you can use contains to match, which means that as long as one of the attributes is present, it will Satisfied
  • //li[contains(@class, “li”) and @name="item"]/a/text(): Match two attributes of the node
  • //li[position()< 3]/a/text(): Select nodes in order
  • //li[1]/ancestor:: *: All ancestor nodes of the first li node
  • //li[1]/attribute:: *: Return all its attribute values
  • child::direct child node
  • descendant::*All descendant nodes
  • following::*All nodes following
  • following-sibling::*All subsequent sibling nodes

The brackets in the above example support operators. The table is as follows: (The one with | cannot be used yet)

operator describe
or
and
mod
| Compute two node sets //book|//cd Returns all node sets that have book and cd elements
+
-
*
div
=
!=
>
> =
<
<=

BeautifulSoup library

method:

BeautifulSoup(html, ‘lxml’)

  • Function: Parse text in html into html form

kind

bs4.BeautifulSoup class:

  • Attributes:
    • Connect the name of the node, for example: soup.p, to get a bs4.element.Tag class
  • method:
    • prettify(): Return the document in soup as a standard indented string
    • find_all(name, attrs, recursive, text, **kwargs): Find nodes that meet the parameters passed in.
      • name: node name
      • attrs: dictionary type, attributes of nodes
      • text: Regular expression type, matching the text of the node
    • find(): Returns the first node that meets the conditions
    • There are other similar functions, but with different scopes:
      • find_parent(),find_parents()
      • find_all_next(),find_next()
      • find_next_sibling(),find_next_siblings()
      • find_all_previous(),find_previous()
      • find_previous_sibling(),find_previous_siblings()
    • select(): The parameter is a css selector expression

bs4.element.Tag class:

  • Attributes:
    • You can further connect the node name to get the class object corresponding to the child node it contains.
    • name: the name of the node
    • attrs: attributes of the node (a dictionary type)
    • string: Get the text content contained in the node element
    • contents: Return direct child nodes as a list (may also contain text)
    • children: Return direct child nodes in the form of a generator
    • descendants: Returns all descendant nodes in the form of a generator
    • parent: direct parent node
    • parents: Returns all parent nodes in the form of a generator
    • next_sibling: next sibling node
    • next_siblings: generator form
    • previous_sibling: previous sibling node
    • previous_siblings: generator form
  • method:
    • find_all: Same as the find_all method introduced in BeautifulSoup
    • select: Same as the select method introduced in BeautifulSoup

pyquery

method

PyQuery(html)

  • effect:
    • Returns a pyquery object, where html can be a string, url, file name

kind

pyquery class:

  • method:
    • __call__(): Called through the object name (), the incoming parameter is the css selector, and the pyquery object is returned
    • find: filter in all subcategories
    • children: filter in direct subclasses
    • parent: direct parent node
    • parents: all parent nodes. You can also pass in CSS selectors for filtering.
    • sblings: sibling nodes
    • items: After the selector is selected, no matter how many objects there are, it will return a pyquery object. If there are multiple objects, you can use items to return a generator to get each selected node.
    • attrs: Pass in the attribute name to get the attribute value. When the pyquery object contains multiple nodes, the attributes corresponding to the first node are returned. Modified way (passing key-value pairs) to pass in parameters
    • text: Returns the plain text within the node, whether it belongs directly to this node or to its child nodes, it will be returned. Pass in parameters to modify
    • html(): Returns html text. If the object contains multiple nodes, it also needs to be traversed to obtain. Pass in parameters to modify
    • addClass: Add attributes to the selected node
    • removeClass: Remove the attributes of the selected node
    • remove: Remove the node. The whole thing no longer exists

parcel

method

Selector(text)

  • effect:
    • Return Selector object

kind

Selector class

  • method:
    • xpath: Pass in the XPath expression, and the returned result is the SelectorList class, even if it contains text content
    • css: Pass in the CSS selector and the returned result is the SelectorList class
    • re: Match the content in the Selector to the regular expression in the form of text, and return the matched part or the part in parentheses
    • re_first: only returns the first result that matches the rule

SelectorList class

  • method:
    • get: Returns the content text of the first Selector in SelectorList
    • getall: all, returned as a list

Data storage

txt file storage

python file reading and writing

json file storage

json library

  • loads(): Convert string to json object, parameter is string
  • load(): Pass a file operation object as a parameter into the function, the function is similar to the above
  • dumps(): Convert json object to string
    • json variable
    • indent: Control how many characters json is indented
    • ensure_ascii: Chinese can be stored when it is False
  • dump: similar to the above dump, but with one more file operation object
  • get(): If the parameter only has a key, it means reading. If there is still a value, it means wake-up modification.
    Strings in json need to be wrapped in double quotes

CSV file storage

csv library
csv Each line is separated by commas or tabs, and each line represents a record. It is plain text, which is different from xls files containing text, format, values, and formulas.

method:

  • writer(): initialize the writing object
    • File operation object
    • delimiter: delimiter between columns
  • DictWriter(): initialize the writing object
    • File operation object
    • fieldnames: passed in are the keys to the dictionary that are subsequently written to the data
  • Reader(): Returns the Reader object. You can then use an iterator to output the contents of the file.
    • File operation object

kind

The class returned by writer()

  • method
    • writerow(): The parameter is the data you want to write, in list form
    • writerows(): The parameter is the data you want to write, in the form of a two-dimensional list

Class returned by DictWriter()

  • method
    • writerow(): The parameter is the data you want to write, in dictionary form, and the key is the data in fieldnames during initialization
    • writerows(): The parameter is the data you want to write, in the form of a two-dimensional list

mysql storage

pymysql library

method

  • pymysql.connect(): The parameters are the parameters required to connect to the database, and the database connection object is returned.
  • db.cursor(): db is the object returned by the above method. This method gets the cursor object.

kind

connection class:

  • method:
    • cursor(): returns cursor class
    • close(): close the connection
    • commit(): Operations such as additions, deletions, and modifications need to be committed before they are actually executed.
    • rollback(): transaction rollback

Tips: When inserting data, if you encounter data whose primary key already exists and need to be updated instead of inserted, you can add ON DUPLICATE KEY UPDATE after the sql statement.

cursor class:

  • Attributes:
    • rowcount: the number of query results
  • method:
    • execute(): execute sql statement
    • fetchone(): Fetch a piece of data, offset one backward
    • fetchall(): Fetch all the data after the offset. If the amount of data is relatively large, it is recommended to fetch it one by one.

This part still needs to be organized according to the operation of the data.

MongoDB storage

pymongo library

connect

client = pymongo.MongoClient(host='localhost', port=27017)
client = MongoClient('mongodb://localhost:27017/')

Specify database

db = client.test
# db = client['test']

Specify collection

Collections are equivalent to tables in relational databases

collection = db.students
# collection = db['students']

insert data

Insertion of mongodb data, regardless of whether the data will be repeated, and does not require that the fields of the inserted data are completely consistent

# 单条数据的插入
student = {
    
    
    'id': '20170101',
    'name': 'Jordan',
    'age': 20,
    'gender': 'male'
}
result = collection.insert_one(student)

# 多条数据的插入
result = collection.insert_many([student1, student2])

Inquire

result = collection.find_one({
    
    'name': 'Mike'})

# 根据objectid来查询
from bson.objectid import ObjectId

result = collection.find_one({
    
    '_id': ObjectId('593278c115c2602667ec6bae')})

# 查询多条数据
results = collection.find({
    
    'age': 22})
for result in results:
    print(result)

# 更多,参数中的值,相当于是筛选条件
results = collection.find({
    
    'age': {
    
    '$gt': 20}})

Comparison symbol

symbol meaning Example
$lt less than {‘age’: {‘$lt’: 20}}
$gt more than the {‘age’: {‘$gt’: 20}}
$lte less than or equal to {‘age’: {‘$lte’: 20}}
$gte greater or equal to {‘age’: {‘$gte’: 20}}
It's $ not equal to {'age': {'$ne': 20}}
$in within the scope {‘age’: {‘$in’: [20, 30]}}
$of not within range {'age': {'$nin': [20, 30]}}

function symbol

symbol meaning Example Example meaning
$regex Match regular expression {‘name’: {‘$regex’: ‘^M.*’}} name starts with M
$exists Does the attribute exist? {‘name’: {‘$exists’: True}} name attribute exists
$type Type judgment {‘age’: {‘$type’: ‘int’}} The type of age is int
$mod digital analog operation {‘age’: {‘$mod’: [5, 0]}} age模5余0
$text 文本查询 {‘KaTeX parse error: Expected '}', got 'EOF' at end of input: text': {'search’: ‘Mike’}} text类型的属性中包含有Mike字符串
$where 高级条件查询 {‘$where’: ‘obj.fans_count == obj.follows_count’}} 自身粉丝数等于关注数

排序

results = collection.find().sort('name', pymongo.ASCENDING)

偏移

# skip跳过两个,limit只取两个
results = collection.find().sort('name', pymongo.ASCENDING).skip(2)
results = collection.find().sort('name', pymongo.ASCENDING).skip(2).limit(2)

更新

# student为字典类型,更新过的数据, condition为条件
condition = {
    
    'name': 'Mike'}
result = collection.update_one(condition, {
    
    '$set': student})
# 更新多条数据,inc是increase
condition = {
    
    'age': {
    
    '$gt': 20}}
result = collection.update_many(condition, {
    
    '$inc': {
    
    'age': 1}})

删除

result = collection.delete_one({
    
    'name': 'Kevin'})
# 删除多条数据
result = collection.delete_many({
    
    'age': {
    
    '$lt': 25}})

Redis缓存存储

redis是一个基于内存的,高效的键值型非关系数据库

连接redis

# 也可以使用connectionPool创建连接,StrictRedis在内部也是创建了一个connectionPool进行连接
redis = StrictRedis(host='localhost', port=6379, db=0, password='')

键操作

方法 作用 参数说明 实例 实例说明 实例结果
exists(name) 判断一个键是否存在 name:键名 redis.exists(‘name’) 是否存在name这个键 True
delete(name) 删除一个键 name:键名 redis.delete(‘name’) 删除name这个键 1
type(name) 判断键的类型 name:键名 redis.type(‘name’) 查看name这个键的属性 b’string
keys(pattern) 获取所有符合规则的键 pattern:匹配规则 redis.key(‘n.*’) 匹配所有以n开头的键 [b’name’]
randomkey() 随机获取一个键 randomkey() 随机获取一个键 b’name’
rename(src, dst) 对键重命名 src:原键名,dst:新键名 redis.rename(‘name’, ‘nickname’) 将键name改名为nickname True
dbsize() 获取当前数据库中键的个数 dbsize() 获取当前数据库中键的个数 100
expire(name, time) 设置键的过期时间 time:秒数 redis.expire(‘name’, 2) 将name键的过期时间设置为2秒 True
ttl(name) 获取键的过期时间 redis.ttl(‘name’) 获取name这个键的过期时间 1 (1表示永不过期)
move(name, db) 将键移到其他的数据库 db:目标数据库代号 redis.move(‘name’, 2) 将name键移动到2号数据库 True
flushdb() 删除当前数据库中所有的键 flushdb() 删除当前数据库中所有的键 Ture
flushall() 删除所有数据库中的所有键 flushall() 删除所有数据库中的所有键 True

字符串操作

方法 作用 参数说明 示例 示例说明 示例结果
set(name, value) 给数据库中键为namestring赋予值value name: 键名;value: 值 redis.set('name', 'Bob') name这个键的value赋值为Bob True
get(name) 返回数据库中键为namestringvalue name:键名 redis.get('name') 返回name这个键的value b'Bob'
getset(name, value) 给数据库中键为namestring赋予值value并返回上次的value name:键名;value:新值 redis.getset('name', 'Mike') 赋值nameMike并得到上次的value b'Bob'
mget(keys, *args) 返回多个键对应的value keys:键的列表 redis.mget(['name', 'nickname']) 返回namenicknamevalue [b'Mike', b'Miker']
setnx(name, value) 如果不存在这个键值对,则更新value,否则不变 name:键名 redis.setnx('newname', 'James') 如果newname这个键不存在,则设置值为James 第一次运行结果是True,第二次运行结果是False
setex(name, time, value) 设置可以对应的值为string类型的value,并指定此键值对应的有效期 name: 键名;time: 有效期; value:值 redis.setex('name', 1, 'James') name这个键的值设为James,有效期为1秒 True
setrange(name, offset, value) 设置指定键的value值的子字符串 name:键名;offset:偏移量;value:值 redis.set('name', 'Hello') redis.setrange('name', 6, 'World') 设置nameHello字符串,并在index为6的位置补World 11,修改后的字符串长度
mset(mapping) 批量赋值 mapping:字典 redis.mset({'name1': 'Durant', 'name2': 'James'}) name1设为Durantname2设为James True
msetnx(mapping) 键均不存在时才批量赋值 mapping:字典 redis.msetnx({'name3': 'Smith', 'name4': 'Curry'}) name3name4均不存在的情况下才设置二者值 True
incr(name, amount=1) 键为namevalue增值操作,默认为1,键不存在则被创建并设为amount name:键名;amount:增长的值 redis.incr('age', 1) age对应的值增1,若不存在,则会创建并设置为1 1,即修改后的值
decr(name, amount=1) 键为namevalue减值操作,默认为1,键不存在则被创建并将value设置为-amount name:键名; amount:减少的值 redis.decr('age', 1) age对应的值减1,若不存在,则会创建并设置为-1 -1,即修改后的值
append(key, value) 键为namestring的值附加value key:键名 redis.append('nickname', 'OK') 向键为nickname的值后追加OK 13,即修改后的字符串长度
substr(name, start, end=-1) 返回键为namestring的子串 name:键名;start:起始索引;end:终止索引,默认为-1,表示截取到末尾 redis.substr('name', 1, 4) 返回键为name的值的字符串,截取索引为1~4的字符 b'ello'
getrange(key, start, end) 获取键的value值从startend的子字符串 key:键名;start:起始索引;end:终止索引 redis.getrange('name', 1, 4) 返回键为name的值的字符串,截取索引为1~4的字符 b'ello'

列表操作

方法 作用 参数说明 示例 示例说明 示例结果
rpush(name, *values) 在键为name的列表末尾添加值为value的元素,可以传多个 name:键名;values:值 redis.rpush('list', 1, 2, 3) 向键为list的列表尾添加1、2、3 3,列表大小
lpush(name, *values) 在键为name的列表头添加值为value的元素,可以传多个 name:键名;values:值 redis.lpush('list', 0) 向键为list的列表头部添加0 4,列表大小
llen(name) 返回键为name的列表的长度 name:键名 redis.llen('list') 返回键为list的列表的长度 4
lrange(name, start, end) 返回键为name的列表中startend之间的元素 name:键名;start:起始索引;end:终止索引 redis.lrange('list', 1, 3) 返回起始索引为1终止索引为3的索引范围对应的列表 [b'3', b'2', b'1']
ltrim(name, start, end) 截取键为name的列表,保留索引为startend的内容 name:键名;start:起始索引;end:终止索引 ltrim('list', 1, 3) 保留键为list的索引为1到3的元素 True
lindex(name, index) 返回键为name的列表中index位置的元素 name:键名;index:索引 redis.lindex('list', 1) 返回键为list的列表索引为1的元素 b’2′
lset(name, index, value) 给键为name的列表中index位置的元素赋值,越界则报错 name:键名;index:索引位置;value:值 redis.lset('list', 1, 5) 将键为list的列表中索引为1的位置赋值为5 True
lrem(name, count, value) 删除count个键的列表中值为value的元素 name:键名;count:删除个数;value:值 redis.lrem('list', 2, 3) 将键为list的列表删除两个3 1,即删除的个数
lpop(name) 返回并删除键为name的列表中的首元素 name:键名 redis.lpop('list') 返回并删除名为list的列表中的第一个元素 b'5'
rpop(name) 返回并删除键为name的列表中的尾元素 name:键名 redis.rpop('list') 返回并删除名为list的列表中的最后一个元素 b'2'
blpop(keys, timeout=0) 返回并删除名称在keys中的list中的首个元素,如果列表为空,则会一直阻塞等待 keys:键列表;timeout: 超时等待时间,0为一直等待 redis.blpop('list') 返回并删除键为list的列表中的第一个元素 [b'5']
brpop(keys, timeout=0) 返回并删除键为name的列表中的尾元素,如果list为空,则会一直阻塞等待 keys:键列表;timeout:超时等待时间,0为一直等待 redis.brpop('list') 返回并删除名为list的列表中的最后一个元素 [b'2']
rpoplpush(src, dst) 返回并删除名称为src的列表的尾元素,并将该元素添加到名称为dst的列表头部 src:源列表的键;dst:目标列表的key redis.rpoplpush('list', 'list2') 将键为list的列表尾元素删除并将其添加到键为list2的列表头部,然后返回 b'2'

集合操作

方法 作用 参数说明 示例 示例说明 示例结果
sadd(name, *values) 向键为name的集合中添加元素 name:键名;values:值,可为多个 redis.sadd('tags', 'Book', 'Tea', 'Coffee') 向键为tags的集合中添加BookTeaCoffee这3个内容 3,即插入的数据个数
srem(name, *values) 从键为name的集合中删除元素 name:键名;values:值,可为多个 redis.srem('tags', 'Book') 从键为tags的集合中删除Book 1,即删除的数据个数
spop(name) 随机返回并删除键为name的集合中的一个元素 name:键名 redis.spop('tags') 从键为tags的集合中随机删除并返回该元素 b'Tea'
smove(src, dst, value) src对应的集合中移除元素并将其添加到dst对应的集合中 src:源集合;dst:目标集合;value:元素值 redis.smove('tags', 'tags2', 'Coffee') 从键为tags的集合中删除元素Coffee并将其添加到键为tags2的集合 True
scard(name) 返回键为name的集合的元素个数 name:键名 redis.scard('tags') 获取键为tags的集合中的元素个数 3
sismember(name, value) 测试member是否是键为name的集合的元素 name:键值 redis.sismember('tags', 'Book') 判断Book是否是键为tags的集合元素 True
sinter(keys, *args) 返回所有给定键的集合的交集 keys:键列表 redis.sinter(['tags', 'tags2']) 返回键为tags的集合和键为tags2的集合的交集 {b'Coffee'}
sinterstore(dest, keys, *args) 求交集并将交集保存到dest的集合 dest:结果集合;keys:键列表 redis.sinterstore('inttag', ['tags', 'tags2']) 求键为tags的集合和键为tags2的集合的交集并将其保存为inttag 1
sunion(keys, *args) 返回所有给定键的集合的并集 keys:键列表 redis.sunion(['tags', 'tags2']) 返回键为tags的集合和键为tags2的集合的并集 {b'Coffee', b'Book', b'Pen'}
sunionstore(dest, keys, *args) 求并集并将并集保存到dest的集合 dest:结果集合;keys:键列表 redis.sunionstore('inttag', ['tags', 'tags2']) 求键为tags的集合和键为tags2的集合的并集并将其保存为inttag 3
sdiff(keys, *args) 返回所有给定键的集合的差集 keys:键列表 redis.sdiff(['tags', 'tags2']) 返回键为tags的集合和键为tags2的集合的差集 {b'Book', b'Pen'}
sdiffstore(dest, keys, *args) 求差集并将差集保存到dest集合 dest:结果集合;keys:键列表 redis.sdiffstore('inttag', ['tags', 'tags2']) 求键为tags的集合和键为tags2的集合的差集并将其保存为inttag` 3
smembers(name) 返回键为name的集合的所有元素 name:键名 redis.smembers('tags') 返回键为tags的集合的所有元素 {b'Pen', b'Book', b'Coffee'}
srandmember(name) 随机返回键为name的集合中的一个元素,但不删除元素 name:键值 redis.srandmember('tags') 随机返回键为tags的集合中的一个元素

有序集合操作

方法 作用 参数说明 示例 示例说明 示例结果
zadd(name, *args, **kwargs) 向键为name的zset中添加元素member,score用于排序。如果该元素存在,则更新其顺序 name: 键名;args:可变参数 redis.zadd('grade', 100, 'Bob', 98, 'Mike') 向键为grade的zset中添加Bob(其score为100),并添加Mike(其score为98) 2,即添加的元素个数
zrem(name, *values) 删除键为name的zset中的元素 name:键名;values:元素 redis.zrem('grade', 'Mike') 从键为grade的zset中删除Mike 1,即删除的元素个数
zincrby(name, value, amount=1) 如果在键为name的zset中已经存在元素value,则将该元素的score增加amount;否则向该集合中添加该元素,其score的值为amount name:key名;value:元素;amount:增长的score redis.zincrby('grade', 'Bob', -2) 键为grade的zset中Bobscore减2 98.0,即修改后的值
zrank(name, value) 返回键为name的zset中元素的排名,按score从小到大排序,即名次 name:键名;value:元素值 redis.zrank('grade', 'Amy') 得到键为grade的zset中Amy的排名 1
zrevrank(name, value) 返回键为name的zset中元素的倒数排名(按score从大到小排序),即名次 name:键名;value:元素值 redis.zrevrank('grade', 'Amy') 得到键为grade的zset中Amy的倒数排名 2
zrevrange(name, start, end, withscores=False) 返回键为name的zset(按score从大到小排序)中indexstartend的所有元素 name:键值;start:开始索引;end:结束索引;withscores:是否带score redis.zrevrange('grade', 0, 3) 返回键为grade的zset中前四名元素 [b'Bob', b'Mike', b'Amy', b'James']
zrangebyscore(name, min, max, start=None, num=None, withscores=False) 返回键为name的zset中score在给定区间的元素 name:键名;min:最低scoremax:最高scorestart:起始索引;num:个数;withscores:是否带score redis.zrangebyscore('grade', 80, 95) 返回键为grade的zset中score在80和95之间的元素 [b'Bob', b'Mike', b'Amy', b'James']
zcount(name, min, max) 返回键为name的zset中score在给定区间的数量 name:键名;min:最低score;max:最高score redis.zcount('grade', 80, 95) 返回键为grade的zset中score在80到95的元素个数 2
zcard(name) 返回键为name的zset的元素个数 name:键名 redis.zcard('grade') 获取键为grade的zset中元素的个数 3
zremrangebyrank(name, min, max) 删除键为name的zset中排名在给定区间的元素 name:键名;min:最低位次;max:最高位次 redis.zremrangebyrank('grade', 0, 0) 删除键为grade的zset中排名第一的元素 1,即删除的元素个数
zremrangebyscore(name, min, max) 删除键为name的zset中score在给定区间的元素 name:键名;min:最低scoremax:最高score redis.zremrangebyscore('grade', 80, 90) 删除score在80到90之间的元素 1,即删除的元素个数

散列操作

方法 作用 参数说明 示例 示例说明 示例结果
hset(name, key, value) 向键为name的散列表中添加映射 name:键名;key:映射键名;value:映射键值 hset('price', 'cake', 5) 向键为price的散列表中添加映射关系,cake的值为5 1,即添加的映射个数
hsetnx(name, key, value) 如果映射键名不存在,则向键为name的散列表中添加映射 name:键名;key:映射键名;value:映射键值 hsetnx('price', 'book', 6) 向键为price的散列表中添加映射关系,book的值为6 1,即添加的映射个数
hget(name, key) 返回键为name的散列表中key对应的值 name:键名;key:映射键名 redis.hget('price', 'cake') 获取键为price的散列表中键名为cake的值 5
hmget(name, keys, *args) 返回键为name的散列表中各个键对应的值 name:键名;keys:映射键名列表 redis.hmget('price', ['apple', 'orange']) 获取键为price的散列表中appleorange的值 [b'3', b'7']
hmset(name, mapping) 向键为name的散列表中批量添加映射 name:键名;mapping:映射字典 redis.hmset('price', {'banana': 2, 'pear': 6}) 向键为price的散列表中批量添加映射 True
hincrby(name, key, amount=1) 将键为name的散列表中映射的值增加amount name:键名;key:映射键名;amount:增长量 redis.hincrby('price', 'apple', 3) keyprice的散列表中apple的值增加3 6,修改后的值
hexists(name, key) 键为name的散列表中是否存在键名为键的映射 name:键名;key:映射键名 redis.hexists('price', 'banana') 键为price的散列表中banana的值是否存在 True
hdel(name, *keys) 在键为name的散列表中,删除键名为键的映射 name:键名;keys:映射键名 redis.hdel('price', 'banana') 从键为price的散列表中删除键名为banana的映射 True
hlen(name) 从键为name的散列表中获取映射个数 name: 键名 redis.hlen('price') 从键为price的散列表中获取映射个数 6
hkeys(name) 从键为name的散列表中获取所有映射键名 name:键名 redis.hkeys('price') 从键为price的散列表中获取所有映射键名 [b'cake', b'book', b'banana', b'pear']
hvals(name) 从键为name的散列表中获取所有映射键值 name:键名 redis.hvals('price') 从键为price的散列表中获取所有映射键值 [b'5', b'6', b'2', b'6']
hgetall(name) 从键为name的散列表中获取所有映射键值对 name:键名 redis.hgetall('price') 从键为price的散列表中获取所有映射键值对 {b'cake': b'5', b'book': b'6', b'orange': b'7', b'pear': b'6'}

Elasticsearch搜索引擎存储

版本是7.16.3(用到之后再补充,对elasticsearch的机制还有没足够的了解)

数据存到这里面便于搜索分析

  • 一个分布式的实时文档存储库,每个字段都可以被索引与搜索
  • 一个分布式的实时分析搜索引擎
  • 能胜任上百个服务节点的扩展,支持PB级别的结构化或者非结构化数据

连接

es = Elasticsearch(['https://localhost:9200'], verify_certs=True)

创建索引

result = es.indices.create(index='news', ignore=400)
# ignore表示的是忽略400这个错误(表示索引已存在)

删除索引

更新数据

删除数据

查询数据

RabbitMQ的使用

主要应用于数据消息的通信,进程间的通信机制。是一个消息队列。

Guess you like

Origin blog.csdn.net/weixin_46287316/article/details/126186889