How to paginate large amounts of data

This article is shared from the Huawei Cloud Community " Paging Processing of Large Data in Applications " by Ma Le.

Introduction

The display of large amounts of data has always been regarded as a problem that must be solved. A classic idea is to display and process them in batches.

1 Processing of foreign keys during query

If the model uses foreign keys in the Django model, define the related operation through on_delete.

CASCADE: Cascade operation. If the foreign key data is deleted, this data will also be deleted.
PROTECT: Protected. As long as this data refers to the data of the foreign key, the foreign key data cannot be deleted. If it is forcibly deleted, the Django framework will report an error.
SET_NULL: Set to NULL. If the foreign key data is deleted, this data is set to NULL, provided that this data can be set to NULL.
SET_DEFAULT: Set the default value. If the foreign key data is deleted, set the value of this data to the default, provided there is a default value.
SET() function: If the data in the foreign key is deleted, the value of the SET function will be obtained as the value of the foreign key. The Set() function can accept a callable object, and the return value of the callable object is set back as the result.
DO_NOTHING: No action is taken, everything depends on the database level behavior.

Database level constraints:

PESTRICT: Default option. If you want to delete parent table records, if the child table has related records, deletion is not allowed.
	NOACTION: Same as above, first detect foreign keys
	CASCADE: When the parent table deletes and updates, the child table association operations also delete and update.
	SET NULL: When the parent table deletes or updates, the child table sets the foreign key field of the associated record to null, so it cannot be not null when designing the child table.

These foreign key method tools can help users handle multi-table related query tasks.

1.1 How to query pagination in django

In applications with paging queries, queries including LIMIT and OFFSET are very common, and almost every one will have an ORDER BY clause.

If you use index sorting, it will be very helpful for performance optimization, otherwise the server needs to do a lot of file sorting.

A high-frequency problem is that the offset value is too large. If the query is like LIMIT 10000, 20, 10020 rows will be generated and the previous 10000 rows will be discarded, which is very expensive.

select * from table order by id limit 10000, 20;

It's very simple. The meaning of this statement is to query 10000+20 records, remove the first 10000 records, and return the last 20 records.
There is no doubt that this query can achieve paging, but the larger the value of 10000, the lower the query performance, because MySQL needs to scan all 10000+20 records.

Assuming that all pages are accessed with the same frequency, such a query will scan half of the data table on average. To optimize them, you can limit the maximum number of accessible pages in paginated views, or make large batch queries more efficient.

When there is a lot of data that meets the query conditions in a table, we often do not need to take them all out at once, which will be a great challenge to query efficiency or server performance: for example, in the simplest mall, suppose There are 10,000 data in the mall, but we may only see one page at a time on the front end.

select * from table where xxx="xxx" limit 10;

This means querying 10 data that meet the conditions.

select * from table where xxx="xxx" limit 10 offset 10;

This means paging, querying the 11th to 20th data that meets the conditions.

Or query by specifying the maximum id

select * from table where id > #max_id# order by id limit n;

This query will also return the last n records, but there is no need to scan the first m records like method 1, but the maximum id (or minimum id) of the previous query (previous page) must be obtained in each query, which is more commonly used. The way.

Of course, the problem with this query is that if the maximum ID is not continuous, we may not be able to get the ID. For example, if we are currently on page 3 and need to query the data on page 5, it will not work.

Or through a subquery, first filter the first 10,000, find the largest ID, and then select the remaining 20 that meet the requirements.

select * from table where id > (select id from table order by id limit m, 1) limit n;

This query also scans the field ID through a subquery, because it does not require table association, but is a simple comparison. This is a recommended usage when the maximum ID on the previous page is not known.

The left-right connection method itself may have worse performance.
There are also the following subqueries, join tables, add indexes to quickly locate tuples, and then read the tuples

SELECT * FROM table WHERE id <= (SELECT id FROM table ORDER BY id DESC LIMIT (page-1)*pagesize ORDER BY id DESC LIMIT pagesize)

rest_framework has a built-in paging operation module. Let us apply it to specific functions in employee/views.py

from rest_framework.pagination import PageNumberPagination
@api_view(['GET', 'POST']) 
@permission_classes([CustomPermission])
def blog_api_view(request):
    """"""
    if request.method == "GET":
		paginator = PageNumberPagination()
        # paginator.page_size = 1 setting we display only 1 item per page.
        paginator.page_size = 2
        task_objects = EmployeeSign.objects.all()
        result = paginator.paginate_queryset(task_objects, request)

If paging is not used, all messages will be displayed on the same page.

serializer = TaskSerializer(result, many=True)
        return Response(serializer.data)

Access paging data. The default interface http://127.0.0.1:2001/api/tasks/ is paging 1

http://127.0.0.1:2001/api/tasks/?page=1  #2,3,4...

2 Summary

Again, in applications with paginated queries, queries involving LIMIT and OFFSET are very common, and almost every one will have an ORDER BY clause. If you use index sorting, it will be very helpful for performance optimization, otherwise the server needs to do a lot of file sorting.

A high-frequency problem is that the offset value is too large. If the query is like LIMIT 10000, 20, 10020 rows will be generated and the previous 10000 rows will be discarded, which is very expensive.

Assuming that all pages are accessed with the same frequency, such a query will scan half of the data table on average.

To optimize them, you can limit the maximum number of accessible pages in paginated views, or make large queries more efficient.

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

The first major version update of JetBrains 2024 (2024.1) is open source. Even Microsoft plans to pay for it. Why is it still being criticized for open source? [Recovered] Tencent Cloud backend crashed: A large number of service errors and no data after logging in to the console. Germany also needs to be "independently controllable". The state government migrated 30,000 PCs from Windows to Linux deepin-IDE and finally achieved bootstrapping! Visual Studio Code 1.88 is released. Good guy, Tencent has really turned Switch into a "thinking learning machine". RustDesk remote desktop starts and reconstructs the Web client. WeChat's open source terminal database based on SQLite, WCDB, has received a major upgrade.
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/11051586