Postgresql memory configuration problem

Postgresql memory configuration problem

Business environment

Operating system: CentOS Linux release 7.3.1611 (Core)
database version: postgresql 10.6

This environment is a master-slave stream replication cluster, using corosync + pacemaker for high availability management and control. One slave uses synchronous replication to share the read pressure; the other slave uses asynchronous replication as a real-time backup.

Scene restoration

Receive monitoring alarm in the morning:

Stack: corosync
Current DC: sh01-oscar-cmp-pp-pg03 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum
Last updated: Wed Oct 16 10:17:02 2019
Last change: Wed Oct 16 10:15:17 2019 by root via crm_attribute on sh01-oscar-cmp-pp-pg03

3 nodes configured
11 resources configured

Online: [ sh01-oscar-cmp-pp-pg01 sh01-oscar-cmp-pp-pg02 sh01-oscar-cmp-pp-pg03 ]

Full list of resources:

fence-sh01-oscar-cmp-pp-pg01	(ocf::heartbeat:fence_check):	Started sh01-oscar-cmp-pp-pg01
fence-sh01-oscar-cmp-pp-pg02	(ocf::heartbeat:fence_check):	Started sh01-oscar-cmp-pp-pg02
fence-sh01-oscar-cmp-pp-pg03	(ocf::heartbeat:fence_check):	Started sh01-oscar-cmp-pp-pg03
Resource Group: master-group
vip-master	(ocf::heartbeat:IPaddr2):	Started sh01-oscar-cmp-pp-pg03
Resource Group: slave-group
vip-slave	(ocf::heartbeat:IPaddr2):	Started sh01-oscar-cmp-pp-pg02
Master/Slave Set: msPostgresql [pgsql]
Masters: [ sh01-oscar-cmp-pp-pg03 ]
Slaves: [ sh01-oscar-cmp-pp-pg02 ]
Stopped: [ sh01-oscar-cmp-pp-pg01 ]
Clone Set: clnPingCheck [pingCheck]
Started: [ sh01-oscar-cmp-pp-pg01 sh01-oscar-cmp-pp-pg02 sh01-oscar-cmp-pp-pg03 ]

Failed Actions:
* pgsql_start_0 on sh01-oscar-cmp-pp-pg01 'unknown error' (1): call=84, status=complete, exitreason='My data may be inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.',
last-rc-change='Wed Oct 16 10:15:06 2019', queued=0ms, exec=167ms

It shows that machine 01 is abnormal, colleague Prometheus receives an alarm, and the service port of node 01 is no longer accessible.

problem analysis

Connected to the operating system and found that the postgresql of node 01 has been closed. Check the database log to find the problem:

2019-10-16 10:15:02.651 CST [55400] LOG:  server process (PID 16342) was terminated by signal 9: Killed

2019-10-16 10:15:02.651 CST [55400] LOG:  terminating any other active server processes
2019-10-16 10:15:02.651 CST [20414] WARNING:  terminating connection because of crash of another server process
2019-10-16 10:15:02.651 CST [20414] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

2019-10-16 10:15:02.681 CST [20523] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2019-10-16 10:15:02.694 CST [55400] LOG:  all server processes terminated; reinitializing
2019-10-16 10:15:02.785 CST [20551] LOG:  database system was interrupted; last known up at 2019-10-16 10:12:42 CST
2019-10-16 10:15:02.787 CST [20552] FATAL:  the database system is in recovery mode
2019-10-16 10:15:02.971 CST [20572] FATAL:  the database system is in recovery mode
2019-10-16 10:15:03.050 CST [20551] LOG:  database system was not properly shut down; automatic recovery in progress
2019-10-16 10:15:03.058 CST [20551] LOG:  redo starts at 0/65CBB4A0
2019-10-16 10:15:03.164 CST [20686] FATAL:  the database system is in recovery mode
2019-10-16 10:15:03.190 CST [20688] FATAL:  the database system is in recovery mode
2019-10-16 10:15:03.195 CST [20689] FATAL:  the database system is in recovery mode
2019-10-16 10:15:03.214 CST [20690] FATAL:  the database system is in recovery mode
2019-10-16 10:15:03.260 CST [20551] LOG:  invalid record length at 0/67D224D8: wanted 24, got 0
2019-10-16 10:15:03.260 CST [20551] LOG:  redo done at 0/67D224B0
2019-10-16 10:15:03.260 CST [20551] LOG:  last completed transaction was at log time 2019-10-16 10:15:02.434297+08
2019-10-16 10:15:03.391 CST [55400] LOG:  database system is ready to accept connections
2019-10-16 10:15:03.419 CST [55400] LOG:  received fast shutdown request
2019-10-16 10:15:03.421 CST [55400] LOG:  aborting any active transactions
2019-10-16 10:15:03.423 CST [55400] LOG:  worker process: logical replication launcher (PID 20811) exited with exit code 1
2019-10-16 10:15:03.425 CST [20804] LOG:  shutting down
2019-10-16 10:15:03.462 CST [20859] FATAL:  the database system is shutting down
2019-10-16 10:15:03.465 CST [20860] FATAL:  the database system is shutting down
2019-10-16 10:15:03.491 CST [55400] LOG:  database system is shut down

It can be observed that at 10:15:02, the postgresql process was directly killed by kill -9 ,,,

View system logs (/ var / log / messages):

Oct 16 10:15:02 sh01-oscar-cmp-pp-pg01 kernel: Out of memory: Kill process 16342 (postgres) score 843 or sacrifice child
Oct 16 10:15:02 sh01-oscar-cmp-pp-pg01 kernel: Killed process 16342 (postgres) total-vm:8494044kB, anon-rss:3399704kB, file-rss:400kB, shmem-rss:21080kB
Oct 16 10:15:02 sh01-oscar-cmp-pp-pg01 kernel: postgres: page allocation failure: order:0, mode:0x2015a

It turned out that the memory was exhausted, oom-kill was triggered, and the operating system killed the postgresql process that consumed the most memory. . .

problem solved

Check the total memory of the machine:

[root@sh01-oscar-cmp-pp-pg01 ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:           3.7G        194M        2.6G        149M        971M        3.1G
Swap:          2.0G        290M        1.7G

Machine memory is less than 4G

The share_buffer configured in the postgresql.conf file is 3G, plus the memory occupied by each session, resulting in insufficient system memory, and then modify the share_buffer to 1G, the problem is solved.

Specific memory calculation method can refer to this blog:
PostgreSQL memory consumption calculation method

The following is reproduced:

  • The default value of wal_buffers is -1.At this time, wal_buffers uses shared_buffers, and the size of wal_buffers is 1/32 of shared_buffers.
  • The default value of autovacuum_work_mem is -1, at this time the value of maintenance_work_mem is used

1 Do not use wal_buffers, autovacuum_work_mem

The calculation formula is:

max_connections*work_mem 
+ max_connections*temp_buffers 
+ shared_buffers
+ (autovacuum_max_workers * maintenance_work_mem)

Assume that the configuration of PostgreSQL is as follows:

max_connections = 100
temp_buffers=32MB
work_mem=32MB
shared_buffers=19GB
autovacuum_max_workers = 3
maintenance_work_mem=1GB #默认值64MB

Then calculate the memory as:

select(
	(100*(32*1024*1024)::bigint)
	+ (100*(32*1024*1024)::bigint)
	+ (19*(1024*1024*1024)::bigint)
	+ (3 * (1024*1024*1024)::bigint )
)::float8 / 1024 / 1024 / 1024
--output
28.25

At this time, the maximum load of pg is 28.25GB of memory.When the physical content is 32GB, there is 3.75GB of memory for the operating system.

2 Use wal_buffers, not autovacuum_work_mem

The calculation formula is:

max_connections*work_mem 
+ max_connections*temp_buffers 
+ shared_buffers+wal_buffers
+ (autovacuum_max_workers * autovacuum_work_mem)

Assume that the configuration of PostgreSQL is as follows:

max_connections = 100
temp_buffers=32MB
work_mem=32MB
shared_buffers=19GB	
wal_buffers=16MB #--with-wal-segsize的默认值
autovacuum_max_workers = 3	
maintenance_work_mem=1GB

Then calculate the memory as:

select(
	(100*(32*1024*1024)::bigint)
	+ (100*(32*1024*1024)::bigint)
	+ (19*(1024*1024*1024)::bigint)
	+ (16*1024*1024)::bigint
	+ (3 * (1024*1024*1024)::bigint )
)::float8  / 1024 / 1024 / 1024
--output
28.26

At this time, the maximum load of pg is 28.5GB of memory, the physical content is 32GB, and 3.5GB of memory is used by the operating system.

3 Use wal_buffers and autovacuum_work_mem at the same time [recommended]

The calculation formula is:

max_connections*work_mem 
+ max_connections*temp_buffers 
+ shared_buffers+wal_buffers
+ (autovacuum_max_workers * autovacuum_work_mem)
+  maintenance_work_mem

Assume that the configuration of PostgreSQL is as follows:

max_connections = 100
temp_buffers=32MB
work_mem=32MB
shared_buffers=19GB	
wal_buffers=262143kb
autovacuum_max_workers = 3
autovacuum_work_mem=256MB
maintenance_work_mem=2GB

Then calculate the memory as:

select(
    (100*(32*1024*1024)::bigint)
    + (100*(32*1024*1024)::bigint)
    + (19*(1024*1024*1024)::bigint)
    + (262143*1024)::bigint
    + (3 * (256*1024*1024)::bigint )
    + ( 2 * (1024*1024*1024)::bigint )
)::float8  / 1024 / 1024 / 1024
--output
28.01

At this time, the maximum load of pg is 28.25GB. When the physical content is 32GB, there is still 3.75GB of memory for the operating system. It is recommended that all memory consumption be based on the hardware configuration, that is, use this configuration.

Published 136 original articles · Like 58 · Visits 360,000+

Guess you like

Origin blog.csdn.net/sunbocong/article/details/102582504