PostgreSQL physical bad blocks and file corruption Case Studies

About the Author

Wang Rui exercise , good health database schema Ping Kong, operation and maintenance for many years postgresql database development work. Worked Civil Aviation Information, Decathlon China. There are also some other database products covered.

background

I recently found a lot of friends often encountered bad blocks or PostgreSQL data confusion, the online Chinese data is relatively small, so order a bit I encountered an error and a variety of solutions

Case I: Physical bad block

Being given logical backup

pg_dump: Dumping the contents of table "xxxx" failed: PQgetResult() failed.
pg_dump: Error message from server: ERROR: invalid memory alloc request size 18446744073709551613
pg_dump: The command was: COPY xxxxxx (id, active_flag, bkd, blk, go_show, grs, lss, lsv, lt, no_show, value, wl, inv_seg_cabin_id, ind) TO stdout;
pg_dump: [parallel archiver] a worker process died unexpectedly

The reason: a bad line database (hardware may be damaged and may be a bug (piece of before the memory gets overwritten by random data pg9.2 version), there may be incorrect hardware configuration)

First of all I consider the pg own parameters zero_damaged_pages, this parameter to true, but still found the error, looked under the official documents, this method does not make physical changes to the file, but on the memory, the cache becomes corrupted page 0. If this method resolved the error, please resume this table back out, or to select another table.

Solution: remove the damaged line

create extension hstore；（过程省略）

1, defined functions :

CREATE OR REPLACE FUNCTION
  find_bad_row(tableName TEXT)
  RETURNS tid
  as $find_bad_row$
DECLARE
  result tid;
  curs REFCURSOR;
  row1 RECORD;
  row2 RECORD;
  tabName TEXT;
  count BIGINT := 0;
BEGIN
  SELECT reverse(split_part(reverse($1), '.', 1)) INTO tabName;
  OPEN curs FOR EXECUTE 'SELECT ctid FROM ' || tableName;

  count := 1;
  FETCH curs INTO row1;
  WHILE row1.ctid IS NOT NULL LOOP
    result = row1.ctid;
    count := count + 1;
    FETCH curs INTO row1;
    EXECUTE 'SELECT (each(hstore(' || tabName || '))).* FROM '
         || tableName || ' WHERE ctid = $1' INTO row2
         USING row1.ctid;
    IF count % 100000 = 0 THEN
      RAISE NOTICE 'rows processed: %', count;
    END IF;
  END LOOP;

  CLOSE curs;
  RETURN row1.ctid;
  EXCEPTION
    WHEN OTHERS THEN
      RAISE NOTICE 'LAST CTID: %', result;
      RAISE NOTICE '%: %', SQLSTATE, SQLERRM;
  RETURN result;
END
$find_bad_row$
LANGUAGE plpgsql;

2, find the problem through the function line :

js1=# select find_bad_row('public.description');
NOTICE: LAST CTID: (78497,6)
NOTICE: XX000: invalid memory alloc request size 18446744073709551613
find_bad_row
--------------
(78497,6)
(1 row)

js1=# select * from xxxxxxx where ctid = '(78498,1)';
ERROR: invalid memory alloc request size 18446744073709551613
js1=# delete from xxxxxx where ctid = '(78498,1)';

Need to be processed in the form xxxx us here

3, and then execute the command pg_dump

Detailed analysis shows: https://www.postgresql.org/message-id/54889986.3000308%40gmail.com

Case II: pgclog file corruption due to power outages

pg_clog damage

Error message:Could not read from file ""pg_clog/0646"" at offset 243287

Abnormal power down the server, this is because the test libraries, so no backup and library equipment (dba so for it is life ah backup, whether it is a test or production database library must make a backup)

The database library full physical backup (to do after the operation Insurance)
The forgery data block (data block Commit all forgery), and change permission with dd

for i in {1..262144}; do printf '\125'; done > committed
ls -l committed
od -xv committed | head
od -xv committed | tail

$ ls -l committed
-rw-r--r-- 1 root root 262144 2009-06-25 11:01 committed

$ od -xv committed  | head
0000000 5555 5555 5555 5555 5555 5555 5555 5555
0000020 5555 5555 5555 5555 5555 5555 5555 5555
0000040 5555 5555 5555 5555 5555 5555 5555 5555
0000060 5555 5555 5555 5555 5555 5555 5555 5555
0000100 5555 5555 5555 5555 5555 5555 5555 5555
0000120 5555 5555 5555 5555 5555 5555 5555 5555
0000140 5555 5555 5555 5555 5555 5555 5555 5555
0000160 5555 5555 5555 5555 5555 5555 5555 5555
0000200 5555 5555 5555 5555 5555 5555 5555 5555
0000220 5555 5555 5555 5555 5555 5555 5555 5555
$ od -xv committed  | tail
0777560 5555 5555 5555 5555 5555 5555 5555 5555
0777600 5555 5555 5555 5555 5555 5555 5555 5555
0777620 5555 5555 5555 5555 5555 5555 5555 5555
0777640 5555 5555 5555 5555 5555 5555 5555 5555
0777660 5555 5555 5555 5555 5555 5555 5555 5555
0777700 5555 5555 5555 5555 5555 5555 5555 5555
0777720 5555 5555 5555 5555 5555 5555 5555 5555
0777740 5555 5555 5555 5555 5555 5555 5555 5555
0777760 5555 5555 5555 5555 5555 5555 5555 5555
1000000

chown postgres.postgres committed
chmod 600 committed
mv -i committed $PGDATA/pg_clog/0646

Note that this can only solve this problem, can not repair the damage to the underlying file, so if you have a backup or backup and restore better.

Case III: toast table damage

missing chunk number x for toast value x in pg_toast_x

Associated with a particular table toast table data corruption

Solution cited: http://m.2cto.com/database/201802/720718.html

1, positioning is toast which tables in question:

select 2619::regclass;
   regclass
--------------
 pg_statistic

2, find the table after which there is a problem, first do some simple fix to the table :

REINDEX table pg_toast.pg_toast_2619;
REINDEX table pg_statistic;
VACUUM ANALYZE pg_statistic;

3, positioning of the corrupted data table row . carried out

DO $$
declare
  v_rec record;
BEGIN    
  for v_rec in SELECT * FROM pg_statistic loop
    raise notice ‘Parameter is:‘, v_rec.ctid;
    raise notice ‘Parameter is:’, v_rec;
  end loop;
END;
$$
LANGUAGE plpgsql;

4, step 3 will locate the records are deleted :

delete from pg_statistic where ctid ='(50,3)';

5. Repeat steps 3 and 4 until all records are cleared in question.

6. At this point, toast the problem is solved over, after resolved, the database maintenance or a full index rebuild.

In fact, generally speaking, the database will not go voluntarily submitted pursuant to archive or wal postgres transaction rollback operation, I was in this environment because of the lack archiving, data can only be deleted manually confusion.

Finally, I want to say that, in many cases because there is no reliable backup and lead to many problems, it is recommended that everyone, no matter what the situation, a backup first, check the backup is very important!

forward from:

http://blog.sina.com.cn/s/blog_67d069a90102vibc.html