Implement postgresql custom aggregation function using hstore

Recently, I encountered a abnormal report query requirement in my work. In order to simplify the business requirement, the description is as follows:

CREATE TABLE public.book (
  book INTEGER NOT NULL,
  bookname CHARACTER VARYING(255) NOT NULL,
  authors CHARACTER VARYING(255) NOT NULL,
  info CHARACTER VARYING(255) NOT NULL,
  comment CHARACTER VARYING(255) NOT NULL,
  year_publication DATE NOT NULL,
  publisher CHARACTER VARYING(10) -- publisher
);
COMMENT ON COLUMN public.book.publisher IS '出版社';

There is a table, the publisher's book publication table, bookname: book title, authors: author, year_publication: publication date, publisher: publisher name, info and comment fields can be ignored, the data is as follows:

INSERT INTO public.book (bookid, bookname, authors, info, comment, year_publication, publisher) VALUES (5, 'c++', 'lisi', ' ', ' ', '2016-12-07', 'pub_1');
INSERT INTO public.book (bookid, bookname, authors, info, comment, year_publication, publisher) VALUES (4, 'php', 'lisi', ' ', ' ', '2016-12-08', 'pub_1');
INSERT INTO public.book (bookid, bookname, authors, info, comment, year_publication, publisher) VALUES (2, 'c', 'zhangsan', ' ', ' ', '2016-12-06', 'pub_1');
INSERT INTO public.book (bookid, bookname, authors, info, comment, year_publication, publisher) VALUES (3, 'python', 'lisi', ' ', ' ', '2016-12-09', 'pub_1');
INSERT INTO public.book (bookid, bookname, authors, info, comment, year_publication, publisher) VALUES (1, 'java', 'zhangsan', ' ', ' ', '2016-12-07', 'pub_1');

The requirements are simplified as follows:

Query the title of the earliest published book by each author, and it is required to be able to group by according to the authros or publisher fields

For example, if the data in the table is aggregated according to authors, the result is as follows:

authros	bookname
zhangsan	c
lysis	c++

If group by is based on the publisher field, then the final return should be a record, as follows

publisher	bookname
c,c++	pub_1

That is, according to the publisher field group by, depending on the number of authors, take the title of the earliest published book by each author, and return it after splicing.

If this is implemented with sql, it is still more complicated. The leader requires that it be implemented with aggregate functions, and then see how the performance is. The database uses postgresql.

The custom aggregation function in postgresql is defined as follows:

Aggregates in PostgreSQL are defined in terms of state values and state transition functions. That is, the aggregation operation uses a
The state value that changes as the row is processed. To define a new aggregate function, select the data type representing the state value, the state initial value,
State transition function. The state transition function accepts the previous state value and the aggregated input value as the current row and returns a new state value.
 It is also possible to declare a final handler function for situations where the desired aggregate result differs from the data that needs to be retained in the state value.
The final handler function accepts the last state value and returns whatever is desired as the aggregated result. In general, state and final functions are just
Ordinary functions, which can also be used outside of aggregates (actually, creating special transformation functions that can only be called as part of an aggregate is often
helpful for performance).
Therefore, in addition to the parameter and result data types seen by aggregate users, there is an internal state value data type that may be related to
Parameters and result types are not the same.

The above definition comes from the postgresql manual translation, the connection is as follows

http://postgres.cn/docs/9.4/xaggr.html

Here is an example of it:

CREATE AGGREGATE avg (float8)
(
    sfunc = float8_accum;
    stype = float8[],
    finalfunc = float8_avg,
    initcond = '{0,0,0}'
);

The principle of the aggregate function is as follows:

1. Aggregate functions include aggregate functions (outer packaging), sfunc and finalfunc, of which finalfunc is optional; the aggregate function is defined by the AGGREGATE keyword, and the aggregate function needs to specify sfunc, finalfunc and a intermediate state variables;

2. sfunc is a state transition function. For the grouped data, each record will call sfunc once, and the intermediate result of processing can be stored in the intermediate variable specified by stype. When the next sfunc call is made, the new record will be automatically Values and intermediate variables are passed in;

3. finalfunc is the final processing function. The received parameter is the final state value. This final state value can be processed. The final state value is the result of the processing of this group of records by the sfunc function.

4. If there is a finalfunc, then the aggregate function uses the return value of the finalfunc function as the result. If no finalfunc is defined, the aggregate function returns the last state of the stype intermediate state processed by the sfunc function.

Now let's analyze the requirements:

According to the publisher field group by, all 5 pieces of data are grouped together, so sfunc will be called 5 times and finalfunc will be called once. When calling sfunc, you need to record the different authors. For each author, you need to compare the corresponding date, record the earliest date; and when finalfunc is called, it is necessary to summarize the previous data, splicing the earliest book titles of different authors, and return. For example: after calling the sfunc function, the data structure should record as follows:

bookname	authors	year_publication
c	zhangsan	2016-12-06
c++	lysis	2016-12-07

In the process of recording, the data in the data structure is space-time at the beginning, and the order in which sfunc processes records may be indeterminate. For example, when sfunc is called for the first time, the intermediate state is empty, and the processed records are ('Zhang San' ,'java','2016-12-07'), then you need to record it in the data structure, wait for the second call to sfunc, and pass it in to record ('Zhang San','c','2016-12- 06'), compare the time of the current record with the previous time, if it is earlier than the previous time, then the record in the data structure needs to be updated to the current one.

After all sfuncs are called, a group by group will call the finalfunc function once. In the finalfunc function, it is necessary to summarize the data in the data structure, traverse all the data, splicing the bookname field, and finally return. The return result of the entire aggregate function is the result returned by the finalfunc function here.

For data structures that store intermediate states, the old practice is to use table storage, which can store multiple records, which can be achieved. But now, the leader said that table storage is relatively slow, let's see if it can be implemented without tables.

After looking around, I saw that postgresql has an hstore extension, and after researching it, this is a data structure similar to hashmap.

First of all, hstore needs to be installed, and postgresql94-contrib-9.4.10-1PGDG.rhel6.x86_64.rpm needs to be installed in the linux environment. The following are the complete installation files for testing:

rpm -ivh postgresql94-libs-9.4.10-1PGDG.rhel6.x86_64.rpm
rpm -ivh postgresql94-9.4.10-1PGDG.rhel6.x86_64.rpm
rpm -ivh postgresql94-server-9.4.10-1PGDG.rhel6.x86_64.rpm
rpm -ivh postgresql94-contrib-9.4.10-1PGDG.rhel6.x86_64.rpm

Install in the order above.

After starting postgresql after installation, you need to add the hstore extension, switch the default postgres account, start psql, and execute the following commands:

create extension hstore;

After installation, if your database was created before hstore, then it is not enough, hstore cannot be used yet, you can add hstore with the following command:

psql database -c 'create extension hstore;'

Where database is your database name, you can refer to this article for details:

http://clarkdave.net/2012/09/postgresql-error-type-hstore-does-not-exist/

This way you can use hstore in your database. For the usage of hstore, please refer to the following documents:

https://www.postgresql.org/docs/9.0/static/hstore.html

The function created is as follows

A type is customized using hstore:

CREATE TYPE rectime_value AS (
  pubdate hstore, --publish time
  bookname hstore --book name
);

This is a custom type, there are two variables of hstore type, the structure is author->year_publication and author->bookname, that is, the corresponding bookname, author, year_publication, two variables with the same key are the same record;

The outer aggregate function is defined as follows:

CREATE AGGREGATE latest_book(bookname VARCHAR, author VARCHAR, publiction DATE)
(
  SFUNC = latest_book_sfunc,
  stype = rectime_value,
  FINALFUNC = latest_book_ffunc
);

The intermediate state variable stype is a custom rectime_value type.

sfunc is defined as follows:

CREATE OR REPLACE FUNCTION latest_book_sfunc(last rectime_value, nb VARCHAR, na VARCHAR, np DATE)
  RETURNS rectime_value
  LANGUAGE plpgsql
AS
  $function$
  DECLARE
    temp rectime_value;
    old_date DATE;
    str_bookname VARCHAR;
    str_year VARCHAR;
  BEGIN
    RAISE INFO '--ssssssssss last = %, nb = %, na = %, np = %', last, nb, na, np;
    IF last IS NULL THEN
      --RAISE INFO '--ssssssssss last = %--', last;
      str_bookname := na || '=>' || nb;
      --RAISE INFO 'str_bookname = %', str_bookname;
      str_year := na || '=>' || np;
      --RAISE INFO 'str_year = %', str_year;
      --RAISE INFO 'temp.bookname = %', temp.bookname;
      temp.bookname := (str_bookname::hstore);
      temp.pubdate := (str_year::hstore);
      RAISE INFO '********temp = %********', temp;
      RETURN temp;
    END IF;
    IF last.pubdate?na THEN
      old_date := last.pubdate -> na;
      IF np < old_date THEN
        str_bookname := na || '=>' || nb;
        str_year := na || '=>' || np;
        last.bookname := last.bookname || (str_bookname::hstore);
        last.pubdate := last.pubdate || (str_year::hstore);
      END IF;
    ELSE
      str_bookname := na || '=>' || nb;
      str_year := na || '=>' || np;
      last.bookname := last.bookname || (str_bookname::hstore);
      last.pubdate := last.pubdate || (str_year::hstore);
    END IF;
    RETURN last;
  END
  $function$;

The syntax of function creation is not detailed here. The basic idea is as follows:

For the first call to sfunc, the intermediate variable last is empty, that is, in the first if, the current recorded value is directly assigned to the temp variable, and the temp variable is returned; when sfunc is called for the second time, the intermediate variable last is It is the temp of the last call, that is, there is already a value in last at this time, and the following if judgment will be executed; the latter if judgment is to judge whether it is the same author from the intermediate variable last, if not the same author, directly Add the current record to the intermediate variable. If it is the same author, take out the last time and compare it with the current record time. If it is later than the current time, replace the record in the intermediate variable with the current record, so that after each call to sfunc After that, the intermediate variable last records the information of the earliest published book by each author;

finalfunc is defined as follows:

CREATE OR REPLACE FUNCTION latest_book_ffunc(last rectime_value)
  RETURNS VARCHAR
  LANGUAGE plpgsql
AS
  $function$
  DECLARE
    _Cursor refcursor;
    authors VARCHAR;
    bookname VARCHAR;
    result VARCHAR DEFAULT '';
  BEGIN
    RAISE INFO '--ffffffffff-- last = %', last;
    open _Cursor for SELECT * FROM each(last.bookname);
    fetch next from _Cursor into  authors, bookname;
    while( FOUND ) loop
      --RAISE INFO 'result = %', result;
      result := result || ',' || bookname;
      fetch next from _Cursor into  authors, bookname;
    END LOOP;
    RETURN result;
  END
  $function$;

The function of finalfunc summarizes the data in the intermediate variable last, which is a custom type, and the data in it is the hstore type. The result can be converted into a cursor through the each function of hstore, and then processed in a loop. More operations supported by hstore can be found in the link above.

At this point, a custom aggregation function is completed, and the operation effect is as follows:

The above is my understanding of the postgresql custom aggregation function. If there is any deviation, please do not hesitate to advise!

Implement postgresql custom aggregation function using hstore

Guess you like