2010-05-28

Merging and manipulating arrays in PostgreSQL

Though, strictly speaking, using arrays in relational databases is kind of not correct, still we use arrays, especially in stored procedures.

Actually PostgreSQL provides us with quite an number of nice array manipulation functions. That can be used at least starting from PostgreSQL 8.0. But what about set operations. How to sort an array in Postgres? How to merge arrays without repetitions? How to remove duplicate elements keeping the order of the appearance in the original array?

If you are working with the integer based arrays, there is quite an old and effective module intarray that allows us to perform some basic set operations on the integer based arrays.

But what to do with not integer based arrays? Or if we want to do some relatively complicated operation on integer bases array? SQL is a set manipulation language, so that means that actually, we can use internal SQL mechanics to manipulate, merge and sort arrays. But for that, first we have to be able to convert an array into kind of a table, so that we can use usual SQL manipulation on it's elements.

generate_series(), generate_subscripts() and unnest() are the most important functions in our case. Unfortunately generate_subscripts() and unnest() are only available starting from PostgreSQL 8.4, but one can easily create a small helper unnest() function using generate_series(), that will be not as efficient as the build in one, but still it is quite good trade off.

CREATE OR REPLACE FUNCTION unnest(a anyarray)
  RETURNS SETOF anyelement AS
$BODY$
/**
 *  Unnests given array into a table
 */
 /* -- testing
     select unnest(ARRAY[1,2,3,4,5]) as i
     select * from unnest(ARRAY[1,2,3,4,5]) as u(i)
  */
select ($1)[s.i] 
  from generate_series( array_lower($1, 1), array_upper($1, 1 ) ) as s(i);
$BODY$
  LANGUAGE 'sql' IMMUTABLE STRICT;

A good thing about unnest() written in SQL and not PL/pgSQL is that one can use such a set retuning function directly after SELECT not necessarily pushing it in the list of used tables after FROM clause. The reason was explained by Tom Lane some time back in one of the newsgroup threads and is related to the implementation drawbacks of the PL/pgSQL.

Ok, so now I will simply give several examples of how to use these functions.

Merge 2 or more text arrays excluding duplicates:
select ARRAY(
  select unnest(ARRAY[ 'a', 'b', 'c' ])
   union
  select unnest(ARRAY[ 'c', 'd', 'e' ])
)
If we use UNION ALL instead, we are doing just the same as array_cat(), but slower :)

Note, that UNION actually sorts the elements of the result set fist, to eliminate the duplicates, so the resulting array will be sorted usually. But one cannot relay on that behavior and if you really need a sorted array on the output, you have to use ORDER BY explicitly.

Merge 2 or more text arrays excluding duplicates and sorting elements:
select ARRAY(
  select unnest(ARRAY[ 'a', 'b', 'c' ]) as e
   union
  select unnest(ARRAY[ 'c', 'd', 'e' ]) as e
   order by e
)
Here we are forcing PostgreSQL to sort the elements alphabetically. If you check the execution plan for that query, you will see that it is exactly the same as for the previous query. But here we can actually reverse the order or sort by element length or first by element length and then alphabetically and so on. This makes this approach quite flexible not still the whole construct is very readable and understandable. But what if we want to trim text of each of the elements, lower the case or even rewrite each element using regular expression replace?

Trim or rewrite text array elements:
select ARRAY(
  select btrim(lower(regexp_replace(unnest(ARRAY['a a','B b','c  C  ']), 
                                    $$\s+$$, 
                                    ' ', 
                                    'g')
                     )
               ) as e
)
We can use this approach in all our previous queries as well, so we actually can merge several arrays reducing all repeated space characters into one, trimming and lowering each element before, removing duplicates from the resulting array and all that in one simple and readable SQL statement.

By now, with this syntax we cannot actually filter some elements out of the original array.

Filter elements from an array depending on element properties:
select ARRAY(
  select a.e 
    from unnest(ARRAY[0,4,2,5,4,1,6,9,8,7]) as a(e)
   where a.e % 2 = 0
)
This query actually rewrites an integer array so, that it drops all the add elements from it. We have to push unnest() to FROM list to be able to name the virtual table, that unnest() creates as well as name the element column so that we can reference it in the where clause. Again there are not guaranties here that our array will be contain elements in the same order that it was in the original one.

The problem is that all these queries do not know anything about the original position of the elements in the original array. To solve this problem we need element indexes to be visible inside the query. One can do it using a plain old generate_series() or not so old generate_subscripts() (one actually should prefer the later one if you are on the PostgreSQL 8.4+) to unnest our original array. You cannot use the unnest() directly as there is not way to determine the position or index other then using window functions like row_number(), but this is much much slower on practice.

Filtering elements from an array depending on element position (index):
select ARRAY(
  select (ARRAY[0,4,2,5,4,1,6,9,8,7])[s.i]
    from generate_series(array_lower(ARRAY[0,4,2,5,4,1,6,9,8,7], 1),
                         array_upper(ARRAY[0,4,2,5,4,1,6,9,8,7], 1) ) as s(i)
   where s.i % 2 = 0
   order by s.i
)
This query is dropping all the elements of the original array that are located on the odd positions and ensures that the elements are ordered in the resulting array the same way as they were in the original one. forcing ordering the elements here is probably not the best thing to do for performance reasons, but I want to demonstrate that it is possible and, strictly speaking, even a proper way to do.

But using generate_series() or generate_subscripts() shows a small problem actually. We have to push the original array at least 2 times in the query. Inside a PL/pgSQL code the arrays are usually replaced with the variable names, and the queries do not look so bulky there, but there is a possibility to create an additional helper function that will unnest an array returning not only a element itself, but also it's index in the original array:

CREATE OR REPLACE FUNCTION array_enumerate(IN a anyarray, 
  OUT i integer, OUT e anyelement)
  RETURNS SETOF record AS
$BODY$
/** 
 *  Unnests array into a table together with element indexes
 */
 /* -- testing
     select * from array_enumerate(ARRAY['a', 'b', 'c']) as u
  */
select s.i, ($1)[s.i] from generate_series( array_lower($1, 1), array_upper($1, 1 ) ) as s(i)
$BODY$
  LANGUAGE 'sql' IMMUTABLE STRICT;
on the PostgreSQL versions after 8.4 one can write it using generate_subscripts() like:
CREATE OR REPLACE FUNCTION array_enumerate(IN a anyarray, 
  OUT i integer, OUT e anyelement)
  RETURNS SETOF record AS
$BODY$
/** 
 *  Unnests array into a table together with element indexes
 */
 /* -- testing
     select * from array_enumerate(ARRAY['a', 'b', 'c']) as u
  */
select s.i, ($1)[s.i] from generate_subscripts( $1, 1 ) as s(i)
$BODY$
  LANGUAGE 'sql' IMMUTABLE STRICT;
Using generate_subscripts() should be preferable as one can create some crazy array with elements having noncontinuous indexes actually. And it should be also a little bit faster, then generate_series().

This actually gives us a possibility to rewrite previous query like this:

select ARRAY(
  select s.e
    from utils.array_enumerate(ARRAY[0,4,2,5,4,1,6,9,8,7]) as s(i,e)
   where s.i % 2 = 0
   order by s.i
)
Now we can even do something practical using this approach. for example

Remove duplicate elements from array keeping the element appearance order:
select ARRAY(
  select s.e
    from utils.array_enumerate(ARRAY[0,4,2,5,4,1,6,9,8,7]) as s(i,e)
   group by s.e
   order by min(s.i)
)
Actually this gives a possibility to process lists of words using string_to_array() and array_to_string() methods, or on newer versions of PostgreSQL regexp_split_to_array() to remove the repeating space or punctuation characters during splitting directly.

Practically everywhere in the queries, one could use the array_agg() to generate the resulting arrays or, in the coming 9.0, a very efficient string_agg() to generate strings directly, without creating preliminary arrays and then using array_to_string().

On my tests comparing the performance of the ARRAY(select) and array_agg() I could not find any significant performance difference.