Though, strictly speaking, using arrays in relational databases is kind of not correct, still we use arrays, especially in stored procedures.
Actually PostgreSQL provides us with quite an number of nice array manipulation functions. That can be used at least starting from PostgreSQL 8.0. But what about set operations. How to sort an array in Postgres? How to merge arrays without repetitions? How to remove duplicate elements keeping the order of the appearance in the original array?
If you are working with the integer based arrays, there is quite an old and effective module intarray
that allows us to perform some basic set operations on the integer based arrays.
But what to do with not integer based arrays? Or if we want to do some relatively complicated operation on integer bases array? SQL is a set manipulation language, so that means that actually, we can use internal SQL mechanics to manipulate, merge and sort arrays. But for that, first we have to be able to convert an array into kind of a table, so that we can use usual SQL manipulation on it's elements.
generate_series()
, generate_subscripts()
and unnest()
are the most important functions in our case. Unfortunately generate_subscripts()
and unnest()
are only available starting from PostgreSQL 8.4, but one can easily create a small helper unnest()
function using generate_series()
, that will be not as efficient as the build in one, but still it is quite good trade off.
CREATE OR REPLACE FUNCTION unnest(a anyarray) RETURNS SETOF anyelement AS $BODY$ /** * Unnests given array into a table */ /* -- testing select unnest(ARRAY[1,2,3,4,5]) as i select * from unnest(ARRAY[1,2,3,4,5]) as u(i) */ select ($1)[s.i] from generate_series( array_lower($1, 1), array_upper($1, 1 ) ) as s(i); $BODY$ LANGUAGE 'sql' IMMUTABLE STRICT;
A good thing about
unnest()
written in SQL and not PL/pgSQL is that one can use such a set retuning function directly after SELECT
not necessarily pushing it in the list of used tables after FROM
clause. The reason was explained by Tom Lane some time back in one of the newsgroup threads and is related to the implementation drawbacks of the PL/pgSQL. Ok, so now I will simply give several examples of how to use these functions.
Merge 2 or more text arrays excluding duplicates: |
---|
select ARRAY( select unnest(ARRAY[ 'a', 'b', 'c' ]) union select unnest(ARRAY[ 'c', 'd', 'e' ]) ) |
UNION ALL
instead, we are doing just the same as array_cat()
, but slower :)Note, that
UNION
actually sorts the elements of the result set fist, to eliminate the duplicates, so the resulting array will be sorted usually. But one cannot relay on that behavior and if you really need a sorted array on the output, you have to use ORDER BY
explicitly.Merge 2 or more text arrays excluding duplicates and sorting elements: |
---|
select ARRAY( select unnest(ARRAY[ 'a', 'b', 'c' ]) as e union select unnest(ARRAY[ 'c', 'd', 'e' ]) as e order by e ) |
Trim or rewrite text array elements: |
---|
select ARRAY( select btrim(lower(regexp_replace(unnest(ARRAY['a a','B b','c C ']), $$\s+$$, ' ', 'g') ) ) as e ) |
By now, with this syntax we cannot actually filter some elements out of the original array.
Filter elements from an array depending on element properties: |
---|
select ARRAY( select a.e from unnest(ARRAY[0,4,2,5,4,1,6,9,8,7]) as a(e) where a.e % 2 = 0 ) |
unnest()
to FROM
list to be able to name the virtual table, that unnest()
creates as well as name the element column so that we can reference it in the where clause. Again there are not guaranties here that our array will be contain elements in the same order that it was in the original one. The problem is that all these queries do not know anything about the original position of the elements in the original array. To solve this problem we need element indexes to be visible inside the query. One can do it using a plain old
generate_series()
or not so old generate_subscripts()
(one actually should prefer the later one if you are on the PostgreSQL 8.4+) to unnest our original array. You cannot use the unnest()
directly as there is not way to determine the position or index other then using window functions like row_number()
, but this is much much slower on practice. Filtering elements from an array depending on element position (index): |
---|
select ARRAY( select (ARRAY[0,4,2,5,4,1,6,9,8,7])[s.i] from generate_series(array_lower(ARRAY[0,4,2,5,4,1,6,9,8,7], 1), array_upper(ARRAY[0,4,2,5,4,1,6,9,8,7], 1) ) as s(i) where s.i % 2 = 0 order by s.i ) |
But using
generate_series()
or generate_subscripts()
shows a small problem actually. We have to push the original array at least 2 times in the query. Inside a PL/pgSQL code the arrays are usually replaced with the variable names, and the queries do not look so bulky there, but there is a possibility to create an additional helper function that will unnest an array returning not only a element itself, but also it's index in the original array:CREATE OR REPLACE FUNCTION array_enumerate(IN a anyarray, OUT i integer, OUT e anyelement) RETURNS SETOF record AS $BODY$ /** * Unnests array into a table together with element indexes */ /* -- testing select * from array_enumerate(ARRAY['a', 'b', 'c']) as u */ select s.i, ($1)[s.i] from generate_series( array_lower($1, 1), array_upper($1, 1 ) ) as s(i) $BODY$ LANGUAGE 'sql' IMMUTABLE STRICT;on the PostgreSQL versions after 8.4 one can write it using
generate_subscripts()
like:CREATE OR REPLACE FUNCTION array_enumerate(IN a anyarray, OUT i integer, OUT e anyelement) RETURNS SETOF record AS $BODY$ /** * Unnests array into a table together with element indexes */ /* -- testing select * from array_enumerate(ARRAY['a', 'b', 'c']) as u */ select s.i, ($1)[s.i] from generate_subscripts( $1, 1 ) as s(i) $BODY$ LANGUAGE 'sql' IMMUTABLE STRICT;Using
generate_subscripts()
should be preferable as one can create some crazy array with elements having noncontinuous indexes actually. And it should be also a little bit faster, then generate_series()
.This actually gives us a possibility to rewrite previous query like this:
select ARRAY( select s.e from utils.array_enumerate(ARRAY[0,4,2,5,4,1,6,9,8,7]) as s(i,e) where s.i % 2 = 0 order by s.i ) |
Remove duplicate elements from array keeping the element appearance order: |
---|
select ARRAY( select s.e from utils.array_enumerate(ARRAY[0,4,2,5,4,1,6,9,8,7]) as s(i,e) group by s.e order by min(s.i) ) |
string_to_array()
and array_to_string()
methods, or on newer versions of PostgreSQL regexp_split_to_array()
to remove the repeating space or punctuation characters during splitting directly.Practically everywhere in the queries, one could use the
array_agg()
to generate the resulting arrays or, in the coming 9.0, a very efficient string_agg()
to generate strings directly, without creating preliminary arrays and then using array_to_string()
.On my tests comparing the performance of the
ARRAY(select)
and array_agg()
I could not find any significant performance difference.
1 comment:
Your last example solved my problem - sorting array elements before comparing two arrays. Thank you!
Post a Comment