2012-01-22

Schema based versioning and deployment for PostgreSQL

I am one of the supporters of keeping as much business logic in the database itself. This reduces the access layer of the application to mostly dumb transport and data transformation logic that can be implemented using different technologies and frameworks, without the need to re-implement critical data consistency and data distribution logic in several places, and gives an easy possibility to control what you are doing with your data and how your applications are allowed to access this data or even change exchange the underlying data structures transparently from the upper level of the code and without the need of a downtime. It also gives a possibility to add an additional layer of security allowing access to the data only through stored procedures, that can change their security execution context as needed (SECURITY DEFINER feature of PostgreSQL).

This approach has some disadvantages of course. One of the biggest technical problems, that is very easily becoming an organizational problem if you have a relatively big teem of developers, a problem of how to rapidly rollout new features without touching old functioning stored procedures, so that old versions of your upper level applications can still access the previous versions of stored procedures, and newly rolled out nodes with new software stack on them, access new stored procedures doing something more, or less, or returning some other data sets compared to their previous versions. And of course hundreds of stored procedures that are there to access and manipulate data are enough to make any attempt to keep all new versions of them backwards compatible, a nightmare.

Classical way to do this, would be to keep all the changes backwards compatible and if it is not possible, then create a new version of a stored procedure with some version suffix like _v2, mark the previous version as deprecated and after all your software stack is rolled out to use that new function, just drop the previous version.  But if you are rolling out new version of the whole stack once of twice a week, the control of what is used and that is not becomes quite a challenge... and discipline of all the developers should be really good as well. Stored procedures are not the only objects, that are changing together with them.  The return or input types can change as well. Changing of a return type, that is used by more then 2 stored procedures in a backwards compatible fashion is a pure horror if you want to do it without creating a new version of such a type and new versions of all the stored procedures, that use it. Dependency control becomes another problem.

My solution to that problem was to introduce a schema based versioning of PostgreSQL stored procedures. It uses an idea of PostgreSQL schema and search_path for a session.

So all the stored procedures, that are exposed to the client software stack, are grouped in one API schema that contains only stored procedures and types needed by them.

Schema name contains a version in it, like proj_api_vX_Y_Z, where X_Y_Z is a version, that a software stack is targeted to. Software stack does SET search_path to proj_api_vX_Y_Z, public; immedeately after it gets a connection from the pool and all calls to the stored procedures are done without explicitly specifying a schema name for that API stored procedure and PostgreSQL finds the needed stored procedure from the specified schema.

So when a branch is stable and branch version is fixed, it is used as a property that will be used when setting the default search_path for the software, that is being deployed for that branch. For example in Java using BoneCP JDBC Pool, setting an initSQL property of all the pools used to access proj database.

We are storing the sources of all the stored procedures (and other database objects) in a special database directory structure that is checked in into a usual SCM system. All the files sorted in corresponding folders and are prefixed with a 2 digit numeric prefix to ensure the order of sorting (good old BASIC times :) ). Like
50_proj_api
00_create_schema.sql
20_types
20_simple_object_input_type.sql
30_stored_procedures
20_get_object.sql
20_set_object.sql

Here 00_create_schema.sql file is containing CREATE SCHEMA proj_api; statement, statements to set default security options for newly created stored procedures and a SET search_path TO proj_api, public; statement, that ensures, that all the objects, that are coming after that file are injected into the correct API schema. An example of 00_create_schema.sql file can look like:
RESET role;

CREATE SCHEMA proj_api AUTHORIZATION proj_api_owner;

ALTER DEFAULT PRIVILEGES FOR ROLE postgres IN SCHEMA proj_api REVOKE EXECUTE ON FUNCTIONS FROM public;

GRANT USAGE ON SCHEMA proj_api TO proj_api_usage;

ALTER DEFAULT PRIVILEGES FOR ROLE postgres IN SCHEMA proj_api GRANT EXECUTE ON FUNCTIONS TO proj_api_executor;
ALTER DEFAULT PRIVILEGES FOR ROLE proj_api_owner IN SCHEMA proj_api GRANT EXECUTE ON FUNCTIONS TO proj_api_executor;
ALTER DEFAULT PRIVILEGES IN SCHEMA proj_api GRANT EXECUTE ON FUNCTIONS TO proj_api_executor;

SET search_path to proj_api, public;

DO $SQL$
BEGIN
  IF CURRENT_DATABASE() ~ '^(prod|staging|integration)_proj_db$' THEN
    -- change default search_path for production, staging and integration databases
    EXECUTE 'ALTER DATABASE ' || CURRENT_DATABASE() || ' SET search_path to proj_api, public;';
  END IF;
END
$SQL$;

SET role TO proj_api_owner;

This kind of layout gives a possibility to bootstrap API schema objects into a needed database easily and that is very important, to keep track of all the database logic changes in SCM system that lets you review and compare the changes between releases.

Bootstrapping into a development database can be done by a very easy script like:
(
echo 'DROP SCHEMA proj_api CASCADE;'
find 50_proj -type f -name '*.sql' \
  | sort \
  | xargs cat \
) | psql dev_proj_db -1 -f -
In case of development database, we are actually bootstrapping all the objects including tables into a freshly prepared database instance, so that integration tests can run and modify data as they want.

Injecting into a production or staging database can be automated and implemented with different kind of additional checks, but at the end it is something like:
(
cat 50_proj/00_create_schema.sql | sed s/proj_api/proj_api_vX_Y_Z/g 
find 50_proj -type f -name '*.sql' ! -name '00_create_schema.sql' \
  | sort \
  | xargs cat \
) | psql prod_proj_db -1 -f - 
So after that, we have a fresh copy of the whole shiny API schema with all the dependencies rolled out to the production database. And this schema objects are only accessed by the software, that is supposed to do so, that is tested to run with this very combination and this versions of the stored procedures and depended types. And if we see any problems with the rollout, we can just rollback the software stack so it can still access our old stored procedures, located in a schema with previous version of out API.

This method does not solve the problem of versioning of tables in our data schema (we would keep all the tables, related objects and low level transformation stored procedures in proj_data schema) but for that, there is a very simple, but very nice, solution, http://www.depesz.com/index.php/2010/08/22/versioning/ suggested and implemented by Depesz. Of cause, changes in table structure should be still kept backwards compatible and nicely written database diff rollout and rollback files should be written for every such change.

I am not going into details about how to prepare Springs configuration of the JDBC pools for the java clients or how to configure the bootstrapping for integration testing in your Maven project configuration as this information will not add any real value to this blog post that became much longer then I expected from the beginning.

NOTE: Because of a bug in PostgreSQL JDBC driver the types that are used as input parameters for stored procedures cannot be located in different schemas (TYPE OIDs are being searched only by name only, without consideration of a schema and search_path). Patching of the driver is very easy and we did so, in my company to be able to use the schema based versioning in our Java projects. I reported the bug twice already (http://archives.postgresql.org/pgsql-jdbc/2011-03/msg00007.php, http://archives.postgresql.org/pgsql-jdbc/2011-12/msg00083.php), but unfortunately no response from anybody. Probably have to submit a patch myself sometime.