The sizes of indexes (and tables of cause) are influencing the look up speed directly (the smaller are the corresponding file system files, the faster one can scan it, not saying anything about the sizes, needed to be kept in memory caches)
So I did some simple experiment to demonstrate the effect of choosing different field types to be used as index fields to the size of the indexes and tables.
DROP DATABASE IF EXISTS eval; CREATE DATABASE eval; \c eval */ SET work_mem TO '128MB'; SET maintenance_work_mem TO '128MB'; DROP SCHEMA IF EXISTS eval_schema CASCADE; CREATE SCHEMA eval_schema; SET search_path to eval_schema; CREATE TABLE eval_config ( table_name text, id_field_type text, id_field_expression text, row_count integer DEFAULT 100000 ); INSERT INTO eval_config VALUES ( 'integer_short_id_table', 'integer', 's.i' ), ( 'integer_large_id_table', 'integer', 's.i * 123' ), ( 'bigint_short_id_table', 'bigint', 's.i' ), ( 'bigint_large_id_table', 'bigint', 's.i * 123' ), ( 'text_number_short_id_table', 'text', 's.i' ), ( 'text_number_large_id_table', 'text', 's.i * 123' ), ( 'numeric_short_id_table', 'numeric', 's.i' ), ( 'numeric_large_id_table', 'numeric', 's.i * 123' ), ( 'binary_md5_text_table', 'bytea', $$decode( md5( s.i::text || '-text-filler'), 'hex' ) $$ ), ( 'md5_text_table', 'text', $$md5( s.i::text || '-text-filler' )$$ ); CREATE VIEW eval_table_stats AS SELECT t.relname as "Table name", c.id_field_type as "Indexed field type", c.id_field_expression as "Expression", c.row_count as "Row count", s.stawidth as "Average id field width", pg_size_pretty(pg_table_size(t.oid)) as "Table size without index", pg_size_pretty(pg_indexes_size(t.oid)) as "Index size" /*, n_tup_ins, n_tup_upd, n_tup_del, n_tup_hot_upd, n_live_tup, n_dead_tup */ FROM eval_config as c JOIN pg_class as t ON c.table_name = t.relname AND t.relkind = 'r' JOIN pg_namespace as n ON relnamespace = n.oid AND n.nspname = 'eval_schema' LEFT JOIN pg_statistic as s ON s.starelid = t.oid AND s.staattnum = 1 LEFT JOIN pg_stat_user_tables as w ON w.relid = t.oid; DO $SQL$ DECLARE config record; BEGIN FOR config IN SELECT * FROM eval_config LOOP RAISE INFO 'Creating table %', quote_ident( config.table_name ); EXECUTE $$ CREATE TABLE $$ || quote_ident( config.table_name ) || $$ ( id $$ || config.id_field_type || $$, data text ); $$; RAISE INFO 'Filling table %', quote_ident( config.table_name ); EXECUTE $$ INSERT INTO $$ || quote_ident( config.table_name ) || $$ SELECT ( $$ || config.id_field_expression || $$ )::$$ || config.id_field_type || $$, 'some filling data' FROM generate_series(1, $$ || config.row_count || $$) as s(i); $$; RAISE INFO 'Building index on %', quote_ident( config.table_name ); EXECUTE $$ CREATE INDEX ON $$ || quote_ident( config.table_name ) || $$ ( id ); $$; RAISE INFO 'Analyzing table %', quote_ident( config.table_name ); EXECUTE $$ ANALYZE $$ || quote_ident( config.table_name ) || $$; $$; END LOOP; END; $SQL$; SELECT * FROM eval_table_stats; DO $SQL$ DECLARE config record; BEGIN FOR config IN SELECT * FROM eval_config LOOP RAISE INFO 'Bloating table % (phase 1)', quote_ident( config.table_name ); EXECUTE $$ UPDATE $$ || quote_ident( config.table_name ) || $$ SET data = data WHERE random() > 0.5; $$; RAISE INFO 'Bloating table % (phase 2)', quote_ident( config.table_name ); EXECUTE $$ UPDATE $$ || quote_ident( config.table_name ) || $$ SET data = data WHERE random() > 0.5; $$; RAISE INFO 'Analyzing table %', quote_ident( config.table_name ); EXECUTE $$ ANALYZE $$ || quote_ident( config.table_name ) || $$; $$; END LOOP; END; $SQL$; SELECT * FROM eval_table_stats;
As a result of execution of this script, we got several tables and some statistics on the table and index sizes.
I created the tables with
id
field with types: integer
, bigint
, text
and numeric
, additionally bytea
and text
for the md5 hash indexes. Actually the table size is including the filler text data, so it's size is not only the size of the fields being evaluated:Table sizes just after insertion of 100T rows
Table name | Indexed field type | Expression | Row count | Average id field width | Table size without index | Index size |
---|---|---|---|---|---|---|
integer_short_id_table | integer | s.i | 100000 | 4 | 5128 kB | 1768 kB |
integer_large_id_table | integer | s.i * 1234 | 100000 | 4 | 5128 kB | 1768 kB |
bigint_short_id_table | bigint | s.i | 100000 | 8 | 5552 kB | 2208 kB |
bigint_large_id_table | bigint | s.i * 1234 | 100000 | 8 | 5552 kB | 2208 kB |
text_number_short_id_table | text | s.i | 100000 | 5 | 5128 kB | 2200 kB |
text_number_large_id_table | text | s.i * 1234 | 100000 | 9 | 5552 kB | 2624 kB |
numeric_short_id_table | numeric | s.i | 100000 | 8 | 5552 kB | 2616 kB |
numeric_large_id_table | numeric | s.i * 1234 | 100000 | 9 | 5624 kB | 2656 kB |
binary_md5_text_table | bytea | decode( md5( s.i::text || '-text-filler'), 'hex' ) | 100000 | 17 | 6336 kB | 3552 kB |
md5_text_table | text | md5( s.i::text || '-text-filler' ) | 100000 | 33 | 7880 kB | 5328 kB |
Table sizes after random bloating (2 times random update of 50% of rows)
Table name | Indexed field type | Expression | Row count | Average id field width | Table size without index | Index size |
---|---|---|---|---|---|---|
integer_short_id_table | integer | s.i | 100000 | 4 | 10232 kB | 5288 kB |
integer_large_id_table | integer | s.i * 1234 | 100000 | 4 | 10232 kB | 5272 kB |
bigint_short_id_table | bigint | s.i | 100000 | 8 | 11 MB | 6592 kB |
bigint_large_id_table | bigint | s.i * 1234 | 100000 | 8 | 11 MB | 6560 kB |
text_number_short_id_table | text | s.i | 100000 | 5 | 10232 kB | 5896 kB |
text_number_large_id_table | text | s.i * 1234 | 100000 | 9 | 11 MB | 7096 kB |
numeric_short_id_table | numeric | s.i | 100000 | 8 | 11 MB | 7752 kB |
numeric_large_id_table | numeric | s.i * 1234 | 100000 | 9 | 11 MB | 7880 kB |
binary_md5_text_table | bytea | decode( md5( s.i::text || '-text-filler'), 'hex' ) | 100000 | 17 | 12 MB | 7336 kB |
md5_text_table | text | md5( s.i::text || '-text-filler' ) | 100000 | 33 | 15 MB | 11 MB |
Table sizes just after insertion of 5M rows
Table name | Indexed field type | Expression | Row count | Average id field width | Table size without index | Index size |
---|---|---|---|---|---|---|
integer_short_id_table | integer | s.i | 5000000 | 4 | 249 MB | 86 MB |
integer_large_id_table | integer | s.i * 123 | 5000000 | 4 | 249 MB | 86 MB |
bigint_short_id_table | bigint | s.i | 5000000 | 8 | 269 MB | 107 MB |
bigint_large_id_table | bigint | s.i * 123 | 5000000 | 8 | 269 MB | 107 MB |
text_number_short_id_table | text | s.i | 5000000 | 7 | 269 MB | 107 MB |
text_number_large_id_table | text | s.i * 123 | 5000000 | 9 | 269 MB | 128 MB |
numeric_short_id_table | numeric | s.i | 5000000 | 8 | 269 MB | 129 MB |
numeric_large_id_table | numeric | s.i * 123 | 5000000 | 10 | 284 MB | 129 MB |
binary_md5_text_table | bytea | decode( md5( s.i::text || '-text-filler'), 'hex' ) | 5000000 | 17 | 308 MB | 172 MB |
md5_text_table | text | md5( s.i::text || '-text-filler' ) | 5000000 | 33 | 383 MB | 259 MB |
Table sizes after random bloating (2 times random update of 50% of rows)
Table name | Indexed field type | Expression | Row count | Average id field width | Table size without index | Index size |
---|---|---|---|---|---|---|
integer_short_id_table | integer | s.i | 5000000 | 4 | 498 MB | 257 MB |
integer_large_id_table | integer | s.i * 123 | 5000000 | 4 | 498 MB | 257 MB |
bigint_short_id_table | bigint | s.i | 5000000 | 8 | 539 MB | 321 MB |
bigint_large_id_table | bigint | s.i * 123 | 5000000 | 8 | 539 MB | 321 MB |
text_number_short_id_table | text | s.i | 5000000 | 7 | 538 MB | 287 MB |
text_number_large_id_table | text | s.i * 123 | 5000000 | 9 | 539 MB | 340 MB |
numeric_short_id_table | numeric | s.i | 5000000 | 8 | 539 MB | 384 MB |
numeric_large_id_table | numeric | s.i * 123 | 5000000 | 10 | 569 MB | 384 MB |
binary_md5_text_table | bytea | decode( md5( s.i::text || '-text-filler'), 'hex' ) | 5000000 | 17 | 615 MB | 354 MB |
md5_text_table | text | md5( s.i::text || '-text-filler' ) | 5000000 | 33 | 766 MB | 545 MB |
Ok, we see here, that actually it is a field byte width, that is important and for small numbers, the difference is really not so significant, but still an index on
integer
(4 bytes wide) filed is little smaller then index on text representation of the same ids even for tables that keep 10000 records only (these results are not posted here).I specially was creating the tables in 2 variants,
id
as a result of 1..N
series, and multiplied by 1234 or 123 to make the numbers wider in text representation to see how the length of the text representation of the number is influencing the size of the index.Some more play with the
smallint
type, showed that it does save space in the table itself, but the index size is the same as for integer
. What was really surprising for me, that
numeric
type had bigger indexes then text indexes for the same numbers.And of cause again and again -- indexes on some long text fields can be built on the md5 hash of such a text and be much more performant, when you index
bytea
version of the md5 hash instead of text representation of the md5 hash itself. Though md5 hash collisions can theoretically happen, I have not experienced them yet myself in practice.