A best practice for scalability is to avoid sequences, with a UUID for the primary key. However there are some good reasons for SQL sequences: they are smaller, guarantee uniqueness (not a high probability but real mathematical guarantee), they cluster the rows that are inserted together, which is good for B-Tree indexes and maybe for further queries. I explained in a
previous post
that sequences can be scalable thanks to caching.
However,
PostgreSQL caches the sequences per-connection
(different from Oracle that caches in the instance shared memory). Sequences can be a bottleneck in case of a connection storm where all have a cold cache, and here is a workaround. Note that re-connecting in PostgreSQL is expensive for many other reasons and better fix that with a connection pool that is not too dynamic (if you need it to be dynamic, you should have a look at
FlexiPool
)
The PostgreSQL
nextval()
function takes a regclass as input. If a text is provided, it is casted to the regclass of the sequence with this name.
I'll extend this, with a function that takes a name as text and look for a specific suffix, with
%
followed by a number.
If there's no such suffix, the postgres function is called with
name::regclass
.
If there's such suffix, I replace the
%[0-9]+
part with a random number modulo the value passed after the
%
.
For example, if I call with
seq%4
it will, at random, take the next value from
seq0
,
seq1
,
seq2
or
seq3
. Of course, they must be created beforehand.
where "seq%4" is not a sequence, it will read from any of the 4 sequences created, from seq0 to seq3.
As my goal is to distribute the sequences that are read when there's a connection storm (all with cold cache), instead of a random number, I use pg_backed_pid() so that the same session uses the same sequence but different sessions will avoid the hotspot.
In this demo, I'll scale to 8 sequences. I create 8 sequences prefixed with seq and add a number from 0 to 7. This is what I need to call my function with select nextval('seq%8') as it will replace the suffix with a modulo 8.
Because the goal is to have unique numbers, each sequence must start from a different number, 1 to 8 when starting with no data, and will increment by 8. I use a cache of 100, so that the first read from seq0 will cache all multiples of 8, from 0 to 792. The seq1 will have the numbers, from 1 to 793, and so on. selectformat('create sequence if not exists %I minvalue %s start with %s cache 100 increment by %s','seq'||n,n,n,1+max(n)over())as"CREATE SEQUENCE"fromgenerate_series(0,7)n;\gexecCREATESEQUENCE-------------------------------------------------------------------------------------createsequenceifnotexistsseq0minvalue0startwith0cache100incrementby8createsequenceifnotexistsseq1minvalue1startwith1cache100incrementby8createsequenceifnotexistsseq2minvalue2startwith2cache100incrementby8createsequenceifnotexistsseq3minvalue3startwith3cache100incrementby8createsequenceifnotexistsseq4minvalue4startwith4cache100incrementby8createsequenceifnotexistsseq5minvalue5startwith5cache100incrementby8createsequenceifnotexistsseq6minvalue6startwith6cache100incrementby8createsequenceifnotexistsseq7minvalue7startwith7cache100incrementby8(8rows)yugabyte=#\gexecCREATESEQUENCECREATESEQUENCECREATESEQUENCECREATESEQUENCECREATESEQUENCECREATESEQUENCECREATESEQUENCECREATESEQUENCEyugabyte=#
This is the normal behavior in PostgreSQL: the cache is per-session. If I reconnect, a new value is fetched to warm the cache. You can see that when querying the sequence itself: the first call to nextval() returned 1 (the start value) and updated the sequence to 793 (to skipcache*increment by). The next calls within the same session get those 99 numbers without accessing to the sequence. Another connection starts at the next after 793 and updates the sequence to 1593.
Note that if I call the function with a regclass it will directly call the postgres function. You can use that if you have a real sequence with the %'+number prefix. Or you can change the prefix. I used '%' to look like a modulo.
However, the goal of sequences created with increment by 8 is to call my function with seq%8.
A cached sequence is probably enough for scalability in a well-designed application (using a connection pool). With a cache of 100 the sequence is read and updated only once every 100 calls. However, if your connection pool is too dynamic and starts hundreds of connections at the same time, they will all access the sequence at the same time.
Another reason to use this is when you don't use cached sequence (defined with cache 1 which is the PostgreSQL default). But, again, this is probably not the best design. A cached sequence has another advantage: concurrent sessions are generating IDs on different ranges which may reduce some hotspots on the B-Tree that supports the primary key.
In YugabyteDB, you dont have this hotspot problem as you will probably have hash-sharding on ID. But reading from multiple sequences can be a nice addition to caching, especially in geo-distributed clusters where the latency adds to the contention in case of connection storms.
Note that this is different from Oracle Database Scalable Sequences which are more like a Partitioned Sequence for which the goal is to avoid B-Tree hotspots. With my function, the numbers are interleaved with increment by to keep numbers low.
Getting started - Amazon Redshift Serverless automatic mounting of AWS Glue Data Catalog
#aws#database#analytics#tutorial
Amazon Aurora is Now 60 Times Faster than RDS for MySQL. Really.
#aws#performance#mysql#database
PostgreSQL Optimizer Hints Discussion
#postgres#sql#hints#optimizer
Once unpublished, all posts by aws-heroes will become hidden and only accessible to themselves.
If aws-heroes is not suspended, they can still re-publish their posts from their dashboard.