We will start with
this blog post by Matt Hudson
on the Crunchy Data blog. Matt gives a good introduction of how
IN
lists are different from
ANY
lists. Here’s the scenario: He has a query which does a look up on the
id
column and he's passing in six different values. He wants to use the SQL query in his application and he’s using a Java application with the Vert.x Postgres library.
He first tries to get this query into the application by using this
IN
list, as he had initially in the query, and by using a
$1
bind parameter. The goal here would be that you have one fixed query text and then as your user supplied data is changing, you're passing this as a bind parameter, avoiding any SQL injection problems.
This is generally a good idea.
In Matt’s scenario he tries this and this
fails
. He gets an exception about the wrong number of parameters, because he's passing in six values, but there's only a single bind parameter in this list.
Next, instead of sending in a list of values, he’s sending in an array. This
also fails
, but with a different error. In this case, Postgres is complaining that the
id
column is an
integer
, but the
$1
is an
array
- which is mismatching.
The
IN
list is essentially a
list of values
, which means that you could write something like
$1
,
$2
as separate bind parameters, and then it would actually work and it would match up with your bind parameters. But there's a better way.
PostgreSQL's ANY operator and bind parameters
Matt comes to the ANY operator, which looks very similar to what Matt had going on thus far. Again, you're passing in
$1
, but instead of
IN
we're saying
= ANY
. What that means is if any of the
$1
values in the
$1
array are matched, then this is true. Otherwise, it's not true. This works and the bind parameter can be used successfully.
The main argument that Matt makes is that if you want to pass parameters securely, then it's a good idea to use
= ANY
, because if you use
IN
, you would have to generate the query text and that could cause a
problem with potential SQL injections.
Additionally,
ANY
also has some other features. For example, you can do comparisons with
LIKE
or
ILIKE
, or
NOT
. So you don't just have
=
as an operator, you can be much more flexible.
There is a question though:
Should I always be using one or the other?
PostgreSQL performance: ANY operator vs. IN lists
Back in 2017, Lukas Eder, who maintains the JOOQ ORM, described how somebody reached out to him and asked why his ORM is using IN lists and not ANY. They essentially said "This is slower, you should be using the faster version".
Lukas
did a benchmark
and the comparison he did was essentially the same thing that we just looked at in Matt’s scenario - but rather looking at it from a performance perspective instead of from a security perspective.
Lukas has two statements, both pretty simple, and in one case he's using an
IN
list, in the other case he's using an
ANY
with an array. He runs it in a
PL/pgSQL function
, which is a great way to make sure that we're testing the performance on the server side, not any kind of client interactions.
The results showed that when he was using a very short IN list, that in his particular benchmark, it was actually
faster
to have an
IN
list, rather than an
ANY
with an array. But once you get to a lot of values, in his case 128, the array actually performs measurably better than the
IN
list.
Once you get to a lot of values, in his case 128, the array actually performs measurably better than the IN list.
Allow hash lookup for IN clauses with many constants
When I was looking at this, I remembered that there was actually
a change in Postgres 14
that improved this. In the release notes, if you read them very carefully, there's a change that says "allow hash look up for IN clauses with many constants". This is a patch by James Coleman that was
committed by David Rowley
. When you have a very long IN list it improves how efficient it is for Postgres to look up data in that
IN
list.
When James contributed this patch, two years ago, what he was showing was that when he's passing 1,000 random integers in an
IN
list, then, with this patch in place, it's about *
10x faster
because there was something really inefficient with how Postgres was looking things up.
I think an important conclusion here is:
if you're not yet on Postgres 14 and you're using
IN
lists, either out of choice or because your ORM generates them, then Postgres 14 will actually give you a nice performance boost for these long
IN
lists.
If you're not yet on Postgres 14 and you're using
IN
lists, either out of choice or because your ORM generates them, then Postgres 14 will actually give you a nice performance boost for these long
IN
lists.
Postgres index not used with = ANY() but is used with IN
There is one other case that I want to mention. Because I usually like to use
= ANY ARRAY
, and there's one hazard that you have to watch out for, which is if you're using
row type comparisons.
In
this example
, on the DBA stack exchange, somebody complained that their index was not being used when they were using
= ANY
, but it was being used with the
IN
operator. They were comparing
a
and
b
and were trying to match rows that had both of these equal to one of the reference values. In the case of
= ANY
, Postgres was doing a sequential scan. But in the case of
IN
, it was doing an index only scan!
Long story short, the reason that this happens is mostly for internal reasons, but you can
fix it by adding an explicit cast to the particular type in the index.