Although I know how this sounds (it reminds me of when I thought I found a bug in perl in 2000), I think I found a bug in Django… but before I go through the effort of filing a bug report, I thought I would ask here to see if perhaps this is not a bug…
I have a rather complex database and an advanced search interface that creates queries through numerous many-to-many tables and it’s been working great for over a year now.
I recently added a feature to count distinct related table record occurrences in the joined results but I ran into a bug of my own in my code where I wasn’t counting those M:M related table records completely. When I applied a fix to supply missing fields to
.distinct()
, I couldn’t get the test to execute without hitting an InvalidColumnReference error. And though I’m supplying the same expanded dict to
.order_by()
that I am to
.distinct()
, the error claims that
SELECT DISTINCT ON expressions must match initial ORDER BY expressions
…
When I print the SQL, the place where it notes a difference has a weird
T8
reference where the model name should be, e.g.
..., "DataRepo_peakgroup"."id" ASC, T8."name" ASC, "DataRepo_compoundsynonym"."name" ASC, ...
. Note the
**T8**."name"
.
The corresponding
DISTINCT ON
has “DataRepo_compoundsynonym” instead of
T8
.
Here’s the full SQL created:
SELECT DISTINCT ON ("DataRepo_peakgroup"."name", "DataRepo_peakgroup"."id", "DataRepo_compoundsynonym"."compound_id", "DataRepo_compoundsynonym"."name", "DataRepo_compoundsynonym"."name", "DataRepo_compound"."name", "DataRepo_peakgroup_compounds"."compound_id", "DataRepo_study"."name", "DataRepo_animal_studies"."study_id") "DataRepo_peakgroup"."id", "DataRepo_peakgroup"."name", "DataRepo_peakgroup"."formula", "DataRepo_peakgroup"."msrun_id", "DataRepo_peakgroup"."peak_group_set_id" FROM "DataRepo_peakgroup" INNER JOIN "DataRepo_msrun" ON ("DataRepo_peakgroup"."msrun_id" = "DataRepo_msrun"."id") INNER JOIN "DataRepo_sample" ON ("DataRepo_msrun"."sample_id" = "DataRepo_sample"."id") INNER JOIN "DataRepo_tissue" ON ("DataRepo_sample"."tissue_id" = "DataRepo_tissue"."id") LEFT OUTER JOIN "DataRepo_peakgroup_compounds" ON ("DataRepo_peakgroup"."id" = "DataRepo_peakgroup_compounds"."peakgroup_id") LEFT OUTER JOIN "DataRepo_compound" ON ("DataRepo_peakgroup_compounds"."compound_id" = "DataRepo_compound"."id") LEFT OUTER JOIN "DataRepo_compoundsynonym" ON ("DataRepo_compound"."id" = "DataRepo_compoundsynonym"."compound_id") LEFT OUTER JOIN "DataRepo_compound" T8 ON ("DataRepo_compoundsynonym"."compound_id" = T8."id") INNER JOIN "DataRepo_animal" ON ("DataRepo_sample"."animal_id" = "DataRepo_animal"."id") LEFT OUTER JOIN "DataRepo_animal_studies" ON ("DataRepo_animal"."id" = "DataRepo_animal_studies"."animal_id") LEFT OUTER JOIN "DataRepo_study" ON ("DataRepo_animal_studies"."study_id" = "DataRepo_study"."id") WHERE UPPER("DataRepo_tissue"."name"::text) = UPPER(Brain) ORDER BY "DataRepo_peakgroup"."name" ASC, "DataRepo_peakgroup"."id" ASC, T8."name" ASC, "DataRepo_compoundsynonym"."name" ASC, "DataRepo_compound"."name" ASC, "DataRepo_peakgroup_compounds"."compound_id" ASC, "DataRepo_study"."name" ASC, "DataRepo_animal_studies"."study_id" ASC
And here’s the code that generates it:
q_exp = Q(msrun__sample__tissue__name__iexact="brain")
fmt_distinct_fields = ['name', 'pk', 'compounds__synonyms__compound', 'compounds__synonyms__name', 'compounds__synonyms__pk', 'compounds__name', 'compounds__pk', 'msrun__sample__animal__studies__name', 'msrun__sample__animal__studies__pk']
all_fields = ['name', 'pk', 'compounds__synonyms__compound', 'compounds__synonyms__name', 'compounds__synonyms__pk', 'compounds__name', 'compounds__pk', 'msrun__sample__animal__studies__name', 'msrun__sample__animal__studies__pk', 'msrun__sample__animal__name', 'peak_data__labeled_element', 'compounds__name', 'msrun__sample__name', 'msrun__sample__tissue__name', 'msrun__sample__animal__tracer_compound__name', 'msrun__sample__animal__studies__name', 'msrun__sample__animal__feeding_status', 'msrun__sample__animal__tracer_infusion_rate', 'msrun__sample__animal__tracer_infusion_concentration']
PeakGroup.objects.filter(q_exp).order_by(*fmt_distinct_fields).distinct(*fmt_distinct_fields).values_list(*all_fields)
The above code works using smaller examples. Previously, I was only adding things to fmt_distinct_fields
in specific instances based on the display of the joined table in our view, but I realized this was leading to inaccurate counts in the displayed stats, so I basically decided I needed to include all M:M table fields in that list. And it’s when I include them all that I hit the error.
Toy Confirmed. In fact, I don’t need the .values_list()
at the end to make it happen…
In [1]: from DataRepo.models import PeakGroup
In [2]: from django.db.models import Q
In [3]: q_exp = Q(msrun__sample__tissue__name__iexact="brain")
In [4]: fmt_distinct_fields = ['name', 'pk', 'compounds__synonyms__compound', 'compounds__synonyms__name', 'compounds__synonyms__pk', 'compounds__name', 'compounds__pk', 'msrun__sample__animal__studies__name', 'msrun__sample__animal__studies__pk']
In [5]: all_fields = ['name', 'pk', 'compounds__synonyms__compound', 'compounds__synonyms__name', 'compounds__synonyms__pk', 'compounds__name', 'compounds__pk', 'msrun__sample__animal__studies__name', 'msrun__sample__animal__studies__pk', 'msrun__sample__animal__name', 'peak_data__labeled_element', 'compounds__na
...: me', 'msrun__sample__name', 'msrun__sample__tissue__name', 'msrun__sample__animal__tracer_compound__name', 'msrun__sample__animal__studies__name', 'msrun__sample__animal__feeding_status', 'msrun__sample__animal__tracer_infusion_rate', 'msrun__sample__animal__tracer_infusion_concentration']
In [6]: PeakGroup.objects.filter(q_exp).order_by(*fmt_distinct_fields).distinct(*fmt_distinct_fields)
Out[6]: ---------------------------------------------------------------------------
InvalidColumnReference Traceback (most recent call last)
~/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py in _execute(self, sql, params, *ignored_wrapper_args)
83 else:
---> 84 return self.cursor.execute(sql, params)
InvalidColumnReference: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
LINE 1: ..."."id", "DataRepo_compoundsynonym"."compound_id", "DataRepo_...
The above exception was the direct cause of the following exception:
ProgrammingError Traceback (most recent call last)
~/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/IPython/core/formatters.py in __call__(self, obj)
700 type_pprinters=self.type_printers,
701 deferred_pprinters=self.deferred_printers)
--> 702 printer.pretty(obj)
703 printer.flush()
704 return stream.getvalue()
~/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/IPython/lib/pretty.py in pretty(self, obj)
392 if cls is not object \
393 and callable(cls.__dict__.get('__repr__')):
--> 394 return _repr_pprint(obj, self, cycle)
396 return _default_pprint(obj, self, cycle)
~/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
698 """A pprint that just redirects to the normal repr function."""
699 # Find newlines and replace them with p.break_()
--> 700 output = repr(obj)
701 lines = output.splitlines()
702 with p.group():
~/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/query.py in __repr__(self)
255 def __repr__(self):
--> 256 data = list(self[:REPR_OUTPUT_SIZE + 1])
257 if len(data) > REPR_OUTPUT_SIZE:
258 data[-1] = "...(remaining elements truncated)..."
~/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/query.py in __len__(self)
261 def __len__(self):
--> 262 self._fetch_all()
263 return len(self._result_cache)
~/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/query.py in _fetch_all(self)
1322 def _fetch_all(self):
1323 if self._result_cache is None:
-> 1324 self._result_cache = list(self._iterable_class(self))
1325 if self._prefetch_related_lookups and not self._prefetch_done:
1326 self._prefetch_related_objects()
~/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/query.py in __iter__(self)
49 # Execute the query. This will also fill compiler.select, klass_info,
50 # and annotations.
---> 51 results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
52 select, klass_info, annotation_col_map = (compiler.select, compiler.klass_info,
53 compiler.annotation_col_map)
~/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/sql/compiler.py in execute_sql(self, result_type, chunked_fetch, chunk_size)
1173 cursor = self.connection.cursor()
1174 try:
-> 1175 cursor.execute(sql, params)
1176 except Exception:
1177 # Might fail for server-side cursors (e.g. connection closed)
~/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py in execute(self, sql, params)
96 def execute(self, sql, params=None):
97 with self.debug_sql(sql, params, use_last_executed_query=True):
---> 98 return super().execute(sql, params)
100 def executemany(self, sql, param_list):
~/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py in execute(self, sql, params)
65 def execute(self, sql, params=None):
---> 66 return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
68 def executemany(self, sql, param_list):
~/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py in _execute_with_wrappers(self, sql, params, many, executor)
73 for wrapper in reversed(self.db.execute_wrappers):
74 executor = functools.partial(wrapper, executor)
---> 75 return executor(sql, params, many, context)
77 def _execute(self, sql, params, *ignored_wrapper_args):
~/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py in _execute(self, sql, params, *ignored_wrapper_args)
82 return self.cursor.execute(sql)
83 else:
---> 84 return self.cursor.execute(sql, params)
86 def _executemany(self, sql, param_list, *ignored_wrapper_args):
~/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/utils.py in __exit__(self, exc_type, exc_value, traceback)
88 if dj_exc_type not in (DataError, IntegrityError):
89 self.wrapper.errors_occurred = True
---> 90 raise dj_exc_value.with_traceback(traceback) from exc_value
92 def __call__(self, func):
~/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py in _execute(self, sql, params, *ignored_wrapper_args)
82 return self.cursor.execute(sql)
83 else:
---> 84 return self.cursor.execute(sql, params)
86 def _executemany(self, sql, param_list, *ignored_wrapper_args):
ProgrammingError: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
LINE 1: ..."."id", "DataRepo_compoundsynonym"."compound_id", "DataRepo_...
I’d be interested in examining this further.
Do you have “toy models” that can be used with this query to demonstrate the issue?
What database engine are you using?
What version of Django are you using?
I do not have a set of toy models to play with and it’s a private github repo (for work).
We’re on Django 3.2 using Postgres.
I was going to post an update yesterday because I’ve learned a bit more and was able to implement a work-around. The problem still exists, but our current query needs and model structure is such that I could code around it.
My hunch is that it has to do with composing the SQL when the relationship “path” goes though multiple “many-related” models. I was able to remove an unnecessary related path from the joined query:
PeakGroup->Compound->CompoundSynonym
I removed CompoundSynonym fields from the query (because I didn’t really need them) and once I did that, the mismatch alias creation in the distinct clause versus the order by clause went away.
PeakGroup:Compound is a many-to-many relationship
Compound:CompoundSynonym is a 1-to-many relationship
One other “many-related” relationship is involved in the query:
Animal:Study (M:M)
Perhaps one complicating factor that could be related is that there are 2 tables that link to Compound in the query. Animal links to Compound. This is a “tracer compound” and is a M:1 relationship. The PeakGroup to Compound link is a “measured compound” (the M:M relationship). My work-around query still includes both links to compound.
I also implemented a catch for the issue, incase future codebase edits reintroduce the problem:
except ProgrammingError as pe:
if len(
all_distinct_fields
) > 1 and "SELECT DISTINCT ON expressions must match initial ORDER BY expressions" in str(
raise UnsupportedDistinctCombo(
all_distinct_fields, self.modeldata[fmt].__class__.__name__
else:
raise pe
class UnsupportedDistinctCombo(Exception):
def __init__(self, fields, fmt_cls_name):
message = (
f"Unsupported combination of distinct fields: [{', '.join(fields)}]. This probably arose from "
"FormatGroup.getQueryStats(), which compiles a list of distinct fields in many-to-many related tables "
f"using the fields found in the stats defined here: [{fmt_cls_name}.stats] and rows that are split in the "
f"views defined here: [{fmt_cls_name}.model_instances[*][manytomany][split_rows]]. To fix this error, "
f"you either need to remove fields from the [{fmt_cls_name}.stats], or make split_rows false that are "
"problematic. Look for tables in the stats that include multiple many-related tables in its field path "
"and remove thos field entries. The problem stems from a likely distinct SQL generation bug in Django "
"when dealing with transiting multiple M:M or 1:M tables int he same path."
super().__init__(message)
self.fields = fields
So there should be enough here to make a toy model set, but it would be trial and error to try and reproduce the issue.
I was just trying to strip down our code to create a set of toy models and I ran across something that is perhaps significant…
compounds__synonyms__name
and compounds__synonyms__pk
are the same field:
class CompoundSynonym(Model):
name = CharField(
primary_key=True,
max_length=256,
compound = ForeignKey(
Compound,
related_name="synonyms"
class Meta:
verbose_name = "synonym"
verbose_name_plural = "synonyms"
ordering = ["compound", "name"]
Could that have confused the aliasing?
I was just looking in my code that gathers the distinct fields to include in .distinct()
and .order_by()
and I noted this comment:
# Don't assume the ordering fields are populated/unique, so include the primary key. Duplicate fields
# should be OK (though I haven't tested it).
I initially add the order-by fields and the line below that comment tacks on the primary key field (just in case).
I will try checking first if the order-by field is the primary key and skip it if so. I’ll then disable my work-around and see if that more precisely identifies the problem or rules it out.
OK. I managed to reproduce the issue with a set of toy models:
toymodels.py:
from django.db.models import Model, CharField, AutoField, ForeignKey, ManyToManyField, CASCADE
class TestPeak(Model):
id = AutoField(primary_key=True)
name = CharField(max_length=10)
compounds = ManyToManyField(
to="TestCompound",
related_name="testpeaks",
class Meta:
verbose_name = "testpeak"
verbose_name_plural = "testpeaks"
ordering = ["name"]
class TestCompound(Model):
id = AutoField(primary_key=True)
name = CharField(max_length=10)
class Meta:
verbose_name = "testcompound"
verbose_name_plural = "testcompounds"
ordering = ["name"]
class TestSynonym(Model):
name = CharField(max_length=10, primary_key=True)
compound = ForeignKey(
TestCompound, related_name="testsynonyms", on_delete=CASCADE
class Meta:
verbose_name = "testsynonym"
verbose_name_plural = "testsynonyms"
ordering = ["compound", "name"]
test_bug.py:
from DataRepo.tests.tracebase_test_case import TracebaseTestCase
from DataRepo.models.toymodels import TestPeak, TestCompound, TestSynonym
from django.db.models import Q
class DjangoSQLBug(TracebaseTestCase):
maxDiff = None
@classmethod
def setUpTestData(cls):
TestCompound.objects.create(name="testcpd")
cpd = TestCompound.objects.get(id__exact=1)
TestSynonym.objects.create(name="testsyn",compound=cpd)
TestPeak.objects.create(name="testpk")
pk = TestPeak.objects.get(id__exact=1)
pk.compounds.add(cpd)
def test_mm_om_query(self):
q_exp = Q(name__iexact="testpk")
distinct_fields = ['name', 'pk', 'compounds__testsynonyms__compound', 'compounds__testsynonyms__name', 'compounds__testsynonyms__pk', 'compounds__name', 'compounds__pk']
qs = TestPeak.objects.filter(q_exp).order_by(*distinct_fields).distinct(*distinct_fields)
self.assertEqual(qs.count(), 1)
python manage.py test
output:
Creating test database for alias 'default'...
Creating test database for alias 'validation'...
System check identified no issues (0 silenced).
======================================================================
ERROR: test_mm_om_query (DataRepo.tests.sqlbugtest.test_bug.DjangoSQLBug)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
psycopg2.errors.InvalidColumnReference: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
LINE 1: ...peak"."id", "DataRepo_testsynonym"."compound_id", "DataRepo_...
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/tests/sqlbugtest/test_bug.py", line 21, in test_mm_om_query
self.assertEqual(qs.count(), 1)
File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/query.py", line 412, in count
return self.query.get_count(using=self.db)
File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/sql/query.py", line 519, in get_count
number = obj.get_aggregation(using, ['__count'])['__count']
File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/sql/query.py", line 504, in get_aggregation
result = compiler.execute_sql(SINGLE)
File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1175, in execute_sql
cursor.execute(sql, params)
File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
django.db.utils.ProgrammingError: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
LINE 1: ...peak"."id", "DataRepo_testsynonym"."compound_id", "DataRepo_...
----------------------------------------------------------------------
Ran 1 test in 0.018s
FAILED (errors=1)
Destroying test database for alias 'default'...
Destroying test database for alias 'validation'...
gen-rl-macbookair[2022-05-05 12:34:44]:~/PROJECT-local/TRACEBASE/tracebase$
You certainly could, it wouldn’t be wrong to do so. (You’d want to include the complete testing code to allow anyone to look at / evaluate this.)
Now that you’ve provided a reproduceable test case, I am curious enough to want to look more closely at this myself - but I’m definitely not an ORM expert.
Can please share the definition of TracebaseTestCase as in
from DataRepo.tests.tracebase_test_case import TracebaseTestCase
I am not able figure this out.
Absolutely, but I don’t think it is going to be helpful. It’s very simple:
from django.test import TestCase
class TracebaseTestCase(TestCase):
This wrapper of TestCase makes the necessary/desirable settings for all test classes.
maxDiff = None
databases = "__all__"
class Meta:
abstract = True
Did you look at the bug ticket I filed (commented above)? They seem to have figured it out.