Use .distinct instead of .group(:id) in new TN relationship filters #…

…3094 I initially used .group(:id) because I noticed one of the queries seemed really slow. I now think it was just the "Subject or object of a relationship" case of the "Relationship subject/object" facet in Filter TNs - tested here on the UCD project, on my personal computer. The result set is 58,337 records. If I benchmark taxon_names/index.json.jbuilder with: ```ruby time = Benchmark.measure do json.array!(@taxon_names) do |taxon_name| json.partial! '/taxon_names/attributes', taxon_name: taxon_name end end puts time.real ``` and run the filter specified above I get: ~.15s with .group(:id) ~2.5s with .distinct :( with the code prior to this commit. The scope being used is ```ruby scope :with_taxon_name_relationships, -> { joins('LEFT OUTER JOIN taxon_name_relationships tnr1 ON taxon_names.id = tnr1.subject_taxon_name_id'). joins('LEFT OUTER JOIN taxon_name_relationships tnr2 ON taxon_names.id = tnr2.object_taxon_name_id'). where('tnr1.subject_taxon_name_id IS NOT NULL OR tnr2.object_taxon_name_id IS NOT NULL') --> either .distinct or .group(:id) here } ``` Using ``` referenced_klass_union([ ::TaxonName.with_taxon_name_relationships_as_subject, ::TaxonName.with_taxon_name_relationships_as_object ]) ``` instead gives ~0.9s Using ``` ::TaxonName.joins('join taxon_name_relationships ON ' \ 'taxon_names.id = taxon_name_relationships.subject_taxon_name_id OR ' \ 'taxon_names.id = taxon_name_relationships.object_taxon_name_id' ).distinct ``` instead gives ~0.7s (using `.group(:id)` brings it down to ~0.15-.02). Leaving things there - using distinct - for now. (Preferring distinct for semantics and to match(?) the rest of the codebase.) [ ```ruby ::TaxonName.joins('join taxon_name_relationships ON ' \ 'taxon_names.id = taxon_name_relationships.subject_taxon_name_id OR ' \ 'taxon_names.id = taxon_name_relationships.object_taxon_name_id' ).select('DISTINCT ON (taxon_names.id) taxon_names.*') ``` gets you back down into the ~.15s range, but DISTINCT ON () is pg-specific. I don't really understand why .distinct isn't being optimized to 'distinct on id' in this case since you're running distinct across rows of TaxonName, which are unique on id...]
SpeciesFileGroup · Feb 12, 2025 · 19c68e1 · 19c68e1
1 parent 9c91b1e
commit 19c68e1
Showing 1 changed file with 10 additions and 5 deletions.
diff --git a/lib/queries/taxon_name/filter.rb b/lib/queries/taxon_name/filter.rb
@@ -588,13 +588,13 @@ def taxon_name_relationship_type_facet
         if taxon_name_relationship_type_subject.present?
           s = ::TaxonName.as_subject_with_taxon_name_relationship(
             taxon_name_relationship_type_subject
-          ).group(:id)
+          ).distinct
         end
 
         if taxon_name_relationship_type_object.present?
           o = ::TaxonName.as_object_with_taxon_name_relationship(
             taxon_name_relationship_type_object
-          ).group(:id)
+          ).distinct
         end
 
         if taxon_name_relationship_type_either.present?
@@ -635,11 +635,16 @@ def relation_to_relationship_facet
         return nil if relation_to_relationship.nil?
 
         if relation_to_relationship == 'subject'
-          ::TaxonName.with_taxon_name_relationships_as_subject.group(:id)
+          ::TaxonName.with_taxon_name_relationships_as_subject.distinct
         elsif relation_to_relationship == 'object'
-          ::TaxonName.with_taxon_name_relationships_as_object.group(:id)
+          ::TaxonName.with_taxon_name_relationships_as_object.distinct
         else
-          ::TaxonName.with_taxon_name_relationships.group(:id)
+          # 3-4x more time-performant than using
+          # :with_taxon_name_relationships.distinct
+          ::TaxonName.joins('join taxon_name_relationships ON ' \
+            'taxon_names.id = taxon_name_relationships.subject_taxon_name_id OR ' \
+            'taxon_names.id = taxon_name_relationships.object_taxon_name_id'
+          ).distinct
         end
       end