fix: add fix for slow full count in postgresql #2174

mdearos · 2025-12-03T15:33:45Z

Overview

This commit adds a fix for the slow performance of full counts in PostgreSQL (Issue #1969).

To achieve this two new PostgreSQL provider settings have been added (postgresql_pseudo_count_enabled and postgresql_pseudo_count_start). Allowing the use of pseudo counts to be configured individually on each use of the PostgreSQL provider.

This fix then uses the PostgreSQL EXPLAIN function to "guess" the number of rows that will be returned by a given request. But this does not affect all queries equally because pseudo counts cannot be run on the following types of query:

Requests with a Result Type of Hits.
Requests with a CQL filter.
Requests with a BBOX filter.
Requests with a Temporal filter.

Also, you can use the postgresql_pseudo_count_start setting to tell the system to run a full count if the row estimate is to small meaning there is enough time for a full count to be run.

This commit also adds the required documentation and postgreSQL provider test changes. Including adding a building_type and datetime column to the dummy_data.sql file.

Additional information

Dependency policy (RFC2)

I have ensured that this PR meets RFC2 requirements

Updates to public demo

I have ensured that breaking changes to the pygeoapi master demo server have been addressed
https://github.com/geopython/demo.pygeoapi.io/blob/master/services/pygeoapi_master/local.config.yml

Contributions and licensing

(as per https://github.com/geopython/pygeoapi/blob/master/CONTRIBUTING.md#contributions-and-licensing)

I'd like to contribute bugfix Slow Query Performance in Postgres Provider Due to Full count on Large Tables to pygeoapi. I confirm that my contributions to pygeoapi will be compatible with the pygeoapi license guidelines at the time of contribution
I have already previously agreed to the pygeoapi Contributions and Licensing

This commit adds a fix for the slow performance of full counts in PostgreSQL (Issue geopython#1969). To achieve this two new PostgreSQL provider settings have been added (postgresql_pseudo_count_enabled and postgresql_pseudo_count_start). Allowing the use of pseudo counts to be configured individually on each use of the PostgreSQL provider. This fix then uses the PostgreSQL EXPLAIN function to "guess" the number of rows that will be returned by a given request. But this does not affect all queries equally because pseudo counts cannot be run on the following types of query: - Requests with a Result Type of Hits. - Requests with a CQL filter. - Requests with a BBOX filter. - Requests with a Temporal filter. Also, you can use the postgresql_pseudo_count_start setting to tell the system to run a full count if the row estimate is to small meaning there is enough time for a full count to be run. This commit also adds the required documentation and postgreSQL provider test changes. Including adding a building_type and datetime column to the dummy_data.sql file.

webb-ben · 2025-12-06T01:34:57Z

I wonder if a postgres specific addition to the provider block is the correct approach if we want to maintain a rigid pygeoapi config schema. This appears to port some of the logic implemented in a Psuedocount-specfic pygeoapi Postgres provider.

I wonder if there is a configuration option / solution that could be used across all pygeoapi providers given the numberMatched is not required by the specification. Is better to have no count, or an incorrect one

mikemahoney218-usgs · 2025-12-08T14:29:15Z

Sharing my two cents: for our deployment, we decided that no count was the better option, so none of our postgresql-backed endpoints have counts. We evaluated the same pseudocount implementation and decided that it didn't give enough benefits (both in performance and in allowing users to predict result sizes -- because the count isn't accurate, so you'll likely still need to incrementally grow your result set) versus removing numberMatched in line with the specification. So our deployment has a flag in the providers block of each resource, number_matched_for_results_enabled, controlling if /items queries return numberMatched -- as well as a matching number_matched_for_hits_enabled controlling if resulttype=hits is allowed at all.

Edit to add: Though unlike Ben's comment above, I do prefer controlling this on a resource level, rather than at the server level; we have other data providers that don't have the same trouble counting records (particularly for small tables) and it's nice to enable numberMatched there. We haven't had anyone comment on the inconsistency yet. We also enable resulttype=hits for most resources, and only disable it for particularly large tables.

…-large-datasets

webb-ben · 2025-12-22T21:06:24Z

@mdearos What is the use case for needing arbitrarily high counts that are not accurate?

mdearos · 2026-01-06T13:44:56Z

@webb-ben Initially we were experiencing HTTP timeouts when accessing collections with high row counts. This was happening because the process of querying the database for the whole collection, and counting the rows returned was taking too long.

Several approaches to handling this issue where considered but the decision was made to go with our proposed solution because it fulfilled our requirements of consistent API responses (a total number of results is always returned), the ability to access any collection page without receiving a HTTP timeout and providing a count of results that is as close to the true value as possible.

We have some very large collections 100 million+ rows, but customers will always filter down the data. So, the predicted values provided for collections with high row counts are used mainly to stop HTTP timeouts and give customers some idea of how much data is in the collection. But, removing them completely would mean that when a customer has filtered the dataset, they would not receive a result count at all.

Also, while the predicted value is likely to be incorrect on collections with high row counts and databases with a high number of writes. Once the vacuuming process has completed the result should be accurate even on collections with extremely high row counts. The accuracy would then degrade again over time as more data is written to the collection, but this would be corrected periodically by the vacuuming process.

tomkralidis requested review from justb4, tomkralidis and webb-ben December 4, 2025 20:04

Merge branch 'geopython:master' into slow-postgresql-performance-with…

38ef741

…-large-datasets

mikemahoney218-usgs mentioned this pull request Dec 15, 2025

Slow Query Performance in Postgres Provider Due to Full count on Large Tables #1969

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: add fix for slow full count in postgresql #2174

fix: add fix for slow full count in postgresql #2174

mdearos commented Dec 3, 2025 •

edited

Loading

Uh oh!

webb-ben commented Dec 6, 2025

Uh oh!

mikemahoney218-usgs commented Dec 8, 2025 •

edited

Loading

Uh oh!

webb-ben commented Dec 22, 2025

Uh oh!

mdearos commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

fix: add fix for slow full count in postgresql #2174

Are you sure you want to change the base?

fix: add fix for slow full count in postgresql #2174

Conversation

mdearos commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Dependency policy (RFC2)

Updates to public demo

Contributions and licensing

Uh oh!

webb-ben commented Dec 6, 2025

Uh oh!

mikemahoney218-usgs commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

webb-ben commented Dec 22, 2025

Uh oh!

mdearos commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mdearos commented Dec 3, 2025 •

edited

Loading

mikemahoney218-usgs commented Dec 8, 2025 •

edited

Loading