-
-
Notifications
You must be signed in to change notification settings - Fork 305
fix: add fix for slow full count in postgresql #2174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix: add fix for slow full count in postgresql #2174
Conversation
This commit adds a fix for the slow performance of full counts in PostgreSQL (Issue geopython#1969). To achieve this two new PostgreSQL provider settings have been added (postgresql_pseudo_count_enabled and postgresql_pseudo_count_start). Allowing the use of pseudo counts to be configured individually on each use of the PostgreSQL provider. This fix then uses the PostgreSQL EXPLAIN function to "guess" the number of rows that will be returned by a given request. But this does not affect all queries equally because pseudo counts cannot be run on the following types of query: - Requests with a Result Type of Hits. - Requests with a CQL filter. - Requests with a BBOX filter. - Requests with a Temporal filter. Also, you can use the postgresql_pseudo_count_start setting to tell the system to run a full count if the row estimate is to small meaning there is enough time for a full count to be run. This commit also adds the required documentation and postgreSQL provider test changes. Including adding a building_type and datetime column to the dummy_data.sql file.
|
I wonder if a postgres specific addition to the provider block is the correct approach if we want to maintain a rigid pygeoapi config schema. This appears to port some of the logic implemented in a Psuedocount-specfic pygeoapi Postgres provider. I wonder if there is a configuration option / solution that could be used across all pygeoapi providers given the |
|
Sharing my two cents: for our deployment, we decided that no count was the better option, so none of our postgresql-backed endpoints have counts. We evaluated the same pseudocount implementation and decided that it didn't give enough benefits (both in performance and in allowing users to predict result sizes -- because the count isn't accurate, so you'll likely still need to incrementally grow your result set) versus removing Edit to add: Though unlike Ben's comment above, I do prefer controlling this on a resource level, rather than at the server level; we have other data providers that don't have the same trouble counting records (particularly for small tables) and it's nice to enable numberMatched there. We haven't had anyone comment on the inconsistency yet. We also enable resulttype=hits for most resources, and only disable it for particularly large tables. |
|
@mdearos What is the use case for needing arbitrarily high counts that are not accurate? |
|
@webb-ben Initially we were experiencing HTTP timeouts when accessing collections with high row counts. This was happening because the process of querying the database for the whole collection, and counting the rows returned was taking too long. Several approaches to handling this issue where considered but the decision was made to go with our proposed solution because it fulfilled our requirements of consistent API responses (a total number of results is always returned), the ability to access any collection page without receiving a HTTP timeout and providing a count of results that is as close to the true value as possible. We have some very large collections 100 million+ rows, but customers will always filter down the data. So, the predicted values provided for collections with high row counts are used mainly to stop HTTP timeouts and give customers some idea of how much data is in the collection. But, removing them completely would mean that when a customer has filtered the dataset, they would not receive a result count at all. Also, while the predicted value is likely to be incorrect on collections with high row counts and databases with a high number of writes. Once the vacuuming process has completed the result should be accurate even on collections with extremely high row counts. The accuracy would then degrade again over time as more data is written to the collection, but this would be corrected periodically by the vacuuming process. |
Overview
This commit adds a fix for the slow performance of full counts in PostgreSQL (Issue #1969).
To achieve this two new PostgreSQL provider settings have been added (postgresql_pseudo_count_enabled and postgresql_pseudo_count_start). Allowing the use of pseudo counts to be configured individually on each use of the PostgreSQL provider.
This fix then uses the PostgreSQL EXPLAIN function to "guess" the number of rows that will be returned by a given request. But this does not affect all queries equally because pseudo counts cannot be run on the following types of query:
Also, you can use the postgresql_pseudo_count_start setting to tell the system to run a full count if the row estimate is to small meaning there is enough time for a full count to be run.
This commit also adds the required documentation and postgreSQL provider test changes. Including adding a building_type and datetime column to the dummy_data.sql file.
Additional information
Dependency policy (RFC2)
Updates to public demo
Contributions and licensing
(as per https://github.com/geopython/pygeoapi/blob/master/CONTRIBUTING.md#contributions-and-licensing)