Server-side aggregation of matches for many pieces of content when testing rules #344
Replies: 3 comments
-
|
This seems like the right approach to me (handling server-side rather than client-side). |
Beta Was this translation helpful? Give feedback.
-
|
Is it possible this will create a large enough amount of work for CAPI that it could affect its performance? Should we check in with that team? I ask this as we'll probably want to check against a very large number of articles - probably more than most queries that go to CAPI - though on the other hand we don't create new rules that often so might not need to do large numbers of corpus checks. |
Beta Was this translation helpful? Give feedback.
-
|
We can, and probably should do that! My assumption here is that, as Michael B. once said to me, 'CAPI is hardcore'. Taking a look at the status page, it's currently serving ~283 reqs/second across private and public accounts. Safe to say that we'll want to cache our results no matter what happens for some duration. Suspect we could get away with a simple time-expired cache that keeps pages for some reasonable duration TBC. One place that will be harder: the checker! We should consider the load there, as lots of checking it may affect PROD Typerighter users. Having said that, a cache might be useful here, as I suspect users will go backwards and forwards between the same pattern often, esp. if they're working to understand the difference between pattern A and pattern B, and even if there's not really an impact on load, the speed benefit will impact our users. There are standard, powerful cache implementations available as a part of Play, so this shouldn't be too much work. We could also want to look at prioritising traffic within the rule management service. I think we should look at the real impact of checking rules on the PROD service before we take this step – the service routinely checks 5000 word pieces with ~13,000 rules and has a maximum p95 check duration of 500ms max. and 1-200ms average, so 5,000,000 words with 1 rule feels like it'll be within an order of magnitude. We'll find out. I think there are some product questions to answer here. On my mind – do we have different sorts of checks? For example, a deterministic, 'standard' check with a large but predefined search, plus a CAPI search check to cover particular pieces? The 'standard' check is a good noise check, but may be inadequate when checking matches against neologisms etc. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We would like to be able to test a new rule against existing content in CAPI. The diagram below shows one way we might do this.
We must communicate with CAPI and the checker service to do this. I think we should prefer orchestrating on the server:
narticles in CAPI.sequenceDiagram participant RMC as Rule management client participant RMS as Rule management server participant CS as Checker server participant CAPI RMC ->> RMS: [GET] checkerRule, capiQuery?, matchCount Note over RMC, RMS: Query optional, defaults to latest content loop for each page of CAPI, until matchCount matches found, we run out of pages, or some limit of pages reached RMS -->> CAPI: [GET] capiQuery CAPI -->> RMS: content[] Note over RMS: Convert HTML content to block[] loop for each page of documents RMS ->> CS: [GET] checkerRule, document[] CS ->> RMS: [Chunked NDJSON] match[] end end Note over RMS: Stream matches back to client, filtering documents with no matches and reporting progress RMS ->> RMC: [Chunked NDJSON] progress, match[]Beta Was this translation helpful? Give feedback.
All reactions