-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Optionally support a libsql connection URI which will be used to track jobs as they are processed by twine-writerd or twine-cli.
A job consists of:
- A UUID to identify it
- Optional a parent UUID
- A URI to identify it (which may simply be a
urn:uuid:representation of the job UUID, if nothing else is suitable, otherwise it'll be the canonical source or target URI, depending upon the processing pipeline; workflow components may update it accordingly during processing) - Timestamps for added and updated
- A status:
WAITING,ACTIVE,ABORTED(by the user),COMPLETE,FAILED,ERRORS(partial failure) - A status annotation (free-text) which may be set to indicate the failure reason
- If active, the cluster/instance details of the node processing the job (preserved for diagnosis once set)
- Processing item
xofyprogress indicators (particularly for bulk ingests from filesystem sources)
UUIDs should be where possible taken from the source, if it incorporates one into its identification, or generated on-the-fly if this is not possible.
A job stack should be maintained internally to libtwine in order to track parent/child relationships, rather than requiring it to be made explicit.
As an example, an ingest of N-Quads from a file, processing with spindle-correlate might yield the following:
- A job is created in state
WAITINGwith a newly-generated UUID and afile:///URI - The N-Quads are parsed and the number of graphs determined; the job is updated to state
ACTIVE, with progress set to 0 of number-of-graphs - For each graph that is correlated by Spindle, progress is updated, and a new child job is created in state
WAITING, using the Spindle-generated UUID and URI - Once processing of the N-Quads is complete, the job status is updated to
COMPLETE
As spindle-generate later processes its queue of items, it performs the following:
- A job is created in state
WAITINGusing the Spindle-generated UUID and URI; if it already exists, its parentage is preserved (thus, if the job originated from an ingest as described above, the proxy-generation step maintains the parent-child relationship allowing for ready visualisation - As the proxy is generated, its status is updated accordingly
With this arrangement, a small number of relatively simple SQL queries can result in progress tracking and volumetrics across a processing cluster.
Open question: how would Twine know when to preserve versus replace the parent of a job?
Perhaps it could be as simple as user action (i.e., twine-cli) taking precedence over an on-going process — thus, a queue-driven twine-writerd will only set the parent of a job if it's newly-created, whereas twine-cli will always override it. Both would create an overarching job for their processing runs, whether that's from a file or a queue.
Tracked as RESDATA-1279