MediaWiki master
|
Notes on the Job queuing system architecture.
The data model consists of the following main components:
JobQueueGroup
service offers a convenience JobQueueGroup instance for the common case of dealing with jobs in the context of the local wiki.Each job type has its own queue and is associated to a storage medium. One queue might save its jobs in Redis, while another one uses would use a database.
Storage mediums are defined in a JobQueue subclass. Before using it, you must define in $wgJobTypeConf a mapping of the job type to the given JobQueue subclass.
The following core queue classes are available:
job
table in a database)All queue classes support some basic operations (though some may be no-ops):
All queue implementations must offer at-least-once execution guarantees for enqueued jobs. The execution order of enqueued jobs may however vary depending on the implementation, so callers should not assume any particular execution order.
Since each job type has its own queue, and wiki-farms may have many wikis, there might be a large number of queues to keep track of. To avoid wasting large amounts of time polling empty queues, aggregators exists to keep track of which queues are ready.
The following queue aggregator classes are available:
Some aggregators cache data for a few minutes while others may be always up to date. This can be an important factor for jobs that need a low pickup time (or latency).
The high-level job execution flow for a queue consists of the following steps:
An exception thrown by Job::run(), or Job::run() returning false
, will cause the job runner to retry the job up to the configured retry limit, unless Job::allowRetries() returns false
. As of MediaWiki 1.43, no job runner implementation makes a distinction between transient errors (which are retry-safe) and non-transient errors (which are not retry-safe). A Job implementation that is expected to have both transient and non-transient error states should therefore catch and process non-transient errors internally and return true
from Job::run() in such cases, to reduce the incidence of unwanted retries for such errors while still benefiting from the automated retry logic for transient errors.
Note that in a distributed job runner implementation, the above steps may be split between different infrastructure components, as is the case with the changeprop-based system used by Wikimedia Foundation. This may require additional configuration than overriding Job::allowRetries() to ensure that other job runner components do not attempt to retry a job that is not retry-safe (T358939).
Since job runner implementations may vary in reliability, job classes should be idempotent, to maintain correctness even if the job happens to run more than once.
A Job subclass may override Job::getDeduplicationInfo() and Job::ignoreDuplicates() to allow jobs to be deduplicated if the job runner in use supports it.
If Job::ignoreDuplicates() returns true
, the deduplication logic must consider the job to be a duplicate if a Job of the same type with identical deduplication info has been executed later than the enqueue timestamp of the job.
Jobs that spawn many smaller jobs (so-called "root" and "leaf" jobs) may enable additional deduplication logic, to make in-flight leaf-jobs no-ops, when a newer root job with identical parameters gets enqueued. This is done by passing two special parameters, rootJobTimestamp
and rootJobSignature
, which hold the MediaWiki timestamp at which the root job was enqueued, and an SHA-1 checksum uniquely identifying the root job, respectively. The Job::newRootJobParams() convenience method facilitates adding these parameters to a preexisting parameter set. When deduplicating leaf jobs, the job runner must consider a leaf job to be a duplicate if a root job with an identical signature has been executed by the runner later than the rootJobTimestamp
of the leaf job.
For enqueueing jobs, JobQueue and JobQueueGroup offer the JobQueue::push() and JobQueue::lazyPush() methods. The former synchronously enqueues the job and propagates a JobQueueError exception to the caller in case of failure, while the latter defers enqueueing the job when running a web request context until after the response has been flushed to the client. Callers should prefer using lazyPush
unless it is necessary to surface enqueue failures.