MediaWiki master
|
To abstract away the differences among different types of storage media, MediaWiki is providing an interface known as FileBackend. Any MediaWiki interaction with stored files should thus use a FileBackend object.
Different types of backing storage media are supported (ranging from local file system to distributed object stores). The types include:
Configuration documentation for each type of backend is to be found in their __construct()
inline documentation.
File backends are registered in LocalSettings.php via the global variable $wgFileBackends. To access one of those defined backends, one would use FileBackendStore::get( <name> )
which will bring back a FileBackend object handle. Such handles are reused for any subsequent get() call (via singleton). The FileBackends objects are caching request calls such as file stats, SHA1 requests or TCP connection handles.
Note: Some backends may require additional PHP extensions to be enabled or can rely on a MediaWiki extension. This is often the case when a FileBackend subclass makes use of an upstream client API for communicating with the backing store.
The MediaWiki FileBackend API supports various operations on either files or directories. See FileBackend.php for full documentation for each function.
The following basic operations are supported for reading from a backend:
On files:
On directories:
Note: Backend handles should return directory listings as iterators, all though in some cases they may just be simple arrays (which can still be iterated over). Iterators allow for callers to traverse a large number of file listings without consuming excessive RAM in the process. Either the memory consumed is flatly bounded (if the iterator does paging) or it is proportional to the depth of the portion of the directory tree being traversed (if the iterator works via recursion).
The following basic operations are supported for writing or changing in the backend:
On files:
The following operations are supported for writing directories in the backend:
Generally, callers should use doOperations() or doQuickOperations() when doing batches of changes, rather than making a suite of single operation calls. This makes the system tolerate high latency much better by pipelining operations when possible.
doOperations() should be used for working on important original data, i.e. when consistency is important. The former will only pipeline operations that do not depend on each other. It is best if the operations that do not depend on each other occur in consecutive groups.
doQuickOperations() is more geared toward ephemeral items that can be easily regenerated from original data. It will always pipeline without checking for dependencies within the operation batch. One might use this function for creating and purging generated thumbnails of original files for example.
Not all backing stores are sequentially consistent by default. Various FileBackend functions offer a "latest" option that can be passed in to assure (or try to assure) that the latest version of the file is read. Some backing stores are consistent by default, but callers should always assume that without this option, stale data may be read. This is actually true for stores that have eventual consistency.
Note that file listing functions have no "latest" flag, and thus some systems may return stale data. Thus callers should avoid assuming that listings contain changes made my the current client or any other client from a very short time ago. For example, creating a file under a directory and then immediately doing a file listing operation on that directory may result in a listing that does not include that file.
Locking is effective if and only if a proper lock manager is registered and is actually being used by the backend. Lock managers can be registered in LocalSettings.php using the $wgLockManagers global configuration variable.
For object stores, locking is not generally useful for avoiding partially written or read objects, since most stores use Multi Version Concurrency Control (MVCC) to avoid this. However, locking can be important when:
When locking, callers should use the latest available file data for reads. Also, one should always lock the file before reading it, not after. If stale data is used to determine a write, there will be some data corruption, even when reads of the original file finally start returning the updated data without needing the "latest" option (eventual consistency). The "scoped" lock functions are preferable since there is not the problem of forgetting to unlock due to early returns or exceptions.
Since acquiring locks can fail, and lock managers can be non-blocking, callers should:
MVCC is also a useful pattern to use on top of the backend interface, because operations are not atomic, even with doOperations(), so doing complex batch file changes or changing files and updating a database row can result in partially written "transactions". Thus one should avoid changing files once they have been stored, except perhaps with ephemeral data that are tolerant of some degree of inconsistency.
Callers can use their own locking (e.g. SELECT FOR UPDATE) if it is more convenient, but note that all callers that change any of the files should then go through functions that acquire these locks. For example, if a caller just directly uses the file backend store() function, it will ignore any custom "FOR UPDATE" locks, which can cause problems.
Support for object stores (like Amazon S3/Swift) drive much of the API and design decisions of FileBackend, but using any POSIX compliant file systems works fine. The system essentially stores "files" in "containers". For a mounted file system as a backing store, "files" will just be files under directories. For an object store as a backing store, the "files" will be objects stored in actual containers.
An advantage of object stores is the reduced Round-Trip Times. This is achieved by avoiding the need to create each parent directory before placing a file somewhere. It gets worse the deeper the directory hierarchy is. Another advantage of object stores is that object listings tend to use databases, which scale better than the linked list directories that file sytems sometimes use. File systems like btrfs and xfs use tree structures, which scale better. For both object stores and file systems, using "/" in filenames will allow for the intuitive use of directory functions. For example, creating a file in Swift called "container/a/b/file1" will mean that:
This means that switching from an object store to a file system and vise versa using the FileBackend interface will generally be harmless. However, one must be aware of some important differences: