Wikibase
MediaWiki Wikibase extension
Loading...
Searching...
No Matches
Item & Property Terms

Secondary storage for Item and Property terms in SQL is needed for efficient and atomic lookup and query of the terms of multiple entities in multiple languages.

For example, when rendering an Item page the labels of all other entities being referred to need to be known. The alternative to secondary storage would be loading each of the full entities in order to lookup the terms needed.

The code for the storage lives in the Wikibase\Lib\Store\Sql\Terms namespace.

Writing to the secondary storage happens through a deferred update after each edit on entities. This is to make saving edits faster and more atomic which also means reducing the failure rate of saving edits. As the result, secondary storage might not be always completely in sync with the actual terms stored in the primary storage.

Briefly in code:

  • ItemTermStoreWriter and PropertyTermStoreWriter are the interfaces at the bottom of the term storage tree.
    • These interfaces are provided by the data-model-services vendor component
    • Implementations exist to write to the new and legacy storage.
  • EntityTermStoreWriter joins these stores in an interface that can generically save either Item or Property terms.

The storage system is currently decided using the tmpItemTermsMigrationStages and tmpPropertyTermsMigrationStage repo settings.

Legacy Secondary Storage

This currently the default storage mechanism when using Wikibase.

In the past (pre 2020) terms were stored in a single large database table called wb_terms. This table lacked clear design and eventually became too big to touch for wikidata.org. Between 2019 and 2020 a migration process was carried out (and is still being carried out) migrating the terms to a new schema (see below).

The "Epic" task for this was https://phabricator.wikimedia.org/T208425 - [EPIC] Kill the wb_terms table

New Secondary Storage

The storage is made up of multiple normalized tables, all prefixed with "wbt_".DatabaseTermInLangIdsAcquirer The tables were created by AddNormalizedTermsTablesDDL.sql which includes some documentation.

  • wbt_item_terms
  • wbt_property_terms
  • wbt_term_in_lang
  • wbt_text_in_lang
  • wbt_type
  • wbt_text

The relations are shown below:

dot_inline_dotgraph_3.png

The Normalization results in a more complex query and update pattern. See sections below for more details on how Reading and Updating work.

Read queries

Lookup terms of an entity

Lookup of the terms of an entity can be achieved by starting with the wbt_item_terms or wbt_property_terms tables where you will find integer representations of Item and Property identifiers.

The below query selects all terms in the tables for item Q123 and can be used as a starting point for data exploration:

SELECT
wbit_item_id as id,
wby_name as type,
wbxl_language as language,
wbx_text as text
FROM wbt_item_terms
LEFT JOIN wbt_term_in_lang ON wbit_term_in_lang_id = wbtl_id
LEFT JOIN wbt_type ON wbtl_type_id = wby_id
LEFT JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id
LEFT JOIN wbt_text ON wbxl_text_id = wbx_id
WHERE wbit_item_id = 123;

For properties you can do something like:

SELECT
wbpt_property_id as id,
wby_name as type,
wbxl_language as language,
wbx_text as text
FROM wbt_property_terms
LEFT JOIN wbt_term_in_lang ON wbpt_term_in_lang_id = wbtl_id
LEFT JOIN wbt_type ON wbtl_type_id = wby_id
LEFT JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id
LEFT JOIN wbt_text ON wbxl_text_id = wbx_id
WHERE wbpt_property_id = 10;

Lookup all entities that use a certain term

Lookup of entities from a term string can be achieved by starting with the wbt_text table which contains the text for all terms or all types for both Items and Properties.

SELECT
wbit_item_id as id,
wby_name as type,
wbxl_language as language,
wbx_text as text
FROM wbt_item_terms
LEFT JOIN wbt_term_in_lang ON wbit_term_in_lang_id = wbtl_id
LEFT JOIN wbt_type ON wbtl_type_id = wby_id
LEFT JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id
LEFT JOIN wbt_text ON wbxl_text_id = wbx_id
WHERE wby_name = 'label'
AND wbxl_language = 'en'
AND wbx_text = 'Berlin';

For properties you can do something like:

SELECT
wbpt_property_id as id,
wby_name as type,
wbxl_language as language,
wbx_text as text
FROM wbt_property_terms
LEFT JOIN wbt_term_in_lang ON wbpt_term_in_lang_id = wbtl_id
LEFT JOIN wbt_type ON wbtl_type_id = wby_id
LEFT JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id
LEFT JOIN wbt_text ON wbxl_text_id = wbx_id
WHERE wby_name = 'label'
AND wbxl_language = 'en'
AND wbx_text = 'instance of';

Updating

Process outline

  • Term secondary storage is currently written to after edits are saved in MediaWiki's "Secondary data updates". See ItemHandler::getSecondaryDataUpdates() and PropertyHandler::getSecondaryDataUpdates() implementations.
  • When term changes happen a series of ID acquisitions occur in DatabaseTermInLangIdsAcquirer. (Finding IDs that already exist in the storage that will be needed for future inserts)
  • When new terms are being introduced rows that do not appear in the acquisition will be inserted.
  • Actual insertion and deletion of the terms in the wbt_item_terms and wbt_property_terms tables is done in DatabaseItemTermStoreWriter and DatabasePropertyTermStoreWriter.
  • When term changes result in some terms potentially being no longer being used across the whole store a CleanTermsIfUnusedJob job will be scheduled to remove the rows.
  • The job removes data from other tables using a DatabaseInnerTermStoreCleaner.

Keeping the store clean

The tables in the store are cleaned up so that data that is totally removed from entities is also totally removed from the store. This is important for cases such as Wikidata that has publicly accessible database replicas of this information.