CirrusSearch
Elasticsearch-powered search for MediaWiki
Loading...
Searching...
No Matches
CirrusSearch\BuildDocument\DocumentSizeLimiter Class Reference

An approximate, incomplete and rather dangerous algorithm to reduce the size of a CirrusSearch document. More...

Public Member Functions

 __construct (array $profile)
 
 resize (Document $document)
 Truncate some textual data from the input Document.
 

Static Public Member Functions

static estimateDataSize (Document $document)
 

Public Attributes

const MANDATORY_REDUCTION_BUCKET = "mandatory_reduction"
 
const OVERSIZE_REDUCTION_REDUCTION_BUCKET = "oversize_reduction"
 
const HINT_DOC_SIZE_LIMITER_STATS = 'DocumentSizeLimiter_stats'
 

Detailed Description

An approximate, incomplete and rather dangerous algorithm to reduce the size of a CirrusSearch document.

This class is meant to reduce the size of abnormally large documents. What we can consider abnormally large is certainly prone to interpretation but this class was designed with numbers like 1Mb considered as extremely large. You should not expect this class to be byte precise and there is no guarantee that the resulting size after the operation will be below the expected max. There might be various reasons for this:

  • there are other fields than the ones listed above that take a lot of space
  • the expected size is so low that it does not even allow the json overhead to be present

If the use-case is to ensure that the resulting json representation is below a size S you should definitely account for some overhead and ask this class to reduce the document to something smaller than S (i.e. S*0.9).

Limiter heuristics are controlled by a profile that supports the following criteria:

  • max_size (int): the target maximum size of the document (when serialized as json)
  • field_types (array<string, string>): field name as key, the type of field (text or keyword) as value
  • max_field_size (array<string, int>): field name as key, max size as value, truncate these fields to the appropriate size
  • fields (array<string, int>): field name as key, min size as value, truncate these fields up to this minimal size as long as the document size is above max_size
  • markup_template (string): mark the document with this template if it was oversize.

Text fields are truncated using mb_strcut, if the string is part of an array and it becomes empty after the truncation it's removed from the array, if the string is a "keyword" (non tokenized field) it's not truncated and simply removed from its array.

If an array is mixing string and non-string data it's ignored.

Member Function Documentation

◆ resize()

CirrusSearch\BuildDocument\DocumentSizeLimiter::resize ( Document $document)

Truncate some textual data from the input Document.

Parameters
Document$document
Returns
array some statistics about the process.

@phan-suppress-next-line PhanRedundantCondition


The documentation for this class was generated from the following file: