CirrusSearch
Elasticsearch-powered search for MediaWiki
Loading...
Searching...
No Matches
CirrusSearch\Maintenance\AnalysisConfigBuilder Class Reference

Builds elasticsearch analysis config arrays. More...

+ Inheritance diagram for CirrusSearch\Maintenance\AnalysisConfigBuilder:

Public Member Functions

 __construct ( $langCode, array $plugins, SearchConfig $config=null, CirrusSearchHookRunner $cirrusSearchHookRunner=null)
 
 shouldActivateIcuFolding ( $language)
 Determine if ascii folding should be used.
 
 shouldActivateIcuTokenization ( $language)
 Determine if the icu tokenizer can be enabled.
 
 buildConfig ( $language=null)
 Build the analysis config.
 
 buildSimilarityConfig ()
 
 enableICUTokenizer (array $config)
 replace the standard tokenizer with icu_tokenizer
 
 enableICUFolding (array $config, $language)
 Activate ICU folding instead of asciifolding.
 
 fixAsciiFolding (array $config)
 Workaround for https://issues.apache.org/jira/browse/LUCENE-7468 The preserve_original duplicates token even if they are not modified, leading to more space used and wrong term frequencies.
 
 getDefaultTextAnalyzerType ( $language)
 Pick the appropriate default analyzer based on the language.
 
 buildLanguageConfigs (array &$config, array $languages, array $analyzers)
 Create per-language configs for specific analyzers which separates and namespaces filters that are different between languages.
 
 isIcuAvailable ()
 
 enableHomoglyphPlugin (array $config, string $language)
 update languages with homoglyph plugin
 

Public Attributes

const VERSION = '0.12'
 Version number for the core analysis.
 
 $homoglyphPluginDenyList = []
 
 $homoglyphIncompatibleFilters = [ 'aggressive_splitting' ]
 

Protected Member Functions

 getICUSetFilter ( $language)
 Return the list of chars to exclude from ICU folding.
 
 getICUNormSetFilter ( $language)
 Return the list of chars to exclude from ICU normalization.
 

Protected Attributes

 $config
 
 $defaultLanguage
 

Detailed Description

Builds elasticsearch analysis config arrays.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. http://www.gnu.org/copyleft/gpl.html

Constructor & Destructor Documentation

◆ __construct()

CirrusSearch\Maintenance\AnalysisConfigBuilder::__construct ( $langCode,
array $plugins,
SearchConfig $config = null,
CirrusSearchHookRunner $cirrusSearchHookRunner = null )
Parameters
string$langCodeThe language code to build config for
string[]$pluginslist of plugins installed in Elasticsearch
SearchConfig | null$config
CirrusSearchHookRunner | null$cirrusSearchHookRunner

Member Function Documentation

◆ buildConfig()

CirrusSearch\Maintenance\AnalysisConfigBuilder::buildConfig ( $language = null)

Build the analysis config.

Parameters
string | null$languageConfig language
Returns
array the analysis config

Reimplemented in CirrusSearch\Maintenance\SuggesterAnalysisConfigBuilder.

◆ buildLanguageConfigs()

CirrusSearch\Maintenance\AnalysisConfigBuilder::buildLanguageConfigs ( array & $config,
array $languages,
array $analyzers )

Create per-language configs for specific analyzers which separates and namespaces filters that are different between languages.

Parameters
array&$configExisting config, will be modified
string[]$languagesList of languages to process
string[]$analyzersList of analyzers to process

◆ buildSimilarityConfig()

CirrusSearch\Maintenance\AnalysisConfigBuilder::buildSimilarityConfig ( )
Returns
array|null the similarity config

◆ enableHomoglyphPlugin()

CirrusSearch\Maintenance\AnalysisConfigBuilder::enableHomoglyphPlugin ( array $config,
string $language )

update languages with homoglyph plugin

Parameters
mixed[]$config
string$languagelanguage to add plugin to
Returns
mixed[] updated config

◆ enableICUFolding()

CirrusSearch\Maintenance\AnalysisConfigBuilder::enableICUFolding ( array $config,
$language )

Activate ICU folding instead of asciifolding.

Parameters
mixed[]$config
string$languageConfig language
Returns
mixed[] update config

◆ enableICUTokenizer()

CirrusSearch\Maintenance\AnalysisConfigBuilder::enableICUTokenizer ( array $config)

replace the standard tokenizer with icu_tokenizer

Parameters
mixed[]$config
Returns
mixed[] update config

◆ fixAsciiFolding()

CirrusSearch\Maintenance\AnalysisConfigBuilder::fixAsciiFolding ( array $config)

Workaround for https://issues.apache.org/jira/browse/LUCENE-7468 The preserve_original duplicates token even if they are not modified, leading to more space used and wrong term frequencies.

Workaround is to append a unique filter to remove the dups. (made public for unit tests)

Parameters
mixed[]$config
Returns
mixed[] update mapping

◆ getDefaultTextAnalyzerType()

CirrusSearch\Maintenance\AnalysisConfigBuilder::getDefaultTextAnalyzerType ( $language)

Pick the appropriate default analyzer based on the language.

Rather than think of this as per language customization you should think of this as an effort to pick a reasonably default in case CirrusSearch isn't customized for the language.

Parameters
string$languageConfig language
Returns
string the analyzer type

◆ getICUNormSetFilter()

CirrusSearch\Maintenance\AnalysisConfigBuilder::getICUNormSetFilter ( $language)
protected

Return the list of chars to exclude from ICU normalization.

Parameters
string$languageConfig language
Returns
null|string

◆ getICUSetFilter()

CirrusSearch\Maintenance\AnalysisConfigBuilder::getICUSetFilter ( $language)
protected

Return the list of chars to exclude from ICU folding.

Parameters
string$languageConfig language
Returns
null|string

◆ isIcuAvailable()

CirrusSearch\Maintenance\AnalysisConfigBuilder::isIcuAvailable ( )
Returns
bool true if the icu analyzer is available.

◆ shouldActivateIcuFolding()

CirrusSearch\Maintenance\AnalysisConfigBuilder::shouldActivateIcuFolding ( $language)

Determine if ascii folding should be used.

Parameters
string$languageConfig language
Returns
bool true if icu folding should be enabled

◆ shouldActivateIcuTokenization()

CirrusSearch\Maintenance\AnalysisConfigBuilder::shouldActivateIcuTokenization ( $language)

Determine if the icu tokenizer can be enabled.

Parameters
string$languageConfig language
Returns
bool

Member Data Documentation

◆ VERSION

const CirrusSearch\Maintenance\AnalysisConfigBuilder::VERSION = '0.12'

Version number for the core analysis.

Increment the major version when the analysis changes in an incompatible way, and change the minor version when it changes but isn't incompatible.

You may also need to increment MetaStoreIndex::METASTORE_VERSION manually as well.


The documentation for this class was generated from the following file: