CirrusSearch
Elasticsearch-powered search for MediaWiki
|
Builds elasticsearch analysis config arrays. More...
Public Member Functions | |
__construct ( $langCode, array $plugins, SearchConfig $config=null, CirrusSearchHookRunner $cirrusSearchHookRunner=null) | |
shouldActivateIcuFolding ( $language) | |
Determine if ascii folding should be used. | |
shouldActivateIcuTokenization ( $language) | |
Determine if the icu tokenizer can be enabled. | |
buildConfig ( $language=null) | |
Build the analysis config. | |
buildSimilarityConfig () | |
enableICUTokenizer (array $config) | |
replace the standard tokenizer with icu_tokenizer | |
enableICUFolding (array $config, $language) | |
Activate ICU folding instead of asciifolding. | |
fixAsciiFolding (array $config) | |
Workaround for https://issues.apache.org/jira/browse/LUCENE-7468 The preserve_original duplicates token even if they are not modified, leading to more space used and wrong term frequencies. | |
getDefaultTextAnalyzerType ( $language) | |
Pick the appropriate default analyzer based on the language. | |
buildLanguageConfigs (array &$config, array $languages, array $analyzers) | |
Create per-language configs for specific analyzers which separates and namespaces filters that are different between languages. | |
isIcuAvailable () | |
enableHomoglyphPlugin (array $config, string $language) | |
update languages with homoglyph plugin | |
Public Attributes | |
const | VERSION = '0.12' |
Version number for the core analysis. | |
$homoglyphPluginDenyList = [] | |
$homoglyphIncompatibleFilters = [ 'aggressive_splitting' ] | |
Protected Member Functions | |
getICUSetFilter ( $language) | |
Return the list of chars to exclude from ICU folding. | |
getICUNormSetFilter ( $language) | |
Return the list of chars to exclude from ICU normalization. | |
Protected Attributes | |
$config | |
$defaultLanguage | |
Builds elasticsearch analysis config arrays.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. http://www.gnu.org/copyleft/gpl.html
CirrusSearch\Maintenance\AnalysisConfigBuilder::__construct | ( | $langCode, | |
array | $plugins, | ||
SearchConfig | $config = null, | ||
CirrusSearchHookRunner | $cirrusSearchHookRunner = null ) |
string | $langCode | The language code to build config for |
string[] | $plugins | list of plugins installed in Elasticsearch |
SearchConfig | null | $config | |
CirrusSearchHookRunner | null | $cirrusSearchHookRunner |
CirrusSearch\Maintenance\AnalysisConfigBuilder::buildConfig | ( | $language = null | ) |
Build the analysis config.
string | null | $language | Config language |
Reimplemented in CirrusSearch\Maintenance\SuggesterAnalysisConfigBuilder.
CirrusSearch\Maintenance\AnalysisConfigBuilder::buildLanguageConfigs | ( | array & | $config, |
array | $languages, | ||
array | $analyzers ) |
Create per-language configs for specific analyzers which separates and namespaces filters that are different between languages.
array | &$config | Existing config, will be modified |
string[] | $languages | List of languages to process |
string[] | $analyzers | List of analyzers to process |
CirrusSearch\Maintenance\AnalysisConfigBuilder::buildSimilarityConfig | ( | ) |
CirrusSearch\Maintenance\AnalysisConfigBuilder::enableHomoglyphPlugin | ( | array | $config, |
string | $language ) |
update languages with homoglyph plugin
mixed[] | $config | |
string | $language | language to add plugin to |
CirrusSearch\Maintenance\AnalysisConfigBuilder::enableICUFolding | ( | array | $config, |
$language ) |
Activate ICU folding instead of asciifolding.
mixed[] | $config | |
string | $language | Config language |
CirrusSearch\Maintenance\AnalysisConfigBuilder::enableICUTokenizer | ( | array | $config | ) |
replace the standard tokenizer with icu_tokenizer
mixed[] | $config |
CirrusSearch\Maintenance\AnalysisConfigBuilder::fixAsciiFolding | ( | array | $config | ) |
Workaround for https://issues.apache.org/jira/browse/LUCENE-7468 The preserve_original duplicates token even if they are not modified, leading to more space used and wrong term frequencies.
Workaround is to append a unique filter to remove the dups. (made public for unit tests)
mixed[] | $config |
CirrusSearch\Maintenance\AnalysisConfigBuilder::getDefaultTextAnalyzerType | ( | $language | ) |
Pick the appropriate default analyzer based on the language.
Rather than think of this as per language customization you should think of this as an effort to pick a reasonably default in case CirrusSearch isn't customized for the language.
string | $language | Config language |
|
protected |
Return the list of chars to exclude from ICU normalization.
string | $language | Config language |
|
protected |
Return the list of chars to exclude from ICU folding.
string | $language | Config language |
CirrusSearch\Maintenance\AnalysisConfigBuilder::isIcuAvailable | ( | ) |
CirrusSearch\Maintenance\AnalysisConfigBuilder::shouldActivateIcuFolding | ( | $language | ) |
Determine if ascii folding should be used.
string | $language | Config language |
CirrusSearch\Maintenance\AnalysisConfigBuilder::shouldActivateIcuTokenization | ( | $language | ) |
Determine if the icu tokenizer can be enabled.
string | $language | Config language |
const CirrusSearch\Maintenance\AnalysisConfigBuilder::VERSION = '0.12' |
Version number for the core analysis.
Increment the major version when the analysis changes in an incompatible way, and change the minor version when it changes but isn't incompatible.
You may also need to increment MetaStoreIndex::METASTORE_VERSION manually as well.