MediaWiki
1.23.0
|
Unicode normalization routines for working with UTF-8 strings. More...
Static Public Member Functions | |
static | cleanUp ( $string) |
The ultimate convenience function! Clean up invalid UTF-8 sequences, and convert to normal form C, canonical composition. More... | |
static | loadData () |
Load the basic composition data if necessary. More... | |
static | placebo ( $string) |
This is just used for the benchmark, comparing how long it takes to interate through a string without really doing anything of substance. More... | |
static | quickIsNFC ( $string) |
Returns true if the string is definitely in NFC. More... | |
static | quickIsNFCVerify (&$string) |
Returns true if the string is definitely in NFC. More... | |
static | toNFC ( $string) |
Convert a UTF-8 string to normal form C, canonical composition. More... | |
static | toNFD ( $string) |
Convert a UTF-8 string to normal form D, canonical decomposition. More... | |
static | toNFKC ( $string) |
Convert a UTF-8 string to normal form KC, compatibility composition. More... | |
static | toNFKD ( $string) |
Convert a UTF-8 string to normal form KD, compatibility decomposition. More... | |
Public Attributes | |
const | UNORM_DEFAULT = self::UNORM_NFC |
const | UNORM_FCD = 6 |
const | UNORM_NFC = 4 |
const | UNORM_NFD = 2 |
const | UNORM_NFKC = 5 |
const | UNORM_NFKD = 3 |
const | UNORM_NONE = 1 |
For using the ICU wrapper. More... | |
Static Public Attributes | |
static | $utfCanonicalComp = null |
static | $utfCanonicalDecomp = null |
static | $utfCheckNFC |
static | $utfCombiningClass = null |
static | $utfCompatibilityDecomp = null |
Static Private Member Functions | |
static | fastCombiningSort ( $string) |
Sorts combining characters into canonical order. More... | |
static | fastCompose ( $string) |
Produces canonically composed sequences, i.e. More... | |
static | fastDecompose ( $string, $map) |
Perform decomposition of a UTF-8 string into either D or KD form (depending on which decomposition map is passed to us). More... | |
static | NFC ( $string) |
static | NFD ( $string) |
static | NFKC ( $string) |
static | NFKD ( $string) |
static | replaceForNativeNormalize ( $string) |
Function to replace some characters that we don't want but most of the native normalize functions keep. More... | |
Unicode normalization routines for working with UTF-8 strings.
Currently assumes that input strings are valid UTF-8!
Not as fast as I'd like, but should be usable for most purposes. UtfNormal::toNFC() will bail early if given ASCII text or text it can quickly determine is already normalized.
All functions can be called static.
See description of forms at http://www.unicode.org/reports/tr15/
Definition at line 48 of file UtfNormal.php.
|
static |
The ultimate convenience function! Clean up invalid UTF-8 sequences, and convert to normal form C, canonical composition.
Fast return for pure ASCII strings; some lesser optimizations for strings containing only known-good characters. Not as fast as toNFC().
string | $string | a UTF-8 string |
Definition at line 79 of file UtfNormal.php.
References NFC(), NORMALIZE_ICU, NORMALIZE_INTL, quickIsNFCVerify(), and replaceForNativeNormalize().
Referenced by MWDebug\debugMsg(), CleanUpTest\doTestBytes(), CleanUpTest\doTestDoubleBytes(), CleanUpTest\doTestTripleBytes(), FeedUtils\formatDiffRow(), Language\normalize(), MediaWikiSite\normalizePageName(), WebRequest\normalizeUnicode(), Preprocessor_DOM\preprocessToObj(), CleanUpTest\testAscii(), CleanUpTest\testBomRegression(), CleanUpTest\testChunkRegression(), CleanUpTest\testForbiddenRegression(), CleanUpTest\testHangulRegression(), CleanUpTest\testInterposeRegression(), CleanUpTest\testLatin(), CleanUpTest\testLatinNormal(), CleanUpTest\testNull(), CleanUpTest\testOverlongRegression(), CleanUpTest\testSurrogateRegression(), xmlsafe(), and CleanUpTest\XtestAllChars().
|
staticprivate |
Sorts combining characters into canonical order.
This is the final step in creating decomposed normal forms D and KD.
string | $string | a valid, decomposed UTF-8 string. Input is not validated. |
Definition at line 570 of file UtfNormal.php.
References $n, $out, array(), and loadData().
Referenced by NFD().
|
staticprivate |
Produces canonically composed sequences, i.e.
normal form C or KC.
string | $string | a valid UTF-8 string in sorted normal form D or KD. Input is not validated. |
Definition at line 622 of file UtfNormal.php.
References $n, $out, loadData(), UNICODE_HANGUL_FIRST, UNICODE_HANGUL_TCOUNT, UNICODE_HANGUL_VCOUNT, UTF8_HANGUL_FIRST, UTF8_HANGUL_LAST, UTF8_HANGUL_LBASE, UTF8_HANGUL_LEND, UTF8_HANGUL_TBASE, UTF8_HANGUL_TEND, UTF8_HANGUL_VBASE, and UTF8_HANGUL_VEND.
|
staticprivate |
Perform decomposition of a UTF-8 string into either D or KD form (depending on which decomposition map is passed to us).
Input is assumed to be valid UTF-8. Invalid code will break.
string | $string | valid UTF-8 string |
array | $map | hash of expanded decomposition map |
Definition at line 510 of file UtfNormal.php.
References $n, $out, $t, loadData(), UNICODE_HANGUL_FIRST, UNICODE_HANGUL_NCOUNT, UNICODE_HANGUL_TCOUNT, UTF8_HANGUL_FIRST, and UTF8_HANGUL_LAST.
Referenced by NFD().
|
static |
Load the basic composition data if necessary.
Definition at line 191 of file UtfNormal.php.
Referenced by fastCombiningSort(), fastCompose(), fastDecompose(), NFD(), quickIsNFC(), and quickIsNFCVerify().
|
staticprivate |
$string | string |
Definition at line 462 of file UtfNormal.php.
References fastCompose(), and NFD().
Referenced by cleanUp(), CleanUpTest\doTestDoubleBytes(), CleanUpTest\doTestTripleBytes(), toNFC(), and CleanUpTest\XtestAllChars().
|
staticprivate |
$string | string |
Definition at line 471 of file UtfNormal.php.
References fastCombiningSort(), fastDecompose(), and loadData().
|
staticprivate |
$string | string |
Definition at line 483 of file UtfNormal.php.
References fastCompose(), and NFKD().
Referenced by toNFKC().
|
staticprivate |
|
static |
This is just used for the benchmark, comparing how long it takes to interate through a string without really doing anything of substance.
$string | string |
Definition at line 753 of file UtfNormal.php.
References $out.
|
static |
Returns true if the string is definitely in NFC.
Returns false if not or uncertain.
string | $string | a valid UTF-8 string. Input is not validated. |
Definition at line 203 of file UtfNormal.php.
References $n, and loadData().
Referenced by toNFC().
|
static |
Returns true if the string is definitely in NFC.
Returns false if not or uncertain.
string | $string | a UTF-8 string, altered on output to be valid UTF-8 safe for XML. |
Definition at line 243 of file UtfNormal.php.
References $matches, $n, are, array(), as, character(), in, is(), loadData(), see, that, used, UTF8_FFFE, UTF8_FFFF, UTF8_MAX, UTF8_OVERLONG_A, UTF8_OVERLONG_B, UTF8_OVERLONG_C, UTF8_REPLACEMENT, and UTF8_SURROGATE_FIRST.
Referenced by Exif\charCodeString(), cleanUp(), IPTC\convIPTCHelper(), GIFMetadataExtractor\getMetadata(), and JpegMetadataExtractor\segmentSplitter().
|
staticprivate |
Function to replace some characters that we don't want but most of the native normalize functions keep.
string | $string | The string |
Definition at line 768 of file UtfNormal.php.
References UTF8_FFFE, UTF8_FFFF, and UTF8_REPLACEMENT.
Referenced by cleanUp().
|
static |
Convert a UTF-8 string to normal form C, canonical composition.
Fast return for pure ASCII strings; some lesser optimizations for strings containing only known-good characters.
string | $string | a valid UTF-8 string. Input is not validated. |
Definition at line 120 of file UtfNormal.php.
References NFC(), NORMALIZE_ICU, NORMALIZE_INTL, and quickIsNFC().
Referenced by normalize_form_c(), and normalize_form_c_php().
|
static |
Convert a UTF-8 string to normal form D, canonical decomposition.
Fast return for pure ASCII strings.
string | $string | a valid UTF-8 string. Input is not validated. |
Definition at line 138 of file UtfNormal.php.
References NFD(), NORMALIZE_ICU, and NORMALIZE_INTL.
Referenced by normalize_form_d(), and normalize_form_d_php().
|
static |
Convert a UTF-8 string to normal form KC, compatibility composition.
This may cause irreversible information loss, use judiciously. Fast return for pure ASCII strings.
string | $string | a valid UTF-8 string. Input is not validated. |
Definition at line 157 of file UtfNormal.php.
References NFKC(), NORMALIZE_ICU, and NORMALIZE_INTL.
Referenced by normalize_form_kc(), and normalize_form_kc_php().
|
static |
Convert a UTF-8 string to normal form KD, compatibility decomposition.
This may cause irreversible information loss, use judiciously. Fast return for pure ASCII strings.
string | $string | a valid UTF-8 string. Input is not validated. |
Definition at line 176 of file UtfNormal.php.
References NFKD(), NORMALIZE_ICU, and NORMALIZE_INTL.
Referenced by normalize_form_kd(), and normalize_form_kd_php().
|
static |
Definition at line 61 of file UtfNormal.php.
Referenced by CleanUpTest\XtestAllChars().
|
static |
Definition at line 62 of file UtfNormal.php.
Referenced by benchmarkForm(), and CleanUpTest\XtestAllChars().
|
static |
Definition at line 67 of file UtfNormal.php.
|
static |
Definition at line 60 of file UtfNormal.php.
|
static |
Definition at line 65 of file UtfNormal.php.
const UtfNormal::UNORM_DEFAULT = self::UNORM_NFC |
Definition at line 58 of file UtfNormal.php.
const UtfNormal::UNORM_FCD = 6 |
Definition at line 57 of file UtfNormal.php.
const UtfNormal::UNORM_NFC = 4 |
Definition at line 55 of file UtfNormal.php.
Referenced by donorm(), and Installer\envCheckLibicu().
const UtfNormal::UNORM_NFD = 2 |
Definition at line 53 of file UtfNormal.php.
const UtfNormal::UNORM_NFKC = 5 |
Definition at line 56 of file UtfNormal.php.
const UtfNormal::UNORM_NFKD = 3 |
Definition at line 54 of file UtfNormal.php.
const UtfNormal::UNORM_NONE = 1 |
For using the ICU wrapper.
Definition at line 52 of file UtfNormal.php.