MediaWiki  1.23.0
UtfNormal Class Reference

Unicode normalization routines for working with UTF-8 strings. More...

Static Public Member Functions

static cleanUp ( $string)
 The ultimate convenience function! Clean up invalid UTF-8 sequences, and convert to normal form C, canonical composition. More...
 
static loadData ()
 Load the basic composition data if necessary. More...
 
static placebo ( $string)
 This is just used for the benchmark, comparing how long it takes to interate through a string without really doing anything of substance. More...
 
static quickIsNFC ( $string)
 Returns true if the string is definitely in NFC. More...
 
static quickIsNFCVerify (&$string)
 Returns true if the string is definitely in NFC. More...
 
static toNFC ( $string)
 Convert a UTF-8 string to normal form C, canonical composition. More...
 
static toNFD ( $string)
 Convert a UTF-8 string to normal form D, canonical decomposition. More...
 
static toNFKC ( $string)
 Convert a UTF-8 string to normal form KC, compatibility composition. More...
 
static toNFKD ( $string)
 Convert a UTF-8 string to normal form KD, compatibility decomposition. More...
 

Public Attributes

const UNORM_DEFAULT = self::UNORM_NFC
 
const UNORM_FCD = 6
 
const UNORM_NFC = 4
 
const UNORM_NFD = 2
 
const UNORM_NFKC = 5
 
const UNORM_NFKD = 3
 
const UNORM_NONE = 1
 For using the ICU wrapper. More...
 

Static Public Attributes

static $utfCanonicalComp = null
 
static $utfCanonicalDecomp = null
 
static $utfCheckNFC
 
static $utfCombiningClass = null
 
static $utfCompatibilityDecomp = null
 

Static Private Member Functions

static fastCombiningSort ( $string)
 Sorts combining characters into canonical order. More...
 
static fastCompose ( $string)
 Produces canonically composed sequences, i.e. More...
 
static fastDecompose ( $string, $map)
 Perform decomposition of a UTF-8 string into either D or KD form (depending on which decomposition map is passed to us). More...
 
static NFC ( $string)
 
static NFD ( $string)
 
static NFKC ( $string)
 
static NFKD ( $string)
 
static replaceForNativeNormalize ( $string)
 Function to replace some characters that we don't want but most of the native normalize functions keep. More...
 

Detailed Description

Unicode normalization routines for working with UTF-8 strings.

Currently assumes that input strings are valid UTF-8!

Not as fast as I'd like, but should be usable for most purposes. UtfNormal::toNFC() will bail early if given ASCII text or text it can quickly determine is already normalized.

All functions can be called static.

See description of forms at http://www.unicode.org/reports/tr15/

Definition at line 48 of file UtfNormal.php.

Member Function Documentation

◆ cleanUp()

static UtfNormal::cleanUp (   $string)
static

◆ fastCombiningSort()

static UtfNormal::fastCombiningSort (   $string)
staticprivate

Sorts combining characters into canonical order.

This is the final step in creating decomposed normal forms D and KD.

Parameters
string$stringa valid, decomposed UTF-8 string. Input is not validated.
Returns
string a UTF-8 string with combining characters sorted in canonical order

Definition at line 570 of file UtfNormal.php.

References $n, $out, array(), and loadData().

Referenced by NFD().

◆ fastCompose()

static UtfNormal::fastCompose (   $string)
staticprivate

Produces canonically composed sequences, i.e.

normal form C or KC.

Parameters
string$stringa valid UTF-8 string in sorted normal form D or KD. Input is not validated.
Returns
string a UTF-8 string with canonical precomposed characters used where possible

Definition at line 622 of file UtfNormal.php.

References $n, $out, loadData(), UNICODE_HANGUL_FIRST, UNICODE_HANGUL_TCOUNT, UNICODE_HANGUL_VCOUNT, UTF8_HANGUL_FIRST, UTF8_HANGUL_LAST, UTF8_HANGUL_LBASE, UTF8_HANGUL_LEND, UTF8_HANGUL_TBASE, UTF8_HANGUL_TEND, UTF8_HANGUL_VBASE, and UTF8_HANGUL_VEND.

Referenced by NFC(), and NFKC().

◆ fastDecompose()

static UtfNormal::fastDecompose (   $string,
  $map 
)
staticprivate

Perform decomposition of a UTF-8 string into either D or KD form (depending on which decomposition map is passed to us).

Input is assumed to be valid UTF-8. Invalid code will break.

Parameters
string$stringvalid UTF-8 string
array$maphash of expanded decomposition map
Returns
string a UTF-8 string decomposed, not yet normalized (needs sorting)

Definition at line 510 of file UtfNormal.php.

References $n, $out, $t, loadData(), UNICODE_HANGUL_FIRST, UNICODE_HANGUL_NCOUNT, UNICODE_HANGUL_TCOUNT, UTF8_HANGUL_FIRST, and UTF8_HANGUL_LAST.

Referenced by NFD().

◆ loadData()

static UtfNormal::loadData ( )
static

Load the basic composition data if necessary.

Access:\n private

Definition at line 191 of file UtfNormal.php.

Referenced by fastCombiningSort(), fastCompose(), fastDecompose(), NFD(), quickIsNFC(), and quickIsNFCVerify().

◆ NFC()

static UtfNormal::NFC (   $string)
staticprivate
Parameters
$stringstring
Returns
string

Definition at line 462 of file UtfNormal.php.

References fastCompose(), and NFD().

Referenced by cleanUp(), CleanUpTest\doTestDoubleBytes(), CleanUpTest\doTestTripleBytes(), toNFC(), and CleanUpTest\XtestAllChars().

◆ NFD()

static UtfNormal::NFD (   $string)
staticprivate
Parameters
$stringstring
Returns
string

Definition at line 471 of file UtfNormal.php.

References fastCombiningSort(), fastDecompose(), and loadData().

Referenced by NFC(), and toNFD().

◆ NFKC()

static UtfNormal::NFKC (   $string)
staticprivate
Parameters
$stringstring
Returns
string

Definition at line 483 of file UtfNormal.php.

References fastCompose(), and NFKD().

Referenced by toNFKC().

◆ NFKD()

static UtfNormal::NFKD (   $string)
staticprivate
Parameters
$stringstring
Returns
string

Definition at line 492 of file UtfNormal.php.

Referenced by NFKC(), and toNFKD().

◆ placebo()

static UtfNormal::placebo (   $string)
static

This is just used for the benchmark, comparing how long it takes to interate through a string without really doing anything of substance.

Parameters
$stringstring
Returns
string

Definition at line 753 of file UtfNormal.php.

References $out.

◆ quickIsNFC()

static UtfNormal::quickIsNFC (   $string)
static

Returns true if the string is definitely in NFC.

Returns false if not or uncertain.

Parameters
string$stringa valid UTF-8 string. Input is not validated.
Returns
bool

Definition at line 203 of file UtfNormal.php.

References $n, and loadData().

Referenced by toNFC().

◆ quickIsNFCVerify()

static UtfNormal::quickIsNFCVerify ( $string)
static

Returns true if the string is definitely in NFC.

Returns false if not or uncertain.

Parameters
string$stringa UTF-8 string, altered on output to be valid UTF-8 safe for XML.
Returns
bool

Definition at line 243 of file UtfNormal.php.

References $matches, $n, are, array(), as, character(), in, is(), loadData(), see, that, used, UTF8_FFFE, UTF8_FFFF, UTF8_MAX, UTF8_OVERLONG_A, UTF8_OVERLONG_B, UTF8_OVERLONG_C, UTF8_REPLACEMENT, and UTF8_SURROGATE_FIRST.

Referenced by Exif\charCodeString(), cleanUp(), IPTC\convIPTCHelper(), GIFMetadataExtractor\getMetadata(), and JpegMetadataExtractor\segmentSplitter().

◆ replaceForNativeNormalize()

static UtfNormal::replaceForNativeNormalize (   $string)
staticprivate

Function to replace some characters that we don't want but most of the native normalize functions keep.

Parameters
string$stringThe string
Returns
String String with the character codes replaced.

Definition at line 768 of file UtfNormal.php.

References UTF8_FFFE, UTF8_FFFF, and UTF8_REPLACEMENT.

Referenced by cleanUp().

◆ toNFC()

static UtfNormal::toNFC (   $string)
static

Convert a UTF-8 string to normal form C, canonical composition.

Fast return for pure ASCII strings; some lesser optimizations for strings containing only known-good characters.

Parameters
string$stringa valid UTF-8 string. Input is not validated.
Returns
string a UTF-8 string in normal form C

Definition at line 120 of file UtfNormal.php.

References NFC(), NORMALIZE_ICU, NORMALIZE_INTL, and quickIsNFC().

Referenced by normalize_form_c(), and normalize_form_c_php().

◆ toNFD()

static UtfNormal::toNFD (   $string)
static

Convert a UTF-8 string to normal form D, canonical decomposition.

Fast return for pure ASCII strings.

Parameters
string$stringa valid UTF-8 string. Input is not validated.
Returns
string a UTF-8 string in normal form D

Definition at line 138 of file UtfNormal.php.

References NFD(), NORMALIZE_ICU, and NORMALIZE_INTL.

Referenced by normalize_form_d(), and normalize_form_d_php().

◆ toNFKC()

static UtfNormal::toNFKC (   $string)
static

Convert a UTF-8 string to normal form KC, compatibility composition.

This may cause irreversible information loss, use judiciously. Fast return for pure ASCII strings.

Parameters
string$stringa valid UTF-8 string. Input is not validated.
Returns
string a UTF-8 string in normal form KC

Definition at line 157 of file UtfNormal.php.

References NFKC(), NORMALIZE_ICU, and NORMALIZE_INTL.

Referenced by normalize_form_kc(), and normalize_form_kc_php().

◆ toNFKD()

static UtfNormal::toNFKD (   $string)
static

Convert a UTF-8 string to normal form KD, compatibility decomposition.

This may cause irreversible information loss, use judiciously. Fast return for pure ASCII strings.

Parameters
string$stringa valid UTF-8 string. Input is not validated.
Returns
string a UTF-8 string in normal form KD

Definition at line 176 of file UtfNormal.php.

References NFKD(), NORMALIZE_ICU, and NORMALIZE_INTL.

Referenced by normalize_form_kd(), and normalize_form_kd_php().

Member Data Documentation

◆ $utfCanonicalComp

UtfNormal::$utfCanonicalComp = null
static

Definition at line 61 of file UtfNormal.php.

Referenced by CleanUpTest\XtestAllChars().

◆ $utfCanonicalDecomp

UtfNormal::$utfCanonicalDecomp = null
static

Definition at line 62 of file UtfNormal.php.

Referenced by benchmarkForm(), and CleanUpTest\XtestAllChars().

◆ $utfCheckNFC

UtfNormal::$utfCheckNFC
static

Definition at line 67 of file UtfNormal.php.

◆ $utfCombiningClass

UtfNormal::$utfCombiningClass = null
static

Definition at line 60 of file UtfNormal.php.

◆ $utfCompatibilityDecomp

UtfNormal::$utfCompatibilityDecomp = null
static

Definition at line 65 of file UtfNormal.php.

◆ UNORM_DEFAULT

const UtfNormal::UNORM_DEFAULT = self::UNORM_NFC

Definition at line 58 of file UtfNormal.php.

◆ UNORM_FCD

const UtfNormal::UNORM_FCD = 6

Definition at line 57 of file UtfNormal.php.

◆ UNORM_NFC

const UtfNormal::UNORM_NFC = 4

Definition at line 55 of file UtfNormal.php.

Referenced by donorm(), and Installer\envCheckLibicu().

◆ UNORM_NFD

const UtfNormal::UNORM_NFD = 2

Definition at line 53 of file UtfNormal.php.

◆ UNORM_NFKC

const UtfNormal::UNORM_NFKC = 5

Definition at line 56 of file UtfNormal.php.

◆ UNORM_NFKD

const UtfNormal::UNORM_NFKD = 3

Definition at line 54 of file UtfNormal.php.

◆ UNORM_NONE

const UtfNormal::UNORM_NONE = 1

For using the ICU wrapper.

Definition at line 52 of file UtfNormal.php.


The documentation for this class was generated from the following file: