This file contains general utilities for token transforms. More...

Static Public Member Functions
static	stripParsoidIdPrefix (string $aboutId)
	Strip Parsoid id prefix from aboutID.

static	stripNamespace (string $className)
	Strip PHP namespace from the fully qualified class name.

static	isParsoidObjectId (string $aboutId)
	Check for Parsoid id prefix in an aboutID string.

static	isVoidElement (string $name)
	Determine if the named tag is void (can not have content).

static	clone ( $obj, $deepClone=true)
	deep clones by default.

static	lastUniChar (string $str, ?int $idx=null)
	Extract the last unicode character of the string.

static	isUniWord (string $s)
	Return true if the first character in $s is a unicode word character.

static	phpURLEncode ( $txt)
	This should not be used.

static	decodeURI (string $s)
	Percent-decode only valid UTF-8 characters, leaving other encoded bytes alone.

static	decodeURIComponent (string $s)
	Percent-decode only valid UTF-8 characters, leaving other encoded bytes alone.

static	extractExtBody (Token $token)
	Extract extension source from the token.

static	isValidDSR (?DomSourceRange $dsr, bool $all=false)
	Check for valid DSR range(s) DSR = "DOM Source Range".

static	normalizeNamespaceName (string $name)
	Cannonicalizes a namespace name.

static	decodeWtEntities (string $text)
	Decode HTML5 entities in wikitext.

static	escapeWtEntities (string $text)
	Entity-escape anything that would decode to a valid wikitext entity.

static	escapeHtml (string $s)
	Convert special characters to HTML entities.

static	entityEncodeAll (string $s)
	Encode all characters as entity references.

static	isProtocolValid ( $linkTarget, Env $env)
	Determine whether the protocol of a link is potentially valid.

static	getExtArgInfo (Token $extToken)
	Get argument information for an extension tag token.

static	parseMediaDimensions (string $str, bool $onlyOne=false)
	Parse media dimensions.

static	validateMediaParam (?int $num)
	Validate media parameters More generally, this is defined by the media handler in core.

static	getStar ( $revision)
	FIXME: Is this needed??

static	magicMasqs ()
	FIXME: This feels broken.

static	isLinkTrail (string $text)
	Check whether some text is a valid link trail.

static	bcp47n ( $code)
	Convert mediawiki-format language code to a BCP47-compliant language code suitable for including in HTML.

Public Attributes
const	COMMENT_REGEXP_FRAGMENT = '<!--(?>[\s\S]*?-->)'
	Regular expression fragment for matching wikitext comments.

const	COMMENT_REGEXP = '/' . self::COMMENT_REGEXP_FRAGMENT . '/'
	Regular fragment for matching a wikitext comment.

Static Public Attributes
static	$linkTrailRegex
	This regex was generated by running through all unicode characters and testing them against all regexes for linktrails in a default MW install.

Detailed Description

This file contains general utilities for token transforms.

Member Function Documentation

◆ bcp47n()

static Wikimedia\Parsoid\Utils\Utils::bcp47n ( $code )

static

Convert mediawiki-format language code to a BCP47-compliant language code suitable for including in HTML.

See GlobalFunctions.php::wfBCP47() in mediawiki sources.

Parameters

string $code Mediawiki language code.

Returns: string BCP47 language code.

◆ clone()

static Wikimedia\Parsoid\Utils\Utils::clone	(		$obj,
			$deepClone = true )

static

deep clones by default.

FIXME, see T161647

Parameters

object \| array	$obj	any plain object not tokens or DOM trees
bool	$deepClone

Returns: object|array

◆ decodeURI()

static Wikimedia\Parsoid\Utils\Utils::decodeURI ( string $s )

static

Percent-decode only valid UTF-8 characters, leaving other encoded bytes alone.

Distinct from decodeURIComponent in that certain escapes are not decoded, matching the behavior of JavaScript's decodeURI().

See also: https://www.ecma-international.org/ecma-262/6.0/#sec-decodeuri-encodeduri

Parameters

string $s URI to be decoded

Returns: string

◆ decodeURIComponent()

static Wikimedia\Parsoid\Utils\Utils::decodeURIComponent ( string $s )

static

Percent-decode only valid UTF-8 characters, leaving other encoded bytes alone.

Parameters

string $s URI to be decoded

Returns: string

◆ decodeWtEntities()

static Wikimedia\Parsoid\Utils\Utils::decodeWtEntities ( string $text )

static

Decode HTML5 entities in wikitext.

NOTE that wikitext only allows semicolon-terminated entities, while HTML allows a number of "legacy" entities to be decoded without a terminating semicolon. This function deliberately does not decode these HTML-only entity forms.

Parameters

string $text

Returns: string

◆ entityEncodeAll()

static Wikimedia\Parsoid\Utils\Utils::entityEncodeAll ( string $s )

static

Encode all characters as entity references.

This is done to make characters safe for wikitext (regardless of whether they are HTML-safe). Typically only called with single-codepoint strings.

Parameters

string $s

Returns: string

◆ escapeHtml()

static Wikimedia\Parsoid\Utils\Utils::escapeHtml ( string $s )

static

Convert special characters to HTML entities.

Parameters

string $s

Returns: string

◆ escapeWtEntities()

static Wikimedia\Parsoid\Utils\Utils::escapeWtEntities ( string $text )

static

Entity-escape anything that would decode to a valid wikitext entity.

Note that HTML5 allows certain "semicolon-less" entities, like &para; these aren't allowed in wikitext and won't be escaped by this function.

Parameters

string $text

Returns: string

◆ extractExtBody()

static Wikimedia\Parsoid\Utils\Utils::extractExtBody ( Token $token )

static

Extract extension source from the token.

Parameters

Token $token token

Returns: string

◆ getExtArgInfo()

static Wikimedia\Parsoid\Utils\Utils::getExtArgInfo ( Token $extToken )

static

Get argument information for an extension tag token.

Parameters

Token $extToken

Returns: \stdClass

◆ getStar()

static Wikimedia\Parsoid\Utils\Utils::getStar ( $revision )

static

FIXME: Is this needed??

Extract content in a backwards compatible way

Parameters

object $revision

Returns: object

◆ isLinkTrail()

static Wikimedia\Parsoid\Utils\Utils::isLinkTrail ( string $text )

static

Check whether some text is a valid link trail.

Parameters

string $text

Returns: bool

◆ isParsoidObjectId()

static Wikimedia\Parsoid\Utils\Utils::isParsoidObjectId ( string $aboutId )

static

Check for Parsoid id prefix in an aboutID string.

Parameters

string $aboutId aboud ID string

Returns: bool

◆ isProtocolValid()

static Wikimedia\Parsoid\Utils\Utils::isProtocolValid	(		$linkTarget,
		Env	$env )

static

Determine whether the protocol of a link is potentially valid.

Use the environment's per-wiki config to do so.

Parameters

mixed	$linkTarget
Env	$env

Returns: bool

◆ isUniWord()

static Wikimedia\Parsoid\Utils\Utils::isUniWord ( string $s )

static

Return true if the first character in $s is a unicode word character.

Parameters

string $s

Returns: bool

◆ isValidDSR()

static Wikimedia\Parsoid\Utils\Utils::isValidDSR	(	?DomSourceRange	$dsr,
		bool	$all = false )

static

Check for valid DSR range(s) DSR = "DOM Source Range".

Parameters

?DomSourceRange	$dsr	DSR source range values
bool	$all	Also check the widths of the container tag

Returns: bool

◆ isVoidElement()

static Wikimedia\Parsoid\Utils\Utils::isVoidElement ( string $name )

static

Determine if the named tag is void (can not have content).

Parameters

string $name tag name

Returns: bool

◆ lastUniChar()

static Wikimedia\Parsoid\Utils\Utils::lastUniChar	(	string	$str,
		?int	$idx = null )

static

Extract the last unicode character of the string.

This might be more than one byte, if the last character is non-ASCII.

Parameters

string	$str
?int	$idx	The index after the character to extract; defaults to the length of $str, which will extract the last character in $str.

Returns: string

◆ magicMasqs()

static Wikimedia\Parsoid\Utils\Utils::magicMasqs ( )

static

FIXME: This feels broken.

Magic words masquerading as templates.

Returns: array

◆ normalizeNamespaceName()

static Wikimedia\Parsoid\Utils\Utils::normalizeNamespaceName ( string $name )

static

Cannonicalizes a namespace name.

Parameters

string $name Non-normalized namespace name.

Returns: string

◆ parseMediaDimensions()

static Wikimedia\Parsoid\Utils\Utils::parseMediaDimensions	(	string	$str,
		bool	$onlyOne = false )

static

Parse media dimensions.

Parameters

string	$str	media dimension string to parse
bool	$onlyOne	If set, returns null if multiple dimenstions are present

Returns: ?array{x:int,y?:int}

◆ phpURLEncode()

static Wikimedia\Parsoid\Utils\Utils::phpURLEncode ( $txt )

static

This should not be used.

Parameters

string $txt URL to encode using PHP encoding

Returns: string

◆ stripNamespace()

static Wikimedia\Parsoid\Utils\Utils::stripNamespace ( string $className )

static

Strip PHP namespace from the fully qualified class name.

Parameters

string $className

Returns: string

◆ stripParsoidIdPrefix()

static Wikimedia\Parsoid\Utils\Utils::stripParsoidIdPrefix ( string $aboutId )

static

Strip Parsoid id prefix from aboutID.

Parameters

string $aboutId aboud ID string

Returns: string

◆ validateMediaParam()

static Wikimedia\Parsoid\Utils\Utils::validateMediaParam ( ?int $num )

static

Validate media parameters More generally, this is defined by the media handler in core.

Parameters

?int $num

Returns: bool

Member Data Documentation

◆ $linkTrailRegex

Wikimedia\Parsoid\Utils\Utils::$linkTrailRegex

static

Initial value:

=
        '/^[^\0-`{÷ĀĈ-ČĎĐĒĔĖĚĜĝĠ-ĪĬ-įĲĴ-ĹĻ-ĽĿŀŅņŉŊŌŎŏŒŔŖ-ŘŜŝŠŤŦŨŪ-ŬŮŲ-ŴŶŸ' .
        'ſ-ǤǦǨǪ-Ǯǰ-ȗȜ-ȞȠ-ɘɚ-ʑʓ-ʸʽ-̂̄-΅·΋΍΢Ϗ-ЯѐѝѠѢѤѦѨѪѬѮѰѲѴѶѸѺ-ѾҀ-҃҅-ҐҒҔҕҘҚҜ-ҠҤ-ҪҬҭҰҲ' .
        'Ҵ-ҶҸҹҼ-ҿӁ-ӗӚ-ӜӞӠ-ӢӤӦӪ-ӲӴӶ-ՠֈ-׏׫-ؠً-ٳٵ-ٽٿ-څڇ-ڗڙ-ڨڪ-ڬڮڰ-ڽڿ-ۅۈ-ۊۍ-۔ۖ-਀਄਋-਎਑਒' .
        '਩਱਴਷਺਻਽੃-੆੉੊੎-੘੝੟-੯ੴ-჏ჱ-ẼẾ-\x{200b}\x{200d}-‒—-‗‚‛”--\x{fffd}]+$/D'

This regex was generated by running through all unicode characters and testing them against all regexes for linktrails in a default MW install.

We had to treat it a little bit, here's what we changed:

A-Z, though allowed in Walloon, is disallowed.
'"', though allowed in Chuvash, is disallowed.
'-', though allowed in Icelandic (possibly due to a bug), is disallowed.
'1', though allowed in Lak (possibly due to a bug), is disallowed.

◆ COMMENT_REGEXP_FRAGMENT

const Wikimedia\Parsoid\Utils\Utils::COMMENT_REGEXP_FRAGMENT = ')'

Regular expression fragment for matching wikitext comments.

Meant for inclusion in other regular expressions.

The documentation for this class was generated from the following file:

src/Utils/Utils.php

Static Public Member Functions

Public Attributes

Static Public Attributes

Detailed Description

Member Function Documentation

◆ bcp47n()

◆ clone()

◆ decodeURI()

◆ decodeURIComponent()

◆ decodeWtEntities()

◆ entityEncodeAll()

◆ escapeHtml()

◆ escapeWtEntities()

◆ extractExtBody()

◆ getExtArgInfo()

◆ getStar()

◆ isLinkTrail()

◆ isParsoidObjectId()

◆ isProtocolValid()

◆ isUniWord()

◆ isValidDSR()

◆ isVoidElement()

◆ lastUniChar()

◆ magicMasqs()

◆ normalizeNamespaceName()

◆ parseMediaDimensions()

◆ phpURLEncode()

◆ stripNamespace()

◆ stripParsoidIdPrefix()

◆ validateMediaParam()

Member Data Documentation

◆ $linkTrailRegex

◆ COMMENT_REGEXP_FRAGMENT