Parsoid
A bidirectional parser between wikitext and HTML5
Loading...
Searching...
No Matches
Wikimedia\Parsoid\Utils\Utils Class Reference

This file contains general utilities for token transforms. More...

Static Public Member Functions

static stripParsoidIdPrefix (string $aboutId)
 Strip Parsoid id prefix from aboutID.
 
static stripNamespace (string $className)
 Strip PHP namespace from the fully qualified class name.
 
static isParsoidObjectId (string $aboutId)
 Check for Parsoid id prefix in an aboutID string.
 
static isVoidElement (string $name)
 Determine if the named tag is void (can not have content).
 
static clone ( $obj, $deepClone=true)
 deep clones by default.
 
static lastUniChar (string $str, ?int $idx=null)
 Extract the last unicode character of the string.
 
static isUniWord (string $s)
 Return true if the first character in $s is a unicode word character.
 
static phpURLEncode ( $txt)
 This should not be used.
 
static decodeURI (string $s)
 Percent-decode only valid UTF-8 characters, leaving other encoded bytes alone.
 
static decodeURIComponent (string $s)
 Percent-decode only valid UTF-8 characters, leaving other encoded bytes alone.
 
static extractExtBody (Token $token)
 Extract extension source from the token.
 
static isValidDSR (?DomSourceRange $dsr, bool $all=false)
 Check for valid DSR range(s) DSR = "DOM Source Range".
 
static normalizeNamespaceName (string $name)
 Cannonicalizes a namespace name.
 
static decodeWtEntities (string $text)
 Decode HTML5 entities in wikitext.
 
static escapeWtEntities (string $text)
 Entity-escape anything that would decode to a valid wikitext entity.
 
static escapeHtml (string $s)
 Convert special characters to HTML entities.
 
static entityEncodeAll (string $s)
 Encode all characters as entity references.
 
static isProtocolValid ( $linkTarget, Env $env)
 Determine whether the protocol of a link is potentially valid.
 
static getExtArgInfo (Token $extToken)
 Get argument information for an extension tag token.
 
static parseMediaDimensions (string $str, bool $onlyOne=false)
 Parse media dimensions.
 
static validateMediaParam (?int $num)
 Validate media parameters More generally, this is defined by the media handler in core.
 
static getStar ( $revision)
 FIXME: Is this needed??
 
static magicMasqs ()
 FIXME: This feels broken.
 
static isLinkTrail (string $text)
 Check whether some text is a valid link trail.
 
static bcp47n ( $code)
 Convert mediawiki-format language code to a BCP47-compliant language code suitable for including in HTML.
 

Public Attributes

const COMMENT_REGEXP_FRAGMENT = '<!--(?>[\s\S]*?-->)'
 Regular expression fragment for matching wikitext comments.
 
const COMMENT_REGEXP = '/' . self::COMMENT_REGEXP_FRAGMENT . '/'
 Regular fragment for matching a wikitext comment.
 

Static Public Attributes

static $linkTrailRegex
 This regex was generated by running through all unicode characters and testing them against all regexes for linktrails in a default MW install.
 

Detailed Description

This file contains general utilities for token transforms.

Member Function Documentation

◆ bcp47n()

static Wikimedia\Parsoid\Utils\Utils::bcp47n ( $code)
static

Convert mediawiki-format language code to a BCP47-compliant language code suitable for including in HTML.

See GlobalFunctions.php::wfBCP47() in mediawiki sources.

Parameters
string$codeMediawiki language code.
Returns
string BCP47 language code.

◆ clone()

static Wikimedia\Parsoid\Utils\Utils::clone ( $obj,
$deepClone = true )
static

deep clones by default.

FIXME, see T161647

Parameters
object | array$objany plain object not tokens or DOM trees
bool$deepClone
Returns
object|array

◆ decodeURI()

static Wikimedia\Parsoid\Utils\Utils::decodeURI ( string $s)
static

Percent-decode only valid UTF-8 characters, leaving other encoded bytes alone.

Distinct from decodeURIComponent in that certain escapes are not decoded, matching the behavior of JavaScript's decodeURI().

See also
https://www.ecma-international.org/ecma-262/6.0/#sec-decodeuri-encodeduri
Parameters
string$sURI to be decoded
Returns
string

◆ decodeURIComponent()

static Wikimedia\Parsoid\Utils\Utils::decodeURIComponent ( string $s)
static

Percent-decode only valid UTF-8 characters, leaving other encoded bytes alone.

Parameters
string$sURI to be decoded
Returns
string

◆ decodeWtEntities()

static Wikimedia\Parsoid\Utils\Utils::decodeWtEntities ( string $text)
static

Decode HTML5 entities in wikitext.

NOTE that wikitext only allows semicolon-terminated entities, while HTML allows a number of "legacy" entities to be decoded without a terminating semicolon. This function deliberately does not decode these HTML-only entity forms.

Parameters
string$text
Returns
string

◆ entityEncodeAll()

static Wikimedia\Parsoid\Utils\Utils::entityEncodeAll ( string $s)
static

Encode all characters as entity references.

This is done to make characters safe for wikitext (regardless of whether they are HTML-safe). Typically only called with single-codepoint strings.

Parameters
string$s
Returns
string

◆ escapeHtml()

static Wikimedia\Parsoid\Utils\Utils::escapeHtml ( string $s)
static

Convert special characters to HTML entities.

Parameters
string$s
Returns
string

◆ escapeWtEntities()

static Wikimedia\Parsoid\Utils\Utils::escapeWtEntities ( string $text)
static

Entity-escape anything that would decode to a valid wikitext entity.

Note that HTML5 allows certain "semicolon-less" entities, like &para; these aren't allowed in wikitext and won't be escaped by this function.

Parameters
string$text
Returns
string

◆ extractExtBody()

static Wikimedia\Parsoid\Utils\Utils::extractExtBody ( Token $token)
static

Extract extension source from the token.

Parameters
Token$tokentoken
Returns
string

◆ getExtArgInfo()

static Wikimedia\Parsoid\Utils\Utils::getExtArgInfo ( Token $extToken)
static

Get argument information for an extension tag token.

Parameters
Token$extToken
Returns
\stdClass

◆ getStar()

static Wikimedia\Parsoid\Utils\Utils::getStar ( $revision)
static

FIXME: Is this needed??

Extract content in a backwards compatible way

Parameters
object$revision
Returns
object

◆ isLinkTrail()

static Wikimedia\Parsoid\Utils\Utils::isLinkTrail ( string $text)
static

Check whether some text is a valid link trail.

Parameters
string$text
Returns
bool

◆ isParsoidObjectId()

static Wikimedia\Parsoid\Utils\Utils::isParsoidObjectId ( string $aboutId)
static

Check for Parsoid id prefix in an aboutID string.

Parameters
string$aboutIdaboud ID string
Returns
bool

◆ isProtocolValid()

static Wikimedia\Parsoid\Utils\Utils::isProtocolValid ( $linkTarget,
Env $env )
static

Determine whether the protocol of a link is potentially valid.

Use the environment's per-wiki config to do so.

Parameters
mixed$linkTarget
Env$env
Returns
bool

◆ isUniWord()

static Wikimedia\Parsoid\Utils\Utils::isUniWord ( string $s)
static

Return true if the first character in $s is a unicode word character.

Parameters
string$s
Returns
bool

◆ isValidDSR()

static Wikimedia\Parsoid\Utils\Utils::isValidDSR ( ?DomSourceRange $dsr,
bool $all = false )
static

Check for valid DSR range(s) DSR = "DOM Source Range".

Parameters
?DomSourceRange$dsrDSR source range values
bool$allAlso check the widths of the container tag
Returns
bool

◆ isVoidElement()

static Wikimedia\Parsoid\Utils\Utils::isVoidElement ( string $name)
static

Determine if the named tag is void (can not have content).

Parameters
string$nametag name
Returns
bool

◆ lastUniChar()

static Wikimedia\Parsoid\Utils\Utils::lastUniChar ( string $str,
?int $idx = null )
static

Extract the last unicode character of the string.

This might be more than one byte, if the last character is non-ASCII.

Parameters
string$str
?int$idxThe index after the character to extract; defaults to the length of $str, which will extract the last character in $str.
Returns
string

◆ magicMasqs()

static Wikimedia\Parsoid\Utils\Utils::magicMasqs ( )
static

FIXME: This feels broken.

Magic words masquerading as templates.

Returns
array

◆ normalizeNamespaceName()

static Wikimedia\Parsoid\Utils\Utils::normalizeNamespaceName ( string $name)
static

Cannonicalizes a namespace name.

Parameters
string$nameNon-normalized namespace name.
Returns
string

◆ parseMediaDimensions()

static Wikimedia\Parsoid\Utils\Utils::parseMediaDimensions ( string $str,
bool $onlyOne = false )
static

Parse media dimensions.

Parameters
string$strmedia dimension string to parse
bool$onlyOneIf set, returns null if multiple dimenstions are present
Returns
?array{x:int,y?:int}

◆ phpURLEncode()

static Wikimedia\Parsoid\Utils\Utils::phpURLEncode ( $txt)
static

This should not be used.

Parameters
string$txtURL to encode using PHP encoding
Returns
string

◆ stripNamespace()

static Wikimedia\Parsoid\Utils\Utils::stripNamespace ( string $className)
static

Strip PHP namespace from the fully qualified class name.

Parameters
string$className
Returns
string

◆ stripParsoidIdPrefix()

static Wikimedia\Parsoid\Utils\Utils::stripParsoidIdPrefix ( string $aboutId)
static

Strip Parsoid id prefix from aboutID.

Parameters
string$aboutIdaboud ID string
Returns
string

◆ validateMediaParam()

static Wikimedia\Parsoid\Utils\Utils::validateMediaParam ( ?int $num)
static

Validate media parameters More generally, this is defined by the media handler in core.

Parameters
?int$num
Returns
bool

Member Data Documentation

◆ $linkTrailRegex

Wikimedia\Parsoid\Utils\Utils::$linkTrailRegex
static
Initial value:
=
'/^[^\0-`{÷ĀĈ-ČĎĐĒĔĖĚĜĝĠ-ĪĬ-įIJĴ-ĹĻ-ĽĿŀŅņʼnŊŌŎŏŒŔŖ-ŘŜŝŠŤŦŨŪ-ŬŮŲ-ŴŶŸ' .
'ſ-ǤǦǨǪ-Ǯǰ-ȗȜ-ȞȠ-ɘɚ-ʑʓ-ʸʽ-̂̄-΅·΋΍΢Ϗ-ЯѐѝѠѢѤѦѨѪѬѮѰѲѴѶѸѺ-ѾҀ-҃҅-ҐҒҔҕҘҚҜ-ҠҤ-ҪҬҭҰҲ' .
'Ҵ-ҶҸҹҼ-ҿӁ-ӗӚ-ӜӞӠ-ӢӤӦӪ-ӲӴӶ-ՠֈ-׏׫-ؠً-ٳٵ-ٽٿ-څڇ-ڗڙ-ڨڪ-ڬڮڰ-ڽڿ-ۅۈ-ۊۍ-۔ۖ-਀਄਋-਎਑਒' .
'਩਱਴਷਺਻਽੃-੆੉੊੎-੘੝੟-੯ੴ-჏ჱ-ẼẾ-\x{200b}\x{200d}-‒—-‗‚‛”--\x{fffd}]+$/D'

This regex was generated by running through all unicode characters and testing them against all regexes for linktrails in a default MW install.

We had to treat it a little bit, here's what we changed:

  1. A-Z, though allowed in Walloon, is disallowed.
  2. '"', though allowed in Chuvash, is disallowed.
  3. '-', though allowed in Icelandic (possibly due to a bug), is disallowed.
  4. '1', though allowed in Lak (possibly due to a bug), is disallowed.

◆ COMMENT_REGEXP_FRAGMENT

const Wikimedia\Parsoid\Utils\Utils::COMMENT_REGEXP_FRAGMENT = '<!--(?>[\s\S]*?-->)'

Regular expression fragment for matching wikitext comments.

Meant for inclusion in other regular expressions.


The documentation for this class was generated from the following file: