Parsoid
A bidirectional parser between wikitext and HTML5
Loading...
Searching...
No Matches
Wikimedia\Parsoid\Utils\Utils Class Reference

This file contains general utilities for token transforms. More...

Static Public Member Functions

static stripParsoidIdPrefix (string $aboutId)
 Strip Parsoid id prefix from aboutID.
 
static stripNamespace (string $className)
 Strip PHP namespace from the fully qualified class name.
 
static isParsoidObjectId (string $aboutId)
 Check for Parsoid id prefix in an aboutID string.
 
static isVoidElement (string $name)
 Determine if the named tag is void (can not have content).
 
static clone ( $obj, $deepClone=true, $debug=false)
 Deep clones by default.
 
static lastUniChar (string $str, ?int $idx=null)
 Extract the last unicode character of the string.
 
static isUniWord (string $s)
 Return true if the first character in $s is a unicode word character.
 
static phpURLEncode ( $txt)
 This should not be used.
 
static decodeURI (string $s)
 Percent-decode only valid UTF-8 characters, leaving other encoded bytes alone.
 
static decodeURIComponent (string $s)
 Percent-decode only valid UTF-8 characters, leaving other encoded bytes alone.
 
static extractExtBody (Token $token)
 Extract extension source from the token.
 
static isValidDSR (?DomSourceRange $dsr, bool $all=false)
 Basic check if a DOM Source Range (DSR) is valid.
 
static normalizeNamespaceName (string $name)
 Cannonicalizes a namespace name.
 
static decodeWtEntities (string $text)
 Decode HTML5 entities in wikitext.
 
static escapeWtEntities (string $text)
 Entity-escape anything that would decode to a valid wikitext entity.
 
static escapeHtml (string $s)
 Convert special characters to HTML entities.
 
static entityEncodeAll (string $s)
 Encode all characters as entity references.
 
static isProtocolValid ( $linkTarget, Env $env)
 Determine whether the protocol of a link is potentially valid.
 
static getExtArgInfo (Token $extToken)
 Get argument information for an extension tag token.
 
static parseMediaDimensions (string $str, bool $onlyOne=false)
 Parse media dimensions.
 
static validateMediaParam (?int $num)
 Validate media parameters More generally, this is defined by the media handler in core.
 
static getStar ( $revision)
 FIXME: Is this needed??
 
static isLinkTrail (string $text)
 Check whether some text is a valid link trail.
 
static bcp47ToMwCode ( $code)
 Convert BCP-47-compliant language code to MediaWiki-internal code.
 
static mwCodeToBcp47 ( $code, bool $strict=false, ?LoggerInterface $warnLogger=null)
 Convert MediaWiki-internal language code to a BCP-47-compliant language code suitable for including in HTML.
 
static isBcp47CodeEqual (Bcp47Code $a, Bcp47Code $b)
 BCP 47 codes are case-insensitive, so this helper does a "proper" comparison of Bcp47Code objects.
 

Public Attributes

const COMMENT_REGEXP_FRAGMENT = '<!--(?>[\s\S]*?-->)'
 Regular expression fragment for matching wikitext comments.
 
const COMMENT_REGEXP = '/' . self::COMMENT_REGEXP_FRAGMENT . '/'
 Regular fragment for matching a wikitext comment.
 

Static Public Attributes

static $linkTrailRegex
 This regex was generated by running through all unicode characters and testing them against all regexes for linktrails in a default MW install.
 

Detailed Description

This file contains general utilities for token transforms.

Member Function Documentation

◆ bcp47ToMwCode()

static Wikimedia\Parsoid\Utils\Utils::bcp47ToMwCode ( $code)
static

Convert BCP-47-compliant language code to MediaWiki-internal code.

This is a temporary back-compatibility hack; Parsoid should be using BCP 47 strings or Bcp47Code objects in all its external APIs. Try to avoid using it, though: there's no guarantee that this mapping will remain in sync with upstream.

Parameters
string | Bcp47Code$codeBCP-47 language code
Returns
string MediaWiki-internal language code

◆ clone()

static Wikimedia\Parsoid\Utils\Utils::clone ( $obj,
$deepClone = true,
$debug = false )
static

Deep clones by default.

Parameters
object | array$objarrays or plain objects Tokens or DOM nodes shouldn't be passed in.

CAVEAT: It looks like debugging methods pass in arrays that can have DOM nodes. So, for debugging purposes, we handle top-level DOM nodes or DOM nodes embedded in arrays But, this will miserably fail if an object embeds a DOM node.

Parameters
bool$deepClone
bool$debug
Returns
object|array

◆ decodeURI()

static Wikimedia\Parsoid\Utils\Utils::decodeURI ( string $s)
static

Percent-decode only valid UTF-8 characters, leaving other encoded bytes alone.

Distinct from decodeURIComponent in that certain escapes are not decoded, matching the behavior of JavaScript's decodeURI().

See also
https://www.ecma-international.org/ecma-262/6.0/#sec-decodeuri-encodeduri
Parameters
string$sURI to be decoded
Returns
string

◆ decodeURIComponent()

static Wikimedia\Parsoid\Utils\Utils::decodeURIComponent ( string $s)
static

Percent-decode only valid UTF-8 characters, leaving other encoded bytes alone.

Parameters
string$sURI to be decoded
Returns
string

◆ decodeWtEntities()

static Wikimedia\Parsoid\Utils\Utils::decodeWtEntities ( string $text)
static

Decode HTML5 entities in wikitext.

NOTE that wikitext only allows semicolon-terminated entities, while HTML allows a number of "legacy" entities to be decoded without a terminating semicolon. This function deliberately does not decode these HTML-only entity forms.

Parameters
string$text
Returns
string

◆ entityEncodeAll()

static Wikimedia\Parsoid\Utils\Utils::entityEncodeAll ( string $s)
static

Encode all characters as entity references.

This is done to make characters safe for wikitext (regardless of whether they are HTML-safe). Typically only called with single-codepoint strings.

Parameters
string$s
Returns
string

◆ escapeHtml()

static Wikimedia\Parsoid\Utils\Utils::escapeHtml ( string $s)
static

Convert special characters to HTML entities.

Parameters
string$s
Returns
string

◆ escapeWtEntities()

static Wikimedia\Parsoid\Utils\Utils::escapeWtEntities ( string $text)
static

Entity-escape anything that would decode to a valid wikitext entity.

Note that HTML5 allows certain "semicolon-less" entities, like &para; these aren't allowed in wikitext and won't be escaped by this function.

Parameters
string$text
Returns
string

◆ extractExtBody()

static Wikimedia\Parsoid\Utils\Utils::extractExtBody ( Token $token)
static

Extract extension source from the token.

Parameters
Token$tokentoken
Returns
string

◆ getExtArgInfo()

static Wikimedia\Parsoid\Utils\Utils::getExtArgInfo ( Token $extToken)
static

Get argument information for an extension tag token.

Parameters
Token$extToken
Returns
DataMw

◆ getStar()

static Wikimedia\Parsoid\Utils\Utils::getStar ( $revision)
static

FIXME: Is this needed??

Extract content in a backwards compatible way

Parameters
object$revision
Returns
object

◆ isBcp47CodeEqual()

static Wikimedia\Parsoid\Utils\Utils::isBcp47CodeEqual ( Bcp47Code $a,
Bcp47Code $b )
static

BCP 47 codes are case-insensitive, so this helper does a "proper" comparison of Bcp47Code objects.

Parameters
Bcp47Code$a
Bcp47Code$b
Returns
bool true iff $a and $b represent the same language

◆ isLinkTrail()

static Wikimedia\Parsoid\Utils\Utils::isLinkTrail ( string $text)
static

Check whether some text is a valid link trail.

Parameters
string$text
Returns
bool

◆ isParsoidObjectId()

static Wikimedia\Parsoid\Utils\Utils::isParsoidObjectId ( string $aboutId)
static

Check for Parsoid id prefix in an aboutID string.

Parameters
string$aboutIdaboud ID string
Returns
bool

◆ isProtocolValid()

static Wikimedia\Parsoid\Utils\Utils::isProtocolValid ( $linkTarget,
Env $env )
static

Determine whether the protocol of a link is potentially valid.

Use the environment's per-wiki config to do so.

Parameters
mixed$linkTarget
Env$env
Returns
bool

◆ isUniWord()

static Wikimedia\Parsoid\Utils\Utils::isUniWord ( string $s)
static

Return true if the first character in $s is a unicode word character.

Parameters
string$s
Returns
bool

◆ isValidDSR()

static Wikimedia\Parsoid\Utils\Utils::isValidDSR ( ?DomSourceRange $dsr,
bool $all = false )
static

Basic check if a DOM Source Range (DSR) is valid.

Clarifications about the "basic validity checks":

  • Only checks for underflow, not for overflow.
  • Does not verify that start <= end
  • Does not verify that openWidth + endWidth <= end - start (even so, the values might be invalid because of content) These would be overkill for our purposes. Given how DSR computation works in thie codebase, the real scenarios we care about are non-null / non-negative values since that can happen.
Parameters
?DomSourceRange$dsrDSR source range values
bool$allAlso check the widths of the container tag
Returns
bool

◆ isVoidElement()

static Wikimedia\Parsoid\Utils\Utils::isVoidElement ( string $name)
static

Determine if the named tag is void (can not have content).

Parameters
string$nametag name
Returns
bool

◆ lastUniChar()

static Wikimedia\Parsoid\Utils\Utils::lastUniChar ( string $str,
?int $idx = null )
static

Extract the last unicode character of the string.

This might be more than one byte, if the last character is non-ASCII.

Parameters
string$str
?int$idxThe index after the character to extract; defaults to the length of $str, which will extract the last character in $str.
Returns
string

◆ mwCodeToBcp47()

static Wikimedia\Parsoid\Utils\Utils::mwCodeToBcp47 ( $code,
bool $strict = false,
?LoggerInterface $warnLogger = null )
static

Convert MediaWiki-internal language code to a BCP-47-compliant language code suitable for including in HTML.

This is a temporary back-compatibility hack, needed for compatibility when running in standalone mode with MediaWiki Action APIs which expose internal language codes. These APIs should eventually be improved so that they also expose BCP-47 compliant codes, which can then be used directly by Parsoid without conversion. But until that day comes, this function will paper over the differences.

Note that MediaWiki-internal Language objects implement Bcp47Code, so we can transition interfaces which currently take a string code to pass a Language object instead; that will make this method effectively a no-op and avoid the issue of upstream sync of the mapping table.

Parameters
string | Bcp47Code$codeMediaWiki-internal language code or object
bool$strictIf true, this code will log a deprecation message or fail if a MediaWiki-internal language code is passed.
?LoggerInterface$warnLoggerA deprecation warning will be emitted on $warnLogger if $strict is true and a string-valued MediaWiki-internal language code is passed; otherwise an exception will be thrown.
Returns
Bcp47Code BCP-47 language code.
See also
LanguageCode::bcp47()

◆ normalizeNamespaceName()

static Wikimedia\Parsoid\Utils\Utils::normalizeNamespaceName ( string $name)
static

Cannonicalizes a namespace name.

Parameters
string$nameNon-normalized namespace name.
Returns
string

◆ parseMediaDimensions()

static Wikimedia\Parsoid\Utils\Utils::parseMediaDimensions ( string $str,
bool $onlyOne = false )
static

Parse media dimensions.

Parameters
string$strmedia dimension string to parse
bool$onlyOneIf set, returns null if multiple dimenstions are present
Returns
?array{x:int,y?:int,bogusPx:bool}

◆ phpURLEncode()

static Wikimedia\Parsoid\Utils\Utils::phpURLEncode ( $txt)
static

This should not be used.

Parameters
string$txtURL to encode using PHP encoding
Returns
string

◆ stripNamespace()

static Wikimedia\Parsoid\Utils\Utils::stripNamespace ( string $className)
static

Strip PHP namespace from the fully qualified class name.

Parameters
string$className
Returns
string

◆ stripParsoidIdPrefix()

static Wikimedia\Parsoid\Utils\Utils::stripParsoidIdPrefix ( string $aboutId)
static

Strip Parsoid id prefix from aboutID.

Parameters
string$aboutIdaboud ID string
Returns
string

◆ validateMediaParam()

static Wikimedia\Parsoid\Utils\Utils::validateMediaParam ( ?int $num)
static

Validate media parameters More generally, this is defined by the media handler in core.

Parameters
?int$num
Returns
bool

Member Data Documentation

◆ $linkTrailRegex

Wikimedia\Parsoid\Utils\Utils::$linkTrailRegex
static
Initial value:
=
'/^[^\0-`{÷ĀĈ-ČĎĐĒĔĖĚĜĝĠ-ĪĬ-įIJĴ-ĹĻ-ĽĿŀŅņʼnŊŌŎŏŒŔŖ-ŘŜŝŠŤŦŨŪ-ŬŮŲ-ŴŶŸ' .
'ſ-ǤǦǨǪ-Ǯǰ-ȗȜ-ȞȠ-ɘɚ-ʑʓ-ʸʽ-̂̄-΅·΋΍΢Ϗ-ЯѐѝѠѢѤѦѨѪѬѮѰѲѴѶѸѺ-ѾҀ-҃҅-ҐҒҔҕҘҚҜ-ҠҤ-ҪҬҭҰҲ' .
'Ҵ-ҶҸҹҼ-ҿӁ-ӗӚ-ӜӞӠ-ӢӤӦӪ-ӲӴӶ-ՠֈ-׏׫-ؠً-ٳٵ-ٽٿ-څڇ-ڗڙ-ڨڪ-ڬڮڰ-ڽڿ-ۅۈ-ۊۍ-۔ۖ-਀਄਋-਎਑਒' .
'਩਱਴਷਺਻਽੃-੆੉੊੎-੘੝੟-੯ੴ-჏ჱ-ẼẾ-\x{200b}\x{200d}-‒—-‗‚‛”--\x{fffd}]+$/D'

This regex was generated by running through all unicode characters and testing them against all regexes for linktrails in a default MW install.

We had to treat it a little bit, here's what we changed:

  1. A-Z, though allowed in Walloon, is disallowed.
  2. '"', though allowed in Chuvash, is disallowed.
  3. '-', though allowed in Icelandic (possibly due to a bug), is disallowed.
  4. '1', though allowed in Lak (possibly due to a bug), is disallowed.

◆ COMMENT_REGEXP_FRAGMENT

const Wikimedia\Parsoid\Utils\Utils::COMMENT_REGEXP_FRAGMENT = '<!--(?>[\s\S]*?-->)'

Regular expression fragment for matching wikitext comments.

Meant for inclusion in other regular expressions.


The documentation for this class was generated from the following file: