Parsoid
A bidirectional parser between wikitext and HTML5
|
This file contains general utilities for token transforms. More...
Static Public Member Functions | |
static | stripParsoidIdPrefix (string $aboutId) |
Strip Parsoid id prefix from aboutID. | |
static | stripNamespace (string $className) |
Strip PHP namespace from the fully qualified class name. | |
static | isParsoidObjectId (string $aboutId) |
Check for Parsoid id prefix in an aboutID string. | |
static | isVoidElement (string $name) |
Determine if the named tag is void (can not have content). | |
static | clone ( $obj, $deepClone=true, $debug=false) |
Deep clones by default. | |
static | lastUniChar (string $str, ?int $idx=null) |
Extract the last unicode character of the string. | |
static | isUniWord (string $s) |
Return true if the first character in $s is a unicode word character. | |
static | phpURLEncode ( $txt) |
This should not be used. | |
static | decodeURI (string $s) |
Percent-decode only valid UTF-8 characters, leaving other encoded bytes alone. | |
static | decodeURIComponent (string $s) |
Percent-decode only valid UTF-8 characters, leaving other encoded bytes alone. | |
static | extractExtBody (Token $token) |
Extract extension source from the token. | |
static | isValidDSR (?DomSourceRange $dsr, bool $all=false) |
Basic check if a DOM Source Range (DSR) is valid. | |
static | normalizeNamespaceName (string $name) |
Cannonicalizes a namespace name. | |
static | decodeWtEntities (string $text) |
Decode HTML5 entities in wikitext. | |
static | escapeWtEntities (string $text) |
Entity-escape anything that would decode to a valid wikitext entity. | |
static | escapeHtml (string $s) |
Convert special characters to HTML entities. | |
static | entityEncodeAll (string $s) |
Encode all characters as entity references. | |
static | isProtocolValid ( $linkTarget, Env $env) |
Determine whether the protocol of a link is potentially valid. | |
static | getExtArgInfo (Token $extToken) |
Get argument information for an extension tag token. | |
static | parseMediaDimensions (string $str, bool $onlyOne=false) |
Parse media dimensions. | |
static | validateMediaParam (?int $num) |
Validate media parameters More generally, this is defined by the media handler in core. | |
static | getStar ( $revision) |
FIXME: Is this needed?? | |
static | isLinkTrail (string $text) |
Check whether some text is a valid link trail. | |
static | bcp47ToMwCode ( $code) |
Convert BCP-47-compliant language code to MediaWiki-internal code. | |
static | mwCodeToBcp47 ( $code, bool $strict=false, ?LoggerInterface $warnLogger=null) |
Convert MediaWiki-internal language code to a BCP-47-compliant language code suitable for including in HTML. | |
static | isBcp47CodeEqual (Bcp47Code $a, Bcp47Code $b) |
BCP 47 codes are case-insensitive, so this helper does a "proper" comparison of Bcp47Code objects. | |
Public Attributes | |
const | COMMENT_REGEXP_FRAGMENT = '<!--(?>[\s\S]*?-->)' |
Regular expression fragment for matching wikitext comments. | |
const | COMMENT_REGEXP = '/' . self::COMMENT_REGEXP_FRAGMENT . '/' |
Regular fragment for matching a wikitext comment. | |
Static Public Attributes | |
static | $linkTrailRegex |
This regex was generated by running through all unicode characters and testing them against all regexes for linktrails in a default MW install. | |
This file contains general utilities for token transforms.
|
static |
Convert BCP-47-compliant language code to MediaWiki-internal code.
This is a temporary back-compatibility hack; Parsoid should be using BCP 47 strings or Bcp47Code objects in all its external APIs. Try to avoid using it, though: there's no guarantee that this mapping will remain in sync with upstream.
string | Bcp47Code | $code | BCP-47 language code |
|
static |
Deep clones by default.
object | array | $obj | arrays or plain objects Tokens or DOM nodes shouldn't be passed in. |
CAVEAT: It looks like debugging methods pass in arrays that can have DOM nodes. So, for debugging purposes, we handle top-level DOM nodes or DOM nodes embedded in arrays But, this will miserably fail if an object embeds a DOM node.
bool | $deepClone | |
bool | $debug |
|
static |
Percent-decode only valid UTF-8 characters, leaving other encoded bytes alone.
Distinct from decodeURIComponent
in that certain escapes are not decoded, matching the behavior of JavaScript's decodeURI().
string | $s | URI to be decoded |
|
static |
Percent-decode only valid UTF-8 characters, leaving other encoded bytes alone.
string | $s | URI to be decoded |
|
static |
Decode HTML5 entities in wikitext.
NOTE that wikitext only allows semicolon-terminated entities, while HTML allows a number of "legacy" entities to be decoded without a terminating semicolon. This function deliberately does not decode these HTML-only entity forms.
string | $text |
|
static |
Encode all characters as entity references.
This is done to make characters safe for wikitext (regardless of whether they are HTML-safe). Typically only called with single-codepoint strings.
string | $s |
|
static |
Convert special characters to HTML entities.
string | $s |
|
static |
Entity-escape anything that would decode to a valid wikitext entity.
Note that HTML5 allows certain "semicolon-less" entities, like ¶
; these aren't allowed in wikitext and won't be escaped by this function.
string | $text |
|
static |
Extract extension source from the token.
Token | $token | token |
|
static |
Get argument information for an extension tag token.
Token | $extToken |
|
static |
FIXME: Is this needed??
Extract content in a backwards compatible way
object | $revision |
|
static |
BCP 47 codes are case-insensitive, so this helper does a "proper" comparison of Bcp47Code objects.
Bcp47Code | $a | |
Bcp47Code | $b |
|
static |
Check whether some text is a valid link trail.
string | $text |
|
static |
Check for Parsoid id prefix in an aboutID string.
string | $aboutId | aboud ID string |
|
static |
Determine whether the protocol of a link is potentially valid.
Use the environment's per-wiki config to do so.
mixed | $linkTarget | |
Env | $env |
|
static |
Return true if the first character in $s is a unicode word character.
string | $s |
|
static |
Basic check if a DOM Source Range (DSR) is valid.
Clarifications about the "basic validity checks":
?DomSourceRange | $dsr | DSR source range values |
bool | $all | Also check the widths of the container tag |
|
static |
Determine if the named tag is void (can not have content).
string | $name | tag name |
|
static |
Extract the last unicode character of the string.
This might be more than one byte, if the last character is non-ASCII.
string | $str | |
?int | $idx | The index after the character to extract; defaults to the length of $str, which will extract the last character in $str. |
|
static |
Convert MediaWiki-internal language code to a BCP-47-compliant language code suitable for including in HTML.
This is a temporary back-compatibility hack, needed for compatibility when running in standalone mode with MediaWiki Action APIs which expose internal language codes. These APIs should eventually be improved so that they also expose BCP-47 compliant codes, which can then be used directly by Parsoid without conversion. But until that day comes, this function will paper over the differences.
Note that MediaWiki-internal Language objects implement Bcp47Code, so we can transition interfaces which currently take a string code to pass a Language object instead; that will make this method effectively a no-op and avoid the issue of upstream sync of the mapping table.
string | Bcp47Code | $code | MediaWiki-internal language code or object |
bool | $strict | If true, this code will log a deprecation message or fail if a MediaWiki-internal language code is passed. |
?LoggerInterface | $warnLogger | A deprecation warning will be emitted on $warnLogger if $strict is true and a string-valued MediaWiki-internal language code is passed; otherwise an exception will be thrown. |
|
static |
Cannonicalizes a namespace name.
string | $name | Non-normalized namespace name. |
|
static |
Parse media dimensions.
string | $str | media dimension string to parse |
bool | $onlyOne | If set, returns null if multiple dimenstions are present |
|
static |
This should not be used.
string | $txt | URL to encode using PHP encoding |
|
static |
Strip PHP namespace from the fully qualified class name.
string | $className |
|
static |
Strip Parsoid id prefix from aboutID.
string | $aboutId | aboud ID string |
|
static |
Validate media parameters More generally, this is defined by the media handler in core.
?int | $num |
|
static |
This regex was generated by running through all unicode characters and testing them against all regexes for linktrails in a default MW install.
We had to treat it a little bit, here's what we changed:
const Wikimedia\Parsoid\Utils\Utils::COMMENT_REGEXP_FRAGMENT = '<!--(?>[\s\S]*?-->)' |
Regular expression fragment for matching wikitext comments.
Meant for inclusion in other regular expressions.