Parsoid
A bidirectional parser between wikitext and HTML5
Loading...
Searching...
No Matches
Wikimedia\Parsoid\Utils\TokenUtils Class Reference

This class contains general utilities for: (a) querying token properties and token types (b) manipulating tokens, individually and as collections. More...

Static Public Member Functions

static isWikitextBlockTag (string $name)
 
static tagOpensBlockScope (string $name)
 In the legacy parser, these block tags open block-tag scope See doBlockLevels in the PHP parser (includes/parser/Parser.php).
 
static tagClosesBlockScope (string $name)
 In the legacy parser, these block tags close block-tag scope See doBlockLevels in the PHP parser (includes/parser/Parser.php).
 
static isTemplateToken ( $token)
 Is this a template token?
 
static isTemplateArgToken ( $token)
 Is this a template arg token?
 
static isExtensionToken ( $token)
 Is this an extension token?
 
static isHTMLTag ( $token)
 Determine whether the current token was an HTML tag in wikitext.
 
static hasDOMFragmentType (Token $token)
 Is the token a DOMFragment type value?
 
static isTableTag ( $token)
 Is the token a table tag?
 
static isSolTransparentLinkTag ( $token)
 Determine if token is a transparent link tag.
 
static isBehaviorSwitch (Env $env, $token)
 Does this token represent a behavior switch?
 
static isSolTransparent (Env $env, $token)
 This should come close to matching WTUtils::emitsSolTransparentSingleLineWT, without the single line caveat.
 
static isAnnotationMetaToken (Token $t)
 
static isAnnotationStartToken (Token $t)
 Checks whether the provided meta tag token is an annotation start token.
 
static isAnnotationEndToken (Token $t)
 Checks whether the provided meta tag token is an annotation end token.
 
static isTranslationUnitMarker (Env $env, CommentTk $token)
 HACK: Returns true if $token looks like a TU marker () and if we could be in a translate-annotated page.
 
static matchTypeOf (Token $t, string $typeRe)
 Determine whether the token matches the given typeof attribute value.
 
static hasTypeOf (Token $t, string $type)
 Determine whether the token matches the given typeof attribute value.
 
static dedupeAboutIds (Env $env, array $maybeTokens)
 
static shiftTokenTSR (array $tokens, ?int $offset, ?Source $tsrSource=null)
 Shift TSR of a token by the requested $offset value and optionally, update its TSR source.
 
static resetSource (array $tokens, Source $tsrSource)
 
static stripEOFTkFromTokens (array &$tokens)
 Strip EOFTk token from token chunk.
 
static convertOffsets (string $s, string $from, string $to, array $offsets)
 Convert string offsets.
 
static convertTokenOffsets (string $s, string $from, string $to, array $tokens)
 Convert offsets in a token array.
 
static isEntitySpanToken ( $token)
 Tests whether token represents an HTML entity.
 
static newlinesToNlTks (string $str)
 Transform "\n" and "\r\n" in the input string to NlTk tokens.
 
static tokensToString ( $tokens, bool $strict=false, array $opts=[], ?int $max=null)
 Flatten/convert a token array into a string.
 
static kvToHash (array $kvs)
 Convert an array of key-value pairs into a hash of keys to values.
 
static tokenTrim ( $tokens)
 Trim space and newlines from leading and trailing text tokens.
 
static hasTemplateToken ( $tokens)
 Detect, if array (or any iterable container) contains template token.
 

Public Attributes

const SOL_TRANSPARENT_LINK_REGEX
 

Detailed Description

This class contains general utilities for: (a) querying token properties and token types (b) manipulating tokens, individually and as collections.

Member Function Documentation

◆ convertOffsets()

static Wikimedia\Parsoid\Utils\TokenUtils::convertOffsets ( string $s,
string $from,
string $to,
array $offsets )
static

Convert string offsets.

Offset types are:

  • 'byte': Bytes (UTF-8 encoding), e.g. PHP substr() or strlen().
  • 'char': Unicode code points (encoding irrelevant), e.g. PHP mb_substr() or mb_strlen().
  • 'ucs2': 16-bit code units (UTF-16 encoding), e.g. JavaScript .substring() or .length.

Offsets that are mid-Unicode character are "rounded" up to the next full character, i.e. the output offset will always point to the start of a Unicode code point (or just past the end of the string). Offsets outside the string are "rounded" to 0 or just-past-the-end.

Note
When constructing the array of offsets to pass to this method, populate it with references as $offsets[] = &$var;.
Parameters
string$sUnicode string the offsets are offsets into, UTF-8 encoded.
('byte'|'ucs2'|'char')$from Offset type to convert from.
('byte'|'ucs2'|'char')$to Offset type to convert to.
int[]$offsetsReferences to the offsets to convert.

◆ convertTokenOffsets()

static Wikimedia\Parsoid\Utils\TokenUtils::convertTokenOffsets ( string $s,
string $from,
string $to,
array $tokens )
static

Convert offsets in a token array.

See also
TokenUtils::convertOffsets()
Parameters
string$sThe offset reference string
('byte'|'ucs2'|'char')$from Offset type to convert from
('byte'|'ucs2'|'char')$to Offset type to convert to
array<Token|string|array>$tokens

◆ dedupeAboutIds()

static Wikimedia\Parsoid\Utils\TokenUtils::dedupeAboutIds ( Env $env,
array $maybeTokens )
static
Parameters
Env$env
array<mixed>$maybeTokens Attribute arrays in tokens may be tokens or something else.

◆ hasDOMFragmentType()

static Wikimedia\Parsoid\Utils\TokenUtils::hasDOMFragmentType ( Token $token)
static

Is the token a DOMFragment type value?

Parameters
Token$token
Returns
bool

◆ hasTemplateToken()

static Wikimedia\Parsoid\Utils\TokenUtils::hasTemplateToken ( $tokens)
static

Detect, if array (or any iterable container) contains template token.

Parameters
null|array<string|Token>$tokens
Returns
bool

◆ hasTypeOf()

static Wikimedia\Parsoid\Utils\TokenUtils::hasTypeOf ( Token $t,
string $type )
static

Determine whether the token matches the given typeof attribute value.

Parameters
Token$t
string$typeExpected value of "typeof" attribute, as a literal string.
Returns
bool True if the token matches.

◆ isAnnotationEndToken()

static Wikimedia\Parsoid\Utils\TokenUtils::isAnnotationEndToken ( Token $t)
static

Checks whether the provided meta tag token is an annotation end token.

Parameters
Token$t
Returns
bool

◆ isAnnotationMetaToken()

static Wikimedia\Parsoid\Utils\TokenUtils::isAnnotationMetaToken ( Token $t)
static
Parameters
Token$t
Returns
bool

◆ isAnnotationStartToken()

static Wikimedia\Parsoid\Utils\TokenUtils::isAnnotationStartToken ( Token $t)
static

Checks whether the provided meta tag token is an annotation start token.

Parameters
Token$t
Returns
bool

◆ isBehaviorSwitch()

static Wikimedia\Parsoid\Utils\TokenUtils::isBehaviorSwitch ( Env $env,
$token )
static

Does this token represent a behavior switch?

Parameters
Env$env
Token | string$token
Returns
bool

◆ isEntitySpanToken()

static Wikimedia\Parsoid\Utils\TokenUtils::isEntitySpanToken ( $token)
static

Tests whether token represents an HTML entity.

Think <span typeof="mw:Entity">.

Parameters
Token | string | null$token
Returns
bool

◆ isExtensionToken()

static Wikimedia\Parsoid\Utils\TokenUtils::isExtensionToken ( $token)
static

Is this an extension token?

Parameters
Token | string | null$token
Returns
bool

◆ isHTMLTag()

static Wikimedia\Parsoid\Utils\TokenUtils::isHTMLTag ( $token)
static

Determine whether the current token was an HTML tag in wikitext.

Parameters
Token | string | null$token
Returns
bool

◆ isSolTransparent()

static Wikimedia\Parsoid\Utils\TokenUtils::isSolTransparent ( Env $env,
$token )
static

This should come close to matching WTUtils::emitsSolTransparentSingleLineWT, without the single line caveat.

Parameters
Env$env
Token | string$token
Returns
bool

◆ isSolTransparentLinkTag()

static Wikimedia\Parsoid\Utils\TokenUtils::isSolTransparentLinkTag ( $token)
static

Determine if token is a transparent link tag.

Parameters
Token | string$token
Returns
bool

◆ isTableTag()

static Wikimedia\Parsoid\Utils\TokenUtils::isTableTag ( $token)
static

Is the token a table tag?

Parameters
Token | string$token
Returns
bool

◆ isTemplateArgToken()

static Wikimedia\Parsoid\Utils\TokenUtils::isTemplateArgToken ( $token)
static

Is this a template arg token?

Parameters
Token | string | null$token
Returns
bool

◆ isTemplateToken()

static Wikimedia\Parsoid\Utils\TokenUtils::isTemplateToken ( $token)
static

Is this a template token?

Parameters
Token | string | null$token
Returns
bool

◆ isTranslationUnitMarker()

static Wikimedia\Parsoid\Utils\TokenUtils::isTranslationUnitMarker ( Env $env,
CommentTk $token )
static

HACK: Returns true if $token looks like a TU marker () and if we could be in a translate-annotated page.

Parameters
Env$env
CommentTk$token
Returns
bool

◆ isWikitextBlockTag()

static Wikimedia\Parsoid\Utils\TokenUtils::isWikitextBlockTag ( string $name)
static
Parameters
string$name
Returns
bool

◆ kvToHash()

static Wikimedia\Parsoid\Utils\TokenUtils::kvToHash ( array $kvs)
static

Convert an array of key-value pairs into a hash of keys to values.

For duplicate keys, the last entry wins.

Note
that numeric key values will be converted by PHP from string to int when they are used as array keys.
Parameters
array<KV>$kvs
Returns
array<string|int,array<Token|string>>|array<string|int,string>

◆ matchTypeOf()

static Wikimedia\Parsoid\Utils\TokenUtils::matchTypeOf ( Token $t,
string $typeRe )
static

Determine whether the token matches the given typeof attribute value.

Parameters
Token$tThe token to test
string$typeReRegular expression matching the expected value of the typeof attribute.
Returns
?string The matching typeof value, or null if there is no match.

◆ newlinesToNlTks()

static Wikimedia\Parsoid\Utils\TokenUtils::newlinesToNlTks ( string $str)
static

Transform "\n" and "\r\n" in the input string to NlTk tokens.

Parameters
string$str
Returns
non-empty-list<NlTk|string> (interspersed string and NlTk tokens)

◆ shiftTokenTSR()

static Wikimedia\Parsoid\Utils\TokenUtils::shiftTokenTSR ( array $tokens,
?int $offset,
?Source $tsrSource = null )
static

Shift TSR of a token by the requested $offset value and optionally, update its TSR source.

At a basic level, "f(wt) = tokens" should be memoizable within the parser pipeline (since the config, env, etc. are fixed for the request) no matter where "wt" originated from (top-level or templates). But, embedded state like tsr offsets, and additional nested state like source ranges interfere with that memoizability. This method attempts to migrate over such embedded state reliably.

NOTE about $offset

A null value of $offset resets TSR on all tokens since we cannot compute a reliable new value of $tsr and the old value of $tsr should not be used either.

NOTE about $tsrSource param

In memoization scenarios where tokens are reused across source frames, we also need to reset the source objects to the target frame. Doing so effectively marks all SourceRange objects as belonging to the target frame. Note that the SourceRange design allows more fine-grained tracking across nested templates. Parsoid doesn't support that yet => the logic below is correct. But in a fine-grained tracking scenario, we'll need to either null offsets OR disable cross-frame memoization OR do more complicated state migration.

◆ stripEOFTkFromTokens()

static Wikimedia\Parsoid\Utils\TokenUtils::stripEOFTkFromTokens ( array & $tokens)
static

Strip EOFTk token from token chunk.

The EOFTk is expected to be the last token of the chunk.

Parameters
array&$tokens
Returns
array return the modified token array so that this call can be chained

◆ tagClosesBlockScope()

static Wikimedia\Parsoid\Utils\TokenUtils::tagClosesBlockScope ( string $name)
static

In the legacy parser, these block tags close block-tag scope See doBlockLevels in the PHP parser (includes/parser/Parser.php).

Parameters
string$name
Returns
bool

◆ tagOpensBlockScope()

static Wikimedia\Parsoid\Utils\TokenUtils::tagOpensBlockScope ( string $name)
static

In the legacy parser, these block tags open block-tag scope See doBlockLevels in the PHP parser (includes/parser/Parser.php).

Parameters
string$name
Returns
bool

◆ tokensToString()

static Wikimedia\Parsoid\Utils\TokenUtils::tokensToString ( $tokens,
bool $strict = false,
array $opts = [],
?int $max = null )
static

Flatten/convert a token array into a string.

Parameters
string|Token|array<Token|string>$tokens
bool$strictWhether to abort as soon as we find a token we can't stringify.
array<string,bool>$opts
?int$maxMaximum tokens to process
Returns
string|list{string,array<Token|string>} The stringified tokens. If $strict is true, returns a two-element array containing string prefix and the remainder of the tokens as soon as we encounter something we can't stringify.

◆ tokenTrim()

static Wikimedia\Parsoid\Utils\TokenUtils::tokenTrim ( $tokens)
static

Trim space and newlines from leading and trailing text tokens.

Parameters
string|Token|(Token|string)[]$tokens
Returns
string|Token|(Token|string)[]

Member Data Documentation

◆ SOL_TRANSPARENT_LINK_REGEX

const Wikimedia\Parsoid\Utils\TokenUtils::SOL_TRANSPARENT_LINK_REGEX
Initial value:
=
'/(?:^|\s)mw:PageProp\/(?:Category|redirect|Language)(?=$|\s)/D'

The documentation for this class was generated from the following file: