HTML 5 tokenizer. More...

Inheritance diagram for Wikimedia\RemexHtml\Tokenizer\Tokenizer:

Public Member Functions
	__construct (TokenHandler $listener, $text, $options=[])
	Constructor.

	setEnableCdataCallback ( $cb)

	execute ( $options=[])
	Run the tokenizer on the whole input stream.

string	getPreprocessedText ()
	Get the preprocessed input text.

	switchState ( $state, $appropriateEndTag)
	Change the state of the tokenizer during parsing.

	setFragmentContext ( $namespace, $tagName)
	Initialize the tokenizer for fragment parsing.

	beginStepping ()
	Notify the tokenizer that the document will be tokenized by repeated step() calls.

bool	step ()
	Tokenize a minimum amount of text from the input stream, and emit the resulting events.

Public Member Functions inherited from Wikimedia\RemexHtml\PropGuard
	__set ( $name, $value)

Public Attributes
const	STATE_START = 1

const	STATE_DATA = 2

const	STATE_RCDATA = 3

const	STATE_RAWTEXT = 4

const	STATE_SCRIPT_DATA = 5

const	STATE_PLAINTEXT = 6

const	STATE_EOF = 7

const	STATE_CURRENT = 8

const	CHARREF_REGEX

Protected Member Functions
	preprocess ()
	Preprocess the input text, if it hasn't been done already.

bool	executeInternal ( $loop)
	The main state machine, the common implementation of step() and execute().

	interpretCommentMatches ( $m)
	Interpret the data state match results for a detected comment, and emit events as appropriate.

	interpretDoctypeMatches ( $m)
	Interpret the data state match results for a detected DOCTYPE token, and emit events as appropriate.

string null	interpretDoctypeQuoted ( $m, $dq, $sq, &$quirks)
	DOCTYPE helper which interprets a quoted string (or lack thereof)

string	handleNulls ( $text, $sourcePos)
	Generic helper for all those points in the spec where U+0000 needs to be replaced with U+FFFD with a parse error issued.

	handleAsciiErrors ( $mask, $text, $offset, $length, $sourcePos)
	Generic helper for points in the spec which say that an error should be issued when certain ASCII characters are seen, with no other action taken.

string	handleCharRefs ( $text, $sourcePos, $inAttr=false, $additionalAllowedChar='')
	Expand character references in some text, and emit errors as appropriate.

	emitDataRange ( $pos, $length, $isSimple=false, $hasSimpleRefs=false)
	Emit a range of the input text as a character token, and emit related errors, with validity rules as per the data state.

	emitCdataRange ( $innerPos, $innerLength, $outerPos, $outerLength)
	Emit a range of characters from the input text, with validity rules as per the CDATA section state.

	emitRawTextRange ( $ignoreCharRefs, $pos, $length)
	Emit a range of characters from the input text, either from RCDATA, RAWTEXT, script data or PLAINTEXT.

int	textElementState ( $ignoreCharRefs)
	The entry point for the RCDATA and RAWTEXT states.

Attributes	consumeAttribs ()
	Advance $this->pos, consuming all tag attributes found at the current position.

array	interpretAttribMatches ( $matches)
	Interpret the results of the attribute preg_match_all().

int	handleAttribsAndClose ( $state, $tagName, $isEndTag, $startPos)
	Consume attributes, and the closing bracket which follows attributes.

int	plaintextState ()
	Process input text in the PLAINTEXT state.

int	scriptDataState ()
	Process input text in the script data state.

	error ( $text, $pos=null)
	Emit a parse error event.

never	fatal ( $text)
	Throw an exception for a specified reason.

never	throwPregError ()
	Interpret preg_last_error() and throw a suitable exception.

Protected Attributes
const	REPLACEMENT_CHAR = "\xef\xbf\xbd"

const	BYTE_ORDER_MARK = "\xef\xbb\xbf"

	$ignoreErrors

	$ignoreCharRefs

	$ignoreNulls

	$skipPreprocess

	$scriptingFlag

	$appropriateEndTag

	$listener

	$state

	$preprocessed

	$text

	$pos

	$length

	$enableCdataCallback

	$fragmentNamespace

	$fragmentName

Additional Inherited Members
Static Public Attributes inherited from Wikimedia\RemexHtml\PropGuard
static	$armed = true

Detailed Description

HTML 5 tokenizer.

Based on the W3C recommendation as published 01 November 2016: https://www.w3.org/TR/2016/REC-html51-20161101/

Constructor & Destructor Documentation

◆ __construct()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::__construct	(	TokenHandler	$listener,
			$text,
			$options = [] )

Constructor.

Parameters

TokenHandler	$listener	The object which receives token events
string	$text	The text to tokenize
array	$options	Associative array of options, including: ignoreErrors: True to improve performance by ignoring errors. The token stream should still be the same, except that error() won't be called. ignoreCharRefs: True to ignore character references. Character tokens will contain the unexpanded character references, and no errors related to invalid character references will be raised. Performance will be improved. This is not compliant behaviour. ignoreNulls: True to ignore NULL bytes in the input stream, instead of raising errors and converting them to U+FFFD as is usually required by the spec. skipPreprocess: True to skip the "preprocessing the input stream" stage, which normalizes line endings and raises errors on certain control characters. Advisable if the input stream is already appropriately normalized. scriptingFlag: True if the scripting flag is enabled. Default true. Setting this to false cause the contents of <noscript> elements to be processed as normal content. The scriptingFlag option in the TreeBuilder should be set to the same value.

Member Function Documentation

◆ beginStepping()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::beginStepping ( )

Notify the tokenizer that the document will be tokenized by repeated step() calls.

This must be called once only, before the first call to step().

◆ consumeAttribs()

Attributes Wikimedia\RemexHtml\Tokenizer\Tokenizer::consumeAttribs ( )

protected

Advance $this->pos, consuming all tag attributes found at the current position.

The new position will be at the end of the tag or at the end of the input string.

To improve performance of consumers which don't need to read the attribute array, interpretation of the PCRE match results is deferred.

Todo
: Make deferral configurable.

Returns: Attributes

◆ emitCdataRange()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::emitCdataRange	(	$innerPos,
		$innerLength,
		$outerPos,
		$outerLength )

protected

Emit a range of characters from the input text, with validity rules as per the CDATA section state.

Parameters

int	$innerPos	The position after the
int	$innerLength	The length of the string not including the terminating
int	$outerPos	The position of the start of the <!CDATA[
int	$outerLength	The length of the whole input region being emitted

◆ emitDataRange()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::emitDataRange	(	$pos,
		$length,
		$isSimple = false,
		$hasSimpleRefs = false )

protected

Emit a range of the input text as a character token, and emit related errors, with validity rules as per the data state.

Parameters

int	$pos	Offset within the input text
int	$length	The length of the range
bool	$isSimple	True if you know that the data range does not contain < \0 or &; false is safe if you're not sure
bool	$hasSimpleRefs	True if you know that any character references are semicolon terminated and in the list of $commonEntities; false is safe if you're not sure

◆ emitRawTextRange()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::emitRawTextRange	(	$ignoreCharRefs,
		$pos,
		$length )

protected

Emit a range of characters from the input text, either from RCDATA, RAWTEXT, script data or PLAINTEXT.

The only difference between these states is whether or not character references are expanded, so we take that as a parameter.

Parameters

bool	$ignoreCharRefs
int	$pos	The input position
int	$length	The length of the range to be emitted

◆ error()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::error	(		$text,
			$pos = null )

protected

Emit a parse error event.

Parameters

string	$text	The error message
int \| null	$pos	The error position, or null to use the current position

◆ execute()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::execute ( $options = [] )

Run the tokenizer on the whole input stream.

This is the normal entry point.

Parameters

array

$options

An associative array of options:

state : One of the STATE_* constants, a state in which to start.
appropriateEndTag : The "appropriate end tag", which needs to be set if entering one of the raw text states.
fragmentNamespace : The fragment namespace
fragmentName : The fragment tag name

◆ executeInternal()

bool Wikimedia\RemexHtml\Tokenizer\Tokenizer::executeInternal ( $loop )

protected

The main state machine, the common implementation of step() and execute().

Parameters

bool $loop Set to true to loop until finished, false to step once.

Returns: bool True if the input continues, false on EOF

◆ fatal()

never Wikimedia\RemexHtml\Tokenizer\Tokenizer::fatal ( $text )

protected

Throw an exception for a specified reason.

This is used for API errors and assertion-like checks.

Parameters

string $text The error message

Exceptions

TokenizerError

Returns: never

◆ getPreprocessedText()

string Wikimedia\RemexHtml\Tokenizer\Tokenizer::getPreprocessedText ( )

Get the preprocessed input text.

Source offsets in event parameters are relative to this string. If skipPreprocess was specified, this will be the same as the input string.

Returns: string

◆ handleAsciiErrors()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::handleAsciiErrors	(	$mask,
		$text,
		$offset,
		$length,
		$sourcePos )

protected

Generic helper for points in the spec which say that an error should be issued when certain ASCII characters are seen, with no other action taken.

Parameters

string	$mask	Mask for strcspn
string	$text	The input text
int	$offset	The start of the range within $text to search
int	$length	The length of the range within $text to search
int	$sourcePos	The offset within the input text corresponding to $text, for error position reporting.

◆ handleAttribsAndClose()

int Wikimedia\RemexHtml\Tokenizer\Tokenizer::handleAttribsAndClose	(	$state,
		$tagName,
		$isEndTag,
		$startPos )

protected

Consume attributes, and the closing bracket which follows attributes.

Emit the appropriate tag event, or in the case of broken attributes in text states, emit characters.

Parameters

int	$state	The current state
string	$tagName	The normalized tag name
bool	$isEndTag	True if this is an end tag, false if it is a start tag
int	$startPos	The input position of the start of the current tag.

Returns: int The next state

◆ handleCharRefs()

string Wikimedia\RemexHtml\Tokenizer\Tokenizer::handleCharRefs	(	$text,
		$sourcePos,
		$inAttr = false,
		$additionalAllowedChar = '' )

protected

Expand character references in some text, and emit errors as appropriate.

Parameters

string	$text	The text to expand
int	$sourcePos	The input position of $text
bool	$inAttr	True if the text is within an attribute value
string	$additionalAllowedChar	An unused string which the spec inexplicably spends a lot of space telling you how to derive. It suppresses errors in a place where no errors are emitted anyway.

Returns: string The expanded text

◆ handleNulls()

string Wikimedia\RemexHtml\Tokenizer\Tokenizer::handleNulls	(		$text,
			$sourcePos )

protected

Generic helper for all those points in the spec where U+0000 needs to be replaced with U+FFFD with a parse error issued.

Parameters

string	$text	The text to be converted
int	$sourcePos	The input byte offset from which $text was extracted, for error position reporting.

Returns: string The converted text

◆ interpretAttribMatches()

array Wikimedia\RemexHtml\Tokenizer\Tokenizer::interpretAttribMatches ( $matches )

protected

Interpret the results of the attribute preg_match_all().

Emit errors as appropriate and return an associative array.

Parameters

array $matches

Returns: array

◆ interpretCommentMatches()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::interpretCommentMatches ( $m )

protected

Interpret the data state match results for a detected comment, and emit events as appropriate.

Parameters

array $m The match array

◆ interpretDoctypeMatches()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::interpretDoctypeMatches ( $m )

protected

Interpret the data state match results for a detected DOCTYPE token, and emit events as appropriate.

Parameters

array $m The match array

◆ interpretDoctypeQuoted()

string null Wikimedia\RemexHtml\Tokenizer\Tokenizer::interpretDoctypeQuoted	(		$m,
			$dq,
			$sq,
		&	$quirks )

protected

DOCTYPE helper which interprets a quoted string (or lack thereof)

Parameters

array	$m
int	$dq
int	$sq
bool	&$quirks

Returns: string|null The quoted value, with nulls replaced.

◆ plaintextState()

int Wikimedia\RemexHtml\Tokenizer\Tokenizer::plaintextState ( )

protected

Process input text in the PLAINTEXT state.

Returns: int The next state index

◆ scriptDataState()

int Wikimedia\RemexHtml\Tokenizer\Tokenizer::scriptDataState ( )

protected

Process input text in the script data state.

Returns: int The next state index

◆ setFragmentContext()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::setFragmentContext	(		$namespace,
			$tagName )

Initialize the tokenizer for fragment parsing.

Parameters

string	$namespace	The namespace of the context element
string	$tagName	The name of the context element

◆ step()

bool Wikimedia\RemexHtml\Tokenizer\Tokenizer::step ( )

Tokenize a minimum amount of text from the input stream, and emit the resulting events.

Returns: bool True if the input continues and step() should be called again, false on EOF

◆ switchState()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::switchState	(		$state,
			$appropriateEndTag )

Change the state of the tokenizer during parsing.

This for use by the tree builder to switch the tokenizer into one of the raw text states.

Parameters

int	$state	One of the STATE_* constants
string	$appropriateEndTag	The appropriate end tag

◆ textElementState()

int Wikimedia\RemexHtml\Tokenizer\Tokenizer::textElementState ( $ignoreCharRefs )

protected

The entry point for the RCDATA and RAWTEXT states.

Parameters

bool $ignoreCharRefs True to ignore character references regardless of configuration, false to respect the configuration.

Returns: int The next state index

◆ throwPregError()

never Wikimedia\RemexHtml\Tokenizer\Tokenizer::throwPregError ( )

protected

Interpret preg_last_error() and throw a suitable exception.

This is called when preg_match() or similar returns false.

Notes for users:

PCRE internal error: may be due to JIT stack space exhaustion prior to PHP 7, due to excessive recursion. Increase stack space.
pcre.backtrack_limit exhausted: The backtrack limit should be at least double the input size, the defaults are way too small. Increase it in configuration.

Returns: never

Member Data Documentation

◆ CHARREF_REGEX

const Wikimedia\RemexHtml\Tokenizer\Tokenizer::CHARREF_REGEX

Initial value:

= '~
                ( .*? )                      # 1. prefix
                &
                (?:
                    \# (?:
                        0*(\d+)           |  # 2. decimal
                        [xX]0*([0-9A-Fa-f]+) # 3. hexadecimal
                    )
                    ( ; ) ?                  # 4. semicolon
                    |
                    ( \# )                   # 5. bare hash
                    |
                    ({{NAMED_ENTITY_REGEX}}) # 6. known named
                    (?:
                        (?<! ; )             # Assert no semicolon prior
                        ( [=a-zA-Z0-9] )     # 7. attribute suffix
                    )?
                    |
                    ( [a-zA-Z0-9]+ ; )       # 8. invalid named
                )
 
 
                ~xAsS'

The documentation for this class was generated from the following file:

src/Tokenizer/Tokenizer.php

Public Member Functions

Public Attributes

Protected Member Functions

Protected Attributes

Additional Inherited Members

Detailed Description

Constructor & Destructor Documentation

◆ __construct()

Member Function Documentation

◆ beginStepping()

◆ consumeAttribs()

◆ emitCdataRange()

◆ emitDataRange()

◆ emitRawTextRange()

◆ error()

◆ execute()

◆ executeInternal()

◆ fatal()

◆ getPreprocessedText()

◆ handleAsciiErrors()

◆ handleAttribsAndClose()

◆ handleCharRefs()

◆ handleNulls()

◆ interpretAttribMatches()

◆ interpretCommentMatches()

◆ interpretDoctypeMatches()

◆ interpretDoctypeQuoted()

◆ plaintextState()

◆ scriptDataState()

◆ setFragmentContext()

◆ step()

◆ switchState()

◆ textElementState()

◆ throwPregError()

Member Data Documentation

◆ CHARREF_REGEX