RemexHtml
Fast HTML 5 parser
Loading...
Searching...
No Matches
Wikimedia\RemexHtml\Tokenizer\Tokenizer Class Reference

HTML 5 tokenizer. More...

+ Inheritance diagram for Wikimedia\RemexHtml\Tokenizer\Tokenizer:

Public Member Functions

 __construct (TokenHandler $listener, $text, $options=[])
 Constructor.
 
 setEnableCdataCallback ( $cb)
 
 execute ( $options=[])
 Run the tokenizer on the whole input stream.
 
string getPreprocessedText ()
 Get the preprocessed input text.
 
 switchState ( $state, $appropriateEndTag)
 Change the state of the tokenizer during parsing.
 
 setFragmentContext ( $namespace, $tagName)
 Initialize the tokenizer for fragment parsing.
 
 beginStepping ()
 Notify the tokenizer that the document will be tokenized by repeated step() calls.
 
bool step ()
 Tokenize a minimum amount of text from the input stream, and emit the resulting events.
 
- Public Member Functions inherited from Wikimedia\RemexHtml\PropGuard
 __set ( $name, $value)
 

Public Attributes

const STATE_START = 1
 
const STATE_DATA = 2
 
const STATE_RCDATA = 3
 
const STATE_RAWTEXT = 4
 
const STATE_SCRIPT_DATA = 5
 
const STATE_PLAINTEXT = 6
 
const STATE_EOF = 7
 
const STATE_CURRENT = 8
 
const CHARREF_REGEX
 

Protected Member Functions

 preprocess ()
 Preprocess the input text, if it hasn't been done already.
 
bool executeInternal ( $loop)
 The main state machine, the common implementation of step() and execute().
 
 interpretCommentMatches ( $m)
 Interpret the data state match results for a detected comment, and emit events as appropriate.
 
 interpretDoctypeMatches ( $m)
 Interpret the data state match results for a detected DOCTYPE token, and emit events as appropriate.
 
string null interpretDoctypeQuoted ( $m, $dq, $sq, &$quirks)
 DOCTYPE helper which interprets a quoted string (or lack thereof)
 
string handleNulls ( $text, $sourcePos)
 Generic helper for all those points in the spec where U+0000 needs to be replaced with U+FFFD with a parse error issued.
 
 handleAsciiErrors ( $mask, $text, $offset, $length, $sourcePos)
 Generic helper for points in the spec which say that an error should be issued when certain ASCII characters are seen, with no other action taken.
 
string handleCharRefs ( $text, $sourcePos, $inAttr=false, $additionalAllowedChar='')
 Expand character references in some text, and emit errors as appropriate.
 
 emitDataRange ( $pos, $length, $isSimple=false, $hasSimpleRefs=false)
 Emit a range of the input text as a character token, and emit related errors, with validity rules as per the data state.
 
 emitCdataRange ( $innerPos, $innerLength, $outerPos, $outerLength)
 Emit a range of characters from the input text, with validity rules as per the CDATA section state.
 
 emitRawTextRange ( $ignoreCharRefs, $pos, $length)
 Emit a range of characters from the input text, either from RCDATA, RAWTEXT, script data or PLAINTEXT.
 
int textElementState ( $ignoreCharRefs)
 The entry point for the RCDATA and RAWTEXT states.
 
Attributes consumeAttribs ()
 Advance $this->pos, consuming all tag attributes found at the current position.
 
array interpretAttribMatches ( $matches)
 Interpret the results of the attribute preg_match_all().
 
int handleAttribsAndClose ( $state, $tagName, $isEndTag, $startPos)
 Consume attributes, and the closing bracket which follows attributes.
 
int plaintextState ()
 Process input text in the PLAINTEXT state.
 
int scriptDataState ()
 Process input text in the script data state.
 
 error ( $text, $pos=null)
 Emit a parse error event.
 
never fatal ( $text)
 Throw an exception for a specified reason.
 
never throwPregError ()
 Interpret preg_last_error() and throw a suitable exception.
 

Protected Attributes

const REPLACEMENT_CHAR = "\xef\xbf\xbd"
 
const BYTE_ORDER_MARK = "\xef\xbb\xbf"
 
 $ignoreErrors
 
 $ignoreCharRefs
 
 $ignoreNulls
 
 $skipPreprocess
 
 $scriptingFlag
 
 $appropriateEndTag
 
 $listener
 
 $state
 
 $preprocessed
 
 $text
 
 $pos
 
 $length
 
 $enableCdataCallback
 
 $fragmentNamespace
 
 $fragmentName
 

Additional Inherited Members

- Static Public Attributes inherited from Wikimedia\RemexHtml\PropGuard
static $armed = true
 

Detailed Description

HTML 5 tokenizer.

Based on the W3C recommendation as published 01 November 2016: https://www.w3.org/TR/2016/REC-html51-20161101/

Constructor & Destructor Documentation

◆ __construct()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::__construct ( TokenHandler $listener,
$text,
$options = [] )

Constructor.

Parameters
TokenHandler$listenerThe object which receives token events
string$textThe text to tokenize
array$optionsAssociative array of options, including:
  • ignoreErrors: True to improve performance by ignoring errors. The token stream should still be the same, except that error() won't be called.
  • ignoreCharRefs: True to ignore character references. Character tokens will contain the unexpanded character references, and no errors related to invalid character references will be raised. Performance will be improved. This is not compliant behaviour.
  • ignoreNulls: True to ignore NULL bytes in the input stream, instead of raising errors and converting them to U+FFFD as is usually required by the spec.
  • skipPreprocess: True to skip the "preprocessing the input stream" stage, which normalizes line endings and raises errors on certain control characters. Advisable if the input stream is already appropriately normalized.
  • scriptingFlag: True if the scripting flag is enabled. Default true. Setting this to false cause the contents of <noscript> elements to be processed as normal content. The scriptingFlag option in the TreeBuilder should be set to the same value.

Member Function Documentation

◆ beginStepping()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::beginStepping ( )

Notify the tokenizer that the document will be tokenized by repeated step() calls.

This must be called once only, before the first call to step().

◆ consumeAttribs()

Attributes Wikimedia\RemexHtml\Tokenizer\Tokenizer::consumeAttribs ( )
protected

Advance $this->pos, consuming all tag attributes found at the current position.

The new position will be at the end of the tag or at the end of the input string.

To improve performance of consumers which don't need to read the attribute array, interpretation of the PCRE match results is deferred.

  • Todo
    : Make deferral configurable.
Returns
Attributes

◆ emitCdataRange()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::emitCdataRange ( $innerPos,
$innerLength,
$outerPos,
$outerLength )
protected

Emit a range of characters from the input text, with validity rules as per the CDATA section state.

Parameters
int$innerPosThe position after the
int$innerLengthThe length of the string not including the terminating
int$outerPosThe position of the start of the <!CDATA[
int$outerLengthThe length of the whole input region being emitted

◆ emitDataRange()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::emitDataRange ( $pos,
$length,
$isSimple = false,
$hasSimpleRefs = false )
protected

Emit a range of the input text as a character token, and emit related errors, with validity rules as per the data state.

Parameters
int$posOffset within the input text
int$lengthThe length of the range
bool$isSimpleTrue if you know that the data range does not contain < \0 or &; false is safe if you're not sure
bool$hasSimpleRefsTrue if you know that any character references are semicolon terminated and in the list of $commonEntities; false is safe if you're not sure

◆ emitRawTextRange()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::emitRawTextRange ( $ignoreCharRefs,
$pos,
$length )
protected

Emit a range of characters from the input text, either from RCDATA, RAWTEXT, script data or PLAINTEXT.

The only difference between these states is whether or not character references are expanded, so we take that as a parameter.

Parameters
bool$ignoreCharRefs
int$posThe input position
int$lengthThe length of the range to be emitted

◆ error()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::error ( $text,
$pos = null )
protected

Emit a parse error event.

Parameters
string$textThe error message
int | null$posThe error position, or null to use the current position

◆ execute()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::execute ( $options = [])

Run the tokenizer on the whole input stream.

This is the normal entry point.

Parameters
array$optionsAn associative array of options:
  • state : One of the STATE_* constants, a state in which to start.
  • appropriateEndTag : The "appropriate end tag", which needs to be set if entering one of the raw text states.
  • fragmentNamespace : The fragment namespace
  • fragmentName : The fragment tag name

◆ executeInternal()

bool Wikimedia\RemexHtml\Tokenizer\Tokenizer::executeInternal ( $loop)
protected

The main state machine, the common implementation of step() and execute().

Parameters
bool$loopSet to true to loop until finished, false to step once.
Returns
bool True if the input continues, false on EOF

◆ fatal()

never Wikimedia\RemexHtml\Tokenizer\Tokenizer::fatal ( $text)
protected

Throw an exception for a specified reason.

This is used for API errors and assertion-like checks.

Parameters
string$textThe error message
Exceptions
TokenizerError
Returns
never

◆ getPreprocessedText()

string Wikimedia\RemexHtml\Tokenizer\Tokenizer::getPreprocessedText ( )

Get the preprocessed input text.

Source offsets in event parameters are relative to this string. If skipPreprocess was specified, this will be the same as the input string.

Returns
string

◆ handleAsciiErrors()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::handleAsciiErrors ( $mask,
$text,
$offset,
$length,
$sourcePos )
protected

Generic helper for points in the spec which say that an error should be issued when certain ASCII characters are seen, with no other action taken.

Parameters
string$maskMask for strcspn
string$textThe input text
int$offsetThe start of the range within $text to search
int$lengthThe length of the range within $text to search
int$sourcePosThe offset within the input text corresponding to $text, for error position reporting.

◆ handleAttribsAndClose()

int Wikimedia\RemexHtml\Tokenizer\Tokenizer::handleAttribsAndClose ( $state,
$tagName,
$isEndTag,
$startPos )
protected

Consume attributes, and the closing bracket which follows attributes.

Emit the appropriate tag event, or in the case of broken attributes in text states, emit characters.

Parameters
int$stateThe current state
string$tagNameThe normalized tag name
bool$isEndTagTrue if this is an end tag, false if it is a start tag
int$startPosThe input position of the start of the current tag.
Returns
int The next state

◆ handleCharRefs()

string Wikimedia\RemexHtml\Tokenizer\Tokenizer::handleCharRefs ( $text,
$sourcePos,
$inAttr = false,
$additionalAllowedChar = '' )
protected

Expand character references in some text, and emit errors as appropriate.

Parameters
string$textThe text to expand
int$sourcePosThe input position of $text
bool$inAttrTrue if the text is within an attribute value
string$additionalAllowedCharAn unused string which the spec inexplicably spends a lot of space telling you how to derive. It suppresses errors in a place where no errors are emitted anyway.
Returns
string The expanded text

◆ handleNulls()

string Wikimedia\RemexHtml\Tokenizer\Tokenizer::handleNulls ( $text,
$sourcePos )
protected

Generic helper for all those points in the spec where U+0000 needs to be replaced with U+FFFD with a parse error issued.

Parameters
string$textThe text to be converted
int$sourcePosThe input byte offset from which $text was extracted, for error position reporting.
Returns
string The converted text

◆ interpretAttribMatches()

array Wikimedia\RemexHtml\Tokenizer\Tokenizer::interpretAttribMatches ( $matches)
protected

Interpret the results of the attribute preg_match_all().

Emit errors as appropriate and return an associative array.

Parameters
array$matches
Returns
array

◆ interpretCommentMatches()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::interpretCommentMatches ( $m)
protected

Interpret the data state match results for a detected comment, and emit events as appropriate.

Parameters
array$mThe match array

◆ interpretDoctypeMatches()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::interpretDoctypeMatches ( $m)
protected

Interpret the data state match results for a detected DOCTYPE token, and emit events as appropriate.

Parameters
array$mThe match array

◆ interpretDoctypeQuoted()

string null Wikimedia\RemexHtml\Tokenizer\Tokenizer::interpretDoctypeQuoted ( $m,
$dq,
$sq,
& $quirks )
protected

DOCTYPE helper which interprets a quoted string (or lack thereof)

Parameters
array$m
int$dq
int$sq
bool&$quirks
Returns
string|null The quoted value, with nulls replaced.

◆ plaintextState()

int Wikimedia\RemexHtml\Tokenizer\Tokenizer::plaintextState ( )
protected

Process input text in the PLAINTEXT state.

Returns
int The next state index

◆ scriptDataState()

int Wikimedia\RemexHtml\Tokenizer\Tokenizer::scriptDataState ( )
protected

Process input text in the script data state.

Returns
int The next state index

◆ setFragmentContext()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::setFragmentContext ( $namespace,
$tagName )

Initialize the tokenizer for fragment parsing.

Parameters
string$namespaceThe namespace of the context element
string$tagNameThe name of the context element

◆ step()

bool Wikimedia\RemexHtml\Tokenizer\Tokenizer::step ( )

Tokenize a minimum amount of text from the input stream, and emit the resulting events.

Returns
bool True if the input continues and step() should be called again, false on EOF

◆ switchState()

Wikimedia\RemexHtml\Tokenizer\Tokenizer::switchState ( $state,
$appropriateEndTag )

Change the state of the tokenizer during parsing.

This for use by the tree builder to switch the tokenizer into one of the raw text states.

Parameters
int$stateOne of the STATE_* constants
string$appropriateEndTagThe appropriate end tag

◆ textElementState()

int Wikimedia\RemexHtml\Tokenizer\Tokenizer::textElementState ( $ignoreCharRefs)
protected

The entry point for the RCDATA and RAWTEXT states.

Parameters
bool$ignoreCharRefsTrue to ignore character references regardless of configuration, false to respect the configuration.
Returns
int The next state index

◆ throwPregError()

never Wikimedia\RemexHtml\Tokenizer\Tokenizer::throwPregError ( )
protected

Interpret preg_last_error() and throw a suitable exception.

This is called when preg_match() or similar returns false.

Notes for users:

  • PCRE internal error: may be due to JIT stack space exhaustion prior to PHP 7, due to excessive recursion. Increase stack space.
  • pcre.backtrack_limit exhausted: The backtrack limit should be at least double the input size, the defaults are way too small. Increase it in configuration.
Returns
never

Member Data Documentation

◆ CHARREF_REGEX

const Wikimedia\RemexHtml\Tokenizer\Tokenizer::CHARREF_REGEX
Initial value:
= '~
( .*? ) # 1. prefix
&
(?:
\# (?:
0*(\d+) | # 2. decimal
[xX]0*([0-9A-Fa-f]+) # 3. hexadecimal
)
( ; ) ? # 4. semicolon
|
( \# ) # 5. bare hash
|
({{NAMED_ENTITY_REGEX}}) # 6. known named
(?:
(?<! ; ) # Assert no semicolon prior
( [=a-zA-Z0-9] ) # 7. attribute suffix
)?
|
( [a-zA-Z0-9]+ ; ) # 8. invalid named
)
~xAsS'

The documentation for this class was generated from the following file: