RemexHtml
Fast HTML 5 parser
|
HTML 5 tokenizer. More...
Public Member Functions | |
__construct (TokenHandler $listener, $text, $options=[]) | |
Constructor. | |
setEnableCdataCallback ( $cb) | |
execute ( $options=[]) | |
Run the tokenizer on the whole input stream. | |
string | getPreprocessedText () |
Get the preprocessed input text. | |
switchState ( $state, $appropriateEndTag) | |
Change the state of the tokenizer during parsing. | |
setFragmentContext ( $namespace, $tagName) | |
Initialize the tokenizer for fragment parsing. | |
beginStepping () | |
Notify the tokenizer that the document will be tokenized by repeated step() calls. | |
bool | step () |
Tokenize a minimum amount of text from the input stream, and emit the resulting events. | |
Public Member Functions inherited from Wikimedia\RemexHtml\PropGuard | |
__set ( $name, $value) | |
Public Attributes | |
const | STATE_START = 1 |
const | STATE_DATA = 2 |
const | STATE_RCDATA = 3 |
const | STATE_RAWTEXT = 4 |
const | STATE_SCRIPT_DATA = 5 |
const | STATE_PLAINTEXT = 6 |
const | STATE_EOF = 7 |
const | STATE_CURRENT = 8 |
const | CHARREF_REGEX |
Protected Member Functions | |
preprocess () | |
Preprocess the input text, if it hasn't been done already. | |
bool | executeInternal ( $loop) |
The main state machine, the common implementation of step() and execute(). | |
interpretCommentMatches ( $m) | |
Interpret the data state match results for a detected comment, and emit events as appropriate. | |
interpretDoctypeMatches ( $m) | |
Interpret the data state match results for a detected DOCTYPE token, and emit events as appropriate. | |
string null | interpretDoctypeQuoted ( $m, $dq, $sq, &$quirks) |
DOCTYPE helper which interprets a quoted string (or lack thereof) | |
string | handleNulls ( $text, $sourcePos) |
Generic helper for all those points in the spec where U+0000 needs to be replaced with U+FFFD with a parse error issued. | |
handleAsciiErrors ( $mask, $text, $offset, $length, $sourcePos) | |
Generic helper for points in the spec which say that an error should be issued when certain ASCII characters are seen, with no other action taken. | |
string | handleCharRefs ( $text, $sourcePos, $inAttr=false, $additionalAllowedChar='') |
Expand character references in some text, and emit errors as appropriate. | |
emitDataRange ( $pos, $length, $isSimple=false, $hasSimpleRefs=false) | |
Emit a range of the input text as a character token, and emit related errors, with validity rules as per the data state. | |
emitCdataRange ( $innerPos, $innerLength, $outerPos, $outerLength) | |
Emit a range of characters from the input text, with validity rules as per the CDATA section state. | |
emitRawTextRange ( $ignoreCharRefs, $pos, $length) | |
Emit a range of characters from the input text, either from RCDATA, RAWTEXT, script data or PLAINTEXT. | |
int | textElementState ( $ignoreCharRefs) |
The entry point for the RCDATA and RAWTEXT states. | |
Attributes | consumeAttribs () |
Advance $this->pos, consuming all tag attributes found at the current position. | |
array | interpretAttribMatches ( $matches) |
Interpret the results of the attribute preg_match_all(). | |
int | handleAttribsAndClose ( $state, $tagName, $isEndTag, $startPos) |
Consume attributes, and the closing bracket which follows attributes. | |
int | plaintextState () |
Process input text in the PLAINTEXT state. | |
int | scriptDataState () |
Process input text in the script data state. | |
error ( $text, $pos=null) | |
Emit a parse error event. | |
never | fatal ( $text) |
Throw an exception for a specified reason. | |
never | throwPregError () |
Interpret preg_last_error() and throw a suitable exception. | |
Additional Inherited Members | |
Static Public Attributes inherited from Wikimedia\RemexHtml\PropGuard | |
static | $armed = true |
HTML 5 tokenizer.
Based on the W3C recommendation as published 01 November 2016: https://www.w3.org/TR/2016/REC-html51-20161101/
Wikimedia\RemexHtml\Tokenizer\Tokenizer::__construct | ( | TokenHandler | $listener, |
$text, | |||
$options = [] ) |
Constructor.
TokenHandler | $listener | The object which receives token events |
string | $text | The text to tokenize |
array | $options | Associative array of options, including:
|
Wikimedia\RemexHtml\Tokenizer\Tokenizer::beginStepping | ( | ) |
Notify the tokenizer that the document will be tokenized by repeated step() calls.
This must be called once only, before the first call to step().
|
protected |
Advance $this->pos, consuming all tag attributes found at the current position.
The new position will be at the end of the tag or at the end of the input string.
To improve performance of consumers which don't need to read the attribute array, interpretation of the PCRE match results is deferred.
|
protected |
Emit a range of characters from the input text, with validity rules as per the CDATA section state.
int | $innerPos | The position after the |
int | $innerLength | The length of the string not including the terminating |
int | $outerPos | The position of the start of the <!CDATA[ |
int | $outerLength | The length of the whole input region being emitted |
|
protected |
Emit a range of the input text as a character token, and emit related errors, with validity rules as per the data state.
int | $pos | Offset within the input text |
int | $length | The length of the range |
bool | $isSimple | True if you know that the data range does not contain < \0 or &; false is safe if you're not sure |
bool | $hasSimpleRefs | True if you know that any character references are semicolon terminated and in the list of $commonEntities; false is safe if you're not sure |
|
protected |
Emit a range of characters from the input text, either from RCDATA, RAWTEXT, script data or PLAINTEXT.
The only difference between these states is whether or not character references are expanded, so we take that as a parameter.
bool | $ignoreCharRefs | |
int | $pos | The input position |
int | $length | The length of the range to be emitted |
|
protected |
Emit a parse error event.
string | $text | The error message |
int | null | $pos | The error position, or null to use the current position |
Wikimedia\RemexHtml\Tokenizer\Tokenizer::execute | ( | $options = [] | ) |
Run the tokenizer on the whole input stream.
This is the normal entry point.
array | $options | An associative array of options:
|
|
protected |
The main state machine, the common implementation of step() and execute().
bool | $loop | Set to true to loop until finished, false to step once. |
|
protected |
Throw an exception for a specified reason.
This is used for API errors and assertion-like checks.
string | $text | The error message |
TokenizerError |
string Wikimedia\RemexHtml\Tokenizer\Tokenizer::getPreprocessedText | ( | ) |
Get the preprocessed input text.
Source offsets in event parameters are relative to this string. If skipPreprocess was specified, this will be the same as the input string.
|
protected |
Generic helper for points in the spec which say that an error should be issued when certain ASCII characters are seen, with no other action taken.
string | $mask | Mask for strcspn |
string | $text | The input text |
int | $offset | The start of the range within $text to search |
int | $length | The length of the range within $text to search |
int | $sourcePos | The offset within the input text corresponding to $text, for error position reporting. |
|
protected |
Consume attributes, and the closing bracket which follows attributes.
Emit the appropriate tag event, or in the case of broken attributes in text states, emit characters.
int | $state | The current state |
string | $tagName | The normalized tag name |
bool | $isEndTag | True if this is an end tag, false if it is a start tag |
int | $startPos | The input position of the start of the current tag. |
|
protected |
Expand character references in some text, and emit errors as appropriate.
string | $text | The text to expand |
int | $sourcePos | The input position of $text |
bool | $inAttr | True if the text is within an attribute value |
string | $additionalAllowedChar | An unused string which the spec inexplicably spends a lot of space telling you how to derive. It suppresses errors in a place where no errors are emitted anyway. |
|
protected |
Generic helper for all those points in the spec where U+0000 needs to be replaced with U+FFFD with a parse error issued.
string | $text | The text to be converted |
int | $sourcePos | The input byte offset from which $text was extracted, for error position reporting. |
|
protected |
Interpret the results of the attribute preg_match_all().
Emit errors as appropriate and return an associative array.
array | $matches |
|
protected |
Interpret the data state match results for a detected comment, and emit events as appropriate.
array | $m | The match array |
|
protected |
Interpret the data state match results for a detected DOCTYPE token, and emit events as appropriate.
array | $m | The match array |
|
protected |
DOCTYPE helper which interprets a quoted string (or lack thereof)
array | $m | |
int | $dq | |
int | $sq | |
bool | &$quirks |
|
protected |
Process input text in the PLAINTEXT state.
|
protected |
Process input text in the script data state.
Wikimedia\RemexHtml\Tokenizer\Tokenizer::setFragmentContext | ( | $namespace, | |
$tagName ) |
Initialize the tokenizer for fragment parsing.
string | $namespace | The namespace of the context element |
string | $tagName | The name of the context element |
bool Wikimedia\RemexHtml\Tokenizer\Tokenizer::step | ( | ) |
Tokenize a minimum amount of text from the input stream, and emit the resulting events.
Wikimedia\RemexHtml\Tokenizer\Tokenizer::switchState | ( | $state, | |
$appropriateEndTag ) |
Change the state of the tokenizer during parsing.
This for use by the tree builder to switch the tokenizer into one of the raw text states.
int | $state | One of the STATE_* constants |
string | $appropriateEndTag | The appropriate end tag |
|
protected |
The entry point for the RCDATA and RAWTEXT states.
bool | $ignoreCharRefs | True to ignore character references regardless of configuration, false to respect the configuration. |
|
protected |
Interpret preg_last_error() and throw a suitable exception.
This is called when preg_match() or similar returns false.
Notes for users:
const Wikimedia\RemexHtml\Tokenizer\Tokenizer::CHARREF_REGEX |