MediaWiki REL1_33
|
HTML sanitizer for MediaWiki. More...
Static Public Member Functions | |
static | armorFrenchSpaces ( $text, $space=' ') |
Armor French spaces with a replacement character. | |
static | attributeWhitelist ( $element) |
Fetch the whitelist of acceptable attributes for a given element name. | |
static | checkCss ( $value) |
Pick apart some CSS and check it for forbidden or unsafe structures. | |
static | cleanUrl ( $url) |
static | cleanUrlCallback ( $matches) |
static | cssDecodeCallback ( $matches) |
static | decCharReference ( $codepoint) |
static | decodeChar ( $codepoint) |
Return UTF-8 string for a codepoint if that is a valid character reference, otherwise U+FFFD REPLACEMENT CHARACTER. | |
static | decodeCharReferences ( $text) |
Decode any character references, numeric or named entities, in the text and return a UTF-8 string. | |
static | decodeCharReferencesAndNormalize ( $text) |
Decode any character references, numeric or named entities, in the next and normalize the resulting string. | |
static | decodeCharReferencesCallback ( $matches) |
static | decodeEntity ( $name) |
If the named entity is defined in the HTML 4.0/XHTML 1.0 DTD, return the UTF-8 encoding of that character. | |
static | decodeTagAttributes ( $text) |
Return an associative array of attribute names and values from a partial tag string. | |
static | encodeAttribute ( $text) |
Encode an attribute value for HTML output. | |
static | escapeClass ( $class) |
Given a value, escape it so that it can be used as a CSS class and return it. | |
static | escapeHtmlAllowEntities ( $html) |
Given HTML input, escape with htmlspecialchars but un-escape entities. | |
static | escapeId ( $id, $options=[]) |
Given a value, escape it so that it can be used in an id attribute and return it. | |
static | escapeIdForAttribute ( $id, $mode=self::ID_PRIMARY) |
Given a section name or other user-generated or otherwise unsafe string, escapes it to be a valid HTML id attribute. | |
static | escapeIdForExternalInterwiki ( $id) |
Given a section name or other user-generated or otherwise unsafe string, escapes it to be a valid URL fragment for external interwikis. | |
static | escapeIdForLink ( $id) |
Given a section name or other user-generated or otherwise unsafe string, escapes it to be a valid URL fragment. | |
static | escapeIdReferenceList ( $referenceString) |
Given a string containing a space delimited list of ids, escape each id to match ids escaped by the escapeIdForAttribute() function. | |
static | fixTagAttributes ( $text, $element, $sorted=false) |
Take a tag soup fragment listing an HTML element's attributes and normalize it to well-formed XML, discarding unwanted attributes. | |
static | getAttribNameRegex () |
Used in Sanitizer::decodeTagAttributes to filter attributes. | |
static | getAttribsRegex () |
Regular expression to match HTML/XML attribute pairs within a tag. | |
static | getRecognizedTagData ( $extratags=[], $removetags=[]) |
Return the various lists of recognized tags. | |
static | hackDocType () |
Hack up a private DOCTYPE with HTML's standard entity declarations. | |
static | hexCharReference ( $codepoint) |
static | isReservedDataAttribute ( $attr) |
Given an attribute name, checks whether it is a reserved data attribute (such as data-mw-foo) which is unavailable to user-generated HTML so MediaWiki core and extension code can safely use it to communicate with frontend code. | |
static | mergeAttributes ( $a, $b) |
Merge two sets of HTML attributes. | |
static | normalizeCharReferences ( $text) |
Ensure that any entities and character references are legal for XML and XHTML specifically. | |
static | normalizeCharReferencesCallback ( $matches) |
static | normalizeCss ( $value) |
Normalize CSS into a format we can easily search for hostile input. | |
static | normalizeEntity ( $name) |
If the named entity is defined in the HTML 4.0/XHTML 1.0 DTD, return the equivalent numeric entity reference (except for the core < > & "). | |
static | normalizeSectionNameWhitespace ( $section) |
Normalizes whitespace in a section name, such as might be returned by Parser::stripSectionName(), for use in the id's that are used for section links. | |
static | removeHTMLcomments ( $text) |
Remove '', and everything between. | |
static | removeHTMLtags ( $text, $processCallback=null, $args=[], $extratags=[], $removetags=[], $warnCallback=null) |
Cleans up HTML, removes dangerous tags and attributes, and removes HTML comments. | |
static | safeEncodeAttribute ( $text) |
Encode an attribute value for HTML tags, with extra armoring against further wiki processing. | |
static | safeEncodeTagAttributes ( $assoc_array) |
Build a partial tag string from an associative array of attribute names and values as returned by decodeTagAttributes. | |
static | setupAttributeWhitelist () |
Foreach array key (an allowed HTML element), return an array of allowed attributes. | |
static | stripAllTags ( $html) |
Take a fragment of (potentially invalid) HTML and return a version with any tags removed, encoded as plain text. | |
static | validateAttributes ( $attribs, $whitelist) |
Take an array of attribute names and values and normalize or discard illegal values for the given whitelist. | |
static | validateEmail ( $addr) |
Does a string look like an e-mail address? | |
static | validateTag ( $params, $element) |
Takes attribute names and values for a tag and the tag name and validates that the tag is allowed to be present. | |
static | validateTagAttributes ( $attribs, $element) |
Take an array of attribute names and values and normalize or discard illegal values for the given element type. | |
Public Attributes | |
const | CHAR_REFS_REGEX |
Regular expression to match various types of character references in Sanitizer::normalizeCharReferences and Sanitizer::decodeCharReferences. | |
const | ELEMENT_BITS_REGEX = '!^(/?)([A-Za-z][^\t\n\v />\0]*+)([^>]*?)(/?>)([^<]*)$!' |
Acceptable tag name charset from HTML5 parsing spec https://www.w3.org/TR/html5/syntax.html#tag-open-state. | |
const | EVIL_URI_PATTERN = '!(^|\s|\*/\s*)(javascript|vbscript)([^\w]|$)!i' |
Blacklist for evil uris like javascript: WARNING: DO NOT use this in any place that actually requires blacklisting for security reasons. | |
const | ID_FALLBACK = 1 |
Tells escapeUrlForHtml() to encode the ID using the fallback encoding, or return false if no fallback is configured. | |
const | ID_PRIMARY = 0 |
Tells escapeUrlForHtml() to encode the ID using the wiki's primary encoding. | |
const | XMLNS_ATTRIBUTE_PATTERN = "/^xmlns:[:A-Z_a-z-.0-9]+$/" |
Static Private Member Functions | |
static | escapeIdInternal ( $id, $mode) |
Helper for escapeIdFor*() functions. | |
static | getTagAttributeCallback ( $set) |
Pick the appropriate attribute value from a match set from the attribs regex matches. | |
static | normalizeWhitespace ( $text) |
static | validateCodepoint ( $codepoint) |
Returns true if a given Unicode codepoint is a valid character in both HTML5 and XML. | |
Static Private Attributes | |
static | $attribNameRegex |
Lazy-initialised attribute name regex, see getAttribNameRegex() | |
static | $attribsRegex |
Lazy-initialised attributes regex, see getAttribsRegex() | |
static | $htmlEntities |
List of all named character entities defined in HTML 4.01 https://www.w3.org/TR/html4/sgml/entities.html As well as ' which is only defined starting in XHTML1. | |
static | $htmlEntityAliases |
Character entity aliases accepted by MediaWiki. | |
HTML sanitizer for MediaWiki.
Definition at line 33 of file Sanitizer.php.
|
static |
Armor French spaces with a replacement character.
string | $text | Text to armor |
string | $space | Space character for the French spaces, defaults to ' ' |
Definition at line 1172 of file Sanitizer.php.
|
static |
Fetch the whitelist of acceptable attributes for a given element name.
string | $element |
Definition at line 1751 of file Sanitizer.php.
|
static |
Pick apart some CSS and check it for forbidden or unsafe structures.
Returns a sanitized string. This sanitized string will have character references and escape sequences decoded and comments stripped (unless it is itself one valid comment, in which case the value will be passed through). If the input is just too evil, only a comment complaining about evilness will be returned.
Currently URL references, 'expression', 'tps' are forbidden.
NOTE: Despite the fact that character references are decoded, the returned string may contain character references given certain clever input strings. These character references must be escaped before the return value is embedded in HTML.
string | $value |
Definition at line 1058 of file Sanitizer.php.
References $value.
|
static |
|
static |
array | $matches |
Definition at line 2085 of file Sanitizer.php.
References $matches.
|
static |
array | $matches |
Definition at line 1087 of file Sanitizer.php.
References $matches.
|
static |
|
static |
Return UTF-8 string for a codepoint if that is a valid character reference, otherwise U+FFFD REPLACEMENT CHARACTER.
int | $codepoint |
Definition at line 1718 of file Sanitizer.php.
|
static |
Decode any character references, numeric or named entities, in the text and return a UTF-8 string.
string | $text |
Definition at line 1662 of file Sanitizer.php.
|
static |
Decode any character references, numeric or named entities, in the next and normalize the resulting string.
(T16952)
This is useful for page titles, not for text to be displayed, MediaWiki allows HTML entities to escape normalization as a feature.
string | $text | Already normalized, containing entities |
Definition at line 1679 of file Sanitizer.php.
|
static |
string | $matches |
Definition at line 1699 of file Sanitizer.php.
References $matches.
|
static |
If the named entity is defined in the HTML 4.0/XHTML 1.0 DTD, return the UTF-8 encoding of that character.
Otherwise, returns pseudo-entity source (eg "&foo;")
string | $name |
Definition at line 1734 of file Sanitizer.php.
References $name.
|
static |
Return an associative array of attribute names and values from a partial tag string.
Attribute names are forced to lowercase, character references are decoded to UTF-8 text.
string | $text |
Definition at line 1443 of file Sanitizer.php.
|
static |
Encode an attribute value for HTML output.
string | $text |
Definition at line 1149 of file Sanitizer.php.
|
static |
Given a value, escape it so that it can be used as a CSS class and return it.
string | $class |
Definition at line 1411 of file Sanitizer.php.
|
static |
Given HTML input, escape with htmlspecialchars but un-escape entities.
This allows (generally harmless) entities like   to survive.
string | $html | HTML to escape |
Definition at line 1426 of file Sanitizer.php.
References $html.
|
static |
Given a value, escape it so that it can be used in an id attribute and return it.
This will use HTML5 validation, allowing anything but ASCII whitespace.
To ensure we don't have to bother escaping anything, we also strip ', ". TODO: Is this the best tactic?
We also strip # because it upsets IE, and % because it could be ambiguous if it's part of something that looks like a percent escape (which don't work reliably in fragments cross-browser).
string | $id | Id to escape |
string | array | $options | String or array of strings (default is array()): 'noninitial': This is a non-initial fragment of an id, not a full id, so don't pay attention if the first character isn't valid at the beginning of an id. |
Definition at line 1254 of file Sanitizer.php.
|
static |
Given a section name or other user-generated or otherwise unsafe string, escapes it to be a valid HTML id attribute.
WARNING: unlike escapeId(), the output of this function is not guaranteed to be HTML safe, be sure to use proper escaping.
string | $id | String to escape |
int | $mode | One of ID_* constants, specifying whether the primary or fallback encoding should be used. |
Definition at line 1288 of file Sanitizer.php.
References $wgFragmentMode.
|
static |
Given a section name or other user-generated or otherwise unsafe string, escapes it to be a valid URL fragment for external interwikis.
string | $id | String to escape |
Definition at line 1338 of file Sanitizer.php.
References $wgExternalInterwikiFragmentMode.
|
static |
Given a section name or other user-generated or otherwise unsafe string, escapes it to be a valid URL fragment.
WARNING: unlike escapeId(), the output of this function is not guaranteed to be HTML safe, be sure to use proper escaping.
string | $id | String to escape |
Definition at line 1315 of file Sanitizer.php.
References $wgFragmentMode.
|
staticprivate |
Helper for escapeIdFor*() functions.
Performs most of the actual escaping.
string | $id | String to escape |
string | $mode | One of modes from $wgFragmentMode |
Definition at line 1353 of file Sanitizer.php.
|
static |
Given a string containing a space delimited list of ids, escape each id to match ids escaped by the escapeIdForAttribute() function.
string | $referenceString | Space delimited list of ids |
Definition at line 1384 of file Sanitizer.php.
References as.
|
static |
Take a tag soup fragment listing an HTML element's attributes and normalize it to well-formed XML, discarding unwanted attributes.
Output is safe for further wikitext processing, with escaping of values that could trigger problems.
string | $text | |
string | $element | |
bool | $sorted | Whether to sort the attributes (default: false) |
Definition at line 1129 of file Sanitizer.php.
|
static |
Used in Sanitizer::decodeTagAttributes to filter attributes.
Definition at line 385 of file Sanitizer.php.
|
static |
Regular expression to match HTML/XML attribute pairs within a tag.
Based on https://www.w3.org/TR/html5/syntax.html#before-attribute-name-state Used in Sanitizer::decodeTagAttributes
Definition at line 356 of file Sanitizer.php.
|
static |
Return the various lists of recognized tags.
array | $extratags | For any extra tags to include |
array | $removetags | For any tags (default or extra) to exclude |
Definition at line 400 of file Sanitizer.php.
References $vars, $wgAllowImageTag, as, by, in, list, table, that, them, used, and will.
|
staticprivate |
Pick the appropriate attribute value from a match set from the attribs regex matches.
array | $set |
MWException | When tag conditions are not met. |
Definition at line 1504 of file Sanitizer.php.
|
static |
Hack up a private DOCTYPE with HTML's standard entity declarations.
PHP 4 seemed to know these if you gave it an HTML doctype, but PHP 5.1 doesn't.
Use for passing XHTML fragments to PHP's XML parsing functions
Definition at line 2018 of file Sanitizer.php.
|
static |
|
static |
Given an attribute name, checks whether it is a reserved data attribute (such as data-mw-foo) which is unavailable to user-generated HTML so MediaWiki core and extension code can safely use it to communicate with frontend code.
string | $attr | Attribute name. |
Definition at line 908 of file Sanitizer.php.
|
static |
Merge two sets of HTML attributes.
Conflicting items in the second set will override those in the first, except for 'class' attributes which will be combined (if they're both strings).
array | $a | |
array | $b |
Definition at line 929 of file Sanitizer.php.
References $out.
Referenced by MediaWiki\EditPage\TextboxBuilder\mergeClassesIntoAttributes().
|
static |
Ensure that any entities and character references are legal for XML and XHTML specifically.
Any stray bits will be &-escaped to result in a valid text fragment.
a. named char refs can only be < > & ", others are numericized (this way we're well-formed even without a DTD) b. any numeric char refs must be legal chars, not invalid or forbidden c. use lower cased "&#x", not "&#X" d. fix or reject non-valid attributes
string | $text |
Definition at line 1562 of file Sanitizer.php.
|
static |
|
static |
Normalize CSS into a format we can easily search for hostile input.
string | $value | the css string |
Definition at line 951 of file Sanitizer.php.
References $matches, $value, and StringUtils\delimiterReplace().
|
static |
If the named entity is defined in the HTML 4.0/XHTML 1.0 DTD, return the equivalent numeric entity reference (except for the core < > & ").
If the entity is a MediaWiki-specific alias, returns the HTML equivalent. Otherwise, returns HTML-escaped text of pseudo-entity source (eg &foo;)
string | $name |
Definition at line 1599 of file Sanitizer.php.
References $name.
|
static |
Normalizes whitespace in a section name, such as might be returned by Parser::stripSectionName(), for use in the id's that are used for section links.
string | $section |
Definition at line 1543 of file Sanitizer.php.
References $section.
|
staticprivate |
|
static |
Remove '', and everything between.
To avoid leaving blank lines, when a comment is both preceded and followed by a newline (ignoring spaces), trim leading and trailing spaces and one of the newlines.
string | $text |
Definition at line 709 of file Sanitizer.php.
|
static |
Cleans up HTML, removes dangerous tags and attributes, and removes HTML comments.
string | $text | |
callable | null | $processCallback | Callback to do any variable or parameter replacements in HTML attribute values |
array | bool | $args | Arguments for the processing callback |
array | $extratags | For any extra tags to include |
array | $removetags | For any tags (default or extra) to exclude |
callable | null | $warnCallback | (Deprecated) Callback allowing the addition of a tracking category when bad input is encountered. DO NOT ADD NEW PARAMETERS AFTER $warnCallback, since it will be removed shortly. |
Definition at line 497 of file Sanitizer.php.
References $args, $params, $t, as, MWTidy\isEnabled(), list, and wfDeprecated().
|
static |
Encode an attribute value for HTML tags, with extra armoring against further wiki processing.
string | $text |
Definition at line 1192 of file Sanitizer.php.
References $matches, and wfUrlProtocols().
|
static |
Build a partial tag string from an associative array of attribute names and values as returned by decodeTagAttributes.
array | $assoc_array |
Definition at line 1485 of file Sanitizer.php.
|
static |
Foreach array key (an allowed HTML element), return an array of allowed attributes.
Definition at line 1761 of file Sanitizer.php.
References array().
|
static |
Take a fragment of (potentially invalid) HTML and return a version with any tags removed, encoded as plain text.
Warning: this return value must be further escaped for literal inclusion in HTML output as of 1.10!
string | $html | HTML fragment |
Definition at line 1993 of file Sanitizer.php.
|
static |
Take an array of attribute names and values and normalize or discard illegal values for the given whitelist.
array | $attribs | |
array | $whitelist | List of allowed attribute names |
Check for legal values where the DTD limits things.
Check for unique id attribute :P
Definition at line 811 of file Sanitizer.php.
References $attribs, $out, $value, as, and wfUrlProtocols().
|
staticprivate |
Returns true if a given Unicode codepoint is a valid character in both HTML5 and XML.
int | $codepoint |
Definition at line 1643 of file Sanitizer.php.
|
static |
Does a string look like an e-mail address?
This validates an email address using an HTML5 specification found at: http://www.whatwg.org/html/states-of-the-type-attribute.html#valid-e-mail-address Which as of 2011-01-24 says:
A valid e-mail address is a string that matches the ABNF production 1*( atext / "." ) "@" ldh-str *( "." ldh-str ) where atext is defined in RFC 5322 section 3.2.3, and ldh-str is defined in RFC 1034 section 3.5.
This function is an implementation of the specification as requested in T24449.
Client-side forms will use the same standard validation rules via JS or HTML 5 validation; additional restrictions can be enforced server-side by extensions via the 'isValidEmailAddr' hook.
Note that this validation doesn't 100% match RFC 2822, but is believed to be liberal enough for wide use. Some invalid addresses will still pass validation here.
string | $addr | E-mail address |
Definition at line 2117 of file Sanitizer.php.
|
static |
Takes attribute names and values for a tag and the tag name and validates that the tag is allowed to be present.
This DOES NOT validate the attributes, nor does it validate the tags themselves. This method only handles the special circumstances where we may want to allow a tag within content but ONLY when it has specific attributes set.
string | $params | |
string | $element |
Definition at line 755 of file Sanitizer.php.
References $params.
|
static |
Take an array of attribute names and values and normalize or discard illegal values for the given element type.
array | $attribs | |
string | $element |
Check for legal values where the DTD limits things.
Check for unique id attribute :P
Definition at line 791 of file Sanitizer.php.
References $attribs.
|
staticprivate |
Lazy-initialised attribute name regex, see getAttribNameRegex()
Definition at line 379 of file Sanitizer.php.
|
staticprivate |
Lazy-initialised attributes regex, see getAttribsRegex()
Definition at line 348 of file Sanitizer.php.
|
staticprivate |
List of all named character entities defined in HTML 4.01 https://www.w3.org/TR/html4/sgml/entities.html As well as ' which is only defined starting in XHTML1.
Definition at line 81 of file Sanitizer.php.
|
staticprivate |
Character entity aliases accepted by MediaWiki.
Definition at line 340 of file Sanitizer.php.
const Sanitizer::CHAR_REFS_REGEX |
Regular expression to match various types of character references in Sanitizer::normalizeCharReferences and Sanitizer::decodeCharReferences.
Definition at line 38 of file Sanitizer.php.
const Sanitizer::ELEMENT_BITS_REGEX = '!^(/?)([A-Za-z][^\t\n\v />\0]*+)([^>]*?)(/?>)([^<]*)$!' |
Acceptable tag name charset from HTML5 parsing spec https://www.w3.org/TR/html5/syntax.html#tag-open-state.
Definition at line 48 of file Sanitizer.php.
const Sanitizer::EVIL_URI_PATTERN = '!(^|\s|\*/\s*)(javascript|vbscript)([^\w]|$)!i' |
Blacklist for evil uris like javascript: WARNING: DO NOT use this in any place that actually requires blacklisting for security reasons.
There are NUMEROUS1 ways to bypass blacklisting, the only way to be secure from javascript: uri based xss vectors is to whitelist things that you know are safe and deny everything else.
Definition at line 58 of file Sanitizer.php.
const Sanitizer::ID_FALLBACK = 1 |
Tells escapeUrlForHtml() to encode the ID using the fallback encoding, or return false if no fallback is configured.
Definition at line 74 of file Sanitizer.php.
const Sanitizer::ID_PRIMARY = 0 |
Tells escapeUrlForHtml() to encode the ID using the wiki's primary encoding.
Definition at line 66 of file Sanitizer.php.
const Sanitizer::XMLNS_ATTRIBUTE_PATTERN = "/^xmlns:[:A-Z_a-z-.0-9]+$/" |
Definition at line 59 of file Sanitizer.php.