textlib
— Changing Wikitext#
Functions for manipulating wiki-text.
Unless otherwise noted, all functions take a unicode string as the argument and return a unicode string.
- class textlib.TimeStripper(site=None)[source]#
Bases:
object
Find timestamp in page and return it as pywikibot.Timestamp object.
- static fix_digits(line)[source]#
Make non-latin digits like Persian to latin to parse.
Deprecated since version 7.0: Use
to_latin_digits()
instead.
- timestripper(line)[source]#
Find timestamp in line and convert it to time zone aware datetime.
All the following items must be matched, otherwise None is returned: -. year, month, hour, time, day, minute, tzinfo
Changed in version 7.6: HTML parts are removed from line
- Returns:
A timestamp found on the given line
- Parameters:
line (str) –
- Return type:
Optional[Timestamp]
- textlib.add_text(text, add, *, site=None)[source]#
Add text to a page content above categories and interwiki.
New in version 6.4.
- Parameters:
text (str) – The page content to add text to.
add (str) – Text to add.
site (pywikibot.Site) – The site that the text is coming from. Required for reorder of categories and interlanguage links. Te default site is used otherwise.
- Return type:
str
- textlib.case_escape(case, string)[source]#
Return an escaped regex pattern which depends on ‘first-letter’ case.
New in version 7.0.
- Parameters:
case (str) – if
case
is ‘first-letter’ the regex contains an upper/lower case set for the first letterstring (str) –
- Return type:
str
- textlib.categoryFormat(categories, insite=None)[source]#
Return a string containing links to all categories in a list.
- Parameters:
categories (iterable) – A list of Category or Page objects or strings which can be either the raw name, [[Category:..]] or [[cat_localised_ns:…]].
insite (pywikibot.Site) – Used to to localise the category namespace.
- Returns:
String of categories
- Return type:
str
- textlib.compileLinkR(withoutBracketed=False, onlyBracketed=False)[source]#
Return a regex that matches external links.
- Parameters:
withoutBracketed (bool) –
onlyBracketed (bool) –
- textlib.does_text_contain_section(pagetext, section)[source]#
Determine whether the page text contains the given section title.
It does not care whether a section string may contain spaces or underlines. Both will match.
If a section parameter contains an internal link, it will match the section with or without a preceding colon which is required for a text link e.g. for categories and files.
- Parameters:
pagetext (str) – The wikitext of a page
section (str) – a section of a page including wikitext markups
- Return type:
bool
- textlib.expandmarker(text, marker='', separator='')[source]#
Return a marker expanded whitespace and the separator.
It searches for the first occurrence of the marker and gets the combination of the separator and whitespace directly before it.
- Parameters:
text (str) – the text which will be searched.
marker (str) – the marker to be searched.
separator (str) – the separator string allowed before the marker. If empty it won’t include whitespace too.
- Returns:
the marker with the separator and whitespace from the text in front of it. It’ll be just the marker if the separator is empty.
- Return type:
str
- textlib.extract_sections(text, site=None)[source]#
Return section headings and contents found in text.
- Returns:
The returned namedtuple contains the text parsed into header, contents and footer parts: The header part is a string containing text part above the first heading. The footer part is also a string containing text part after the last section. The section part is a list of tuples, each tuple containing a string with section heading and a string with section content. Example article:
'''A''' is a thing. == History of A == Some history... == Usage of A == Some usage... [[Category:Things starting with A]]
…is parsed into the following namedtuple:
result = extract_sections(text, site) result.header = "'''A''' is a thing." result.sections = [('== History of A ==', 'Some history...'), ('== Usage of A ==', 'Some usage...')] result.footer = '[[Category:Things starting with A]]'
- Parameters:
text (str) –
- Return type:
_Content
New in version 3.0.
- textlib.extract_templates_and_params(text, remove_disabled_parts=False, strip=False)[source]#
Return a list of templates found in text.
Return value is a list of tuples. There is one tuple for each use of a template in the page, with the template title as the first entry and a dict of parameters as the second entry. Parameters are indexed by strings; as in MediaWiki, an unnamed parameter is given a parameter name with an integer value corresponding to its position among the unnamed parameters, and if this results multiple parameters with the same name only the last value provided will be returned.
This uses the package
mwparserfromhell
orwikitextparser
as MediaWiki markup parser. It is mandatory that one of them is installed.There are minor differences between the two implementations.
The parser packages preserves whitespace in parameter names and values.
If there are multiple numbered parameters in the wikitext for the same position, MediaWiki will only use the last parameter value. e.g.
{{a| foo | 2 <!-- --> = bar | baz }}
is{{a|1=foo|2=baz}}
To replicate that behaviour, enable bothremove_disabled_parts
andstrip
parameters.- Parameters:
text (str) – The wikitext from which templates are extracted
remove_disabled_parts (bool) – If enabled, remove disabled wikitext such as comments and pre.
strip (bool) – If enabled, strip arguments and values of templates.
- Returns:
list of template name and params
- Return type:
List[Tuple[str, OrderedDict[str, str]]]
Changed in version 6.1: wikitextparser package is supported; either wikitextparser or mwparserfromhell is strictly recommended.
- textlib.extract_templates_and_params_regex_simple(text)[source]#
Extract top-level templates with params using only a simple regex.
This function uses only a single regex, and returns an entry for each template called at the top-level of the wikitext. Nested templates are included in the argument values of the top-level template.
This method will incorrectly split arguments when an argument value contains a ‘|’, such as {{template|a={{b|c}} }}.
- Parameters:
text (str) – The wikitext from which templates are extracted
- Returns:
list of template name and params
- Return type:
list of tuple of name and OrderedDict
- textlib.findmarker(text, startwith='@@', append=None)[source]#
Find a string which is not part of text.
- Parameters:
text (str) –
startwith (str) –
append (Optional[str]) –
- Return type:
str
- textlib.getCategoryLinks(text, site=None, include=None, expand_text=False)[source]#
Return a list of category links found in text.
- Parameters:
include (Optional[List[str]]) – list of tags which should not be removed by removeDisabledParts() and where CategoryLinks can be searched.
text (str) –
expand_text (bool) –
- Returns:
all category links found
- Return type:
List[Category]
- textlib.getLanguageLinks(text, insite=None, template_subpage=False)[source]#
Return a dict of inter-language links found in text.
The returned dict uses the site as keys and Page objects as values. It does not contain its own site.
Do not call this routine directly, use Page.interwiki() method instead.
- Parameters:
text (str) –
template_subpage (bool) –
- Return type:
Dict
- textlib.glue_template_and_params(template_and_params)[source]#
Return wiki text of template glued from params.
You can use items from extract_templates_and_params here to get an equivalent template wiki text (it may happen that the order of the params changes).
- Return type:
str
- textlib.ignore_case(string)[source]#
Return a case-insensitive pattern for the string.
Changed in version 7.2:
_ignore_case
becomes a public method- Parameters:
string (str) –
- Return type:
str
- textlib.interwikiFormat(links, insite=None)[source]#
Convert interwiki link dict into a wikitext string.
- Parameters:
links (dict with the Site objects as keys, and Page or Link objects as values. # noqa: DAR103) – interwiki links to be formatted
insite (BaseSite) – site the interwiki links will be formatted for (defaulting to the current site).
- Returns:
string including wiki links formatted for inclusion in insite
- Return type:
str
- textlib.interwikiSort(sites, insite=None)[source]#
Sort sites according to local interwiki sort logic.
- textlib.isDisabled(text, index, tags=None)[source]#
Return True if text[index] is disabled, e.g. by a comment or nowiki tags.
For the tags parameter, see
removeDisabledParts
.- Parameters:
text (str) –
index (int) –
- Return type:
bool
- textlib.reformat_ISBNs(text, match_func)[source]#
Reformat ISBNs.
- Parameters:
text (str) – text containing ISBNs
match_func (callable) – function to reformat matched ISBNs
- Returns:
reformatted text
- Return type:
str
- textlib.removeCategoryLinks(text, site=None, marker='')[source]#
Return text with all category links removed.
- Parameters:
text (str) – The text that needs to be modified.
site (pywikibot.Site) – The site that the text is coming from.
marker (str) – If defined, marker is placed after the last category link, or at the end of text if there are no category links.
- Returns:
The modified text.
- Return type:
str
- textlib.removeCategoryLinksAndSeparator(text, site=None, marker='', separator='')[source]#
Return text with category links and preceding separators removed.
- Parameters:
text (str) – The text that needs to be modified.
site (pywikibot.Site) – The site that the text is coming from.
marker (str) – If defined, marker is placed after the last category link, or at the end of text if there are no category links.
separator (str) – The separator string that will be removed if followed by the category links.
- Returns:
The modified text
- Return type:
str
- textlib.removeDisabledParts(text, tags=None, include=None, site=None)[source]#
Return text without portions where wiki markup is disabled.
Parts that will be removed by default are:
HTML comments
nowiki tags
pre tags
includeonly tags
source and syntaxhighlight tags
Changed in version 7.0: the order of removals will correspond to the tags argument if provided as an ordered collection (list, tuple)
- Parameters:
tags (Optional[Iterable]) – The exact set of parts which should be removed using keywords from textlib._get_regexes().
include (Optional[Container]) – Or, in alternative, default parts that shall not be removed.
site (Optional[BaseSite]) – Site to be used for site-dependent regexes. Default disabled parts listed above do not need it.
text (str) –
- Returns:
text stripped from disabled parts.
- Return type:
str
- textlib.removeHTMLParts(text, keeptags=None)[source]#
Return text without portions where HTML markup is disabled.
Parts that can/will be removed are – * HTML and all wiki tags
The exact set of parts which should NOT be removed can be passed as the ‘keeptags’ parameter, which defaults to [‘tt’, ‘nowiki’, ‘small’, ‘sup’].
- Parameters:
text (str) –
keeptags (Optional[List[str]]) –
- Return type:
str
- textlib.removeLanguageLinks(text, site=None, marker='')[source]#
Return text with all inter-language links removed.
If a link to an unknown language is encountered, a warning is printed.
- Parameters:
text (str) – The text that needs to be modified.
site (pywikibot.Site) – The site that the text is coming from.
marker (str) – If defined, marker is placed after the last language link, or at the end of text if there are no language links.
- Returns:
The modified text.
- Return type:
str
- textlib.removeLanguageLinksAndSeparator(text, site=None, marker='', separator='')[source]#
Return text with inter-language links and preceding separators removed.
If a link to an unknown language is encountered, a warning is printed.
- Parameters:
text (str) – The text that needs to be modified.
site (pywikibot.Site) – The site that the text is coming from.
marker (str) – If defined, marker is placed after the last language link, or at the end of text if there are no language links.
separator (str) – The separator string that will be removed if followed by the language links.
- Returns:
The modified text
- Return type:
str
- textlib.replaceCategoryInPlace(oldtext, oldcat, newcat, site=None, add_only=False)[source]#
Replace old category with new one and return the modified text.
- Parameters:
oldtext – Content of the old category
oldcat – pywikibot.Category object of the old category
newcat – pywikibot.Category object of the new category
add_only (bool) – If add_only is True, the old category won’t be replaced and the category given will be added after it.
- Returns:
the modified text
- Return type:
str
- textlib.replaceCategoryLinks(oldtext, new, site=None, addOnly=False)[source]#
Replace all existing category links with new category links.
- Parameters:
oldtext (str) – The text that needs to be replaced.
new (iterable) – Should be a list of Category objects or strings which can be either the raw name or [[Category:..]].
site (pywikibot.Site) – The site that the text is from.
addOnly (bool) – If addOnly is True, the old category won’t be deleted and the category(s) given will be added (and they won’t replace anything).
- Returns:
The modified text.
- Return type:
str
- textlib.replaceExcept(text, old, new, exceptions, caseInsensitive=False, allowoverlap=False, marker='', site=None, count=0)[source]#
Return text with ‘old’ replaced by ‘new’, ignoring specified types of text.
Skips occurrences of ‘old’ within exceptions; e.g., within nowiki tags or HTML comments. If caseInsensitive is true, then use case insensitive regex matching. If allowoverlap is true, overlapping occurrences are all replaced (watch out when using this, it might lead to infinite loops!).
- Parameters:
text (str) – text to be modified
old – a compiled or uncompiled regular expression
new – a unicode string (which can contain regular expression references), or a function which takes a match object as parameter. See parameter repl of re.sub().
exceptions (list) – a list of strings or already compiled regex objects which signal what to leave out. Strings might be like [‘math’, ‘table’, ‘template’] for example.
marker (str) – a string that will be added to the last replacement; if nothing is changed, it is added at the end
count (int) – how many replacements to do at most. See parameter count of re.sub().
caseInsensitive (bool) –
allowoverlap (bool) –
- Return type:
str
- textlib.replaceLanguageLinks(oldtext, new, site=None, addOnly=False, template=False, template_subpage=False)[source]#
Replace inter-language links in the text with a new set of links.
- Parameters:
oldtext (str) – The text that needs to be modified.
new (dict) – A dict with the Site objects as keys, and Page or Link objects as values (i.e., just like the dict returned by getLanguageLinks function).
site (pywikibot.Site) – The site that the text is from.
addOnly (bool) – If True, do not remove old language links, only add new ones.
template (bool) – Indicates if text belongs to a template page or not.
template_subpage (bool) – Indicates if text belongs to a template sub-page or not.
- Returns:
The modified text.
- Return type:
str
- textlib.replace_links(text, replace, site)[source]#
Replace wikilinks selectively.
The text is searched for a link and on each link it replaces the text depending on the result for that link. If the result is just None it skips that link. When it’s False it unlinks it and just inserts the label. When it is a Link instance it’ll use the target, section and label from that Link instance. If it’s a Page instance it’ll use just the target from the replacement and the section and label from the original link.
If it’s a string and the replacement was a sequence it converts it into a Page instance. If the replacement is done via a callable it’ll use it like unlinking and directly replace the link with the text itself. It only supports unicode when used by the callable and bytes are not allowed.
If either the section or label should be used the replacement can be a function which returns a Link instance and copies the value which should remaining.
Changed in version 7.0:
site
parameter is mandatory- Parameters:
text (str) – the text in which to replace links
replace (sequence of pywikibot.Page/pywikibot.Link/str or callable) – either a callable which reacts like described above. The callable must accept four parameters link, text, groups, rng and allows for user interaction. The groups are a dict containing ‘title’, ‘section’, ‘label’ and ‘linktrail’ and the rng are the start and end position of the link. The ‘label’ in groups contains everything after the first pipe which might contain additional data which is used in File namespace for example. Alternatively it can be a sequence containing two items where the first must be a Link or Page and the second has almost the same meaning as the result by the callable. It’ll convert that into a callable where the first item (the Link or Page) has to be equal to the found link and in that case it will apply the second value from the sequence.
site (BaseSite) – a Site object to use. It should match the origin or target site of the text
- Raises:
TypeError – missing positional argument ‘site’
ValueError – Wrong site type
ValueError – Wrong replacement number
ValueError – Wrong replacement types
- Return type:
str
- textlib.to_latin_digits(phrase, langs=None)[source]#
Change non-latin digits to latin digits.
New in version 7.0.
- Parameters:
phrase (str) – The phrase to convert to latin numerical.
langs (Optional[Union[Sequence[str], str]]) – Language codes. If langs parameter is None, use all known languages to convert.
- Returns:
The string with latin digits
- Return type:
str
- textlib.to_local_digits(phrase, lang)[source]#
Change Latin digits based on language to localized version.
Be aware that this function only works for several languages, and that it returns an unchanged string if an unsupported language is given.
Changed in version 7.5: always return a string even
phrase
is an int.- Parameters:
phrase (Union[str, int]) – The phrase to convert to localized numerical
lang (str) – language code
- Returns:
The localized version
- Return type:
str