proofreadpage
— ProofreadPage Extension#
Objects used with ProofreadPage Extension.
OCR support of page scans via:
Wikimedia OCR, see: https://www.mediawiki.org/wiki/Help:Extension:Wikisource/Wikimedia_OCR
https://ocr.wmcloud.org/, inspired by https://wikisource.org/wiki/MediaWiki:GoogleOCR.js
- class proofreadpage.FullHeader(text=None)[source]#
Bases:
object
Header of a ProofreadPage object.
- Parameters:
text (str | None)
- TEMPLATE_V1 = '<pagequality level="{0.ql}" user="{0.user}" /><div class="pagetext">{0.header}\n\n\n'#
- TEMPLATE_V2 = '<pagequality level="{0.ql}" user="{0.user}" />{0.header}'#
- p_header = re.compile('<pagequality level="(?P<ql>\\d)" user="(?P<user>.*?)" />(?P<has_div><div class="pagetext">)?(?P<header>.*)', re.DOTALL)#
- class proofreadpage.IndexPage(source, title='')[source]#
Bases:
Page
Index Page page used in MediaWiki ProofreadPage extension.
Instantiate an IndexPage object.
In this class: page number is the number in the page title in the Page namespace, if the wikisource site adopts this convention (e.g. page_number is 12 for Page:Popular Science Monthly Volume 1.djvu/12) or the sequential number of the pages linked from the index section in the Index page if the index is built via transclusion of a list of pages (e.g. like on de wikisource). page label is the label associated with a page in the Index page.
This class provides methods to get pages contained in Index page, and relative page numbers and labels by means of several helper functions.
It also provides a generator to pages contained in Index page, with possibility to define range, filter by quality levels and page existence.
- Raises:
UnknownExtensionError – source Site has no ProofreadPage Extension.
ImportError – bs4 is not installed.
- Parameters:
source (PageSourceType)
title (str)
- INDEX_TEMPLATE = ':MediaWiki:Proofreadpage_index_template'#
- _get_prp_index_pagelist()[source]#
Get all pages in an IndexPage page list.
Note
This method is called by initializer and should not be used.
See also
- get_label_from_page(*args, **kwargs)[source]#
- Parameters:
self (IndexPage)
args (Any)
kwargs (Any)
- Return type:
Any
- get_label_from_page_number(*args, **kwargs)[source]#
- Parameters:
self (IndexPage)
args (Any)
kwargs (Any)
- Return type:
Any
- get_number(*args, **kwargs)[source]#
- Parameters:
self (IndexPage)
args (Any)
kwargs (Any)
- Return type:
Any
- get_page(*args, **kwargs)[source]#
- Parameters:
self (IndexPage)
args (Any)
kwargs (Any)
- Return type:
Any
- get_page_from_label(*args, **kwargs)[source]#
- Parameters:
self (IndexPage)
args (Any)
kwargs (Any)
- Return type:
Any
- get_page_number_from_label(*args, **kwargs)[source]#
- Parameters:
self (IndexPage)
args (Any)
kwargs (Any)
- Return type:
Any
- has_valid_content()[source]#
Test page only contains a single call to the index template.
- Return type:
bool
- property num_pages: Any#
- page_gen(start=1, end=None, filter_ql=None, only_existing=False)[source]#
Return a page generator which yields pages contained in Index page.
Range is [start … end], extremes included.
Changed in version 9.0: The content parameter was removed
- Parameters:
start (int) – first page, defaults to 1
end (int | None) – num_pages if end is None
filter_ql (Sequence[int] | None) – filters quality levels if None: all but ‘Without Text’.
only_existing (bool) – yields only existing pages.
- Return type:
Iterable[pywikibot.page.Page]
- pages(*args, **kwargs)[source]#
- Parameters:
self (IndexPage)
args (Any)
kwargs (Any)
- Return type:
Any
- class proofreadpage.PagesTagParser(text='<pages />')[source]#
Bases:
Container
Parser for tag
<pages />
.Parse text and extract the first
<pages ... />
tag. Individual attributes will be accessible with dot notation.>>> tp = PagesTagParser('<pages />') >>> tp PagesTagParser('<pages />')
>>> tp = PagesTagParser( ... 'Text: <pages index="Index.pdf" from="first" to="last" />') >>> tp PagesTagParser('<pages index="Index.pdf" from="first" to="last" />')
Attributes can be modified via dot notation. If an attribute is a number, it is converted to int.
Note
from
is represented asffrom
due to conflict with keyword.>>> tp.ffrom = 1; tp.to = '"3"' >>> tp.ffrom 1 >>> tp.to 3
Quotes are stripped in the value and added back in the str representation.
Note
Quotes are not mandatory.
>>> tp PagesTagParser('<pages index="Index.pdf" from=1 to="3" />')
Attributes can be added via dot notation. Order is fixed (same order as attribute definition in the class).
>>> tp.fromsection = '"A"' >>> tp.fromsection 'A' >>> tp PagesTagParser('<pages index="Index.pdf" from=1 to="3" fromsection="A" />')
Attributes can be deleted. >>> del tp.fromsection >>> tp PagesTagParser(‘<pages index=”Index.pdf” from=1 to=”3” />’)
Attribute presence can be checked. >>> ‘to’ in tp True
>>> 'step' in tp False
Added in version 8.0.
Changed in version 8.1: text parameter is defaulted to
'<pages />'
.- exclude#
A descriptor tag.
Added in version 8.0.
- ffrom#
A descriptor tag.
Added in version 8.0.
- fromsection#
A descriptor tag.
Added in version 8.0.
- header#
A descriptor tag.
Added in version 8.0.
- include#
A descriptor tag.
Added in version 8.0.
- index#
A descriptor tag.
Added in version 8.0.
- onlysection#
A descriptor tag.
Added in version 8.0.
- pat_attr = re.compile('(index=|from=|to=|include=|exclude=|step=|header=|fromsection=|tosection=|onlysection=)')#
- pat_tag = re.compile('<pages (?P<attrs>[^/]*?)/>')#
- step#
A descriptor tag.
Added in version 8.0.
- to#
A descriptor tag.
Added in version 8.0.
- tokens = ('index', 'from', 'to', 'include', 'exclude', 'step', 'header', 'fromsection', 'tosection', 'onlysection')#
- tosection#
A descriptor tag.
Added in version 8.0.
- class proofreadpage.ProofreadPage(source, title='')[source]#
Bases:
Page
ProofreadPage page used in MediaWiki ProofreadPage extension.
Instantiate a ProofreadPage object.
- Raises:
UnknownExtensionError – source Site has no ProofreadPage Extension.
- Parameters:
source (PageSourceType)
title (str)
- NOT_PROOFREAD = 1#
- PROBLEMATIC = 2#
- PROOFREAD = 3#
- PROOFREAD_LEVELS = [0, 1, 2, 3, 4]#
- VALIDATED = 4#
- WITHOUT_TEXT = 0#
- property body: Any#
- close_tag = '</noinclude>'#
- property header: Any#
- property index: IndexPage | None#
Get the Index page which contains ProofreadPage.
If there are many Index pages link to this ProofreadPage, and the ProofreadPage is titled Page:<index title>/<page number>, the Index page with the same title will be returned. Otherwise None is returned in the case of multiple linked Index pages.
To force reload, delete index and call it again.
- Returns:
the Index page for this ProofreadPage
- ocr(ocr_tool=None)[source]#
Do OCR of ProofreadPage scan.
The text returned by this function shall be assigned to
body
, otherwise the ProofreadPage format will not be maintained.Warning
It is the user’s responsibility to reset quality level accordingly.
Changed in version 9.2: default for ocr_tool is
wmfOCR
.Removed in version 9.2:
phetools
support is not available anymore.- Parameters:
ocr_tool (str | None) – ‘wmfOCR’ or ‘googleOCR’; default is ‘wmfOCR’
- Returns:
OCR text for the page.
- Raises:
TypeError – wrong ocr_tool keyword arg.
ValueError – something went wrong with OCR process.
- Return type:
str
- open_tag = '<noinclude>'#
- p_close = re.compile('(</div>|\\n\\n\\n)?</noinclude>')#
- p_close_no_div = re.compile('</noinclude>')#
- p_open = re.compile('<noinclude>')#
- property pre_summary: str#
Return trailing part of edit summary.
The edit summary shall be appended to pre_summary to highlight Status in the edit summary on wiki.
- property ql: Any#
- property quality_level: int#
Return the quality level of this page when it is retrieved from API.
This is only applicable if contentmodel equals ‘proofread-page’. None is returned otherwise.
This property is read-only and is applicable only when page is loaded. If quality level is overwritten during page processing, this property is no longer necessarily aligned with the new value.
In this way, no text parsing is necessary to check quality level when fetching a page.
- save(*args, **kwargs)[source]#
Save page content after recomposing the page.
- Parameters:
args (Any)
kwargs (Any)
- Return type:
None
- property status: Any#
- property text: str#
Override text property.
Preload text returned by EditFormPreloadText to preload non-existing pages.
- property url_image: str#
Get the file url of the scan of ProofreadPage.
- Returns:
file url of the scan of ProofreadPage or None.
For MW version < 1.40: :raises Exception: in case of http errors :raises ImportError: if bs4 is not installed, _bs4_soup() will raise :raises ValueError: in case of no prp_page_image src found for scan
- property user: Any#
- class proofreadpage.PurgeRequest(**kwargs)[source]#
Bases:
Request
Subclass of Request which skips the check on write rights.
Workaround for T128994.
Monkeypatch action in Request initializer.
- Parameters:
kwargs (Any)
- class proofreadpage.TagAttr(attr, value)[source]#
Bases:
object
Tag attribute of <pages />.
Represent a single attribute. It is used internally in
PagesTagParser
and shall not be used stand-alone.It manages string formatting output and conversion str <–> int and quotes. Input value can only be str or int and shall have quotes or nothing.
>>> a = TagAttr('to', 3.0) Traceback (most recent call last): ... TypeError: value=3.0 must be str or int.
>>> a = TagAttr('to', 'A123"') Traceback (most recent call last): ... ValueError: value=A123" has wrong quotes.
>>> a = TagAttr('to', 3) >>> a TagAttr('to', 3) >>> str(a) 'to=3' >>> a.attr 'to' >>> a.value 3
>>> a = TagAttr('to', '3') >>> a TagAttr('to', '3') >>> str(a) 'to=3' >>> a.attr 'to' >>> a.value 3
>>> a = TagAttr('to', '"3"') >>> a TagAttr('to', '"3"') >>> str(a) 'to="3"' >>> a.value 3
>>> a = TagAttr('to', "'3'") >>> a TagAttr('to', "'3'") >>> str(a) "to='3'" >>> a.value 3
>>> a = TagAttr('to', 'A123') >>> a TagAttr('to', 'A123') >>> str(a) 'to=A123' >>> a.value 'A123'
Added in version 8.0.
- property value#
Attribute value.