proofreadpage
— ProofreadPage Extension#
Objects used with ProofreadPage Extension.
This module includes objects:
ProofreadPage(Page)
FullHeader
IndexPage(Page)
OCR support of page scans via: - https://phetools.toolforge.org/hocr_cgi.py - https://phetools.toolforge.org/ocr.php - inspired by https://en.wikisource.org/wiki/MediaWiki:Gadget-ocr.js
Wikimedia OCR
see: https://www.mediawiki.org/wiki/Help:Extension:Wikisource/Wikimedia_OCR
inspired by https://wikisource.org/wiki/MediaWiki:GoogleOCR.js
- class proofreadpage.FullHeader(text=None)[source]#
Bases:
object
Header of a ProofreadPage object.
- Parameters:
text (Optional[str]) –
- TEMPLATE_V1 = '<pagequality level="{0.ql}" user="{0.user}" /><div class="pagetext">{0.header}\n\n\n'#
- TEMPLATE_V2 = '<pagequality level="{0.ql}" user="{0.user}" />{0.header}'#
- p_header = re.compile('<pagequality level="(?P<ql>\\d)" user="(?P<user>.*?)" />(?P<has_div><div class="pagetext">)?(?P<header>.*)', re.DOTALL)#
- class proofreadpage.IndexPage(source, title='')[source]#
Bases:
Page
Index Page page used in Mediawiki ProofreadPage extension.
Instantiate an IndexPage object.
In this class: page number is the number in the page title in the Page namespace, if the wikisource site adopts this convention (e.g. page_number is 12 for Page:Popular Science Monthly Volume 1.djvu/12) or the sequential number of the pages linked from the index section in the Index page if the index is built via transclusion of a list of pages (e.g. like on de wikisource). page label is the label associated with a page in the Index page.
This class provides methods to get pages contained in Index page, and relative page numbers and labels by means of several helper functions.
It also provides a generator to pages contained in Index page, with possibility to define range, filter by quality levels and page existence.
- Raises:
UnknownExtensionError – source Site has no ProofreadPage Extension.
ImportError – bs4 is not installed.
- Parameters:
- INDEX_TEMPLATE = ':MediaWiki:Proofreadpage_index_template'#
- _get_prp_index_pagelist()[source]#
Get all pages in an IndexPage page list.
Note
This method is called by initializer and should not be used.
See also
- get_label_from_page(*args, **kwargs)[source]#
- Parameters:
self (IndexPage) –
args (Any) –
kwargs (Any) –
- Return type:
Any
- get_label_from_page_number(*args, **kwargs)[source]#
- Parameters:
self (IndexPage) –
args (Any) –
kwargs (Any) –
- Return type:
Any
- get_number(*args, **kwargs)[source]#
- Parameters:
self (IndexPage) –
args (Any) –
kwargs (Any) –
- Return type:
Any
- get_page(*args, **kwargs)[source]#
- Parameters:
self (IndexPage) –
args (Any) –
kwargs (Any) –
- Return type:
Any
- get_page_from_label(*args, **kwargs)[source]#
- Parameters:
self (IndexPage) –
args (Any) –
kwargs (Any) –
- Return type:
Any
- get_page_number_from_label(*args, **kwargs)[source]#
- Parameters:
self (IndexPage) –
args (Any) –
kwargs (Any) –
- Return type:
Any
- has_valid_content()[source]#
Test page only contains a single call to the index template.
- Return type:
bool
- property num_pages: Any#
- page_gen(start=1, end=None, filter_ql=None, only_existing=False, content=True)[source]#
Return a page generator which yields pages contained in Index page.
Range is [start … end], extremes included.
- Parameters:
start (int) – first page, defaults to 1
end (Optional[int]) – num_pages if end is None
filter_ql (Optional[Sequence[int]]) – filters quality levels if None: all but ‘Without Text’.
only_existing (bool) – yields only existing pages.
content (bool) – preload content.
- Return type:
Iterable[Page]
- pages(*args, **kwargs)[source]#
- Parameters:
self (IndexPage) –
args (Any) –
kwargs (Any) –
- Return type:
Any
- class proofreadpage.ProofreadPage(source, title='')[source]#
Bases:
Page
ProofreadPage page used in Mediawiki ProofreadPage extension.
Instantiate a ProofreadPage object.
- Raises:
UnknownExtensionError – source Site has no ProofreadPage Extension.
- Parameters:
- NOT_PROOFREAD = 1#
- PROBLEMATIC = 2#
- PROOFREAD = 3#
- PROOFREAD_LEVELS = [0, 1, 2, 3, 4]#
- VALIDATED = 4#
- WITHOUT_TEXT = 0#
- property body: Any#
- close_tag = '</noinclude>'#
- property header: Any#
- property index: Optional[IndexPage]#
Get the Index page which contains ProofreadPage.
If there are many Index pages link to this ProofreadPage, and the ProofreadPage is titled Page:<index title>/<page number>, the Index page with the same title will be returned. Otherwise None is returned in the case of multiple linked Index pages.
To force reload, delete index and call it again.
- Returns:
the Index page for this ProofreadPage
- ocr(ocr_tool=None)[source]#
Do OCR of ProofreadPage scan.
The text returned by this function shall be assigned to self.body, otherwise the ProofreadPage format will not be maintained.
It is the user’s responsibility to reset quality level accordingly.
- Parameters:
ocr_tool (Optional[str]) – ‘phetools’, ‘wmfOCR’ or ‘googleOCR’; default is ‘phetools’
- Returns:
OCR text for the page.
- Raises:
TypeError – wrong ocr_tool keyword arg.
ValueError – something went wrong with OCR process.
- Return type:
str
- open_tag = '<noinclude>'#
- p_close = re.compile('(</div>|\\n\\n\\n)?</noinclude>')#
- p_close_no_div = re.compile('</noinclude>')#
- p_open = re.compile('<noinclude>')#
- property pre_summary: str#
Return trailing part of edit summary.
The edit summary shall be appended to pre_summary to highlight Status in the edit summary on wiki.
- property ql: Any#
- property quality_level: int#
Return the quality level of this page when it is retrieved from API.
This is only applicable if contentmodel equals ‘proofread-page’. None is returned otherwise.
This property is read-only and is applicable only when page is loaded. If quality level is overwritten during page processing, this property is no longer necessarily aligned with the new value.
In this way, no text parsing is necessary to check quality level when fetching a page.
- save(*args, **kwargs)[source]#
Save page content after recomposing the page.
- Parameters:
args (Any) –
kwargs (Any) –
- Return type:
None
- property status: Any#
- property text: str#
Override text property.
Preload text returned by EditFormPreloadText to preload non-existing pages.
- property url_image: str#
Get the file url of the scan of ProofreadPage.
- Returns:
file url of the scan ProofreadPage or None.
- Raises:
Exception – in case of http errors
ImportError – if bs4 is not installed, _bs4_soup() will raise
ValueError – in case of no prp_page_image src found for scan
- property user: Any#
- class proofreadpage.PurgeRequest(**kwargs)[source]#
Bases:
Request
Subclass of Request which skips the check on write rights.
Workaround for T128994.
Monkeypatch action in Request initializer.
- Parameters:
kwargs (Any) –