mwparserfromhell Package¶
mwparserfromhell
Package¶
mwparserfromhell (the MediaWiki Parser from Hell) is a Python package that provides an easy-to-use and outrageously powerful parser for MediaWiki wikicode.
compat
Module¶
Implements support for both Python 2 and Python 3 by defining common types in
terms of their Python 2/3 variants. For example, str
is set to
unicode
on Python 2 but str
on Python 3; likewise,
bytes
is str
on 2 but bytes
on 3. These types are
meant to be imported directly from within the parser’s modules.
definitions
Module¶
Contains data about certain markup, like HTML tags and external links.
When updating this file, please also update the the C tokenizer version: - mwparserfromhell/parser/ctokenizer/definitions.c - mwparserfromhell/parser/ctokenizer/definitions.h
-
mwparserfromhell.definitions.
get_html_tag
(markup)[source]¶ Return the HTML tag associated with the given wiki-markup.
-
mwparserfromhell.definitions.
is_parsable
(tag)[source]¶ Return if the given tag‘s contents should be passed to the parser.
-
mwparserfromhell.definitions.
is_visible
(tag)[source]¶ Return whether or not the given tag contains visible text.
-
mwparserfromhell.definitions.
is_single
(tag)[source]¶ Return whether or not the given tag can exist without a close tag.
smart_list
Module¶
This module contains the SmartList
type, as well as its
_ListProxy
child, which together implement a list whose sublists
reflect changes made to the main list, and vice-versa.
-
class
mwparserfromhell.smart_list.
SmartList
(iterable=None)[source]¶ Bases:
mwparserfromhell.smart_list._SliceNormalizerMixIn
,list
Implements the
list
interface with special handling of sublists.When a sublist is created (by
list[i:j]
), any changes made to this list (such as the addition, removal, or replacement of elements) will be reflected in the sublist, or vice-versa, to the greatest degree possible. This is implemented by having sublists - instances of the_ListProxy
type - dynamically determine their elements by storing their slice info and retrieving that slice from the parent. Methods that change the size of the list also change the slice info. For example:>>> parent = SmartList([0, 1, 2, 3]) >>> parent [0, 1, 2, 3] >>> child = parent[2:] >>> child [2, 3] >>> child.append(4) >>> child [2, 3, 4] >>> parent [0, 1, 2, 3, 4]
-
pop
([index]) → item -- remove and return item at index (default last).[source]¶ Raises IndexError if list is empty or index is out of range.
-
-
class
mwparserfromhell.smart_list.
_ListProxy
(parent, sliceinfo)[source]¶ Bases:
mwparserfromhell.smart_list._SliceNormalizerMixIn
,list
Implement the
list
interface by getting elements from a parent.This is created by a
SmartList
object when slicing. It does not actually store the list at any time; instead, whenever the list is needed, it builds it dynamically using the_render()
method.-
index
(value[, start[, stop]]) → integer -- return first index of value.[source]¶ Raises ValueError if the value is not present.
-
pop
([index]) → item -- remove and return item at index (default last).[source]¶ Raises IndexError if list is empty or index is out of range.
-
string_mixin
Module¶
This module contains the StringMixIn
type, which implements the
interface for the unicode
type (str
on py3k) in a dynamic manner.
-
class
mwparserfromhell.string_mixin.
StringMixIn
[source]¶ Implement the interface for
unicode
/str
in a dynamic manner.To use this class, inherit from it and override the
__unicode__()
method (same on py3k) to return the string representation of the object. The various string methods will operate on the value of__unicode__()
instead of the immutableself
like the regularstr
type.
utils
Module¶
This module contains accessory functions for other parts of the library. Parser users generally won’t need stuff from here.
-
mwparserfromhell.utils.
parse_anything
(value, context=0, skip_style_tags=False)[source]¶ Return a
Wikicode
for value, allowing multiple types.This differs from
Parser.parse()
in that we accept more than just a string to be parsed. Unicode objects (strings in py3k), strings (bytes in py3k), integers (converted to strings),None
, existingNode
orWikicode
objects, as well as an iterable of these types, are supported. This is used to parse input on-the-fly by various methods ofWikicode
and others likeTemplate
, such aswikicode.insert()
or settingtemplate.name
.Additional arguments are passed directly to
Parser.parse()
.
wikicode
Module¶
-
class
mwparserfromhell.wikicode.
Wikicode
(nodes)[source]¶ Bases:
mwparserfromhell.string_mixin.StringMixIn
A
Wikicode
is a container for nodes that operates like a string.Additionally, it contains methods that can be used to extract data from or modify the nodes, implemented in an interface similar to a list. For example,
index()
can get the index of a node in the list, andinsert()
can add a new node at that index. Thefilter()
series of functions is very useful for extracting and iterating over, for example, all of the templates in the object.-
RECURSE_OTHERS
= 2¶
-
append
(value)[source]¶ Insert value at the end of the list of nodes.
value can be anything parsable by
parse_anything()
.
-
filter
(*args, **kwargs)[source]¶ Return a list of nodes within our list matching certain conditions.
This is equivalent to calling
list()
onifilter()
.
-
filter_arguments
(*a, **kw)¶ Iterate over arguments.
This is equivalent to
filter()
with forcetype set toArgument
.
-
filter_comments
(*a, **kw)¶ Iterate over comments.
This is equivalent to
filter()
with forcetype set toComment
.
-
filter_external_links
(*a, **kw)¶ Iterate over external_links.
This is equivalent to
filter()
with forcetype set toExternalLink
.
-
filter_headings
(*a, **kw)¶ Iterate over headings.
This is equivalent to
filter()
with forcetype set toHeading
.
-
filter_html_entities
(*a, **kw)¶ Iterate over html_entities.
This is equivalent to
filter()
with forcetype set toHTMLEntity
.
-
filter_templates
(*a, **kw)¶ Iterate over templates.
This is equivalent to
filter()
with forcetype set toTemplate
.
-
filter_text
(*a, **kw)¶ Iterate over text.
-
filter_wikilinks
(*a, **kw)¶ Iterate over wikilinks.
This is equivalent to
filter()
with forcetype set toWikilink
.
-
get_sections
(levels=None, matches=None, flags=50, flat=False, include_lead=None, include_headings=True)[source]¶ Return a list of sections within the page.
Sections are returned as
Wikicode
objects with a shared node list (implemented usingSmartList
) so that changes to sections are reflected in the parent Wikicode object.Each section contains all of its subsections, unless flat is
True
. If levels is given, it should be a iterable of integers; only sections whose heading levels are within it will be returned. If matches is given, it should be either a function or a regex; only sections whose headings match it (without the surrounding equal signs) will be included. flags can be used to override the default regex flags (seeifilter()
) if a regex matches is used.If include_lead is
True
, the first, lead section (without a heading) will be included in the list;False
will not include it; the default will include it only if no specific levels were given. If include_headings isTrue
, the section’s beginningHeading
object will be included; otherwise, this is skipped.
-
get_tree
()[source]¶ Return a hierarchical tree representation of the object.
The representation is a string makes the most sense printed. It is built by calling
_get_tree()
on theWikicode
object and its children recursively. The end result may look something like the following:>>> text = "Lorem ipsum {{foo|bar|{{baz}}|spam=eggs}}" >>> print(mwparserfromhell.parse(text).get_tree()) Lorem ipsum {{ foo | 1 = bar | 2 = {{ baz }} | spam = eggs }}
-
ifilter
(recursive=True, matches=None, flags=50, forcetype=None)[source]¶ Iterate over nodes in our list matching certain conditions.
If forcetype is given, only nodes that are instances of this type (or tuple of types) are yielded. Setting recursive to
True
will iterate over all children and their descendants.RECURSE_OTHERS
will only iterate over children that are not the instances of forcetype.False
will only iterate over immediate children.RECURSE_OTHERS
can be used to iterate over all un-nested templates, even if they are inside of HTML tags, like so:>>> code = mwparserfromhell.parse("{{foo}}<b>{{foo|{{bar}}}}</b>") >>> code.filter_templates(code.RECURSE_OTHERS) ["{{foo}}", "{{foo|{{bar}}}}"]
matches can be used to further restrict the nodes, either as a function (taking a single
Node
and returning a boolean) or a regular expression (matched against the node’s string representation withre.search()
). If matches is a regex, the flags passed tore.search()
arere.IGNORECASE
,re.DOTALL
, andre.UNICODE
, but custom flags can be specified by passing flags.
-
ifilter_arguments
(*a, **kw)¶ Iterate over arguments.
This is equivalent to
ifilter()
with forcetype set toArgument
.
-
ifilter_comments
(*a, **kw)¶ Iterate over comments.
This is equivalent to
ifilter()
with forcetype set toComment
.
-
ifilter_external_links
(*a, **kw)¶ Iterate over external_links.
This is equivalent to
ifilter()
with forcetype set toExternalLink
.
-
ifilter_headings
(*a, **kw)¶ Iterate over headings.
This is equivalent to
ifilter()
with forcetype set toHeading
.
-
ifilter_html_entities
(*a, **kw)¶ Iterate over html_entities.
This is equivalent to
ifilter()
with forcetype set toHTMLEntity
.
-
ifilter_templates
(*a, **kw)¶ Iterate over templates.
This is equivalent to
ifilter()
with forcetype set toTemplate
.
-
ifilter_text
(*a, **kw)¶ Iterate over text.
-
ifilter_wikilinks
(*a, **kw)¶ Iterate over wikilinks.
This is equivalent to
ifilter()
with forcetype set toWikilink
.
-
index
(obj, recursive=False)[source]¶ Return the index of obj in the list of nodes.
Raises
ValueError
if obj is not found. If recursive isTrue
, we will look in all nodes of ours and their descendants, and return the index of our direct descendant node within our list of nodes. Otherwise, the lookup is done only on direct descendants.
-
insert
(index, value)[source]¶ Insert value at index in the list of nodes.
value can be anything parsable by
parse_anything()
, which includes strings or otherWikicode
orNode
objects.
-
insert_after
(obj, value, recursive=True)[source]¶ Insert value immediately after obj.
obj can be either a string, a
Node
, or anotherWikicode
object (as created byget_sections()
, for example). If obj is a string, we will operate on all instances of that string within the code, otherwise only on the specific instance given. value can be anything parsable byparse_anything()
. If recursive isTrue
, we will try to find obj within our child nodes even if it is not a direct descendant of thisWikicode
object. If obj is not found,ValueError
is raised.
-
insert_before
(obj, value, recursive=True)[source]¶ Insert value immediately before obj.
obj can be either a string, a
Node
, or anotherWikicode
object (as created byget_sections()
, for example). If obj is a string, we will operate on all instances of that string within the code, otherwise only on the specific instance given. value can be anything parsable byparse_anything()
. If recursive isTrue
, we will try to find obj within our child nodes even if it is not a direct descendant of thisWikicode
object. If obj is not found,ValueError
is raised.
-
matches
(other)[source]¶ Do a loose equivalency test suitable for comparing page names.
other can be any string-like object, including
Wikicode
, or a tuple of these. This operation is symmetric; both sides are adjusted. Specifically, whitespace and markup is stripped and the first letter’s case is normalized. Typical usage isif template.name.matches("stub"): ...
.
-
remove
(obj, recursive=True)[source]¶ Remove obj from the list of nodes.
obj can be either a string, a
Node
, or anotherWikicode
object (as created byget_sections()
, for example). If obj is a string, we will operate on all instances of that string within the code, otherwise only on the specific instance given. If recursive isTrue
, we will try to find obj within our child nodes even if it is not a direct descendant of thisWikicode
object. If obj is not found,ValueError
is raised.
-
replace
(obj, value, recursive=True)[source]¶ Replace obj with value.
obj can be either a string, a
Node
, or anotherWikicode
object (as created byget_sections()
, for example). If obj is a string, we will operate on all instances of that string within the code, otherwise only on the specific instance given. value can be anything parsable byparse_anything()
. If recursive isTrue
, we will try to find obj within our child nodes even if it is not a direct descendant of thisWikicode
object. If obj is not found,ValueError
is raised.
-
set
(index, value)[source]¶ Set the
Node
at index to value.Raises
IndexError
if index is out of range, orValueError
if value cannot be coerced into oneNode
. To insert multiple nodes at an index, useget()
with eitherremove()
andinsert()
orreplace()
.
-
strip_code
(normalize=True, collapse=True)[source]¶ Return a rendered string without unprintable code such as templates.
The way a node is stripped is handled by the
__strip__()
method ofNode
objects, which generally return a subset of their nodes orNone
. For example, templates and tags are removed completely, links are stripped to just their display part, headings are stripped to just their title. If normalize isTrue
, various things may be done to strip code further, such as converting HTML entities likeΣ
,Σ
, andΣ
toΣ
. If collapse isTrue
, we will try to remove excess whitespace as well (three or more newlines are converted to two, for example).
-