mwparserfromhell Package

mwparserfromhell Package

mwparserfromhell (the MediaWiki Parser from Hell) is a Python package that provides an easy-to-use and outrageously powerful parser for MediaWiki wikicode.

mwparserfromhell.__init__.parse(text)

Short for Parser.parse().

compat Module

Implements support for both Python 2 and Python 3 by defining common types in terms of their Python 2/3 variants. For example, str is set to unicode on Python 2 but str on Python 3; likewise, bytes is str on 2 but bytes on 3. These types are meant to be imported directly from within the parser’s modules.

smart_list Module

This module contains the SmartList type, as well as its _ListProxy child, which together implement a list whose sublists reflect changes made to the main list, and vice-versa.

class mwparserfromhell.smart_list.SmartList(iterable=None)[source]

Bases: list

Implements the list interface with special handling of sublists.

When a sublist is created (by list[i:j]), any changes made to this list (such as the addition, removal, or replacement of elements) will be reflected in the sublist, or vice-versa, to the greatest degree possible. This is implemented by having sublists - instances of the _ListProxy type - dynamically determine their elements by storing their slice info and retrieving that slice from the parent. Methods that change the size of the list also change the slice info. For example:

>>> parent = SmartList([0, 1, 2, 3])
>>> parent
[0, 1, 2, 3]
>>> child = parent[2:]
>>> child
[2, 3]
>>> child.append(4)
>>> child
[2, 3, 4]
>>> parent
[0, 1, 2, 3, 4]
append(item)[source]

L.append(object) – append object to end

extend(item)[source]

L.extend(iterable) – extend list by appending elements from the iterable

insert(index, item)[source]

L.insert(index, object) – insert object before index

pop([index]) → item -- remove and return item at index (default last).[source]

Raises IndexError if list is empty or index is out of range.

remove(item)[source]

L.remove(value) – remove first occurrence of value. Raises ValueError if the value is not present.

reverse()[source]

L.reverse() – reverse IN PLACE

sort(cmp=None, key=None, reverse=None)[source]

L.sort(cmp=None, key=None, reverse=False) – stable sort IN PLACE; cmp(x, y) -> -1, 0, 1

class mwparserfromhell.smart_list._ListProxy(parent, sliceinfo)[source]

Bases: list

Implement the list interface by getting elements from a parent.

This is created by a SmartList object when slicing. It does not actually store the list at any time; instead, whenever the list is needed, it builds it dynamically using the _render() method.

append(item)[source]

L.append(object) – append object to end

count(value) → integer -- return number of occurrences of value[source]
extend(item)[source]

L.extend(iterable) – extend list by appending elements from the iterable

index(value[, start[, stop]]) → integer -- return first index of value.[source]

Raises ValueError if the value is not present.

insert(index, item)[source]

L.insert(index, object) – insert object before index

pop([index]) → item -- remove and return item at index (default last).[source]

Raises IndexError if list is empty or index is out of range.

remove(item)[source]

L.remove(value) – remove first occurrence of value. Raises ValueError if the value is not present.

reverse()[source]

L.reverse() – reverse IN PLACE

sort(cmp=None, key=None, reverse=None)[source]

L.sort(cmp=None, key=None, reverse=False) – stable sort IN PLACE; cmp(x, y) -> -1, 0, 1

string_mixin Module

This module contains the StringMixIn type, which implements the interface for the unicode type (str on py3k) in a dynamic manner.

class mwparserfromhell.string_mixin.StringMixIn[source]

Implement the interface for unicode/str in a dynamic manner.

To use this class, inherit from it and override the __unicode__() method (same on py3k) to return the string representation of the object. The various string methods will operate on the value of __unicode__() instead of the immutable self like the regular str type.

capitalize() → unicode[source]

Return a capitalized version of S, i.e. make the first character have upper case and the rest lower case.

center(width[, fillchar]) → unicode[source]

Return S centered in a Unicode string of length width. Padding is done using the specified fill character (default is a space)

count(sub[, start[, end]]) → int[source]

Return the number of non-overlapping occurrences of substring sub in Unicode string S[start:end]. Optional arguments start and end are interpreted as in slice notation.

decode([encoding[, errors]]) → string or unicode[source]

Decodes S using the codec registered for encoding. encoding defaults to the default encoding. errors may be given to set a different error handling scheme. Default is ‘strict’ meaning that encoding errors raise a UnicodeDecodeError. Other possible values are ‘ignore’ and ‘replace’ as well as any other name registered with codecs.register_error that is able to handle UnicodeDecodeErrors.

encode([encoding[, errors]]) → string or unicode[source]

Encodes S using the codec registered for encoding. encoding defaults to the default encoding. errors may be given to set a different error handling scheme. Default is ‘strict’ meaning that encoding errors raise a UnicodeEncodeError. Other possible values are ‘ignore’, ‘replace’ and ‘xmlcharrefreplace’ as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors.

endswith(suffix[, start[, end]]) → bool[source]

Return True if S ends with the specified suffix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. suffix can also be a tuple of strings to try.

expandtabs([tabsize]) → unicode[source]

Return a copy of S where all tab characters are expanded using spaces. If tabsize is not given, a tab size of 8 characters is assumed.

find(sub[, start[, end]]) → int[source]

Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Return -1 on failure.

format(*args, **kwargs) → unicode[source]

Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).

index(sub[, start[, end]]) → int[source]

Like S.find() but raise ValueError when the substring is not found.

isalnum() → bool[source]

Return True if all characters in S are alphanumeric and there is at least one character in S, False otherwise.

isalpha() → bool[source]

Return True if all characters in S are alphabetic and there is at least one character in S, False otherwise.

isdecimal() → bool[source]

Return True if there are only decimal characters in S, False otherwise.

isdigit() → bool[source]

Return True if all characters in S are digits and there is at least one character in S, False otherwise.

islower() → bool[source]

Return True if all cased characters in S are lowercase and there is at least one cased character in S, False otherwise.

isnumeric() → bool[source]

Return True if there are only numeric characters in S, False otherwise.

isspace() → bool[source]

Return True if all characters in S are whitespace and there is at least one character in S, False otherwise.

istitle() → bool[source]

Return True if S is a titlecased string and there is at least one character in S, i.e. upper- and titlecase characters may only follow uncased characters and lowercase characters only cased ones. Return False otherwise.

isupper() → bool[source]

Return True if all cased characters in S are uppercase and there is at least one cased character in S, False otherwise.

join(iterable) → unicode[source]

Return a string which is the concatenation of the strings in the iterable. The separator between elements is S.

ljust(width[, fillchar]) → int[source]

Return S left-justified in a Unicode string of length width. Padding is done using the specified fill character (default is a space).

lower() → unicode[source]

Return a copy of the string S converted to lowercase.

lstrip([chars]) → unicode[source]

Return a copy of the string S with leading whitespace removed. If chars is given and not None, remove characters in chars instead. If chars is a str, it will be converted to unicode before stripping

partition(sep) -> (head, sep, tail)[source]

Search for the separator sep in S, and return the part before it, the separator itself, and the part after it. If the separator is not found, return S and two empty strings.

replace(old, new[, count]) → unicode[source]

Return a copy of S with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.

rfind(sub[, start[, end]]) → int[source]

Return the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Return -1 on failure.

rindex(sub[, start[, end]]) → int[source]

Like S.rfind() but raise ValueError when the substring is not found.

rjust(width[, fillchar]) → unicode[source]

Return S right-justified in a Unicode string of length width. Padding is done using the specified fill character (default is a space).

rpartition(sep) -> (head, sep, tail)[source]

Search for the separator sep in S, starting at the end of S, and return the part before it, the separator itself, and the part after it. If the separator is not found, return two empty strings and S.

rsplit([sep[, maxsplit]]) → list of strings[source]

Return a list of the words in S, using sep as the delimiter string, starting at the end of the string and working to the front. If maxsplit is given, at most maxsplit splits are done. If sep is not specified, any whitespace string is a separator.

rstrip([chars]) → unicode[source]

Return a copy of the string S with trailing whitespace removed. If chars is given and not None, remove characters in chars instead. If chars is a str, it will be converted to unicode before stripping

split([sep[, maxsplit]]) → list of strings[source]

Return a list of the words in S, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done. If sep is not specified or is None, any whitespace string is a separator and empty strings are removed from the result.

splitlines([keepends]) → list of strings[source]

Return a list of the lines in S, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and true.

startswith(prefix[, start[, end]]) → bool[source]

Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. prefix can also be a tuple of strings to try.

strip([chars]) → unicode[source]

Return a copy of the string S with leading and trailing whitespace removed. If chars is given and not None, remove characters in chars instead. If chars is a str, it will be converted to unicode before stripping

swapcase() → unicode[source]

Return a copy of S with uppercase characters converted to lowercase and vice versa.

title() → unicode[source]

Return a titlecased version of S, i.e. words start with title case characters, all remaining cased characters have lower case.

translate(table) → unicode[source]

Return a copy of the string S, where all characters have been mapped through the given translation table, which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None. Unmapped characters are left untouched. Characters mapped to None are deleted.

upper() → unicode[source]

Return a copy of S converted to uppercase.

zfill(width) → unicode[source]

Pad a numeric string S with zeros on the left, to fill a field of the specified width. The string S is never truncated.

utils Module

This module contains accessory functions that wrap around existing ones to provide additional functionality.

mwparserfromhell.utils.parse_anything(value)[source]

Return a Wikicode for value, allowing multiple types.

This differs from mwparserfromhell.parse() in that we accept more than just a string to be parsed. Unicode objects (strings in py3k), strings (bytes in py3k), integers (converted to strings), None, existing Node or Wikicode objects, as well as an iterable of these types, are supported. This is used to parse input on-the-fly by various methods of Wikicode and others like Template, such as wikicode.insert() or setting template.name.

wikicode Module

class mwparserfromhell.wikicode.Wikicode(nodes)[source]

Bases: mwparserfromhell.string_mixin.StringMixIn

A Wikicode is a container for nodes that operates like a string.

Additionally, it contains methods that can be used to extract data from or modify the nodes, implemented in an interface similar to a list. For example, index() can get the index of a node in the list, and insert() can add a new node at that index. The filter() series of functions is very useful for extracting and iterating over, for example, all of the templates in the object.

append(value)[source]

Insert value at the end of the list of nodes.

value can be anything parasable by parse_anything().

filter(recursive=False, matches=None, flags=50, forcetype=None)[source]

Return a list of nodes within our list matching certain conditions.

This is equivalent to calling list() on ifilter().

Return a list of wikilink nodes.

This is equivalent to calling list() on ifilter_links().

filter_tags(recursive=False, matches=None, flags=50)[source]

Return a list of tag nodes.

This is equivalent to calling list() on ifilter_tags().

filter_templates(recursive=False, matches=None, flags=50)[source]

Return a list of template nodes.

This is equivalent to calling list() on ifilter_templates().

filter_text(recursive=False, matches=None, flags=50)[source]

Return a list of text nodes.

This is equivalent to calling list() on ifilter_text().

get(index)[source]

Return the indexth node within the list of nodes.

get_sections(flat=True, matches=None, levels=None, flags=50, include_headings=True)[source]

Return a list of sections within the page.

Sections are returned as Wikicode objects with a shared node list (implemented using SmartList) so that changes to sections are reflected in the parent Wikicode object.

With flat as True, each returned section contains all of its subsections within the Wikicode; otherwise, the returned sections contain only the section up to the next heading, regardless of its size. If matches is given, it should be a regex to matched against the titles of section headings; only sections whose headings match the regex will be included. If levels is given, it should be a = list of integers; only sections whose heading levels are within the list will be returned. If include_headings is True, the section’s literal Heading object will be included in returned Wikicode objects; otherwise, this is skipped.

get_tree()[source]

Return a hierarchical tree representation of the object.

The representation is a string makes the most sense printed. It is built by calling _get_tree() on the Wikicode object and its children recursively. The end result may look something like the following:

>>> text = "Lorem ipsum {{foo|bar|{{baz}}|spam=eggs}}"
>>> print mwparserfromhell.parse(text).get_tree()
Lorem ipsum
{{
      foo
    | 1
    = bar
    | 2
    = {{
            baz
      }}
    | spam
    = eggs
}}
ifilter(recursive=False, matches=None, flags=50, forcetype=None)[source]

Iterate over nodes in our list matching certain conditions.

If recursive is True, we will iterate over our children and all descendants of our children, otherwise just our immediate children. If matches is given, we will only yield the nodes that match the given regular expression (with re.search()). The default flags used are re.IGNORECASE, re.DOTALL, and re.UNICODE, but custom flags can be specified by passing flags. If forcetype is given, only nodes that are instances of this type are yielded.

Iterate over wikilink nodes.

This is equivalent to ifilter() with forcetype set to Wikilink.

ifilter_tags(recursive=False, matches=None, flags=50)[source]

Iterate over tag nodes.

This is equivalent to ifilter() with forcetype set to Tag.

ifilter_templates(recursive=False, matches=None, flags=50)[source]

Iterate over template nodes.

This is equivalent to ifilter() with forcetype set to Template.

ifilter_text(recursive=False, matches=None, flags=50)[source]

Iterate over text nodes.

This is equivalent to ifilter() with forcetype set to Text.

index(obj, recursive=False)[source]

Return the index of obj in the list of nodes.

Raises ValueError if obj is not found. If recursive is True, we will look in all nodes of ours and their descendants, and return the index of our direct descendant node within our list of nodes. Otherwise, the lookup is done only on direct descendants.

insert(index, value)[source]

Insert value at index in the list of nodes.

value can be anything parasable by parse_anything(), which includes strings or other Wikicode or Node objects.

insert_after(obj, value, recursive=True)[source]

Insert value immediately after obj in the list of nodes.

obj can be either a string or a Node. value can be anything parasable by parse_anything(). If recursive is True, we will try to find obj within our child nodes even if it is not a direct descendant of this Wikicode object. If obj is not in the node list, ValueError is raised.

insert_before(obj, value, recursive=True)[source]

Insert value immediately before obj in the list of nodes.

obj can be either a string or a Node. value can be anything parasable by parse_anything(). If recursive is True, we will try to find obj within our child nodes even if it is not a direct descendant of this Wikicode object. If obj is not in the node list, ValueError is raised.

nodes[source]

A list of Node objects.

This is the internal data actually stored within a Wikicode object.

remove(obj, recursive=True)[source]

Remove obj from the list of nodes.

obj can be either a string or a Node. If recursive is True, we will try to find obj within our child nodes even if it is not a direct descendant of this Wikicode object. If obj is not in the node list, ValueError is raised.

replace(obj, value, recursive=True)[source]

Replace obj with value in the list of nodes.

obj can be either a string or a Node. value can be anything parasable by parse_anything(). If recursive is True, we will try to find obj within our child nodes even if it is not a direct descendant of this Wikicode object. If obj is not in the node list, ValueError is raised.

set(index, value)[source]

Set the Node at index to value.

Raises IndexError if index is out of range, or ValueError if value cannot be coerced into one Node. To insert multiple nodes at an index, use get() with either remove() and insert() or replace().

strip_code(normalize=True, collapse=True)[source]

Return a rendered string without unprintable code such as templates.

The way a node is stripped is handled by the __showtree__() method of Node objects, which generally return a subset of their nodes or None. For example, templates and tags are removed completely, links are stripped to just their display part, headings are stripped to just their title. If normalize is True, various things may be done to strip code further, such as converting HTML entities like Σ, Σ, and Σ to Σ. If collapse is True, we will try to remove excess whitespace as well (three or more newlines are converted to two, for example).