mwparserfromhell Package

mwparserfromhell Package

mwparserfromhell (the MediaWiki Parser from Hell) is a Python package that provides an easy-to-use and outrageously powerful parser for MediaWiki wikicode.

compat Module

Implements support for both Python 2 and Python 3 by defining common types in terms of their Python 2/3 variants. For example, str is set to unicode on Python 2 but str on Python 3; likewise, bytes is str on 2 but bytes on 3. These types are meant to be imported directly from within the parser’s modules.

smart_list Module

This module contains the SmartList type, as well as its _ListProxy child, which together implement a list whose sublists reflect changes made to the main list, and vice-versa.

class mwparserfromhell.smart_list.SmartList(iterable=None)[source]

Bases: mwparserfromhell.smart_list._SliceNormalizerMixIn, list

Implements the list interface with special handling of sublists.

When a sublist is created (by list[i:j]), any changes made to this list (such as the addition, removal, or replacement of elements) will be reflected in the sublist, or vice-versa, to the greatest degree possible. This is implemented by having sublists - instances of the _ListProxy type - dynamically determine their elements by storing their slice info and retrieving that slice from the parent. Methods that change the size of the list also change the slice info. For example:

>>> parent = SmartList([0, 1, 2, 3])
>>> parent
[0, 1, 2, 3]
>>> child = parent[2:]
>>> child
[2, 3]
>>> child.append(4)
>>> child
[2, 3, 4]
>>> parent
[0, 1, 2, 3, 4]

The parent needs to keep a list of its children in order to update them, which prevents them from being garbage-collected. If you are keeping the parent around for a while but creating many children, it is advisable to call destroy() when you’re finished with them.

append(item)[source]

L.append(object) – append object to end

extend(item)[source]

L.extend(iterable) – extend list by appending elements from the iterable

insert(index, item)[source]

L.insert(index, object) – insert object before index

pop([index]) → item -- remove and return item at index (default last).[source]

Raises IndexError if list is empty or index is out of range.

remove(item)[source]

L.remove(value) – remove first occurrence of value. Raises ValueError if the value is not present.

reverse()[source]

L.reverse() – reverse IN PLACE

sort(cmp=None, key=None, reverse=None)[source]

L.sort(cmp=None, key=None, reverse=False) – stable sort IN PLACE; cmp(x, y) -> -1, 0, 1

class mwparserfromhell.smart_list._ListProxy(parent, sliceinfo)[source]

Bases: mwparserfromhell.smart_list._SliceNormalizerMixIn, list

Implement the list interface by getting elements from a parent.

This is created by a SmartList object when slicing. It does not actually store the list at any time; instead, whenever the list is needed, it builds it dynamically using the _render() method.

append(item)[source]

L.append(object) – append object to end

count(value) → integer -- return number of occurrences of value[source]
destroy()[source]

Make the parent forget this child. The child will no longer work.

extend(item)[source]

L.extend(iterable) – extend list by appending elements from the iterable

index(value[, start[, stop]]) → integer -- return first index of value.[source]

Raises ValueError if the value is not present.

insert(index, item)[source]

L.insert(index, object) – insert object before index

pop([index]) → item -- remove and return item at index (default last).[source]

Raises IndexError if list is empty or index is out of range.

remove(item)[source]

L.remove(value) – remove first occurrence of value. Raises ValueError if the value is not present.

reverse()[source]

L.reverse() – reverse IN PLACE

sort(cmp=None, key=None, reverse=None)[source]

L.sort(cmp=None, key=None, reverse=False) – stable sort IN PLACE; cmp(x, y) -> -1, 0, 1

string_mixin Module

This module contains the StringMixIn type, which implements the interface for the unicode type (str on py3k) in a dynamic manner.

class mwparserfromhell.string_mixin.StringMixIn[source]

Implement the interface for unicode/str in a dynamic manner.

To use this class, inherit from it and override the __unicode__() method (same on py3k) to return the string representation of the object. The various string methods will operate on the value of __unicode__() instead of the immutable self like the regular str type.

definitions Module

Contains data about certain markup, like HTML tags and external links.

mwparserfromhell.definitions.get_html_tag(markup)[source]

Return the HTML tag associated with the given wiki-markup.

mwparserfromhell.definitions.is_parsable(tag)[source]

Return if the given tag‘s contents should be passed to the parser.

mwparserfromhell.definitions.is_visible(tag)[source]

Return whether or not the given tag contains visible text.

mwparserfromhell.definitions.is_single(tag)[source]

Return whether or not the given tag can exist without a close tag.

mwparserfromhell.definitions.is_single_only(tag)[source]

Return whether or not the given tag must exist without a close tag.

mwparserfromhell.definitions.is_scheme(scheme, slashes=True, reverse=False)[source]

Return whether scheme is valid for external links.

utils Module

This module contains accessory functions for other parts of the library. Parser users generally won’t need stuff from here.

mwparserfromhell.utils.parse_anything(value, context=0)[source]

Return a Wikicode for value, allowing multiple types.

This differs from Parser.parse() in that we accept more than just a string to be parsed. Unicode objects (strings in py3k), strings (bytes in py3k), integers (converted to strings), None, existing Node or Wikicode objects, as well as an iterable of these types, are supported. This is used to parse input on-the-fly by various methods of Wikicode and others like Template, such as wikicode.insert() or setting template.name.

If given, context will be passed as a starting context to the parser. This is helpful when this function is used inside node attribute setters. For example, ExternalLink‘s url setter sets context to contexts.EXT_LINK_URI to prevent the URL itself from becoming an ExternalLink.

wikicode Module

class mwparserfromhell.wikicode.Wikicode(nodes)[source]

Bases: mwparserfromhell.string_mixin.StringMixIn

A Wikicode is a container for nodes that operates like a string.

Additionally, it contains methods that can be used to extract data from or modify the nodes, implemented in an interface similar to a list. For example, index() can get the index of a node in the list, and insert() can add a new node at that index. The filter() series of functions is very useful for extracting and iterating over, for example, all of the templates in the object.

append(value)[source]

Insert value at the end of the list of nodes.

value can be anything parasable by parse_anything().

filter(recursive=True, matches=None, flags=50, forcetype=None)[source]

Return a list of nodes within our list matching certain conditions.

This is equivalent to calling list() on ifilter().

filter_arguments(**kw)

Iterate over arguments.

This is equivalent to filter() with forcetype set to Argument.

filter_comments(**kw)

Iterate over comments.

This is equivalent to filter() with forcetype set to Comment.

Iterate over external_links.

This is equivalent to filter() with forcetype set to ExternalLink.

filter_headings(**kw)

Iterate over headings.

This is equivalent to filter() with forcetype set to Heading.

filter_html_entities(**kw)

Iterate over html_entities.

This is equivalent to filter() with forcetype set to HTMLEntity.

filter_tags(**kw)

Iterate over tags.

This is equivalent to filter() with forcetype set to Tag.

filter_templates(**kw)

Iterate over templates.

This is equivalent to filter() with forcetype set to Template.

filter_text(**kw)

Iterate over text.

This is equivalent to filter() with forcetype set to Text.

Iterate over wikilinks.

This is equivalent to filter() with forcetype set to Wikilink.

get(index)[source]

Return the indexth node within the list of nodes.

get_sections(levels=None, matches=None, flags=50, flat=False, include_lead=None, include_headings=True)[source]

Return a list of sections within the page.

Sections are returned as Wikicode objects with a shared node list (implemented using SmartList) so that changes to sections are reflected in the parent Wikicode object.

Each section contains all of its subsections, unless flat is True. If levels is given, it should be a iterable of integers; only sections whose heading levels are within it will be returned. If matches is given, it should be either a function or a regex; only sections whose headings match it (without the surrounding equal signs) will be included. flags can be used to override the default regex flags (see ifilter()) if a regex matches is used.

If include_lead is True, the first, lead section (without a heading) will be included in the list; False will not include it; the default will include it only if no specific levels were given. If include_headings is True, the section’s beginning Heading object will be included; otherwise, this is skipped.

get_tree()[source]

Return a hierarchical tree representation of the object.

The representation is a string makes the most sense printed. It is built by calling _get_tree() on the Wikicode object and its children recursively. The end result may look something like the following:

>>> text = "Lorem ipsum {{foo|bar|{{baz}}|spam=eggs}}"
>>> print mwparserfromhell.parse(text).get_tree()
Lorem ipsum
{{
      foo
    | 1
    = bar
    | 2
    = {{
            baz
      }}
    | spam
    = eggs
}}
ifilter(recursive=True, matches=None, flags=50, forcetype=None)[source]

Iterate over nodes in our list matching certain conditions.

If recursive is True, we will iterate over our children and all of their descendants, otherwise just our immediate children. If forcetype is given, only nodes that are instances of this type are yielded. matches can be used to further restrict the nodes, either as a function (taking a single Node and returning a boolean) or a regular expression (matched against the node’s string representation with re.search()). If matches is a regex, the flags passed to re.search() are re.IGNORECASE, re.DOTALL, and re.UNICODE, but custom flags can be specified by passing flags.

ifilter_arguments(**kw)

Iterate over arguments.

This is equivalent to ifilter() with forcetype set to Argument.

ifilter_comments(**kw)

Iterate over comments.

This is equivalent to ifilter() with forcetype set to Comment.

Iterate over external_links.

This is equivalent to ifilter() with forcetype set to ExternalLink.

ifilter_headings(**kw)

Iterate over headings.

This is equivalent to ifilter() with forcetype set to Heading.

ifilter_html_entities(**kw)

Iterate over html_entities.

This is equivalent to ifilter() with forcetype set to HTMLEntity.

ifilter_tags(**kw)

Iterate over tags.

This is equivalent to ifilter() with forcetype set to Tag.

ifilter_templates(**kw)

Iterate over templates.

This is equivalent to ifilter() with forcetype set to Template.

ifilter_text(**kw)

Iterate over text.

This is equivalent to ifilter() with forcetype set to Text.

Iterate over wikilinks.

This is equivalent to ifilter() with forcetype set to Wikilink.

index(obj, recursive=False)[source]

Return the index of obj in the list of nodes.

Raises ValueError if obj is not found. If recursive is True, we will look in all nodes of ours and their descendants, and return the index of our direct descendant node within our list of nodes. Otherwise, the lookup is done only on direct descendants.

insert(index, value)[source]

Insert value at index in the list of nodes.

value can be anything parasable by parse_anything(), which includes strings or other Wikicode or Node objects.

insert_after(obj, value, recursive=True)[source]

Insert value immediately after obj.

obj can be either a string, a Node, or another Wikicode object (as created by get_sections(), for example). If obj is a string, we will operate on all instances of that string within the code, otherwise only on the specific instance given. value can be anything parasable by parse_anything(). If recursive is True, we will try to find obj within our child nodes even if it is not a direct descendant of this Wikicode object. If obj is not found, ValueError is raised.

insert_before(obj, value, recursive=True)[source]

Insert value immediately before obj.

obj can be either a string, a Node, or another Wikicode object (as created by get_sections(), for example). If obj is a string, we will operate on all instances of that string within the code, otherwise only on the specific instance given. value can be anything parasable by parse_anything(). If recursive is True, we will try to find obj within our child nodes even if it is not a direct descendant of this Wikicode object. If obj is not found, ValueError is raised.

matches(other)[source]

Do a loose equivalency test suitable for comparing page names.

other can be any string-like object, including Wikicode, or a tuple of these. This operation is symmetric; both sides are adjusted. Specifically, whitespace and markup is stripped and the first letter’s case is normalized. Typical usage is if template.name.matches("stub"): ....

nodes[source]

A list of Node objects.

This is the internal data actually stored within a Wikicode object.

remove(obj, recursive=True)[source]

Remove obj from the list of nodes.

obj can be either a string, a Node, or another Wikicode object (as created by get_sections(), for example). If obj is a string, we will operate on all instances of that string within the code, otherwise only on the specific instance given. If recursive is True, we will try to find obj within our child nodes even if it is not a direct descendant of this Wikicode object. If obj is not found, ValueError is raised.

replace(obj, value, recursive=True)[source]

Replace obj with value.

obj can be either a string, a Node, or another Wikicode object (as created by get_sections(), for example). If obj is a string, we will operate on all instances of that string within the code, otherwise only on the specific instance given. value can be anything parasable by parse_anything(). If recursive is True, we will try to find obj within our child nodes even if it is not a direct descendant of this Wikicode object. If obj is not found, ValueError is raised.

set(index, value)[source]

Set the Node at index to value.

Raises IndexError if index is out of range, or ValueError if value cannot be coerced into one Node. To insert multiple nodes at an index, use get() with either remove() and insert() or replace().

strip_code(normalize=True, collapse=True)[source]

Return a rendered string without unprintable code such as templates.

The way a node is stripped is handled by the __strip__() method of Node objects, which generally return a subset of their nodes or None. For example, templates and tags are removed completely, links are stripped to just their display part, headings are stripped to just their title. If normalize is True, various things may be done to strip code further, such as converting HTML entities like Σ, Σ, and Σ to Σ. If collapse is True, we will try to remove excess whitespace as well (three or more newlines are converted to two, for example).

Table Of Contents

Previous topic

mwparserfromhell

Next topic

nodes Package

This Page