Public Member Functions | |
def | __get_current_state__ |
def | __handle_body_tag__ |
def | __init__ |
def | __update_state_machine_end__ |
def | __update_state_machine_start__ |
def | get_breadcrumbs |
def | get_doc |
def | get_links |
def | get_title |
def | handle_charref |
def | handle_data |
def | handle_decl |
def | handle_endtag |
def | handle_entityref |
def | handle_starttag |
Public Attributes | |
breadcrumbs | |
current_state | |
div_bookmark | |
div_level | |
div_state_map | |
links | |
out_doc | |
page_title | |
state | |
toc |
WikidotParser is used to clean a page from www.wikidot.com, keeping only the interesting content.
def wikidot.parser.WikidotParser.__init__ | ( | self | ) |
def wikidot.parser.WikidotParser.__handle_body_tag__ | ( | self, | |
tag, | |||
attrs | |||
) |
def wikidot.parser.WikidotParser.__update_state_machine_end__ | ( | self, | |
tag | |||
) |
def wikidot.parser.WikidotParser.__update_state_machine_start__ | ( | self, | |
tag, | |||
attrs | |||
) |
def wikidot.parser.WikidotParser.get_breadcrumbs | ( | self | ) |
def wikidot.parser.WikidotParser.get_doc | ( | self | ) |
def wikidot.parser.WikidotParser.get_links | ( | self | ) |
def wikidot.parser.WikidotParser.get_title | ( | self | ) |
def wikidot.parser.WikidotParser.handle_charref | ( | self, | |
name | |||
) |
def wikidot.parser.WikidotParser.handle_data | ( | self, | |
data | |||
) |
def wikidot.parser.WikidotParser.handle_decl | ( | self, | |
decl | |||
) |
def wikidot.parser.WikidotParser.handle_endtag | ( | self, | |
tag | |||
) |
def wikidot.parser.WikidotParser.handle_entityref | ( | self, | |
name | |||
) |
def wikidot.parser.WikidotParser.handle_starttag | ( | self, | |
tag, | |||
attrs | |||
) |
Overridden - Called when a start tag is parsed The heart of this function is the state machine. When a <div> tag is detected, the attributes are compared with a map of the form (name,value) -> state. If a match occurs, the state is pushed on top of the stack. Depending on the current state, the start tag is queued for output, or not.