Package rosdeb :: Module BeautifulSoup :: Class BeautifulSoup
[frames] | no frames]

Class BeautifulSoup

source code

PageElement --+        
              |        
            Tag --+    
                  |    
 BeautifulStoneSoup --+
                      |
                     BeautifulSoup
Known Subclasses:

This parser knows the following facts about HTML:

* Some tags have no closing tag and should be interpreted as being
  closed as soon as they are encountered.

* The text inside some tags (ie. 'script') may contain tags which
  are not really part of the document and which should be parsed
  as text, not tags. If you want to parse the text as tags, you can
  always fetch it and parse it explicitly.

* Tag nesting rules:

  Most tags can't be nested at all. For instance, the occurance of
  a <p> tag should implicitly close the previous <p> tag.

   <p>Para1<p>Para2
    should be transformed into:
   <p>Para1</p><p>Para2

  Some tags can be nested arbitrarily. For instance, the occurance
  of a <blockquote> tag should _not_ implicitly close the previous
  <blockquote> tag.

   Alice said: <blockquote>Bob said: <blockquote>Blah
    should NOT be transformed into:
   Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah

  Some tags can be nested, but the nesting is reset by the
  interposition of other tags. For instance, a <tr> tag should
  implicitly close the previous <tr> tag within the same <table>,
  but not close a <tr> tag in another table.

   <table><tr>Blah<tr>Blah
    should be transformed into:
   <table><tr>Blah</tr><tr>Blah
    but,
   <tr>Blah<table><tr>Blah
    should NOT be transformed into
   <tr>Blah<table></tr><tr>Blah

Differing assumptions about tag nesting rules are a major source
of problems with the BeautifulSoup class. If BeautifulSoup is not
treating as nestable a tag your page author treats as nestable,
try ICantBelieveItsBeautifulSoup, MinimalSoup, or
BeautifulStoneSoup before writing your own subclass.

Instance Methods
 
__init__(self, *args, **kwargs)
The Soup object is initialized as the 'root tag', and the provided markup (which can be a string or a file-like object) is fed into the underlying parser.
source code
 
extractCharsetFromMeta(self, attrs)
Beautiful Soup can detect a charset included in a META tag, try to convert the document to that charset, and re-parse the document from the beginning.
source code
 
__call__(self, *args, **kwargs)
Calling a tag like a function is the same as calling its findAll() method. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
__contains__(self, x) (Inherited from rosdeb.BeautifulSoup.Tag) source code
 
__delitem__(self, key)
Deleting tag[key] deletes all 'key' attributes for the tag. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
__eq__(self, other)
Returns true iff this tag has the same name, the same attributes, and the same contents (recursively) as the given tag. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
__getattr__(self, tag) (Inherited from rosdeb.BeautifulSoup.Tag) source code
 
__getitem__(self, key)
tag[key] returns the value of the 'key' attribute for the tag, and throws an exception if it's not there. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
__iter__(self)
Iterating over a tag iterates over its contents. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
__len__(self)
The length of a tag is the length of its list of contents. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
__ne__(self, other)
Returns true iff this tag is not identical to the other tag, as defined in __eq__. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
__nonzero__(self)
A tag is non-None even if it has no contents. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
__repr__(self, encoding='utf-8')
Renders this tag as a string. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
__setitem__(self, key, value)
Setting tag[key] sets the value of the 'key' attribute for the tag. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
__str__(self) (Inherited from rosdeb.BeautifulSoup.Tag) source code
 
__unicode__(self) (Inherited from rosdeb.BeautifulSoup.Tag) source code
 
append(self, tag)
Appends the given tag to the contents of this tag. (Inherited from rosdeb.BeautifulSoup.PageElement)
source code
 
childGenerator(self) (Inherited from rosdeb.BeautifulSoup.Tag) source code
 
decode(self, prettyPrint=False, indentLevel=0, eventualEncoding='utf-8')
Returns a string or Unicode representation of this tag and its contents. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
decodeContents(self, prettyPrint=False, indentLevel=0, eventualEncoding='utf-8')
Renders the contents of this tag as a string in the given encoding. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
decompose(self)
Recursively destroys the contents of this tree. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
encode(self, encoding='utf-8', prettyPrint=False, indentLevel=0) (Inherited from rosdeb.BeautifulSoup.Tag) source code
 
encodeContents(self, encoding='utf-8', prettyPrint=False, indentLevel=0) (Inherited from rosdeb.BeautifulSoup.Tag) source code
 
endData(self, containerClass=<class 'rosdeb.BeautifulSoup.NavigableString'>) (Inherited from rosdeb.BeautifulSoup.BeautifulStoneSoup) source code
 
extract(self)
Destructively rips this element out of the tree. (Inherited from rosdeb.BeautifulSoup.PageElement)
source code
 
fetch(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
Extracts a list of Tag objects that match the given criteria. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
fetchNextSiblings(self, name=None, attrs={}, text=None, limit=None, **kwargs)
Returns the siblings of this Tag that match the given criteria and appear after this Tag in the document. (Inherited from rosdeb.BeautifulSoup.PageElement)
source code
 
fetchParents(self, name=None, attrs={}, limit=None, **kwargs)
Returns the parents of this Tag that match the given criteria. (Inherited from rosdeb.BeautifulSoup.PageElement)
source code
 
fetchPrevious(self, name=None, attrs={}, text=None, limit=None, **kwargs)
Returns all items that match the given criteria and appear before this Tag in the document. (Inherited from rosdeb.BeautifulSoup.PageElement)
source code
 
fetchPreviousSiblings(self, name=None, attrs={}, text=None, limit=None, **kwargs)
Returns the siblings of this Tag that match the given criteria and appear before this Tag in the document. (Inherited from rosdeb.BeautifulSoup.PageElement)
source code
 
fetchText(self, text=None, recursive=True, limit=None) (Inherited from rosdeb.BeautifulSoup.Tag) source code
 
find(self, name=None, attrs={}, recursive=True, text=None, **kwargs)
Return only the first child of this Tag matching the given criteria. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
findAll(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
Extracts a list of Tag objects that match the given criteria. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
findAllNext(self, name=None, attrs={}, text=None, limit=None, **kwargs)
Returns all items that match the given criteria and appear after this Tag in the document. (Inherited from rosdeb.BeautifulSoup.PageElement)
source code
 
findAllPrevious(self, name=None, attrs={}, text=None, limit=None, **kwargs)
Returns all items that match the given criteria and appear before this Tag in the document. (Inherited from rosdeb.BeautifulSoup.PageElement)
source code
 
findChild(self, name=None, attrs={}, recursive=True, text=None, **kwargs)
Return only the first child of this Tag matching the given criteria. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
findChildren(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
Extracts a list of Tag objects that match the given criteria. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
findNext(self, name=None, attrs={}, text=None, **kwargs)
Returns the first item that matches the given criteria and appears after this Tag in the document. (Inherited from rosdeb.BeautifulSoup.PageElement)
source code
 
findNextSibling(self, name=None, attrs={}, text=None, **kwargs)
Returns the closest sibling to this Tag that matches the given criteria and appears after this Tag in the document. (Inherited from rosdeb.BeautifulSoup.PageElement)
source code
 
findNextSiblings(self, name=None, attrs={}, text=None, limit=None, **kwargs)
Returns the siblings of this Tag that match the given criteria and appear after this Tag in the document. (Inherited from rosdeb.BeautifulSoup.PageElement)
source code
 
findParent(self, name=None, attrs={}, **kwargs)
Returns the closest parent of this Tag that matches the given criteria. (Inherited from rosdeb.BeautifulSoup.PageElement)
source code
 
findParents(self, name=None, attrs={}, limit=None, **kwargs)
Returns the parents of this Tag that match the given criteria. (Inherited from rosdeb.BeautifulSoup.PageElement)
source code
 
findPrevious(self, name=None, attrs={}, text=None, **kwargs)
Returns the first item that matches the given criteria and appears before this Tag in the document. (Inherited from rosdeb.BeautifulSoup.PageElement)
source code
 
findPreviousSibling(self, name=None, attrs={}, text=None, **kwargs)
Returns the closest sibling to this Tag that matches the given criteria and appears before this Tag in the document. (Inherited from rosdeb.BeautifulSoup.PageElement)
source code
 
findPreviousSiblings(self, name=None, attrs={}, text=None, limit=None, **kwargs)
Returns the siblings of this Tag that match the given criteria and appear before this Tag in the document. (Inherited from rosdeb.BeautifulSoup.PageElement)
source code
 
first(self, name=None, attrs={}, recursive=True, text=None, **kwargs)
Return only the first child of this Tag matching the given criteria. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
firstText(self, text=None, recursive=True) (Inherited from rosdeb.BeautifulSoup.Tag) source code
 
get(self, key, default=None)
Returns the value of the 'key' attribute for the tag, or the value given for 'default' if it doesn't have that attribute. (Inherited from rosdeb.BeautifulSoup.Tag)
source code
 
handle_data(self, data) (Inherited from rosdeb.BeautifulSoup.BeautifulStoneSoup) source code
 
has_key(self, key) (Inherited from rosdeb.BeautifulSoup.Tag) source code
 
insert(self, position, newChild) (Inherited from rosdeb.BeautifulSoup.PageElement) source code
 
isSelfClosingTag(self, name)
Returns true iff the given string is the name of a self-closing tag according to this parser. (Inherited from rosdeb.BeautifulSoup.BeautifulStoneSoup)
source code
 
nextGenerator(self) (Inherited from rosdeb.BeautifulSoup.PageElement) source code
 
nextSiblingGenerator(self) (Inherited from rosdeb.BeautifulSoup.PageElement) source code
 
parentGenerator(self) (Inherited from rosdeb.BeautifulSoup.PageElement) source code
 
popTag(self) (Inherited from rosdeb.BeautifulSoup.BeautifulStoneSoup) source code
 
prettify(self, encoding='utf-8') (Inherited from rosdeb.BeautifulSoup.Tag) source code
 
previousGenerator(self) (Inherited from rosdeb.BeautifulSoup.PageElement) source code
 
previousSiblingGenerator(self) (Inherited from rosdeb.BeautifulSoup.PageElement) source code
 
pushTag(self, tag) (Inherited from rosdeb.BeautifulSoup.BeautifulStoneSoup) source code
 
recursiveChildGenerator(self) (Inherited from rosdeb.BeautifulSoup.Tag) source code
 
renderContents(self, encoding='utf-8', prettyPrint=False, indentLevel=0) (Inherited from rosdeb.BeautifulSoup.Tag) source code
 
replaceWith(self, replaceWith) (Inherited from rosdeb.BeautifulSoup.PageElement) source code
 
reset(self) (Inherited from rosdeb.BeautifulSoup.BeautifulStoneSoup) source code
 
setup(self, parent=None, previous=None)
Sets up the initial relations between this element and other elements. (Inherited from rosdeb.BeautifulSoup.PageElement)
source code
 
substituteEncoding(self, str, encoding=None) (Inherited from rosdeb.BeautifulSoup.PageElement) source code
 
toEncoding(self, s, encoding=None)
Encodes an object to a string in some encoding, or to Unicode. (Inherited from rosdeb.BeautifulSoup.PageElement)
source code
 
unknown_endtag(self, name) (Inherited from rosdeb.BeautifulSoup.BeautifulStoneSoup) source code
 
unknown_starttag(self, name, attrs, selfClosing=0) (Inherited from rosdeb.BeautifulSoup.BeautifulStoneSoup) source code
Class Variables
  SELF_CLOSING_TAGS = {'base': None, 'br': None, 'frame': None, ...
  PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
  QUOTE_TAGS = {'script': None, 'textarea': None}
  NESTABLE_INLINE_TAGS = ['span', 'font', 'q', 'object', 'bdo', ...
  NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins',...
  NESTABLE_LIST_TAGS = {'dd': ['dl'], 'dl': [], 'dt': ['dl'], 'l...
  NESTABLE_TABLE_TAGS = {'table': [], 'tbody': ['table'], 'td': ...
  NON_NESTABLE_BLOCK_TAGS = ['address', 'form', 'p', 'pre']
  RESET_NESTING_TAGS = {'address': None, 'blockquote': None, 'dd...
  NESTABLE_TAGS = {'bdo': [], 'blockquote': [], 'center': [], 'd...
  CHARSET_RE = re.compile(r'(?m)((^|;)\s*charset=)([^;]*)')
  ALL_ENTITIES = 'xhtml' (Inherited from rosdeb.BeautifulSoup.BeautifulStoneSoup)
  BARE_AMPERSAND_OR_BRACKET = re.compile(r'([<>]|&(?!#\d+;|#x[0-... (Inherited from rosdeb.BeautifulSoup.Tag)
  HTML_ENTITIES = 'html' (Inherited from rosdeb.BeautifulSoup.BeautifulStoneSoup)
  MARKUP_MASSAGE = [(re.compile('(<[^<>]*)/>'), lambda x: x.grou... (Inherited from rosdeb.BeautifulSoup.BeautifulStoneSoup)
  ROOT_TAG_NAME = u'[document]' (Inherited from rosdeb.BeautifulSoup.BeautifulStoneSoup)
  STRIP_ASCII_SPACES = {9: None, 10: None, 12: None, 13: None, 3... (Inherited from rosdeb.BeautifulSoup.BeautifulStoneSoup)
  XHTML_ENTITIES = 'xhtml' (Inherited from rosdeb.BeautifulSoup.BeautifulStoneSoup)
  XML_ENTITIES = 'xml' (Inherited from rosdeb.BeautifulSoup.BeautifulStoneSoup)
  XML_ENTITIES_TO_SPECIAL_CHARS = {'amp': '&', 'apos': '\'', 'gt... (Inherited from rosdeb.BeautifulSoup.Tag)
  XML_SPECIAL_CHARS_TO_ENTITIES = {'"': 'quot', '&': 'amp', '\''... (Inherited from rosdeb.BeautifulSoup.Tag)
Method Details

__init__(self, *args, **kwargs)
(Constructor)

source code 
The Soup object is initialized as the 'root tag', and the
provided markup (which can be a string or a file-like object)
is fed into the underlying parser.

HTMLParser will process most bad HTML, and the BeautifulSoup
class has some tricks for dealing with some HTML that kills
HTMLParser, but Beautiful Soup can nonetheless choke or lose data
if your data uses self-closing tags or declarations
incorrectly.

By default, Beautiful Soup uses regexes to sanitize input,
avoiding the vast majority of these problems. If the problems
don't apply to you, pass in False for markupMassage, and
you'll get better performance.

The default parser massage techniques fix the two most common
instances of invalid HTML that choke HTMLParser:

 <br/> (No space between name of closing tag and tag close)
 <! --Comment--> (Extraneous whitespace in declaration)

You can pass in a custom list of (RE object, replace method)
tuples to get Beautiful Soup to scrub your input the way you
want.

Overrides: Tag.__init__
(inherited documentation)

extractCharsetFromMeta(self, attrs)

source code 

Beautiful Soup can detect a charset included in a META tag, try to convert the document to that charset, and re-parse the document from the beginning.

Overrides: BeautifulStoneSoup.extractCharsetFromMeta

Class Variable Details

SELF_CLOSING_TAGS

Value:
{'base': None,
 'br': None,
 'frame': None,
 'hr': None,
 'img': None,
 'input': None,
 'link': None,
 'meta': None,
...

NESTABLE_INLINE_TAGS

Value:
['span', 'font', 'q', 'object', 'bdo', 'sub', 'sup', 'center']

NESTABLE_BLOCK_TAGS

Value:
['blockquote', 'div', 'fieldset', 'ins', 'del']

NESTABLE_LIST_TAGS

Value:
{'dd': ['dl'],
 'dl': [],
 'dt': ['dl'],
 'li': ['ul', 'ol'],
 'ol': [],
 'ul': []}

NESTABLE_TABLE_TAGS

Value:
{'table': [],
 'tbody': ['table'],
 'td': ['tr'],
 'tfoot': ['table'],
 'th': ['tr'],
 'thead': ['table'],
 'tr': ['table', 'tbody', 'tfoot', 'thead']}

RESET_NESTING_TAGS

Value:
{'address': None,
 'blockquote': None,
 'dd': ['dl'],
 'del': None,
 'div': None,
 'dl': [],
 'dt': ['dl'],
 'fieldset': None,
...

NESTABLE_TAGS

Value:
{'bdo': [],
 'blockquote': [],
 'center': [],
 'dd': ['dl'],
 'del': [],
 'div': [],
 'dl': [],
 'dt': ['dl'],
...