Recipe 12.11. Using MSHTML to Parse XML or HTML
Credit: Bill Bell
Problem
Your Python application, running on Windows,
needs to use the Microsoft MSHTML COM component, which is also the
parser that Microsoft Internet Explorer uses to parse HTML and XML
web pages.
Solution
As usual, PyWin32 lets our Python code access COM quite simply:
from win32com.client import Dispatch
html = Dispatch('htmlfile') # the disguise for MSHTML as a COM server
html.writeln( "<html><header><title>A title</title>"
"<meta name='a name' content='page description'></header>"
"<body>This is some of it. <span>And this is the rest.</span>"
"</body></html>" )
print "Title: %s" % (html.title,)
print "Bag of words from body of the page: %s" % (html.body.innerText,)
print "URL associated with the page: %s" % (html.url,)
print "Display of name:content pairs from the metatags: "
metas = html.getElementsByTagName("meta")
for m in xrange(metas.length):
print "\t%s: %s" % (metas[m].name, metas[m].content,)
Discussion
While Python offers many ways to parse HTML or XML, as long as
you're running your programs only on Windows, MSHTML
is very speedy and simple to use. As the recipe shows, you can simply
use the writeln method of the COM object to feed
the page into MSHTML and then you can use the methods and properties
of the components to get at all kinds of aspects of the
page's DOM. Of course, you can get the string of
markup and text to feed into MSHTML in any way that suits your
application, such as by using the Python Standard Library module
urllib if you're getting a page
from some URL.Since the structure of the enriched DOM that MSHTML makes available
is quite rich and complicated, I suggest you experiment with it in
the PythonWin interactive environment that comes with PyWin32. The
strength of PythonWin for such exploratory tasks is that it displays
all of the properties and methods made available by each interface.
See Also
A detailed reference to MSHTML, albeit oriented to Visual Basic and
C# users, can be found at http://www.xaml.net/articles/type.asp?o=MSHTML.