Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] - نسخه متنی

David Ascher, Alex Martelli, Anna Ravenscroft

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید







Recipe 12.11. Using MSHTML to Parse XML or HTML


Credit: Bill Bell


Problem







Your Python application, running on Windows,
needs to use the Microsoft MSHTML COM component, which is also the
parser that Microsoft Internet Explorer uses to parse HTML and XML
web pages.


Solution


As usual, PyWin32 lets our Python code access COM quite simply:

from win32com.client import Dispatch
html = Dispatch('htmlfile') # the disguise for MSHTML as a COM server
html.writeln( "<html><header><title>A title</title>"
"<meta name='a name' content='page description'></header>"
"<body>This is some of it. <span>And this is the rest.</span>"
"</body></html>" )
print "Title: %s" % (html.title,)
print "Bag of words from body of the page: %s" % (html.body.innerText,)
print "URL associated with the page: %s" % (html.url,)
print "Display of name:content pairs from the metatags: "
metas = html.getElementsByTagName("meta")
for m in xrange(metas.length):
print "\t%s: %s" % (metas[m].name, metas[m].content,)


Discussion


While Python offers many ways to parse HTML or XML, as long as
you're running your programs only on Windows, MSHTML
is very speedy and simple to use. As the recipe shows, you can simply
use the writeln method of the COM object to feed
the page into MSHTML and then you can use the methods and properties
of the components to get at all kinds of aspects of the
page's DOM. Of course, you can get the string of
markup and text to feed into MSHTML in any way that suits your
application, such as by using the Python Standard Library module
urllib if you're getting a page
from some URL.

Since the structure of the enriched DOM that MSHTML makes available
is quite rich and complicated, I suggest you experiment with it in
the PythonWin interactive environment that comes with PyWin32. The
strength of PythonWin for such exploratory tasks is that it displays
all of the properties and methods made available by each interface.


See Also


A detailed reference to MSHTML, albeit oriented to Visual Basic and
C# users, can be found at http://www.xaml.net/articles/type.asp?o=MSHTML.


/ 394