添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

为了从 HTML 文档中提取数据,将 SGMLParser 类进行子类化,然后对想要捕捉的标记或实体定义方法。

HTML 文档中提取数据的第一步是得到某个 HTML 文件。如果在您的硬盘里存放着 HTML 文件,您可以使用 file 函数 将它读出来,但是真正有意思的是从实际的网页得到 HTML

例 8.5. urllib 介绍

>>> import urllib
1 >>> sock = urllib.urlopen( "http://diveintopython.org/" ) 2 >>> htmlSource = sock.read() 3 >>> sock.close() 4 >>> print htmlSource 5 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head> <meta http-equiv='Content-Type' content='text/html; charset=ISO-8859-1'> <title>Dive Into Python</title> <link rel='stylesheet' href='diveintopython.css' type='text/css'> <link rev='made' href='mailto: [email protected] '> <meta name='keywords' content='Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free'> <meta name='description' content='a free Python tutorial for experienced programmers'> </head> <body bgcolor='white' text='black' link='#0000FF' vlink='#840084' alink='#0000FF'> <table cellpadding='0' cellspacing='0' border='0' width='100%'> <tr><td class='header' width='1%' valign='top'>diveintopython.org</td> <td width='99%' align='right'><hr size='1' noshade></td></tr> <tr><td class='tagline' colspan='2'>Python&nbsp;for&nbsp;experienced&nbsp;programmers</td></tr> [...略...] >>> import urllib, urllister >>> usock = urllib.urlopen( "http://diveintopython.org/" ) >>> parser = urllister.URLLister() >>> parser.feed(usock.read()) 1 >>> usock.close() 2 >>> parser.close() 3 >>> for url in parser.urls: print url 4 toc/index.html #download #languages toc/index.html appendix/history.html download/diveintopython-html-5.0.zip download/diveintopython-pdf-5.0.zip download/diveintopython-word-5.0.zip download/diveintopython-text-5.0.zip download/diveintopython-html-flat-5.0.zip download/diveintopython-xml-5.0.zip download/diveintopython-common-5.0.zip ...略...