python中的Web抓取urlopen_随笔

python中的Web抓取urlopen

我个人写道：

# Python 2.7import urlliburl = 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'sock = urllib.urlopen(url)content = sock.read() sock.close()print content

Et si tu parlesfrançais，.. bonjour sur stackoverflow.com！

更新1

实际上，我现在喜欢使用以下代码，因为它更快。

# Python 2.7import httplibconn = httplib.HTTPConnection(host='www.boursorama.com',timeout=30)req = '/includes/cours/last_transactions.phtml?symbole=1xEURUS'try:    conn.request('GET',req)except:     print 'echec de connexion'content = conn.getresponse().read()print content

将此代码更改

httplib

为

http.client

足以使其适应Python 3。

。

我确认，使用这两个代码，可以获得获取您感兴趣的数据的源代码：

        <td  width="33%" align="center">11:57:44</td>        <td  width="33%" align="center">1.4486</td>        <td  width="33%" align="center">0</td></tr>       <tr>        <td  width="33%" align="center">11:57:43</td>        <td  width="33%" align="center">1.4486</td>        <td  width="33%" align="center">0</td></tr>

更新2

在上面的代码中添加以下代码段，即可提取我想要的数据：

for i,line in enumerate(content.splitlines(True)):    print str(i)+' '+repr(line)print 'nn'import reregx = re.compile('tttttt<td  width="33%" align="center">(dd:dd:dd)</td>rn'       'tttttt<td  width="33%" align="center">([d.]+)</td>rn'       'tttttt<td  width="33%" align="center">(d+)</td>rn')print regx.findall(content)

结果（仅结尾）

............................................................................................................................................................98 'window.config.graphics = {};n'99 'window.config.accordions = {};n'100 'n'101 "window.addEvent('domready', function(){n"102 '});n'103 '</script>n'104 '<script type="text/javascript">n'105 'ttttsas_tmstp = Math.round(Math.random()*10000000000);n'106 'ttttsas_pageid = "177/(includes/cours/last_transactions)"; // Page : boursorama.com/smartad_testn'107 'ttttvar sas_formatids = "8968";n'108 'ttttsas_target = "symb=1xEURUS#"; // TargetingArrayn'109 'ttttdocument.write("<scr"+"ipt src=\"http://ads.boursorama.com/call2/pubjall/" + sas_pageid + "/" + sas_formatids + "/" + sas_tmstp + "/" + escape(sas_target) + "?\"></scr"+"ipt>");ttttn'110 'ttt</script><div id="_smart1"><script language="javascript">sas_script(1,8968);</script></div><script type="text/javascript">rn'111 "twindow.addEvent('domready', function(){rn"112 'sas_move(1,8968);t});rn'113 '</script>n'114 '<script type="text/javascript">n'115 'var _gaq = _gaq || [];n'116 "_gaq.push(['_setAccount', 'UA-1623710-1']);n"117 "_gaq.push(['_setDomainName', 'www.boursorama.com']);n"118 "_gaq.push(['_setCustomVar', 1, 'segment', 'WEB-VISITOR']);n"119 "_gaq.push(['_setCustomVar', 4, 'version', '18']);n"120 "_gaq.push(['_trackPageLoadTime']);n"121 "_gaq.push(['_trackPageview']);n"122 '(function() {n'123 "var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;n"124 "ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';n"125 "var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);n"126 '})();n'127 '</script>n'128 '</body>n'129 '</html>'[('12:25:36', '1.4478', '0'), ('12:25:33', '1.4478', '0'), ('12:25:31', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:29', '1.4478', '0')]

我希望您不打算在外汇交易中“玩”交易：这是快速散布资金的最佳方法之一。

更新3

对不起！我忘记了您使用Python3。因此，我认为您必须这样定义正则表达式：

regx = re.compile（ b ‘ t t t t t ......）

也就是说在字符串之前加上 b
，否则您将收到类似此问题的错误

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5664109.html

python中的Web抓取urlopen

发表评论

评论列表（0条）