添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

Hello,

Great service! I've just signed up today and have a 'Hacker' paid account.

I am trying to automate a task to check a website to see if anything has changed.

It seems like, for whatever reason, when I try to access the website using either lxml or requests, PA is 'hanging' / timing out.

I tried a half dozen other sites and been able to get a response of some kind...see below

r = requests.get( "http://www.pythonanywhere.com/terms" )
print 'REQUESTS HTTP response code for the url ', 'python anywhere', ' => ', r.status_code
r = requests.get( "http://www.cnn.com" )
print 'REQUESTS HTTP response code for the url ', 'cnn', ' => ', r.status_code
page = requests.get( "http://www.politico.com" )
print 'REQUESTS HTTP response code for the url ', 'Politico', ' => ', page.status_code
page = requests.get( "http://www.reddit.com" )
print 'REQUESTS HTTP response code for the url ', 'Reddit', ' => ', page.status_code
page = requests.get( "http://www.ticketmaster.com" )
print 'REQUESTS HTTP response code for the url ', 'Ticket Master', ' => ', page.status_code
page = requests.get( "http://www.apeconcerts.com" )
print 'REQUESTS HTTP response code for the url ', 'APE Concerts', ' => ', page.status_code

results in...

01:15 ~ $ python ticktateScraperv02.py
Completed Yesterday Module
REQUESTS HTTP response code for the url  python anywhere  =>  200
REQUESTS HTTP response code for the url  cnn  =>  200
REQUESTS HTTP response code for the url  Politico  =>  200
REQUESTS HTTP response code for the url  Reddit  =>  200
REQUESTS HTTP response code for the url  Ticket Master  =>  403

...and then nothing until I do a KeyboardInterrupt:

^CTraceback (most recent call last):
  File "ticktateScraperv02.py", line 64, in <module>
    page = requests.get( "http://www.apeconcerts.com" )
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 55, in get
    return request('get', url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 383, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 486, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 334, in send
    timeout=timeout
  File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/connectionpool.py", line 480, in urlopen
    body=body, headers=headers)
  File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/connectionpool.py", line 313, in _make_request
    httplib_response = conn.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1051, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 415, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
    line = self.fp.readline(_MAXLINE + 1)
  File "/usr/lib/python2.7/socket.py", line 476, in readline
    data = self._sock.recv(self._rbufsize)
KeyboardInterrupt

The code works fine from my local machine, by the way.

Thoughts?

Thanks!

Sincerely, Michael

Harry,

Thanks for the response. Yeah, it is the same sites (or at least, this one). APE doesn't have any public terms & conditions about scraping (it's a regional concert promotion company), but that could be the case, I suppose. I left the code running in a console over night and it's still hung.

I actually switched over to lxml and is it working to access the site, but now I'm running into an error using XPath and count. Again, this works fine locally, but it is throwing an error on PA.

import lxml
import lxml.etree
import lxml.html
from lxml import html
url = "http://www.apeconcerts.com"
parser = lxml.etree.HTMLParser(encoding='utf-8')
tree = lxml.etree.parse(url, parser)
print "got the tree"
count = tree.xpath("count(//*[@id='main']/div[2]/div)")

This results in the error:

15:42 ~ $ python ticktateScraperv03.py
Completed Yesterday Module
got the tree
<lxml.etree._ElementTree object at 0x7f0efd512fc8>
Traceback (most recent call last):
  File "ticktateScraperv03.py", line 81, in <module>
    count = tree.xpath("count(//*[@id='main']/div[2]/div)")
  File "lxml.etree.pyx", line 2111, in lxml.etree._ElementTree.xpath (src/lxml/lxml.etree.c:57604)
  File "lxml.etree.pyx", line 1780, in lxml.etree._ElementTree._assertHasRoot (src/lxml/lxml.etree.c:54277)
AssertionError: ElementTree not initialized, missing root

My current hypothesis is that there is ambiguity in what the root is (which, as I understand it, would be the 'html' tag). So I think I either need to write out the full XPath or somehow indicate that the html tag is the root?

I am planning to put some time against this later this evening.

Any thoughts prior to that would be very much appreciated!

Sincerely, Michael

PS - If not clear, I'm a Python beginner (esp. web) and a PA total beginner, so I appreciate the help.

No probs. My guess is that the "missing root" is lxml's unhelpful way of telling you it couldn't actually load anything from that URL... you could do a print(tree) or something similar to confirm if that's the case.

In the meantime, I think we might be blocked from scraping that particular site. You could contact the site administrators and ask them if they do sometimes block scraping requests?

Harry - thanks. Yes, that looks like what is happening.

It does return an ElementTree object, but when I

url = "http://www.apeconcerts.com"
parser = lxml.etree.HTMLParser(encoding='utf-8')
tree = lxml.etree.parse(url, parser)
print "got the tree"
print tree
treeString = lxml.etree.tostring(tree)
print treeString

results in...

got the tree
<lxml.etree._ElementTree object at 0x7f22e5467a70>

And then the file continues to run and when it hits xpath((count("//...

Traceback (most recent call last):
  File "ticktateScraperv03.py", line 67, in <module>
    testpath = tree.xpath("//html/body/main/div/div/div[1]/h1/a/text()")
  File "lxml.etree.pyx", line 2111, in lxml.etree._ElementTree.xpath (src/lxml/lxml.etree.c:57604)
  File "lxml.etree.pyx", line 1780, in lxml.etree._ElementTree._assertHasRoot (src/lxml/lxml.etree.c:54277)
AssertionError: ElementTree not initialized, missing root

Thanks for the help!

Sorry, we have had to rate-limit your feedback sending.
Please try again in a few moments... Thanks for the feedback! Our tireless devs will get back to you soon.