The code works fine from my local machine, by the way.
Thoughts?
Thanks!
Sincerely,
Michael
Harry,
Thanks for the response. Yeah, it is the same sites (or at least, this one). APE doesn't have any public terms & conditions about scraping (it's a regional concert promotion company), but that could be the case, I suppose. I left the code running in a console over night and it's still hung.
I actually switched over to lxml and is it working to access the site, but now I'm running into an error using XPath and count. Again, this works fine locally, but it is throwing an error on PA.
importlxmlimportlxml.etreeimportlxml.htmlfromlxmlimporthtmlurl="http://www.apeconcerts.com"parser=lxml.etree.HTMLParser(encoding='utf-8')tree=lxml.etree.parse(url,parser)print"got the tree"count=tree.xpath("count(//*[@id='main']/div[2]/div)")
My current hypothesis is that there is ambiguity in what the root is (which, as I understand it, would be the 'html' tag). So I think I either need to write out the full XPath or somehow indicate that the html tag is the root?
I am planning to put some time against this later this evening.
Any thoughts prior to that would be very much appreciated!
Sincerely,
Michael
PS - If not clear, I'm a Python beginner (esp. web) and a PA total beginner, so I appreciate the help.
No probs. My guess is that the "missing root" is lxml's unhelpful way of telling you it couldn't actually load anything from that URL... you could do a print(tree) or something similar to confirm if that's the case.
In the meantime, I think we might be blocked from scraping that particular site. You could contact the site administrators and ask them if they do sometimes block scraping requests?
Harry - thanks. Yes, that looks like what is happening.
It does return an ElementTree object, but when I
url = "http://www.apeconcerts.com"
parser = lxml.etree.HTMLParser(encoding='utf-8')
tree = lxml.etree.parse(url, parser)
print "got the tree"
print tree
treeString = lxml.etree.tostring(tree)
print treeString
results in...
got the tree
<lxml.etree._ElementTree object at 0x7f22e5467a70>
And then the file continues to run and when it hits xpath((count("//...
Traceback (most recent call last):
File "ticktateScraperv03.py", line 67, in <module>
testpath = tree.xpath("//html/body/main/div/div/div[1]/h1/a/text()")
File "lxml.etree.pyx", line 2111, in lxml.etree._ElementTree.xpath (src/lxml/lxml.etree.c:57604)
File "lxml.etree.pyx", line 1780, in lxml.etree._ElementTree._assertHasRoot (src/lxml/lxml.etree.c:54277)
AssertionError: ElementTree not initialized, missing root
Thanks for the help!
Sorry, we have had to rate-limit your feedback sending. Please try again in a few moments...
Thanks for the feedback! Our tireless devs will get back to you soon.