Can't retrieve URLs from pythonanywhere.com with python "requests" command : Forums : PythonAnywhere

Can't retrieve URLs from pythonanywhere.com with python "requests" command

Hi there, this site was so unbeliveable cool I just had to sign up for an account and then try to find some tasks that justifies my (work) time spent here...

Actually, I was looking for a quick way of logging response times (and down times) for some new web mapping services our agency has established. Without going into much detail let's just say our IT department doesn't exactly embrace neither open source nor free data service policies. Our WMS geoserver is firmly locked behind a firewall (which is good) and then exposed through some XML gateway (which I'm highly sceptical about, both in terms of overhead (response time) as well as stability). So, I want to log the uptime, response codes (such as any occurence of 401 unauthorized) and response times from somewhere outside our corporate network. Pythonanywhere seems like a cool place to do that,

I've been reading and tinkering, and through these forums I've learned that the "requests" module is there to save me from the utter hell of using urllib2. Cool.

Anyway, I'm a puzzled about why r = requests.get( "pythonanywhere.com/terms" ) returns a 501 (service unavailable) http response code. Urllib2, wget and curl has no problems retrieving that. Changing the URL to somewhere else on the whitelist (such as wikipedia.org) produces the opposite effekt: Both curl and wget produce a "403 (forbidden)" response from the proxy server. Urllib also fails, I guess for the same reason (I have not bothered investigating the arcane API of urllib2 to get the actual response codes). However, the requests module retrieves www.wikipedia.com with shiny colours....

I guess this has something to do with proxy setup and behaviour for non-paying clients. While I scrutinize my backup harddisk in my attic for my paypal details (It's been a while and a computer change since last time I used paypal...) it would be nice if someone could shed a light on this.

Code example:

import requests
import urllib2
import subprocess

mylink = 'https://www.pythonanywhere.com/terms/'
# mylink = 'http://wikipedia.org/'

r = requests.get( mylink )

print 'REQUESTS HTTP response code for the url ', mylink, ' => ', r.status_code

# Why does URLLIB2 FAIL if it encounters a http 401 or similar???? 
try:
    response = urllib2.urlopen(mylink)
    print 'URLLIB2 HTTP Response code for the url ', mylink, response.getcode()
except:
    print 'Could not retrieve url ', mylink, ' using URLLIB2'

print 'Testing using wget..'
subprocess.call(["wget", '-O', 'wgetresult', '-S', mylink])

print 'Testing using curl...'
subprocess.call(["curl", '--head', mylink])

jansimple | 24 posts | Aug. 2, 2012, 6:29 a.m. | permalink

I ran your test code and got this result.

07:12 ~ $ python reqtest.py
REQUESTS HTTP response code for the url  https://www.pythonanywhere.com/terms/  =>  200
URLLIB2 HTTP Response code for the url  https://www.pythonanywhere.com/terms/ 200
Testing using wget..
--2012-08-02 07:12:25--  https://www.pythonanywhere.com/terms/
Resolving www.pythonanywhere.com... 50.19.109.98
Connecting to www.pythonanywhere.com|50.19.109.98|:443... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Date: Thu, 02 Aug 2012 07:12:26 GMT
  Server: Apache/2.2.16 (Debian)
  Vary: Cookie,Accept-Encoding
  Connection: close
  Content-Type: text/html; charset=utf-8
Length: unspecified [text/html]
Saving to: “wgetresult”

    [ <=>                                   ] 31,245      --.-K/s   in 0.009s

2012-08-02 07:12:25 (3.20 MB/s) - “wgetresult” saved [31245]

Testing using curl...
HTTP/1.1 200 OK
Date: Thu, 02 Aug 2012 07:12:26 GMT
Server: Apache/2.2.16 (Debian)
Vary: Cookie,Accept-Encoding
Content-Type: text/html; charset=utf-8

07:12 ~ $

a2j | 684 posts | Aug. 2, 2012, 7:16 a.m. | permalink

Ah, but you have full access to internet services from your account. I am going to sign up for a paid account, which will grant me the same privileges. (I confess to being a penny-pincher, but I'll shell out for this service. I do, however, have to dig through my backup harddisk to get my paypal-details).

I guess the root of the behaviour I find puzzling is how the proxy server is setup to restrict acces to the internet for penny-pinchers like myself. Why this should produce these rather inconsistent results (such as the 501 respose code for "results", 200 for all other methods) is something I can't understand, but it might be worth looking into for the fine folks at PA.

Note that both links I've tested with (wikipedia and PA "terms of service" page) both are on the whitelist for free acounts like mine.

Thanks for confirming that everything works perfectly fine once you've paid for this outstanding service :)

jansimple | 24 posts | Aug. 2, 2012, 8:03 a.m. | permalink

If you used Flask or Django, you could easily and quickly knock up a really neat web app that summarised the data.

rcs1000 | 311 posts | Aug. 2, 2012, 8:55 a.m. | permalink

Hi jansimple, is it possible that you are running into this bug in requests/urllib3? . a2j wouldn't see it because paid accounts are not proxied.

hansel | 459 posts | PythonAnywhere staff | Aug. 2, 2012, 10:37 a.m. | permalink

Yep, that's the plan. Not that I actually need the killer web app, but it'll be great fun and I'll learn tons of new stuff, which is why I'm here anyway.

For now I'm more than happy just to be able to start logging response times and http return codes into the database. Having full python and mysql at my fingertips makes this a really trivial task. I should add it's a very pleasant learning experience as well... :)

jansimple | 24 posts | Aug. 2, 2012, 10:40 a.m. | permalink

Hi Hansel,

The problems using requests to connect to anything on the PA domain seems somewhat consistent with this bug (although I get a 501 (not implemented), not a 502 (bad request). As described, anyting on the PA domain seems works fine with the urllib2, wget and curl but i do get a 501 error using requests.

The problems using urrlib2, wget and curl to connect to anything whitelisted seems more bizarre. This works fine with requests, but I get a "403 forbidden" using urllib2, wget and curl.

Connecting to anything not in the white list produces a "403 (forbidden)" for all methods, as expected.

jansimple | 24 posts | Aug. 2, 2012, 10:50 a.m. | permalink

Just to clarify, you're not seeing any of these problems now that you've upgraded to Premium? You shoudln't because you are no longer going over a proxy.

hansel | 459 posts | PythonAnywhere staff | Aug. 3, 2012, 11:43 a.m. | permalink

No problems what so ever, only 200 OK all that I can see.

jansimple | 24 posts | Aug. 5, 2012, 8:51 p.m. | permalink

Its weird but works. You need to upgrade to premium for wget to work.

deleted-user-71088 | 1 post | April 6, 2013, 1:56 a.m. | permalink

@sandyinside: Welcome to the PA community! This place is great.

a2j | 684 posts | April 6, 2013, 3:25 p.m. | permalink