Issues with Selenium and PyVirtualDisplay : Forums : PythonAnywhere

Issues with Selenium and PyVirtualDisplay

I'm working on a script that can scrape a list from two nearly identical websites, and as such, I'm attempting to do this using Selenium. The issue with the site I'm fetching the information from is that it loads stuff through javascript after the web page is fetched/downloaded through your browser, and selenium/pyvirtualdisplay isn't letting that javascript run to create the rest of the webpage for me to sort through.

from pyvirtualdisplay import Display
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from lxml import html
from sys import exit
from time import sleep

disneyland_url = "http://disneyland.disney.go.com/maps/service-details/18583410%3bentitytype%3dguest-service"
disneyworld_url = "http://disneyworld.disney.go.com/maps/service-details/18579731%3bentitytype%3dguest-service"

names_xpath = "//div[@class='textContainer']/div/text()"
coords_xpath = "//div[@class='textContainer']/parent::div/@data-id"

dlString = ""
dwString = ""

def scrape(url):
    htmlString = ""
    display = Display(visible=0, size=(800, 600))
    display.start()
    print "Displaying"
    for retry in range(3):
        try:
            browser = webdriver.Firefox()
            break
        except:
            sleep(3)
    try:
        browser.get("http://disneyland.disney.go.com/maps/service-details/18583410%3bentitytype%3dguest-service")
        print browser.title
        WebDriverWait(browser, 50).until(EC.presence_of_element_located((By.Class, "textContainer")))
        innerHTML = browser.execute_script("return document.body.innerHTML")
        htmlString = innerHTML
    except:
        print "Website didn't load in time"
        browser.quit()
        display.stop()
        exit("Website Error")
    finally:
        browser.quit()
        display.stop()

    tree = html.fromstring(htmlString)
    print(htmlString)
    names = tree.xpath(names_xpath)
    print(names)

scrape(disneyland_url)

Does anyone know what I'm doing wrong in regards to this?

deleted-user-1628581 | 3 posts | June 14, 2017, 7:51 p.m. | permalink

I know the url variable isn't used right now in the scrape function, but I sincerely doubt it's the issue right here. I kept it there for testing.

deleted-user-1628581 | 3 posts | June 14, 2017, 9:47 p.m. | permalink

We've often found that the best thing to do with Selenium and stuff that's loaded by JavaScript is to write our own wait code. So, for example:

def wait_for(condition_fn, error_message, timeout_seconds):
    start = time.time()
    while time.time() - start < timeout_seconds:
        if condition_fn():
            return
        time.sleep(1)
    raise AssertionError(error_message)


def is_data_loaded():
    # ...check the page elements and see if the data is there.


wait_for(is_data_loaded, "Page didn't load data", 30)

giles | 12671 posts | PythonAnywhere staff | June 15, 2017, 1:43 p.m. | permalink

No change, it just times-out.

Ok, I'm still getting a

urllib2.URLError: <urlopen error [Errno 111] Connection refused>

whenever it runs, but only after calling browser.close().

deleted-user-1628581 | 3 posts | June 15, 2017, 8:54 p.m. | permalink

It's usually useful to dump out something related to your condition function to check your assumptions about what the page is showing.

If you've closed the browser, there's nothing to make the request. The connection refused is probably coming from Selenium when it tries to connect to the non-existent browser to make it do the thing you asked.

glenn | 10043 posts | PythonAnywhere staff | June 16, 2017, 11:11 a.m. | permalink