Selenium Problem with ArcGIS Dashboard : Forums : PythonAnywhere

Selenium Problem with ArcGIS Dashboard

I'm having trouble getting a script to work on pythonanywhere even though it works for me locally. I tried some other solutions I found in the forum for people with similar issues, but it's still not working for me...

Background:

I'm using selenium to scrape values from an ArcGIS dashboard.
I have confirmed via screenshots and the html output that the local version of my script properly loads the dashboard webpage, but the pythonanywhere version only loads the dashboard text labels and not the numeric values that I need. The HTML headers look identical in both cases. I can't figure out why the whole page doesn't load from my pythonanywhere-based script...

My setup:

I have the haggis system image.
I'm running the default version of Python (3.10).
I have the latest version of selenium (4.26.1 via pip install --user --upgrade selenium)
I have a paid account.
I have tried time.sleep() values up to 120 seconds to allow plenty of time for the page to fully load, and this doesn't fix the problem.

Code snippet:

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
browser = webdriver.Chrome(options=chrome_options)

url = "https://townofsuperior.maps.arcgis.com/apps/dashboards/b6f5d5b7137142e7a75d452bff134ebb"
browser.get(url)
time.sleep(30)
html = browser.page_source
labels = browser.find_elements(By.CSS_SELECTOR, 'g.responsive-text-label')
for label in labels:
    print(label.text)
print(html)

Output when I run it locally:

Commercial Properties
4
Public Property
1
Residential Properties
393
Permits Issued
307
Certificates of Occupancy
270
Destroyed Properties
398
Debris Removal
Complete

Output when I run it on pythonanywhere;

Commercial Properties
Public Property
Residential Properties
Permits Issued
Certificates of Occupancy
Destroyed Properties
Debris Removal
Complete

Can anyone help me with what the problem might be...?!? Thank you.

bbrookemann | 5 posts | Nov. 11, 2024, 10:31 p.m. | permalink

p.s. My script worked on pythonanywhere for about six months and then it started failing in June (specifically, June 27th). Also, I'm having the same problem on all three ArcGIS dashboards that I'm trying to scrape:

url = "https://louisvillecogov.maps.arcgis.com/apps/dashboards/a5c177edbb024f509735cd313b11baac"
url = "https://townofsuperior.maps.arcgis.com/apps/dashboards/b6f5d5b7137142e7a75d452bff134ebb"
url = "https://www.arcgis.com/apps/dashboards/0988758f629e49478dc3e83bfbb0e378"

Snapshot of local script output

Snapshot of PythonAnywhere output

bbrookemann | 5 posts | Nov. 11, 2024, 10:43 p.m. | permalink

That "Data source error sounds like a problem there" Are you able to inspect it? Maybe some service is blocking PythonAnywhere.

fjl | 4966 posts | PythonAnywhere staff | Nov. 12, 2024, 9:10 a.m. | permalink

With 100% consistency, the ArcGIS dashboard page renders correctly from my local script but not when I run from PythonAnywhere — and I can't figure out what's different.

This ESRI page suggests the "Data Source Error" might happen when a display container is too small – but the dashboard containers are normally sized in my PA screenshots and their dimensions are the same in the raw HTML from both my local output and the PA output.

I've tried lots of options —

user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--window-size=1920,1080')
chrome_options.add_argument('--ignore-certificate-errors')
chrome_options.add_argument('--allow-running-insecure-content')
chrome_options.add_argument('--disable-extensions')
chrome_options.add_argument(f'user-agent={user_agent}')

— but nothing fixes it.

There aren't any error messages except for "Data Source Error" in the screenshot; the page is rendering but it's missing all of the calculated numeric elements (they're just  in the raw HTML source where I otherwise see numbers in the raw HTML from my local script).

Question: per your suggestion, how can I inspect or debug a headless browser? (With Chrome on my desktop I would use Ctrl-Shift-C to open the devtools...but how can I do something similar with a headless browser on PA?)

Thank you...

bbrookemann | 5 posts | Nov. 12, 2024, 11:14 p.m. | permalink

The page may have some anti-bot measures in place, or it may be blocking PythonAnywhere specifically.

You can't really inspect the page the same way you would in a headed browser, so you need to make do with screenshots or inspecting the HTML. You could try to get extra information like network activity using performance logging: https://developer.chrome.com/docs/chromedriver/logging/performance-log

nkahr | 610 posts | PythonAnywhere staff | Nov. 13, 2024, 9:38 a.m. | permalink

The page isn't blocked entirely -- just some elements of it (the ones I need!).

Oh well -- I gave up on trying to make it work on PA the same way it works for me locally; I came up with this solution instead: rather than trying to access the dashboard directly, create a (free) account at the LambdaTest browser testing platform, remotely load the page over there from my PA script, and then scrape that.

Note that this approach requires a paid PA account since lambdatest.com isn't in the PA whitelist.

# References
# https://www.lambdatest.com/learning-hub/webdriver
# https://automation.lambdatest.com/configure?framework=python&lang=python

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

# LambdaTest configuration
username = 'username'
access_key ='access_key_abc123xyz'
remote_url = f'https://{username}:{access_key}@hub.lambdatest.com/wd/hub'
capabilities = {
    'platform': 'Windows 10',
    'browserName': 'chrome',
    'version': 'latest'}

# Initialize the remote webdriver with LambdaTest capabilities
options = webdriver.ChromeOptions()
options.set_capability('cloud:options', capabilities)
driver = webdriver.Remote(remote_url, options=options)

# ArcGIS dashboard to scrape
url = 'https://audubon.maps.arcgis.com/apps/dashboards/96896859c42c4301a8032609493a9e00'

driver.get(url)
time.sleep(20)  # use implicit wait time in this example, for simplicity
html = driver.page_source
labels = driver.find_elements(By.CSS_SELECTOR, 'g.responsive-text-label')
for label in labels:
    print(label.text)
driver.save_screenshot('test_screenshot.png')
# print(html)

# Close the browser
driver.quit()

bbrookemann | 5 posts | Nov. 13, 2024, 6:02 p.m. | permalink

That's a clever workaround -- glad to hear you were able to get it working!

giles | 12640 posts | PythonAnywhere staff | Nov. 13, 2024, 6:49 p.m. | permalink

Argh! Not such a clever workaround after all -- I just realized the free lambdatest account has a limit of 100 minutes, and I have reached that limit with my daily process.

bbrookemann | 5 posts | Dec. 3, 2024, 5:19 a.m. | permalink

Sorry to hear that!

pafk | 3620 posts | PythonAnywhere staff | Dec. 3, 2024, 8:49 a.m. | permalink