On the databricks community, I see repeated problems regarding the selenium installation on the databricks driver. Installing selenium on databricks can be surprising, but for example, sometimes we need to grab some datasets behind fancy authentication, and selenium is the most accessible tool to do that. Of course, always remember to check the most uncomplicated alternatives first. For example, if we need to download an HTML file, we can use SparkContext.addFile() or just use the requests library. If we need to parse HTML without simulating user actions or downloading complicated pages, we can use BeautifulSoap. Please remember that selenium is running on the driver only (workers are not utilized), so just for the selenium part single node cluster is the preferred setting.
Installation
The easiest solution is to use apt-get to install ubuntu packages, but often version in the ubuntu repo is outdated. Recently that solution stopped working for me, and I decided to take a different approach and to get the driver and binaries from chromium-browser-snapshots
https://commondatastorage.googleapis.com/chromium-browser-snapshots/index.html
Below script download the newest version of browser binaries and driver. Everything is saved to /tmp/chrome directory. We must also set the chrome home directory to /tmp/chrome/chrome-user-data-dir. Sometimes, chromium complains about missing libraries. That's why we also install libgbm-dev. The below script will create a bash file implementing mentioned steps.
The script was saved to DBFS storage as /dbfs/databricks/scripts/selenium-install.sh We can set it as an init script for the server. Click your cluster in "compute" -> click "Edit" -> "configuration" tab -> scroll down to "Advanced options" -> click "Init Scripts" -> select "DBFS" and set "Init script path" as "/dbfs/databricks/scripts/selenium-install.sh" -> click "add".
If you haven't set the init script, please run the below command.
%sh
/dbfs/databricks/scripts/selenium-install.sh
Now we can install selenium. Click your cluster in "compute" -> click "Libraries" -> click "Install new" -> click "PyPI" -> set "Package" as "selenium" -> click "install".
Alternatively (which is less convenient), you can install it every time in your notebook by running the below command.
%pip install selenium
So let's start webdriver. We can see that Service and binary_location point to driver and binaries, which were downloaded and unpacked by our script.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
s = Service('/tmp/chrome/latest/chromedriver_linux64/chromedriver')
options = webdriver.ChromeOptions()
options.binary_location = "/tmp/chrome/latest/chrome-linux/chrome"
options.add_argument('headless')
options.add_argument('--disable-infobars')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--no-sandbox')
options.add_argument('--remote-debugging-port=9222')
options.add_argument('--homedir=/tmp/chrome/chrome-user-data-dir')
options.add_argument('--user-data-dir=/tmp/chrome/chrome-user-data-dir')
prefs = {"download.default_directory":"/tmp/chrome/chrome-user-data-di",
"download.prompt_for_download":False
options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome(service=s, options=options)
Let's test webdriver. We will take the last posts from the databricks community and convert them to a dataframe.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.execute("get", {'url': 'https://community.databricks.com/s/discussions?page=1&filter=All'})
date = [elem.text for elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "lightning-formatted-date-time")))]
title = [elem.text for elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "p[class='Sub-heaading1']")))]
I followed these instructions in an AWS backed Databricks platform and can't get past this error every time I run the below code:
Partial Error:
Could not connect to security.ubuntu.com:80
Code:
%sh
/dbfs/databricks/scripts/selenium-install.sh
I have provided the full error at the bottom of this post. Is there anything that I am doing wrong? I looked at the Network ACLs and Security Groups defaulted in the AWS account and it looks like I should have access to in/outbound HTTP(80) ports, but I am not an AWS expert. I added a new Security Group for Outbout 80 access to try and troubleshoot but didn't work and is probably redundent. Could use some help troubleshooting.
I tried running the below as the full error suggested and get am getting simular error messages:
I was able to get this fixed by working with our IT department. port 80 is required for the %sh command and our firewall configuration was blocked for port 80 on that particular cloud platform.
I have a new issue though. When trying to run the first command after the pip install selenium command, I am getting this error.
WebDriverException: Message: unknown error: unable to discover open pages
@Hubert Dudek Any ideas?
@Hubert-Dudek
Hi, thanks for the detailed tutorial. With slight tweaks to the init script I was able to make Selenium work on single-node cluster. However, I haven't had much luck with shared clusters in DB Runtime 14.0. Btw, I'm using Volumes to store both chrome 114 debian package & chromebinary executable.
See attached for the previous steps.
Hi Hubert-Dudek,
Are there any updates to your article? I have struggling to get databricks to recognise a Seleniumbase driver. I think the error might actually be a permissions problem as the error is:
WebDriverException:
Message: Can not connect to the Service /local_disk0/.ephemeral_nfs/envs/pythonEnv-0000-xxx.../lib/python3.11/site-packages/seleniumbase/drivers/uc_driver