Lately, I've been having to check my router somewhat frequently. Occasionally, I need to check out which devices are connected to my network, their MAC addresses, how much time they have left on their DHCP lease, etc. I got fed up with my router's web interface and decided I wanted to automate some of these tasks. Unfortunately there's not an easy way to get custom firmware onto the router.
So I decided to "hack" it, clean the data, and have the information sent to me via email whenever there's an important update. Below is a tutorial on how to do something similar. While this post is a very specific use-case of selenium, it is my hope that you will still be able to learn plenty and avoid some common mistakes on your own projects.
Part 1: Scraping Connected Device Information
This isn't exactly "big data" (or even hacking, for that matter); nevertheless some of the core principles used in data mining will help us here. The first task is to figure out where your data is, and how you can extract it.
Sometimes, a website or service will expose a RESTful API for you, greatly simplifying this task. In this instance, there's no public REST api to use. The data we can get is the data the router's http server can show in html.
That's no problem, though! We have Selenium. Selenium and its drivers allow you to operate a browser in real-time. This is really cool because it allows us to write code that is essentially human as far as the website is concerned. When we tell selenium to click an element on the screen, it will do literally that. Unless you're using a headless browser (more on that later), selenium will open up your browser of choice and, depending on the code you wrote, will perform each action you've told it to perform.
A lot of people use selenium for writing functional tests for a code base. It excels in the automation of user stories. There are plenty of great articles out there for that - so today, we're just going to be using it as a web scraping tool.
Create a Selenium Docker Container
Docker - so hot right now.
Install Docker. Installation is relatively simple on Linux. On MacOS or Windows 10 it will use a Linux virtual machine. There seem to be some issues with the Windows flavor of the Docker beta. You may want to consider using the older but more stable solution of Docker Toolbox.
Docker should now be running in the background with the proper environment variables set to be run in your CLI of choice. Let's get the necessary image from Docker Hub. Docker hub is basically GitHub for ultra-portable, lightweight services, APIs and even operating systems.
We're going to use the official Selenium "standalone-firefox" docker image. You're free to use others as well, just note that each web driver behaves slightly differently, so your mileage may vary.
Open up a terminal and type
docker pull selenium/standalone-firefox
and start it
docker run -d -P selenium/standalone-firefox
The output of this command is the id of your container. A small but important detail here is that because we're running an instance of a Docker image, we now refer to it as a Docker container. Your docker container is now accessible on a new port on your local network.
docker ps
This will list your available docker containers and the port they can be accessed through. Make note of the port, you'll need it later.
Wire the Selenium Container to Python
I'm using Python 3.5.3 for this tutorial. As always, it is best to stick with a Python virtualenv so as not to clutter the Python in your global scope with project specific packages. If you don't have virtualenv, or don't know what it is, I recommend reading the documentation. I'll also be making use of virtualenvwrapper
Create and name your virtual env. Let's call it scrape_env
which python3 # First found out where you python3 binary is located. mkvirtualenv -p /path/to/python3 scrape_env
You should already be in your virtualenv after running this. To confirm this, run
which python
The output should return the location of a python binary inside of your virtual env. If it doesn't, try
workon scrape_env
Now that we're in our virtualenv, let's install the Python selenium code bindings.
pip install selenium
Alright, we should now be prepared to start wiring up our running Selenium instance with a bit of python code. Open up a new file inside of a new project directory with your editor of choice.
mkdir scrapecd scrapevim scrape_router.py
This is where the wiring happens. Selenium has support for a RemoteDriver to support server-level browser automation. It also happens to be very useful for connecting to a Docker container.
scrape_router.py
from selenium import webdriverfrom selenium.webdriver.common.desired_capabilities import DesiredCapabilitiescapabilities = DesiredCapabilities.FIREFOX# This is a web driver capability that Firefox 48 and higher use by default. Since our container uses Firefox 47, let's disable it.capabilities["marionette"] = False# Wire it all up. Be sure to use the port for _your_ container found earlier using docker psbrowser = webdriver.Remote(command_executor='http://127.0.0.1:32768/wd/hub', desired_capabilities=caps)
Note the browser object created at the end. The browser is an instance of a running Firefox browser with full access to the Document Object Model, and several operations useful for manipulating the browser's state. First thing's first - let's get to the router's administration page. This is typically located at http://192.168.1.1.
scrape_router.py
...browser = webdriver.Remote(command_executor='http://127.0.0.1:32768/wd/hub', desired_capabilities=caps)url = 'http://192.168.1.1/'browser.get(url)
Finding out the structure of your specific admin page will be different, but if you can do it yourself in a browser, so can selenium. Try to follow this basic algorithm -
1.) Open the website in your browser and find the fastest route to the data you need
In my case, I'm interested in the devices connected to the router. After some tinkering, I found an executable path to my data in just a few clicks.
2.) Use developer tools to identify the elements needed in your execution path
Okay, so now we need to tell selenium to run Firefox in that exact same manner. We'll call the series of clicks, typing and waiting that's required to get to the data you want to scrape the browser execution path.
The question is, how will selenium know what to click? Being able to use the developer tools in your browser is crucial for telling selenium what to click on, type in, etc..
Go back through your execution path. At each step, inspect the relevant element with your dev tools. From there, we can determine -
- The element's id attribute
- The element's xpath
- The CSS selectors which accurately specify the element
You're free to pick from each of these options. An id is often the cleanest method. XPath can be specific and (in testing) can enforce parent-child relationships. CSS selectors are best used for testing the styling behavior of your website. I digress.
The first step of the process is to click on the password input box to get its focus. Then we type the password in. We need a way to identify that input. Inspecting the element shows us that the input has a unique id - "adminPass".
3.) Write selenium code one step at a time
Before you get all giddy and start hammering out a bunch of code, keep in mind that each step can require some tricky debugging. Browsers are fast[citation needed], so we don't necessarily see everything the browser may be doing in the ~100ms that occur after you open the page. Unfortunately for us, programs are even faster, so this time period becomes very important.
Let's start with an intuitive approach to see how quickly problems can arise. Add this to your code:
scrape_router.py
from selenium.webdriver.common.keys import Keys'''...imports and other code'''browser.get(url)# The webpage should be open now. Let's click the input box.password_input = browser.find_element_by_id('adminPass')# We should have access to the input now. Let's type the passwordpassword_input.clear()password_input.send_keys('password')password_input.send_keys(Keys.RETURN)
Looks good to me. Let's run it.
python scrape_router.pyTraceback (most recent call last): # Stack Trace ... selenium.common.exceptions.ElementNotVisibleException: Message: Element is not currently visible and so may not be interacted with
What's this? Selenium isn't seeing our element even though it should be there, right?
The problem is that once browser.get(url) is run, the web page will begin loading. Before the page has even loaded, the next line of python will be run. By the time we've gotten to password_input = browser.find_element_by_id('adminPass), in our code, there's still no guarantee that the page is finished loading. So, how do we make sure the page has loaded before running our code to find the password input element?
How you solve this is a point of contention in the selenium community. Harry J.W. Percival, author of Obey the Testing Goat provides several solutions and one particularly robust solution in this article
The answer comes in the form of selenium's explicit waits. Explicit waits allow you to halt execution until certain conditions are met. More specifically, we can tell selenium to wait until our input box has loaded before trying to find it. Let's refactor.
scrape_router.py
# ...from selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as EC# ...browser.get(url)wait = WebDriverWait(browser, 10)password_input = wait.until(EC.element_to_be_clickable((By.ID, "adminPass")))password_input.clear()password_input.send_keys('password')password_input.send_keys(Keys.RETURN)
We now have an explicit wait defined on the condition that the password input be clickable. Until that condition is true, no further code will be run. Perfect.
Let's run it and check some actual results out in our docker container's web interface.
python scrape_router.py
This time, everything should've gone smoothly without any exceptions being thrown. Let's check the website itself though to see what our firefox selenium container is actually doing for us. Head to http://127.0.0.1/wd/hub/
Awesome! The hub provided by the docker container gives an option to take a screenshot of the instance of Firefox that it's running. Once we do, we see that we've successfully logged in, meaning the code worked as intended.
4.) Repeat until you've reached the final step in your shortest possible execution path.
The previous step was a little longer, but for good reason. You never know what sort of strange behavior a website will have implemented. It's up to you to determine the issues at each step, and create an explicit wait that solves those issues.
Let's have a look at the final code.
scrape_router.py
from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.webdriver.common.keys import Keysfrom selenium.webdriver.common.desired_capabilities import DesiredCapabilitiesfrom selenium.webdriver.common.action_chains import ActionChainscaps = DesiredCapabilities.FIREFOXcaps["marionette"] = Falsebrowser = webdriver.Remote(command_executor='http://127.0.0.1:32768/wd/hub', desired_capabilities=caps)url = 'http://192.168.1.1/'browser.get(url)# Step 1 - Type the password and press enterwait = WebDriverWait(browser, 10)password_input = wait.until(EC.element_to_be_clickable((By.ID, "adminPass")))password_input.clear()password_input.send_keys('password')password_input.send_keys(Keys.RETURN)# Step 2 - Wait for the router's dashboard to appear and click the "troubleshooting" navigation item in the side bar.troubleshooting = wait.until(EC.element_to_be_clickable((By.ID, "iconTroubleshooting")))troubleshooting.click()# Step 3 - Wait for an (annoying and pointless) modal to disappear. Then wait for the "DHCP client table" button to appear and click it. # Here, there was an annoying overlay that would be displayed even though the elementwait.until(EC.invisibility_of_element_located((By.ID, "dialog-overlay")))open_status = wait.until(EC.element_to_be_clickable((By.ID, "open-status")))# ActionChains were used to help solve another issue that arose during the final step.ActionChains(browser).move_to_element(open_status).click().perform()open_status.click()
As you can see, each step along the way I encountered some specific behavior that needed to be solved with an explicit wait. In the step 3, a few problems arose all at once.
Initially, I was getting an exception telling me that the element was hidden underneath another element with id "dialog-overlay", and thus couldn't be clicked on. Apparently my router likes to randomly render invisible overlays before finishing a page load. I solved this with an explicit wait conditional on the disappearance of that id from the DOM.
Also, it turns out that selenium's drivers don't have the entire page accessible to them. Just like a regular browser, some elements are only visible once you scroll to those elements. The DOM is optimized well enough to not render those elements until they need to be rendered. Because the 'DHCP Client Table' button wasn't in view, selenium had no idea what location to click on. The solution was to use a selenium ActionChain. I defined a series of two actions for the browser to perform - first, move the browser until you find the 'DHCP client table' button; second - click on the button. Problem solved!
Thoughts
So, selenium doesn't exactly get you there for free. It's a framework which requires a very human touch. This simple code took far longer to debug than it did to write. If you make effective use of the described algorithm, you'll find that the process of debugging is far less complex. You should only ever have to focus on one clearly defined set of problems at a time when automating a browser.
It doesn't stop there though In Part 2 (coming soon!) of the tutorial, I'll go into more advanced scraping with python. We'll make use of Django's ORM to save our scraped data to a database, and create an automated celery task that updates us via email every time a new device has logged on to the router.
This blog post was originally posted on codeneurotic. Its copyright was licensed for fair use on hirelofty.com by Clay Mullis on October 4, 2016.