Using RSelenium for task automation and web scraping

Fabrício Barbacena
16 min readJun 28, 2021

--

image by unDraw

1. Introduction

This article aims to present the RSelenium package’s basic functionalities and show how it can help you to perform many useful daily tasks automatically in your internet browser. RSelenium also offers a great opportunity to observe object-oriented programming (OOP) fully implemented in R.

In order to use RSelenium, first you need to install it with the install.packages("RSelenium") command. Be carefull to write the first two letter as capital ones. Besides, the Java software needs to be installed in your machine too, or later you will face the following error: error in java_check () : PATH to JAVA not found. A very few functions from the tidyverse will also appear in our code, so install it with install.packages("tidyverse") if you haven’t yet. Also, the package getPass might be useful if we want to send sensitive login information in our code.

In our examples of browser manipulation, we used the version 91.0.4472.77 of Google Chrome in a Windows environment and R version 4.0.5. However, RSelenium can be used with other combinations of internet browser and operational system. Check out the RSelenium official documentation page for more information about these topics.

The following images in this article are all mines, except the few times when the webpages https://quotes.toscrape.com/ and http://httpbin.org/forms/post (two popular sites for webscraping exercises) or their HTML code are reproduced here. I used an R Jupyter Notebook during development but the code can be run in RStudio for sure too. You will find both a Jupyter Notebook and an R script with the complete code on my Github repository.

For people who might be interested in learning Selenium for Python, there is another notebook on the same Github repository with the code translated to this programming language.

2. Importing packages and writing a helper function

Let’s start by importing the libraries to be used in our code.

I also created a customized function called check_object that will print the class and typeof functions info, which will help us to analyze the new RSelenium objects we will find on our way.

3. First RSelenium objects: rsClientServer and remoteDriver

The next step is to create an RSelenium rsClientServer, which we will save in the computational variable called client_server. The code below will automatically open a new Chrome window and give us more information about the object we have just created.

Note that I passed a specific Chrome version as a string in the chromever argument. So, you will need to adjust it according to your reality. Some arguments in RSelenium::rsDriver() use "latest" as default value, which can cause errors if an installed version from Selenium or the chosen browser is different from the latest available version, even if it is a beta one. I experienced that issue with Chrome while finishing this article and I want you to be prepared if you wind up facing a similar obstacle.

Now it is time to save the RSelenium remoteDriver into the computational variable called driver. This object is by far the most important one in our code. As we will discover later, we will spend most of the time calling methods from the remoteDriver object to manipulate the browser and get the information we need.

4. Navigate to a new webpage

We will start by telling the remote driver to open the site https://quotes.toscrape.com/. One can do that by calling the navigate() method from the remoteDriver object and using the URL as the method argument:

After running this code, the Chrome window controlled by RSelenium will open the chosen site.

Whenever you need your code to return the current webpage URL, you can call driver$getCurrentUrl(). And if you want to work with a maximized browser window, just call driver$maxWindowSize().

Here I would like to make a very important observation: the browser window controlled by RSelenium will close automatically if there is no new activity for a while. If that happens, you will probably have to run the code below in order to free the port you last used in your RSelenium program. I found this line of code on StackOverflow after facing this problem for the first fime and now I use it whenever I run into such an issue.

5. The driver$findElement() method

The first element to be manipulated will be the login link, located in the webpage upper right corner. We will do that later by calling the driver$findElement() method with its two arguments (using and value).

However, since this method is so important, let’s take a quick look at all options available to the using parameter. Running driver$findElement, without the parentheses, will return information about this method. Please note that only the first lines from the output are reproduced below.

As we can observe from the last reproduced output, the using parameter accepts one of the following strings:

  • "xpath" => find the element in the page by using the Xpath query language;
  • "css selector" => use a selector from the CSS style sheet language to find the element;
  • "id" | "name" | "class name" => these three options use respectively the id attribute, the name attribute, and the class attribute from the HTML tags to find an element in the webpage;
  • "link text" | "partial link text" => when you deal with an HTML anchor tag (example: <a>this is the link text</a>), you can find them by searching for either the partial or the complete link text.

6. Working with an RSelenium webElement

So, let’s use the Xpath to find the login link element. One can do that on Chrome by right-clicking on the login link and then choosing the inspect option. This will open the DOM panel and show its HTML code.

Now, in your browser, right-click on the HTML <a> login element (the line highlighted in blue in the image above) and then choose “Copy” and “Copy full Xpath” in the auxiliary menu. Paste the Xpath information into a string and use it inside driver$findElement(), as follow:

As one can observe, the method driver$findElement() returns an RSelenium webElement object. When saved in a computational variable like login_link, this webElement can be used to call other methods too, as shown below:

You can also click on a webElement. In the current example, this will send us to the login page.

7. Make your code take a little break with Sys.sleep()

We will manipulate this login page later. By now, let’s run the following code so that we can go back to the main page and wait for 2.5 seconds. Then we return to the login page and wait for another 2 seconds:

When we work with RSelenium, asking the program to wait a couple of seconds before moving to a new page is an excellent strategy. Actually, this additional time before moving on is almost a mandatory strategy in order to avoid errors that can crash the code.

For example, if we go to a new page and ask RSelenium to find an HTML element that has not been loaded yet, we will get a NoSuchElement Error. And if that happens during a long loop and a lot of data has already been scraped, all the effort done by then might get lost. Thus, waiting a few seconds is a more preferable approach. You could also use try() and tryCatch() statements so that your code can be prepared for some of the most common errors before they occur.

8. Write a function to go back to the first page

If we were to click on the Quotes to Scrape link, the browser goes back to the first page. The image below shows this link HTML code:

Let’s find this link now with RSelenium by using the following CSS selector:

In CSS, class attributes are represented by a dot, and id attributes are represented by a hashtag symbol. The css_selector string we reproduced above can be read as follow: “find the first div HTML tag that has both the header-box and row classes, then find a div with the col-md-8 class inside it. Next, find a h1 heading, and then return the first link element inside it”.

We can get some info from this new element and click on it too so that the browser goes to the first page.

Now we are ready to write our first customized function using RSelenium code, which will be helpful whenever we want to go back to the first page in this website. Note here the phenomenon of method-chaining (a method called right after another method), which is a very common practice when working with OOP. This new function of ours will not return a webElement object, since the last action it performs is to click on the link. So, the function will return NULL because that is the value returned by the clickElement() method.

9. Gather all links in a webpage with driver$findElements()

Our next action will be to scrape all links in the first page. This can be done by the findElements() method (note the additional letter s in the end). While findElement() (singular) returns the first webElement object in the DOM to match the search criteria, findElements() (plural) returns a list with ALL webElements objects with given criteria. findElements() will also demand the same using and value parameters we talked about earlier. If no webElement is found with the arguments passed to findElements(), an empty list will be returned by it.

If we perform a search using only the <a> tag name, we will find 55 links in the first page.

We can choose to print to the console the inner text and URL from all the link elements in the first page. In order to do that, let’s first create another function.

This new function must receive an RSelenium webElement with an HTML <a> tag, since we will look for its inner text and href attibute. So, if the print_output parameter is set to TRUE, the code inside the if statement is executed and the element inner text and URL will be printed to the console gradually, since flush.console() is called there. A vector with text and urlis returned by the function.

The code below with lapply will print all the information from the all_links_page_1 list. The computational variable saved_list allows to access this info in a list of lists too, if necessary. Only part of the output is reproduced below:

If we prefer to save the links information into a data frame, that can be done by a for loop, with each iteration calling show_links_info() and appending this info to a new computational variable called links_dataframe.

10. Go through all pages in a website

Another very common task we might need to perform is to visit all pages in a website by following a next link, for example. And in each page, we can perform an action, like scraping some information.

Since we cannot be sure of how many pages a site has beforehand, we need code that keeps repeating an action until a condition becomes true and the code stops executing. That is right, this is a perfect scenario to use a repeat loop with a break inside an if statement. At first, this loop of ours will only go through every page and then print the page number to the console.

Note that the next link is located by using driver$findElements() in line 11. So, this code will return either a list with one webElement or a list with no element at all. This is a nice way of avoiding the NoSuchElement Error that would have been raised in the last page if we had used the driver$findElement() method instead. So, when next_link_as_list has a length of zero, the code will enter the conditional and break the loop, and we will have accomplished our goal of going through all pages in the website.

11. Writing code to open a specific page in the website

After the last code execution, we discovered that the site has only 10 pages with quotes. Besides, all pages from 2 to 10 share the following url pattern: https://quotes.toscrape.com/page/ plus the page number added to its end. This kind of situation is a very desirable one because it allows you to move to new pages by just sending the specific url pattern directly to the driver$navigate() method. So, if you know all the pages range beforehand, this approach will be much easier than finding link elements and clicking on them to change pages.

Let’s now create a function that will send us directly to a chosen page in our website. We will add some boolean conditions in it so that the function can check the argument class and raise an exception if it is a character one. The function will also truncate decimal numbers and raise an error if the number passed to it is not between 1 to 10.

Now, we can go to page 3 and page 7 directly by running this code:

If we ask to go to page pi by mistake instead of page 3, the function will let us know that it used a truncated version of pi, leading us then to page 3.

And any of the following function calls will return an error:

And what if we wanted to retrieve an information from the current URL and save it in a computational variable, like the current page number, for example? The following code will transform the current URL in a string, split it by the / characters and save this info in a list of strings. Then it will get the last element of this list, which will be the page number for any page but the first one.

12. Saving the quotes info as strings

Now, it is time for us to tackle the main information in the website: the quotes. Before we can save them all in a nice data frame (R programmers do like saving info in data frames, don’t they?), I will show you a way of accessing all the quotes text information at once. Below we reproduce the HTML structure from one of these quotes (the vertical blue line marks all children tags inside a quote box):

When we use the getElementText() method from a webElement object, we not only access its own inner text but also the inner texts from all its children tags. So, if each quote info is inside a div tag with a quote class, represented in CSS selector by div.quote, we can print all quotes information from page 1 by running the following code (output is only parcially reproduced below):

13. Saving the quotes info as a data frame

In our effort to save the quotes information into a data frame, let’s first create the following auxiliary functions:

If we run the code below, we can get all quotes info in the first page and save them to computational variables:

These three functions we created will be used to get the quotes information from each page (like we did for page 1) and save it in vectors that will eventually be appended to the all_quotes data frame. Notice that we will also loop over the pages using the repeat structure we built before, even though we use a while (TRUE) loop here instead.

If we want, we can save the all_quotes data frame as a CSV file, check the data frame dimensions and look at its first rows.

Pretty amazing, right?

14. Saving the authors’ biographical info as a data frame

However, there are still some interesting information in this website waiting to be scraped by us. If you navigate to the first link in the statistical variable all_quotes$author_biography_link, you will find biographical information about Albert Einstein. Every author quoted in the website has a page like that. The good news is that we already gathered all these biographical links, we only need to loop over them and save the scraped information into another data frame called authors_info. The following code accomplishes that task.

After we loop over all the 50 biographical pages, we will have a new nice data frame with all this information and we can saved it in a new CSV file, for example. Notice that this program logic can be reused and applied to many web scraping tasks. Find the data you need, make sure to follow the website robots.txt instructions and be happy!

15. Manipulate text input boxes in the login page

Now it is time for us to start manipulating form elements, which is a very important skill to add to our task automation kit. Let’s go back to the login page and save the two text input boxes into computational variables.

When handling text input HTML elements, two methods are definitely very useful: clearElement() and sendKeysToElement(). It is a good idea to start by calling the former so that any previous content might get cleared before you start sending new information to the text box. As for the second method, it receives a list with a string as its sole element.

In order to submit the form and login in the site, we still need to press the login button. We can do that by using the submitElement(), available for some HTML elements that have the ability of submiting information to the website.

After we do that, nothing extraordinary will happen: we will be redirected to the first page and the login text will be changed to Logout. But we did perform some fancy activity: we wrote code to login automatically in a website. If we desire, we could do the same operation in many other sites.

Just be careful about letting important information, like passwords, hard-coded in your program. Don’t make the life of bad people who can steal your personal info too easy, ok? A very good option to avoid such problems is to use the getPass package to send sensitive input data.

16. Working with other types of HTML input tags

Even though they are found on the internet very often, textboxes are not the only HTML input we can manipulate with RSelenium code in our browser. We will move now to a new website (http://httpbin.org/forms/post) and automate the boring task of ordering pizza online. Because eating pizza is a lot of fun but ordering it… well, not that much. We can let our R program to do the ordering for us so that we can focus only on the eating, fun part.

As we can observe, besides three text boxes, this form uses radio buttons for the pizza size, checkboxes for the toppings, and a time input for the delivery time. These four elements are all input HTML tags, but they differ in their type attribute value. And depending on the chosen input type, they can have very specific HTML attributes that are not present in the others ones. We also have a textarea tag used in the delivery instructions, and a submit button to the form.

So, I want a large pizza with bacon, extra cheese and mushrooms to be delivered around 19:45 at my home. I just need to let my code to do the ordering for me.

Below you will find my answer to this challenge. However, before you check it out, see if you can imagine how this code would look like. And to make it more interesting, use a one liner to order the three toppings at once with the help of the base::lapply() and the click_element() functions (this last one was created by us earlier).

Another quick tip: radioboxes and checkboxes can be manipulated using the $clickElement() method.

My code to perform this task goes below:

17. Finishing your work with RSelenium

Finally, we can ask RSelenium to close the browser window and shut down the server. We should also release the port we were using.

18. Last words

Now, we could create code to order 20 diferent pizzas, for example, if we desired to make a good surprise for our great work colleages. Then, whenever you want to ask another 20 pizzas (or 15, or 1000!), you just need to hit the run button, and RSelenium makes all the work for you.

That’s the beauty of task automation, and with this tutorial, we only scratched the surface of what RSelenium can do for us. However, the information published here is more than enough to allow you to continue your learning path in this area and create fantastic scripts that will save you and your team a lot of time.

And this saved time can be applied later in other important activities, like spending quality time with family or friends, reading, learning new skills, birdwatching, or simply relaxing. You definitely earned it after writing your automation script!

Good luck and happy automation!

P.S.: You will find more info about my work on LinkedIn, Medium, and Github:

https://www.linkedin.com/in/fabriciobrasil

https://fabriciusbr.medium.com/

https://github.com/fabricius1

--

--

Fabrício Barbacena
Fabrício Barbacena

Written by Fabrício Barbacena

Python and Django Developer • Data Analyst • BI Consultant • Data Science • Data Engineering • https://linktr.ee/fabriciobarbacena

No responses yet