Selenium For Web Scraping Python

What is web scraping? To align with terms, web scraping, also known as web harvesting, or web data extraction is data scraping used for data extraction from websites. The web scraping script may access the url directly using HTTP requests or through simulating a web browser. The second approach is exactly how selenium works – it simulates a. Become an expert in web scraping and web crawling using Python 3, Scrapy, Splash and Selenium 2nd EDITION (2020) Bestseller Rating: 4.7 out of 5 4.7 (2,417 ratings). What is Web Scraping? Web scraping is a computer software technique of extracting information from websites. This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet). You can perform web scraping in various ways, including use of Google Docs to almost every.

  • First, learn the essentials of web scraping, explore the framework of a website, and get your local environment ready to take on scraping challenges with BeautifulSoup, and Selenium. Next, cover the basics of BeautifulSoup, utilize the requests library and LXML parser, and scale up to deploy a new scraping algorithm to scrape data from any.
  • Oct 22, 2015 What is Web Scraping? Web scraping is a computer software technique of extracting information from websites. This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet). You can perform web scraping in various ways, including use of Google Docs to almost every.

When a web page is opened in a browser, the browser will automatically execute JavaScript and generate dynamic HTML content. It is common to make HTTP request to retrieve the web pages. However, if the web page is dynamically generated by JavasSript, a HTTP request will only get source codes of the page. Many websites implement Ajax to send information to and retrieve data from server without reloading web pages. To scrape Ajax-enabled web pages without losing any data, one solution is to execute JavaScript using Python packages and scrape the web page that is completely loaded. Selenium is a powerful tool to automate browsers and load web pages with the functionality to execute JavaScript.

1. Start Selenium with a WebDriver

Selenium does not contain a web browser. It calls an API on a WebDriver which opens a browser. Both Firefox and Chrome have their own WebDrivers that interact with Selenium. If you do not need a browser UI, PhantomJS is a good option that loads web pages and executes JavaScript at the background. In the following example, I will use Chrome WebDriver.

Before starting Selenium with a WebDriver, install Selenium pip install Selenium and download Chrome WebDriver

Start Selenium with a WebDriver. By running the following code, a Chrome browser pops up.

2. Dynamic HTML

Let's take this web page as an example: https://www.u-optic.com/plano-convex-spherical-lens/en.html. This page makes Ajax requests to retrieve data and then generate the page content dynamically. Suppose we are interested in the data listed in the HTML table. They are not present in the original HTML source code. A simple HTTP request will only retrieve the page source code without the data.

Python

A closer look at the table generated by JavaScript in a browser:

3. Start scraping

There are two ways to scrape dynamic HTML. The more obvious way is to load the page in Selenium WebDriver. The WebDriver automatically executes Ajax requests and subsequently generates the full web page. After the web page is loaded completely, use Selenium to acquire the page source in which the data is present.

Web Scraping Using Selenium Python

However, on the example web page, due to table pagination, the table only shows 10 records. Multiple Ajax requests have to be made in order to retrieve all records.

Inspect the web page, under Network tab, we find 2 Ajax requests from which the web page loads the data to construct the tables.

By copying and pasting the urls into a browser or making HTTP requests using Python Requests library, we retrieve 10 records in JSON.

The returned JSON data indicates there are 1564 records in total. A closer look at the Ajax url reveals that the number of records to be retrieved is specified under the parameter 'length' in the url.

There are 62 items in the first table and 1564 items in the second table. Thus we change the value for the parameter 'length' in the url accordingly.

Making requests for the data directly is much more convenient than parsing the data from web pages using Xpath or CSS selector.

Selenium Web Scraping Python Firefox

4. Search for Ajax request urls in WebDriver logs

The Ajax request urls are hidden inside the JavaScript codes. We can search in WebDriver's performance log which logs events for Ajax requests. To retrieve performance logs from WebDriver, we must specify the argument when creating a WebDriver object:

Selenium for web scraping python tutorial

The performance log records network activities that the WebDriver performed when loading the web page.

The value of the key 'message' is a string in JSON. Parse the the string using Python json module and we find the Ajax requests to retrieve data are made under the method 'Network.requestWillBeSent'. The url has the path: '/api/diy/get_product_by_type'.

Selenium For Web Scraping Python Github

We use regular expression to find these urls.

Additional notes:

When the WebDriver loads the web page, it may take a few seconds for the WebDriver to make Ajax requests and then generate the page content. Thus it is recommended to configure the WebDriver to wait for some time until the section we intend to scrape is loaded completely. In this example, we want to scraped the data in the table. The data is placed under class 'text-bold'. Thus we set the WebDriver to wait for 5s until the class 'text-bold' gets loaded. If the section does not get loaded in 5s, a TimeoutException will be thrown.

5. Conclusion

Dynamically generated web pages are different from their source codes and thus we cannot scrape the web pages by HTTP requests. Executing JavaScript with Selenium is a solution to scrape the web pages without losing any data. Furthermore, if the data to be scraped is retrieved via Ajax requests, we may search for the request urls in the performance logs of the WebDriver and retrieve the data directly by making HTTP requests.