Web Scraping Using Python

Instead of looking at the job site every day, you can use Python to help automate the repetitive parts of your job search. Automated web scraping can be a solution to speed up the data collection process. You write your code once and it will get the information you want many times and from many pages. There are different ways of scraping web pages using python. In my previous article, I gave an introduction to web scraping by using the libraries:requests and BeautifulSoup. However, many web pages are dynamic and use JavaScript to load their content. These websites often require a different approach to gather the data.

  1. Python Web Scraper
  2. Web Scraping Using Python And Selenium
  3. Web Scraping Python Example
  4. Web Scraping Using Python Login
  5. Web Scraping Using Python Code
  6. Web Scraping Using Python Tutorial

Sometimes we need to extract information from websites. We can extract data from websites by using there available API’s. But there are websites where API’s are not available.

Here, Web scraping comes into play!

Python is widely being used in web scraping, for the ease it provides in writing the core logic. Whether you are a data scientist, developer, engineer or someone who works with large amounts of data, web scraping with Python is of great help.

Python Web Scraper

Without a direct way to download the data, you are left with web scraping in Python as it can extract massive quantities of data without any hassle and within a short period of time.

In this tutorial , we shall be looking into scraping using some very powerful Python based libraries like BeautifulSoup and Selenium.

BeautifulSoup and urllib

BeautifulSoup is a Python library for pulling data out of HTML and XML files. But it does not get data directly from a webpage. So here we will use urllib library to extract webpage.

First we need to install Python web scraping BeautifulSoup4 plugin in our system using following command :

$ sudo pip install BeatifulSoup4

$ pip install lxml

OR

$ sudo apt-get install python3-bs4

$ sudo apt-get install python-lxml

So here I am going to extract homepage from a website https://www.botreetechnologies.com

from urllib.request import urlopen

from bs4 import BeautifulSoup

We import our package that we are going to use in our program. Now we will extract our webpage using following.

response = urlopen('https://www.botreetechnologies.com/case-studies')

Beautiful Soup does not get data directly from content we just extract. So we need to parse it in html/XML data.

data = BeautifulSoup(response.read(),'lxml')

Here we parsed our webpage html content into XML using lxml parser.

As you can see in our web page there are many case studies available. I just want to read all the case studies available here.

There is a title of case studies at the top and then some details related to that case. I want to extract all that information.

We can extract an element based on tag , class, id , Xpath etc.

You can get class of an element by simply right click on that element and select inspect element.

case_studies = data.find('div', { 'class' : 'content-section' })

In case of multiple elements of this class in our page, it will return only first. So if you want to get all the elements having this class use findAll() method.

Web scraping using python code

case_studies = data.find('div', { 'class' : 'content-section' })

Now we have div having class ‘content-section’ containing its child elements. We will get all <h2> tags to get our ‘TITLE’ and <ul> tag to get all children, the <li> elements.

case_stud.find('h2').find('a').text

case_stud_details = case_stud.find(‘ul’).findAll(‘li’)

Now we got the list of all children of ul element.

To get first element from the children list simply write:

case_stud_details[0]

Web Scraping Using Python And Selenium

We can extract all attribute of a element . i.e we can get text for this element by using:

case_stud_details[2].text

But here I want to click on the ‘TITLE’ of any case study and open details page to get all information.

Since we want to interact with the website to get the dynamic content, we need to imitate the normal user interaction. Such behaviour cannot be achieved using BeautifulSoup or urllib, hence we need a webdriver to do this.

Webdriver basically creates a new browser window which we can control pragmatically. It also let us capture the user events like click and scroll.

Selenium is one such webdriver.

Selenium Webdriver

Selenium webdriver accepts cthe ommand and sends them to ba rowser and retrieves results.

You can install selenium in your system using fthe ollowing simple command:

$ sudo pip install selenium

In order to use we need to import selenium in our Python script.

from selenium import webdriver

I am using Firefox webdriver in this tutorial. Now we are ready to extract our webpage and we can do this by using fthe ollowing:

self.url = 'https://www.botreetechnologies.com/'

self.browser = webdriver.Firefox()

Now we need to click on ‘CASE-STUDIES’ to open that page.

We can click on a selenium element by using following piece of code:

self.browser.find_element_by_xpath('//div[contains(@id,'navbar')]/ul[2]/li[1]').click()

Now we are transferred to case-studies page and here all the case studies are listed with some information.

Here, I want to click on each case study and open details page to extract all available information.

So, I created a list of links for all case studies and load them one after the other.

To load previous page you can use following piece of code:

self.browser.execute_script('window.history.go(-1)')

Final script for using Selenium will looks as under:

And we are done, Now you can extract static webpages or interact with webpages using the above script.

Conclusion: Web Scraping Python is an essential Skill to have

Today, more than ever, companies are working with huge amounts of data. Learning how to scrape data in Python web scraping projects will take you a long way. In this tutorial, you learn Python web scraping with beautiful soup.

Along with that, Python web scraping with selenium is also a useful skill. Companies need data engineers who can extract data and deliver it to them for gathering useful insights. You have a high chance of success in data extraction if you are working on Python web scraping projects.

If you want to hire Python developers for web scraping, then contact BoTree Technologies. We have a team of engineers who are experts in web scraping. Give us a call today.

Consulting is free – let us help you grow!

In the last tutorial we learned how to leverage the Scrapy framework to solve common web scraping problems.Today we are going to take a look at Selenium (with Python ❤️ ) in a step-by-step tutorial.

Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python.

The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.

At the beginning of the project (almost 20 years ago!) it was mostly used for cross-browser, end-to-end testing (acceptance tests).

Now it is still used for testing, but it is also used as a general browser automation platform. And of course, it us used for web scraping!

Selenium is useful when you have to perform an action on a website such as:

  • Clicking on buttons
  • Filling forms
  • Scrolling
  • Taking a screenshot

It is also useful for executing Javascript code. Let's say that you want to scrape a Single Page Application. Plus you haven't found an easy way to directly call the underlying APIs. In this case, Selenium might be what you need.

Installation

We will use Chrome in our example, so make sure you have it installed on your local machine:

  • selenium package

To install the Selenium package, as always, I recommend that you create a virtual environment (for example using virtualenv) and then:

Quickstart

Once you have downloaded both Chrome and Chromedriver and installed the Selenium package, you should be ready to start the browser:

This will launch Chrome in headfull mode (like regular Chrome, which is controlled by your Python code).You should see a message stating that the browser is controlled by automated software.

To run Chrome in headless mode (without any graphical user interface), you can run it on a server. See the following example:

The driver.page_source will return the full page HTML code.

Here are two other interesting WebDriver properties:

  • driver.title gets the page's title
  • driver.current_url gets the current URL (this can be useful when there are redirections on the website and you need the final URL)

Locating Elements

Web Scraping Python Example

Locating data on a website is one of the main use cases for Selenium, either for a test suite (making sure that a specific element is present/absent on the page) or to extract data and save it for further analysis (web scraping).

There are many methods available in the Selenium API to select elements on the page. You can use:

  • Tag name
  • Class name
  • IDs
  • XPath
  • CSS selectors

We recently published an article explaining XPath. Don't hesitate to take a look if you aren't familiar with XPath.

As usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need.A cool shortcut for this is to highlight the element you want with your mouse and then press Ctrl + Shift + C or on macOS Cmd + Shift + C instead of having to right click + inspect each time:


find_element

There are many ways to locate an element in selenium.Let's say that we want to locate the h1 tag in this HTML:

All these methods also have find_elements (note the plural) to return a list of elements.

For example, to get all anchors on a page, use the following:

Some elements aren't easily accessible with an ID or a simple class, and that's when you need an XPath expression. You also might have multiple elements with the same class (the ID is supposed to be unique).

XPath is my favorite way of locating elements on a web page. It's a powerful way to extract any element on a page, based on it's absolute position on the DOM, or relative to another element.

WebElement

A WebElement is a Selenium object representing an HTML element.

There are many actions that you can perform on those HTML elements, here are the most useful:

  • Accessing the text of the element with the property element.text
  • Clicking on the element with element.click()
  • Accessing an attribute with element.get_attribute('class')
  • Sending text to an input with: element.send_keys('mypassword')

There are some other interesting methods like is_displayed(). This returns True if an element is visible to the user.

It can be interesting to avoid honeypots (like filling hidden inputs).

Honeypots are mechanisms used by website owners to detect bots. For example, if an HTML input has the attribute type=hidden like this:

This input value is supposed to be blank. If a bot is visiting a page and fills all of the inputs on a form with random value, it will also fill the hidden input. A legitimate user would never fill the hidden input value, because it is not rendered by the browser.

That's a classic honeypot.

Full example

Here is a full example using Selenium API methods we just covered.

We are going to log into Hacker News:


In our example, authenticating to Hacker News is not really useful on its own. However, you could imagine creating a bot to automatically post a link to your latest blog post.

In order to authenticate we need to:

  • Go to the login page using driver.get()
  • Select the username input using driver.find_element_by_* and then element.send_keys() to send text to the input
  • Follow the same process with the password input
  • Click on the login button using element.click()

Should be easy right? Let's see the code:

Easy, right? Now there is one important thing that is missing here. How do we know if we are logged in?

We could try a couple of things:

  • Check for an error message (like “Wrong password”)
  • Check for one element on the page that is only displayed once logged in.

So, we're going to check for the logout button. The logout button has the ID “logout” (easy)!

We can't just check if the element is None because all of the find_element_by_* raise an exception if the element is not found in the DOM.So we have to use a try/except block and catch the NoSuchElementException exception:

Taking a screenshot

We could easily take a screenshot using:


Web Scraping Using Python Login

Note that a lot of things can go wrong when you take a screenshot with Selenium. First, you have to make sure that the window size is set correctly.Then, you need to make sure that every asynchronous HTTP call made by the frontend Javascript code has finished, and that the page is fully rendered.

In our Hacker News case it's simple and we don't have to worry about these issues.

Waiting for an element to be present

Web Scraping Using Python Code

Dealing with a website that uses lots of Javascript to render its content can be tricky. These days, more and more sites are using frameworks like Angular, React and Vue.js for their front-end. These front-end frameworks are complicated to deal with because they fire a lot of AJAX calls.

If we had to worry about an asynchronous HTTP call (or many) to an API, there are two ways to solve this:

  • Use a time.sleep(ARBITRARY_TIME) before taking the screenshot.
  • Use a WebDriverWait object.

If you use a time.sleep() you will probably use an arbitrary value. The problem is, you're either waiting for too long or not enough.Also the website can load slowly on your local wifi internet connection, but will be 10 times faster on your cloud server.With the WebDriverWait method you will wait the exact amount of time necessary for your element/data to be loaded.

This will wait five seconds for an element located by the ID “mySuperId” to be loaded.There are many other interesting expected conditions like:

  • element_to_be_clickable
  • text_to_be_present_in_element
  • element_to_be_clickable

Web Scraping Using Python Tutorial

Python

You can find more information about this in the Selenium documentation

Executing Javascript

Sometimes, you may need to execute some Javascript on the page. For example, let's say you want to take a screenshot of some information, but you first need to scroll a bit to see it.You can easily do this with Selenium:

Conclusion

I hope you enjoyed this blog post! You should now have a good understanding of how the Selenium API works in Python. If you want to know more about how to scrape the web with Python don't hesitate to take a look at our general Python web scraping guide.

Selenium is often necessary to extract data from websites using lots of Javascript. The problem is that running lots of Selenium/Headless Chrome instances at scale is hard. This is one of the things we solve with ScrapingBee, our web scraping API

Selenium is also an excellent tool to automate almost anything on the web.

If you perform repetitive tasks like filling forms or checking information behind a login form where the website doesn't have an API, it's maybe* a good idea to automate it with Selenium,just don't forget this xkcd: