Jupyter Notebook Web Scraping

“Literate Programming” with Jupyter Notebook. Another IDE that comes into play when talking about Python is Jupyter Notebook, formerly known as IPython Notebook. Jupyter Notebook is especially important in giving shape to what Donald Knuth, a computer scientist from Stanford, famously called “literate programming”. “Literate Programming” with Jupyter Notebook. Another IDE that comes into play when talking about Python is Jupyter Notebook, formerly known as IPython Notebook. Jupyter Notebook is especially important in giving shape to what Donald Knuth, a computer scientist from Stanford, famously called “literate programming”.

This content is from the fall 2016 version of this course. Please go here for the most recent version.

If you have not already done so, you will need to properly install an Anaconda distribution of Python, following the installation instructions from the first week.

I would also recommend installing a friendly text editor for editing scripts such as Atom. Once installed, you can start a new script by simply typing in bash atom name_of_your_new_script. You can edit an existing script by using atom name_of_script. SublimeText also works similar to Atom. Alternatively, you may use a native text editor such as Vim, but this has a higher learning curve.

Installing New Python Packages

One way to install new packages not already included in Anaconda is using conda install <package>. While packages in Anaconda are curated, they are not always the most up to date version. Furthermore, not all packages are available with conda install.

To resolve this issue, use the Python package manager pip, which should be installed by default. To begin, update pip.

On Mac or Linux

On Windows

In contrast to querying API’s with Python, web-scraping relies on targeting the observed structure of a website itself to download specified content. A good conceptual model for web-scraping is the following example:

Suppose you would like to collect all the speeches and remarks of President Obama during his presidency. You could begin by going to the White House Speeches and Remarks website, finding a speech, copying the text, pasting it in a text file, and saving it. Repeat this process by navigating to the URL for each speech, copy the speech, and save it. Obviously, this would be an onerous experience to do manually. Web-scraping offers a programmatic solution to automate this process.

We will return to a simplified example of scraping presidential speech later in the tutorial. Before we get deeper into this, let’s review the key ideas in web-scraping.

Essential Processes of Webscraping

There are two essential concepts of webscraping

(1) Finding URLs to Download

Finding URLs to download is often one of the most challenging parts of web-scraping and is highly dependent on the website layout, URL construction, and formatting. Your basic goal here is to generate a list or CSV of URLs to download. How do I know what the URLs should be?

Some sites have very clean URLs:

In such a case, the URLs themselves follow a formula. Generating the list of URLs is a simple matter of using string manipulation to create URLs for all possible datetimes that exist in the range you desire.

This is rarely the case. Often URLs will have no programmable format. You must instead collect them. Here the strategy is to begin at the base_URL for your desired site content. From here, there will be one or more links (which lead to one or more link combinations) leading to the final URL. The task is to write a series of loops that recursively go through each possible path to get to the final set of URLs. This is site dependent.

Some sites have pager functions: Page 1 of X. Others have JavaScript that dynamically render site content when the user scrolls to the end of the page. Pager functions or other file paths can be solved using BeautifulSoup and URLlib. Dynamic pages are best resolved using a package like Selenium.

These topics are beyond the current scope. Today, we will be emphasizing the most basic of web-scraping tasks, downloading content for a specific URL.

(2) Downloading URLs

Once you have a URL, downloading it can be achieved in various ways, and depends on the format. If your file exists as a plain text file or pdf, you could simply download it from the command line using curl or wget.

Alternatively, for standard websites, you will often want to use an HTLM parser such as BeautifulSoup.

Note that you could use wget on most websites. For standard websites, however, wget will download the raw HTML (the main web development structure). This is not very readable and much longer than the desired content. Below, I demonstrate a simple implementation of BeautifulSoup to get select speech content.

Before turning to this example, please note the following web-scraping libraries in Python:

Essential Tools for Webscraping

Advanced Webscraping Tools

  • Selenium (For Advanced Web-scraping)

The code for this example is available on Github. Please clone the following:

To launch Jupyter, go to your Shell and type:

jupyter notebook

This will launch your web-browser and Jupyter from the location you run the above command. It is recommended you do this immediately after the above git clone to open Jupyter in the correct location. You will have the option to open or navigate to the tutorial notebook, or to start a new one.

Open the folder Webscraping_and_APIs_in_Python and open the notebook Elementary_Web_Scraping.ipynb.

If you cannot launch the notebook, you can view the HTML version here.

This work is licensed under the CC BY-NC 4.0 Creative Commons License.

Learning Outcomes

  • To understand the benefits of using async + await compared to simply web scraping with the requests library.
  • Learn how to create an asynchronous web scraper from scratch in pure python using asyncio and aiohttp.
  • Practice downloading multiple webpages using Aiohttp + Asyncio and parsing HTML content per URL with BeautifulSoup.

The following python installations are for a Jupyter Notebook, however if you are using a command line then simply exclude the ! symbol

Note: The only reason why we use nest_asyncio is because this tutorial is written in a jupyter notebook, however if you wanted to write the same web scraper code in a python file, then you would’nt need to install or run the following code block:

Why Use Asychronous Web Scraping?

Writing synchronous web scrapers are easier and the code is less complex, however they’re incredibly slow.

This is because all of the requests must wait for the current request to finish one by one. There can only be one request running at a given time.

In contrast, asynchronous web requests are able to execute without depending on previous requests within a queue or for loop. Asychronous requests happen simultaneously.

How Is Asychronous Web Scraping Different To Using Python Requests?

Jupyter notebook web scraping

Instead of thinking about creating a for loop with Xn requests, you need to think about creating an event loop. For example the environment for NodeJS, by design executes in a single threaded event loop.

Jupyter Notebook Web Scraping

However for Python, we will manually create an event loop with asyncio.

Inside of your event loop, you can set a number of tasks to be completed and every task will be created and executed asychronously.

How To Web Scrape A Single Web Page Using Aiohttp

Firstly we define a client session with aiohttp:

Then with our session, we execute a get response on a single URL:

Thirdly, notice how we use the await keyword in front of response.text() like this:

Also, note that every asynchronous function starts with:

Finally we run asyncio.run(main()), this creates an event loop and executes all tasks within it.

After all of the tasks have been completed then the event loop is automatically destroyed.

Jupyter

How To Web Scrape Multiple Pages Using Aiohttp

When scraping multiple pages with asyncio and aiohttp, we’ll use the following pattern to create multiple tasks that will be simulataneously executed within an asyncio event loop:

To start with we create an empty list and then for every URL, we will attach an uncalled/uninvoked function, an AioHTTP session and the URL to the list.

The asyncio.gather(*tasks), basically tells asyncio to keep running the event loop until all of these functions within the python have been completed. It will return a list that is the same length as the number of functions (unless one of the functions within the list returned zero results).

Now that we know how to create and execute multiple tasks, let’s see this in action:

Adding HTML Parsing Logic To The Aiohttp Web Scraper

Jupyter Notebook App

As well as collecting the HTML response from multiple webpages, parsing the web page can be useful for SEO and HTML Content Analysis.

Therefore let’s create second function which will parse the HTML page and will extract the title tag.

Conclusion

Asynchronous web scraping is more suitable when you have a larger number of URLs that need to be processed quickly.

Web Scraping Free

Also, notice how easy it is to add on a HTML parsing function with BeautifulSoup, allowing you to easily extract specific elements on a per URL basis.