Story: This series of articles assume you work in the IT Department of Mason Books. The Bossman asks you to scrape the website of a competitor. He would like this information to gain insight into his inventory and pricing structures.
💡 Note: Before continuing, we recommend you possess, at minimum, a basic knowledge of HTML and CSS and have reviewed our articles on How to Scrape HTML tables.
Part 1 focused on:
- Reviewing the website to scrape.
- Understanding HTTP Status Codes.
- Connecting to the Books to Scrape website using the
requests
library. - Retrieving Total Pages to Scrape
- Closing the Open Connection.
Part 2 focuses on:
- Configuring a page URL for scraping
- Setting a delay: time.sleep() to pause between page scrapes.
- Looping through two (2) pages for testing purposes.
💡 Note: This article assumes you have completed the steps in Part 1.
Getting Started
import pandas as pd import requests from bs4 import BeautifulSoup import time
- The
pandas
library enables access to/from a DataFrame. - The
requests
library provides access to the HTTP requests in Python. - The
Beautiful Soup
library enables data extraction from HTML and XML files.
💡 Note: The time
library is built-in and does not require installation. However, be respectful to the website you are scraping. You can do this by pausing between page scrapes. This code is added below.
To install these libraries, navigate to an IDE terminal. At the command prompt ($
), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($
). Your terminal prompt may be different.
$ pip install pandas
Hit the <Enter>
key on the keyboard to start the installation process.
$ pip install requests
Hit the <Enter>
key on the keyboard to start the installation process.
$ pip install beautifulsoup4
Hit the <Enter>
key on the keyboard to start the installation process.
Configure Page URL
The next step is to determine how to properly navigate from page to page while performing the scrape operation.
When you first navigate to the Books to Scrape site, the URL in the address bar is the following:
https://books.toscrape.com/index.html
Let’s see what happens when we click the next
button in the footer area.

We are forwarded to page 2
of the website and the URL format in the address bar changes to the following:
https://books.toscrape.com/catalogue/page-2.html
Now, let’s navigate to the footer area and click the previous
button.
We are forwarded to page 1
of the website and the URL format in the address bar changes to:
https://books.toscrape.com/catalogue/page-1.html
Notice how the original URL format has changed.
The following is appended to the original URL:
- a sub-directory: /catalogue/
- a page-x.html: where x is the page you are currently on.
💡 Note: Click next and previous in the footer area to confirm this.
We can work with this!
Let’s move to an IDE and write some Python code using this URL configuration to loop through all pages of the website.
💡 Note: Most of the code below has been brought forward from Part 1. The lines highlighted in yellow have been added.
web_url = "https://books.toscrape.com" res = requests.get(web_url) cur_page = 1 if res: soup = BeautifulSoup(res.content, 'html.parser') total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3]) while cur_page <= 2: #total_pgs: pg_url = f"{web_url}/catalogue/page-{str(cur_page)}.html" print(f"Scraping: {pg_url}") cur_page += 1 time.sleep(2) res.close() else: print(f"The following error occured: {res}")
- Line [1] creates a new variable
cur_page
to keep track of the page we are currently on. This variable is initially set to a value of one (1). - Line [2] initiates a While Loop which repeats until
cur_pg
equalstotal_pgs
.- Line [3] creates a new variable
pg_url
by combining the variableweb_url
with thecur_page
variable. - Line [4] outputs the value of the
pg_url
to the terminal for each loop. - Line [5] increases the value of
cur_page
by one (1). - Line [6] pauses the code for two (2) seconds between pages using time.sleep().
- Line [3] creates a new variable
- Line [7] closes the open connection.
Before running this code, we recommend you do not loop through all 50 pages of the website. Instead, let’s change the While Loop to the following:
web_url = "https://books.toscrape.com" res = requests.get(web_url) cur_page = 1 if res: soup = BeautifulSoup(res.content, 'html.parser') total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3]) while cur_page <= 2: #total_pgs: pg_url = f"{web_url}/catalogue/page-{str(cur_page)}.html" print(f"Scraping: {pg_url}") cur_page += 1 res.close() else: print(f"The following error occured: {res}")
💡 Note: To comment out code in Python, use the # character. This prevents everything else on the line from executing.
The While Loop has been modified. This code executes twice as depicted by the output below:
Output
Scraping: https://books.toscrape.com/catalogue/page-1.html Scraping: https://books.toscrape.com/catalogue/page-2.html |
Summary
In this article, you learned how to:
- Configure a page URL for scraping
- Set a delay: time.sleep() to pause between page scrapes.
- Loop through two (2) pages for testing purposes.
What’s Next
In Part 3 of this series, you will learn to identify additional elements/tags inside HTML code.

At university, I found my love of writing and coding. Both of which I was able to use in my career.
During the past 15 years, I have held a number of positions such as:
In-house Corporate Technical Writer for various software programs such as Navision and Microsoft CRM
Corporate Trainer (staff of 30+)
Programming Instructor
Implementation Specialist for Navision and Microsoft CRM
Senior PHP Coder