Story: This series of articles assume you work in the IT Department of Mason Books. The Bossman asks you to scrape the website of a competitor. He would like this information to gain insight into his inventory and pricing structures.
💡 Note: Before continuing, we recommend you possess, at minimum, a basic knowledge of HTML and CSS and have reviewed our articles on How to Scrape HTML tables.
This article focuses on:
- Reviewing the website to scrape.
- Understanding HTTP Status Codes.
- Connecting to the Books to Scrape website using the
requests
library. - Retrieving Total Pages to Scrape
- Closing the Open Connection.
Getting Started
import pandas as pd import requests from bs4 import BeautifulSoup import time
- The
pandas
library enables access to/from a DataFrame. - The
requests
library provides access to the HTTP requests in Python. - The
Beautiful Soup
library enables data extraction from HTML and XML files.
💡 Note: The time
library is built-in and does not require installation. However, be respectful to the website you are scraping. You can do this by pausing between page scrapes. This code is added in Part 2.
To install these libraries, navigate to an IDE terminal. At the command prompt ($
), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($
). Your terminal prompt may be different.
$ pip install pandas
Hit the <Enter>
key on the keyboard to start the installation process.
$ pip install requests
Hit the <Enter>
key on the keyboard to start the installation process.
$ pip install beautifulsoup4
Hit the <Enter>
key on the keyboard to start the installation process.
Website Review
Let’s navigate to Books to Scrape and review the format.
At first glance, you will notice:
- Book categories display on the left-hand side.
- There are, in total, 1,000 books listed on the website.
- Each web page displays 20 books.
- The price of each item displays in £ (in this instance, the UK pound).
- Each book displays minimum details.
- To view complete details for a book, click on the image or the
Book Title
hyperlink. This hyperlink forwards to a page containing additional book details for the selected item. - The total number of website pages displays in the footer (
Page 1 of 50
).

The Bossman would like additional details above those displayed on the main pages. Our code will need to access each book’s sub-page to gather/scrape the appropriate information.

💡 Note: This series of articles uses the Google Chrome browser.
HTTP Response Codes
When you attempt to connect from your Python code to any URL, an HTTP Response Code returns indicating the connection status.
This code can be any one of the following:
100 –199 |
Informational responses |
200 –299 |
Successful responses |
300–399 |
Redirection messages |
400–499 |
Client error responses |
500–599 |
Server error responses |
💡 Note: To view a detailed list of HTTP Status Codes, click here.
Connect to Website
Before any scraping can occur, we need to determine if we can successfully connect to this website. We do this using the requests
library. If successful, an HTTP Status Code
of 200 returns.
Let’s try running this code by performing the following steps:
- Open an IDE terminal.
- Create a new Python file (example:
books.py
). - Copy and paste the code below into this file.
- Save and run this file.
web_url = "https://books.toscrape.com" res = requests.get(web_url) if res: print(f"{res}") res.close() else: print(f"The following error occured: {res}")
- Line [1] assigns the Books to Scrape URL to the
web_url
variable. - Line [2] attempts to connect to this website using the
requests.get()
method. An HTTP Status Code returns and saves to theres
variable. - Line [3] initiates an
if
statement. If theres
variable contains the value of 200 (success), the code inside this statement executes.- Line [4] outputs the HTTP Status Code contained in the
res
variable to the terminal. - Line [5] closes the open connection.
- Line [4] outputs the HTTP Status Code contained in the
- Lines [6-7] execute if the
res
variable returns a value other than 200 (success).
Output
<Response [200]>
Great news! The connection to the Books to Scrape website works!
💡 Note: If successful, a connection is made from the Python code to the Books to Scrape website. Remember to close a connection when not in use.
💡 Note: You may want to remove Line [4] before continuing.
Retrieve Total Pages
Our goal in this section is to retrieve the total pages to scrape. This value is saved in our Python code to use later.
As indicated in the footer, this value is 50.

To locate the HTML code relating to this value, perform the following steps:
- Navigate to the Books to Scrape website.
- Scroll down to the footer area.
- With your mouse, hover over the text
Page 1 of 50
. - Right-mouse click to display a pop-up menu.
- Click to select
Inspect
. This option opens the HTML code window to the right of the browser window.

The HTML code relating to the selected text now contains a highlight.

Upon reviewing the HTML we notice that the text we need (Page 1 of 50) is wrapped inside an <li>
element/tag. We can reference this specific <li>
using class_='current'
.
In the example below, we have added a few lines inside the if
statement to retrieve and display this information Pythonically.
web_url = "https://books.toscrape.com" res = requests.get(web_url) if res: soup = BeautifulSoup(res.content, 'html.parser') total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3]) print(total_pgs) res.close() else: print(f"The following error occured: {res}")
- Line [1] initiates an
if
statement. If theres
variable contains the value of 200 (success), the code inside this statement executes.- Line [2] retrieves the HTML code from the Books to Scrape. This HTML code saves to the
soup
variable. - Line [3] searches inside the HTML code contained in the
soup
variable for an element/tag (in this case an<li>
) whereclass_='current'
.
If found, the following occurs:- The text of the
<li class_='current'>
tag is retrieved. This tag contains the stringPage 1 of 50
. - All leading and trailing spaces are removed from the string using the strip() method.
- The split() method splits the string on the space (‘ ‘) character. This results in the following list:
['Page', '1', 'of', '50']
- The last element (element 3) is accessed
[3]
. - The output converts to an integer and saves to
total_pgs
.
- The text of the
- Line [4] outputs
total_pgs
to the terminal. - Line [5] closes the open connection.
- Line [2] retrieves the HTML code from the Books to Scrape. This HTML code saves to the
Output
50
💡 Note: You may want to remove Line [4] before continuing.
💡 Note: Each website places the total number of pages in different locales. You will need to determine how to retrieve this information as required on a per-website basis.
Summary
In this article, you learned how to:
- Review the Books to Scrape website.
- Understand HTTP Status Codes.
- Connect to the Books to Scrape website using the
library.requests
- Locate and Retrieve Total Pages using a Web Browser and HTML code.
- Close the open connection.
What’s Next
In Part 2 of this series, you will learn to configure a URL for scraping and set a delay.

At university, I found my love of writing and coding. Both of which I was able to use in my career.
During the past 15 years, I have held a number of positions such as:
In-house Corporate Technical Writer for various software programs such as Navision and Microsoft CRM
Corporate Trainer (staff of 30+)
Programming Instructor
Implementation Specialist for Navision and Microsoft CRM
Senior PHP Coder