Story: This series of articles assume you work in the IT Department of Mason Books. The Bossman asks you to scrape the website of a competitor. He would like this information to gain insight into his inventory and pricing structures.
💡 Note: Before continuing, we recommend you possess, at minimum, a basic knowledge of HTML and CSS and have reviewed our articles on How to Scrape HTML tables.
Part 1 focused on:
- Reviewing the website to scrape.
- Understanding HTTP Status Codes.
- Connecting to the Books to Scrape website using the
requests
library. - Retrieving Total Pages to Scrape
- Closing the Open Connection.
Part 2 focused on:
- Configuring a page URL for scraping
- Setting a delay:
time.sleep()
to pause between page scrapes. - Looping through two (2) pages for testing purposes.
Part 3 focuses on:
- Locating Book details.
- Writing code to retrieve this information for all Books.
- Save
Book
details to a list.
💡 Note: This article assumes you have completed the steps in Part 1 and Part 2.
Getting Started
Required Starter Code
import pandas as pd import requests from bs4 import BeautifulSoup import time
Before any data manipulation can occur, three (3) new libraries will require installation.
- The
pandas
library enables access to/from a DataFrame. - The
requests
library provides access to the HTTP requests in Python. - The
Beautiful Soup
library enables data extraction from HTML and XML files.
💡 Note: The time
library is built-in and does not require installation.
This library contains time.sleep()
which is used to set a delay between page scrapes. This code is below.
To install these libraries, navigate to an IDE terminal. At the command prompt ($
), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($
). Your terminal prompt may be different.
$ pip install pandas
Hit the <Enter>
key on the keyboard to start the installation process.
$ pip install requests
Hit the <Enter>
key on the keyboard to start the installation process.
$ pip install beautifulsoup4
Hit the <Enter>
key on the keyboard to start the installation process.
Overview
Each Book on the top-level pages of the site contains a:
- Thumbnail image.
- Book Title hyperlink.
- Price.
In stock
reference.Add to basket
button.
In this section, we will scrape these top-level pages.
💡 Note: The Finxter Challenge is to write additional code to scape each Book’s sub-page.
Locate Book Details
Navigating through the site shows us that the setup for each Book is identical across all pages.

To view the HTML code associated with each Book, perform the following steps:
- Open a browser and navigate to the Books to Scrape website.
- With the mouse, hover over any thumbnail.
- Right-mouse click to display a pop-up menu.
- Click to select the
Inspect
menu item. This option opens the HTML code window to the right of the browser window.

Upon reviewing the HTML code, we notice that the <img>
tag with the highlight is wrapped inside <article class="product_prod"></article>
tags.

Let’s confirm this by using our mouse to hover over the <article class="product_prod">
tag in the HTML code.
If correct, the selected Book section on the left turns Blue.

Great! We can work with this!
Let’s move back to an IDE and write some Python Code!
💡 Note: The code below has been brought forward from Part 2. The lines in yellow are new.
web_url = "https://books.toscrape.com" res = requests.get(web_url) cur_page = 1 all_books = [] if res: soup = BeautifulSoup(res.content, 'html.parser') total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3]) while cur_page <= 2: #total_pgs: pg_url = f"{web_url}/catalogue/page-{str(cur_page)}.html" print(f"Scraping: {pg_url}") all_articles = soup.find_all('article') for article in all_articles: b_href = article.find('a')['href'] b_src = article.find('img')['src'] b_title = article.find('img')['alt'] b_rtg = article.find("p", class_="star-rating").attrs.get("class")[1] b_price = article.find('p', class_='price_color').text all_books.append([b_href, b_src, b_title, b_rtg, b_price]) cur_page += 1 time.sleep(2) res.close() else: print(f"The following error occured: {res}") print(all_books)
- Line [1] declares the list variable
all_books
. - Line [2] locates all
<article>
tags on the current web page. This output saves toall_articles
. - Line [3] initiates a
for
loop to traverse through each<article></article>
tag on the current page.- Line [4] retrieves and saves the
href
value to theb_href
variable. - Line [5] retrieves and saves the image source to the
b_src
variable. - Line [6] retrieves and saves the title to the
b_title
variable. - Line [7]retrieves and saves the rating to the
b_rtg
variable. - Line [8] retrieves and saves the price to the
b_price
variable. - Line [9] append this information to the
all_books
list created earlier.
- Line [4] retrieves and saves the
- Line [10] outputs the contents of
all_books
to the terminal.
Output (Snippet)
[['catalogue/a-light-in-the-attic_1000/index.html', 'media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg', 'A Light in the Attic', 'Three', '£51.77'], ['catalogue/tipping-the-velvet_999/index.html', 'media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg', 'Tipping the Velvet', 'One', '£53.74'],[['catalogue/a-light-in-the-attic_1000/index.html', 'catalogue/a-light-in-the-attic_1000/index.html', 'A Light in the Attic', 'Three', '£51.77'], ['catalogue/tipping-the-velvet_999/index.html', 'catalogue/tipping-the-velvet_999/index.html', 'Tipping the Velvet', 'One', '£53.74'], .....]] |
💡 Note: You may want to remove Line [10] before continuing.
Summary
In this article, you learned how to:
- Locate Book details.
- Write code to retrieve this information.
- Save Book details to a List.
What’s Next
In Part 4 of this series, we will clean up the code and save the results to a Database.

At university, I found my love of writing and coding. Both of which I was able to use in my career.
During the past 15 years, I have held a number of positions such as:
In-house Corporate Technical Writer for various software programs such as Navision and Microsoft CRM
Corporate Trainer (staff of 30+)
Programming Instructor
Implementation Specialist for Navision and Microsoft CRM
Senior PHP Coder