Scraping a Bookstore – Part 3 – Finxter

0
92
Scraping a Bookstore – Part 3 – Finxter


Story: This series of articles assume you work in the IT Department of Mason Books. The Bossman asks you to scrape the website of a competitor. He would like this information to gain insight into his inventory and pricing structures.

💡 Note: Before continuing, we recommend you possess, at minimum, a basic knowledge of HTML and CSS and have reviewed our articles on How to Scrape HTML tables.


Part 1 focused on:

  • Reviewing the website to scrape.
  • Understanding HTTP Status Codes.
  • Connecting to the Books to Scrape website using the requests library.
  • Retrieving Total Pages to Scrape
  • Closing the Open Connection.

Part 2 focused on:

  • Configuring a page URL for scraping
  • Setting a delay: time.sleep() to pause between page scrapes.
  • Looping through two (2) pages for testing purposes.

Part 3 focuses on:

  • Locating Book details.
  • Writing code to retrieve this information for all Books.
  • Save Book details to a list.

💡 Note: This article assumes you have completed the steps in Part 1 and Part 2.


Getting Started

Remember to add the Required Starter Code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

Required Starter Code

import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

Before any data manipulation can occur, three (3) new libraries will require installation.

  • The pandas library enables access to/from a DataFrame.
  • The requests library provides access to the HTTP requests in Python.
  • The Beautiful Soup library enables data extraction from HTML and XML files.

💡 Note: The time library is built-in and does not require installation.
This library contains time.sleep() which is used to set a delay between page scrapes. This code is below.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install requests

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install beautifulsoup4

Hit the <Enter> key on the keyboard to start the installation process.


Overview

Each Book on the top-level pages of the site contains a:

  • Thumbnail image.
  • Book Title hyperlink.
  • Price.
  • In stock reference.
  • Add to basket button.

In this section, we will scrape these top-level pages.

💡 Note: The Finxter Challenge is to write additional code to scape each Book’s sub-page.


Locate Book Details

Navigating through the site shows us that the setup for each Book is identical across all pages.

To view the HTML code associated with each Book, perform the following steps:

  • Open a browser and navigate to the Books to Scrape website.
  • With the mouse, hover over any thumbnail.
  • Right-mouse click to display a pop-up menu.
  • Click to select the Inspect menu item. This option opens the HTML code window to the right of the browser window.

Upon reviewing the HTML code, we notice that the <img> tag with the highlight is wrapped inside <article class="product_prod"></article> tags.

Let’s confirm this by using our mouse to hover over the <article class="product_prod"> tag in the HTML code.

If correct, the selected Book section on the left turns Blue.

Great! We can work with this!


Let’s move back to an IDE and write some Python Code!

💡 Note: The code below has been brought forward from Part 2. The lines in yellow are new.

web_url = "https://books.toscrape.com"
res = requests.get(web_url)
cur_page = 1
all_books = []

if res:
    soup = BeautifulSoup(res.content, 'html.parser')
    total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3])

    while cur_page <= 2: #total_pgs:
        pg_url = f"{web_url}/catalogue/page-{str(cur_page)}.html"
        print(f"Scraping: {pg_url}")

        all_articles = soup.find_all('article')
        for article in all_articles:
            b_href  = article.find('a')['href']
            b_src   = article.find('img')['src']
            b_title = article.find('img')['alt']
            b_rtg   = article.find("p", class_="star-rating").attrs.get("class")[1]
            b_price = article.find('p', class_='price_color').text
            all_books.append([b_href, b_src, b_title, b_rtg, b_price])
        cur_page += 1
        time.sleep(2)
    res.close()
else:
    print(f"The following error occured: {res}")

print(all_books)
  • Line [1] declares the list variable all_books.
  • Line [2] locates all <article> tags on the current web page. This output saves to all_articles.
  • Line [3] initiates a for loop to traverse through each <article></article> tag on the current page.
    • Line [4] retrieves and saves the href value to the b_href variable.
    • Line [5] retrieves and saves the image source to the b_src variable.
    • Line [6] retrieves and saves the title to the b_title variable.
    • Line [7]retrieves and saves the rating to the b_rtg variable.
    • Line [8] retrieves and saves the price to the b_price variable.
    • Line [9] append this information to the all_books list created earlier.
  • Line [10] outputs the contents of all_books to the terminal.

Output (Snippet)

[['catalogue/a-light-in-the-attic_1000/index.html', 'media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg', 'A Light in the Attic', 'Three', '£51.77'], ['catalogue/tipping-the-velvet_999/index.html', 'media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg', 'Tipping the Velvet', 'One', '£53.74'],[['catalogue/a-light-in-the-attic_1000/index.html', 'catalogue/a-light-in-the-attic_1000/index.html', 'A Light in the Attic', 'Three', '£51.77'], ['catalogue/tipping-the-velvet_999/index.html', 'catalogue/tipping-the-velvet_999/index.html', 'Tipping the Velvet', 'One', '£53.74'], .....]]

💡 Note: You may want to remove Line [10] before continuing.


Summary

In this article, you learned how to:

  • Locate Book details.
  • Write code to retrieve this information.
  • Save Book details to a List.

What’s Next

In Part 4 of this series, we will clean up the code and save the results to a Database.



Source link

Leave a reply

Please enter your comment!
Please enter your name here