Scraping a Bookstore – Part 2 – Finxter

0
81
Scraping a Bookstore – Part 2 – Finxter


Story: This series of articles assume you work in the IT Department of Mason Books. The Bossman asks you to scrape the website of a competitor. He would like this information to gain insight into his inventory and pricing structures.

💡 Note: Before continuing, we recommend you possess, at minimum, a basic knowledge of HTML and CSS and have reviewed our articles on How to Scrape HTML tables.


Part 1 focused on:

  • Reviewing the website to scrape.
  • Understanding HTTP Status Codes.
  • Connecting to the Books to Scrape website using the requests library.
  • Retrieving Total Pages to Scrape
  • Closing the Open Connection.

Part 2 focuses on:

  • Configuring a page URL for scraping
  • Setting a delay: time.sleep() to pause between page scrapes.
  • Looping through two (2) pages for testing purposes.

💡 Note: This article assumes you have completed the steps in Part 1.


Getting Started

Remember to add the Required Starter Code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

Required Starter Code

import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

Before any data manipulation can occur, three (3) new libraries will require installation.

  • The pandas library enables access to/from a DataFrame.
  • The requests library provides access to the HTTP requests in Python.
  • The Beautiful Soup library enables data extraction from HTML and XML files.

💡 Note: The time library is built-in and does not require installation. However, be respectful to the website you are scraping. You can do this by pausing between page scrapes. This code is added below.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install requests

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install beautifulsoup4

Hit the <Enter> key on the keyboard to start the installation process.


Configure Page URL

The next step is to determine how to properly navigate from page to page while performing the scrape operation.

When you first navigate to the Books to Scrape site, the URL in the address bar is the following:

https://books.toscrape.com/index.html

Let’s see what happens when we click the next button in the footer area.

We are forwarded to page 2 of the website and the URL format in the address bar changes to the following:

https://books.toscrape.com/catalogue/page-2.html

Now, let’s navigate to the footer area and click the previous button.

We are forwarded to page 1 of the website and the URL format in the address bar changes to:

https://books.toscrape.com/catalogue/page-1.html

Notice how the original URL format has changed.

The following is appended to the original URL:

  • a sub-directory: /catalogue/
  • a page-x.html: where x is the page you are currently on.

💡 Note: Click next and previous in the footer area to confirm this.

We can work with this!

Let’s move to an IDE and write some Python code using this URL configuration to loop through all pages of the website.

💡 Note: Most of the code below has been brought forward from Part 1. The lines highlighted in yellow have been added.

web_url = "https://books.toscrape.com"
res = requests.get(web_url)
cur_page = 1

if res:
    soup = BeautifulSoup(res.content, 'html.parser')
    total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3])

    while cur_page <= 2: #total_pgs:
        pg_url = f"{web_url}/catalogue/page-{str(cur_page)}.html"
        print(f"Scraping: {pg_url}")
        cur_page += 1
        time.sleep(2)
    res.close()
else:
    print(f"The following error occured: {res}")
  • Line [1] creates a new variable cur_page to keep track of the page we are currently on. This variable is initially set to a value of one (1).
  • Line [2] initiates a While Loop which repeats until cur_pg equals total_pgs.
    • Line [3] creates a new variable pg_url by combining the variable web_url with the cur_page variable.
    • Line [4] outputs the value of the pg_url to the terminal for each loop.
    • Line [5] increases the value of cur_page by one (1).
    • Line [6] pauses the code for two (2) seconds between pages using time.sleep().
  • Line [7] closes the open connection.

Before running this code, we recommend you do not loop through all 50 pages of the website. Instead, let’s change the While Loop to the following:

web_url = "https://books.toscrape.com"
res = requests.get(web_url)
cur_page = 1

if res:
    soup = BeautifulSoup(res.content, 'html.parser')
    total_pgs = int(soup.find('li', class_='current').text.strip().split(' ')[3])

    while cur_page <= 2:   #total_pgs:
        pg_url = f"{web_url}/catalogue/page-{str(cur_page)}.html"
        print(f"Scraping: {pg_url}")
        cur_page += 1
    res.close()
else:
    print(f"The following error occured: {res}")

💡 Note: To comment out code in Python, use the # character. This prevents everything else on the line from executing.

The While Loop has been modified. This code executes twice as depicted by the output below:

Output

Scraping: https://books.toscrape.com/catalogue/page-1.html
Scraping: https://books.toscrape.com/catalogue/page-2.html

Summary

In this article, you learned how to:

  • Configure a page URL for scraping
  • Set a delay: time.sleep() to pause between page scrapes.
  • Loop through two (2) pages for testing purposes.

What’s Next

In Part 3 of this series, you will learn to identify additional elements/tags inside HTML code.



Source link

Leave a reply

Please enter your comment!
Please enter your name here