添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

// Tutorial //

How To Scrape Web Pages with Beautiful Soup and Python 3

Published on July 20, 2017 · Updated on March 20, 2019
English
How To Scrape Web Pages with Beautiful Soup and Python 3

Introduction

Beautiful Soup , an allusion to the Mock Turtle’s song found in Chapter 10 of Lewis Carroll’s Alice’s Adventures in Wonderland , is a Python library that allows for quick turnaround on web scraping projects. Currently available as Beautiful Soup 4 and compatible with both Python 2.7 and Python 3, Beautiful Soup creates a parse tree from parsed HTML and XML documents (including documents with non-closed tags or tag soup and other malformed markup).

In this tutorial, we will collect and parse a web page in order to grab textual data and write the information we have gathered to a CSV file.

Prerequisites

local or server-based Python programming environment set up on your machine.

You should have the Requests and Beautiful Soup modules installed , which you can achieve by following our tutorial “ How To Work with Web Data Using Requests and Beautiful Soup with Python 3 .” It would also be useful to have a working familiarity with these modules.

Additionally, since we will be working with data scraped from the web, you should be comfortable with HTML structure and tagging.

Understanding the Data

National Gallery of Art in the United States. The National Gallery is an art museum located on the National Mall in Washington, D.C. It holds over 120,000 pieces dated from the Renaissance to the present day done by more than 13,000 artists.

We would like to search the Index of Artists, which, at the time of updating this tutorial, is available via the Internet Archive ’s Wayback Machine at the following URL:

https://web.archive.org/web/20170131230332/https://www.nga.gov/collection/an.shtm

Note : The long URL above is due to this website having been archived by the Internet Archive.

The Internet Archive is a non-profit digital library that provides free access to internet sites and other digital media. This organization takes snapshots of websites to preserve sites’ histories, and we can currently access an older version of the National Gallery’s site that was available when this tutorial was first written. The Internet Archive is a good tool to keep in mind when doing any kind of historical data scraping, including comparing across iterations of the same site and available data.

Beneath the Internet Archive’s header, you’ll see a page that looks like this:

Since we’ll be doing this project in order to learn about web scraping with Beautiful Soup, we don’t need to pull too much data from the site, so let’s limit the scope of the artist data we are looking to scrape. Let’s therefore choose one letter — in our example we’ll choose the letter Z — and we’ll see a page that looks like this:

In the page above, we see that the first artist listed at the time of writing is Zabaglia, Niccola , which is a good thing to note for when we start pulling data. We’ll start by working with this first page, with the following URL for the letter Z :

https://web.archive.org/web/20121007172955/http://www.nga.gov/collection/an Z 1.htm

It is important to note for later how many pages total there are for the letter you are choosing to list, which you can discover by clicking through to the last page of artists. In this case, there are 4 pages total, and the last artist listed at the time of writing is Zykmund, Václav . The last page of Z artists has the following URL:

https://web.archive.org/web/20121010201041/http://www.nga.gov/collection/an Z 4.htm

However , you can also access the above page by using the same Internet Archive numeric string of the first page:

https://web.archive.org/web/ 20121007172955 /http://www.nga.gov/collection/an Z 4.htm

This is important to note because we’ll be iterating through these pages later in this tutorial.

To begin to familiarize yourself with how this web page is set up, you can take a look at its DOM , which will help you understand how the HTML is structured. In order to inspect the DOM, you can open your browser’s Developer Tools .

Importing the Libraries

    1. Requests and Beautiful Soup.

      The Requests library allows you to make use of HTTP within your Python programs in a human readable way, and the Beautiful Soup module is designed to get web scraping done quickly.

      We will import both Requests and Beautiful Soup with the import statement. For Beautiful Soup, we’ll be importing it from bs4, the package in which Beautiful Soup 4 is found.

      nga_z_artists.py
      # Import libraries
      import requests
      from bs4 import BeautifulSoup
      

      With both the Requests and Beautiful Soup modules imported, we can move on to working to first collect a page and then parse it.

      Collecting and Parsing a Web Page

      variable page by using the method requests.get().

      nga_z_artists.py
      import requests
      from bs4 import BeautifulSoup
      # Collect first page of artists’ list
      page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')
      

      <$>[note] Note: Because the URL is lengthy, the code above and throughout this tutorial will not pass PEP 8 E501 which flags lines longer than 79 characters. You may want to assign the URL to a variable to make the code more readable in final versions. The code in this tutorial is for demonstration purposes and will allow you to swap out shorter URLs as part of your own projects.<$>

      We’ll now create a BeautifulSoup object, or a parse tree. This object takes as its arguments the page.text document from Requests (the content of the server’s response) and then parses it from Python’s built-in html.parser.

      nga_z_artists.py
      import requests
      from bs4 import BeautifulSoup
      page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')
      # Create a BeautifulSoup object
      soup = BeautifulSoup(page.text, 'html.parser')
      

      With our page collected, parsed, and set up as a BeautifulSoup object, we can move on to collecting the data that we would like.

      Pulling Text From a Web Page

      for loop in order to iterate over all the artist names that we just put into the artist_name_list_items variable.

      We’ll print these names out with the prettify() method in order to turn the Beautiful Soup parse tree into a nicely formatted Unicode string.

      nga_z_artists.py
      ...
      artist_name_list = soup.find(class_='BodyText')
      artist_name_list_items = artist_name_list.find_all('a')
      # Create for loop to print out all artists' names
      for artist_name in artist_name_list_items:
          print(artist_name.prettify())
      

      Let’s run the program as we have it so far:

      1. python nga_z_artists.py

        Once we do so, we’ll receive the following output:

        Output
        <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630"> Zabaglia, Niccola <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3427"> Zao Wou-Ki <a href="/web/20121007172955/https://www.nga.gov/collection/anZ2.htm"> Zas-Zie <a href="/web/20121007172955/https://www.nga.gov/collection/anZ3.htm"> Zie-Zor <a href="/web/20121007172955/https://www.nga.gov/collection/anZ4.htm"> <strong> </strong>

        What we see in the output at this point is the full text and tags related to all of the artists’ names within the <a> tags found in the <div class="BodyText"> tag on the first page, as well as some additional link text at the bottom. Since we don’t want this extra information, let’s work on removing this in the next section.

        Removing Superfluous Data

        Pulling the Contents from a Tag

        list data type.

        Let’s revise the for loop so that instead of printing the entire link and its tag, we’ll print the list of children (i.e. the artists’ full names):

        nga_z_artists.py
        import requests
        from bs4 import BeautifulSoup
        page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')
        soup = BeautifulSoup(page.text, 'html.parser')
        last_links = soup.find(class_='AlphaNav')
        last_links.decompose()
        artist_name_list = soup.find(class_='BodyText')
        artist_name_list_items = artist_name_list.find_all('a')
        # Use .contents to pull out the <a> tag’s children
        for artist_name in artist_name_list_items:
            names = artist_name.contents[0]
            print(names)
        

        Note that we are iterating over the list above by calling on the index number of each item.

        We can run the program with the python command to view the following output:

        Output
        Zabaglia, Niccola Zaccone, Fabian Zadkine, Ossip Zanini-Viola, Giuseppe Zanotti, Giampietro Zao Wou-Ki

        We have received back a list of all the artists’ names available on the first page of the letter Z.

        However, what if we want to also capture the URLs associated with those artists? We can extract URLs found within a page’s <a> tags by using Beautiful Soup’s get('href') method.

        From the output of the links above, we know that the entire URL is not being captured, so we will concatenate the link string with the front of the URL string (in this case https://web.archive.org/).

        These lines we’ll also add to the for loop:

        nga_z_artists.py
        ...
        for artist_name in artist_name_list_items:
            names = artist_name.contents[0]
            links = 'https://web.archive.org' + artist_name.get('href')
            print(names)
            print(links)
        

        When we run the program above, we’ll receive both the artists’ names and the URLs to the links that tell us more about the artists:

        Output
        Zabaglia, Niccola https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630 Zaccone, Fabian https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34202 Zanotti, Giampietro https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11631 Zao Wou-Ki https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3427

        Although we are now getting information from the website, it is currently just printing to our terminal window. Let’s instead capture this data so that we can use it elsewhere by writing it to a file.

        Writing the Data to a CSV File

        how to handle plain text files in Python.

        First, we need to import Python’s built-in csv module along with the other modules at the top of the Python programming file:

        import csv
        

        Next, we’ll create and open a file called z-artist-names.csv for us to write to (we’ll use the variable f for file here) by using the 'w' mode. We’ll also write the top row headings: Name and Link which we’ll pass to the writerow() method as a list:

        f = csv.writer(open('z-artist-names.csv', 'w'))
        f.writerow(['Name', 'Link'])
        

        Finally, within our for loop, we’ll write each row with the artists’ names and their associated links:

        f.writerow([names, links])
        

        You can see the lines for each of these tasks in the file below:

        nga_z_artists.py
        import requests
        import csv
        from bs4 import BeautifulSoup
        page = requests.get('https://web.archive.org/web/20121007172955/http://www.nga.gov/collection/anZ1.htm')
        soup = BeautifulSoup(page.text, 'html.parser')
        last_links = soup.find(class_='AlphaNav')
        last_links.decompose()
        # Create a file to write to, add headers row
        f = csv.writer(open('z-artist-names.csv', 'w'))
        f.writerow(['Name', 'Link'])
        artist_name_list = soup.find(class_='BodyText')
        artist_name_list_items = artist_name_list.find_all('a')
        for artist_name in artist_name_list_items:
            names = artist_name.contents[0]
            links = 'https://web.archive.org' + artist_name.get('href')
            # Add each artist’s name and associated link to a row
            f.writerow([names, links])
        

        When you run the program now with the python command, no output will be returned to your terminal window. Instead, a file will be created in the directory you are working in called z-artist-names.csv.

        Depending on what you use to open it, it may look something like this:

        z-artist-names.csv
        Name,Link
        "Zabaglia, Niccola",https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=11630
        "Zaccone, Fabian",https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=34202
        "Zadkine, Ossip",https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=3475w
        

        Or, it may look more like a spreadsheet:

        In either case, you can now use this file to work with the data in more meaningful ways since the information you have collected is now stored in your computer’s memory.

        Earlier in this tutorial, we noted that we should pay attention to the total number of pages there are that contain artists’ names starting with the letter Z (or whatever letter we’re using). Since there are 4 pages for the letter Z, we constructed the for loop above with a range of 1 to 5 so that it will iterate through each of the 4 pages.

        For this specific web site, the URLs begin with the string https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ and then are followed with a number for the page (which will be the integer i from the for loop that we convert to a string) and end with .htm. We will concatenate these strings together and then append the result to the pages list.

        In addition to this loop, we’ll have a second loop that will go through each of the pages above. The code in this for loop will look similar to the code we have created so far, as it is doing the task we completed for the first page of the letter Z artists for each of the 4 pages total. Note that because we have put the original program into the second for loop, we now have the original loop as a nested for loop contained in it.

        The two for loops will look like this:

        pages = []
        for i in range(1, 5):
            url = 'https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ' + str(i) + '.htm'
            pages.append(url)
        for item in
        
        
        
        
            
         pages:
            page = requests.get(item)
            soup = BeautifulSoup(page.text, 'html.parser')
            last_links = soup.find(class_='AlphaNav')
            last_links.decompose()
            artist_name_list = soup.find(class_='BodyText')
            artist_name_list_items = artist_name_list.find_all('a')
            for artist_name in artist_name_list_items:
                names = artist_name.contents[0]
                links = 'https://web.archive.org' + artist_name.get('href')
                f.writerow([names, links])
        

        In the code above, you should see that the first for loop is iterating over the pages and the second for loop is scraping data from each of those pages and then is adding the artists’ names and links line by line through each row of each page.

        These two for loops come below the import statements, the CSV file creation and writer (with the line for writing the headers of the file), and the initialization of the pages variable (assigned to a list).

        Within the greater context of the programming file, the complete code looks like this:

        nga_z_artists.py
        import requests
        import csv
        from bs4 import BeautifulSoup
        f = csv.writer(open('z-artist-names.csv', 'w'))
        f.writerow(['Name', 'Link'])
        pages = []
        for i in range(1, 5):
            url = 'https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ' + str(i) + '.htm'
            pages.append(url)
        for item in pages:
            page = requests.get(item)
            soup = BeautifulSoup(page.text, 'html.parser')
            last_links = soup.find(class_='AlphaNav')
            last_links.decompose()
            artist_name_list = soup.find(class_='BodyText')
            artist_name_list_items = artist_name_list.find_all('a')
            for artist_name in artist_name_list_items:
                names = artist_name.contents[0]
                links = 'https://web.archive.org' + artist_name.get('href')
                f.writerow([names, links])
        

        Since this program is doing a bit of work, it will take a little while to create the CSV file. Once it is done, the output will be complete, showing the artists’ names and their associated links from Zabaglia, Niccola to Zykmund, Václav.

        Being Considerate

        Conclusion

        How To Crawl A Web Page with Scrapy and Python 3.”

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about us


About the authors
Default avatar
Lisa Tagliaferri

author

Still looking for an answer?

Ask a question Search for more help

Was this helpful?
10 Comments


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

I only signed up to leave this comment.

Thank you so much, very good, on-point tutorial. This is really one of the best sources on the internet on this topic and I believe it will be more than helpful to people like me who just started web scraping with python.

Great tutorial, it’s very useful thank you!

Thanks for this terrific tutorial, Lisa! The part about iterating over a series of result pages was especially helpful to me.

finestjava July 22, 2017

The first part of the tutorial was fine - but then it falls apart. This is a very common problem - The author knows the material so well they forget we have never seen most of the topic or task. So just getting the Z names and printing them to the terminal and CSV files worked just fine.

When it came to adding all the pages and writing the code to the CSV files - I started getting the dreaded Unicode encoding errors. I have worked with beautiful soup before and I really liked how you started out.

File “nga_z_artists.py”, line 29, in <module> f.writerow([names, links]) UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xe4’ in position 16: ordinal not in range(128)

File “nga_z_artists.py”, line 3 f.writerow([names, links])nd associated link to a row’)m’) SyntaxError: invalid syntax

Thanks for the tutorial! I’ve been using the requests and BeautifulSoup libraries for a little while, but always struggle matching the regex. Previously I hadn’t found much good documentation out there, and I always spend hours doing trial and error. This is pretty detailed and helpful. Thank you. I was able to populate this Python/Django website SeekingBeer using these libraries, but I spent forever messing around with the code.

Again, very amazing tutorial. Thank you.

I ran this tutorial on an iPad with Pythonista. The data prints to the console just fine, however the resulting CSV is blank.

Hello the tutorial is best. But how can I run my scrapping script 24*7 on Digital Ocean ? Any tutorial ?

``I have a small doubt? while grabbing the details more than 10 pages i got an error of Traceback (most recent call last):

File “<ipython-input-3-3844d097c07c>”, line 1, in <module> runfile(‘C:/Users/user/Test/BS4/test_webscraping_2.py’, wdir=‘C:/Users/user/Test/BS4’)

File “C:\Users\user\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”, line 668, in runfile execfile(filename, namespace)

File “C:\Users\user\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”, line 108, in execfile exec(compile(f.read(), filename, ‘exec’), namespace)

File “C:/Users/user/Test/BS4/test_webscraping_2.py”, line 35, in <module> actual_price = act_prices[0].text

IndexError: list index out of range

this is my from bs4 import BeautifulSoup as BS import requests import csv

pages = []

file = ‘LedTv_List.csv’ f = csv.writer(open(file,‘w’)) f.writerow([‘Brand Names’])

for i in range(1, 5): url = ‘ https://www.flipkart.com/search?q=led+tv&sid=ckf%2Cczl&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_0_3&otracker1=AS_QueryStore_OrganicAutoSuggest_0_3&as-pos=0&as-type=RECENT&as-searchtext=led+&page= ’ + str(i) pages.append(url)

for page in pages: web = requests.get(page) #web = requests.get(“ https://www.flipkart.com/search?q=mobile&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&“+i+” ”) soup = BS(web.content,“html.parser”) item = soup.findAll(“div”,{“class”:“_1-2Iqu row”}) #items = item[0] for items in item: brand = items.find_all(“div”,{“class”:“_3wU53n”}) brand_name = brand[0].text #print (brand_name) act_prices = items.find_all(“div”,{“class”:“_3auQ3N _2GcJzG”}) actual_price = act_prices[0].text offers = items.find_all(“div”,{“class”:“VGWI6T”}) discount = offers[0].text prices = items.find_all(“div”,{“class”:“_1vC4OE _2rQ-NK”}) price = prices[0].text

    #print (actual_price)
    #print (discount)
    #print (price)
    f.writerow([brand_name])