nc-state logo
Selenium + Beautiful Soup
Christopher G. Healey

Introduction

A common need is web scraping, the ability to copy content from an online web page. In data analytics, this is often used to extract data from a web page into a format amenable to follow-on analysis in languages like Python, R, SAS, or SQL.

At its inception, web pages were a simple combination of Hypertext Markup Language (HTML) to style the content of a page, the Hypertext Transfer Protocol (HTTP) to transfer HTML documents over the Internet, and web browsers like Mosaic to convert HTML into a rendered presentation. HTML supported document structuring: paragraphs, tables, lists, and so on; and text styling: boldface, italics, and other types of visual modification of text.

Modern web pages are very different from their ancestors. Now, pages commonly contain complex styling and programming that a web browser must interpret and execute prior to displaying a page. Two common examples are Cascading Style Sheets (CSS) and Javascript programs to control both the appearance and the functionality of a web page.

In order to web scrape in this new reality, two steps are needed. First, the web page must be read and interpreted into its final format. Selenium is used to do this, since it has the ability to mimic a web browser by reading raw HTML, then performing the execution necessary to convert the HTML into its final format. Although Selenium is designed to perform web page testing, it can also deliver the HTML for a fully rendered page. Once that HTML is available, it needs to be read and parsed. In Python, we use Beautiful Soup to do this.

Even with Selenium and Beautiful Soup, web scraping is non-trivial. For example, web pages often have interactive controls that need to be invoked in a specific order to arrive at the page of interest. Selenium is fully capable of doing this, but the raw HTML must be examined to determine how to uniquely identify the web page controls to manipulate. This information is needed by Selenium to locate and modify widgets on the web page. Once the target page is scraped, the HTML must again be examined for to determine how to tell Beautiful Soup what we want to scrape. Well written HTML will have easy-to-locate identifiers for all the main elements on a page. Poorly written HTML will not. Both can be parsed, but the effort required for poorly written pages is more complex. Regardless, for both Selenium and Beautiful Soup, understanding how HTML works is a prerequisite for scraping most pages. If you need a quick introduction to HTML, refer back to our discussion of HTML5 on the plotly | Dash lessons.

HTML id & class

Extending our understanding of HTML, the most common way to identify a particular section of HTML is through its id or class, two attributes that can be attached to most HTML markup tags. As an example, consider the following simple HTML code.

<p>This is a paragraph.</p>
<p id="id-tag">This is a paragraph with an id attribute of id-tag.</p>
<p class="class-tag">This is a paragraph with a class attribute of class-tag.</p>
<p id="id-tag" class="class-tag">This is a paragraph with an id attribute of id-tag and a class attribute of class-tag.</p>

In HTML, the id attribute is used to identify a specific HTML structure. The class tag is used to assign a pre-defined class to the structure, usually to style the structure in some common way throughout the document. Both Selenium and Beautiful Soup allow us to select HTML structures based on their id and class attributes, or combinations thereof. This is the most common way of identifying the target structure we wish to extract from a web page.

Selenium

Selenium can be installed as a package in Python from the Anaconda Prompt by typing the following.

conda install -c conda-forge selenium

You may need to run the prompt in Administrator mode to allow conda to update your Anaconda installation.

Once Selenium is installed, you will also need to provide a webdriver. The webdriver allows you to programmatically drive a web page, exactly as though you were a user. You can ask Selenium to load a page, click elements on the page, fill in text fields, scroll the page, and do any of the other things a real user could do if they were viewing the page in their own web browser.

The webdriver itself is a program that runs on your computer and mimics one of the common web browser. Currently, you can download webdrivers for Chrome, Firefox, Edge, IE, Safari, and Opera from this web page. Unless you have a need for specific browser compatibility, the driver you choose isn't particularly important, since all browsers support a nearly-identical set of operations. Once you've downloaded a webdriver for a specific browser and operating system (Windows, Mac OS, or Linux,) you will have an executable like chromedriver (for Chrome) or geckodriver (for Firefox.) This executable must be placed in a location where Selenium can find it, so it can be run when Selenium starts its processing. The simplest location is the same directory as the Python program using Selenium. The documentation on the downloads page also explains how you can add the location of Selenium webdrivers to your PATH variable, since Selenium will check PATH locations whenever a webdriver is requested.

At this point, you have everything you need to load Selenium in Python, invoke a controllable version of one of the common web browsers, then use Selenium to load a page and manipulate its contents to navigate to to the location where you want to scrape data. At that point, the page's HTML can be retrieved and passed to Beautiful Soup for further processing. Given this, the high-level order of processing for web scraping with Selenium and Beautiful Soup is as follows.

  1. Ask Selenium to invoke a webdriver, creating a browser instance that we can control through Selenium.
  2. Load the initial web page in the browser instance.
  3. Use Selenium to navigate from the initial page to the target page exactly as a user would, by clicking buttons, selecting from lists, entering text in text fields, and so on.
  4. Once the target page is loaded, ask Selenium to return the HTML used to represent the target page.
  5. Pass the page's HTML to Beautiful Soup to parse its contents into a searchable parse tree.
  6. Use Beautiful Soup to retrieve information from the web page's parse tree.
  7. Save information scraped from the web page into Python variables for follow-on analysis.

Exploring Web Pages

As noted above, one of the fundamental requirements for Selenium or Beautiful Soup is properly identifying the HTML structures in a web page you want to manipulate or scrape. The easiest way to do this is to load a web page into your favourite web browser, then use the developer tools every browser provides to examine the underlying HTML source code in detail. The discussion in these notes will use Chrome as an example, since it provides a robust set of examination tools. The same functionality can be performed in Firefox, Safari, or any other browser, however, using whatever commands they make available for this type of exploration.

To begin, run Chrome and load NC State's homepage at https://www.ncsu.edu. Next, click on the three vertical dots in the top-right corner of the browser window to reveal Chrome's menu. Choose More tools → Developer tools to bring up Chrome's developer tools (you can also use the keyboard shortcut Ctrl+Shift+I to do the same thing.) If this is the first time you've used the developer tools, they will appear in a dock inside the browser window to the left or right. To force the developer tools into their own, separate window, click the three vertical dots in the upper-right corner of the tools dock, and click on the Dock side option that shows two overlapping windows. This will pull the developer tools into a window separate from the browser window.

The developer tools are designed for a variety of options, including examining a web page's source code, debugging Javascript code, and confirming resources for the page loaded properly over the network. Since we're interested in examining HTML source code, choose the tab labelled Elements at the top of the page. This shows the an overview of the code that makes up NC State's homepage, with exposure triangles to allow you to show and hide more detailed information contained in the page. Move your mouse over the different lines in the source code list. You should see different parts of the main browser window highlight. This is showing you which parts of the web page correspond to which parts of the code you are moving over.

Example: Identifying the RESOURCES Link

Notice that if we click on the RESOURCES button at the top of the page, a panel slides down with additional options to selection. If we wanted to do this with Selenium, we would need to determine how to uniquely identify the RESOURCES button. To do this, we would start moving our mouse over the source code in the Elements panel, watching to see when RESOURCES was highlighted, and continuing to descend into the code in more detail until we find the exact line of code that represents the RESOURCES button. When I load the NC State homepage and examine the source code, this is what I need to do to find the RESOURCES button.

  1. When I highlight the line that begins <div id="header-bar"… the entire navigation bar containing the magnifying glass highlights, so I click the triangle to expose the code within the div.
  2. Next, I expose the line <div class="ncstate-utility-bar-container"… since it highlights the magnifying glass.
    Hovering over code in the Developer Tools Elements tab to identify which code corresponds to the magnifying glass button in the NC State homepage header (click the image for a full-size version)
    explore-HTML
  3. Finally, when I hover over the line <a id="ncstate-utility-bar"… I see the magnifying glass (which is now identified as a clickable link) highlight. I've now identified a clickable div (<div> … </div>) with the class property ncstate-utility-bar that represents the magnifying glass button. If I invoke a chromedriver in Selenium and ask to click the link with class ncstate-utility-bar, the options panel will slide down, just like it did when I clicked it explicitly.

This shows how you can use Chrome's developer tools to walk in and out of the code to find the specific elements you want to manipulate, and what unique identifiers can be used to allow Selenium to manipulate them.

Coding Selenium

Now that we know how to identify the magnifying glass button, how would we use Selenium to automatically select it in a chromedriver? The following Python code snippet will create a chromedriver, load the NC State homepage, click the magnifying glass link, wait for 5 seconds, then terminate. Remember, this code will only run if you have chromedriver in the current working directory, or in a directory included in your PATH.

>>> import time
>>> from selenium import webdriver
>>> from selenium.webdriver.common.by import By
>>>
>>> driver = webdriver.Chrome( '/Users/healey/Downloads/chromedriver' )
>>> driver.get( 'https://www.ncsu.edu' )
>>>
>>> link = driver.find_element( By.CLASS_NAME, 'ncstate-utility-bar' )
>>> link.click()
>>>
>>> time.sleep( 5 )
>>> driver.close()

Moving further into our example, supposed I want to select the Campus Directory link at the top of the panel. Some exploring identifies its id as ncstate-utility-bar-first-link. The following code duplicates our previous operation to click the Campus Directory link.

>>> import time
>>> from selenium import webdriver
>>> from selenium.webdriver.common.by import By
>>>
>>> driver = webdriver.Chrome( '/Users/healey/Downloads/chromedriver' )
>>> driver.get( 'https://www.ncsu.edu' )
>>>
>>> link = driver.find_element( By.CLASS_NAME, 'ncstate-utility-bar' )
>>> link.click()
>>>
>>> link = driver.find_element( By.XPATH, '//ul[contains(@id, "dropdown-resource-link-list")]/li[3]' )
>>> link.click()
>>>
>>> time.sleep( 5 )
>>> driver.close()

If you run this code, it may fail with an error stating that the Campus Directory link cannot be interacted with. Why did this happen? There are actually two reasons this is not working.

First, if you examine the web page source before clicking the magnifying glass link, you will see that ncstate-utility-bar exists, even though the panel is not visible. This is because the web designers have defined it as part of the original page, and only revealed it when the magnifying glass link is clicked. You can see this by searching the web page source for ncstate-utility-bar, which will be present. Then, click on the magnifying glass link. This invokes the div that contains the ncstate-utility-dropdown-pane to slide down the list of available options the magnifying glass (search) reveals.

Unfortunately, this is only one problem that we are encountering. You might wonder, "If the link is always available, why can't we just locate it and click it, without having to reveal the search panel?" This is because, in addition to exposing the panel is also making the link available. Trying to click the link programmatically before we reveal the panel tells us the element is not yet available for interaction (an ElementNotInteractableException error). But, our code above first clicked the magnifying glass link to make the panel visible, then found and clicked the Campus Directory link. Even doing this, we may still encounter an interaction error. Why would this happen?

When the web browser executes code for complicated web pages like the NC State homepage, it takes time for the operations to complete and the web page to update. Our code is running too quickly, so it asks for a reference to the Campus Directory link before chromedriver has processed our first click and rendered the drop-down panel. This is a very common occurrence during web scraping of dynamic pages.

How can we solve this second error? An obvious way would be to create an infinite loop that located the Campus Directory link, and if it wasn't available for interaction, slept for a short period of time, then try again. This is strongly discouraged, however, since it is inefficient, and it also blocks the Python interpreter from performing any actions while the sleep command runs. Selenium provides two possible methods for dealing with this issue: implicit waits and explicit waits. An implicit wait will wait a certain amount of time to locate an element before it gives up and returns an error. An explicit wait will wait a certain amount of time for a specific condition to evaluate to True based on the web page's contents before it gives up and returns an error. It is also fairly easy to write our own function to wait a set number of attempts for a target element to become available on the web page before giving up and deciding something has gone wrong.

In our situation, an implicit wait will not work, since the Campus Directory link is always present whether it is visible (and clickable) or not. This means we will need an explicit wait, with a specific condition we are waiting on. Selenium's expected_conditions class provides numerous ways to wait for specific conditions, including waiting until an element is clickable. We will set an explicit wait by element id for five seconds on ncstate-utility-bar-first-link until it becomes clickable.

>>> import time
>>> import sys
>>>
>>> from selenium import webdriver
>>> from selenium.webdriver.common.by import By
>>> from selenium.webdriver.support.ui import WebDriverWait
>>> from selenium.webdriver.support import expected_conditions as EC
>>>
>>> driver = webdriver.Chrome( '/Users/healey/Downloads/chromedriver' )
>>> driver.get( 'https://www.ncsu.edu' )
>>>
>>> link = driver.find_element( By.CLASS_NAME, 'ncstate-utility-bar')
>>> link.click()
>>>
>>> try:
... link = WebDriverWait( driver, 5 ).until(
...   EC.element_to_be_clickable(
...     ( By.XPATH, '//ul[contains(@id, "dropdown-resource-link-list")]/li[3]' )
...   )
... )
>>> except:
... print( 'mainline(), could not expose RESOURCES panel' )
... driver.close()
... sys.exit( 0 )
>>>
>>> link.click()
>>>
>>> time.sleep( 5 )
>>> driver.close()

Now, the second click registers as expected and we move to the Campus Directory web page, ready to search for information on campus members.

Programmatically navigating to NC State's Campus Directory page.
campus-directory

Querying a User

To finish our example, we will enter a last and first name into the appropriate fields on the web form, then click the Search button to retrieve information about the given individual. At this point, we will have arrived at our target page, and we are ready to extract the user's Email address from the resulting information. Fields are populated using the send_keys, in the following way.

>>> import time
>>> import sys
>>>
>>> from selenium import webdriver
>>> from selenium.webdriver.common.by import By
>>> from selenium.webdriver.support.ui import WebDriverWait
>>> from selenium.webdriver.support import expected_conditions as EC
>>>
>>> driver = webdriver.Chrome( '/Users/healey/Downloads/chromedriver' )
>>> driver.get( 'https://www.ncsu.edu' )
>>>
>>> link = driver.find_element( By.CLASS_NAME, 'ncstate-utility-bar')
>>> link.click()
>>>
>>> try:
... link = WebDriverWait( driver, 5 ).until(
...   EC.element_to_be_clickable(
...     ( By.XPATH, '//ul[contains(@id, "dropdown-resource-link-list")]/li[3]' )
...   )
... )
>>> except:
... print( 'mainline(), could not expose RESOURCES panel' )
... driver.close()
... sys.exit( 0 )
>>>
>>> link.click()
>>>
>>> try:
... btn = WebDriverWait( driver, 5 ).until(
...   EC.element_to_be_clickable( ( By.CLASS_NAME, 'btn-primary' ) )
... )
>>> except:
... print( 'mainline(), could not find Campus Directory Search button' )
... driver.close()
... sys.exit( 0 )
>>>
>>> form = driver.find_element( By.ID, 'lastname' )
>>> form.send_keys( 'Healey' )
>>> form = driver.find_element( By.ID, 'firstname' )
>>> form.send_keys( 'Christopher' )
>>> btn.click()
>>>
>>> time.sleep( 5 )
>>> driver.close()
Campus Directory information for Christopher Healey
healey-dir

At this point, if we want to retrieve the value attached to Email field, we have two options. First, we can do this directly in Selenium. Second, we can ask Selenium to return the HTML for the current page, parse that HTML with Beautiful Soup, then retrieve the value attached to the Email field using Beautiful Soup's parse tree.

Selenium

>>> …
>>> try:
... div = WebDriverWait( driver, 5 ).until(
...   EC.visibility_of_element_located(
...     ( By.CLASS_NAME, 'person__right' )
...   )
... )
>>> except:
... print( 'mainline(), could not find user phone/fax/email' )
... driver.close()
... sys.exit( 0 )
>>>
>>> tok = div.text.split()
>>> email = 'unknown'
>>>
>>> for i,str in enumerate( tok ):
... if 'Email' in str:
... email = tok[ i + 1 ]
... break
>>>
>>> print( 'Email address: ' + email )
>>>
>>> driver.close()

Beautiful Soup

>>> …
>>> from bs4 import BeautifulSoup
>>>
>>> tree = BeautifulSoup( driver.page_source, 'html.parser' )
>>> div = tree.find( 'div', class_='person__right' )
>>>
>>> tok = div.text.split()
>>> email = 'unknown'
>>>
>>> for i,str in enumerate( tok ):
... if 'Email' in str:
... email = tok[ i + 1 ]
... break
>>>
>>> print( 'Email address: ' + email )
>>>
>>> driver.close()
Graduate student admissions information
grad-student-admit

In both cases, the program returns healey@ncsu.edu, which is the correct NC State email for employee Christopher Healey. You might wonder, "This is a lot of work to get someone's email address. Why would we go through all this effort for that result?" In fact, we probably would not. However, suppose we had a list of 1,000 NC State employee first and last names, and we needed an email address for each of them. Doing this manually through the NC State web page would take a significant amount of time. With only a slight modification to the end of our program, however, we could query an email, go back one page, refill the fields to query a new email, and so on until we had all 1,000 emails. Not only would it be fully automated, it would also be much faster than a manual approach. This is the power of Selenium and Beautiful Soup: the ability to automate tedious or manually labourious tasks, even when they involve many dynamic interactions with a web page.

XPATH

To date, we've seen how to use Selenium to search for specific HTML properties like id and class. Selenium provides functions to search the following properties.

What happens if you need a more sophisticated way of locating elements in the HTML? This often happens when the HTML is poorly written. Although there are various ways to do this, Selenium's proposed solution is to use By.XPATH. But, what's an XPATH? XPath is short for XML path, the path used to navigate through the structure of an HTML document. By.XPATH allows you to locate elements using XML path expressions. XPaths come in two flavours: absolute and relative. An absolute XPath defines the exact path from the beginning of the HTML page to the page element you want to locate.

/html/body/div[2]/div[1]/h4[1]

This absolute XPath says start at the root node (/), then find the HTML element (which is the entire HTML for the page), then the body element, the second div in the body, the first div inside the body's second div, and finally the first h4 section heading within that div. Although this allows very explicit selection, it is also labourious, and if the format of the web page changes, the absolute XPath will break. The much more common alternative is a relative XPath, which allows searching within the web page for target elements. The basic format of a relative XPath is:

//tagname[ @attribute = 'value' ]

where:

As an example, to search for the clickable link with an id of ncstate-utility-bar-toggle-link using a relative XPath, we would use:

driver.find_element( By.XPATH, "//a[@id='ncstate-utility-bar-toggle-link']" )

XPATH contains

If you want to locate an element based on a partial text match, the contains function can be used to do this. Rather than an XPath of //tagname[ @attribute='value' ], we can use //tagname[ contains(@attribute='partial-value' ] to locate an element of type tagname with an attribute whose text contains partial-value, for example:

driver.find_element( By.XPATH, "//a[contains( @href, 'coronavirus' )]" )

This will select all anchors whose href contains the text cornavirus. This corresponds to the COVID-19 UPDATES link on the NC State homepage banner. Clicking it will direct the browser to NC State's COVID-19 information page.

NC State's COVID-19 "Protecting the Pack" information page
covid-19-page

In addition to contain, the function starts-with can be used to identify elements whose text starts with a specific string.

Mixing Relative and Absolute XPaths

Finally, it is also possible to mix relative and absolute XPaths. The common scenario for this is to find a location in the HTML with a relative XPath search, then follow that location with a set of absolute path links to obtain a specific element that follows the relative location. For example, suppose I wanted to click the Admissions link on the NC State home page. By examining the code, I can see that the link is part of an unordered list ul with an id of menu-menu. The list elements li that follow are a link to coronavirus updates, a link to an About page, and a link to Admissions. Now that I know the list element I want is the third element, I can specify it in an absolute fashion.

driver.find_element( By.XPATH, "//ul[@id='menu-menu']/li[3]" )

Notice that XPaths index starting at 1, not at 0 like indexing in Python. The above code locates the unordered list with the id of menu-menu, then from that point in the HTML, searches for the third list element. If we click that element, we will be taken to the Admissions page.

You may also notice that if we hover over the Admissions link, rather than clicking it, a drop-down menu appears with Apply to NC State, Undergraduate, Graduate, and International options. Suppose we actually wanted to click the Graduate option in the drop-down menu. Again, further examination of the HTML reveals a div and another ul follow the Admissions list element. Within the second ul are the options in the drop-down menu. The Graduate option is the third in this list.

There is a problem, however. We cannot simply build an absolute XPath to the Graduate link. Since the drop-down is not visible, we cannot immediately select the Graduate option. We must first simulate a hover over the Admissions text to reveal the drop-down menu, then select the third item in the list and click it to proceed to the Graduate Students section of the Admissions page.

from selenium.webdriver.common.action_chains import ActionChains

elem = driver.find_element( By.XPATH, "//ul[@id='menu-menu']/li[3]" )
hover = ActionChains( driver ).move_to_element( elem )
hover.perform()
elem = driver.find_element( By.XPATH, "//ul[@id='menu-menu']/li[3]/div/ul/li[3]" )
elem.click()

If you're curious, ActionChains are used to automate lower-level interactions such as mouse movements, mouse button actions, key-presses, and context menu interactions. Our hover operation can be considered the beginning of a context menu interaction. To perform the operation, first an action chain built by creating a queue of actions. Next, the action chain's perform function is called to execute actions in the queue one-after-another. In our case, we have only one queued action: moving the mouse over the Admissions text to reveal its drop-down menu. Once this is done, we have access to the Graduate menu option.

Beautiful Soup

Although Beautiful Soup does not have as extensive a list of capabilities as Selenium, in particular the ability to interactively manipulate a web page, if it does what you need, it has the advantage of not requiring webdrivers or additional programming to reach a point where you can start scraping data from a web page. The typical advice is to only use Selenium if Beautiful Soup cannot do what you need.

If you are not using Selenium, you will need to read the HTML source from a web URL using Python's requests library. This is easy to do.

import requests
import sys
from bs4 import BeautifulSoup

page = requests.get( 'https://www.ncsu.edu' )
if page.status_code != 200:
   print( 'mainline(), could not retrieve HTML for www.ncsu.edu' )
   sys.exit( 0 )
else:
   tree = BeautifulSoup( page.content, 'html.parser' )

This attempts to read the HTML source from a web URL. If the status_code is 200, the request was successful, otherwise a problem occurred and the source was not returned. For example, if the page does not exists, a status_code of 404 would be returned. Status codes in the form of 2xx indicate success. Errors are in the form of 4xx or 5xx. The requests library has its own set of routines for validating and responding to different status codes.

Once the web page is read and parsed by Beautiful Soup, you can begin to issue commands to search for information and extract it from the web page's source. Identical to Selenium, you will need to understand the source code in order to specify what you want to extract. This can be done in ways identical to Selenium, for example, by using Chrome's developer tools to explore using the Elements tab. We assume you've read through this material, so we will focus on how to use Beautiful Soup's parse tree to retrieve information.

Finding Elements

The vast majority of what we do with Beautiful Soup is search for target elements, then examine the elements' attributes. There are three useful functions to do this: find(), find_all(), and select(). The find function finds the first occurrence of an element. find_all finds all occurrences of an element and returns them as a list. select allows you to specify CSS selectors to locate elements.

As an example, using the tree element built from parsing www.ncsu.edu, we could find the first paragraph element as follows. Once we have that element, we can use the get_text() function to retrieve the text of the paragraph.

>>> txt = tree.find( 'p' ).get_text()
>>> print( txt )

      Hear from the alumna and record-setting NASA astronaut on Dec. 4.

To retrieve all the paragraphs on the page, we would use find_all.

>>> para = tree.find_all( 'p' )
>>> print( 'Total paragraphs:', len( para ) )
>>> print( para[ 1 ].get_text() )
Total paragraphs: 31

      Thousands of NC State employees have strived to keep campus running and care for our students during COVID-19.

Similar to Selenium, both find and find_all allow you to specify attributes in addition to HTML tags. The two most common attributes are id and class_ (note the trailing underscore). Beautiful Soup's documentation provides a long set of examples for how to search for targets within the parse tree.

>>> div = tree.find( 'div', id='main-content' )
>>> section = tree.find( 'section', class_='news' )

If an element has an attribute that is non-standard, you can still search for the element using the attrs argument. attrs is a dictionary that defines one or more attributes as keys and corresponding attribute values as values. Any element that matches the given key–value pair(s) will be returned. Incidentally, there's no limitation on only using non-standard attributes with attrs. Any attribute including ones like id, class>, or name can be included in the attrs dictionary.

>>> tree = BeautifulSoup( '<div special-attr="value">A special div</div>', 'htlm.parser' )
>>> tree.find( attrs={ 'special-attr': 'value' } )

CSS Selectors

CSS selectors are cascading style sheet components. Specifically, with CSS you can define a style for elements within your web page. For example, the following CSS rule makes the text for all paragraphs (defined using <p> … </p>) red.

<style>
p {
   color: red;
}
</style>

The selector is the first part of the CSS rule, in this case, the p tag. It defines which elements in the web page should be selected to have the given CSS property applied. Selectors can also refer to classes or ids. To refer to an element block assigned to a specific class, preface the class's name with a period (.). To refer to an element block assigned to a specific id, preface the class's name with a hashtag (#).

<style>
.center-align {
   text-align: center;
}

#blue-bg {
   background-color: blue;
   color: white;
}
</style>

CSS selectors provide Beautiful Soup with a flexible way to locate target elements or element blocks. For example, to find all paragraphs we could specify a selection for p, but we can string together selectors to define parent–child relationships. To find all anchors inside a paragraph tag, we could specify a selection for p a. To find all paragraphs with an id of first, we could specify p#first.

para = tree.select( "p" )
para_anchor = tree.select( "p a" )
para_first = tree.select( "p#first" )

In this sense, select can perform operations similar to find_all, but with the ability to be both more general and more specific about what elements to locate.

National Weather Service Example

As a practical example, we will use Beautiful Soup to scrape and print the extended forecast for Raleigh reported by the National Weather Service (NWS) web site. Since we have already discussed exploring a web page to identify target elements, we will limit our example to using Beautiful Soup to extract the elements we need. We also provide the URL for Raleigh's weather forecast by entering Raleigh, NC into the NWS homepage, producing a URL of https://forecast.weather.gov/MapClick.php?lat=35.7855&lon=-78.6427.

National Weather Service extended forecast for Raleigh, NC
nws-forecast

To start, we scrape the HTML for Raleigh's web site and confirm it was returned properly.

>>> import requests
>>> import sys
>>> from bs4 import BeautifulSoup
>>>
>>> page = requests.get(\
>>>  'https://forecast.weather.gov/MapClick.php?lat=35.7855&lon=-78.6427' )
>>>
>>> if page.status_code != 200:
... print( "mainline(), could not retrieve HTML for Raleigh's weather" )
... sys.exit( 0 )
>>>
>>> tree = BeautifulSoup( page.content, 'html.parser' )

The individual extended weather entries are contained in an unordered list with an id of seven-day-forecast-list. Within each list item is a div with a class of tombstone-container. Within this div are four separate paragraphs: (1) the period name (e.g., This Afternoon); (2) an image whose alt tag contains the detailed forecast (this is the same text as in the detailed forecast list, but it appears as a tooltip when you hover over the image); (3) a short description (e.g., Sunny); and (4) a temperature (e.g., High: 48°F). At this point, we have two options. We can combine together the first, third, and fourth text items to produce a short summary of the extended weather. Or, we can extract the alt text of the img in the second paragraph. Below is code for both options.

>>> list = tree.find( 'ul', id='seven-day-forecast-list' )
>>> item = list.find_all( 'li', class_='forecast-tombstone' )
>>>
>>> for li in item:
... txt = ''
... para = li.find_all( 'p' )
... for html in para:
... html = str( html ).replace( '<br/>', ' ' )
... soup = BeautifulSoup( html, 'html.parser' )
... soup_txt = soup.get_text().strip()
... txt += ( soup_txt + '. ' if len( soup_txt ) > 0 else '' )
... print( txt )

Here, we extract the HTML for each list item, convert it to a string, and replace any line break <br/> with a space. Then, we re-use BeautifulSoup to parse the HTML, and ask for the text it contains. Joining these together produces a final set of extended forecast summary lines.

Tonight. Clear. Low: 27 °F.
Wednesday. Sunny. High: 50 °F.
Wednesday Night. Clear. Low: 29 °F.
Thursday. Sunny. High: 56 °F.
Thursday Night. Partly Cloudy. Low: 36 °F.
Friday. Chance Showers. High: 59 °F.
Friday Night. Chance Showers. Low: 42 °F.
Saturday. Mostly Sunny. High: 57 °F.
Saturday Night. Partly Cloudy. Low: 36 °F.

If we instead wanted to use the alt text for the image embedded in the second paragraph, the following code would extract the alt text.

>>> list = tree.find( 'ul', id='seven-day-forecast-list' )
>>> item = list.find_all( 'li', class_='forecast-tombstone' )
>>>
>>> for li in item:
... para = li.find_all( 'p' )
... txt = para[ 1 ].find( 'img' )[ 'alt' ]
... print( txt )

This produces a result similar to the first code block, although with slightly more detail and in a grammatically correct format.

Tonight: Clear, with a low around 27. West wind 7 to 9 mph, with gusts as high as 18 mph.
Wednesday: Sunny, with a high near 50. West wind 6 to 9 mph.
Wednesday Night: Clear, with a low around 29. Light west wind.
Thursday: Sunny, with a high near 56. Calm wind.
Thursday Night: Partly cloudy, with a low around 36. Calm wind.
Friday: A chance of showers, mainly after 1pm. Mostly cloudy, with a high near 59. Chance of precipitation is 30%. New precipitation amounts of less than a tenth of an inch possible.
Friday Night: A chance of showers before 1am. Mostly cloudy, with a low around 42. Chance of precipitation is 30%.
Saturday: Mostly sunny, with a high near 57.
Saturday Night: Partly cloudy, with a low around 36.