A common need is web scraping, the ability to copy content from an online web page. In data analytics, this is often used to extract data from a web page into a format amenable to follow-on analysis in languages like Python, R, SAS, or SQL.
At its inception, web pages were a simple combination of Hypertext Markup Language (HTML) to style the content of a page, the Hypertext Transfer Protocol (HTTP) to transfer HTML documents over the Internet, and web browsers like Mosaic to convert HTML into a rendered presentation. HTML supported document structuring: paragraphs, tables, lists, and so on; and text styling: boldface, italics, and other types of visual modification of text.
Modern web pages are very different from their ancestors. Now, pages commonly contain complex styling and programming that a web browser must interpret and execute prior to displaying a page. Two common examples are Cascading Style Sheets (CSS) and Javascript programs to control both the appearance and the functionality of a web page.
In order to web scrape in this new reality, two steps are needed. First, the web page must be read and interpreted into its final format. Selenium is used to do this, since it has the ability to mimic a web browser by reading raw HTML, then performing the execution necessary to convert the HTML into its final format. Although Selenium is designed to perform web page testing, it can also deliver the HTML for a fully rendered page. Once that HTML is available, it needs to be read and parsed. In Python, we use Beautiful Soup to do this.
Even with Selenium and Beautiful Soup, web scraping is non-trivial. For example, web pages often have interactive controls that need to be invoked in a specific order to arrive at the page of interest. Selenium is fully capable of doing this, but the raw HTML must be examined to determine how to uniquely identify the web page controls to manipulate. This information is needed by Selenium to locate and modify widgets on the web page. Once the target page is scraped, the HTML must again be examined for to determine how to tell Beautiful Soup what we want to scrape. Well written HTML will have easy-to-locate identifiers for all the main elements on a page. Poorly written HTML will not. Both can be parsed, but the effort required for poorly written pages is more complex. Regardless, for both Selenium and Beautiful Soup, understanding how HTML works is a prerequisite for scraping most pages. If you need a quick introduction to HTML, refer back to our discussion of HTML5 on the plotly | Dash lessons.
id
& class
Extending our understanding of HTML, the most common way to identify a
particular section of HTML is through its id
or class
, two attributes that can be attached to most
HTML markup tags. As an example, consider the following simple HTML
code.
In HTML, the id
attribute is used to identify a
specific HTML structure. The class tag is used to assign a
pre-defined class to the structure, usually to style the structure in
some common way throughout the document. Both Selenium and Beautiful
Soup allow us to select HTML structures based on their id
and class
attributes, or combinations thereof. This is
the most common way of identifying the target structure we wish to
extract from a web page.
Selenium can be installed as a package in Python from the Anaconda Prompt by typing the following.
You may need to run the prompt in Administrator mode
to allow conda
to update your Anaconda installation.
Once Selenium is installed, you will also need to provide a webdriver. The webdriver allows you to programmatically drive a web page, exactly as though you were a user. You can ask Selenium to load a page, click elements on the page, fill in text fields, scroll the page, and do any of the other things a real user could do if they were viewing the page in their own web browser.
The webdriver itself is a program that runs on your computer and
mimics one of the common web browser. Currently, you can download
webdrivers for Chrome, Firefox, Edge, IE, Safari, and Opera from
this web page. Unless you have a need for specific
browser compatibility, the driver you choose isn't particularly
important, since all browsers support a nearly-identical set of
operations. Once you've downloaded a webdriver for a specific browser
and operating system (Windows, Mac OS, or Linux,) you will have an
executable like
chromedriver
(for Chrome) or geckodriver
(for Firefox.) This executable must be placed in a location where
Selenium can find it, so it can be run when Selenium starts its
processing. The simplest location is the same directory as the Python
program using Selenium. The documentation on the downloads page also
explains how you can add the location of Selenium webdrivers to
your PATH
variable, since Selenium will
check PATH
locations whenever a webdriver is requested.
At this point, you have everything you need to load Selenium in Python, invoke a controllable version of one of the common web browsers, then use Selenium to load a page and manipulate its contents to navigate to to the location where you want to scrape data. At that point, the page's HTML can be retrieved and passed to Beautiful Soup for further processing. Given this, the high-level order of processing for web scraping with Selenium and Beautiful Soup is as follows.
As noted above, one of the fundamental requirements for Selenium or Beautiful Soup is properly identifying the HTML structures in a web page you want to manipulate or scrape. The easiest way to do this is to load a web page into your favourite web browser, then use the developer tools every browser provides to examine the underlying HTML source code in detail. The discussion in these notes will use Chrome as an example, since it provides a robust set of examination tools. The same functionality can be performed in Firefox, Safari, or any other browser, however, using whatever commands they make available for this type of exploration.
To begin, run Chrome and load NC State's homepage
at https://www.ncsu.edu. Next, click on the three
vertical dots in the top-right corner of the browser window to reveal
Chrome's menu. Choose More tools → Developer
tools
to bring up Chrome's developer tools (you can also use
the keyboard shortcut Ctrl+Shift+I
to do the same thing.)
If this is the first time you've used the developer tools, they will
appear in a dock inside the browser window to the left or
right. To force the developer tools into their own, separate window,
click the three vertical dots in the upper-right corner of the tools
dock, and click on the Dock side option that shows two overlapping
windows. This will pull the developer tools into a window separate
from the browser window.
The developer tools are designed for a variety of options, including
examining a web page's source code, debugging Javascript code, and
confirming resources for the page loaded properly over the network. Since
we're interested in examining HTML source code, choose the tab labelled
Elements
at the top of the page. This shows the an
overview of the code that makes up NC State's homepage, with exposure
triangles to allow you to show and hide more detailed information
contained in the page. Move your mouse over the different lines in
the source code list. You should see different parts of the main browser
window highlight. This is showing you which parts of the web page correspond
to which parts of the code you are moving over.
RESOURCES
Link
Notice that if we click on the RESOURCES
button at the
top of the page, a panel slides down with additional options to
selection. If we wanted to do this with Selenium, we would need to
determine how to uniquely identify the RESOURCES
button. To do this, we would start moving our mouse over the source
code in the Elements
panel, watching to see
when RESOURCES
was highlighted, and continuing to descend
into the code in more detail until we find the exact line of code that
represents the RESOURCES
button. When I load the NC
State homepage and examine the source code, this is what I need to do
to find the RESOURCES
button.
<div
id="header-bar"…
the entire navigation bar
containing the magnifying glass highlights, so I click the triangle
to expose the code within the div
.
<div
class="ncstate-utility-bar-container"…
since it
highlights the magnifying glass.
![]() |
<a
id="ncstate-utility-bar"…
I see the magnifying glass
(which is now identified as a clickable link) highlight. I've now
identified a clickable div
(<div> … </div>
) with
the class
property ncstate-utility-bar
that
represents the magnifying glass button. If I invoke
a chromedriver
in Selenium and ask to click the link
with class
ncstate-utility-bar
, the
options panel will slide down, just like it did when I clicked it
explicitly.
This shows how you can use Chrome's developer tools to walk in and out of the code to find the specific elements you want to manipulate, and what unique identifiers can be used to allow Selenium to manipulate them.
Now that we know how to identify the magnifying glass button,
how would we use Selenium to automatically select it in
a chromedriver
? The following Python code snippet will
create a chromedriver
, load the NC State homepage, click
the magnifying glass link, wait for 5 seconds, then terminate.
Remember, this code will only run if you
have chromedriver
in the current working directory, or in
a directory included in your
PATH
.
Moving further into our example, supposed I want to select the
Campus Directory
link at the top of the panel. Some exploring
identifies its id
as ncstate-utility-bar-first-link
.
The following code duplicates our previous operation to click the
Campus Directory
link.
If you run this code, it may fail with an error stating
that the Campus Directory
link cannot be interacted
with. Why did this happen? There are actually two reasons this is not
working.
First, if you examine the web page source before clicking the
magnifying glass link, you will see
that ncstate-utility-bar
exists, even though the panel is
not visible. This is because the web designers have defined it as part
of the original page, and only revealed it when the magnifying
glass link is clicked. You can see this by searching the web page
source for ncstate-utility-bar
, which will be
present. Then, click on the magnifying glass link. This invokes
the div
that contains
the ncstate-utility-dropdown-pane
to slide down the list
of available options the magnifying glass (search) reveals.
Unfortunately, this is only one problem that we are encountering. You
might wonder, "If the link is always available, why can't we just
locate it and click it, without having to reveal the search panel?"
This is because, in addition to exposing the panel is also making the
link available. Trying to click the link programmatically
before we reveal the panel tells us the element is not yet available
for interaction (an ElementNotInteractableException
error). But, our code above first clicked the magnifying glass link to
make the panel visible, then found and clicked the Campus
Directory
link. Even doing this, we may still encounter an
interaction error. Why would this happen?
When the web browser executes code for complicated web pages like the
NC State homepage, it takes time for the operations to complete and
the web page to update. Our code is running too quickly, so it asks
for a reference to the Campus Directory
link
before chromedriver
has processed our first click and
rendered the drop-down panel. This is a very common occurrence during
web scraping of dynamic pages.
How can we solve this second error? An obvious way would be to create
an infinite loop that located the Campus Directory
link,
and if it wasn't available for interaction, slept for a short period
of time, then try again. This is
strongly discouraged, however, since it is inefficient, and it
also blocks the Python interpreter from performing any actions while
the sleep command runs. Selenium provides two possible methods for
dealing with this issue: implicit waits and explicit
waits. An implicit wait will wait a certain amount of time to
locate an element before it gives up and returns an error. An explicit
wait will wait a certain amount of time for a specific condition to
evaluate to True
based on the web page's contents before
it gives up and returns an error. It is also fairly easy to write our
own function to wait a set number of attempts for a target element to
become available on the web page before giving up and deciding
something has gone wrong.
In our situation, an implicit wait will not work, since
the Campus Directory
link is always present whether it is
visible (and clickable) or not. This means we will need an explicit
wait, with a specific condition we are waiting on.
Selenium's expected_conditions
class provides numerous
ways to wait for specific conditions, including waiting until an
element is clickable. We will set an explicit wait by
element id
for five seconds
on ncstate-utility-bar-first-link
until it becomes
clickable.
Now, the second click registers as expected and we move to the Campus Directory web page, ready to search for information on campus members.
![]() |
To finish our example, we will enter a last and first name into the
appropriate fields on the web form, then click the Search button to
retrieve information about the given individual. At this point, we
will have arrived at our target page, and we are ready to extract the
user's Email address from the resulting information. Fields are
populated using the send_keys
, in the following way.
![]() |
At this point, if we want to retrieve the value attached to Email field, we have two options. First, we can do this directly in Selenium. Second, we can ask Selenium to return the HTML for the current page, parse that HTML with Beautiful Soup, then retrieve the value attached to the Email field using Beautiful Soup's parse tree.
![]() |
In both cases, the program returns healey@ncsu.edu
, which
is the correct NC State email for employee Christopher Healey. You
might wonder, "This is a lot of work to get someone's email
address. Why would we go through all this effort for that result?" In
fact, we probably would not. However, suppose we had a list of 1,000
NC State employee first and last names, and we needed an email address
for each of them. Doing this manually through the NC State web page
would take a significant amount of time. With only a slight
modification to the end of our program, however, we could query an
email, go back one page, refill the fields to query a new email, and
so on until we had all 1,000 emails. Not only would it be fully
automated, it would also be much faster than a manual approach. This
is the power of Selenium and Beautiful Soup: the ability to automate
tedious or manually labourious tasks, even when they involve many
dynamic interactions with a web page.
To date, we've seen how to use Selenium to search for specific HTML
properties like id
and class
. Selenium
provides functions to search the following properties.
id
: find_element(
By.ID, '…' )
, or with helper
function find_element_by_id( '…' )
name
: find_element(
By.NAME, '…' )
, or with helper
function find_element_by_name( '…' )
class name
: find_element(
By.CLASS_NAME, '…' )
, or with helper
function find_element_by_class_name( '…' )
tag name
: find_element(
By.TAG_NAME, '…' )
, or with helper
function find_element_by_tag_name(…' )
link text
: find_element(
By.LINK_TEXT, '…' )
, or with helper
function find_element_by_link_text( '…' )
partial link text
: find_element(
By.PARTIAL_LINK_TEXT, '…' )
, or with helper
function find_element_by_partial_link_text( '…' )
CSS selector
: find_element(
By.CSS_SELECTOR, '…' )
, or with helper
function find_element_by_css_selector( '…' )
What happens if you need a more sophisticated way of locating elements
in the HTML? This often happens when the HTML is poorly
written. Although there are various ways to do this, Selenium's
proposed solution is to use By.XPATH
. But, what's
an XPATH
? XPath is short for XML path, the path used to
navigate through the structure of an HTML
document. By.XPATH
allows you to locate elements using
XML path expressions. XPaths come in two flavours: absolute
and relative. An absolute XPath defines the exact path from the
beginning of the HTML page to the page element you want to locate.
This absolute XPath says start at the root node (/
), then
find the HTML
element (which is the entire HTML for the
page), then the body
element, the second div
in the body, the first div
inside the body's
second div
, and finally the first h4
section
heading within that div
. Although this allows very
explicit selection, it is also labourious, and if the format of the
web page changes, the absolute XPath will break. The much more common
alternative is a relative XPath, which allows searching within the web
page for target elements. The basic format of a relative XPath is:
where:
//
: select the current node
tagname
: tag name of the target node to be
found (div
, img
, a
, and so on)
@
: select an attribute
attribute
: attribute name of the target node
to be found
value
: attribute value of the target node to
be found
As an example, to search for the clickable link with
an id
of ncstate-utility-bar-toggle-link
using a relative XPath, we would use:
contains
If you want to locate an element based on a partial text match,
the contains
function can be used to do this. Rather than
an XPath of //tagname[ @attribute='value' ]
, we can
use //tagname[ contains(@attribute='partial-value' ]
to
locate an element of type tagname
with
an attribute
whose text
contains partial-value
, for example:
This will select all anchors whose href
contains the
text cornavirus
. This corresponds to the COVID-19
UPDATES
link on the NC State homepage banner. Clicking it will
direct the browser to NC State's COVID-19 information page.
![]() |
In addition to contain
, the function starts-with
can be used to identify elements whose text starts with a specific string.
Finally, it is also possible to mix relative and absolute XPaths. The
common scenario for this is to find a location in the HTML with a
relative XPath search, then follow that location with a set of absolute
path links to obtain a specific element that follows the relative
location. For example, suppose I wanted to click the Admissions
link on the NC State home page. By examining the code, I can see that
the link is part of an unordered list ul
with an id
of menu-menu
. The list elements li
that follow
are a link to coronavirus updates, a link to an About page, and a link to
Admissions. Now that I know the list element I want is the third element,
I can specify it in an absolute fashion.
Notice that XPaths index starting at 1, not at 0 like indexing in Python.
The above code locates the unordered list with the id
of
menu-menu
, then from that point in the HTML, searches for
the third list element. If we click that element, we will be taken to
the Admissions page.
You may also notice that if we hover over the Admissions link,
rather than clicking it, a drop-down menu appears with Apply to NC
State, Undergraduate, Graduate, and International options. Suppose we
actually wanted to click the Graduate option in the drop-down
menu. Again, further examination of the HTML reveals
a div
and another ul
follow the Admissions
list element. Within the second ul
are the options in the
drop-down menu. The Graduate option is the third in this list.
There is a problem, however. We cannot simply build an absolute XPath to the Graduate link. Since the drop-down is not visible, we cannot immediately select the Graduate option. We must first simulate a hover over the Admissions text to reveal the drop-down menu, then select the third item in the list and click it to proceed to the Graduate Students section of the Admissions page.
If you're curious, ActionChains
are used to automate
lower-level interactions such as mouse movements, mouse button
actions, key-presses, and context menu interactions. Our hover
operation can be considered the beginning of a context menu
interaction. To perform the operation, first an action chain built by
creating a queue of actions. Next, the action
chain's perform
function is called to execute actions in
the queue one-after-another. In our case, we have only one queued
action: moving the mouse over the Admissions text to reveal its
drop-down menu. Once this is done, we have access to the Graduate menu
option.
Although Beautiful Soup does not have as extensive a list of capabilities as Selenium, in particular the ability to interactively manipulate a web page, if it does what you need, it has the advantage of not requiring webdrivers or additional programming to reach a point where you can start scraping data from a web page. The typical advice is to only use Selenium if Beautiful Soup cannot do what you need.
If you are not using Selenium, you will need to read the HTML source
from a web URL using Python's requests
library. This is
easy to do.
This attempts to read the HTML source from a web URL. If
the status_code
is 200, the request was successful,
otherwise a problem occurred and the source was not returned. For
example, if the page does not exists, a status_code
of
404 would be
returned. Status codes in the form of 2xx indicate
success. Errors are in the form of 4xx or
5xx. The requests
library
has its own set of routines for validating and
responding to different status codes.
Once the web page is read and parsed by Beautiful Soup, you can begin to issue commands to search for information and extract it from the web page's source. Identical to Selenium, you will need to understand the source code in order to specify what you want to extract. This can be done in ways identical to Selenium, for example, by using Chrome's developer tools to explore using the Elements tab. We assume you've read through this material, so we will focus on how to use Beautiful Soup's parse tree to retrieve information.
The vast majority of what we do with Beautiful Soup is search for target
elements, then examine the elements' attributes. There are three useful
functions to do this: find()
, find_all()
, and
select()
. The find
function finds the first
occurrence of an element. find_all
finds all occurrences
of an element and returns them as a list. select
allows
you to
specify CSS selectors to locate elements.
As an example, using the tree
element built from parsing
www.ncsu.edu
, we could find the first paragraph element as
follows. Once we have that element, we can use the get_text()
function to retrieve the text of the paragraph.
To retrieve all the paragraphs on the page, we would use find_all
.
Similar to Selenium, both find
and find_all
allow you to specify attributes in addition to HTML tags. The two most
common attributes are id
and class_
(note
the trailing underscore). Beautiful Soup's documentation provides a
long set of examples for how to search for targets
within the parse tree.
If an element has an attribute that is non-standard, you can still search
for the element using the attrs
argument. attrs
is a dictionary that defines one or more attributes as keys and
corresponding attribute values as values. Any element that matches the
given key–value pair(s) will be returned. Incidentally, there's no
limitation on only using non-standard attributes with attrs
.
Any attribute including ones like id
, class>
, or
name
can be included in the attrs
dictionary.
CSS selectors are cascading style sheet components. Specifically, with
CSS you can define a style for elements within your web page. For example,
the following CSS rule makes the text for all paragraphs (defined using
<p> … </p>
) red.
The selector is the first part of the CSS rule, in this case, the
p
tag. It defines which elements in the web page should
be selected to have the given CSS property applied. Selectors can also
refer to classes or ids. To refer to an element block assigned to a
specific class, preface the class's name with a period (.). To refer
to an element block assigned to a specific id, preface the class's
name with a hashtag (#).
CSS selectors provide Beautiful Soup with a flexible way to locate
target elements or element blocks. For example, to find all paragraphs
we could specify a selection for p
, but we can string
together selectors to define parent–child relationships. To find
all anchors inside a paragraph tag, we could specify a selection
for p a
. To find all paragraphs with an id
of first
, we could specify p#first
.
In this sense, select
can perform operations similar to
find_all
, but with the ability to be both more general
and more specific about what elements to locate.
As a practical example, we will use Beautiful Soup to scrape and print
the extended forecast for Raleigh reported by the National Weather
Service (NWS) web site. Since we have already discussed exploring a
web page to identify target elements, we will limit our example to
using Beautiful Soup to extract the elements we need. We also provide
the URL for Raleigh's weather forecast by entering Raleigh, NC into the
NWS homepage, producing a URL
of https://forecast.weather.gov/MapClick.php?lat=35.7855&lon=-78.6427
.
![]() |
To start, we scrape the HTML for Raleigh's web site and confirm it was returned properly.
The individual extended weather entries are contained in an unordered
list with an id of seven-day-forecast-list
. Within each
list item is a div
with a class
of tombstone-container
. Within this div are four
separate paragraphs: (1) the period name (e.g., This Afternoon); (2)
an image whose alt
tag contains the detailed forecast
(this is the same text as in the detailed forecast list, but it
appears as a tooltip when you hover over the image); (3) a short
description (e.g., Sunny); and (4) a temperature (e.g., High:
48°F). At this point, we have two options. We can combine together
the first, third, and fourth text items to produce a short summary
of the extended weather. Or, we can extract the alt
text
of the img
in the second paragraph. Below is code for
both options.
Here, we extract the HTML for each list item, convert it to a string,
and replace any line break <br/>
with a
space. Then, we re-use BeautifulSoup to parse the HTML, and ask for
the text it contains. Joining these together produces a final set of
extended forecast summary lines.
If we instead wanted to use the alt
text for the image
embedded in the second paragraph, the following code would extract
the alt
text.
This produces a result similar to the first code block, although with slightly more detail and in a grammatically correct format.