Python Tutorial
Christopher G. Healey
nc-state-logo

Introduction

A computer program is a sequence of instructions, written in a programming language, that combine together to solve a problem or produce a desired result. Computer programs range from very simple examples with only a few lines of source code, to very complicated. For example, Windows 7 is estimated to contain around 50 million lines of code.

In general, programming involves the following steps:

  1. A program is written in some programming language (e.g., C++, Python, R, or Javascript), usually stored in one or more source code files.
  2. For compiled languages like C or C++, the source code is converted into machine language by a compiler, then combined into an executable file by a linker. The executable file is run on a target machine and operating system.
  3. For interpreted languages like Python and R, the source code is processed by an interpreter. Individual lines are converted to machine language and executed one by one, as they are encountered within the source code.

This tutorial will provide an introduction to programming using Python, an interpreted programming language. Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde & Informatica (National Research Institute for Mathematics and Computer Science) in the Netherlands. Python 1.0 was released in 1994. The most recent version is 3.4.2, released in October 2014. To maintain backwards compatibility, Python versions 3.0 and 3.1 coincided with versions 2.6 and 2.7. We will be using Python version 2.7.

Python is an interpreted language, which means individual lines of code are converted to machine language and executed as they are encountered (versus a compiled language, which converts an entire program to machine code in an explicit compile–link stage). One advantage of an interpreted language is the ability to enter individual commands on a command line prompt, and immediately see their results.

>>> print 2 + 5
7
>>> list = [ 1, 2, 3 ]
>>> print list
[1, 2, 3]
>>> print list[ 0 ]
1
>>> email = { 'healey': 'healey@ncsu.edu', 'rappa': 'mrappa@ncsu.edu' }
>>> print email
{'rappa': 'mrappa@ncsu.edu', 'healey': 'healey@ncsu.edu'}
>>> print email[ 'healey' ]
healey@ncsu.edu
>>> basket = [ 'orange', 'apple', 'pear', 'apple', 'durian', 'orange' ]
>>> fruit = set( basket )
>>> print fruit
set(['orange', 'pear', 'apple', 'durian'])
>>> print 'durian' in fruit
True
>>> print 'pomegranate' in fruit
False

Unless we're only issuing a few commands once or twice, we normally store the commands in source code files. This allows us to load and execute the commands as often as we want. It also makes it easier to modify a program, or correct it when we discover errors. A slightly modified version of the above code (to force output to appear when it's run) is available in the source code file tut-01-intro.py.

A source code file in Python is called a module. Every Python program includes a main module. When the program starts running, it begins interpreting code in the main module. You can also split your code across additional modules to organize it, or use libraries to access functionality provided by other programmers.

Assignment

This year's assignment asks you to use Weather Underground's API (Application Programming Interface) to query forecasted temperature data. You will write your results to a CSV file. The assignment web page provides full details on what is required.

The assignment will be completed as individuals, and not as a homework team. Grading for the assignment will be on a standard 0–100% scale. The assignment is due by 5:00pm EST on Friday, September 2. You will submit the Python code for your completed assignment through Moodle. Look for "Python Assignment: Weather Underground Tempature API" in the Programming & Visualization section.

Running Python

For this class, we'll be using the Anaconda Scientific Python for Windows. If you haven't done so already, there are comprehensive instructions on how to download and install Anaconda. Anaconda includes the NumPy (numerical Python), pandas (Python data analysis), urllib2 (URL management), and BeautifulSoup libraries that we'll be using in this tutorial.

Once Anaconda is installed, choose the IPython Qt Console program to bring up an interactive Python console.

On Mac OS X or Linux, Python is normally installed by default. If it is, you can type python at a command line prompt to bring up a Python shell.

In addition to typing individual statements, you can load and execute source code from the command line. For example, to run the statements shown in the Introduction above, download and save the source code file tut-01-intro.py, then use execfile to run it:

>>> execfile( 'c:/Users/healey/Downloads/tut-01-intro.py' )

Note that the path to your copy of the file will depend on where you've saved it.

Why Python?

Python is a powerful programming language that can perform many functions. For this class, our interest is in Python's data management capabilities. Python offers efficient ways to read data from one or more input files, then modify, correct, convert, or extend that data and write the results to an output file.

For example, suppose we had the following comma-separated value (CSV) file:

Name,Height
Jim,181
Betty,167
Frank,154
...
Annabelle,201

We want to classify each individual based on his or her height hi versus the mean and standard deviation of height μ and σ as:

producing a new CSV output file with contents:

Name,Height,Class
Jim,181,2
Betty,167,1
Frank,154,3
...
Annabelle,201,4

Python isn't the only way to produce this result (e.g., we could probably do it in Excel, or SAS, or R), but it's often easier and faster to do it with a Python program. If you're curious, this source code file classifies a comma-delimited height file.

Variables

Every programming language provides a way to maintain values as a program runs. Usually, this is done by create a named variable, then assigning a value to the variable. In Python, variables are created interactively by specifying their name and assigning them an initial value.

>>> name = 'Abraham Lincoln'
>>> height = 6.33
>>> age = 56
>>> birthplace = 'Hodgenville KY'
>>> born = 'February 12 1809'
>>> deceased = 'April 15 1865'

Unlike languages like C++, Python does not require you to specify a variable's type. This is inferred from the value it maintains. In the above example the variables name, birthplace, born, and deceased are inferred to be strings, height is inferred to be floating point, and age is inferred to be integer. One advantage of Python's dynamically typed variables is that you can change them to hold different types of values whenever you want. You can also ask Python what type of value a variable contains with the type() function.

>>> name = 'Abraham Lincoln'
>>> print type( name )
str
>>> name = 25
>>> print type( name )
int
>>> name = 6.3
>>> print type( name )
float
>>> name = 305127925769258727938193819283L
>>> print type( name )
long
>>> name = False
>>> print type( name )
bool

Here is a quick list of some of Python's basic variable types. More complicated types will be discussed later in the tutorial.

Variable Practice Problem

Write a set of Python statements that assign the names and associated phone numbers: Christopher Healey, 9195138112, Michael Rappa, 9195130480, and Aric LaBarr, 9195132957 to Python variables, then prints three lines listing each person's name and corresponding phone number.

I recommend you write your program using a text editor, save it as a source code file, and use execfile() to test it, rather than writing the program directly in the Python shell. This will let you write your code, run it to see what it does, edit it to fix problems, and run it again, without having to re-type the entire program at the command line.

Variable Assignment Solution

>>> name_0 = 'Christopher Healey'
>>> phone_0 = '9195138112'
>>> name_1 = 'Michael Rappa'
>>> phone_1 = '9195130480'
>>> name_2 = 'Aric LaBarr'
>>> phone_2 = '9195132957'
>>>
>>> print name_0, '@', phone_0
Christopher Healey @ 9195138112
>>>
>>> print name_1, '@', phone_1
Michael Rappa @ 9195130480
>>>
>>> print name_2, '@', phone_2
Aric LaBarr @ 9195132957

You can download the solution file and run it on your machine, if you want.

Your choice of variable names is probably different than ours, and you might have printed the name and phone number with slightly different formatting. Regardless, the basic idea is to use six separate variables to store the names and phone numbers, then print the contents of these variables in combinations that produce the correct output.

You might think, "This works, but it doesn't seem very efficient." That's true. Once you've learned more about Python, it's unlikely you'd write this code to solve the problem. Here's a more elegant and flexible solution. When you've finished the tutorial, you'll be able to understand, and to implement, this type of code.

>>> db = { }
>>> db[ '9195138112' ] = 'Christopher Healey'
>>> db[ '9195130480' ] = 'Michael Rappa'
>>> db[ '9195132957' ] = 'Aric LaBarr'
>>> for phone in db.keys():
... print db[ phone ], '@', phone
...
Michael Rappa @ 9195130480
Christopher Healey @ 9195138112
Aric LaBarr @ 9195132957

Operators

Python provides a set of built-in functions or operators to perform simple operations such as addition, subtraction, comparison, and boolean logic. An expression is a combination of variables, constants, and operators. Every expression has a result. Operators in Python have precedence associated with them. This means expressions using operators are not evaluated strictly left to right. Results from the operators with the highest precedence are computed first. Consider the following simple Python expression:

>>> 6 + 3 * 4 / 2 + 2
14

If this were evaluated left to right, the result would be 20. However, since multiplication and division have a higher precedence than addition in Python, the result returned is 14, computed as follows.

Of course, we can use parentheses to force a result of 20, if that’s what we wanted, with the following expression:

>>> ((( 6 + 3 ) * 4 ) / 2 ) + 2
20

Below is a list of the common operators in Python, along with an explanation of what they do. The operators are group according to precedence, from highest to lowest.

Operator Description
( ) parentheses define the order in which groups of operators should be evaluated
** exponential
+x, -x make positive, make negative
*, /, //, % multiplication, division, floor division, remainder
+, - addition, subtraction
<, <=, >, >=, !=, == less, less or equal, greater, greater or equal, not equal, equal
and boolean AND
or boolean OR

Advanced Data Types

In addition to boolean and numeric variables, Python provides a number of more complex types, including strings (str), lists (list), dictionaries (dict), or tuples (tuple). Using these types effectively will make you a much more efficient programmer.

str

String variables (str) are a sequence of one or more characters. String values are denoted by single quotes, s = 'Abraham Lincoln', or double quotes, s = "Abraham Lincoln". Because strings are a more complex data type, they support more complex operations. Here are some common operations you can perform on strings.

Here are some examples of string operations executed in a Python shell.

>>> s = 'Hello world!'
>>> print len( s )
12
>>> print s[ 6 ]
'w'
>>> print s[ 2: 8 ]
'llo wo'
>>> print s[ -2 ]
'd'
>>> print s[ -3: 12 ]
'ld!'
>>> t = 'Must.. try.. harder..'
>>> print s + ' ' + t
'Hello world! Must.. try.. harder..'

There are many additional operations you can perform on strings, for example, s.capitalize() to capitalize a string, or s.find( t ) to find the first occurrence of substring t in s. The Python documentation describes all the available string methods, explaining how to use them and what they do.

list

List variables are ordered sequences of values. Most data types can be stored in a list, for example, a list of int's, a list of str's, or even a list of list's. List values are denoted by square brackets, l = [ 1, 2, 3, 4 ].

You might notice that strings look suspiciously like a list of characters. Indeed, both list and str are known as "sequence types" in Python, so lists support the same len, concatenation, indexing, and slicing operations as strings.

>>> l = [ 1, 2, 3, 4, 5 ]
>>> print len( l )
5
>>> print l[ 2 ]
3
>>> print l[ 1: 3 ]
[2, 3]
>>> print l[ -2 ]
4
>>> print l[ -3: 5 ]
[3, 4, 5]
>>> m = [ -2, -3, -4 ]
>>> print l + m
[1, 2, 3, 4, 5, -2, -3, -4]

As with strings, there are many additional operations you can perform on lists, for example, l.insert( i, x ) and l.remove( i ) to add and remove items, or l.sort() to reorder items into sorted order. The Python documentation describes the available list methods, explaining how to use them and what they do.

dict

Dictionary variables are a collection of key–value pairs. This is meant to be analogous to a real dictionary, where the key is a word, and the associated value is the word's definition. dict variables are designed to support efficient searching for elements in the dictionary based on key. Dictionary values are denoted by braces, d = { key: value }.

By design, dictionaries have one important requirement: every value you store in a dictionary must have its own, unique key. For example, we could not store a person's address using their last name as a key, because if two different people had the same last name, only one of their addresses could be saved in the dictionary.

Suppose instead we wanted to find a person's name based on their phone number. To do this, we could create a dictionary with phone number as a key and name as a value.

>>> d = { '9195138112': 'Christopher Healey' }
>>> d[ '9195130480' ] = 'Michael Rappa'
>>> d[ '9195152858' ] = 'Dept of Computer Science'
>>> print d
{'9195130480': 'Michael Rappa', '9195138112': 'Christopher Healey', '9195152858': 'Dept of Computer Science'}
>>> print d[ '9195138112' ]
'Christopher Healey'

The first statement creates a dictionary variable named d and assigns a single key–value pair made up of a string phone number key 9195138112 and a string name value Christopher Healey. The next two lines add two new key–value pairs to d by specifying a key as an index (an string phone number inside the square brackets) and assigned a name as the key's value. Printing d lists all its key–value pairs. The value attached to a specific key can be queried by indexing d with the target key.

Key Types

Why did we choose to make our key a string variable and not a numeric variable? The dictionary would work if we used d = { 9195138112: 'Christopher Healey' }, with a long integer for the key rather than a string. Our choice was semantic: we view a phone number as a sequence of (numeric) characters, and not as a generic numeric value. It doesn't make sense to add or subtract phone numbers, for example, so phone numbers don't really act like numbers.

Since the dictionary works identically either way, does it really matter? In terms of functionality, probably not. In terms of understandability, that depends. We try to find the best match between the context of a variable and its type. Here, the point is subtle, so it doesn't make a big difference. In other cases, though, a proper choice (rather than simply the first choice that works) can improve a program's effectiveness, and perhaps more importantly, make it easier to understand.

One interesting difference between a dictionary and a list is that dictionaries do not maintain order. The order that items are stored in a dictionary will not necessarily match the order that you added them to the dictionary. You can see this in the above example, where phone numbers were added in the order 9195138112, 9195130480, 9195152858, but they were stored in d in the order 9195130480, 9195138112, then 9195152858.

Dictionaries are a very powerful data structure. If you need to perform efficient search, if ordering the element's isn't critical, and if you can define a key for each of the elements you're storing, a dict might be a good candidate. There are many additional operations you can perform on dictionaries, for example, d.keys() to return a list of keys in the dictionary, d.values() to return a list of its values, or d.pop( k ) to remove an entry with key k (and return its value, if you want it). The Python documentation describes the available dictionary methods, explaining how to use them and what they do.

tuple

Tuple variables are ordered sequences of values, where the position of a value within the tuple often has a semantic meaning. Tuple values are denoted by parentheses, t = ( 2013, 10, 28, 14, 15, 0 ).

You might notice that tuples look identical to lists. Again, both are "sequence types" in Python, so tuples support the len, concatenation, and slicing operations, as well as querying by index.

>>> t = ( 2013, 10, 28, 14, 15, 0 )
>>> print len( t )
6
>>> print t[ 0 ]
2013
>>> print t[ 3: 6 ]
(14, 15, 0)
>>> print t[ -3 ]
14
>>> print t[ -6: 3 ]
(2013, 10, 28)
>>> u = ( 1, 232, 1 )
>>> print t + u
(2013, 10, 28, 14, 15, 0, 1, 232, 1)

There are important differences between lists and tuples. Although they are subtle, understanding them will help you decide when to use a list, and when to use a tuple.

Tuples support methods that lists support, as long as the method does not modify the tuple's values. So, for example, tuples support len() and +, but not remove() or sort(), since that changes the tuple's values or the order of its values.

Conditionals

We've already seen that a Python program runs by executing the first statement in the main module, and continuing with each successive statement until it reaches the end of the program. This doesn't allow for very complicated programs. What if want to control the flow of execution, that is, what if we want one part of the program to be executed in some cases, but another part to be executed in different cases?

Conditional statements allow you to control how your program executes. For example, a conditional statement could apply a comparison operator to a variable, then execute different a block of statements depending on the result of the comparison. Or, it could cause a block of statements to be executed repeatedly until some condition is met.

Understanding condition statements is necessary for writing even moderately complicated programs. We discuss some common Python conditional operators below, and give details on how to structure your code within a conditional statement.

if-then-else

To start, we'll discuss the if-then-else conditional. Described in simple terms, this is used in a program to say, "If some condition is true, then do this, else do that."

As an example, suppose we have a variable grade that holds a student's numeric grade on the range 0–100. We want to define a new variable passed that's set to True if the student's grade is 50 or higher, or False if the grade is less than 50. The following Python conditional will do this.

>>> grade = 75
>>> if grade >= 50:
... passed = True
... else:
... passed = False
...
>>> print passed
True

Although this statement appears simple, there are a number of important details to discuss.

Interestingly, the else part of the conditional is optional. The following code will produce the same result as the first example.

>>> passed = False
>>> grade = 75
>>> if grade >= 50:
... passed = True
...
>>> print passed
True

Suppose we wanted to not only define pass or fail, but also assign a letter grade for the student. We could use a series of if-then statements, one for each possible letter grade. A better way is to use elif, which defines else-if code blocks. Now, we're telling a program, "If some condition is true, then do this, else if some other condition is true, then do this, else do that." You can include as many else-if statements as you want in an if-then-else conditional.

>>> grade = 75
>>> if grade >= 90:
... passed = True
... letter = 'A'
... elif grade >= 80:
... passed = True
... letter = 'B'
... elif grade >= 65:
... passed = True
... letter = 'C'
... elif grade >= 50:
... passed = True
... letter = 'D'
... else:
... passed = False
... letter = 'F'
...
>>> print passed
True
>>> print letter
C

while

Another common situation is the need to execute a code block until some condition is met. This is done with a while conditional. Here, we're telling the program "While some condition is true, do this." For example, suppose we wanted to print the square roots of values on the range 1–15.

>>> import math
>>> i = 1
>>> while i <= 15:
... print 'The square root of', i, 'is', math.sqrt( i )
... i = i + 1
...
The square root of 1 is 1
The square root of 2 is 1.4142135623730951

The square root of 15 is 3.872983346207417

(The import math statement is needed to give us access to mathematical functions like math.sqrt).

Notice that the variable that's compared in the while conditional normally must be updated in the conditional's code block. If you don't update the conditional variable, a comparison that initially evaluates to True will never evaluate to False, which means the while loop will execute forever. For example, consider the following code block.

>>> import math
>>> i = 1
>>> while i <= 15:
... print 'The square root of', i, 'is', math.sqrt( i )
...
The square root of 1 is 1
The square root of 1 is 1
The square root of 1 is 1
The square root of 1 is 1
The square root of 1 is 1
The square root of 1 is 1
The square root of 1 is 1

Without the i = i + 1 statement to update i in the conditional's code block, the while conditional never fails, giving us the same output over and over. You can use Ctrl+C to halt your program if it's caught in an infinite loop like this.

for

A final conditional that is very common is a for loop. Here, we're telling a program "Execute this code block for some list of values." for can work on any list of values, but it's often applied to a numeric range. The range command is used to create a list containing a sequence of integers.

>>> print range( 5, 10 )
[5, 6, 7, 8 ,9]
>>> print range( 10 )
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> print range( -15, -10 )
[-15, -14, -13, -12, -11]

Giving two values to range like range( 2, 5 ) defines a starting value of 2 and an ending value of 5. range generates an integer list from the starting value, up to but not including the ending value: [2, 3, 4]. If you only give an ending value to range like range( 5 ), range assumes a starting value of 0, producing the list [0, 1, 2, 3, 4].

Once a list is produced with range, each value in the list is given to the for conditional's code block, in order. For example, suppose we wanted to print the same set of square roots from 1–15 using a for loop.

>>> import math
>>> for i in range( 1, 16 ):
... print 'The square root of', i, 'is', math.sqrt( i )
...
The square root of 1 is 1
The square root of 2 is 1.4142135623730951

The square root of 15 is 3.872983346207417

The for statement defines a variable to hold the "current" list value. In our case, this variable is called i. range( 1, 16 ) generates the list [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. The for conditional walks through this list and executes the code block 15 times, first with i set to 1, then with i set to 2, and so on up to the final value of 15. The statement inside the code block uses i to track the current list value, printing square roots from 1 to 15.

We don't need to use range to execute a for conditional. Any list can be used in a for loop.

>>> name = [ "Healey", "Rappa", "LaBarr" ]
>>> for nm in name:
... print nm, '(', len( nm ), ')'
...
Healey ( 6 )
Rappa ( 5 )
LaBarr ( 6 )

break and continue

break

Sometimes we need to exit a for or while loop before its condition evaluates to False. The break statement allows us to do this. For example, suppose we wanted to print the elements of a list of strings, but terminate examining the list if we see the string stop.


>>> l = [ 'Healey', 'Rappa', 'LaBarr' ]
>>> for i in range( len( l ) ):
... if l[ i ] == 'stop':
... break
... print l[ i ]
...
Healey
Rappa
LaBarr
>>> l.insert( 1, 'stop' )
>>> print l
[ 'Healey', 'stop', 'Rappa', 'LaBarr' ]
>>> for i in range( len( l ) ):
... if l[ i ] == 'stop':
... break
... print l[ i ]
...
Healey

continue

Other times, we want to stop executing a loop's code block, and instead return to check its condition. The continue statement allows us to do this. For example, suppose we wanted to print only the odd numbers from 1 to 10.

>>> for i in range( 1, 10 ):
... if i % 2 == 0:
... i = i + 1
... continue
... print i, 'is odd'
... i = i + 1
...
1 is odd
3 is odd
5 is odd
7 is odd
9 is odd

Loop Practice Problem

Write a set of Python statements to compute the average of the following list of numbers.

I recommend you write your program using a text editor, save it as a source code file, and use execfile() to test it, rather than writing the program directly in the Python shell. This will let you write your code, run it to see what it does, edit it to fix problems, and run it again, without having to re-type the entire program at the command line.

List Average Solution

for loop

>>> num = [ 6, 12, -7, 29, 14, 38, 11, 7 ]
>>> sum = 0
>>> for n in num:
... sum = sum + n
>>> print float( sum ) / len( num )
13.75

while loop

>>> num = [ 6, 12, -7, 29, 14, 38, 11, 7 ]
>>> i = 0
>>> sum = 0
>>> while i < len( num ):
... sum = sum + num[ i ]
... i = i + 1
>>> print float( sum ) / len( num )
13.75

Notice that we have to convert the sum to a floating point value (in our case, by casting it with float()) to get the proper average of 13.75. If we had used the statement print float sum / len( num ) instead, Python would have return an integer result of 13.

You can download the solution file and run it on your machine, if you want.

Debugging

Inevitably, you'll write some Python code that either doesn't do what you expect it to do, or that generates an error message when you try to execute it. When that happens, you'll need to debug the program to locate and correct the error. Consider the following code.

>>> l = [ '10', '20', '30' ]
>>> sum = 0
>>> for val in l:
... sum = sum + val
...

If you hit Return to close the for loop, Python would respond with an error message similar to this.

TypeError   Traceback (most recent call last)
<ipython-input-11-00773625d4c5> in <module>()
      1 for val in l:
----> 2     sum = sum + val
      3

TypeError: unsupported operand type(s) for +: 'int' and 'str'

So, that didn't work. The first two lines of the error message give some details about how the error's being reported, and where the error occurred (on line 2 of code being entered on "<stdin>", which is Python's way of saying "from the keyboard"). The most important part of the error is the last line, which tries to explain the problem Python encountered. This explanation suggests that Python doesn't know how to add (+) an int and a str.

If you look at where the error was reported (line 2 of the for loop), it attempted to execute sum = sum + val. Python is claiming the first variable in the add operation, sum, is an int, but the second variable val is a str. val is a value from the list variable l. And, when we look at l, we see that it contains three string values: '10', '20', and '30'. This is the problem that Python encountered.

There are various ways to fix this problem. One simple solution is to put integers in the list, l = [ 10, 20, 30 ]. If you wanted l to contain strings for some reason, you could cast val to be an integer in the add operation.

>>> l = [ '10', '20', '30' ]
>>> sum = 0
>>> for val in l:
... sum = sum + int( val )
...
>>> print sum
60

Now, Python accepts the for loop's body because it understands how to add to int variables. The resulting sum is printed after the loop finishes.

Computer "Bugs"

Why are errors in computer programs called bugs? Historically, the term "bug" was used in engineering to describe mechanical malfunctions.

bug

In 1947 computer engineers were designing the Harvard Mark II computer. An error in the machine was traced to a moth that had become trapped in one of the machine's relays. The moth was removed and taped to the engineers' log book, where they referred to it as "The first actual case of bug being found." This incident seems to have contributed to the widespread use of the term in Computer Science.

Files

One of the main reasons we're using Python is to read and write data to and from external files. Python has an extensive set of file input/output (file IO) operations to support this. The basic structure of modifying a file often follows this simple pattern.

  1. Open one or more input files to read the original data.
  2. Open an output file to write the new or modified data.
  3. Read data from the input file, usually line by line.
  4. Examine each input line, modifying it or generate new data based on its contents.
  5. Write the modified line and/or data to the output file.
  6. Close the input and output files when processing is completed.

Reading Files

Here are some operations you can use to open files and read from them.

As a file is being read, Python maintains a current position in the file (the file pointer). This is how Python finds things like the "next" line: it starts from its current position in the file, reads until it sees a newline character (\r\n) or the end of the file, then returns what it read as a string.

Maintaining a current position means that Python won't automatically "back up" for you if you want to go back and re-read some data. For example, consider the following code snippet.

>>> inp = open( 'input.txt', 'rb' )
>>> line = inp.readlines()
>>> print len( line )
180
>>> line = inp.readlines()
>>> print len( line )
0
>>> inp.close()

The first time we read input.txt and asked how many lines it contained, Python told us it had 180 lines. But the next time we read the file, Python said it had 0 lines. How is that possible?

Remember, after the first line = inp.readlines() statement, Python reads everything in the file, returning a list with 180 strings representing the file's 180 lines. Critically, the current position is now at the end of the file. We re-issue the same line = input.readlines() statement, Python starts from its current position at the end of the file, realizes there's nothing else left to read, and returns an empty list to tell us that. So, the length of that second list is 0, exactly as we saw.

How could we re-read the entire file? The easiest way to do this is to close the file, then re-open it. Doing this resets the current position back to the start of the file.

>>> inp = open( 'input.txt', 'rb' )
>>> line = inp.readlines()
>>> print len( line )
180
>>> inp.close()
>>> inp = open( 'input.txt', 'rb' )
>>> line = inp.readlines()
>>> print len( line )
180
>>> inp.close()

seek and tell

seek

It's possible to change the current position in a file without closing and re-opening it. seek( pos ) is used to set a file's current position to pos. For example, we could re-read input.text as follows.

>>> inp = open( 'input.txt', 'rb' )
>>> line = inp.readlines()
>>> print len( line )
180
>>> inp.seek( 0 )
>>> line = inp.readlines()
>>> print len( line )
180
>>> inp.close()

The command inp.seek( 0 ) sets the current position to 0 bytes from the start of the file (i.e., to the start of the file). If you need to seek from the end of the file, you can specifying a negative offset and a second argument of 2 to seek, for example, inp.seek( -10, 2 ) seek 10 bytes backwards from the end of input.txt. You can also seek from the current position by specifying an offset and a second argument of 1 to seek, for example, inp.seek( 20, 1 ) to seek 20 bytes forwards from the current position, or seek( -5, 1 ) to seek 5 bytes backwards from the current position.

tell

The tell() command will return the current position in a file. For example, to determine the size of a file, you could do the following.

>>> inp = open( 'input.txt', 'rb' )
>>> inp.seek( 0, 2 )
>>> print inp.tell()
4044
>>> inp.close()

Writing Files

Python provides similar operations open files and write to them.

Writing data is simple: put the data you want to write into a string variable (or convert a variable's value to a string value), then use write to write the data to an output file.

>>> out = open( 'output.txt', 'wb' )
>>> num = 50
>>> list = [ 'Healey', 'Rappa', 'Mostek' ]
>>> for elem in list:
... out.write( elem ) + '\r\n'
...
>>> out.write( str( num ) )
>>> out.close()

This code snippet creates an output file output.txt and writes four lines containing Healey, Rappa, Mostek, and 50 to the file. Notice that if we need a newline after each value, we need to add it explicitly by appending '\r\n' to the string as it's being written.

newlines

Newlines are used to insert a carriage return between lines in a file. In Windows (or DOS) a newline is actually two characters: a return and a newline, denoted \r\n. This is unique to Windows. On other operating systems like Mac OS or Linux, only the newline \n is used.

Newlines also matter when you're reading data. For example, supposed you used readlines() to read all the lines in a file.

>>> inp = open( 'input.txt', 'rb' )
>>> line = inp.readlines()
>>> print line[ 0: 2 ]
['This is the first line\r\n', 'This is the second line\r\n']

The return and newline are included at the end of each line that's read. In some cases you want these "separators" removed when the file is read and parsed. One easy way to do this is to read the entire file using read(), then split the result using the string operator split().

>>> inp = open( 'input.txt', 'rb' )
>>> content = inp.read()
>>> line = content.split( '\r\n' )
>>> print line[ 0: 2 ]
['This is the first line', 'This is the second line']

split() divides a string into a list of substrings based on a delimiter, throwing away the delimiter after each split. Splitting the file's contents on the delimiter \r\n gives us what we want: the individual lines from the file without a return and newline at the end of each line.

CSV Files

A file type you are likely to encounter often is a comma-separated value (CSV) file. CSV files are text files containing a table of values. Each line represents one row in the table, with individual column values in the row identified by a separator character. The separator is often a comma (,), although it can be any character that's guaranteed not to appear in any column value. For example, the US Census Bureau maintains a file of estimated 2011 city and town populations.

SUMLEV,STATE,COUNTY,PLACE,COUSUB,CONCIT,NAME,STNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011
040,01,000,00000,00000,00000,Alabama,Alabama,4779736,4779735,4785401,4802740
162,01,000,00124,00000,00000,Abbeville city,Alabama,2688,2688,2689,2704
162,01,000,00460,00000,00000,Adamsville city,Alabama,4522,4522,4522,4525
162,01,000,00484,00000,00000,Addison town,Alabama,758,758,754,754
162,01,000,00676,00000,00000,Akron town,Alabama,356,356,354,348
162,01,000,00820,00000,00000,Alabaster city,Alabama,30352,30352,30473,30799
162,01,000,00988,00000,00000,Albertville city,Alabama,21160,21160,21202,21421
162,01,000,01132,00000,00000,Alexander City city,Alabama,14875,14875,14846,14876
162,01,000,01228,00000,00000,Aliceville city,Alabama,2486,2486,2483,2438

157,56,045,79125,00000,00000,Upton town,Wyoming,1100,1100,1096,1084
157,56,045,99990,00000,00000,Balance of Weston County,Wyoming,2576,2576,2564,2539

We could use file and string operations to read and parse CSV files, but Python provides a csv module to help us with this. Modules are pre-written collections of operators, usually designed for a specific purpose or task. To use a module, you must first import it. To invoke one of it's operations, you precede the operator's name with the name of the module, followed by a period.

>>> import csv
>>> csv.list_dialects()
['excel-tab', 'excel']

To read data from a CSV file, we normally perform the following steps.

  1. Open the CSV file to read with open(), exactly like any other input file.
  2. Attach a CSV reader to the CSV file.
  3. Use next() to read and parse any header line(s) in the CSV file.
  4. Use a for loop to read and parse the rows in the CSV file. Each row is returned as a list of column values found in the row's line.
  5. Close the CSV file.

For example, this code would read and parse the Census population file

>>> import csv
>>>
>>> inp = open( 'pop.csv', 'rb' )
>>> reader = csv.reader( inp )
>>> header = reader.next()
>>>
>>> for row in reader:
... if row[ 6 ] == row[ 7 ]:
... print 'The state of', row[ 7 ], 'has population', row[ 11 ]
... else:
... print 'City', row[ 6 ], 'in state', row[ 7 ], 'has population', row[ 11 ]
...
The state of Alabama has population 4802740
City Abbeville city in state Alabama has population 2704
City Adamsville city in state Alabama has population 4525

City Upton town in state Wyoming has population 1084
City Balance of Weston County in state Wyoming has population 2539
>>>
>>> inp.close()

Notice that the csv module helps us read and parse a CSV file, but it doesn't tell us anything about what the rows and columns in the file represent. We need to provide that context based on our understanding of the file. For example, in the code above:

It's also possible to write data out in CSV format. This is useful, since CSV files can be easily imported into programs like Excel or SAS. A very similar sequence of steps is used to write a CSV file.

  1. Open the CSV file to write with open(), exactly like any other output file.
  2. Attach a CSV writer to the CSV file.
  3. For each row you want to write to the CSV file, store the row's column values in a list.
  4. Use writerow() to write the list's values as a comma-separated line in the CSV file.
  5. After all the rows are written, close the CSV file.

For example, the Census population file has a lot of columns we might not care about. Suppose we wanted to reduce the file to only include city name, state name, and 2011 population estimate. The following code would do this.

>>> import csv
>>>
>>> inp = open( 'pop.csv', 'rb' )
>>> reader = csv.reader( inp )
>>> header = reader.next()
>>>
>>> out = open( 'pop-summary.csv', 'wb' )
>>> writer = csv.writer( out )
>>> writer.writerow( [ 'City', 'State', 'Population' ] )
>>>
>>> for row in reader:
... writer.writerow( [ row[ 6 ], row[ 7 ], row[ 11 ] ] )
...
>>> inp.close()
>>> out.close()

This will produce an output file pop-summary.csv with the following data.

City,State,Population
Alabama,Alabama,4802740
Abbeville city,Alabama,2704
Adamsville city,Alabama,4525
Addison town,Alabama,754
Akron town,Alabama,348
Alabaster city,Alabama,30799
Albertville city,Alabama,21421
Alexander City city,Alabama,14876
Aliceville city,Alabama,2438

Upton town,Wyoming,1084
Balance of Weston County,Wyoming,2539

CSV Practice Problem

Write a Python program that finds the city with the largest population in pop.csv, then prints this city's name, the name of its state, and its population.

Hint. When you read data with a CSV reader, the column values it returns are all strings. You'll want to convert the population value from a string to an integer. To do this, you cast the string using the int()operation.

>>> for row in reader:
... pop = int( row[ 11 ] )

I recommend you write your program using a text editor, save it as a source code file, and use execfile() to test it, rather than writing the program directly in the Python shell. This will let you write your code, run it to see what it does, edit it to fix problems, and run it again, without having to re-type the entire program at the command line.

Maximum Population Solution

>>> import csv
>>> inp = open( 'pop.csv', 'rb' )
>>> reader = csv.reader( inp )
>>> header = reader.next()
>>> max_pop = 0
>>> max_city_nm = "Unknown"
>>> max_st_nm = "Unknown"
>>> for row in reader:
... pop = int( row[ 11 ] )
... if row[ 6 ] != row[ 7 ]:
... if pop > max_pop:
... max_pop = pop
... max_city_nm = row[ 6 ]
... max_st_nm = row[ 7 ]
...
>>> inp.close()
>>> print max_city_nm, 'in state', max_st_nm, 'has population', max_pop
Los Angeles County in state California has population 9889056

You can download the solution file and run it on your machine, if you want.

Functions

It's possible to write a program as a single, long sequence of statements in the main module. Even for small programs, however, this isn't efficient. First, writing a program this way makes it difficult to examine and understand. Second, if you're performing common operation on different variables, you need to duplicate the code every time you perform that operation.

For example, supposed we wanted to report the average of two numeric lists l and m. One obvious way to do it is to write two for loops.

>>> l = [ 1, 2, 3 ]
>>> m = [ 7, 8, 14 ]
>>> sum = 0
>>> for elem in l:
... sum = sum + elem
...
>>> print float( sum ) / 3
2.0
>>> sum = 0
>>> for elem in m:
... sum = sum + elem
...
>>> print float( sum ) / 3
9.66666666667

This has a number of problems, however. What if we had more than just two lists we wanted to average? We'd need to duplicate the for loop once for each list. What if we wanted to do something more complicated than calculating the average (e.g., what if we wanted population standard deviation instead)? The amount of code we'd need to duplicate would be much longer.

What we really want to do is to have some sort of avg() operation that we can call whenever we want to calculate the average of a numeric list.

>>> l = [ 1, 2, 3 ]
>>> m = [ 7, 8, 14 ]
>>> print avg( l )
>>> 2.0
>>> print avg( m )
>>> 9.66666666667

In Python we can define a function to create new operations like avg(). A function is defined using the keyword def, followed by the function's name, followed by an argument list in parentheses, and then a colon. The function's code block defines what the function does when it's called.

>>> def avg( num ):
... sum = 0
... for elem in num:
... sum = sum + elem
... return float( sum ) / len( num )
...

Functions can take zero or more arguments. A function with no arguments still needs open and close parentheses, def func():. A function with multiple arguments separates then with commas, def func( a, b ):.

Once a function is defined, it can be used anywhere, including in other functions. Suppose we now wanted to write a function stdev() to compute the population standard deviation of a numeric list. We can use our avg() function to help to do this.

>>> import math
>>>
>>> def stdev( num ):
... sum = 0
... num_avg = avg( num )
... for elem in num:
... sum = sum + ( ( elem - num_avg ) ** 2.0 )
... return math.sqrt( sum / len( num ) )
...

What if we wanted to allow a user to decide whether to calculate population standard deviation or sample standard deviation? We could write two separate functions to do this, but an easier way is to add an argument to the stdev() function to specify which standard deviation to calculate.

>>> import math
>>>
>>> def stdev( num, pop = True ):
... sum = 0
... num_avg = avg( num )
... for elem in num:
... sum = sum + ( ( elem - num_avg ) ** 2.0 )
... if pop == True:
... return math.sqrt( sum / len( num ) )
... else:
... return math.sqrt( sum / ( len( num ) - 1 ) )
...

Now, stdev() takes a second argument pop. If pop is True, we return population standard deviation. Otherwise, we return sample standard deviation. Notice that in the function header we defined pop = True. This specifies a default value for pop. If the user doesn't specify a second argument, we return population standard deviation by default.

>>> l = [ 7, 8, 14 ]
>>> print 'Pop stdev of', l, 'is', stdev( l )
Pop stdev of [7, 8, 14] is 3.09120616517
>>> print 'Sample stdev of', l, 'is', stdev( l, False )
Sample stdev of [7, 8, 14] is 3.7859388972

It's even possible for functions to call themselves. This is known as recursion. Consider the Fibonacci sequence:

Since Fibonacci numbers for n ≥ 2 are defined based on lower-order versions of themselves, they are a common candidate to demonstrate a recursive function.

>>> def fib( n ):
... if n <= 0:
... return 0
... elif n == 1:
... return 1
... else:
... return fib( n - 1 ) + fib( n - 2 )
...
>>> fib( 0 )
0
>>> fib( 2 )
1
>>> fib( 20 )
6765
>>> fib( 40 )
102334155

If you look at fib(), you should see intuitively that it's very expensive to execute. If you tried to calculate fib( 100 ), for example, you'd be waiting a long time for it to finish. That's because each number if a Fibonacci sequence requires two recursive calls, which themselves each require two recursive calls, and so on.

Fibonacci, Rabbits, and Efficiency

Although the origins of Fibonacci numbers are credited to Indian mathematics, the number series is named after Leonardo of Pisa, also known as Fibonacci. In 1202 Fibonacci posed a question about an idealized rabbit population.

Fibonacci wondered, "How many pairs of rabbits would you have after n months passed?"

As you can see, this forms a numeric sequence that we now call the Fibonacci number series.

If we need to compute large Fibonacci numbers, the recursive formula is too inefficient. Instead, we use Binet's Formula to calculate F(n) directly.

>>> import math
>>>
>>> def fib_e( n ):
... if n <= 0:
... return 0
... elif n == 1:
... return 1
... else:
... sum = ( 1.0 + math.sqrt( 5 ) ) ** n
... sum = sum - ( 1.0 - math.sqrt( 5 ) ) ** n
... sum = sum / ( 2.0 ** n * math.sqrt( 5 ) )
... return round( sum )
...

If you're curious, F(100) = 3.542248481792631e+20. It would take about 4 years for rabbits to outnumber humans based on Fibonacci's scenario.

NumPy

NumPy (numerical Python, hereafter numpy, pronounced num-pie) is a library that provides advanced mathematical operations involving statistics and linear algebra. Our interest is mainly in numpy statistical capabilities, since this will allow us to calculate things like mean, variance, minimum, maximum, correlation, and covariance on lists of numbers.

numpy's standard data type is an array: a sequence of numbers, all of the same type. Arrays can be created in numerous ways. Common examples include:

>>> import numpy
>>>
>>> arr = numpy.array( [ 2, 3, 4 ] )
>>> print arr
[2 3 4]
>>> print type( arr )
numpy.ndarray
>>> arr = numpy.zeros( 5 )
>>> print arr
>>> [0. 0. 0. 0. 0.]
>>> arr = numpy.arange( 10, 30, 5 )
>>> print arr
[10 15 20 25]

numpy also supports multidimensional arrays. For example, a table is a 2-dimensional (2D) array with rows and columns. A data cube is a 3-dimensional array with rows, columns, and slices. We restrict ourselves to 1D and 2D arrays in this tutorial. The easiest way to define a 2D array in numpy is to provide a list of equal-length sublists to the array() operator, one sublist for each row in the array.

>>> import numpy
>>>
>>> arr = numpy.array( [ [ 1, 2 ], [ 4, 5 ], [ 7, 8 ] ] )
>>> print arr
[[1 2]
 [4 5]
 [7 8]]
>>> print arr[ 1 ]
[4 5]
>>> print arr[ 1, 1 ]
5
>>> print arr[ -1 ]
[7 8]
>>> print arr[ 1: 3 ]
[[4 5]
 [7 8]]
>>> print arr[ 0: 2, 1 ]
[2 5]
>>>
>>> print arr.shape
(3, 2)
>>> arr = arr.reshape( 2, 3 )
>>> print arr
[[1 2 4]
 [5 7 8]
>>> print arr.shape
(2, 3)

numpy provides access to elements of an array using the standard indexing operator [ ]. Negative indices and slicing can be used, similar to Python lists. It's also possible to ask for the shape of an array using shape(), which returns the number of rows for a 1D array, or a tuple with the number of rows and columns for a 2D array. It's even possible to reshape an array using reshape(), rearranging the array's values into a new (row, column) configuration.

As mentioned above, one of the main advantages of using numpy is access to a number of statistical operations. A few common examples are listed below. A full list of numpy's statistical operators is available online.

>>> import numpy
>>>
>>> arr = numpy.arange( 0, 500, 2 )
>>> print arr.size
250
>>> hist = numpy.histogram( arr, 10 )
>>> print hist
(array([25, 25, 25, 25, 25, 25, 25, 25, 25, 25]), array([ 0. , 49.8, 99.6, 149.4, 199.2, 249. , 298.8, 348.6, 398.4, 448.2, 498. ]))

(Another use for numpy is to perform linear algebra operations. Although less common in the analytics program, numpy's ability to compute eigenvectors, invert matrices, or solve systems of equations is very powerful.)

pandas

The pandas (Python Data Analysis) library builds on numpy, offering an extended set of data manipulation and analysis tools. pandas is built on a few basic data types (or data structures, as they're called in pandas), together with operations on data stored using these types.

Series

One of pandas's fundamental data types is a Series, a 1D labelled array. You can think of this as a numpy array with an explicit label attached to each data value. The collection of labels is called the data's index.

A Series can be created in numerous ways: from Python lists, from a numpy array, or even from a Python dictionary.

>>> import numpy
>>> import pandas
>>>
>>> s = pandas.Series( [ 1, 2, 3 ], [ 'a', 'b', 'c' ] )
>>> print s
a   1
b   2
c   3
dtype: int64
>>>
>>> d = { 'a': 3.14, 'b': 6.29, 'x': -1.34 }
>>> t = pandas.Series( d )
>>> print t
a    3.14
x   -1.34
b    6.29
dtype: float64
>>>
>>> a = numpy.array( [ 4.5, 5.5, 6.5 ] )
>>> u = pandas.Series( a )
>>> print u
0   4.5
1   5.5
2   6.5
dtype: float64

Data in a Series can be queried using numeric indices and slicing, just like with Python lists and numpy arrays. It can also be accessed using index labels, similar to a Python dictionary.

>>> import numpy
>>> import pandas
>>>
>>> s = pandas.Series( [ 1, 2, 3 ], [ 'a', 'b', 'c' ] )
>>> print s[ 1 ]
2
>>> print s[ 'c' ]
3
>>> print s[ 1: 3 ]
b   2
c   3
dtype: int64

More importantly, we can index by applying a conditional operation to every data element in a Series. This returns a new boolean Series with the result of applying the conditional (True or False) at each element position. The boolean Series is then used to select only those data elements that passed the conditional. For example, suppose we wanted to select the elements in a Series whose values were greater than 2, but less than 5.

>>> import numpy
>>> import pandas
>>>
>>> s = pandas.Series( [ 1, 2, 3, 4, 5 ] )
>>> idx = ( s > 2 ) & ( s < 5 )
>>> print idx
0   False
1   False
2    True
3    True
4   False
dtype: bool
>>> s_sub = s[ idx ]
>>> print s_sub
2   3
3   4
dtype: int64

This is how pandas interprets these commands.

Because pandas data have labels, we can perform operations that use data alignment. pandas will look at the variables involved in an operation, and automatically "match up" data elements with common labels.

>>> import numpy
>>> import pandas
>>>
>>> s = pandas.Series( [ 1, 2, 3 ], [ 'a', 'b', 'c' ] )
>>> t = pandas.Series( { 'a': 3.14, 'b': 6.29, 'x': -1.34 } )
>>> print s + t
a    4.14
b    8.29
c     NaN
x     NaN
dtype: float64

When we apply s + t, pandas sees data with labels a and b in both variables, so it knows to add those entries together. If data with a common label is missing from any of the Series, the result is undefined, and set to NaN (not a number).

DataFrame

A DataFrame is the second fundamental data type in pandas. A DataFrame is a 2D table whose columns and rows are both labelled. You can think of a DataFrame as a table of Series arrays, one for each column in the DataFrame. Row labels are still referred to as the index, and column labels are simply called the columns.

>>> import numpy
>>> import pandas
>>>
>>> d = [ [ 1, 2 ], [ 4, 5 ], [ 7, 8 ] ]
>>> df = pandas.DataFrame( d, index = [ 'a', 'b', 'c' ], columns = [ 'C1', 'C2' ] )
>>> print df
   C1  C2
a   1   2
b   4   5
c   7   8

Indexing is more complicated with DataFrames, since there are two separate dimensions: the columns and the rows. The following operators are used to index a DataFrame variable.

>>> import numpy
>>> import pandas
>>>
>>> d = [ [ 1, 2 ], [ 4, 5 ], [ 7, 8 ] ]
>>> df = pandas.DataFrame( d, index = [ 'a', 'b', 'c' ], columns = [ 'C1', 'C2' ] )
>>> print df[ 'C2' ]
a   2
b   5
c   8
Name: C2, dtype: int64
>>> print df[ 'C1' ][ 'b' ]
4
>>> print df.loc[ 'b' ]
C1   4
C2   5
Name: b, dtype: int64
>>> print df.iloc[ 2 ]
C1   7
C2   8
Name: c, dtype: int64

Slicing DataFrames

It's possible to use slicing to return multiple rows (and/or columns), rather than a single result. This must be done in a particular way to properly handle and differentiate between rows and columns, however, making it more complex than slicing lists or arrays.

>>> import numpy
>>> import pandas
>>>
>>> d = [ [ 1, 2, -1 ], [ 4, 5, -2 ], [ 7, 8, -3 ] ]
>>> df = pandas.DataFrame( d, index = [ 'a', 'b', 'c' ], columns = [ 'C1', 'C2', 'C3' ] )
>>> print df[ 'b': 'c' ]
   C1  C2  C3
b   4   5  -2
c   7   8  -3
>>> print df.loc[ 'a': 'b' ]
   C1  C2  C3
a   1   2  -1
b   4   5  -2
>>> print df.loc[ :, 'C2': 'C3' ]
   C2  C3
a   2  -1
b   5  -2
c   8  -3
>>> print df.loc[ 'a': 'b', 'C1': 'C2' ]
   C1  C2
a   1   2
b   4   5

You can insert and delete rows and columns from a DataFrame, or change values at specific index positions, using the same indexing operations.

>>> import numpy
>>> import pandas
>>>
>>> d = [ [ 1, 2 ], [ 4, 5 ], [ 7, 8 ] ]
>>> df = pandas.DataFrame( d, index = [ 'a', 'b', 'c' ], columns = [ 'C1', 'C2' ] )
>>>
>>> s = pandas.Series( { 'C1': 9, 'C2': 11 } )
>>> s.name = 'd'
>>> df = df.append( s )
>>> print df
  C1  C2
a  1   2
b  4   5
c  7   8
d  9  11
>>>
>>> df[ 'C3' ] = df[ 'C1' ] + df[ 'C2' ]
>>> df[ 'C1' ][ 'a' ] = 99
>>>
>>> print df
   C1  C2  C3
a  99   2   3
b   4   5   9
c   7   8  15
d   9  11  20

Maximum Population Revisited

To better exemplify the power of pandas, recall the previous problem of identifying the city with the maximum estimated 2011 population from a Census Bureau CSV file. Using pandas, we could identify the city with the largest population using only a few lines of code.

>>> import numpy
>>> import pandas
>>>
>>> pop = pandas.read_csv( 'pop.csv' )
>>> pop.describe()
            SUMLEV        STATE  …  POPESTIMATE2011
count 81746.000000 81746.000000  …     81746.000000
mean    114.380936    30.592849  …     15929.926565
std      47.934468    13.416473  …    244346.744241
min      40.000000     1.000000  …         0.000000
25%      61.000000    19.000000  …       339.000000
50%     157.000000    29.000000  …      1209.000000
75%     157.000000    41.000000  …      5041.000000
max     172.000000    56.000000  …  37691912.000000
>>>
>>> idx = pop[ 'NAME' ] != pop[ 'STNAME' ]
>>> city_pop = pop[ idx ]
>>> i = city_pop[ 'POPESTIMATE2011' ].idxmax()
>>>
>>> nm = pop[ 'NAME' ][ i ]
>>> state = pop[ 'STNAME' ][ i ]
>>> max_pop = pop[ 'POPESTIMATE2011' ][ i ]
>>> print nm, 'in state', state, 'has population', max_pop
Los Angeles County in state California has population 9889056

This is how pandas interprets these commands.

Web Parsing

The ability read and parse web page content is often very useful, since data is commonly available on a web page, but in a format that's not convenient to simply copy and paste.

Two Python libraries are used to simplify web scraping: urllib2 and BeautifulSoup 4. urllib2 supports posting HTTP requests to a web server. In our case, this is normally a request for HTML content of a target web page. BeautifulSoup parses HTML into a parse tree, then provides operations that allow you to find specific information within the tree, for example, all the HTML links, or text attached to an HTML ID in the document.

These are the steps you would follow to read and parse a web page.

This code will read the HTML for this tutorial, then return all the HTML links in the document.

>>> import urllib2
>>> from bs4 import BeautifulSoup
>>>
>>> url = urllib2.urlopen( 'http://www.csc.ncsu.edu/faculty/healey/msa-17/python/index.html' )
>>> doc = url.read()
>>> tree = BeautifulSoup( doc )
>>> links = tree.find_all( 'a' )
>>> for l in links:
... print l.get( 'href' )
...
http://www.csc.ncsu.edu/faculty/healey
https://www.ncsu.edu
http://www.python.org

http://www.crummy.com/software/BeautifulSoup

One you have a parse tree, most of your work will involve navigating the tree to find the things you're looking for. BeautifulSoup provides numerous operations to search within the tree. This can be done by HTML tag, by ID, by string value, and so on. This requires work on your part, however.

For example, suppose you parsed this HTML page using BeautifulSoup into a variable called tree. You could search for the div with a specific id and the value within that div as follows.

>>> import urllib2
>>> from bs4 import BeautifulSoup
>>>
>>> url = urllib2.urlopen( 'http://www.csc.ncsu.edu/faculty/healey/msa-17/python/index.html' )
>>> doc = url.read()
>>> tree = BeautifulSoup( doc )
>>>
>>> div = tree.find( 'div', { 'class': 'footer' } )
>>> div
<div class="footer">Updated <span id="mod-date">01-Jan-01</span></div>
>>> span = div.find( 'span', { 'id': 'mod-date' } )
>>> span
<span id="mod-date">01-Jan-01°</span>
>>> span.getText()
u'01-Jan-01'

Understanding the structure of a web page can't be done for you. It's something you'll need to complete as a first step for extracting data from the web page. Perhaps more importantly, if a web page doesn't have any useful structure, then it might be difficult (or impossible) to isolate the data you're interested in. Usually, however, web pages are built with a logical design. In these cases, web scraping can save you significant time and effort over manually copying or re-typing individual data values.