Skip to article frontmatterSkip to article content

Chapter 6.1 - Pandas introduction

Pandas is a package that can read in, analyze, process, and modify tabular data (e.g., csv files)

Pandas examples

Reading in csv files:

Much like numpy and matplotlib, we just need to import pandas to access its methods. The typical practice is to import it as the alias pd:

import pandas as pd

Download csv file

!wget -nc "https://arcus.niu.edu/btr_month_tmean.csv"
File ‘btr_month_tmean.csv’ already there; not retrieving.

Chapter 6.2.1 - Reading a file

The most basic method provided by pandas is the read_csv method which takes a filename as an argument.

In the file explorer tab, you will see a file named btr_month_tmean.csv after you run the download code above. We will use this file for the examples in this notebook.

First, to read in the file, you use the following code. The filename is either a relative path (the file is in this folder or a subfolder) or an absolute path (fake example: C:/Users/Test/Example/Etc./etc./file).

import pandas as pd


df = pd.read_csv("btr_month_tmean.csv")

df # just typing df will print out the data
Loading...

Chapter 6.1.2 - pandas DataFrame

There are multiple ways to access specific data in a DataFrame. In the case of these data, you may want to access an entire year of data. This can be done by providing the index to a pandas method called ```loc``. What is an index?

We can print the index using the following code:

df.index.values
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90])

As you can see, the indexes are just a list of numbers between 0 and 90. By default, pandas will automatically assign these ordered indexes when a csv file is read in. Each index will be unique, and the maximum value corresponds to the number of rows minus 1 (since we start at 0).

If we want to access the row at index 0, we can just use the ```loc`` method (short for ‘location’):

df.loc[0]
YEAR 1930.00 JAN 50.84 FEB 60.14 MAR 58.53 APR 70.93 MAY NaN JUN 81.12 JUL 82.24 AUG 81.89 SEP 78.52 OCT 67.93 NOV 59.67 DEC 49.60 Name: 0, dtype: float64

As you can see, this prints out all of the monthly data from the first row only (which happens to correspond with all monthly values in 1930, since 1930 was the first row in the csv file)

This approach is somewhat useful, but it is more useful to explicitly set our indexes in a way that makes sense in for our domain, problem, etc.

In this case, the YEAR column makes more sense to use as an index. We can tell pandas to use it as the index by adding an argument to the read_csv method called index_col. This argument takes a string that corresponds to the name of the column you want to use as the index:

import pandas as pd

df = pd.read_csv("btr_month_tmean.csv", index_col='YEAR')

df.loc[1930]
JAN 50.84 FEB 60.14 MAR 58.53 APR 70.93 MAY NaN JUN 81.12 JUL 82.24 AUG 81.89 SEP 78.52 OCT 67.93 NOV 59.67 DEC 49.60 Name: 1930, dtype: float64

Now, we can access the data in a more understandable way (much like we did with dictionaries). In the case above, we simply used 1930 as our desired key, and the DataFrame (which we specified should use YEAR as its index) returns the row associated with that key.

Chapter 6.1.3 - Practice with DataFrames

Answer the following questions using the following pandas DataFrame and only the ```loc`` method.

import pandas as pd
import numpy as np

df = pd.read_csv("btr_month_tmean.csv", index_col='YEAR')


### What was the temperature in June of 1992?


### What was the temperature in January of 2001?


### Which August was warmer? 2015 or 2018?


### What was the temperature for the year and month you were born?

Chapter 6.1.4 - Accessing column values

We access specific columns in a pandas DataFrame in a way that is very similar to dictionaries.

One you read in your csv file and the DataFrame is created, you can access a column using square brackets and the column name:

import pandas as pd

df = pd.read_csv("btr_month_tmean.csv", index_col='YEAR')

df['AUG']
YEAR 1930 81.89 1931 79.55 1932 82.40 1933 81.47 1934 81.52 ... 2016 83.29 2017 81.65 2018 82.60 2019 84.26 2020 83.71 Name: AUG, Length: 91, dtype: float64

This result is the temperature for every August from 1930 to 2020.

Just for fun, lets try to plot this!

import matplotlib.pyplot as plt

plt.plot(df['AUG'])

plt.show()
<Figure size 640x480 with 1 Axes>

Chapter 6.1.5 - pandas plotting

Pandas DataFrame objects work very well with matplotlib without having to think too much about it! The column that you accessed had index information in the form of a YEAR which was automatically interpreted by matplotlib to be the x-axis value. You can be more explicit about it by using the values method, which converts pandas information into a numpy array.

import matplotlib.pyplot as plt

x = df.index.values

y = df['AUG'].values

plt.plot(x, y)

plt.show()

print("X values=", x)
print("Y values=", y)
<Figure size 640x480 with 1 Axes>
X values= [1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943
 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957
 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971
 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985
 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
 2014 2015 2016 2017 2018 2019 2020]
Y values= [81.89 79.55 82.4  81.47 81.52 82.58 81.82 81.61 81.93 81.6  80.32 82.98
 81.97 83.73 82.03 81.36 81.56 83.13 81.31 80.39 80.63 85.21 82.85 81.6
 83.11 81.89 81.68 81.81 81.79 82.1  80.31 78.03 84.35 82.45 82.02 79.58
 79.97 80.05 81.65 80.9  81.94 81.66 82.81 80.85 80.42 80.97 80.89 81.26
 81.68 80.85 82.16 82.58 82.42 82.13 79.81 81.92 82.03 83.06 82.94 82.47
 82.9  82.1  79.82 82.79 80.55 83.53 79.97 81.16 83.44 85.05 84.27 81.03
 81.66 82.76 80.92 83.87 83.85 85.66 82.6  81.94 84.29 86.32 82.16 81.81
 82.4  83.21 83.29 81.65 82.6  84.26 83.71]

Chapter 6.1.6 - Indexing practice

Combining the index with the column names to access specific values

Get into groups and answer the following questions more specifically than before:

import pandas as pd
import numpy as np

df = pd.read_csv("btr_month_tmean.csv", index_col='YEAR')

### What was the temperature in June of 1992?
month = 'June'
year = 1992

result = np.nan 

print(f"The temperature in {month} of {year} was {result:.2f}")

### What was the temperature in January of 2001?
month = 'January'
year = 2001

result = np.nan 

print(f"The temperature in {month} of {year} was {result:.2f}")

### Which August was warmer? 2015 or 2018?
month = 'August'
year1 = 2015
year2 = 2018

result1 = np.nan
result2 = np.nan

print(f"The temperature in {month} of {year1} was {result1:.2f}")

print(f"The temperature in {month} of {year2} was {result2:.2f}")

### What was the temperature for the year and month you were born?

birth_month = np.nan
birth_year = np.nan

result = np.nan

print(f"The temperature in {birth_month} of {birth_year} was {result:.2f}")
The temperature in June of 1992 was nan
The temperature in January of 2001 was nan
The temperature in August of 2015 was nan
The temperature in August of 2018 was nan
The temperature in nan of nan was nan

Chapter 6.1.7 - Data summaries using pandas

We can use the describe method to examine summary statistics of data within a DataFrame

import pandas as pd

df = pd.read_csv("btr_month_tmean.csv", index_col='YEAR')

df['AUG'].describe()
count 91.000000 mean 82.060989 std 1.415806 min 78.030000 25% 81.285000 50% 81.940000 75% 82.830000 max 86.320000 Name: AUG, dtype: float64

This is equivalent to using numpy, but much easier!

import pandas as pd

df = pd.read_csv("btr_month_tmean.csv", index_col='YEAR')

print("Numpy version")

print("count", len(df['AUG']))

print("mean", np.mean(df['AUG']))
print("std", np.std(df['AUG']))
print("min", np.min(df['AUG']))
print("25%", np.percentile(df['AUG'], 25))
print("50%", np.percentile(df['AUG'], 50))
print("75%", np.percentile(df['AUG'], 75))
print("max", np.max(df['AUG']))
Numpy version
count 91
mean 82.06098901098902
std 1.4080054125029076
min 78.03
25% 81.285
50% 81.94
75% 82.83
max 86.32

Chatper 6.1.8 - Climatology with pandas

Get into groups and answer the following climatology questions. Provide code that proves your point.

How does the June temperature in 1992 compare to a typical June temperature?

# your code here

Was the January temperature in 2001 above the 75th percentile?

# your code here

How many standard deviations was August 2015 and August 2018 from the mean? Specify whether or not they were above or below the mean.

# your code here

How does the temperature during the month and year you were born compare to a typical temperature during that month?

# your code here

Create a scatterplot that examines the correlation between June and July temperatures

import matplotlib.pyplot as plt
import numpy as np

june = np.nan
july = np.nan

### use june as the x value and july as the y value for the scatterplot.