Pandas is a package that can read in, analyze, process, and modify tabular data (e.g., csv files)
Reading in csv files:¶
Much like numpy and matplotlib, we just need to import pandas to access its methods. The typical practice is to import it as the alias pd:
import pandas as pdDownload csv file¶
!wget -nc "https://arcus.niu.edu/btr_month_tmean.csv"File ‘btr_month_tmean.csv’ already there; not retrieving.
Chapter 6.2.1 - Reading a file¶
The most basic method provided by pandas is the read_csv method which takes a filename as an argument.
In the file explorer tab, you will see a file named btr_month_tmean.csv after you run the download code above. We will use this file for the examples in this notebook.
First, to read in the file, you use the following code. The filename is either a relative path (the file is in this folder or a subfolder) or an absolute path (fake example: C:/Users/Test/Example/Etc./etc./file).
import pandas as pd
df = pd.read_csv("btr_month_tmean.csv")
df # just typing df will print out the dataChapter 6.1.2 - pandas DataFrame¶
There are multiple ways to access specific data in a DataFrame. In the case of these data, you may want to access an entire year of data. This can be done by providing the index to a pandas method called ```loc``. What is an index?
We can print the index using the following code:
df.index.valuesarray([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
85, 86, 87, 88, 89, 90])As you can see, the indexes are just a list of numbers between 0 and 90. By default, pandas will automatically assign these ordered indexes when a csv file is read in. Each index will be unique, and the maximum value corresponds to the number of rows minus 1 (since we start at 0).
If we want to access the row at index 0, we can just use the ```loc`` method (short for ‘location’):
df.loc[0]YEAR 1930.00
JAN 50.84
FEB 60.14
MAR 58.53
APR 70.93
MAY NaN
JUN 81.12
JUL 82.24
AUG 81.89
SEP 78.52
OCT 67.93
NOV 59.67
DEC 49.60
Name: 0, dtype: float64As you can see, this prints out all of the monthly data from the first row only (which happens to correspond with all monthly values in 1930, since 1930 was the first row in the csv file)
This approach is somewhat useful, but it is more useful to explicitly set our indexes in a way that makes sense in for our domain, problem, etc.
In this case, the YEAR column makes more sense to use as an index. We can tell pandas to use it as the index by adding an argument to the read_csv method called index_col. This argument takes a string that corresponds to the name of the column you want to use as the index:
import pandas as pd
df = pd.read_csv("btr_month_tmean.csv", index_col='YEAR')
df.loc[1930]JAN 50.84
FEB 60.14
MAR 58.53
APR 70.93
MAY NaN
JUN 81.12
JUL 82.24
AUG 81.89
SEP 78.52
OCT 67.93
NOV 59.67
DEC 49.60
Name: 1930, dtype: float64Now, we can access the data in a more understandable way (much like we did with dictionaries). In the case above, we simply used 1930 as our desired key, and the DataFrame (which we specified should use YEAR as its index) returns the row associated with that key.
Chapter 6.1.3 - Practice with DataFrames¶
Answer the following questions using the following pandas DataFrame and only the ```loc`` method.
import pandas as pd
import numpy as np
df = pd.read_csv("btr_month_tmean.csv", index_col='YEAR')
### What was the temperature in June of 1992?
### What was the temperature in January of 2001?
### Which August was warmer? 2015 or 2018?
### What was the temperature for the year and month you were born?
Chapter 6.1.4 - Accessing column values¶
We access specific columns in a pandas DataFrame in a way that is very similar to dictionaries.
One you read in your csv file and the DataFrame is created, you can access a column using square brackets and the column name:
import pandas as pd
df = pd.read_csv("btr_month_tmean.csv", index_col='YEAR')
df['AUG']YEAR
1930 81.89
1931 79.55
1932 82.40
1933 81.47
1934 81.52
...
2016 83.29
2017 81.65
2018 82.60
2019 84.26
2020 83.71
Name: AUG, Length: 91, dtype: float64This result is the temperature for every August from 1930 to 2020.
Just for fun, lets try to plot this!
import matplotlib.pyplot as plt
plt.plot(df['AUG'])
plt.show()
Chapter 6.1.5 - pandas plotting¶
Pandas DataFrame objects work very well with matplotlib without having to think too much about it! The column that you accessed had index information in the form of a YEAR which was automatically interpreted by matplotlib to be the x-axis value. You can be more explicit about it by using the values method, which converts pandas information into a numpy array.
import matplotlib.pyplot as plt
x = df.index.values
y = df['AUG'].values
plt.plot(x, y)
plt.show()
print("X values=", x)
print("Y values=", y)
X values= [1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943
1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957
1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971
1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985
1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
2014 2015 2016 2017 2018 2019 2020]
Y values= [81.89 79.55 82.4 81.47 81.52 82.58 81.82 81.61 81.93 81.6 80.32 82.98
81.97 83.73 82.03 81.36 81.56 83.13 81.31 80.39 80.63 85.21 82.85 81.6
83.11 81.89 81.68 81.81 81.79 82.1 80.31 78.03 84.35 82.45 82.02 79.58
79.97 80.05 81.65 80.9 81.94 81.66 82.81 80.85 80.42 80.97 80.89 81.26
81.68 80.85 82.16 82.58 82.42 82.13 79.81 81.92 82.03 83.06 82.94 82.47
82.9 82.1 79.82 82.79 80.55 83.53 79.97 81.16 83.44 85.05 84.27 81.03
81.66 82.76 80.92 83.87 83.85 85.66 82.6 81.94 84.29 86.32 82.16 81.81
82.4 83.21 83.29 81.65 82.6 84.26 83.71]
Chapter 6.1.6 - Indexing practice¶
Combining the index with the column names to access specific values
Get into groups and answer the following questions more specifically than before:
import pandas as pd
import numpy as np
df = pd.read_csv("btr_month_tmean.csv", index_col='YEAR')
### What was the temperature in June of 1992?
month = 'June'
year = 1992
result = np.nan
print(f"The temperature in {month} of {year} was {result:.2f}")
### What was the temperature in January of 2001?
month = 'January'
year = 2001
result = np.nan
print(f"The temperature in {month} of {year} was {result:.2f}")
### Which August was warmer? 2015 or 2018?
month = 'August'
year1 = 2015
year2 = 2018
result1 = np.nan
result2 = np.nan
print(f"The temperature in {month} of {year1} was {result1:.2f}")
print(f"The temperature in {month} of {year2} was {result2:.2f}")
### What was the temperature for the year and month you were born?
birth_month = np.nan
birth_year = np.nan
result = np.nan
print(f"The temperature in {birth_month} of {birth_year} was {result:.2f}")The temperature in June of 1992 was nan
The temperature in January of 2001 was nan
The temperature in August of 2015 was nan
The temperature in August of 2018 was nan
The temperature in nan of nan was nan
Chapter 6.1.7 - Data summaries using pandas¶
We can use the describe method to examine summary statistics of data within a DataFrame
import pandas as pd
df = pd.read_csv("btr_month_tmean.csv", index_col='YEAR')
df['AUG'].describe()count 91.000000
mean 82.060989
std 1.415806
min 78.030000
25% 81.285000
50% 81.940000
75% 82.830000
max 86.320000
Name: AUG, dtype: float64This is equivalent to using numpy, but much easier!
import pandas as pd
df = pd.read_csv("btr_month_tmean.csv", index_col='YEAR')
print("Numpy version")
print("count", len(df['AUG']))
print("mean", np.mean(df['AUG']))
print("std", np.std(df['AUG']))
print("min", np.min(df['AUG']))
print("25%", np.percentile(df['AUG'], 25))
print("50%", np.percentile(df['AUG'], 50))
print("75%", np.percentile(df['AUG'], 75))
print("max", np.max(df['AUG']))Numpy version
count 91
mean 82.06098901098902
std 1.4080054125029076
min 78.03
25% 81.285
50% 81.94
75% 82.83
max 86.32
Chatper 6.1.8 - Climatology with pandas¶
Get into groups and answer the following climatology questions. Provide code that proves your point.
How does the June temperature in 1992 compare to a typical June temperature?
# your code hereWas the January temperature in 2001 above the 75th percentile?
# your code hereHow many standard deviations was August 2015 and August 2018 from the mean? Specify whether or not they were above or below the mean.
# your code hereHow does the temperature during the month and year you were born compare to a typical temperature during that month?
# your code hereCreate a scatterplot that examines the correlation between June and July temperatures
import matplotlib.pyplot as plt
import numpy as np
june = np.nan
july = np.nan
### use june as the x value and july as the y value for the scatterplot.