Data Frames

A data frame is essentially a table that has rows and columns. Each column is a Pandas Series and represents a variable, and each row is an observation, which represents an entry. By default, data frames are indexed with numbers (starting at 0).

Remember, first you have to import Pandas!

import pandas as pd

Creating a Data Frame – from Scratch

There are many ways to create a data frame. In the real world, most commonly your data set will come from a .csv, or .xlsx file. These will be demonstrated on this page. With Pandas, you can import many other files such as SAS, SQL, STATA, JSON, HTML, and directly from your clipboard. To see all file types Pandas can import, look at the official documentation page.

Before showing how to import a file, first let’s see how to create a blank data frame. You use the assignment statement with the data frame name on the left of the “=” sign, and with the pd.DataFrame() method on the right.

df = pd.DataFrame()

Importing a “.csv” or “.xlsx” File as a Data Frame

The data set used on this page was downloaded from Kaggle.com from the user Miroslav Sabo. To download, go to our GitHub page (https://github.com/Opensourcefordatascience/Data-sets), or get it from Kaggle (https://www.kaggle.com/miroslavsabo/young-people-survey). You will need to go to the Kaggle site to view the codebook for this data set. If unfamiliar with the term, a codebook contains information on the data set. This includes: background information, how the data was collected, the scales, the scales meaning, and notes about the data set.

Note: The file from our GitHub page is modified from the original .csv file. In our version, a “Participant Number” column has been added. This column is arbitrarily assigned.

resp_df = pd.read_csv("responses.csv")

resp_df_ex = pd.read_excel("responses.xlsx")

If the data in the “.csv” or “.xlsx” file doesn’t start on the first row, use the skiprows = (number of rows to skip) parameter in either of the methods.

resp_df_ex = pd.read_excel("responses.xlsx", skiprows = 3)

Data Frame Preview

If you haven’t already previewed the data set you are working with before importing it into Pandas, now is the time! It is important to know what the data looks like that you are working with.

There are two possible methods, .head() and/or .tail(). The .head() method shows the first 5 rows and all of the columns (or some of the columns depending on which IDE you are using), whereas the .tail() method shows the last 5 rows. If you want to view more, or fewer, than 5 rows, pass a number in the .head(#) or .tail(#) methods. Let’s see what we are working with. Only the first 10 columns will be displayed below.

resp_df.head()

Participant
Number
Music Slow songs
or fast songs
Dance Folk Country Classical
music
Musical Pop Rock
0 1 5.0 3.0 2.0 1.0 2.0 2.0 1.0 5.0 5.0
1 2 4.0 4.0 2.0 1.0 1.0 1.0 2.0 3.0 5.0
2 3 5.0 5.0 2.0 2.0 3.0 4.0 5.0 3.0 5.0
3 4 5.0 3.0 2.0 1.0 1.0 1.0 1.0 2.0 2.0
4 5 5.0 3.0 4.0 3.0 2.0 4.0 3.0 5.0 3.0

Indexing and Slicing Data Frames

There are a few ways you can select which rows and columns you are interested in when working with Data Frames. They are presented below in a table, then more in-depth examples are provided of each.

Selection Method Description
DataFrame[‘column_label_desired‘] Selects the column(s) desired and all of the rows.
Pass a list if more than one column is desired
DataFrame.loc[rows_desired, ‘column_label_desired‘] Selects the desired number of rows, and column label(s) desired.
Pass a list if more than one column is desired
DataFrame.iloc[rows_desired, column_position_desired] Selects the desired rows, and column position desired.
Pass a list of column positions if they are not next to each other,
otherwise use a slice to select the range of the positions.

The difference between using .loc[] and .iloc[] is that .loc[] selects the columns by column label (column name), whereas .iloc[] selects the columns by position. The column position starts at 0, just like the row indexes.

For both of these methods, if you want all rows for the desired columns you pass a colon, “:”, in the rows_desired parameter. Let’s see an example of each selection method in the above table.

DataFrame[‘column_label_desired‘]

Selecting a column or columns this way will by default include data from every row as well. Most often this is what is desired. Selecting data this way is often used for sub-setting a data set. If unfamiliar with what sub-setting is, it is the act of creating a smaller data set that contains only the observations that are of interest from the initial data set. Let’s see it in action.

Although selecting data this way includes all the rows for the columns selected, only the first 5 rows will be displayed below.

resp_df['Folk']

Folk
0 1.0
1 1.0
2 2.0
3 1.0
4 3.0

If you want to select more than 1 column, pass a list.

resp_df[['Folk', 'Country']]

Folk Country
0 1.0 2.0
1 1.0 1.0
2 2.0 3.0
3 1.0 1.0
4 3.0 2.0

Sub-setting section sneak peak: To create a subset, use the assignment statement. The subset names on the left side of the “=” and the data frame selection method on the right side. This will create a new Python object that contains all the data in the column(s) you specify. Check out the sub-setting page for more on this.

DataFrame.loc[rows_desired, ‘column_label_desired‘]

This data selection method is similar to the previous selection method. Except you can state the specific rows you are interested in. You still use the column label(s) to identify which column(s) you are interested in.

To select the rows you are interested in, you can pass a list of values or use a slice range if they are next to each other.

Note: Using a slice range in the .loc[] method, you will receive the starting and ending row index locations stated. This is different from when using with the .iloc[] method which follows the traditional slicing rule of end position – 1.

resp_df.loc[[0,2,3], 'Music']

Music
0 5.0
2 5.0
3 5.0

You can also pass a list of column labels (column names) if you want more than 1 column.

resp_df.loc[2:7, ['Music', 'Rock']]

Music Rock
2 5.0 5.0
3 5.0 2.0
4 5.0 3.0
5 5.0 5.0
6 5.0 3.0
7 5.0 5.0

DataFrame.iloc[rows_desired, column_position_desired]

This selection method is similar to the previous method, .loc[], in that all the same rules apply for selecting the rows. The difference is that with the .iloc[] method, you identify the desired columns with their position location instead in column label. The columns position starts at 0. The code below will return the same results as the .loc[] method.

resp_df.iloc[[0,2,3], 1]

Music
0 5.0
2 5.0
3 5.0

resp_df.iloc[2:7, [1, 9]]

Music Rock
2 5.0 5.0
3 5.0 2.0
4 5.0 3.0
5 5.0 5.0
6 5.0 3.0

The advantage of using this method is that you can use a range slice to select multiple columns that are next to each other. This saves time from typing each column name. However, you have to know the column(s) position. Remember that using a slice range with the .iloc[] method returns the ending number – 1, which is the traditional slicing behavior with Python. Whereas the .loc[] method returns the stated ending number.

resp_df.iloc[2:7, 1:9]

Music Slow songs
or fast songs
Dance Folk Country Classical
music
Musical Pop
2 5.0 5.0 2.0 2.0 3.0 4.0 5.0 3.0
3 5.0 3.0 2.0 1.0 1.0 1.0 1.0 2.0
4 5.0 3.0 4.0 3.0 2.0 4.0 3.0 5.0
5 5.0 3.0 2.0 3.0 2.0 3.0 3.0 2.0
6 5.0 5.0 5.0 3.0 1.0 2.0 2.0 5.0

Creating a New Variable using other Variables within a Data Frame

This is a fairly common task and luckily it’s simple. To create a new variable within a data frame you need to use an assignment statement with the data frame and new variable name on the left side of the “=” and the data frame operations on the right side. This will be completed element-wise, i.e. between columns within the same row.

For our example, we want to calculate a new variable that is the difference in enjoyment between “Rock” and “Folk”. We will subtract “Folk” from “Rock” so that positive values will mean a greater enjoyment in rock music, whereas negative values will mean a greater enjoyment in folk music.

resp_df['Rock Folk Diff'] = resp_df["Rock"] - resp_df["Folk"]

resp_df['Rock Folk Diff'].head()

Rock Folk Diff
0 4.0
1 4.0
2 3.0
3 1.0
4 0.0

Common Methods and Operations with Data Frames

Below is a table of common methods and operations conducted on Data Frames. For the full list of attributes and methods available to be used with data frames, see the official Pandas documentation which can be found here.

Method/Function Description
.sum() Returns the sum value
.mean() Returns the average value
.std() Returns the standard deviation
.min() Returns the minimum value
.max() Returns the maximum value
len() A function that returns the length which is the total number
of observations. This includes missing values.
.count() Returns the number of responses that
are not missing
.abs() Returns the absolute value
.corr() Returns the a correlation matrix with r values
.value_counts() Returns the frequencies of unique values
.groupby(“your_series“) Groups data frame by unique series values
pandas.get_dummies() Converts categorical variables into dummy variables
.describe() Returns the sample size, mean, standard deviation,
minimum value, 25th percentile value, 50th percentile value,
75th percentile value, and the maximum value
.apply() Applies a function to the data
.concat() Combines data frame by stacking
.merge() Combines data frame by merging
on a key (a column with unique values)
.head() Shows first 5 rows. Pass a number between “()”
for more or fewer rows.
.tail() Shows last 5 rows. Pass a number between “()”
for more or fewer rows.

To use a method on a Data Frame, or a Data Frame column (which is truly a Series) or columns, you place the method after the Python object; i.e. your_dataframe.method(), your_dateframe[‘column_label’].method(), or your_dateframe[[‘column_label1’, ‘column_label2’, ‘column_label3’]].method(). Let’s see a few examples below.

First example, let’s get the total number of observations (len()), the number of actual responses (.count()), and the mean (.mean()) rating for “Rock” column.

len(resp_df['Rock'])
1010

resp_df['Rock'].count()
1004

resp_df['Rock'].mean()
3.7619521912350598

From these analyses, we know that there are a total of 1,010 observations for the “Rock” variable. Of the 1,010 participants in this study, 1,004 answered this question. The average score for the “Rock” variable is 3.76. Given the context of the scale (found on the original data set’s page that is linked above), the people in this survey have a slight enjoyment of rock music.

Next example will be using the .groupby() method. There are variables within the data set that contains categorical data that we can group the data by and run analyses on the different groups to see differences between them.

The variable name that we will use is “Village – town”. Let’s see if there is a difference in the enjoyment of rock music between individuals that live in a village or a city.

When using multiple methods together, you need to stack them in the correct order. In this case, we need to group the data frame by the categorical data, state the variable we are interested in, and then get the means.

resp_df.groupby("Village - town")['Rock'].mean()

Village – town
city 3.761024
village 3.764310

There appears to be no difference between the enjoyment of rock music scores. However, this still should be tested for significance.

Leave a Reply