What is a Pandas DataFrame?
Pandas DataFrame is a 2-dimensional data structure consisting of rows and columns. It allows you to store and query data just like you would normally do in Excel or SQL.
Each DataFrame column represents a variable/feature/predictor, such as age, gender, square footage, and so on. Similarly, each row represents one record/observation, such as housing information, or employee data. Take a look at the following figure for a visual explanation:
Pandas allows you to store pretty much any data type inside a DataFrame, including numerical, categorical, and textual. You can even store more complex data types, such as JSON, but that’s a story for another time. Today is all about the basics.
Table of contents:
How to Create a Pandas DataFrame (pd DataFrame)?
Working with DataFrames in Python is utterly easy due to the sheer power of the Pandas library. This section will walk you through the basics of creating Pandas DataFrames with the pd.DataFrame
constructor and many different Python data types.
Here are a couple of possible arguments you can pass in when creating a Pandas DataFrame:
Argument | Description |
---|---|
data |
An n-dimensional array representing your data, can be an iterable, dictionary, array, series, and much more |
index |
Index to use for the DataFrame, range index by default (integers from 0 to N, where N is the number of rows - 1) |
columns |
Column labels to assign to DataFrame columns, range index by default |
dtype |
A single data type to force on the entire DataFrame, if not supplied then Pandas infers data types automatically |
copy |
Boolean, whether or not Pandas should copy data from inputs |
Let’s now go over some hands-on use cases for creating Pandas DataFrames. These are the library imports you’ll need at the beginning of your notebook or script:
import numpy as np
import pandas as pd
from datetime import datetime
Convert List to Pandas DataFrame
A common data type that often gets converted to a DataFrame is a Python list. You can represent each DataFrame row as a single list, which is convenient if you have a small number of features, but get’s messy as the dataset grows in width.
Take a look at the following code - it creates four lists, one for each employee. Then, it passes a list of employees (list of lists) to the data
argument of the pd.DataFrame()
function:
emp1 = ["Bob", "Doe", "bobdoe@company.com", datetime(2023, 2, 15)]
emp2 = ["Mark", "Doe", "markdoe@company.com", datetime(2023, 3, 10)]
emp3 = ["Jane", "Doe", "janedoe@company.com", datetime(2023, 3, 12)]
emp4 = ["Patrick", "Doe", "patrickdoe@company.com", datetime(2023, 3, 18)]
data = pd.DataFrame(data=[emp1, emp2, emp3, emp4])
data
This is what the resulting Pandas DataFrame looks like:
As you can see, the column names are missing. You can supply yours by passing them to the columns
argument of the pd.DataFrame()
function:
data = pd.DataFrame(
data=[emp1, emp2, emp3, emp4],
columns=["First Name", "Last Name", "Email", "Created At"]
)
data
Our DataFrame has column names now:
Let’s see how we can do the same with dictionaries.
Want to learn more about converting a List to Pandas DataFrame? Read our comprehensive guide.
Convert Dictionary to Pandas DataFrame
Python dictionaries are a powerful data structure, especially when working with Pandas. You can easily convert a dict to Pandas DataFrame by passing in a list of dictionaries, each one representing a single row of data.
Because dictionaries are key-value pairs, we essentially specify the column names and respective values at once. Take a look at the following snippet:
employees = [
{"First Name": "Bob", "Last Name": "Doe", "Email": "bobdoe@company.com", "Created At": datetime(2023, 2, 15)},
{"First Name": "Mark", "Last Name": "Doe", "Email": "markdoe@company.com", "Created At": datetime(2023, 3, 10)},
{"First Name": "Jane", "Last Name": "Doe", "Email": "janedoe@company.com", "Created At": datetime(2023, 3, 12)},
{"First Name": "Patrick", "Last Name": "Doe", "Email": "patrickdoe@company.com", "Created At": datetime(2023, 3, 18)}
]
data = pd.DataFrame(employees)
data
We get the same DataFrame without specifying the columns explicitly:
Up next, let’s see how to do the same with Numpy arrays.
Convert Numpy Array to Pandas DataFrame
Numpy and Pandas go hand in hand. Both libraries are used together in most data science projects, so it makes sense to incorporate Numpy arrays in Pandas DataFrames. The following code snippet shows you how to convert the Numpy array to Pandas DataFrames. It’s frankly the same thing as with Python lists:
emp1 = np.array(["Bob", "Doe", "bobdoe@company.com", datetime(2023, 2, 15)])
emp2 = np.array(["Mark", "Doe", "markdoe@company.com", datetime(2023, 3, 10)])
emp3 = np.array(["Jane", "Doe", "janedoe@company.com", datetime(2023, 3, 12)])
emp4 = np.array(["Patrick", "Doe", "patrickdoe@company.com", datetime(2023, 3, 18)])
data = pd.DataFrame(
data=[emp1, emp2, emp3, emp4],
columns=["First Name", "Last Name", "Email", "Created At"]
)
data
The resulting DataFrame looks familiar:
And that’s how you can construct Pandas DataFrames from zero, but how can you expand them? In other words, how can you add rows and columns? That’s what we’ll answer next.
How to Add Rows and Columns to a Pandas DataFrame
This section will explain some basic ways to add rows and columns to Pandas DataFrames. There will be dedicated articles that cover the same in much more depth, so make sure to stay tuned to Practical Pandas.
Let’s start with columns.
Add Column to Pandas DataFrame
Data science and data analytics often require you to make derived columns. Put simply, these columns represent data in a new way that is probably easier to understand, or easier to plug into a machine learning model.
We’ll keep things simple today, and only show you how to append column to DataFrame. For example, imagine if we wanted to add a Date of Birth
attribute to our dataset. This is one way to do it:
dobs = [datetime(1985, 1, 15), datetime(1990, 5, 14), datetime(1997, 7, 9), datetime(1960, 5, 5)]
data["Date of Birth"] = dobs
data
The new attribute gets appended to the end of the DataFrame:
But what if you want it at a certain location, perhaps after the Last Name
column? You can use the insert()
function to specify the index location (keep in mind that indexes in Python and Pandas start at 0):
dobs = [datetime(1985, 1, 15), datetime(1990, 5, 14), datetime(1997, 7, 9), datetime(1960, 5, 5)]
data.insert(loc=2, column="Date of Birth", value=dobs)
data
This approach allows for more control, as you can see from the image below:
Next, let’s dive into adding rows to the DataFrame.
Add Row to Pandas DataFrame
We’ll now add a couple more employees to our DataFrame. Until recently, the recommended way for adding rows was to use the append()
function, but this one will be deprecated in future versions of Pandas. Instead, you should opt for the concat()
function.
It expects two or more Pandas DataFrames, so each new row has to be converted first. Luckily, you already know how to do that!
Here’s an example of adding one row:
data = pd.concat([
data,
pd.DataFrame(data=[
{"First Name": "John", "Last Name": "Doe", "Email": "johndoe@company.com", "Created At": datetime(2023, 3, 21)}
])
], ignore_index=True)
data
The dataset is now a bit longer:
But what if you want to add more rows? Do you have to call pd.concat()
twice? Absolutely not. We’re already constructing the second DataFrame from a list of dictionaries, so simply add another dictionary for the second row:
data = pd.concat([
data,
pd.DataFrame(data=[
{"First Name": "Linda", "Last Name": "Doe", "Email": "lindadoe@company.com", "Created At": datetime(2023, 3, 23)},
{"First Name": "Kelly", "Last Name": "Doe", "Email": "kellydoe@company.com", "Created At": datetime(2023, 3, 25)}
])
], ignore_index=True)
data
Here’s the resulting DataFrame:
And that pretty much concludes the introduction to Pandas DataFrames. We’ll dive much, much deeper shortly, but this is enough for now.
Summing up
Pandas DataFrames are the data structure where all the magic happens in Python and Pandas. You saw how easy it is to create DataFrames from plain Python objects, such as lists and dictionaries, but also from Numpy arrays. You’ve also learned how to add rows and columns, which are the basic data manipulation techniques you’ll use daily.
Up next, we’ll dive much deeper into each of the subtopics discussed today. Make sure to stay tuned to Practical Pandas, and we’ll make sure to publish the next piece shortly.