Skip to content

Effortlessly Sort Pandas DataFrame by Column

[

pandas Sort: Your Guide to Sorting Data in Python

Learning pandas sort methods is a great way to start with or practice doing basic data analysis using Python. Most commonly, data analysis is done with spreadsheets, SQL, or pandas. One of the great things about using pandas is that it can handle a large amount of data and offers highly performant data manipulation capabilities.

By the end of this tutorial, you’ll know how to:

  • Sort a pandas DataFrame by the values of one or more columns
  • Use the ascending parameter to change the sort order
  • Sort a DataFrame by its index using .sort_index()
  • Organize missing data while sorting values
  • Sort a DataFrame in place using inplace set to True

To follow along with this tutorial, you’ll need a basic understanding of pandas DataFrames and some familiarity with reading in data from files.

Getting Started With Pandas Sort Methods

As a quick reminder, a DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is similar to a table in SQL or Excel, where the data is organized in a tabular form. Throughout this tutorial, we will be using pandas to perform various sorting operations on DataFrame objects.

Preparing the Dataset

Before we dive into the sorting methods, let’s first prepare our dataset. We’ll use a sample dataset of employee records, which includes columns such as “Name,” “Age,” “Salary,” and “Department.” This dataset will provide us with a variety of sorting scenarios.

To start, import the pandas library and read the dataset from a CSV file:

import pandas as pd
df = pd.read_csv('employee_records.csv')

Make sure to replace ‘employee_records.csv’ with the path to your own dataset.

Getting Familiar With .sort_values()

The .sort_values() method is used to sort a DataFrame by the values of one or more columns. By default, the sorting is done in ascending order, meaning the smallest values will appear first. Let’s see how this works with an example.

Suppose we want to sort our employee records by the “Salary” column. We can do this by calling .sort_values() and passing the column name as the argument:

sorted_df = df.sort_values('Salary')

The sorted_df DataFrame will now contain the employee records sorted by salary in ascending order.

If we want to sort the records in descending order, we can set the ascending parameter to False:

sorted_df = df.sort_values('Salary', ascending=False)

This will give us the employee records sorted by salary in descending order.

Getting Familiar With .sort_index()

The .sort_index() method is used to sort a DataFrame by its index. The index is the row labels of the DataFrame, which serves as a unique identifier for each row. By default, the sorting is done in ascending order based on the index values.

To sort a DataFrame by its index, simply call .sort_index():

sorted_df = df.sort_index()

The sorted_df DataFrame will now contain the employee records sorted by index in ascending order.

If we want to sort the records in descending order based on the index, we can set the ascending parameter to False:

sorted_df = df.sort_index(ascending=False)

This will give us the employee records sorted by index in descending order.

Sorting Your DataFrame on a Single Column

Now that we’re familiar with the .sort_values() and .sort_index() methods, let’s dive deeper into sorting DataFrames. Specifically, we’ll focus on sorting DataFrames based on a single column.

Sorting by a Column in Ascending Order

To sort a DataFrame by a single column in ascending order, we can use the .sort_values() method and pass the column name as the argument:

sorted_df = df.sort_values('Age')

This will sort the employee records based on the “Age” column in ascending order.

Changing the Sort Order

If we want to change the sort order to descending, we can add the ascending=False parameter:

sorted_df = df.sort_values('Age', ascending=False)

Now the employee records will be sorted based on the “Age” column in descending order.

Choosing a Sorting Algorithm

By default, the .sort_values() method uses a stable sorting algorithm called “quicksort” to sort the DataFrame. However, you can also choose to use other sorting algorithms such as “mergesort” or “heapsort” by specifying the kind parameter:

sorted_df = df.sort_values('Age', kind='mergesort')

This will sort the DataFrame based on the “Age” column using the mergesort algorithm.

Sorting Your DataFrame on Multiple Columns

In some cases, you may need to sort your DataFrame based on multiple columns simultaneously. This can be achieved by passing a list of column names to the .sort_values() method.

Sorting by Multiple Columns in Ascending Order

To sort a DataFrame based on multiple columns in ascending order, we can pass a list of column names to the .sort_values() method:

sorted_df = df.sort_values(['Department', 'Salary'])

This will sort the employee records first by the “Department” column, and then by the “Salary” column within each department, in ascending order.

Changing the Column Sort Order

If we want to change the sort order of specific columns, we can pass a list of tuples to the .sort_values() method. Each tuple should contain the column name and the sort order (True for ascending, False for descending):

sorted_df = df.sort_values([('Department', True), ('Salary', False)])

This will sort the employee records first by the “Department” column in ascending order, and then by the “Salary” column in descending order.

Sorting by Multiple Columns in Descending Order

If we want to sort the DataFrame based on multiple columns in descending order, we can simply set the ascending parameter to False for all columns:

sorted_df = df.sort_values(['Department', 'Salary'], ascending=[False, False])

This will sort the employee records first by the “Department” column in descending order, and then by the “Salary” column in descending order.

Sorting by Multiple Columns With Different Sort Orders

In some cases, you may want to sort a DataFrame based on multiple columns, where each column has a different sort order. To achieve this, you can pass a list of tuples to the .sort_values() method, where each tuple specifies the column name and the sort order:

sorted_df = df.sort_values([('Department', False), ('Salary', True)])

This will sort the employee records first by the “Department” column in descending order, and then by the “Salary” column in ascending order.

Sorting Your DataFrame on Its Index

Apart from sorting based on column values, you can also sort a DataFrame based on its index values.

Sorting by Index in Ascending Order

To sort a DataFrame based on its index in ascending order, you can use the .sort_index() method:

sorted_df = df.sort_index()

This will sort the DataFrame based on its index in ascending order.

Sorting by Index in Descending Order

If you want to sort the DataFrame based on its index in descending order, you can set the ascending parameter to False:

sorted_df = df.sort_index(ascending=False)

Now the DataFrame will be sorted based on its index values in descending order.

Exploring Advanced Index-Sorting Concepts

Sorting a DataFrame based on its index can be more complex when dealing with hierarchical or multi-level indexes. In such cases, you can use additional parameters, such as level and sort_remaining, to control the sorting behavior. For more information on this topic, you can refer to the pandas documentation.

Sorting the Columns of Your DataFrame

So far, we have been sorting the rows of our DataFrame based on various criteria. But what if you want to sort the columns instead?

Working With the DataFrame Axis

To sort the columns of a DataFrame, you need to specify the axis parameter of the .sort_values() method. By default, axis is set to 0, which means sorting will be performed along the rows (i.e., sorting the rows). To sort the columns, set axis to 1:

sorted_df = df.sort_values('Department', axis=1)

This will sort the columns of the DataFrame based on the values in the “Department” column.

Using Column Labels to Sort

By default, the .sort_values() method uses the column values to sort the DataFrame. However, you can also use the column labels as the sorting criteria by passing the axis=1 and sort_axis=‘columns’ parameters:

sorted_df = df.sort_values('Age', axis=1, sort_axis='columns')

This will sort the columns of the DataFrame based on the column labels.

Working With Missing Data When Sorting in Pandas

When sorting a DataFrame, it’s important to consider how missing data is handled. By default, missing values are sorted at the end of the DataFrame.

Understanding the na_position Parameter in .sort_values()

The .sort_values() method has a na_position parameter that allows you to control the position of missing values in the sorted DataFrame. By default, na_position is set to ‘last’, which means missing values will be placed at the end of the DataFrame.

If you want to place missing values at the beginning of the DataFrame, you can set na_position to ‘first’:

sorted_df = df.sort_values('Salary', na_position='first')

This will place rows with missing salary values at the beginning of the sorted DataFrame.

Understanding the na_position Parameter in .sort_index()

The .sort_index() method also has a na_position parameter that works in the same way as the na_position parameter in .sort_values(). It allows you to control the position of missing values in the sorted DataFrame based on the index values.

To place missing values at the beginning of the sorted DataFrame, set na_position to ‘first’:

sorted_df = df.sort_index(na_position='first')

Using Sort Methods to Modify Your DataFrame

So far, we have been creating new DataFrames with the sorted results. However, you can also modify your existing DataFrame in place using the inplace parameter.

Using .sort_values() In Place

To sort a DataFrame in place using .sort_values(), set the inplace parameter to True:

df.sort_values('Salary', inplace=True)

This will sort the DataFrame by the “Salary” column in ascending order and modify the original DataFrame.

Using .sort_index() In Place

Similarly, to sort a DataFrame by its index in place, set the inplace parameter to True when calling .sort_index():

df.sort_index(inplace=True)

This will sort the DataFrame by its index in ascending order and modify the original DataFrame.

Conclusion

Sorting data is a fundamental operation in data analysis, and pandas provides efficient methods for sorting DataFrames. In this tutorial, we explored various sorting techniques using the .sort_values() and .sort_index() methods. We learned how to sort DataFrames based on single and multiple columns, as well as the index. We also discussed how to handle missing data while sorting and how to modify DataFrames in place.

By mastering the sorting methods in pandas, you will be equipped with essential skills for manipulating and analyzing data effectively in Python.