Master the Art of Efficiently Reading CSV Files Using Pandas

[

pandas: How to Read and Write Files

by Mirko Stojiljković

Table of Contents

Installing pandas
Preparing Data
Using the pandas read_csv() and .to_csv() Functions
- Write a CSV File
- Read a CSV File
Using pandas to Write and Read Excel Files
- Write an Excel File
- Read an Excel File
Understanding the pandas IO API
- Write Files
- Read Files
Working With Different File Types
- CSV Files
- JSON Files
- HTML Files
- Excel Files
- SQL Files
- Pickle Files
Working With Big Data
- Compress and Decompress Files
- Choose Columns
- Omit Rows
- Force Less Precise Data Types
- Use Chunks to Iterate Through Files
Conclusion

pandas is a powerful and flexible Python package that allows you to work with labeled and time series data. It also provides statistics methods, enables plotting, and more. One crucial feature of pandas is its ability to write and read Excel, CSV, and many other types of files. Functions like the pandas read_csv() method enable you to work with files effectively. You can use them to save the data and labels from pandas objects to a file and load them later as pandas Series or DataFrame instances.

In this tutorial, you’ll learn:

What the pandas IO tools API is
How to read and write data to and from files
How to work with various file formats
How to work with big data efficiently

Let’s start reading and writing files!

Installing pandas

The code in this tutorial is executed with CPython 3.7.4 and pandas 0.25.1. It would be beneficial to make sure you have the latest versions of Python and pandas on your machine. You might want to create a new virtual environment and install the dependencies for this tutorial.

First, you’ll need the pandas library. You may already have it installed. If you don’t, then you can install it with pip:

$ pip install pandas

Once the installation process completes, you should have pandas installed and ready.

Anaconda is an excellent Python distribution that comes with Python, many useful packages like pandas, and a package and environment manager called Conda. If you don’t have pandas in your virtual environment, then you can install it with Conda:

$ conda install pandas

Conda is powerful as it manages the dependencies and their versions.

Preparing Data

Before we dive into reading and writing files, let’s prepare some data that we can work with. In this tutorial, we’ll focus on CSV, Excel, JSON, HTML, SQL, and Pickle file formats.

To get started, let’s create a simple CSV file called “data.csv” with the following content:

Name,Age,City
John,25,New York
Lisa,29,Chicago
Bob,31,Los Angeles

Save this file in the same directory where your Python script or Jupyter Notebook is located.

Using the pandas read_csv() and .to_csv() Functions

The pandas read_csv() function is a powerful tool for reading CSV files into a pandas DataFrame. It allows you to specify various parameters, such as the file path, column names, delimiter, and more.

To read the “data.csv” file we created earlier, use the following code:

import pandas as pd

df = pd.read_csv("data.csv")
print(df)

In the code above, we import the pandas library and use the read_csv() function to read the CSV file into a DataFrame. We then print the DataFrame to confirm that the data was successfully loaded.

The pandas to_csv() function allows you to write a DataFrame to a CSV file. Here’s an example:

df.to_csv("new_data.csv", index=False)

This code will write the DataFrame to a new CSV file called “new_data.csv” in the current directory. The index=False parameter ensures that the DataFrame index is not included in the CSV file.

Using pandas to Write and Read Excel Files

Similarly to CSV files, pandas provides functions to write and read Excel files. The to_excel() function is used to write a DataFrame to an Excel file, while the read_excel() function is used to read an Excel file into a DataFrame.

Here’s an example of how to write a DataFrame to an Excel file:

df.to_excel("data.xlsx", sheet_name="Sheet1", index=False)

This code writes the DataFrame to an Excel file named “data.xlsx” with a sheet named “Sheet1”. The index=False parameter ensures that the DataFrame index is not included in the Excel file.

To read the Excel file back into a DataFrame, use the following code:

df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
print(df)

Understanding the pandas IO API

The pandas IO API provides a unified interface for writing and reading various file formats. It includes functions for writing files, such as to_csv(), to_excel(), and to_sql(), as well as functions for reading files, such as read_csv(), read_excel(), and read_sql().

By using the pandas IO API, you can easily switch between different file formats without needing to change much of your code.

Working With Different File Types

In this tutorial, we’ll cover several file types that pandas can handle, including CSV, JSON, HTML, Excel, SQL, and Pickle files.

CSV Files

CSV files are one of the most commonly used file formats for storing tabular data. Pandas provides excellent support for working with CSV files, as we’ve already seen in the previous examples.

To write a DataFrame to a CSV file, you can use the to_csv() function:

df.to_csv("data.csv", index=False)

To read a CSV file into a DataFrame, you can use the read_csv() function:

df = pd.read_csv("data.csv")

JSON Files

JSON (JavaScript Object Notation) is a lightweight data format that is commonly used for storing and exchanging data on the web. Pandas includes functions for working with JSON files as well.

To write a DataFrame to a JSON file, you can use the to_json() function:

df.to_json("data.json", orient="records")

The orient="records" parameter ensures that the JSON file represents the DataFrame as a list of records.

To read a JSON file into a DataFrame, you can use the read_json() function:

df = pd.read_json("data.json")

HTML Files

HTML (Hypertext Markup Language) is the standard markup language for creating web pages. Pandas provides functions for working with HTML files, allowing you to read data from HTML tables and write DataFrames to HTML files.

To read data from an HTML table into a DataFrame, you can use the read_html() function:

df_list = pd.read_html("data.html")
df = df_list[0]

The read_html() function returns a list of DataFrames, so you’ll need to select the appropriate DataFrame from the list.

To write a DataFrame to an HTML file, you can use the to_html() function:

df.to_html("data.html", index=False)

Excel Files

As mentioned earlier, pandas provides functions for working with Excel files. You can use the to_excel() function to write a DataFrame to an Excel file and the read_excel() function to read an Excel file into a DataFrame.

Here’s an example of reading an Excel file with multiple sheets:

dfs = pd.read_excel("data.xlsx", sheet_name=None)

The sheet_name=None parameter reads all sheets in the Excel file and returns a dictionary of DataFrames, with the sheet names as keys.

SQL Files

Pandas allows you to read and write DataFrames directly to and from SQL databases. You can use the read_sql() function to query an SQL database and load the results into a DataFrame, and the to_sql() function to write a DataFrame to an SQL table.

Here’s an example of reading data from an SQL database:

import sqlite3

conn = sqlite3.connect("data.db")
df = pd.read_sql("SELECT * FROM table", conn)
conn.close()

To write a DataFrame to an SQL table, you can use the to_sql() function:

conn = sqlite3.connect("data.db")
df.to_sql("table", conn, if_exists="replace")
conn.close()

The if_exists="replace" parameter ensures that the table is replaced if it already exists.

Pickle Files

Pickle is a Python-specific binary format for serializing and deserializing Python objects. Pandas provides functions for working with Pickle files as well.

To write a DataFrame to a Pickle file, you can use the to_pickle() function:

df.to_pickle("data.pkl")

To read a Pickle file into a DataFrame, you can use the read_pickle() function:

df = pd.read_pickle("data.pkl")

Working With Big Data

When working with big data, it’s important to optimize the way you read and write files to ensure efficient performance. Pandas provides several techniques to work with large datasets:

Compress and Decompress Files: Use compression algorithms like gzip or zip to reduce the file size and improve read and write performance.
Choose Columns: Specify the columns you need to read from a file instead of reading the entire dataset. This can significantly speed up the process.
Omit Rows: Use filters or conditions to omit unnecessary rows from the dataset, reducing the size of the data being processed.
Force Less Precise Data Types: Specify less precise data types for numeric or date columns to reduce memory usage.
Use Chunks to Iterate Through Files: Instead of loading the entire dataset into memory, you can process it in smaller chunks using the chunksize parameter of the read_csv() function.

Conclusion

In this tutorial, you’ve learned how to use pandas to read and write files in various formats, including CSV, Excel, JSON, HTML, SQL, and Pickle. You’ve seen how to write and read data from these files using pandas functions like read_csv(), to_csv(), read_excel(), to_excel(), and more. You’ve also learned some techniques for working with big data efficiently.

Now you have the knowledge and tools to handle different file formats and work with large datasets using pandas. Keep exploring the pandas documentation to discover more advanced features and become proficient in data manipulation and analysis.