The Pandas library offers powerful I/O tools (API) for data import and export, enabling seamless handling of various file formats like CSV, Excel, JSON, and many more. This API includes top-level reader functions like, pd.read_csv(), read_clipboard() and corresponding writer methods like, to_csv(), to_clipboard() for easy data handling.
In this tutorial, we will learn about the overview of the Pandas I/O tools and learn how to use them effectively.
The Pandas I/O API supports a wide variety of data formats. Here is a summary of supported formats and their corresponding reader and writer functions −
| Format | Reader Function | Writer Function |
|---|---|---|
| Tabular Data | read_table() | NA |
| CSV | read_csv() | to_csv() |
| Fixed-Width Text File | read_fwf() | NA |
| Clipboard | read_clipboard() | to_clipboard() |
| Pickling | read_pickle() | to_pickle() |
| Excel | read_excel() | to_excel() |
| JSON | read_json() | to_json() |
| HTML | read_html() | to_html() |
| XML | read_xml() | to_xml() |
| LaTeX | NA | to_latex() |
| HDF5 Format | read_hdf() | to_hdf() |
| Feather | read_feather() | to_feather() |
| Parquet | read_parquet() | to_parquet() |
| ORC | read_orc() | to_orc() |
| SQL | read_sql() | to_sql() |
| Stata | read_stata() | to_stata() |
Among these, the most frequently used functions for handling text files are read_csv() and read_table(). Both convert flat files into DataFrame objects.
This example shows reading the CSV data using the pandas read_csv() function. In this example we are using the StringIO to load the CSV string into a Pandas DataFrame object.
import pandas as pd
from io import StringIO
data = """S.No,Name,Age,City,Salary
1,Tom,28,Toronto,20000
2,Lee,32,HongKong,3000
3,Steven,43,Bay Area,8300
4,Ram,38,Hyderabad,3900"""
obj = StringIO(data)
df = pd.read_csv(obj)
print(df)
Output:
S.No Name Age City Salary 0 1 Tom 28 Toronto 20000 1 2 Lee 32 HongKong 3000 2 3 Steven 43 Bay Area 8300 3 4 Ram 38 Hyderabad 3900
Pandas allows several customization options when parsing data. You can modify how the data is parsed using parameters like −
You can customize the row labels or index of the Pandas object by using index_col parameter. Setting index_col=False forces Pandas to not use the first column as the index, which can be helpful when handling malformed files with extra delimiters.
import pandas as pd
from io import StringIO
data = """S.No,Name,Age,City,Salary
1,Tom,28,Toronto,20000
2,Lee,32,HongKong,3000
3,Steven,43,Bay Area,8300
4,Ram,38,Hyderabad,3900"""
obj = StringIO(data)
df = pd.read_csv(obj, index_col=['S.No'])
print(df)
Output:
S.No Name Age City Salary 1 Tom 28 Toronto 20000 2 Lee 32 HongKong 3000 3 Steven 43 Bay Area 8300 4 Ram 38 Hyderabad 3900
Pandas also provides the ability to specify the data type for columns using the dtype parameter.
import pandas as pd
from io import StringIO
import numpy as np
data = """[
{"Name": "Braund", "Gender": "Male", "Age": 30},
{"Name": "Cumings", "Gender": "Female", "Age": 25},
{"Name": "Heikkinen", "Gender": "Female", "Age": 35}
]"""
obj = StringIO(data)
df = pd.read_json(obj, dtype={'Age': np.float64})
print(df.dtypes)
Output:
Name object Gender object Age float64 dtype: object
Thus, the data looks like float −
Name Gender Age 0 Braund Male 30.0 1 Cumings Female 25.0 2 Heikkinen Female 35.0
When reading data files, Pandas assumes the first row as the header. However, you can customize this using the names Parameter.
import pandas as pd
from io import StringIO
xml = """
Everyday Italian
Giada De Laurentiis
2005
30.00
Harry Potter
J K. Rowling
2005
29.99
Learning XML
Erik T. Ray
2003
39.95
"""
df = pd.read_xml(StringIO(xml), names=['a', 'b', 'c','d','e'])
print('Output DataFrame from XML:')
print(df)
Output:
a b c d e 0 cooking Everyday Italian Giada De Laurentiis 2005 30.00 1 children Harry Potter J K. Rowling 2005 29.99 2 web Learning XML Erik T. Ray 2003 39.95
