When you specify the path to the CSV file as the first argument in
read_csv()
, the file is loaded as a
DataFrame
.
df = pd.read_csv('data/src/sample_header.csv')
print(df)
# a b c d
# 0 11 12 13 14
# 1 21 22 23 24
# 2 31 32 33 34
The path can be either absolute or relative. For how to check and change the current directory, see the following article.
Get and change the current working directory in Python
By default, the first row of the CSV file is treated as the column names (columns
). You can configure various settings, such as how headers are handled, by the arguments to read_csv()
.
The following sections describe the main arguments frequently used with read_csv()
. For comprehensive details of all arguments, please refer to the official documentation.
pandas.read_csv — pandas 2.0.3 documentation
Use the following CSV file without a header as an example.
11,12,13,14
21,22,23,24
31,32,33,34
By default, the first row is recognized as a header and assigned to the column names (columns
).
df = pd.read_csv('data/src/sample.csv')
print(df)
# 11 12 13 14
# 0 21 22 23 24
# 1 31 32 33 34
print(df.columns)
# Index(['11', '12', '13', '14'], dtype='object')
If you set the header
argument to None
, sequential numbers starting from zero will be used as columns
.
df_none = pd.read_csv('data/src/sample.csv', header=None)
print(df_none)
# 0 1 2 3
# 0 11 12 13 14
# 1 21 22 23 24
# 2 31 32 33 34
To set other values as column names, specify a list or tuple for the names
argument.
df_names = pd.read_csv('data/src/sample.csv', names=['A', 'B', 'C', 'D'])
print(df_names)
# A B C D
# 0 11 12 13 14
# 1 21 22 23 24
# 2 31 32 33 34
Use the following CSV file with a header as an example.
a,b,c,d
11,12,13,14
21,22,23,24
31,32,33,34
As mentioned above, by default, the first row is recognized as a header and assigned to the column names (columns
).
df = pd.read_csv('data/src/sample_header.csv')
print(df)
# a b c d
# 0 11 12 13 14
# 1 21 22 23 24
# 2 31 32 33 34
To set values different from the first row as columns
, specify a list for the names
argument while setting header=0
. Note that if header=0
is not set, the first row remains as data.
df_names = pd.read_csv('data/src/sample_header.csv', names=['A', 'B', 'C', 'D'])
print(df_names)
# A B C D
# 0 a b c d
# 1 11 12 13 14
# 2 21 22 23 24
# 3 31 32 33 34
df_names_0 = pd.read_csv('data/src/sample_header.csv',
header=0, names=['A', 'B', 'C', 'D'])
print(df_names_0)
# A B C D
# 0 11 12 13 14
# 1 21 22 23 24
# 2 31 32 33 34
You can also specify a line number, starting from zero, to use as a header. For example, you can set header=2
. Lines above the specified line are ignored.
df_header_2 = pd.read_csv('data/src/sample_header.csv', header=2)
print(df_header_2)
# 21 22 23 24
# 0 31 32 33 34
Read CSV with an index: index_col
Use the following CSV file with both a header and an index as an example.
,a,b,c,d
ONE,11,12,13,14
TWO,21,22,23,24
THREE,31,32,33,34
By default, the row names (index
) are sequentially numbered starting from zero. The first column is not treated as an index.
df = pd.read_csv('data/src/sample_header_index.csv')
print(df)
# Unnamed: 0 a b c d
# 0 ONE 11 12 13 14
# 1 TWO 21 22 23 24
# 2 THREE 31 32 33 34
print(df.index)
# RangeIndex(start=0, stop=3, step=1)
You can specify the column number, starting from zero, to use as index
by setting the argument index_col
.
df_index_col = pd.read_csv('data/src/sample_header_index.csv', index_col=0)
print
(df_index_col)
# a b c d
# ONE 11 12 13 14
# TWO 21 22 23 24
# THREE 31 32 33 34
print(df_index_col.index)
# Index(['ONE', 'TWO', 'THREE'], dtype='object')
Select columns to read: usecols
To read only specific columns, use the usecols
argument. You can specify the column numbers or column names to be read using a list.
df_usecols = pd.read_csv('data/src/sample_header.csv', usecols=[1, 3])
print(df_usecols)
# b d
# 0 12 14
# 1 22 24
# 2 32 34
df_usecols = pd.read_csv('data/src/sample_header.csv', usecols=['a', 'c'])
print(df_usecols)
# a c
# 0 11 13
# 1 21 23
# 2 31 33
You can also specify callable objects, such as lambda expressions. The column name is passed to the callable, and only the columns evaluated as True
are extracted.
Lambda expressions in Python
df_usecols = pd.read_csv('data/src/sample_header.csv',
usecols=lambda x: x != 'b')
print(df_usecols)
# a c d
# 0 11 13 14
# 1 21 23 24
# 2 31 33 34
df_usecols = pd.read_csv('data/src/sample_header.csv',
usecols=lambda x: x not in ['a', 'c'])
print(df_usecols)
# b d
# 0 12 14
# 1 22 24
# 2 32 34
Skip rows to read
Skip the first n
rows or specified row numbers: skiprows
To skip (exclude) specific rows to read, use the skiprows
argument.
If you specify an integer, it skips the first n
rows from the file to read.
df = pd.read_csv('data/src/sample.csv', header=None)
print(df)
# 0 1 2 3
# 0 11 12 13 14
# 1 21 22 23 24
# 2 31 32 33 34
df_skiprows = pd.read_csv('data/src/sample.csv', header=None, skiprows=2)
print(df_skiprows)
# 0 1 2 3
# 0 31 32 33 34
You can also specify the row numbers to skip in a list. Unlike usecols
, you specify the rows to skip, not the rows to read.
df_skiprows = pd.read_csv('data/src/sample.csv', header=None, skiprows=[0, 2])
print(df_skiprows)
# 0 1 2 3
# 0 21 22 23 24
You can also specify callable objects, such as lambda expressions. The row number is passed to the callable, and any rows evaluated as True
are skipped.
Lambda expressions in Python
df_skiprows = pd.read_csv('data/src/sample.csv', header=None,
skiprows=lambda x: x not in [0, 2])
print(df_skiprows)
# 0 1 2 3
# 0 11 12 13 14
# 1 31 32 33 34
For a file with a header, note that the header row is considered the 0th row.
df_skiprows = pd.read_csv('data/src/sample_header.csv', skiprows=2)
print(df_skiprows)
# 21 22 23 24
# 0 31 32 33 34
df_skiprows = pd.read_csv('data/src/sample_header.csv', skiprows=[1, 3])
print(df_skiprows)
# a b c d
# 0 21 22 23 24
Note that, even if you specify a column as the index using index_col
, you cannot use skiprows
with row names.
To skip from the end of the file, use the skipfooter
argument. Specify the number of rows to skip from the end as an integer.
df_skipfooter = pd.read_csv('data/src/sample.csv', header=None,
skipfooter=1, engine='python')
print(df_skipfooter)
# 0 1 2 3
# 0 11 12 13 14
# 1 21 22 23 24
To avoid the potential warning that might appear depending on the environment, specify engine='python'
.
ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.
Read only the first n
rows: nrows
The nrows
argument allows you to read only the first n
rows. This can be useful when you want to preview some data from a large file.
df_nrows = pd.read_csv('data/src/sample.csv', header=None, nrows=2)
print(df_nrows)
# 0 1 2 3
# 0 11 12 13 14
# 1 21 22 23 24
In the case of a file with a header, the header row is not counted.
df_nrows = pd.read_csv('data/src/sample_header.csv', nrows=2)
print(df_nrows)
# a b c d
# 0 11 12 13 14
# 1 21 22 23 24
Read CSV by specifying the data type (dtype
)
pandas.DataFrame
has a data type (dtype
) for each column. For more information, including the relationship between strings and the object
type, see the following article.
pandas: How to use astype() to cast dtype of DataFrame
While read_csv()
automatically selects each column's data type based on the values, you can explicitly specify the data type using the dtype
argument.
Use the following CSV file as an example.
,a,b,c,d
ONE,1,"001",100,x
TWO,2,"020",,y
THREE,3,"300",300,z
For example, a column of numbers starting with 0
is treated as a number by default, even if it is enclosed in quotes, and the leading 0
is ignored.
df = pd.read_csv('data/src/sample_header_index_dtype.csv', index_col=0)
print(df)
# a b c d
# ONE 1 1 100.0 x
# TWO 2 20 NaN y
# THREE 3 300 300.0 z
print(df.dtypes)
# a int64
# b int64
# c float64
# d object
# dtype: object
If you want to treat it as a string including the leading 0
, specify the dtype
argument of read_csv()
.
Specify the same data type (dtype
) for all columns
If you specify any data type for the dtype
argument, all columns, including those specified by index_col
, are converted to that type.
Be careful, as it will cause an error if there are values that cannot be converted to the specified type.
# pd.read_csv('data/src/sample_header_index_dtype.csv',
# index_col=0, dtype=float)
# ValueError: could not convert string to float: 'ONE'
If you set dtype=str
, all columns are converted to strings. However, in this case, the missing value (NaN
) will still be of type float
.
df_str = pd.read_csv
('data/src/sample_header_index_dtype.csv',
index_col=0, dtype=str)
print(df_str)
# a b c d
# ONE 1 001 100 x
# TWO 2 020 NaN y
# THREE 3 300 300 z
print(df_str.dtypes)
# a object
# b object
# c object
# d object
# dtype: object
print(df_str.applymap(type))
# a b c d
# ONE <class 'str'> <class 'str'> <class 'str'> <class 'str'>
# TWO <class 'str'> <class 'str'> <class 'float'> <class 'str'>
# THREE <class 'str'> <class 'str'> <class 'str'> <class 'str'>
If you read the file without specifying dtype
and then cast it to str
with astype()
, NaN
values are also converted to the string 'nan'
.
df = pd.read_csv('data/src/sample_header_index_dtype.csv', index_col=0)
print(df.astype(str))
# a b c d
# ONE 1 1 100.0 x
# TWO 2 20 nan y
# THREE 3 300 300.0 z
print(df.astype(str).applymap(type))
# a b c d
# ONE <class 'str'> <class 'str'> <class 'str'> <class 'str'>
# TWO <class 'str'> <class 'str'> <class 'str'> <class 'str'>
# THREE <class 'str'> <class 'str'> <class 'str'> <class 'str'>
Specify data type (dtype
) for each column
You can also specify the type of each column individually by passing a dictionary (dict
) to the dtype
argument. The types for columns not specified in the dictionary will be selected automatically.
df_col = pd.read_csv('data/src/sample_header_index_dtype.csv',
index_col=0, dtype={'a': float, 'b': str})
print(df_col)
# a b c d
# ONE 1.0 001 100.0 x
# TWO 2.0 020 NaN y
# THREE 3.0 300 300.0 z
print(df_col.dtypes)
# a float64
# b object
# c float64
# d object
# dtype: object
The dictionary keys can also be column numbers. Be careful, if you are specifying the index column, you need to specify the column numbers including the index column.
df_col = pd.read_csv('data/src/sample_header_index_dtype.csv',
index_col=0, dtype={1: float, 2: str})
print(df_col)
# a b c d
# ONE 1.0 001 100.0 x
# TWO 2.0 020 NaN y
# THREE 3.0 300 300.0 z
print(df_col.dtypes)
# a float64
# b object
# c float64
# d object
# dtype: object
Read CSV with missing values (NaN
)
Default values treated as missing
By default, read_csv()
interprets some values as NaN
. For example, strings such as ''
, 'NaN'
, 'nan'
, 'null'
, and others are treated as NaN
.
By default the following values are interpreted as NaN: “ “, “#N/A”, “#N/A N/A”, “#NA”, “-1.#IND”, “-1.#QNAN”, “-NaN”, “-nan”, “1.#IND”, “1.#QNAN”, “<NA>”, “N/A”, “NA”, “NULL”, “NaN”, “None”, “n/a”, “nan”, “null “.
pandas.read_csv — pandas 2.0.3 documentation
Use the following CSV file as an example.
ONE,,NaN
TWO,-,nan
THREE,null,N/A
If you read the file with default settings and check with the isnull()
method, you can see that all values except '-'
are treated as NaN
.
pandas: Detect and count NaN (missing values) with isnull(), isna()
df_nan = pd.read_csv('data/src/sample_header_index_nan.csv', index_col=0)
print(df_nan)
# a b
# ONE NaN NaN
# TWO - NaN
# THREE NaN NaN
print(df_nan.isnull())
# a b
# ONE True True
# TWO False True
# THREE True True
For handling missing values after reading into a pandas.DataFrame
, refer to the following article.
pandas: Remove NaN (missing values) with dropna()
pandas: Replace NaN (missing values) with fillna()
Specify values to be treated as NaN
: na_values
, keep_default_na
To specify values to be treated as NaN
, use the na_values
argument. In addition to the default values mentioned above, the values specified in na_values
are also treated as NaN
. You can specify multiple values by providing a list.
df_nan_set_na = pd.read_csv('data/src/sample_header_index_nan.csv',
index_col=0, na_values='-')
print(df_nan_set_na)
# a b
# ONE NaN NaN
# TWO NaN NaN
# THREE NaN NaN
print(df_nan_set_na.isnull())
# a b
# ONE True True
# TWO True True
# THREE True True
If you set the keep_default_na
argument to False
, the default values mentioned above will not be treated as NaN
. If you specify the na_values
argument in this case, only the values specified in na_values
will be treated as NaN
.
df_nan_no_keep = pd.read_csv('data/src/sample_header_index_nan.csv',
index_col=0, na_values=['-', 'NaN', 'null'],
keep_default_na=False)
print(df_nan_no_keep)
# a b
# ONE NaN
# TWO NaN nan
# THREE NaN N/A
print(df_nan_no_keep.isnull())
# a b
# ONE False True
# TWO True False
# THREE True False
Treat no values as NaN
: na_filter
If the na_filter
argument is set to False
, all values are read as strings as they are and are not treated as NaN
, regardless of the setting of the na_values
and keep_default_na
arguments.
df_nan_no_filter = pd.read_csv('data/src/sample_header_index_nan.csv',
index_col=0, na_filter=False)
print(df_nan_no_filter)
# a b
# ONE NaN
# TWO - nan
# THREE null N/A
print(df_nan_no_filter.isnull())
# a b
# ONE False False
# TWO False False
# THREE False False
Read CSV by specifying encoding: encoding
, encoding_errors
Use the following CSV file encoded in Shift-JIS as an example.
sample_header_shift_jis.csv
You can specify encoding by the encoding
argument. The default is utf-8
, and a UnicodeDecodeError
will occur for files not in utf-8
.
# df_sjis = pd.read_csv('data/src/sample_header_shift_jis.csv')
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 8: invalid start byte
You need to specify the correct encoding.
df_sjis = pd.read_csv('data/src/sample_header_shift_jis.csv',
encoding='shift_jis')
print(df_sjis)
# a b c d
# 0 あ 12 13 14
# 1 い 22 23 24
# 2 う 32 33 34
The encoding_errors
argument allows you to specify error handling.
The default is 'strict'
, which results in an error as mentioned above. You can specify 'ignore'
, 'replace'
, etc.
codecs - Error Handlers — Codec registry and base classes — Python 3.11.4 documentation
df_ignore = pd.read_csv('data/src/sample_header_shift_jis.csv',
encoding_errors='ignore')
print(
df_ignore)
# a b c d
# 0 NaN 12 13 14
# 1 NaN 22 23 24
# 2 NaN 32 33 34
df_replace = pd.read_csv('data/src/sample_header_shift_jis.csv',
encoding_errors='replace')
print(df_replace)
# a b c d
# 0 �� 12 13 14
# 1 �� 22 23 24
# 2 �� 32 33 34
df_backslash = pd.read_csv('data/src/sample_header_shift_jis.csv',
encoding_errors='backslashreplace')
print(df_backslash)
# a b c d
# 0 \x82\xa0 12 13 14
# 1 \x82\xa2 22 23 24
# 2 \x82\xa4 32 33 34
Read compressed CSV files
read_csv()
can read CSV files that have been compressed with ZIP or other methods.
df_zip = pd.read_csv('data/src/sample_header.csv.zip')
print(df_zip)
# a b c d
# 0 11 12 13 14
# 1 21 22 23 24
# 2 31 32 33 34
If the extension is .gz
, .bz2
, .zip
, .xz
, .zst
, .tar
, .tar.gz
, .tar.xz
, .tar.bz2
, it will automatically detect and extract the file. If the extension is different, explicitly specify a string like 'zip'
(without the period) in the compression
argument.
Note that this only works for compressed single CSV files. If multiple files are compressed, an error will occur.
Read CSV files from the web
You can also specify a URL as the first argument to read_csv()
. You can directly read files from the web, even if they are ZIP compressed.
Let's take a CSV file on GitHub as an example. Be careful to use the URL that points to the CSV file itself (Raw data).
df_web = pd.read_csv(
'https://raw.githubusercontent.com/nkmk/python-snippets/master/notebook/data/src/sample_header.csv'
print(df_web)
# a b c d
# 0 11 12 13 14
# 1 21 22 23 24
# 2 31 32 33 34
Note that in many cases, adjusting the argument settings may be required to successfully read the file or exclude unnecessary data. Particularly for large files, it is often more practical to download them locally first.
Read TSV files: sep
To read a TSV file, specify the tab character (\t
) in the sep
argument of read_csv()
.
a b c d
ONE 11 12 13 14
TWO 21 22 23 24
THREE 31 32 33 34
df_tsv_sep = pd.read_csv('data/src/sample_header_index.tsv', index_col=0, sep='\t')
print(df_tsv_sep)
# a b c d
# ONE 11 12 13 14
# TWO 21 22 23 24
# THREE 31 32 33 34
read_table()
uses the tab character (\t
) as the default separator, and its usage is the same as those for read_csv()
.
pandas.read_table — pandas 2.0.3 documentation