pandas: Read CSV into DataFrame with read_csv()

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

`Basic Usage of pandas.read_csv()`


            
             When you specify the path to the CSV file as the first argument in
             
              read_csv()
             
             , the file is loaded as a
             
              DataFrame
             
             .
            
            
             df = pd.read_csv('data/src/sample_header.csv')
print(df)
#     a   b   c   d
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34
source: pandas_read_csv.py
The path can be either absolute or relative. For how to check and change the current directory, see the following article.
Get and change the current working directory in Python
By default, the first row of the CSV file is treated as the column names (columns). You can configure various settings, such as how headers are handled, by the arguments to read_csv().
The following sections describe the main arguments frequently used with read_csv(). For comprehensive details of all arguments, please refer to the official documentation.
pandas.read_csv — pandas 2.0.3 documentation
Read CSV without a header: header, names
Use the following CSV file without a header as an example.
11,12,13,14
21,22,23,24
31,32,33,34
source: sample.csv
By default, the first row is recognized as a header and assigned to the column names (columns).
df = pd.read_csv('data/src/sample.csv')
print(df)
#    11  12  13  14
# 0  21  22  23  24
# 1  31  32  33  34
print(df.columns)
# Index(['11', '12', '13', '14'], dtype='object')
source: pandas_read_csv.py
If you set the header argument to None, sequential numbers starting from zero will be used as columns.
df_none = pd.read_csv('data/src/sample.csv', header=None)
print(df_none)
#     0   1   2   3
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34
source: pandas_read_csv.py
To set other values as column names, specify a list or tuple for the names argument.
df_names = pd.read_csv('data/src/sample.csv', names=['A', 'B', 'C', 'D'])
print(df_names)
#     A   B   C   D
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34
source: pandas_read_csv.py
Read CSV with a header: header, names
Use the following CSV file with a header as an example.
a,b,c,d
11,12,13,14
21,22,23,24
31,32,33,34
source: sample_header.csv
As mentioned above, by default, the first row is recognized as a header and assigned to the column names (columns).
df = pd.read_csv('data/src/sample_header.csv')
print(df)
#     a   b   c   d
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34
source: pandas_read_csv.py
To set values different from the first row as columns, specify a list for the names argument while setting header=0. Note that if header=0 is not set, the first row remains as data.
df_names = pd.read_csv('data/src/sample_header.csv', names=['A', 'B', 'C', 'D'])
print(df_names)
#     A   B   C   D
# 0   a   b   c   d
# 1  11  12  13  14
# 2  21  22  23  24
# 3  31  32  33  34
df_names_0 = pd.read_csv('data/src/sample_header.csv',
                         header=0, names=['A', 'B', 'C', 'D'])
print(df_names_0)
#     A   B   C   D
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34
source: pandas_read_csv.py
You can also specify a line number, starting from zero, to use as a header. For example, you can set header=2. Lines above the specified line are ignored.
df_header_2 = pd.read_csv('data/src/sample_header.csv', header=2)
print(df_header_2)
#    21  22  23  24
# 0  31  32  33  34
source: pandas_read_csv.py
Read CSV with an index: index_col
Use the following CSV file with both a header and an index as an example.
,a,b,c,d
ONE,11,12,13,14
TWO,21,22,23,24
THREE,31,32,33,34
source: sample_header_index.csv
By default, the row names (index) are sequentially numbered starting from zero. The first column is not treated as an index.
df = pd.read_csv('data/src/sample_header_index.csv')
print(df)
#   Unnamed: 0   a   b   c   d
# 0        ONE  11  12  13  14
# 1        TWO  21  22  23  24
# 2      THREE  31  32  33  34
print(df.index)
# RangeIndex(start=0, stop=3, step=1)
source: pandas_read_csv.py
You can specify the column number, starting from zero, to use as index by setting the argument index_col.
df_index_col = pd.read_csv('data/src/sample_header_index.csv', index_col=0)
print(df_index_col)
#         a   b   c   d
# ONE    11  12  13  14
# TWO    21  22  23  24
# THREE  31  32  33  34
print(df_index_col.index)
# Index(['ONE', 'TWO', 'THREE'], dtype='object')
source: pandas_read_csv.py




    

Select columns to read: usecols
To read only specific columns, use the usecols argument. You can specify the column numbers or column names to be read using a list.
df_usecols = pd.read_csv('data/src/sample_header.csv', usecols=[1, 3])
print(df_usecols)
#     b   d
# 0  12  14
# 1  22  24
# 2  32  34
df_usecols = pd.read_csv('data/src/sample_header.csv', usecols=['a', 'c'])
print(df_usecols)
#     a   c
# 0  11  13
# 1  21  23
# 2  31  33
source: pandas_read_csv.py
You can also specify callable objects, such as lambda expressions. The column name is passed to the callable, and only the columns evaluated as True are extracted.
Lambda expressions in Python
df_usecols = pd.read_csv('data/src/sample_header.csv',
                         usecols=lambda x: x != 'b')
print(df_usecols)
#     a   c   d
# 0  11  13  14
# 1  21  23  24
# 2  31  33  34
df_usecols = pd.read_csv('data/src/sample_header.csv',
                         usecols=lambda x: x not in ['a', 'c'])
print(df_usecols)
#     b   d
# 0  12  14
# 1  22  24
# 2  32  34
source: pandas_read_csv.py
Skip rows to read
Skip the first n rows or specified row numbers: skiprows
To skip (exclude) specific rows to read, use the skiprows argument.
If you specify an integer, it skips the first n rows from the file to read.
df = pd.read_csv('data/src/sample.csv', header=None)
print(df)
#     0   1   2   3
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34
df_skiprows = pd.read_csv('data/src/sample.csv', header=None, skiprows=2)
print(df_skiprows)
#     0   1   2   3
# 0  31  32  33  34
source: pandas_read_csv.py
You can also specify the row numbers to skip in a list. Unlike usecols, you specify the rows to skip, not the rows to read.
df_skiprows = pd.read_csv('data/src/sample.csv', header=None, skiprows=[0, 2])
print(df_skiprows)
#     0   1   2   3
# 0  21  22  23  24
source: pandas_read_csv.py
You can also specify callable objects, such as lambda expressions. The row number is passed to the callable, and any rows evaluated as True are skipped.
Lambda expressions in Python
df_skiprows = pd.read_csv('data/src/sample.csv', header=None,
                          skiprows=lambda x: x not in [0, 2])
print(df_skiprows)
#     0   1   2   3
# 0  11  12  13  14
# 1  31  32  33  34
source: pandas_read_csv.py
For a file with a header, note that the header row is considered the 0th row.
df_skiprows = pd.read_csv('data/src/sample_header.csv', skiprows=2)
print(df_skiprows)
#    21  22  23  24
# 0  31  32  33  34
df_skiprows = pd.read_csv('data/src/sample_header.csv', skiprows=[1, 3])
print(df_skiprows)
#     a   b   c   d
# 0  21  22  23  24
source: pandas_read_csv.py
Note that, even if you specify a column as the index using index_col, you cannot use skiprows with row names.
Skip the last n rows: skipfooter
To skip from the end of the file, use the skipfooter argument. Specify the number of rows to skip from the end as an integer.
df_skipfooter = pd.read_csv('data/src/sample.csv', header=None,
                            skipfooter=1, engine='python')
print(df_skipfooter)
#     0   1   2   3
# 0  11  12  13  14
# 1  21  22  23  24
source: pandas_read_csv.py
To avoid the potential warning that might appear depending on the environment, specify engine='python'.
ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.
Read only the first n rows: nrows
The nrows argument allows you to read only the first n rows. This can be useful when you want to preview some data from a large file.
df_nrows = pd.read_csv('data/src/sample.csv', header=None, nrows=2)
print(df_nrows)
#     0   1   2   3
# 0  11  12  13  14
# 1  21  22  23  24
source: pandas_read_csv.py
In the case of a file with a header, the header row is not counted.
df_nrows = pd.read_csv('data/src/sample_header.csv', nrows=2)
print(df_nrows)
#     a   b   c   d
# 0  11  12  13  14
# 1  21  22  23  24
source: pandas_read_csv.py
Read CSV by specifying the data type (dtype)

pandas.DataFrame has a data type (dtype) for each column. For more information, including the relationship between strings and the object type, see the following article.
pandas: How to use astype() to cast dtype of DataFrame
While read_csv() automatically selects each column's data type based on the values, you can explicitly specify the data type using the dtype argument.
Use the following CSV file as an example.
,a,b,c,d
ONE,1,"001",100,x
TWO,2,"020",,y
THREE,3,"300",300,z
source: sample_header_index_dtype.csv
For example, a column of numbers starting with 0 is treated as a number by default, even if it is enclosed in quotes, and the leading 0 is ignored.
df = pd.read_csv('data/src/sample_header_index_dtype.csv', index_col=0)
print(df)
#        a    b      c  d
# ONE    1    1  100.0  x
# TWO    2   20    NaN  y
# THREE  3  300  300.0  z
print(df.dtypes)
# a      int64
# b      int64
# c    float64
# d     object
# dtype: object
source: pandas_read_csv_dtype.py
If you want to treat it as a string including the leading 0, specify the dtype argument of read_csv().
Specify the same data type (dtype) for all columns
If you specify any data type for the dtype argument, all columns, including those specified by index_col, are converted to that type.
Be careful, as it will cause an error if there are values that cannot be converted to the specified type.
# pd.read_csv('data/src/sample_header_index_dtype.csv',
#             index_col=0, dtype=float)
# ValueError: could not convert string to float: 'ONE'
source: pandas_read_csv_dtype.py
If you set dtype=str, all columns are converted to strings. However, in this case, the missing value (NaN) will still be of type float.
df_str = pd.read_csv('data/src/sample_header_index_dtype.csv',
                     index_col=0, dtype=str)
print(df_str)
#        a    b    c  d
# ONE    1  001  100  x
# TWO    2  020  NaN  y
# THREE  3  300  300  z
print(df_str.dtypes)
# a    object
# b    object
# c    object
# d    object
# dtype: object
print(df_str.applymap(type))
#                    a              b                c              d
# ONE    <class 'str'>  <class 'str'>    <class 'str'>  <class 'str'>
# TWO    <class 'str'>  <class 'str'>  <class 'float'>  <class 'str'>
# THREE  <class 'str'>  <class 'str'>    <class 'str'>  <class 'str'>
source: pandas_read_csv_dtype.py




    

If you read the file without specifying dtype and then cast it to str with astype(), NaN values are also converted to the string 'nan'.
df = pd.read_csv('data/src/sample_header_index_dtype.csv', index_col=0)
print(df.astype(str))
#        a    b      c  d
# ONE    1    1  100.0  x
# TWO    2   20    nan  y
# THREE  3  300  300.0  z
print(df.astype(str).applymap(type))
#                    a              b              c              d
# ONE    <class 'str'>  <class 'str'>  <class 'str'>  <class 'str'>
# TWO    <class 'str'>  <class 'str'>  <class 'str'>  <class 'str'>
# THREE  <class 'str'>  <class 'str'>  <class 'str'>  <class 'str'>
source: pandas_read_csv_dtype.py
Specify data type (dtype) for each column
You can also specify the type of each column individually by passing a dictionary (dict) to the dtype argument. The types for columns not specified in the dictionary will be selected automatically.
df_col = pd.read_csv('data/src/sample_header_index_dtype.csv',
                     index_col=0, dtype={'a': float, 'b': str})
print(df_col)
#          a    b      c  d
# ONE    1.0  001  100.0  x
# TWO    2.0  020    NaN  y
# THREE  3.0  300  300.0  z
print(df_col.dtypes)
# a    float64
# b     object
# c    float64
# d     object
# dtype: object
source: pandas_read_csv_dtype.py
The dictionary keys can also be column numbers. Be careful, if you are specifying the index column, you need to specify the column numbers including the index column.
df_col = pd.read_csv('data/src/sample_header_index_dtype.csv',
                     index_col=0, dtype={1: float, 2: str})
print(df_col)
#          a    b      c  d
# ONE    1.0  001  100.0  x
# TWO    2.0  020    NaN  y
# THREE  3.0  300  300.0  z
print(df_col.dtypes)
# a    float64
# b     object
# c    float64
# d     object
# dtype: object
source: pandas_read_csv_dtype.py
Read CSV with missing values (NaN)
Default values treated as missing
By default, read_csv() interprets some values as NaN. For example, strings such as '', 'NaN', 'nan', 'null', and others are treated as NaN.
By default the following values are interpreted as NaN: “ “, “#N/A”, “#N/A N/A”, “#NA”, “-1.#IND”, “-1.#QNAN”, “-NaN”, “-nan”, “1.#IND”, “1.#QNAN”, “<NA>”, “N/A”, “NA”, “NULL”, “NaN”, “None”, “n/a”, “nan”, “null “.
pandas.read_csv — pandas 2.0.3 documentation
Use the following CSV file as an example.
ONE,,NaN
TWO,-,nan
THREE,null,N/A
source: sample_header_index_nan.csv
If you read the file with default settings and check with the isnull() method, you can see that all values except '-' are treated as NaN.
pandas: Detect and count NaN (missing values) with isnull(), isna()
df_nan = pd.read_csv('data/src/sample_header_index_nan.csv', index_col=0)
print(df_nan)
#          a   b
# ONE    NaN NaN
# TWO      - NaN
# THREE  NaN NaN
print(df_nan.isnull())
#            a     b
# ONE     True  True
# TWO    False  True
# THREE   True  True
source: pandas_read_csv_nan.py
For handling missing values after reading into a pandas.DataFrame, refer to the following article.
pandas: Remove NaN (missing values) with dropna()
pandas: Replace NaN (missing values) with fillna()
Specify values to be treated as NaN: na_values, keep_default_na
To specify values to be treated as NaN, use the na_values argument. In addition to the default values mentioned above, the values specified in na_values are also treated as NaN. You can specify multiple values by providing a list.
df_nan_set_na = pd.read_csv('data/src/sample_header_index_nan.csv',
                            index_col=0, na_values='-')
print(df_nan_set_na)
#         a   b
# ONE   NaN NaN
# TWO   NaN NaN
# THREE NaN NaN
print(df_nan_set_na.isnull())
#           a     b
# ONE    True  True
# TWO    True  True
# THREE  True  True
source: pandas_read_csv_nan.py
If you set the keep_default_na argument to False, the default values mentioned above will not be treated as NaN. If you specify the na_values argument in this case, only the values specified in na_values will be treated as NaN.
df_nan_no_keep = pd.read_csv('data/src/sample_header_index_nan.csv',
                             index_col=0, na_values=['-', 'NaN', 'null'],
                             keep_default_na=False)
print(df_nan_no_keep)
#          a    b
# ONE         NaN
# TWO    NaN  nan
# THREE  NaN  N/A
print(df_nan_no_keep.isnull())
#            a      b
# ONE    False   True
# TWO     True  False
# THREE   True  False
source: pandas_read_csv_nan.py
Treat no values as NaN: na_filter
If the na_filter argument is set to False, all values are read as strings as they are and are not treated as NaN, regardless of the setting of the na_values and keep_default_na arguments.
df_nan_no_filter = pd.read_csv('data/src/sample_header_index_nan.csv',
                               index_col=0, na_filter=False)
print(df_nan_no_filter)
#           a    b
# ONE          NaN
# TWO       -  nan
# THREE  null  N/A
print(df_nan_no_filter.isnull())
#            a      b
# ONE    False  False
# TWO    False  False
# THREE  False  False
source: pandas_read_csv_nan.py
Read CSV by specifying encoding: encoding, encoding_errors
Use the following CSV file encoded in Shift-JIS as an example.
sample_header_shift_jis.csv
You can specify encoding by the encoding argument. The default is utf-8, and a UnicodeDecodeError will occur for files not in utf-8.
# df_sjis = pd.read_csv('data/src/sample_header_shift_jis.csv')
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 8: invalid start byte
source: pandas_read_csv_encoding.py
You need to specify the correct encoding.
df_sjis = pd.read_csv('data/src/sample_header_shift_jis.csv',
                      encoding='shift_jis')
print(df_sjis)
#    a   b   c   d
# 0  あ  12  13  14
# 1  い  22  23  24
# 2  う  32  33  34
source: pandas_read_csv_encoding.py
The encoding_errors argument allows you to specify error handling.
The default is 'strict', which results in an error as mentioned above. You can specify 'ignore', 'replace', etc.
codecs - Error Handlers — Codec registry and base classes — Python 3.11.4 documentation
df_ignore = pd.read_csv('data/src/sample_header_shift_jis.csv',
                        encoding_errors='ignore')
print(df_ignore)
#     a   b   c   d
# 0 NaN  12  13  14
# 1 NaN  22  23  24
# 2 NaN  32  33  34
df_replace = pd.read_csv('data/src/sample_header_shift_jis.csv',
                         encoding_errors='replace')
print(df_replace)
#     a   b   c   d
# 0  ��  12  13  14
# 1  ��  22  23  24
# 2  ��  32  33  34
df_backslash = pd.read_csv('data/src/sample_header_shift_jis.csv',
                           encoding_errors='backslashreplace')
print(df_backslash)
#           a   b   c   d
# 0  \x82\xa0  12  13  14
# 1  \x82\xa2  22  23  24
# 2  \x82\xa4  32  33  34
source: pandas_read_csv_encoding.py




    

Read compressed CSV files
read_csv() can read CSV files that have been compressed with ZIP or other methods.
df_zip = pd.read_csv('data/src/sample_header.csv.zip')
print(df_zip)
#     a   b   c   d
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34
source: pandas_read_csv.py
If the extension is .gz, .bz2, .zip, .xz, .zst, .tar, .tar.gz, .tar.xz, .tar.bz2, it will automatically detect and extract the file. If the extension is different, explicitly specify a string like 'zip' (without the period) in the compression argument.
Note that this only works for compressed single CSV files. If multiple files are compressed, an error will occur.
Read CSV files from the web
You can also specify a URL as the first argument to read_csv(). You can directly read files from the web, even if they are ZIP compressed.
Let's take a CSV file on GitHub as an example. Be careful to use the URL that points to the CSV file itself (Raw data).
df_web = pd.read_csv(
    'https://raw.githubusercontent.com/nkmk/python-snippets/master/notebook/data/src/sample_header.csv'
print(df_web)
#     a   b   c   d
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34
source: pandas_read_csv.py
Note that in many cases, adjusting the argument settings may be required to successfully read the file or exclude unnecessary data. Particularly for large files, it is often more practical to download them locally first.
Read TSV files: sep
To read a TSV file, specify the tab character (\t) in the sep argument of read_csv().
    a   b   c   d
ONE 11  12  13  14
TWO 21  22  23  24
THREE   31  32  33  34
source: sample_header_index.tsv
df_tsv_sep = pd.read_csv('data/src/sample_header_index.tsv', index_col=0, sep='\t')
print(df_tsv_sep)
#         a   b   c   d
# ONE    11  12  13  14
# TWO    21  22  23  24
# THREE  31  32  33  34
source: pandas_read_csv.py
read_table() uses the tab character (\t) as the default separator, and its usage is the same as those for read_csv().
pandas.read_table — pandas 2.0.3 documentation
df_tsv = pd.read_table('data/src/sample_header_index.tsv', index_col=0)
print(df_tsv)
#         a   b   c   d
# ONE    11  12  13  14
# TWO    21  22  23  24
# THREE  31  32  33  34
source: pandas_read_csv.py
          pandas: Write DataFrame to CSV with to_csv()
          pandas: Select rows/columns by index (numbers and names)
          pandas: Get summary statistics for each column with describe()
          pandas: Get unique values and their counts in a column
          List of pandas articles
          pandas: Select rows by multiple conditions
          pandas: Shuffle rows/elements of DataFrame/Series
          pandas: Detect and count NaN (missing values) with isnull(), isna()
          pandas: Get and set options for display, data behavior, etc.
          pandas: Find rows/columns with NaN (missing values)
          How to fix "ValueError: The truth value ... is ambiguous" in NumPy, pandas
          pandas: Data binning with cut() and qcut()
          Method chaining across multiple lines in Python
          pandas: Query DataFrame with query()
          pandas: Reorder rows and columns in DataFrame with reindex()

pandas: Write DataFrame to CSV with to_csv()


          
           
            
             
              
               
                
                 
                  
                   
                    
                     
                      
                       
                        
                         
                          
                           
                            
                             
                              
                               
                                
                                 
                                  
                                   
                                    
                                     
                                      
                                       
                                        
                                         
                                          
                                           
                                            
                                             
                                              
                                               
                                                
                                                 
                                                  
                                                   
                                                    
                                                     pandas: Select rows/columns by index (numbers and names)
                                                    
                                                   
                                                   
                                                    
                                                     pandas: Get summary statistics for each column with describe()
                                                    
                                                   
                                                   
                                                    
                                                     pandas: Get unique values and their counts in a column
                                                    
                                                   
                                                   
                                                    
                                                     List of pandas articles

`Basic Usage of pandas.read_csv()`

Read CSV without a header: `header`, `names`

Read CSV with a header: `header`, `names`

Read CSV with an index: `index_col`

Select columns to read: `usecols`

Skip rows to read

Skip the first `n` rows or specified row numbers: `skiprows`

Skip the last `n` rows: `skipfooter`

Read only the first `n` rows: `nrows`

`Read CSV by specifying the data type (dtype)`

Specify the same data type (`dtype`) for all columns

Specify data type (`dtype`) for each column

Read CSV with missing values (`NaN`)

Default values treated as missing

Specify values to be treated as `NaN`: `na_values`, `keep_default_na`

Treat no values as `NaN`: `na_filter`

Read CSV by specifying encoding: `encoding`, `encoding_errors`

Read compressed CSV files

Read CSV files from the web

Read TSV files: `sep`