添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

In pandas, pandas.read_csv() allows you to read CSV or TSV files into pandas.DataFrame objects.

  • pandas.read_csv — pandas 2.0.3 documentation
  • IO tools (text, CSV, HDF5, …) - CSV & text files — pandas 2.0.3 documentation
  • Contents
    • Basic Usage of pandas.read_csv()
    • Read CSV without a header: header , names
    • Read CSV with a header: header , names
    • Read CSV with an index: index_col
    • Select columns to read: usecols
    • Skip rows to read
      • Skip the first n rows or specified row numbers: skiprows
      • Skip the last n rows: skipfooter
      • Read only the first n rows: nrows
      • Read CSV by specifying the data type ( dtype )
        • Specify the same data type ( dtype ) for all columns
        • Specify data type ( dtype ) for each column
        • Read CSV with missing values ( NaN )
          • Default values treated as missing
          • Specify values to be treated as NaN : na_values , keep_default_na
          • Treat no values as NaN : na_filter
          • Read CSV by specifying encoding: encoding , encoding_errors
          • Read compressed CSV files
          • Read CSV files from the web
          • Read TSV files: sep
          • For information on writing or appending DataFrame objects to CSV files, refer to the following article.

          • pandas: Write DataFrame to CSV with to_csv()
          • The sample code in this article uses pandas version 2.0.3 .

            import pandas as pd
            print(pd.__version__)
            # 2.0.3
            
            source: pandas_read_csv.py

            Basic Usage of pandas.read_csv()

            When you specify the path to the CSV file as the first argument in read_csv() , the file is loaded as a DataFrame .

            df = pd.read_csv('data/src/sample_header.csv')
            print(df)
            #     a   b   c   d
            # 0  11  12  13  14
            # 1  21  22  23  24
            # 2  31  32  33  34
            
            

            The path can be either absolute or relative. For how to check and change the current directory, see the following article.

          • Get and change the current working directory in Python
          • By default, the first row of the CSV file is treated as the column names (columns). You can configure various settings, such as how headers are handled, by the arguments to read_csv().

            The following sections describe the main arguments frequently used with read_csv(). For comprehensive details of all arguments, please refer to the official documentation.

          • pandas.read_csv — pandas 2.0.3 documentation
          • Read CSV without a header: header, names

            Use the following CSV file without a header as an example.

            11,12,13,14
            21,22,23,24
            31,32,33,34
            
            source: sample.csv

            By default, the first row is recognized as a header and assigned to the column names (columns).

            df = pd.read_csv('data/src/sample.csv')
            print(df)
            #    11  12  13  14
            # 0  21  22  23  24
            # 1  31  32  33  34
            print(df.columns)
            # Index(['11', '12', '13', '14'], dtype='object')
            
            

            If you set the header argument to None, sequential numbers starting from zero will be used as columns.

            df_none = pd.read_csv('data/src/sample.csv', header=None)
            print(df_none)
            #     0   1   2   3
            # 0  11  12  13  14
            # 1  21  22  23  24
            # 2  31  32  33  34
            
            

            To set other values as column names, specify a list or tuple for the names argument.

            df_names = pd.read_csv('data/src/sample.csv', names=['A', 'B', 'C', 'D'])
            print(df_names)
            #     A   B   C   D
            # 0  11  12  13  14
            # 1  21  22  23  24
            # 2  31  32  33  34
            
            

            Read CSV with a header: header, names

            Use the following CSV file with a header as an example.

            a,b,c,d
            11,12,13,14
            21,22,23,24
            31,32,33,34
            
            

            As mentioned above, by default, the first row is recognized as a header and assigned to the column names (columns).

            df = pd.read_csv('data/src/sample_header.csv')
            print(df)
            #     a   b   c   d
            # 0  11  12  13  14
            # 1  21  22  23  24
            # 2  31  32  33  34
            
            

            To set values different from the first row as columns, specify a list for the names argument while setting header=0. Note that if header=0 is not set, the first row remains as data.

            df_names = pd.read_csv('data/src/sample_header.csv', names=['A', 'B', 'C', 'D'])
            print(df_names)
            #     A   B   C   D
            # 0   a   b   c   d
            # 1  11  12  13  14
            # 2  21  22  23  24
            # 3  31  32  33  34
            df_names_0 = pd.read_csv('data/src/sample_header.csv',
                                     header=0, names=['A', 'B', 'C', 'D'])
            print(df_names_0)
            #     A   B   C   D
            # 0  11  12  13  14
            # 1  21  22  23  24
            # 2  31  32  33  34
            
            

            You can also specify a line number, starting from zero, to use as a header. For example, you can set header=2. Lines above the specified line are ignored.

            df_header_2 = pd.read_csv('data/src/sample_header.csv', header=2)
            print(df_header_2)
            #    21  22  23  24
            # 0  31  32  33  34
            
            

            Read CSV with an index: index_col

            Use the following CSV file with both a header and an index as an example.

            ,a,b,c,d
            ONE,11,12,13,14
            TWO,21,22,23,24
            THREE,31,32,33,34
            
            

            By default, the row names (index) are sequentially numbered starting from zero. The first column is not treated as an index.

            df = pd.read_csv('data/src/sample_header_index.csv')
            print(df)
            #   Unnamed: 0   a   b   c   d
            # 0        ONE  11  12  13  14
            # 1        TWO  21  22  23  24
            # 2      THREE  31  32  33  34
            print(df.index)
            # RangeIndex(start=0, stop=3, step=1)
            
            

            You can specify the column number, starting from zero, to use as index by setting the argument index_col.

            df_index_col = pd.read_csv('data/src/sample_header_index.csv', index_col=0)
            print(df_index_col)
            #         a   b   c   d
            # ONE    11  12  13  14
            # TWO    21  22  23  24
            # THREE  31  32  33  34
            print(df_index_col.index)
            # Index(['ONE', 'TWO', 'THREE'], dtype='object')
            
            source: pandas_read_csv.py

            Select columns to read: usecols

            To read only specific columns, use the usecols argument. You can specify the column numbers or column names to be read using a list.

            df_usecols = pd.read_csv('data/src/sample_header.csv', usecols=[1, 3])
            print(df_usecols)
            #     b   d
            # 0  12  14
            # 1  22  24
            # 2  32  34
            df_usecols = pd.read_csv('data/src/sample_header.csv', usecols=['a', 'c'])
            print(df_usecols)
            #     a   c
            # 0  11  13
            # 1  21  23
            # 2  31  33
            
            

            You can also specify callable objects, such as lambda expressions. The column name is passed to the callable, and only the columns evaluated as True are extracted.

          • Lambda expressions in Python
          • df_usecols = pd.read_csv('data/src/sample_header.csv',
                                     usecols=lambda x: x != 'b')
            print(df_usecols)
            #     a   c   d
            # 0  11  13  14
            # 1  21  23  24
            # 2  31  33  34
            df_usecols = pd.read_csv('data/src/sample_header.csv',
                                     usecols=lambda x: x not in ['a', 'c'])
            print(df_usecols)
            #     b   d
            # 0  12  14
            # 1  22  24
            # 2  32  34
            
            

            Skip rows to read

            Skip the first n rows or specified row numbers: skiprows

            To skip (exclude) specific rows to read, use the skiprows argument.

            If you specify an integer, it skips the first n rows from the file to read.

            df = pd.read_csv('data/src/sample.csv', header=None)
            print(df)
            #     0   1   2   3
            # 0  11  12  13  14
            # 1  21  22  23  24
            # 2  31  32  33  34
            df_skiprows = pd.read_csv('data/src/sample.csv', header=None, skiprows=2)
            print(df_skiprows)
            #     0   1   2   3
            # 0  31  32  33  34
            
            

            You can also specify the row numbers to skip in a list. Unlike usecols, you specify the rows to skip, not the rows to read.

            df_skiprows = pd.read_csv('data/src/sample.csv', header=None, skiprows=[0, 2])
            print(df_skiprows)
            #     0   1   2   3
            # 0  21  22  23  24
            
            

            You can also specify callable objects, such as lambda expressions. The row number is passed to the callable, and any rows evaluated as True are skipped.

          • Lambda expressions in Python
          • df_skiprows = pd.read_csv('data/src/sample.csv', header=None,
                                      skiprows=lambda x: x not in [0, 2])
            print(df_skiprows)
            #     0   1   2   3
            # 0  11  12  13  14
            # 1  31  32  33  34
            
            

            For a file with a header, note that the header row is considered the 0th row.

            df_skiprows = pd.read_csv('data/src/sample_header.csv', skiprows=2)
            print(df_skiprows)
            #    21  22  23  24
            # 0  31  32  33  34
            df_skiprows = pd.read_csv('data/src/sample_header.csv', skiprows=[1, 3])
            print(df_skiprows)
            #     a   b   c   d
            # 0  21  22  23  24
            
            

            Note that, even if you specify a column as the index using index_col, you cannot use skiprows with row names.

            Skip the last n rows: skipfooter

            To skip from the end of the file, use the skipfooter argument. Specify the number of rows to skip from the end as an integer.

            df_skipfooter = pd.read_csv('data/src/sample.csv', header=None,
                                        skipfooter=1, engine='python')
            print(df_skipfooter)
            #     0   1   2   3
            # 0  11  12  13  14
            # 1  21  22  23  24
            
            

            To avoid the potential warning that might appear depending on the environment, specify engine='python'.

            ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.
            

            Read only the first n rows: nrows

            The nrows argument allows you to read only the first n rows. This can be useful when you want to preview some data from a large file.

            df_nrows = pd.read_csv('data/src/sample.csv', header=None, nrows=2)
            print(df_nrows)
            #     0   1   2   3
            # 0  11  12  13  14
            # 1  21  22  23  24
            
            

            In the case of a file with a header, the header row is not counted.

            df_nrows = pd.read_csv('data/src/sample_header.csv', nrows=2)
            print(df_nrows)
            #     a   b   c   d
            # 0  11  12  13  14
            # 1  21  22  23  24
            
            

            Read CSV by specifying the data type (dtype)

            pandas.DataFrame has a data type (dtype) for each column. For more information, including the relationship between strings and the object type, see the following article.

          • pandas: How to use astype() to cast dtype of DataFrame
          • While read_csv() automatically selects each column's data type based on the values, you can explicitly specify the data type using the dtype argument.

            Use the following CSV file as an example.

            ,a,b,c,d
            ONE,1,"001",100,x
            TWO,2,"020",,y
            THREE,3,"300",300,z
            
            

            For example, a column of numbers starting with 0 is treated as a number by default, even if it is enclosed in quotes, and the leading 0 is ignored.

            df = pd.read_csv('data/src/sample_header_index_dtype.csv', index_col=0)
            print(df)
            #        a    b      c  d
            # ONE    1    1  100.0  x
            # TWO    2   20    NaN  y
            # THREE  3  300  300.0  z
            print(df.dtypes)
            # a      int64
            # b      int64
            # c    float64
            # d     object
            # dtype: object
            
            

            If you want to treat it as a string including the leading 0, specify the dtype argument of read_csv().

            Specify the same data type (dtype) for all columns

            If you specify any data type for the dtype argument, all columns, including those specified by index_col, are converted to that type.

            Be careful, as it will cause an error if there are values that cannot be converted to the specified type.

            # pd.read_csv('data/src/sample_header_index_dtype.csv',
            #             index_col=0, dtype=float)
            # ValueError: could not convert string to float: 'ONE'
            
            

            If you set dtype=str, all columns are converted to strings. However, in this case, the missing value (NaN) will still be of type float.

            df_str = pd.read_csv('data/src/sample_header_index_dtype.csv',
                                 index_col=0, dtype=str)
            print(df_str)
            #        a    b    c  d
            # ONE    1  001  100  x
            # TWO    2  020  NaN  y
            # THREE  3  300  300  z
            print(df_str.dtypes)
            # a    object
            # b    object
            # c    object
            # d    object
            # dtype: object
            print(df_str.applymap(type))
            #                    a              b                c              d
            # ONE    <class 'str'>  <class 'str'>    <class 'str'>  <class 'str'>
            # TWO    <class 'str'>  <class 'str'>  <class 'float'>  <class 'str'>
            # THREE  <class 'str'>  <class 'str'>    <class 'str'>  <class 'str'>
            
            source: pandas_read_csv_dtype.py

            If you read the file without specifying dtype and then cast it to str with astype(), NaN values are also converted to the string 'nan'.

            df = pd.read_csv('data/src/sample_header_index_dtype.csv', index_col=0)
            print(df.astype(str))
            #        a    b      c  d
            # ONE    1    1  100.0  x
            # TWO    2   20    nan  y
            # THREE  3  300  300.0  z
            print(df.astype(str).applymap(type))
            #                    a              b              c              d
            # ONE    <class 'str'>  <class 'str'>  <class 'str'>  <class 'str'>
            # TWO    <class 'str'>  <class 'str'>  <class 'str'>  <class 'str'>
            # THREE  <class 'str'>  <class 'str'>  <class 'str'>  <class 'str'>
            
            

            Specify data type (dtype) for each column

            You can also specify the type of each column individually by passing a dictionary (dict) to the dtype argument. The types for columns not specified in the dictionary will be selected automatically.

            df_col = pd.read_csv('data/src/sample_header_index_dtype.csv',
                                 index_col=0, dtype={'a': float, 'b': str})
            print(df_col)
            #          a    b      c  d
            # ONE    1.0  001  100.0  x
            # TWO    2.0  020    NaN  y
            # THREE  3.0  300  300.0  z
            print(df_col.dtypes)
            # a    float64
            # b     object
            # c    float64
            # d     object
            # dtype: object
            
            

            The dictionary keys can also be column numbers. Be careful, if you are specifying the index column, you need to specify the column numbers including the index column.

            df_col = pd.read_csv('data/src/sample_header_index_dtype.csv',
                                 index_col=0, dtype={1: float, 2: str})
            print(df_col)
            #          a    b      c  d
            # ONE    1.0  001  100.0  x
            # TWO    2.0  020    NaN  y
            # THREE  3.0  300  300.0  z
            print(df_col.dtypes)
            # a    float64
            # b     object
            # c    float64
            # d     object
            # dtype: object
            
            

            Read CSV with missing values (NaN)

            Default values treated as missing

            By default, read_csv() interprets some values as NaN. For example, strings such as '', 'NaN', 'nan', 'null', and others are treated as NaN.

            By default the following values are interpreted as NaN: “ “, “#N/A”, “#N/A N/A”, “#NA”, “-1.#IND”, “-1.#QNAN”, “-NaN”, “-nan”, “1.#IND”, “1.#QNAN”, “<NA>”, “N/A”, “NA”, “NULL”, “NaN”, “None”, “n/a”, “nan”, “null “. pandas.read_csv — pandas 2.0.3 documentation

            Use the following CSV file as an example.

            ONE,,NaN TWO,-,nan THREE,null,N/A

            If you read the file with default settings and check with the isnull() method, you can see that all values except '-' are treated as NaN.

          • pandas: Detect and count NaN (missing values) with isnull(), isna()
          • df_nan = pd.read_csv('data/src/sample_header_index_nan.csv', index_col=0)
            print(df_nan)
            #          a   b
            # ONE    NaN NaN
            # TWO      - NaN
            # THREE  NaN NaN
            print(df_nan.isnull())
            #            a     b
            # ONE     True  True
            # TWO    False  True
            # THREE   True  True
            
            

            For handling missing values after reading into a pandas.DataFrame, refer to the following article.

          • pandas: Remove NaN (missing values) with dropna()
          • pandas: Replace NaN (missing values) with fillna()
          • Specify values to be treated as NaN: na_values, keep_default_na

            To specify values to be treated as NaN, use the na_values argument. In addition to the default values mentioned above, the values specified in na_values are also treated as NaN. You can specify multiple values by providing a list.

            df_nan_set_na = pd.read_csv('data/src/sample_header_index_nan.csv',
                                        index_col=0, na_values='-')
            print(df_nan_set_na)
            #         a   b
            # ONE   NaN NaN
            # TWO   NaN NaN
            # THREE NaN NaN
            print(df_nan_set_na.isnull())
            #           a     b
            # ONE    True  True
            # TWO    True  True
            # THREE  True  True
            
            

            If you set the keep_default_na argument to False, the default values mentioned above will not be treated as NaN. If you specify the na_values argument in this case, only the values specified in na_values will be treated as NaN.

            df_nan_no_keep = pd.read_csv('data/src/sample_header_index_nan.csv',
                                         index_col=0, na_values=['-', 'NaN', 'null'],
                                         keep_default_na=False)
            print(df_nan_no_keep)
            #          a    b
            # ONE         NaN
            # TWO    NaN  nan
            # THREE  NaN  N/A
            print(df_nan_no_keep.isnull())
            #            a      b
            # ONE    False   True
            # TWO     True  False
            # THREE   True  False
            
            

            Treat no values as NaN: na_filter

            If the na_filter argument is set to False, all values are read as strings as they are and are not treated as NaN, regardless of the setting of the na_values and keep_default_na arguments.

            df_nan_no_filter = pd.read_csv('data/src/sample_header_index_nan.csv',
                                           index_col=0, na_filter=False)
            print(df_nan_no_filter)
            #           a    b
            # ONE          NaN
            # TWO       -  nan
            # THREE  null  N/A
            print(df_nan_no_filter.isnull())
            #            a      b
            # ONE    False  False
            # TWO    False  False
            # THREE  False  False
            
            

            Read CSV by specifying encoding: encoding, encoding_errors

            Use the following CSV file encoded in Shift-JIS as an example.

          • sample_header_shift_jis.csv
          • You can specify encoding by the encoding argument. The default is utf-8, and a UnicodeDecodeError will occur for files not in utf-8.

            # df_sjis = pd.read_csv('data/src/sample_header_shift_jis.csv')
            # UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 8: invalid start byte
            
            

            You need to specify the correct encoding.

            df_sjis = pd.read_csv('data/src/sample_header_shift_jis.csv',
                                  encoding='shift_jis')
            print(df_sjis)
            #    a   b   c   d
            # 0  あ  12  13  14
            # 1  い  22  23  24
            # 2  う  32  33  34
            
            

            The encoding_errors argument allows you to specify error handling.

            The default is 'strict', which results in an error as mentioned above. You can specify 'ignore', 'replace', etc.

          • codecs - Error Handlers — Codec registry and base classes — Python 3.11.4 documentation
          • df_ignore = pd.read_csv('data/src/sample_header_shift_jis.csv',
                                    encoding_errors='ignore')
            print(df_ignore)
            #     a   b   c   d
            # 0 NaN  12  13  14
            # 1 NaN  22  23  24
            # 2 NaN  32  33  34
            df_replace = pd.read_csv('data/src/sample_header_shift_jis.csv',
                                     encoding_errors='replace')
            print(df_replace)
            #     a   b   c   d
            # 0  ��  12  13  14
            # 1  ��  22  23  24
            # 2  ��  32  33  34
            df_backslash = pd.read_csv('data/src/sample_header_shift_jis.csv',
                                       encoding_errors='backslashreplace')
            print(df_backslash)
            #           a   b   c   d
            # 0  \x82\xa0  12  13  14
            # 1  \x82\xa2  22  23  24
            # 2  \x82\xa4  32  33  34
            
            source: pandas_read_csv_encoding.py

            Read compressed CSV files

            read_csv() can read CSV files that have been compressed with ZIP or other methods.

            df_zip = pd.read_csv('data/src/sample_header.csv.zip')
            print(df_zip)
            #     a   b   c   d
            # 0  11  12  13  14
            # 1  21  22  23  24
            # 2  31  32  33  34
            
            

            If the extension is .gz, .bz2, .zip, .xz, .zst, .tar, .tar.gz, .tar.xz, .tar.bz2, it will automatically detect and extract the file. If the extension is different, explicitly specify a string like 'zip' (without the period) in the compression argument.

            Note that this only works for compressed single CSV files. If multiple files are compressed, an error will occur.

            Read CSV files from the web

            You can also specify a URL as the first argument to read_csv(). You can directly read files from the web, even if they are ZIP compressed.

            Let's take a CSV file on GitHub as an example. Be careful to use the URL that points to the CSV file itself (Raw data).

            df_web = pd.read_csv(
                'https://raw.githubusercontent.com/nkmk/python-snippets/master/notebook/data/src/sample_header.csv'
            print(df_web)
            #     a   b   c   d
            # 0  11  12  13  14
            # 1  21  22  23  24
            # 2  31  32  33  34
            
            

            Note that in many cases, adjusting the argument settings may be required to successfully read the file or exclude unnecessary data. Particularly for large files, it is often more practical to download them locally first.

            Read TSV files: sep

            To read a TSV file, specify the tab character (\t) in the sep argument of read_csv().

        • pandas: Write DataFrame to CSV with to_csv()
        • pandas: Select rows/columns by index (numbers and names)
        • pandas: Get summary statistics for each column with describe()
        • pandas: Get unique values and their counts in a column
        • List of pandas articles