pandas: Read CSV into DataFrame with read_csv()

Posted: | Tags: Python, pandas, CSV

In pandas, pandas.read_csv() allows you to read CSV or TSV files into pandas.DataFrame objects.

For information on writing or appending DataFrame objects to CSV files, refer to the following article.

The sample code in this article uses pandas version 2.0.3.

import pandas as pd

print(pd.__version__)
# 2.0.3

Basic Usage of pandas.read_csv()

When you specify the path to the CSV file as the first argument in read_csv(), the file is loaded as a DataFrame.

df = pd.read_csv('data/src/sample_header.csv')
print(df)
#     a   b   c   d
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34

The path can be either absolute or relative. For how to check and change the current directory, see the following article.

By default, the first row of the CSV file is treated as the column names (columns). You can configure various settings, such as how headers are handled, by the arguments to read_csv().

The following sections describe the main arguments frequently used with read_csv(). For comprehensive details of all arguments, please refer to the official documentation.

Read CSV without a header: header, names

Use the following CSV file without a header as an example.

11,12,13,14
21,22,23,24
31,32,33,34
source: sample.csv

By default, the first row is recognized as a header and assigned to the column names (columns).

df = pd.read_csv('data/src/sample.csv')
print(df)
#    11  12  13  14
# 0  21  22  23  24
# 1  31  32  33  34

print(df.columns)
# Index(['11', '12', '13', '14'], dtype='object')

If you set the header argument to None, sequential numbers starting from zero will be used as columns.

df_none = pd.read_csv('data/src/sample.csv', header=None)
print(df_none)
#     0   1   2   3
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34

To set other values as column names, specify a list or tuple for the names argument.

df_names = pd.read_csv('data/src/sample.csv', names=['A', 'B', 'C', 'D'])
print(df_names)
#     A   B   C   D
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34

Read CSV with a header: header, names

Use the following CSV file with a header as an example.

a,b,c,d
11,12,13,14
21,22,23,24
31,32,33,34

As mentioned above, by default, the first row is recognized as a header and assigned to the column names (columns).

df = pd.read_csv('data/src/sample_header.csv')
print(df)
#     a   b   c   d
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34

To set values different from the first row as columns, specify a list for the names argument while setting header=0. Note that if header=0 is not set, the first row remains as data.

df_names = pd.read_csv('data/src/sample_header.csv', names=['A', 'B', 'C', 'D'])
print(df_names)
#     A   B   C   D
# 0   a   b   c   d
# 1  11  12  13  14
# 2  21  22  23  24
# 3  31  32  33  34

df_names_0 = pd.read_csv('data/src/sample_header.csv',
                         header=0, names=['A', 'B', 'C', 'D'])
print(df_names_0)
#     A   B   C   D
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34

You can also specify a line number, starting from zero, to use as a header. For example, you can set header=2. Lines above the specified line are ignored.

df_header_2 = pd.read_csv('data/src/sample_header.csv', header=2)
print(df_header_2)
#    21  22  23  24
# 0  31  32  33  34

Read CSV with an index: index_col

Use the following CSV file with both a header and an index as an example.

,a,b,c,d
ONE,11,12,13,14
TWO,21,22,23,24
THREE,31,32,33,34

By default, the row names (index) are sequentially numbered starting from zero. The first column is not treated as an index.

df = pd.read_csv('data/src/sample_header_index.csv')
print(df)
#   Unnamed: 0   a   b   c   d
# 0        ONE  11  12  13  14
# 1        TWO  21  22  23  24
# 2      THREE  31  32  33  34

print(df.index)
# RangeIndex(start=0, stop=3, step=1)

You can specify the column number, starting from zero, to use as index by setting the argument index_col.

df_index_col = pd.read_csv('data/src/sample_header_index.csv', index_col=0)
print(df_index_col)
#         a   b   c   d
# ONE    11  12  13  14
# TWO    21  22  23  24
# THREE  31  32  33  34

print(df_index_col.index)
# Index(['ONE', 'TWO', 'THREE'], dtype='object')

Select columns to read: usecols

To read only specific columns, use the usecols argument. You can specify the column numbers or column names to be read using a list.

df_usecols = pd.read_csv('data/src/sample_header.csv', usecols=[1, 3])
print(df_usecols)
#     b   d
# 0  12  14
# 1  22  24
# 2  32  34

df_usecols = pd.read_csv('data/src/sample_header.csv', usecols=['a', 'c'])
print(df_usecols)
#     a   c
# 0  11  13
# 1  21  23
# 2  31  33

You can also specify callable objects, such as lambda expressions. The column name is passed to the callable, and only the columns evaluated as True are extracted.

df_usecols = pd.read_csv('data/src/sample_header.csv',
                         usecols=lambda x: x != 'b')
print(df_usecols)
#     a   c   d
# 0  11  13  14
# 1  21  23  24
# 2  31  33  34

df_usecols = pd.read_csv('data/src/sample_header.csv',
                         usecols=lambda x: x not in ['a', 'c'])
print(df_usecols)
#     b   d
# 0  12  14
# 1  22  24
# 2  32  34

Skip rows to read

Skip the first n rows or specified row numbers: skiprows

To skip (exclude) specific rows to read, use the skiprows argument.

If you specify an integer, it skips the first n rows from the file to read.

df = pd.read_csv('data/src/sample.csv', header=None)
print(df)
#     0   1   2   3
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34

df_skiprows = pd.read_csv('data/src/sample.csv', header=None, skiprows=2)
print(df_skiprows)
#     0   1   2   3
# 0  31  32  33  34

You can also specify the row numbers to skip in a list. Unlike usecols, you specify the rows to skip, not the rows to read.

df_skiprows = pd.read_csv('data/src/sample.csv', header=None, skiprows=[0, 2])
print(df_skiprows)
#     0   1   2   3
# 0  21  22  23  24

You can also specify callable objects, such as lambda expressions. The row number is passed to the callable, and any rows evaluated as True are skipped.

df_skiprows = pd.read_csv('data/src/sample.csv', header=None,
                          skiprows=lambda x: x not in [0, 2])
print(df_skiprows)
#     0   1   2   3
# 0  11  12  13  14
# 1  31  32  33  34

For a file with a header, note that the header row is considered the 0th row.

df_skiprows = pd.read_csv('data/src/sample_header.csv', skiprows=2)
print(df_skiprows)
#    21  22  23  24
# 0  31  32  33  34

df_skiprows = pd.read_csv('data/src/sample_header.csv', skiprows=[1, 3])
print(df_skiprows)
#     a   b   c   d
# 0  21  22  23  24

Note that, even if you specify a column as the index using index_col, you cannot use skiprows with row names.

Skip the last n rows: skipfooter

To skip from the end of the file, use the skipfooter argument. Specify the number of rows to skip from the end as an integer.

df_skipfooter = pd.read_csv('data/src/sample.csv', header=None,
                            skipfooter=1, engine='python')
print(df_skipfooter)
#     0   1   2   3
# 0  11  12  13  14
# 1  21  22  23  24

To avoid the potential warning that might appear depending on the environment, specify engine='python'.

ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.

Read only the first n rows: nrows

The nrows argument allows you to read only the first n rows. This can be useful when you want to preview some data from a large file.

df_nrows = pd.read_csv('data/src/sample.csv', header=None, nrows=2)
print(df_nrows)
#     0   1   2   3
# 0  11  12  13  14
# 1  21  22  23  24

In the case of a file with a header, the header row is not counted.

df_nrows = pd.read_csv('data/src/sample_header.csv', nrows=2)
print(df_nrows)
#     a   b   c   d
# 0  11  12  13  14
# 1  21  22  23  24

Read CSV by specifying the data type (dtype)

pandas.DataFrame has a data type (dtype) for each column. For more information, including the relationship between strings and the object type, see the following article.

While read_csv() automatically selects each column's data type based on the values, you can explicitly specify the data type using the dtype argument.

Use the following CSV file as an example.

,a,b,c,d
ONE,1,"001",100,x
TWO,2,"020",,y
THREE,3,"300",300,z

For example, a column of numbers starting with 0 is treated as a number by default, even if it is enclosed in quotes, and the leading 0 is ignored.

df = pd.read_csv('data/src/sample_header_index_dtype.csv', index_col=0)
print(df)
#        a    b      c  d
# ONE    1    1  100.0  x
# TWO    2   20    NaN  y
# THREE  3  300  300.0  z

print(df.dtypes)
# a      int64
# b      int64
# c    float64
# d     object
# dtype: object

If you want to treat it as a string including the leading 0, specify the dtype argument of read_csv().

Specify the same data type (dtype) for all columns

If you specify any data type for the dtype argument, all columns, including those specified by index_col, are converted to that type.

Be careful, as it will cause an error if there are values that cannot be converted to the specified type.

# pd.read_csv('data/src/sample_header_index_dtype.csv',
#             index_col=0, dtype=float)
# ValueError: could not convert string to float: 'ONE'

If you set dtype=str, all columns are converted to strings. However, in this case, the missing value (NaN) will still be of type float.

df_str = pd.read_csv('data/src/sample_header_index_dtype.csv',
                     index_col=0, dtype=str)
print(df_str)
#        a    b    c  d
# ONE    1  001  100  x
# TWO    2  020  NaN  y
# THREE  3  300  300  z

print(df_str.dtypes)
# a    object
# b    object
# c    object
# d    object
# dtype: object

print(df_str.applymap(type))
#                    a              b                c              d
# ONE    <class 'str'>  <class 'str'>    <class 'str'>  <class 'str'>
# TWO    <class 'str'>  <class 'str'>  <class 'float'>  <class 'str'>
# THREE  <class 'str'>  <class 'str'>    <class 'str'>  <class 'str'>

If you read the file without specifying dtype and then cast it to str with astype(), NaN values are also converted to the string 'nan'.

df = pd.read_csv('data/src/sample_header_index_dtype.csv', index_col=0)
print(df.astype(str))
#        a    b      c  d
# ONE    1    1  100.0  x
# TWO    2   20    nan  y
# THREE  3  300  300.0  z

print(df.astype(str).applymap(type))
#                    a              b              c              d
# ONE    <class 'str'>  <class 'str'>  <class 'str'>  <class 'str'>
# TWO    <class 'str'>  <class 'str'>  <class 'str'>  <class 'str'>
# THREE  <class 'str'>  <class 'str'>  <class 'str'>  <class 'str'>

Specify data type (dtype) for each column

You can also specify the type of each column individually by passing a dictionary (dict) to the dtype argument. The types for columns not specified in the dictionary will be selected automatically.

df_col = pd.read_csv('data/src/sample_header_index_dtype.csv',
                     index_col=0, dtype={'a': float, 'b': str})
print(df_col)
#          a    b      c  d
# ONE    1.0  001  100.0  x
# TWO    2.0  020    NaN  y
# THREE  3.0  300  300.0  z

print(df_col.dtypes)
# a    float64
# b     object
# c    float64
# d     object
# dtype: object

The dictionary keys can also be column numbers. Be careful, if you are specifying the index column, you need to specify the column numbers including the index column.

df_col = pd.read_csv('data/src/sample_header_index_dtype.csv',
                     index_col=0, dtype={1: float, 2: str})
print(df_col)
#          a    b      c  d
# ONE    1.0  001  100.0  x
# TWO    2.0  020    NaN  y
# THREE  3.0  300  300.0  z

print(df_col.dtypes)
# a    float64
# b     object
# c    float64
# d     object
# dtype: object

Read CSV with missing values (NaN)

Default values treated as missing

By default, read_csv() interprets some values as NaN. For example, strings such as '', 'NaN', 'nan', 'null', and others are treated as NaN.

By default the following values are interpreted as NaN: “ “, “#N/A”, “#N/A N/A”, “#NA”, “-1.#IND”, “-1.#QNAN”, “-NaN”, “-nan”, “1.#IND”, “1.#QNAN”, “<NA>”, “N/A”, “NA”, “NULL”, “NaN”, “None”, “n/a”, “nan”, “null “. pandas.read_csv — pandas 2.0.3 documentation

Use the following CSV file as an example.

,a,b
ONE,,NaN
TWO,-,nan
THREE,null,N/A

If you read the file with default settings and check with the isnull() method, you can see that all values except '-' are treated as NaN.

df_nan = pd.read_csv('data/src/sample_header_index_nan.csv', index_col=0)
print(df_nan)
#          a   b
# ONE    NaN NaN
# TWO      - NaN
# THREE  NaN NaN

print(df_nan.isnull())
#            a     b
# ONE     True  True
# TWO    False  True
# THREE   True  True

For handling missing values after reading into a pandas.DataFrame, refer to the following article.

Specify values to be treated as NaN: na_values, keep_default_na

To specify values to be treated as NaN, use the na_values argument. In addition to the default values mentioned above, the values specified in na_values are also treated as NaN. You can specify multiple values by providing a list.

df_nan_set_na = pd.read_csv('data/src/sample_header_index_nan.csv',
                            index_col=0, na_values='-')
print(df_nan_set_na)
#         a   b
# ONE   NaN NaN
# TWO   NaN NaN
# THREE NaN NaN

print(df_nan_set_na.isnull())
#           a     b
# ONE    True  True
# TWO    True  True
# THREE  True  True

If you set the keep_default_na argument to False, the default values mentioned above will not be treated as NaN. If you specify the na_values argument in this case, only the values specified in na_values will be treated as NaN.

df_nan_no_keep = pd.read_csv('data/src/sample_header_index_nan.csv',
                             index_col=0, na_values=['-', 'NaN', 'null'],
                             keep_default_na=False)
print(df_nan_no_keep)
#          a    b
# ONE         NaN
# TWO    NaN  nan
# THREE  NaN  N/A

print(df_nan_no_keep.isnull())
#            a      b
# ONE    False   True
# TWO     True  False
# THREE   True  False

Treat no values as NaN: na_filter

If the na_filter argument is set to False, all values are read as strings as they are and are not treated as NaN, regardless of the setting of the na_values and keep_default_na arguments.

df_nan_no_filter = pd.read_csv('data/src/sample_header_index_nan.csv',
                               index_col=0, na_filter=False)
print(df_nan_no_filter)
#           a    b
# ONE          NaN
# TWO       -  nan
# THREE  null  N/A

print(df_nan_no_filter.isnull())
#            a      b
# ONE    False  False
# TWO    False  False
# THREE  False  False

Read CSV by specifying encoding: encoding, encoding_errors

Use the following CSV file encoded in Shift-JIS as an example.

You can specify encoding by the encoding argument. The default is utf-8, and a UnicodeDecodeError will occur for files not in utf-8.

# df_sjis = pd.read_csv('data/src/sample_header_shift_jis.csv')
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 8: invalid start byte

You need to specify the correct encoding.

df_sjis = pd.read_csv('data/src/sample_header_shift_jis.csv',
                      encoding='shift_jis')
print(df_sjis)
#    a   b   c   d
# 0  あ  12  13  14
# 1  い  22  23  24
# 2  う  32  33  34

The encoding_errors argument allows you to specify error handling.

The default is 'strict', which results in an error as mentioned above. You can specify 'ignore', 'replace', etc.

df_ignore = pd.read_csv('data/src/sample_header_shift_jis.csv',
                        encoding_errors='ignore')
print(df_ignore)
#     a   b   c   d
# 0 NaN  12  13  14
# 1 NaN  22  23  24
# 2 NaN  32  33  34

df_replace = pd.read_csv('data/src/sample_header_shift_jis.csv',
                         encoding_errors='replace')
print(df_replace)
#     a   b   c   d
# 0  ��  12  13  14
# 1  ��  22  23  24
# 2  ��  32  33  34

df_backslash = pd.read_csv('data/src/sample_header_shift_jis.csv',
                           encoding_errors='backslashreplace')
print(df_backslash)
#           a   b   c   d
# 0  \x82\xa0  12  13  14
# 1  \x82\xa2  22  23  24
# 2  \x82\xa4  32  33  34

Read compressed CSV files

read_csv() can read CSV files that have been compressed with ZIP or other methods.

df_zip = pd.read_csv('data/src/sample_header.csv.zip')
print(df_zip)
#     a   b   c   d
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34

If the extension is .gz, .bz2, .zip, .xz, .zst, .tar, .tar.gz, .tar.xz, .tar.bz2, it will automatically detect and extract the file. If the extension is different, explicitly specify a string like 'zip' (without the period) in the compression argument.

Note that this only works for compressed single CSV files. If multiple files are compressed, an error will occur.

Read CSV files from the web

You can also specify a URL as the first argument to read_csv(). You can directly read files from the web, even if they are ZIP compressed.

Let's take a CSV file on GitHub as an example. Be careful to use the URL that points to the CSV file itself (Raw data).

df_web = pd.read_csv(
    'https://raw.githubusercontent.com/nkmk/python-snippets/master/notebook/data/src/sample_header.csv'
)
print(df_web)
#     a   b   c   d
# 0  11  12  13  14
# 1  21  22  23  24
# 2  31  32  33  34

Note that in many cases, adjusting the argument settings may be required to successfully read the file or exclude unnecessary data. Particularly for large files, it is often more practical to download them locally first.

Read TSV files: sep

To read a TSV file, specify the tab character (\t) in the sep argument of read_csv().

    a   b   c   d
ONE 11  12  13  14
TWO 21  22  23  24
THREE   31  32  33  34
df_tsv_sep = pd.read_csv('data/src/sample_header_index.tsv', index_col=0, sep='\t')
print(df_tsv_sep)
#         a   b   c   d
# ONE    11  12  13  14
# TWO    21  22  23  24
# THREE  31  32  33  34

read_table() uses the tab character (\t) as the default separator, and its usage is the same as those for read_csv().

df_tsv = pd.read_table('data/src/sample_header_index.tsv', index_col=0)
print(df_tsv)
#         a   b   c   d
# ONE    11  12  13  14
# TWO    21  22  23  24
# THREE  31  32  33  34

Related Categories

Related Articles