You signed in with another tab or window.
Reload
to refresh your session.
You signed out in another tab or window.
Reload
to refresh your session.
You switched accounts on another tab or window.
Reload
to refresh your session.
By clicking “Sign up for GitHub”, you agree to our
terms of service
and
privacy statement
. We’ll occasionally send you account related emails.
Already on GitHub?
Sign in
to your account
df
=
pd
.
read_csv
(
file_path
)
Problem description
Currently, the default CSV writing behaviour is to write the index column. The default reading behaviour, however, is to assume there is no index column in the file. This is not intuitive when writing and reading files.
The expected behaviour is that a file which is written without any index option it should be able to be read without any index option. However, the default writing and reading behaviour results in an unnamed column.
One fix is to change the default index_col for read_csv, another option is to change the default index boolean for to_csv. The former is probably preferable as it preserves information.
At this point, both of those defaults are unlikely to be changed.
pd.DataFrame.from_csv
exists for easier round-tripping, though deprecated, see e.g.
#10163
Because it is schema-less, csv is never a particularly safe format for round-tripping, consider using something binary like parquet of HDF5 instead
This has been rejected before in
#12627
with the following reply:
Has been this way almost since the beginning.
The idea is that .to_csv and .from_csv are inverses
Essentially impossible to change at this point. But to be honest its actually a sensible default. Indexes are more and more important. If you are not using them you should.
However,
.from_csv
has been deprecated in favor of
.read_csv
, which would only be the inverse of
.to_csv
if this default were changed.
I can only interpret the other argument (similar to
@chris-b1
's reply) to be "because legacy". I personally consider this to be a poor argument in any case. Defaults have been changed before, is there a specific reason this one shouldn't?
Because it is schema-less, csv is never a particularly safe format for round-tripping, consider using something binary like parquet of HDF5 instead
Agreed. However, while CSV may not be the best data format for round-tripping, in practice this is one of the most common use-cases. Many new users are confused as to why unnamed columns appear in their files seemingly at random. After years of using Pandas, I still regularly forget to set the right arguments to allow round-tripping. In my opinion, the inconsistency of the current defaults only adds unnecessary cognitive load.
I think this is a good issue for v0.25.
but that has its own set of problems
Agreed. But the point is well taken that we should pick a suitable (and consistent) default for both to avoid the confusions described above. That being said, it would be good to get some more opinions on what people generally use in the wild (i.e. with an index, without) before settling on one.
Defaults have been changed before, is there a specific reason this one shouldn't?
We aren't saying that it shouldn't, but asking for us to do this in
0.25.0
is rushing things IMO.
asking for us to do this in
0.25.0
is rushing things IMO.
Sorry, I didn't mean to rush it. I'm not very familiar with the Pandas release cycle. As long as it's not pushed back to
1.0.0
- I don't think it's that breaking.
just personally, I'd be more sympathetic to changing the default on to_csv to index=False, but that has its own set of problems
I'm leaning towards this, too, if nothing else because it would match what most users I know already do.
I tend to agree with
@JeroenDelcour
. Not only have I seen multiple users get confused by the appearance of unnamed columns, I myself often forget to set the index arguments.
It seems to me we should be prioritizing usability and intuitiveness. Can't we schedule this for release and add a FutureWarning?