I stumbled across
os.scandir
just recently while refactoring a code base from using
os.path
to
pathlib
.
After doing a bit of research I’m still not sure whether an equivalent of it exists in the
pathlib
module. My question could, as an alternative, be phrased,
“Is there a solution with
pathlib
that is as fast as
os.scandir()
?”
There is no mention of
os.scandir()
in the
table of correspondence
in
pathlib
’s documentation, and
Path.walk()
is listed as being equivalent to
os.walk()
, hence traditionally slow. The only mention of
scandir
is in the
Path.walk
section, stating:
By default, errors from
os.scandir()
are ignored.
Taking a peek at
pathlib’s implementation
confirms that
os.scandir()
is
used by
Path.walk()
under the hood.
Does that mean that
Path.walk
is as efficient as
os.scandir
, and one can simply disregard scandir and use pathlib’s walk instead?

Peter Bittner:
Does that mean that
Path.walk
is as efficient as
os.scandir
, and one can simply disregard scandir and use pathlib’s walk instead?
My recommendation is to start by
trying it
and seeing if it performs well enough for your purposes, or if you can discern a performance difference implementing your overall code each way.
I can’t say how the performance compares. But I would say
Path.iterdir
is the canonical pathlib equivalent to
os.scandir
. `
pathlib — Object-oriented filesystem paths — Python 3.12.2 documentation
If you don’t need literally everything, only results matching a known pattern, e.g. for the file names or extensions, globbing is very handy too:
pathlib — Object-oriented filesystem paths — Python 3.12.2 documentation
Thanks for mentioning
Path.iterdir
. When I
look at the implementation
it turns out that it’s a generator of calls to
os.listdir
. Hence, it must be just as slow as its
os
-equivalent.
Note that the entire argument is about
os.scandir
being performant, while
os.listdir
,
os.walk
, and likely
Path.walk
, is not. I’d like to avoid that I – and other people – earn a speed penalty only because they want to be purist and blindly switch to what
pathlib
offers.
For everyone’s convenience, the motivation and history of
os.scandir
is well explained at, e.g.
Contributing os.scandir() to Python
PEP 471 – os.scandir() function – a better and faster directory iterator | peps.python.org
That’s an excellent point, and I agree.
According to my understanding
os.scandir
is what you want when you want all the directory information
now
. That’s why Guido refused to let it return
Path
instances and have them hold cached file system information. If you expect to inspect
Path
objects later you’re better off using a
os.scandir
alternative.
If that’s the entire wisdom I’d like to have that captured in the Python
pathlib
docs. A hint similar to the “See also” box in the
os.listdir
documentation might do it. Any opinions on that?
Fair enough - good point. I just thought it and glob were both conspicuous by their absence in your summary.
I naively assumed that given its name,
Path.iterdir
, returns a pure iterator. Would an simple tweak to use os.scandir instead gain much in terms of performance for pathlib?

James Parrott:
Would an simple tweak to use os.scandir instead gain much in terms of performance for pathlib?
That’s a good thought experiment! – I would guess, it certainly would. However,
os.scandir
would never return
Path
objects but
os.DirEntry
objects instead, which have file system information cached as opposed to
Path
objects. We gain nothing compared to using
os.scandir
directly.
In addition, the discussion to make
os.scandir
use
pathlib
’s
Path
interface has already been taken place, as documented by the author in
PEP 471
. So,
scandir
will probably never be integrated with
pathlib
.
Sure. I’m not suggesting changing os.scandir (or any breaking changes at all). The required helper methods are even already there (originally added for Path.walk). I haven’t tested this (it’s so straightforward, I suspect Barney and Brett et al considered it already and ruled it out), but the diff could be as simple as changing Path.iterdir to:
return (self._make_child_direntry(entry) for entry in self._scandir())
return (self._make_child_relpath(name) for name in os.listdir(self))
def write_text(self, data, encoding=None, errors=None, newline=None):
Open the file in text mode, write to it, and close the file.
# Call io.text_encoding() here to ensure any warning is raised at an
# appropriate stack level.
encoding = io.text_encoding(encoding)
return _abc.PathBase.write_text(self, data, encoding, errors, newline)
def iterdir(self):
"""Yield path objects of the directory contents.
The children are yielded in arbitrary order, and the
special entries '.' and '..' are not included.
return (self._make_child_relpath(name) for name in os.listdir(self))
def _scandir(self):
return os.scandir(self)
And does that have any performance benefit? We are throwing away the extra information contained in the DirEntry
objects, so what do we gain? I would imagine that without the extra information being used os.listdir
is faster than os.scandir
. But I would suggest you measure that.
Changing Path
objects to contain this extra information is probably not a good idea, because people don’t expect Path
objects to “expire” like DirEntry
will.
And does that have any performance benefit?
Indeed. I’m assuming os.scandir is a true iterator without a cache. The main advantage is making Path.iterdir return a cache-less iterator, instead of a trivial iterator based on a hidden cache from os.listdir.
If so, then it doesn’t store the whole directory contents in memory until needed, so I imagine it would be more efficient to iterate over. Certainly on directories containing large enough numbers of items. But I agree, this remains to be shown. Currently for discussion only. I suggest before serious work is started, we wait to hear from those that worked on Pathlib, to make sure revamping Path.iterdir wasn’t already considered and ruled out, when Path.walk was last worked on. Maybe they just thought users who want more performance can simply use Path.walk over Path.iterdir (and so Peter is right).
We are throwing away the extra information contained in the DirEntry
objects, so what do we gain?
The ability to iterate over a directory’s contents only when needed, without storing them all in memory first.
Changing Path
objects to contain this extra information is probably not a good idea, because people don’t expect Path
objects to “expire” like DirEntry
will.
I’m sorry, I know Guido intervened on this, but I haven’t really grasped why expiration of DirEntry
s is important. The suggested new Path.iterdir does not do anything different to create the yielded Path instances from a DirEntry
s, that Path.walk does not do already (it reuses Path.walk’s helper methods) .
Creating a Path doesn’t add any sort of lock at the OS level does it? So if something else outside the Python process deletes a file, the value returned by Path(files_path).is_file() will change. So don’t users of Path instances handle expiration already?
James Parrott:
Creating a Path doesn’t add any sort of lock at the OS level does it? So if something else outside the Python process deletes a file, the value returned by Path(files_path).is_file() will change. So don’t users of Path instances handle expiration already?
Yes, but DirEntry.is_file()
doesn’t change (at least not necessarily, system dependent), at least that is my reading of the documentation. That is why this class still exists. With “expire” I mean that the object will tell you wrong things, not that race conditions are impossible.
Thanks. I see.
So was the reason it was chosen to make os.scandir not return Paths, that there’s legacy code out there, that historically already checked or allowed for DirEntry possibly returning the wrong thing (from an out of date cached value), that would become more inefficient if it was replaced by a Path, as the same checks (sys calls?) would be done twice, even though a Path returns the correct value?
Not legacy code. Code that cares about speed. Note that race conditions aren’t impossible with Path.is_file()
. The point is that Path.is_file()
will be slow in contrast to DirEntry.is_file()
. And if you can reasonably expect that the file system doesn’t change at the point where you actually look at the DirEntry
objects, then there is little danger. But in general there is the expectation that Path
objects can be stored for a long time and it’s functions will still return correct values, whereas DirEntry
objects should be consumed soon after generation. They serve different purposes.
I don’t think pathlib
needs a replacement for os.scandir
. If you need the speed gains from os.scandir
, just call it.
It sounds like os.scandir is also creating a cache, just not in Python. The suggestion would just also avoid the sys calls of os.listdir in that case, not create a true iterator.
It should be checked of course, but from a brief read of the code, the Path objects returned by the suggested Path.iterdir, are not caching values from DirEntry. _make_child_direntry
only uses DirEntry.name
and DirEntry.path
. If the user wants to know if Path.is_file(), that method will still make a sys call when called itself.
I don’t think pathlib
needs a replacement for os.scandir
.
Me neither. I don’t even need this myself! I just wondered if this might be a quick easy win, that could help someone else.
James Parrott:
I haven’t tested this (it’s so straightforward, I suspect Barney and Brett et al considered it already and ruled it out), but the diff could be as simple as changing Path.iterdir to:
return (self._make_child_direntry(entry) for entry in self._scandir())
return (self._make_child_relpath(name) for name in os.listdir(self))
Let’s see what our friendly neighborhood pathlib superhero himself has to say, shall we
— @barneygale ?
I think I did try that at one point, but I can’t remember why I didn’t pursue it. Maybe the _make_child_direentry(entry)
method didn’t exist at that point, and calling _make_child_relpath(entry.name)
wasn’t any faster. If it provides a performance improvement feel free to open an issue and a PR and I’ll review!
Thanks Barney - I really appreciate that 
Unfortunately, @CAM-Gerlach, @MegaIng I’ve taken you all on a complete wild goose chase. I apologise. In summary, my suggestion could well make the performance between slightly and significantly worse (except on Mac OS, on which the performance is even worse regardless of my suggestion).
Cornelius - you were absolutely right, this needed to be measured. I have done so.
Ubuntu 22.04:
Testing reps=20 of listing a directory of: 50000 files
Time using Path.iterdir: 2.142239902
Time using ScanDirPath.iterdir: 2.313271893999996
Time using os.listdir: 0.00019698799999900984
Time using os.scandir: 0.0001934010000042008
Windows Server 2022:
Testing reps=20 of listing a directory of: 50000 files
Time using Path.iterdir: 1.841156299999966
Time using ScanDirPath.iterdir: 2.198301700000002
Time using os.listdir: 0.000457399999959307
Time using os.scandir: 0.00043940000000475266
MacOS 13
Testing reps=20 of listing a directory of: 50000 files
Time using Path.iterdir: 4.364073147000454
Time using ScanDirPath.iterdir: 4.452478466000684
Time using os.listdir: 0.00039130699951783754
Time using os.scandir: 0.00034591399980854476
There’s still the very real possibility I’ve done something silly, in particular something that means these tests are unfair. My choice of 50,000 files and 20 repetitions is influenced as much by my patience in waiting for tests to finish, as by my idea of a realistic usage scenario, in which any difference could be important. Be ever wary of isolated benchmarks, etc.
If these tests are not flawed, then you were completely right too Cornelius about using the os module, where performance is needed.
If nothing else, I believe I now have a definitive answer to Peter’s original (rephrased) question:
" “Is there a solution with pathlib
that is as fast as os.scandir()
?”"
No. Not even one as fast as os.listdir.
Pathlib is superb, but its primary benefit is code readability (and writeability). Not raw performance on unmanageably extreme numbers of files.
I am impressed that scandir
is consistently faster than listdir
. Doesn’t the former have to do strictly more work than the latter? Specifically it has to allocate and construct all the DirEntry
objects. There might be improvements to be made in the os.listdir
implementation (or your testing is flawed for some non-obvious reason)
scandir
is only faster for directories containing a certain number of files. For smaller numbers, listdir is faster. My understanding was yes it does do more work, but only more work within Python. Whereas listdir
makes more sys calls. Or is the latter no longer the case?
Anyway, this was the result on my laptop in Python 3.12:
C:\...\py Path_iterdir_scandir_test.py 10000 100
Time using Path.iterdir: 4.629436599996552
This test relies on implementation details of Python 3.13's pathlib, unavailable in earlier Pythons.
Time using os.listdir: 0.013278999998874497
Time using os.scandir: 0.015009400001872564
Even so, I was surprised the difference was so negligible for a single directory, after the anecdotes in the blog (but as that mentions, the difference is noticeable for other network file systems). I wondered when scandir was faster too, so tested a directory of a million files. The highest improvement from scandir over listdir was on Windows:
Testing reps=5 of listing a directory of: 1000000 files
Time using Path.iterdir: 9.390752900000052
Time using ScanDirPath.iterdir: 11.222257000000127
Time using os.listdir: 0.0002230000000054133
Time using os.scandir: 0.00013970000009067007