I'm trying to speedup directory enumeration in C++, where I'm recursing into subdirectories. I currently have an app which spends 95% of it's time in FindFirst/FindNextFile APIs, and it takes several minutes to enumerate all the files on a given volume. I know it's possible to do this faster because there is an app that does: Everything. It enumerates my entire drive in seconds.
How might I accomplish something like this?
6 Answers
Answers 1
I realize this is an old post, but there is a project on source forge that does exactly what you are asking and the source code is available.
You can find the project here: NTFS-Search
Answers 2
"Everything" accesses directory information at a lower level than the Win32 FindFirst/FindNext APIs.
I believe it reads and interprets the NTFS MFT structures directly, and that this is one of the main reasons for its performance. It's also why it requires admin privileges and why "Everything" only indexes local or removable NTFS volumes (not network drives, for example).
A couple other utilities that do the similar things are:
- FindOnClick by 2Brightsparks
- Search GT
A little reverse engineering with a debugger on these tools might give you some insight on the techniques they use.
Answers 3
"Everything" builds an index in the background, so queries are against the index not the file system itself.
There are a few improvements to be made - at least over the straight-forward algorrithm:
First, breadth search over depth search. That is, enumerate and process all files in a single folder before recursing into the sub folders you found. This improves locality - usually a lot.
On Windows 7 / W2K8R2, you can use FindFirstFileEx
with FindExInfoBasic
, the main speedup being omitting the short file name on NTFS file systems where this is enabled.
Separate threads help if you enumerate different physical disks (not just drives). For the same disk it only helps if it's an SSD ("zero seek time"), or you spend significant time processing a file name (compared to the time spent on disk access).
[edit] Wikipedia actually has some comments - Basically, they are skipping the file system abstraction layer, and access NTFS directly. This way, they can batch calls and skip expensive services of the file system - such as checking ACL's.
A good starting point would be the NTFS Technical Reference on MSDN.
Answers 4
If you are doing this on NTFS, here's a lib for low level access: NTFSLib.
You can enumerate through all file records in $MFT, each representing a real file on disk. You can get all file attributes from the record, including $DATA.
This may be the fastest way to enumerate all files/directories on NTFS volumes, 200k~300k files per minute as I tested.
Answers 5
Don't recurse immediately, save a list of directories you find and dive into them when finished. You want to do linear access to each directory, to take advantage of locality of reference and any caching the OS is doing.
Answers 6
If you're already doing the best you can to get the maximum speed from the API, the next step is to do low-level disk accesses and bypass Windows altogether. You might get some guidance from the NTFS drivers for Linux, or perhaps you can use one directly.
0 comments:
Post a Comment