I've read some answers ( for example) here at SO where some say that parallelism is not going to increase performance ( maybe in read IO).
But I've created few tests which shows that also WRITE operations are much faster.
— READ TEST:
I've created random 6000 files with dummy data :
Let's try to read them w/ w/o parallelism :
var files = Directory.GetFiles("c:\\temp\\2\\", "*.*", SearchOption.TopDirectoryOnly).Take(1000).ToList(); var sw = Stopwatch.StartNew(); files.ForEach(f => ReadAllBytes(f).GetHashCode()); sw.ElapsedMilliseconds.Dump("Run READ- Serial"); sw.Stop(); sw.Restart(); files.AsParallel().ForAll(f => ReadAllBytes(f).GetHashCode()); sw.ElapsedMilliseconds.Dump("Run READ- Parallel"); sw.Stop();
Result1:
Run READ- Serial 595
Run READ- Parallel 193
Result2:
Run READ- Serial 316
Run READ- Parallel 192
— WRITE TEST:
Going to create 1000 random files where each file is 300K. (I've emptied the directory from prev test)
var bytes = new byte[300000]; Random r = new Random(); r.NextBytes(bytes); var list = Enumerable.Range(1, 1000).ToList(); sw.Restart(); list.ForEach((f) => WriteAllBytes(@"c:\\temp\\2\\" + Path.GetRandomFileName(), bytes)); sw.ElapsedMilliseconds.Dump("Run WRITE serial"); sw.Stop(); sw.Restart(); list.AsParallel().ForAll((f) => WriteAllBytes(@"c:\\temp\\2\\" + Path.GetRandomFileName(), bytes)); sw.ElapsedMilliseconds.Dump("Run WRITE Parallel"); sw.Stop();
Result 1:
Run WRITE serial 2028
Run WRITE Parallel 368
Result 2:
Run WRITE serial 784
Run WRITE Parallel 426
Question:
The results have surprised me. It is clear that against all expectations ( especially with WRITE operations)- the performance are better with parallelism , yet with IO operations.
How/Why come the parallelism results better ? It seems that SSD can work with threads and that there is no/less bottleneck when running more than one job at a time in the IO device.
Nb I didn't test it with HDD (I'll be happy that one that has HDD will run the tests.)
4 Answers
Answers 1
Benchmarking is a tricky art, you are just not measuring what you think you are. That it is not actually I/O overhead is somewhat obvious from the test results, why is the single threaded code faster the second time you run it?
What you are not counting on is the behavior of the file system cache. It keeps a copy of the disk content in RAM. This has a particularly big impact on the multi-threaded code measurement, it is not using any I/O at all. In a nutshell:
Reads come from RAM if the file system cache has a copy of the data. This operates at memory bus speeds, typically around 35 gigabytes/second. If it does not have a copy then the read is delayed until the disk supplies the data. It does not just read the requested cluster but an entire cylinder worth of data off the disk.
Writes go straight to RAM, completes very quickly. That data is written to the disk lazily in the background while the program keeps executing, optimized to minimize write head movement in cylinder order. Only if no more RAM is available will a write ever stall.
Actual cache size depends on the installed amount of RAM and the need for RAM imposed by running processes. A very rough guideline is that you can count on 1GB on a machine with 4GB of RAM, 3GB on a machine with 8GB of RAM. It is visible in Resource Monitor, Memory tab, displayed as the "Cached" value. Keep in mind that it is highly variable.
So enough to make sense of what you see, the Parallel test benefits greatly from the Serial test already have read all the data. If you had written the test so that the Parallel test was run first then you'd have gotten very different results. Only if the cache is cold could you see the loss of perf due to threading. You'd have to restart your machine to ensure that condition. Or read another very large file first, large enough to evict useful data from the cache.
Only if you have a-priori knowledge of your program only ever reading data that was just written can you safely use threads without risking a perf loss. That guarantee is normally pretty hard to come by. It does exist, a good example is Visual Studio building your project. The compiler writes the build result to the obj\Debug directory, then MSBuild copies it to bin\Debug. Looks very wasteful, but it is not, that copy will always complete very quickly since the file is hot in the cache. The cache also explains the difference between a cold and a hot start of a .NET program and why using NGen is not always best.
Answers 2
The reason of this behaviour is called File Caching which is a Windows feature to improve performance of file operations. Let`s have a look at a short explanation at the Windows Dev Center:
By default, Windows caches file data that is read from disks and written to disks. This implies that read operations read file data from an area in system memory known as the system file cache, rather than from the physical disk.
This means the hard disk is (typically) never used during your tests.
We can avoid this behaviour by creating a FileStream
using using the FILE_FLAG_NO_BUFFERING
flag, documented at the MSDN. Let's have a look at our new ReadUnBuffered
function using this flag:
private static object ReadUnbuffered(string f) { //Unbuffered read and write operations can only //be performed with blocks having a multiple //size of the hard drive sector size byte[] buffer = new byte[4096 * 10]; const ulong FILE_FLAG_NO_BUFFERING = 0x20000000; using (FileStream fs = new FileStream( f, FileMode.Open, FileAccess.Read, FileShare.None, 8, (FileOptions)FILE_FLAG_NO_BUFFERING)) { return fs.Read(buffer, 0, buffer.Length); } }
The result: Reading serial is much faster. In my case even nearly twice as fast.
Reading a file using the standard Windows cache has only to perform CPU and RAM operations to manage the chaching of files, deal with the FileStream
, ... because the files are already cached. Sure, it's not very CPU-intensive but it's not negligible. Since the files are already in the system cache, the parallel approach (without cache modification) shows exactly the time of these overhead operations.
This behaviour can also be transfered to write operations.
Answers 3
It is a very interesting topic! I'm sorry that I can't explain the technical details, but there are some concerns need to be raised. It is a bit long, so I can't fit them into the comment. Please forgive me to post it into as an "answer".
I think you need to think about both large and small files, also, the test must run a few times and get the average time to make sure the result is verifiable. A general guideline is to run it 25 times as a paper in evolutionary computing suggests.
Another concern is about system caching. You only created one bytes
buffer and always write the same thing, I don't know how the system handles buffer, but to minimise the difference, I would suggest you to create different buffer for different files.
(Update: maybe GC also affect the performance, so I revised again to put GC aside as much as I could.)
I luckily have both SSD and HDD on my computer, and revised the test code. I executed the it with different configurations and get the following results. Hope I can inspire someone for better explanation.
1KB, 256 Files
Avg Write Parallel SSD: 46.88 Avg Write Serial SSD: 94.32 Avg Read Parallel SSD: 4.28 Avg Read Serial SSD: 15.48 Avg Write Parallel HDD: 35.4 Avg Write Serial HDD: 71.52 Avg Read Parallel HDD: 4.52 Avg Read Serial HDD: 14.68
512KB, 256 Files
Avg Write Parallel SSD: 86.84 Avg Write Serial SSD: 210.84 Avg Read Parallel SSD: 65.64 Avg Read Serial SSD: 80.84 Avg Write Parallel HDD: 85.52 Avg Write Serial HDD: 186.76 Avg Read Parallel HDD: 63.24 Avg Read Serial HDD: 82.12 // Note: GC seems still kicked in the parallel reads on this test
My machine is: i7-6820HQ / 32G / Windows 7 Enterprise x64 / VS2017 Professional / Target .NET 4.6 / Running in debug mode.
The two harddrives are:
C drive: IDE\Crucial_CT275MX300SSD4___________________M0CR021
D drive: IDE\ST2000LM003_HN-M201RAD__________________2BE10001
The revised code is as follows:
Stopwatch sw = new Stopwatch(); string path; int fileSize = 1024 * 1024 * 1024; int numFiles = 2; byte[] bytes = new byte[fileSize]; Random r = new Random(DateTimeOffset.UtcNow.Millisecond); List<int> list = Enumerable.Range(0, numFiles).ToList(); List<List<byte>> allBytes = new List<List<byte>>(numFiles); List<string> files; int numTests = 1; List<long> wss = new List<long>(numTests); List<long> wps = new List<long>(numTests); List<long> rss = new List<long>(numTests); List<long> rps = new List<long>(numTests); List<long> wsh = new List<long>(numTests); List<long> wph = new List<long>(numTests); List<long> rsh = new List<long>(numTests); List<long> rph = new List<long>(numTests); Enumerable.Range(1, numTests).ToList().ForEach((i) => { path = @"C:\SeqParTest\"; allBytes.Clear(); GC.Collect(); GC.WaitForFullGCComplete(); list.ForEach((x) => { r.NextBytes(bytes); allBytes.Add(new List<byte>(bytes)); }); try { GC.TryStartNoGCRegion(0, true); } catch (Exception) { } sw.Restart(); list.AsParallel().ForAll((x) => File.WriteAllBytes(path + Path.GetRandomFileName(), allBytes[x].ToArray())); wps.Add(sw.ElapsedMilliseconds); sw.Stop(); try { GC.EndNoGCRegion(); } catch (Exception) { } Debug.Print($"Write parallel SSD #{i}: {wps[i - 1]}"); allBytes.Clear(); GC.Collect(); GC.WaitForFullGCComplete(); list.ForEach((x) => { r.NextBytes(bytes); allBytes.Add(new List<byte>(bytes)); }); try { GC.TryStartNoGCRegion(0, true); } catch (Exception) { } sw.Restart(); list.ForEach((x) => File.WriteAllBytes(path + Path.GetRandomFileName(), allBytes[x].ToArray())); wss.Add(sw.ElapsedMilliseconds); sw.Stop(); try { GC.EndNoGCRegion(); } catch (Exception) { } Debug.Print($"Write serial SSD #{i}: {wss[i - 1]}"); files = Directory.GetFiles(path, "*.*", SearchOption.TopDirectoryOnly).Take(numFiles).ToList(); try { GC.TryStartNoGCRegion(0, true); } catch (Exception) { } sw.Restart(); files.AsParallel().ForAll(f => File.ReadAllBytes(f).GetHashCode()); rps.Add(sw.ElapsedMilliseconds); sw.Stop(); try { GC.EndNoGCRegion(); } catch (Exception) { } files.ForEach(f => File.Delete(f)); Debug.Print($"Read parallel SSD #{i}: {rps[i - 1]}"); GC.Collect(); GC.WaitForFullGCComplete(); files = Directory.GetFiles(path, "*.*", SearchOption.TopDirectoryOnly).Take(numFiles).ToList(); try { GC.TryStartNoGCRegion(0, true); } catch (Exception) { } sw.Restart(); files.ForEach(f => File.ReadAllBytes(f).GetHashCode()); rss.Add(sw.ElapsedMilliseconds); sw.Stop(); try { GC.EndNoGCRegion(); } catch (Exception) { } files.ForEach(f => File.Delete(f)); Debug.Print($"Read serial SSD #{i}: {rss[i - 1]}"); GC.Collect(); GC.WaitForFullGCComplete(); path = @"D:\SeqParTest\"; allBytes.Clear(); GC.Collect(); GC.WaitForFullGCComplete(); list.ForEach((x) => { r.NextBytes(bytes); allBytes.Add(new List<byte>(bytes)); }); try { GC.TryStartNoGCRegion(0, true); } catch (Exception) { } sw.Restart(); list.AsParallel().ForAll((x) => File.WriteAllBytes(path + Path.GetRandomFileName(), allBytes[x].ToArray())); wph.Add(sw.ElapsedMilliseconds); sw.Stop(); try { GC.EndNoGCRegion(); } catch (Exception) { } Debug.Print($"Write parallel HDD #{i}: {wph[i - 1]}"); allBytes.Clear(); GC.Collect(); GC.WaitForFullGCComplete(); list.ForEach((x) => { r.NextBytes(bytes); allBytes.Add(new List<byte>(bytes)); }); try { GC.TryStartNoGCRegion(0, true); } catch (Exception) { } sw.Restart(); list.ForEach((x) => File.WriteAllBytes(path + Path.GetRandomFileName(), allBytes[x].ToArray())); wsh.Add(sw.ElapsedMilliseconds); sw.Stop(); try { GC.EndNoGCRegion(); } catch (Exception) { } Debug.Print($"Write serial HDD #{i}: {wsh[i - 1]}"); files = Directory.GetFiles(path, "*.*", SearchOption.TopDirectoryOnly).Take(numFiles).ToList(); try { GC.TryStartNoGCRegion(0, true); } catch (Exception) { } sw.Restart(); files.AsParallel().ForAll(f => File.ReadAllBytes(f).GetHashCode()); rph.Add(sw.ElapsedMilliseconds); sw.Stop(); try { GC.EndNoGCRegion(); } catch (Exception) { } files.ForEach(f => File.Delete(f)); Debug.Print($"Read parallel HDD #{i}: {rph[i - 1]}"); GC.Collect(); GC.WaitForFullGCComplete(); files = Directory.GetFiles(path, "*.*", SearchOption.TopDirectoryOnly).Take(numFiles).ToList(); try { GC.TryStartNoGCRegion(0, true); } catch (Exception) { } sw.Restart(); files.ForEach(f => File.ReadAllBytes(f).GetHashCode()); rsh.Add(sw.ElapsedMilliseconds); sw.Stop(); try { GC.EndNoGCRegion(); } catch (Exception) { } files.ForEach(f => File.Delete(f)); Debug.Print($"Read serial HDD #{i}: {rsh[i - 1]}"); GC.Collect(); GC.WaitForFullGCComplete(); }); Debug.Print($"Avg Write Parallel SSD: {wps.Average()}"); Debug.Print($"Avg Write Serial SSD: {wss.Average()}"); Debug.Print($"Avg Read Parallel SSD: {rps.Average()}"); Debug.Print($"Avg Read Serial SSD: {rss.Average()}"); Debug.Print($"Avg Write Parallel HDD: {wph.Average()}"); Debug.Print($"Avg Write Serial HDD: {wsh.Average()}"); Debug.Print($"Avg Read Parallel HDD: {rph.Average()}"); Debug.Print($"Avg Read Serial HDD: {rsh.Average()}");
Well, I haven't fully tested the code, so it may buggy. I realised it sometimes stops on the parallel read, I assume it was because of the deletion of files from sequential read was completed AFTER reading the list of existing files in the next step, so it complains file not found error.
Another problem is that I used the newly created files for read test. Theoretically it is better to not do so (even restart computer / fill out empty space on SSD to avoid caching), but I didn't bother because the intended comparison is between sequential and parallel performance.
Update:
I don't know how to explain the reason, but I think it may because the IO resource is pretty idle? I'll try two things the next:
- Large files (1GB) in serial / parallel
- When other background activities using disk IO.
Answers 4
First, the test needs to exclude any CPU/RAM operations (GetHashCode) since the serial code may be waiting for the CPU before doing the next disk operation.
Internally, an SSD is always trying to parallellize operations between its different internal chips. Its ability to do so depends on the model, how much (TRIMmed) free space it has etc. Until some time ago, this should behave the same in parallell and serial, because the queue between the OS and the SSD is serial anyway..... Unless the SSD supports NCQ (Native Command Queue), which enables the SSD to select which operation from the queue to do next, in order to maximize the usage of all its chips. So what you are seeing could be the benefits of NCQ. (Note that NCQ also works for hard disk drives).
Due to the differences between SSDs (controller strategy, number of internal chips, free space etc) the benefits of parallellization will probably vary a lot.
0 comments:
Post a Comment