Compare two fastest fastest file contents comparison of two files NET CORE content

The fastest compare two files NET CORE content

Recent projects have a demand, content needs to compare two files of any size are the same, the following:

  1. Project is .NET CORE, so write the comparative method in C #
  2. Any file size, so you can not read the entire contents of the file into memory by comparing (a more professional point that requires the use of relatively non-cached mode)
  3. Do not rely on third-party libraries
  4. The sooner the better

In order to select the best solution, I built a simple command line project, prepared two file size of 912MB, and the contents of the two files are identical. At the end of this article, you can see the project Code Main method.

Here we start trying to compare the various methods and chooses the optimum solution:

Compare two files are identical, the first thought is to calculate the hash value of the two files using a hash algorithm (such as MD5, SHA), and then compared.

Without further ado, roll up its sleeves to write a MD5 comparison method:

/// <summary>
/// MD5 /// </summary> /// <param name="file1"></param> /// <param name="file2"></param> /// <returns></returns> private static bool CompareByMD5(string file1, string file2) { // 使用.NET内置的MD5库 using (var md5 = MD5.Create()) { byte[] one, two; using (var fs1 = File.Open(file1, FileMode.Open)) { // 以FileStream读取文件内容,计算HASH值 one = md5.ComputeHash(fs1); } using (var fs2 = File.Open(file2, FileMode.Open)) { // 以FileStream读取文件内容,计算HASH值 two = md5.ComputeHash(fs2); } // 将MD5结果(字节数组)转换成字符串进行比较 return BitConverter.ToString(one) == BitConverter.ToString(two); } }

Comparing results:

Method: CompareByMD5, Identical: True. Elapsed: 00:00:05.7933178

Takes 5.79 seconds, it feels pretty good. However, this is the best solution?

In fact, we carefully think about, the answer is no.

Because all of the bytes certain calculations essentially any hash algorithm, and the calculation is to be time-consuming.

Provides the hash value of the downloaded files on many download sites, it is because the downloaded source file itself does not change, only you need to compute the hash value of the primary source documents provided to the user authentication can be.

And our demand, the two files are not fixed, then the hash value is calculated for each of the two documents, it is not appropriate.

Therefore, the hash compare this program is PASS.

Optimal solution algorithms to solve this, my past experience is:  go find stackoverflow  :)

After my hard work, found a very pertinent answer:  How to the Compare Files the FAST 2 a using the .NET?

Like most get an answer, the code will transform it into a project:

/// <summary>
/// https://stackoverflow.com/a/1359947 /// </summary> /// <param name="file1"></param> /// <param name="file2"></param> /// <returns></returns> private static bool CompareByToInt64(string file1, string file2) { const int BYTES_TO_READ = sizeof(Int64); // 每次读取8个字节 int iterations = (int)Math.Ceiling((double)new FileInfo(file1).Length / BYTES_TO_READ); // 计算读取次数 using (FileStream fs1 = File.Open(file1, FileMode.Open)) using (FileStream fs2 = File.Open(file2, FileMode.Open)) { byte[] one = new byte[BYTES_TO_READ]; byte[] two = new byte[BYTES_TO_READ]; for (int i = 0; i < iterations; i++) { // 循环读取到字节数组中 fs1.Read(one, 0, BYTES_TO_READ); fs2.Read(two, 0, BYTES_TO_READ); // 转换为Int64进行数值比较 if (BitConverter.ToInt64(one, 0) != BitConverter.ToInt64(two, 0)) return false; } } return true; }

The basic principle of this method is to cycle to read the two documents, each read eight bytes, converted to Int64, then numerical comparison. So how efficient?

Method: CompareByToInt64, Identical: True. Elapsed: 00:00:08.0918099

What? 8 seconds! Even slower than MD5? It was not the most liked SO answer it, how could this be?

In fact, an analysis is not difficult to think of reason, because each time only read 8 bytes, the program frequently performed IO operations, resulting in poor performance. It seems the answer on SO nor superstition ah!

Then the direction of optimization becomes how to reduce losses caused by IO operations.

Since each 8 bytes too little, we define a larger array of bytes, such as 1024 bytes. 1024 bytes per read to the array, and then compare byte array.

But this brings a new problem is how to quickly compare two byte arrays are the same?

My first thought is used in the MD5 method ---- converting byte array to a string compare:

/// <summary>
/// 读入到字节数组中比较(转为String比较) /// </summary> /// <param name="file1"></param> /// <param name="file2"></param> /// <returns></returns> private static bool CompareByString(string file1, string file2) { const int BYTES_TO_READ = 1024 * 10; using (FileStream fs1 = File.Open(file1, FileMode.Open)) using (FileStream fs2 = File.Open(file2, FileMode.Open)) { byte[] one = new byte[BYTES_TO_READ]; byte[] two = new byte[BYTES_TO_READ]; while (true) { int len1 = fs1.Read(one, 0, BYTES_TO_READ); int len2 = fs2.Read(two, 0, BYTES_TO_READ); if (BitConverter.ToString(one) != BitConverter.ToString(two)) return false; if (len1 == 0 || len2 == 0) break; // 两个文件都读取到了末尾,退出while循环 } } return true; }

result:

Method: CompareByString, Identical: True. Elapsed: 00:00:07.8088732

Took close to eight seconds, a method little more than how much.

Analyze the reasons, in each cycle, the conversion of the string is a very time-consuming operation. Is there a byte array comparative method type conversion not it?

I think there is a LINQ sequence comparison method SequenceEqual, we try to use this method comparison:

/// <summary>
/// 读入到字节数组中比较(使用LINQ的SequenceEqual比较) /// </summary> /// <param name="file1"></param> /// <param name="file2"></param> /// <returns></returns> private static bool CompareBySequenceEqual(string file1, string file2) { const int BYTES_TO_READ = 1024 * 10; using (FileStream fs1 = File.Open(file1, FileMode.Open)) using (FileStream fs2 = File.Open(file2, FileMode.Open)) { byte[] one = new byte[BYTES_TO_READ]; byte[] two = new byte[BYTES_TO_READ]; while (true) { int len1 = fs1.Read(one, 0, BYTES_TO_READ); int len2 = fs2.Read(two, 0, BYTES_TO_READ); if (!one.SequenceEqual(two)) return false; if (len1 == 0 || len2 == 0) break; // 两个文件都读取到了末尾,退出while循环 } } return true; }

result:

Method: CompareBySequenceEqual, Identical: True. Elapsed: 00:00:08.2174360

Turned out to be slower than the first two (actually this is the slowest of all the programs a), LINQ efficiency of SequenceEqual seems not to be born.

So we do not have those fancy features, return to simplicity, honest child how to use the while loop compares a byte array it?

/// <summary>
/// 读入到字节数组中比较(while循环比较字节数组) /// </summary> /// <param name="file1"></param> /// <param name="file2"></param> /// <returns></returns> private static bool CompareByByteArry(string file1, string file2) { const int BYTES_TO_READ = 1024 * 10; using (FileStream fs1 = File.Open(file1, FileMode.Open)) using (FileStream fs2 = File.Open(file2, FileMode.Open)) { byte[] one = new byte[BYTES_TO_READ]; byte[] two = new byte[BYTES_TO_READ]; while (true) { int len1 = fs1.Read(one, 0, BYTES_TO_READ); int len2 = fs2.Read(two, 0, BYTES_TO_READ); int index = 0; while (index < len1 && index < len2) { if (one[index] != two[index]) return false; index++; } if (len1 == 0 || len2 == 0) break; } } return true; }

The result is....

Method: CompareByByteArry, Identical: True. Elapsed: 00:00:01.5356821

1.53 seconds! Breakthrough! It seems sometimes seems clumsy method but better!

Test this, compare two 900 MB file takes about 1.5 seconds, whether the reader satisfied with the way to do that?

No! I am not satisfied! I believe that through our efforts, we will find a faster way!

Also .NET CORE also in order to write code for good performance and continuous optimization.

So how do we continue to optimize our code?

I suddenly thought type in a new value added in C # 7.2:  Span<T>it used to represent a contiguous area of memory, and provides a series of methods operable region.

For our needs, because we will not change the value of the array, you can use another type of read-only ReadOnlySpan<T>pursuit of greater efficiency.

Modify the code, use ReadOnlySpan<T>:

/// <summary>
/// 读入到字节数组中比较(ReadOnlySpan) /// </summary> /// <param name="file1"></param> /// <param name="file2"></param> /// <returns></returns> private static bool CompareByReadOnlySpan(string file1, string file2) { const int BYTES_TO_READ = 1024 * 10; using (FileStream fs1 = File.Open(file1, FileMode.Open)) using (FileStream fs2 = File.Open(file2, FileMode.Open)) { byte[] one = new byte[BYTES_TO_READ]; byte[] two = new byte[BYTES_TO_READ]; while (true) { int len1 = fs1.Read(one, 0, BYTES_TO_READ); int len2 = fs2.Read(two, 0, BYTES_TO_READ); // 字节数组可直接转换为ReadOnlySpan if (!((ReadOnlySpan<byte>)one).SequenceEqual((ReadOnlySpan<byte>)two)) return false; if (len1 == 0 || len2 == 0) break; // 两个文件都读取到了末尾,退出while循环 } } return true; }

The core is used to compare the SequenceEqualmethod is ReadOnlySpanan extension method, pay attention to it just in the same method name and LINQ to achieve completely different.
So how does the performance of the method?

Method: CompareByReadOnlySpan, Identical: True. Elapsed: 00:00:00.9287703

Less than a second!

Has been on a relatively good result, almost 40% faster!

This result, I personally feel very satisfied, if you have a faster way, please let us know, I very much welcome!

About Span<T>structure types, you readers who are interested, you can browse the article , the article has a very detailed introduction.

postscript

  • Text of the code only for experimental, practical applications can continue to optimize the details, such as:

    1. The two files of different sizes, the direct return false
    2. If two identical file path, directly returns true
    3. ...
  • Main method of pilot projects Source:

    static void Main(string[] args) { string file1 = @"C:\Users\WAKU\Desktop\file1.ISO"; string file2 = @"C:\Users\WAKU\Desktop\file2.ISO"; var methods = new Func<string, string, bool>[] { CompareByMD5, CompareByToInt64, CompareByByteArry, CompareByReadOnlySpan }; foreach (var method in methods) { var sw = Stopwatch.StartNew(); bool identical = method(file1, file2); Console.WriteLine("Method: {0}, Identical: {1}. Elapsed: {2}", method.Method.Name, identical, sw.Elapsed); } }

Finish.

Guess you like

Origin www.cnblogs.com/Leo_wl/p/11072104.html