思考609. Find Duplicate File in System

寻找重复文件（文件夹和文件名不同文件内容相同）

https://leetcode.com/problems/find-duplicate-file-in-system/description/

思路比较简单，以文件内容为key建立map，将相同内容的文件地址放在一起

public static List<List<String>> findDuplicate(String[] paths) {
        List<List<String>> ans=new LinkedList<List<String>>();
        HashMap<String,LinkedList<String>> map=new HashMap<String,LinkedList<String>>();
        
        for(int i=0;i<paths.length;i++)
        {
            String[] files=paths[i].split(" ");//按空格切分，数组第一个值是文件夹名，后面是文本文件名及其内容
            String directory=files[0];
            for(int k=1;k<files.length-1;k++)
            {
                String[] fc=files[k].split("\\(");
                if(map.containsKey(fc[1]))
                {
                    map.get(fc[1]).add(directory+"/"+fc[0]);
                }
                else
                {
                    LinkedList<String> tmp=new LinkedList<String>();
                    tmp.add(directory+"/"+fc[0]);
                    map.put(fc[1],tmp);
                }
                    
            }
        }
        
        for(String key:map.keySet())
            ans.add(map.get(key));
        return ans;
    }

但是题干中给了更进一步的思考：

Imagine you are given a real file system, how will you search files? DFS or BFS?
If the file content is very large (GB level), how will you modify your solution?
If you can only read the file by 1kb each time, how will you modify your solution?
What is the time complexity of your modified solution? What is the most time-consuming part and memory consuming part of it? How to optimize?
How to make sure the duplicated files you find are not false positive?

1.在真实系统中搜索文件下的文件并读取经常采用DFS

2.当文件内容很大时，上述解法就无能为力了，这里可以先基于文件大小进行hash（大的文件通常有元数据，可以比较它们元数据的不同）；如果文件大小一样，可以通过文件内容的逐字节比较。

3.逐字节比较也可以

4.O(M*N)：M表示文件个数，N表示文件内容长度

5.false positive指被误认为内容重复的非重复文件，只要经过文件大小、文件元数据、文件内容（逐字节比较）这三重比较就能避免。

思考609. Find Duplicate File in System

猜你喜欢