find 100 largest numbers from all the files present in different folders

john :

I recently had an interview where I was asked below question and it sounded pretty easy to me but then at the end it became tricky for me.

There are lot of files in all the folders and their sub folders. Each file will have lot of numbers in each line. Given a root folder, I need to find 100 largest number from all those files. I came up with below solution:

  • Read all the files line by line.
  • Store each number in an array list.
  • Sort it in descending order.
  • Now get the first k numbers from the list.

But then interviewer asked me what will be the time complexity for this. I said since we are sorting it so it's gonna be O(nlogn) and then he asked how can we improve below program? Since you are storing everything in memory and then sorting it - what if you can't fit everything in memory?

I was confused then and couldn't figure out if there was any better/efficient way to solve the below problem. He wanted me to write the efficient code. Is there any better way to accomplish this?

Below is my original code I came up with:

  private static final List<Integer> numbers = new ArrayList<>();

  public static void main(String[] args) {
    int k = 100;
    List<Integer> numbers = findKLargest("/home/david");

    // sort in descending order
    Collections.sort(numbers, Collections.reverseOrder());
    List<Integer> kLargest = new ArrayList<>();
    int j = 0;
    // now iterate all the numbers and get the first k numbers from the list
    for (Integer num : numbers) {
      j++;
      kLargest.add(num);
      if (j == k) {
        break;
      }
    }
    // print the first k numbers
    System.out.println(kLargest);
  }

  /**
   * Read all the numbers from all the files and load it in array list
   * @param rootDirectory
   * @return
   */
  private static List<Integer> findKLargest(String rootDirectory) {
    if (rootDirectory == null || rootDirectory.isEmpty()) {
      return new ArrayList<>();
    }

    File file = new File(rootDirectory);
    for (File entry : file.listFiles()) {
      if (entry.isDirectory()) {
        numbers.addAll(findKLargest(entry.getName()));
      } else {
        try (BufferedReader br = new BufferedReader(new FileReader(entry))) {
          String line;
          while ((line = br.readLine()) != null) {
            numbers.add(Integer.parseInt(line));
          }
        } catch (NumberFormatException | IOException e) {
          e.printStackTrace();
        }
      }
    }
    return numbers;
  }
MBo :

Instead of storing all N ( overall count of numbers in all files ) values and sorting them, you can store only 100 values - the largest ones in every moment.

Convenient and fast data structure for this task - priority queue (usually based on binary heap). Create min-heap with 100 first values, then for every new value check whether it is larger than heap top. If yes - remove top, insert new item.

Space complexity is O(K), time complexity is O(NlogK), here K=100, so complexities might be evaluated as O(1) and O(N) (omitting constant term)

Python example to show how it works:

import heapq, random

pq = [random.randint(0, 20) for _ in range(5)]  #initial values
print(pq)
heapq.heapify(pq)                               #initial values ordered in heap
print(pq)
for i in range(5):
    r = random.randint(0, 20)    # add 5 more values
    if r > pq[0]:
        heapq.heappop(pq)
        heapq.heappush(pq, r)
    print(r, pq)

[17, 22, 10, 1, 15]   //initial values
[1, 15, 10, 22, 17]   //heapified, smallest is the left
29 [10, 15, 17, 22, 29]     //29 replaces 1
25 [15, 22, 17, 29, 25]     //25 replaces 10
14 [15, 22, 17, 29, 25]      //14 is too small
8 [15, 22, 17, 29, 25]       //8 is too small
21 [17, 21, 25, 29, 22]     //21 is in the club now

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=113300&siteId=1