Filtering list under limitations

xingbin :

Description

The input is List<Item> sorted by score, Item looks like:

class Item {
    double score;
    String category; 
    String author;    
    String poi;   
}

Now I need to select 10 elements which have the highest scores from the array, under these limitations:

  • They should have different poi
  • They should have different author
  • There are at most 3 items from the same category. And the length of any subsequence from the same category should not be longer than 2.

If there is no subsequence which satisfies above rules, just return the first 10 elements.

What I have tried

Now, I directly iterate over the List, and use three HashMap<String, Integer> to store the appearences of each cagetory/poi/author. And I use List<Item> selected to store the result.

  • If there is already a selected element that has this poi, then the new element will be discarded.
  • If there is already a selected element that has this author, then the new element will be discarded.
  • If there are already three selected elements that have this category, then the new element will be discarded.
  • If there are already two elements in the tail of selected that have this category, then the new element will be discarded.

Problem

It works when the input is large, but when the input is relatively small, it does not work. For example, when the input is

  1. Item1(Category A, Author 1)
  2. Item2(Category A, Author 2)
  3. Item3(Category A, Author 3)
  4. Item4(Category B, Author 2)
  5. Item5(Category C, Author 5)
  6. Item6(Category D, Author 6)
  7. Item7(Category E, Author 7)
  8. Item8(Category F, Author 8)
  9. Item9(Category G, Author 9)
  10. Item10(Category H, Author 10)
  11. Item11(Category I, Author 11)

Then my solution will be

  • Item3 discarded, because it has the same category as Item1 and Item2
  • Item4 discarded, because it has the same author as Item2
  • the 9 other elements remain.

And this does not satisfy the select top 10 elements. The correct solution is discarding Item2 and only 10 elements should remain.

Question

I think my solution is just in the wrong direction. So I'm looking for other solutions to deal with this problem. Any solution produces the desired output is appreciated.

Assafs :

The original algorithm you used will always tend to minimize the number of results, because in any mutual-exclusive choice between items the highest-score item wins. This way the algorithm operates like a sieve, eliminating many lower-score items.

In order to support choosing a set of at least size X (10 in this case) from an original set of items length Y (11 in your example), you will need to collect a list of decision sets rather than eliminating items by score alone. A decision set(m,n) is a set of m items from which you must choose to keep n items and eliminate the rest. Since most of the rules in your system are single item of attribute x, most decision sets in your list will be set(m,1) - choose 1 of the m items, eliminate the rest.

The first pass on the full items set will populate the decision set list, and the second pass will go over that list and choose from each decision set the items to eliminate from the original set. Once a decision is made and the item/s are eliminated from the original set, the decision set is removed from the list (resolved). Once the list of decision sets has been cleared, your original set is legal.

The goal is to have the decision set list cleared in at most Y-X eliminations. Since an item can appear in multiple decision sets, you can also add for each item a "survival score". The survival score suggests the maximum number of item which will have to be eliminated if this item is kept. It is calculated per item by going over each decision set (m,n) and adding to each item contained m-n to its accumulated score.

Let's look at your example, and build its decision sets:

  • Item1(Category A, Author 1)
  • Item2(Category A, Author 2)
  • Item3(Category A, Author 3)
  • Item4(Category B, Author 2)
  • Item5(Category C, Author 5)
  • Item6(Category D, Author 6)
  • Item7(Category E, Author 7)
  • Item8(Category F, Author 8)
  • Item9(Category G, Author 9)
  • Item10(Category H, Author 10)
  • Item11(Category I, Author 11)

The decision sets we compile are (note the survival score in parenthesis):

  • Author decision set (2,1) = {item 2 (2), item 4 (1)}
  • Category decision set (3,2) = {item 1 (1), item 2 (2), item 3 (1)}

Our goal is to resolve the decision set list in at most 1 elimination. You can see that all the items are with survival score 1 (meaning that keeping them will result in at most 1 other item eliminated) except for item 2 which has survival score of 2 (keeping it will eliminate at most 2 items). We cannot afford 2 items, and therefore we cannot afford to keep item 2 regardless of its score. eliminating it will resolve both decision sets and is the only option.

The more general algorithm can be more complex: in every iteration you eliminate the items with survival score you cannot afford, and if you're not near that limit use a combination of score and survival-score to decide which one must go.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=36674&siteId=1