集合聚合算法

有这样一个需求，在做solr索引优化的设计方案中，业务方给出了一个业务场景，相关的业务表有n多张，需要出一个方案来解决数据库上复杂SQL查询导致的高延时问题，另外顺带通过Solr还需要对一些字段做拼音查询。

利用Solr的倒排索引对数据库的复杂查询是非常奏效的，另外，可以在Solr上事先将数据库中的几张业务表打成宽表的方式，进一步提高查询效率。打宽表是一种常用的优化手段，但是，子业务逻辑表不能太多，太多的话，会造成索引更新维护非常麻烦（如果索引不需要更新的话，那么聚合的子表再多也没有问题），按照笔者的经验，宽表的子表数不能超过5张为宜。

需求分析时，让业务方给出了一份在查询数据库上的一些复杂SQL，大都是有left outer join 的SQl，而且关联的表还不止一个。算了一下，关联到的表有14个，当然不能将这14个表聚合成一个宽表，需要分而治之，将这一堆表切分成相对小的宽表（每个宽表一个solr的collection），这样维护起来相对方便一些。

首先，拿到了以下一份表之间的查询关联关系：

"kind_menu_addition
kind_menu
menu"
"kind_menu_addition
kind_menu"
"menu_make
make
menu"
"kind_menu
menu"
"menu
kind_menu
menu_spec_detail
menu_make"
"menu
kind_menu
menu_prop"
"menu_make
make"
"menu_spec_detail
spec_detail"
"menu_time_price
menu
kind_menu"
"menu_spec_detail
spec_detail
menu"
"suit_menu_detail
menu"
"suit_menu_change
suit_menu_detail
menu
spec_detail"
"suit_menu_detail
menu"
"kind_menu
menu_kind_taste"
"menu_kind_taste
kind_taste"
"menu_kind_taste
kind_taste
taste"
"menu_kind_taste
taste"

如上，引号之间描述的是，表之间的关联关系，需要有一个算法，将以上的关联描述关系，计算出有关联关系的表簇。类似，A和B有关联关系，A和C也是有关联关系，那么计算结果应该要得到 A，B，C是一个关联簇。

另外需要说明的是，上面描述关系中menu，kind_menu两个表的出现频率非常高。其实menu表是ER关系的主表，所以需要在计算过程中需要将类似于这样的出现频率高的主表过滤掉。

下面给出具体的算法：

import org.apache.commons.io.FileUtils;
import org.apache.commons.io.LineIterator;
import org.apache.commons.lang.StringUtils;

public class TabClusterStatis {
	private static Set<String> mainTables = new HashSet<String>();
	static {
		// 主表
		mainTables.add("menu");
		mainTables.add("kind_menu");
	}

	/**
	 * @param args
	 */
	public static void main(String[] args) throws Exception {

		// 记录每个表和其他表的应用关系
		Map<String, TableLink> tabcluster = new HashMap<String, TableLink>();
		LineIterator it = FileUtils.lineIterator(new File("D:\\tmp\\tab.txt"));

		// 计算表之间的引用关系
		String tab = null;
		Set<String> tables = null;
		TableLink tlink = null;
		while (it.hasNext()) {
			tab = it.nextLine();
			if (StringUtils.startsWith(tab, "\"")) {
				tables = new HashSet<String>();
				tab = StringUtils.substringAfter(tab, "\"");
			}
			boolean endWithSlash = false;
			if (endWithSlash = StringUtils.endsWith(tab, "\"")) {
				tab = StringUtils.substringBefore(tab, "\"");
			}

			tables.add(tab);

			if (endWithSlash) {
				for (String t : tables) {
					tlink = tabcluster.get(t);
					if (tlink == null) {
						tlink = new TableLink(t);
						tabcluster.put(t, tlink);
					}
					tlink.refs.addAll(tables);
				}
			}
		}

		// 将主表删除不参与计算
		Iterator<String> nameIt = tabcluster.keySet().iterator();
		while (nameIt.hasNext()) {
			if (mainTables.contains(nameIt.next())) {
				nameIt.remove();
			}
		}

		List<TableLink> tablelist = new ArrayList<TableLink>(
				tabcluster.values());

		TableLink tref1 = null;
		TableLink tref2 = null;
		Set<String> subsetTable = null;
		List<Set<String>> subsetList = new ArrayList<Set<String>>();
		for (int i = 0; i < tablelist.size(); i++) {

			tref1 = tablelist.get(i);
			if (tref1.hasSetsubWithOther) {
				continue;
			}
			subsetTable = new HashSet<String>();
			// subsetTable.add(String.valueOf(i));
			subsetTable.addAll(tref1.refs);
			subsetList.add(subsetTable);

			for (int j = (i + 1); j < tablelist.size(); j++) {
				tref2 = tablelist.get(j);
				if (tref2.hasSetsubWithOther) {
					continue;
				}
				if (tref2.hasSubSet(subsetTable)) {
					// subsetTable.add("j:" + String.valueOf(j));
					subsetTable.addAll(tref2.refs);
					tref2.hasSetsubWithOther = true;
				}
			}

			for (int jj = tablelist.size() - 1; jj >= (i + 1); jj--) {
				tref2 = tablelist.get(jj);
				if (tref2.hasSetsubWithOther) {
					continue;
				}
				if (tref2.hasSubSet(subsetTable)) {
					// subsetTable.add("jj:" + String.valueOf(jj));
					subsetTable.addAll(tref2.refs);
					tref2.hasSetsubWithOther = true;
				}
			}
		}

		List<String> sortTables = null;
		for (Set<String> c : subsetList) {
			sortTables = new ArrayList<>(c);
			Collections.sort(sortTables);

			for (String t : sortTables) {
				// if (mainTables.contains(t)) {
				// continue;
				// }
				System.out.print(t + ",");
			}
			System.out.println();
		}
	}
	private static class TableLink {
		private final String name;
		private final Set<String> refs = new HashSet<String>();

		boolean hasSetsubWithOther = false;

		public TableLink(String name) {
			super();
			this.name = name;
		}

		boolean hasSubSet(TableLink refs) {
			return hasSubSet(refs.refs);
		}

		public boolean hasSubSet(Set<String> refs) {
			for (String tab : refs) {
				if (mainTables.contains(tab)) {
					continue;
				}
				if (this.refs.contains(tab)) {
					return true;
				}
			}

			return false;
		}

	}

}

执行之后会打印如下信息：

kind_menu,kind_taste,menu_kind_taste,taste,							
kind_menu,menu,menu_time_price,							
kind_menu,make,menu,menu_make,menu_spec_detail,spec_detail,suit_menu_change,suit_menu_detail,							
kind_menu,kind_menu_addition,menu,							
kind_menu,menu,menu_prop,

很好，自动将两两描述信息计算出了5个子表簇，现在我们就可以在solr上集群中构建5个collection。

很显然，这样的做的好处是快不会错，以前都是拍脑袋跟着感觉走，推而广之，在其他场景下也能用上这个算法。

猜你喜欢