模式设计

按天划分表的模式一天一张表如果用户登录日志 login_20180101,login_20180102

在hive中可以使用按天分区，这样查询效率高，而且比按天分表看起来更清新明了

hive> create table loginfo(userid int,logintime timestamp) partitioned by (dateid int);
OK
Time taken: 0.292 seconds

hive> alter table loginfo add partition(dateid=20180101);
OK
Time taken: 0.585 seconds
hive> alter table loginfo add partition(dateid=20180102);
OK
Time taken: 0.464 seconds
hive> alter table loginfo add partition(dateid=20180103);
OK
Time taken: 0.557 seconds

[root@host ~]# hdfs dfs -ls /user/hive/warehouse/gamedw.db/loginfo
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Found 3 items
drwx-wx-wx   - root supergroup          0 2018-09-18 13:46 /user/hive/warehouse/gamedw.db/loginfo/dateid=20180101
drwx-wx-wx   - root supergroup          0 2018-09-18 13:46 /user/hive/warehouse/gamedw.db/loginfo/dateid=20180102
drwx-wx-wx   - root supergroup          0 2018-09-18 13:47 /user/hive/warehouse/gamedw.db/loginfo/dateid=20180103

hdfs的设计目的是存储数以百万计的大文件而不是数以亿计的小文件，文件太多会超过namenode的处理能力。

一个理想的分区方案是每个目录下的文件足够大，是文件系统中块的若干倍，而不是产生过多的文件和文件夹。选择合适的分区时间粒度，尽量保证数据随着时间的推移，分区数据量是均匀的，分区足够大，可以优化一般查询的吞吐量。

还有一种分区策略就是使用两个级别的分区，并使用不同的维度。如按时间+地区，第一级分区按时间，第二级分区按地区划分。这样分区可能会因为不同地区的数据不一样而导致数据分不均，导致处理数据大的地区数据会用时比较久。

因此要根据实际情况合理的进行分区。

猜你喜欢