Show topN hive table size

For 2 and a half days to complete the process from the acquired data to the display data to the final wash.

Demand: hive has a lot of tables, their large storage capacity, disk tight, in order to be able to clearly see the hive largest library of 10 tables, so it is necessary to do a show.

Organize your thoughts:

Way to get the data: use absolute paths hadoop fs -du -s -h table

For simplicity late development, so we put the acquired data as concise as possible

To sort the data acquisition, according to the actual situation, the front table 10 T or G must be level so that when put the acquired data M, and K levels to filter out

hadoop fs -du -h -s  /data/cc_ads/*|grep T |sort -rn |head -5

hadoop fs -du -h -s  /data/cc_ads/*|grep G |sort -rn |head -5

Explain: grep T and G are taken only G or T units of, sort -rn head -5 descending order before taking the maximum of five data. The reason why is because after taking the first five research libraries at each table to so few big, it basically covers the first five, more does not make sense, of course, you can also do it according to the actual situation judgment. Here we find a better way after development is completed

Is not to directly display unit byte, the latter can be unified in terms of G

Obtaining data format

2.3 T  /data/cclog/t_neu_car

The command (i.e. hadoop fs -du -h -s / data / cc_ads / * | grep G | sort -rn | head -5) obtained into a unified data file test.sh last> command to cover a hdfs .log file,

That sh test.sh> hdfs.log, such hdfs.log where the data is the data we need.

After data acquisition, data cleansing. Our aim is to write data to the mysql and show to grafana in.

By spark, we can read the file and converted into RDD, and finally converted to DF, then the DF write mysql, it sounds so easy, in fact, so easy, but the real implementation of the process, there will be all sorts of strange the situation needs to be considered.

The first and most often is an array of offside, if your idea is no problem, then please go to check your data, must be a problem caused by dirty data.

I appeared array offside mainly in grep T and G, the table name and some also with these two letters, which led to two tables in a directory without data, zero size, but no units, as follows

0 /data/cc_ods/mysql/zkdagh
0 /data/cc_ods/mysql/umtll

This leads me when split, there has been an array of cross-border problems

Here is the code

package Caocao_project
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession

//import scala.sys.process.processInternal.IOException
object TableSizeApp {

    def main(args: Array[String]): Unit = {
      val num=args(0)
      val spark =SparkSession.builder()//
       .master("yarn")//
      //  .master("local[*]")//
        .appName("TableSizeApp")//
        .getOrCreate () 
      Val sc = spark.sparkContext
     //   Val tableSize = sc.textFile ( "File: // / C: \\ \\ seventeen small the Users \\ Desktop \\ flume-ng-core -1.7.0 \ \ create_table_7.txt "). 
      Val tableSize = sc.textFile ( " HDFS: // spark01: 9000 / tmp / ADMIN / MySQL / hdfs.log " .) 
        Map (Line => { 
        Val ARR = line.split ( "  " ) 
        Val of arr1 = ARR ( 0 ) .toDouble 
// IF where the dirty data is mainly processed, originally should try catch, but with less skilled, use the so IF
IF (ARR ( . 1 ) == " T ") { (arr1 * 1024, "G", arr(3)) } else if (arr(1) == "G") { (arr1, arr(1), arr(3)) } else {(arr1, "G", arr(2)) } } ).map(p => Row(p._1,p._2,p._3)) // ..toString()map(p => Row(p(0), p(1))) //Configuration schema uses two classes StructType and StructFile, three parameters which are StructFile class (field name, type, whether the data can be filled with null) Val = StructType schema (the Array (StructField ( " size " , DoubleType, to true ) , StructField ( " Unit " , the StringType, to true ), StructField ( " pathtable " , the StringType, to true ))) // Step3. createDataFrame by the row RDD methods mode Val sizeDF = spark.createDataFrame (tableSize, Schema) // sizeDF.registerTempTable ( "peopleTable") sizeDF.createGlobalTempView ( " Sizetable ") val result = spark.sql(s"select * from global_temp.Sizetable order by size desc limit $num") // val result = spark.sql(s"select * from global_temp.Sizetable order by size desc limit 10") // result.show(40) result.write.mode("overwrite").format("jdbc").option("url","jdbc:mysql://172.16.150.89:15361/airflow").option("driver","com.mysql.jdbc.Driver").option("dbtable","Sizetable").option("user","admin").option("password","9a9F839N4q2maLVC").save() spark.stop() } }

grafana show

 

 

 

Guess you like

Origin www.cnblogs.com/xuziyu/p/11978476.html