Druid configuration --Ingestion Spec (uptake specification)

Ingestion Spec (ingestion specification)

Apache Druid ingestion spec consists of three parts:

{
  "dataSchema" : {...},
  "ioConfig" : {...},
  "tuningConfig" : {...}
}
Field Types of description Do you have to
dataSchema JSON Object Specifies whether incoming data Schema. Ingestion Spec can all share the same dataSchema. Yes
ioConfig JSON Object Specify the source and destination of the data. This object will vary with the method of ingestion. Yes
tuningConfig JSON Object Specifies how to adjust various parameters uptake. This object will vary with the method of ingestion. no

DataSchema

The following is an example of a dataSchema:

"dataSchema" : {
  "dataSource" : "wikipedia",
  "parser" : {
    "type" : "string",
    "parseSpec" : {
      "format" : "json",
      "timestampSpec" : {
        "column" : "timestamp",
        "format" : "auto"
      },
      "dimensionsSpec" : {
        "dimensions": [
          "page",
          "language",
          "user",
          "unpatrolled",
          "newPage",
          "robot",
          "anonymous",
          "namespace",
          "continent",
          "country",
          "region",
          "city",
          {
            "type": "long",
            "name": "countryNum"
          },
          {
            "type": "float",
            "name": "userLatitude"
          },
          {
            "type": "float",
            "name": "userLongitude"
          }
        ],
        "dimensionExclusions" : [],
        "spatialDimensions" : []
      }
    }
  },
  "metricsSpec" : [{
    "type" : "count",
    "name" : "count"
  }, {
    "type" : "doubleSum",
    "name" : "added",
    "fieldName" : "added"
  }, {
    "type" : "doubleSum",
    "name" : "deleted",
    "fieldName" : "deleted"
  }, {
    "type" : "doubleSum",
    "name" : "delta",
    "fieldName" : "delta"
  }],
  "granularitySpec" : {
    "segmentGranularity" : "DAY",
    "queryGranularity" : "NONE",
    "intervals" : [ "2013-08-31/2013-09-01" ]
  },
  "transformSpec" : null
}
Field Types of description Do you have to
dataSource String Intake name of the data source. The data source can be viewed as tables. Yes
parser JSON Object Specifies how to parse the uptake of data. Yes
metricsSpec JSON Object array Aggregator list. Yes
granularitySpec JSON Object Specifies how to create segments and summary data. Yes
transformSpec JSON Object Specify how the input data is converted and filtered. See the conversion specification . no

Parser

If typenot included, the parser by default string. For additional data formats, please see our list of extensions .

String Parser

Field Types of description Do you have to
type String This should be a general sense string, or Hadoop index used in the job hadoopyString. no
parseSpec JSON Object Specified format, and the time stamp of the dimension data. Yes

ParseSpec

ParseSpecs two purposes:

  • String Parser uses them to determine the format of the incoming line (ie, JSON, CSV, TSV).
  • All parsers use them to determine whether an incoming line timestamp and dimensions.

If formatnot included, the default is parseSpec tsv.

JSON ParseSpec

It is used with String Parser to load JSON.

Field Types of description Do you have to
format String Here it should bejson no
timestampSpec JSON Object And column format specified timestamp. Yes
dimensionsSpec JSON Object Dimensions specified data. Yes
flattenSpec JSON Object Specify a nested JSON data flattened configuration. For more information, see Flattening JSON . no

JSON Lowercase ParseSpec

_jsonLowercase_ parser has been deprecated and may be removed in future versions of the Druid.

JSON parseSpec This is a special variant, it is possible to reduce the size of all the incoming JSON data column name. If you update from Druid 0.6.x to Druid 0.7.x, directly use the column name mixed case of direct uptake JSON, without any ETL these column names to lower case, and you want to query, you need to include this parseSpec use 0.6.x 0.7.x and data created.

Field Types of description Do you have to
format String Here it should bejsonLowercase Yes
timestampSpec JSON Object And column format specified timestamp. Yes
dimensionsSpec JSON Object Dimensions specified data. Yes

CSV ParseSpec

It is used with String Parser to load CSV. Use com.opencsv library to parse the string.

Field Types of description Do you have to
format String Here it should becsv Yes
timestampSpec JSON Object And column format specified timestamp. Yes
dimensionsSpec JSON Object Dimensions specified data. Yes
listDelimiter String Multivalued dimensional custom delimiter. No (default == ctrl + A)
columns JSON array Column specifies the data. Yes

TimestampSpec

Field Types of description Do you have to
column String Timestamp columns. Yes
format String iso, posix, millis, micro, nano, auto or any Joda time format. No (default == 'auto')

DimensionsSpec

Field Types of description Do you have to
dimensions JSON array dimension schema list of objects or dimension name. Provide a name is equivalent to providing dimensional model of type String with the given name. If this is an empty array, Druid will not all appear in the "dimensionExclusions" non timestamp column as a string of non-metric type of dimension columns. Yes
dimensionExclusions JSON String array Name excluded dimensions from ingestion. 否(默认== [])
spatialDimensions JSON Object array 一系列spatial dimensions 否(默认== [])

Dimension Schema

维度模式指定要摄取的维度的类型和名称。

对于字符串列,维度模式还可用于通过设置createBitmapIndex布尔值来启用或禁用位图索引 。默认情况下,为所有字符串列启用位图索引。只有字符串列才能有位图索引; 数字列不支持它们。

例如,以下dimensionsSpec部分将dataSchema一列作为Long(countryNum),两列作为Float(userLatitudeuserLongitude),其他列作为字符串,并为comment列禁用位图索引。

"dimensionsSpec" : {
  "dimensions": [
    "page",
    "language",
    "user",
    "unpatrolled",
    "newPage",
    "robot",
    "anonymous",
    "namespace",
    "continent",
    "country",
    "region",
    "city",
    {
      "type": "string",
      "name": "comment",
      "createBitmapIndex": false
    },
    {
      "type": "long",
      "name": "countryNum"
    },
    {
      "type": "float",
      "name": "userLatitude"
    },
    {
      "type": "float",
      "name": "userLongitude"
    }
  ],
  "dimensionExclusions" : [],
  "spatialDimensions" : []
}

metricsSpec

metricsSpec是一个聚合器列表。如果在granularity specrollup为false,则metricsSpec应该是一个空列表,所有列应该在dimensionsSpec中定义(没有rollup,在获取时维度和度量之间没有真正的区别)。不过,这是可选的。

GranularitySpec

GranularitySpec定义了如何将数据源划分为时间块。默认granularitySpec是uniform,可以通过设置type字段来更改。目前,支持uniform类型和arbitrary类型。

Uniform Granularity Spec

此规范用于生成具有均匀间隔的段。

字段 类型 描述 是否必须
segmentGranularity string 创建时间块的粒度。每个时间块可以创建多个段。例如,在“DAY”分段粒度的情况下,同一天的事件属于相同的时间块,可以根据其他配置和输入大小可选地进一步划分为多个段。有关支持的粒度,请参阅Granularity 否(默认==‘DAY’)
queryGranularity string 能够查询结果的最小粒度以及段内数据的粒度。例如,“minute”值表示数据以每分钟的粒度聚合。也就是说,如果元组(分钟(时间戳)、维度)中存在冲突,那么它将使用聚合器将值聚合在一起,而不是存储单独的行。“NONE”的粒度表示毫秒粒度。有关支持的粒度,请参阅Granularity 否(默认==‘NONE’)
rollup boolean 汇总与不汇总 否(默认== true)
intervals JSON string array 正在摄取的原始数据的间隔列表。忽略实时摄取。 否,如果指定,Hadoop和本地非并行批处理摄取任务可能会跳过确定分区阶段,这将导致更快的摄取; 本机并行摄取任务可以预先请求所有锁,而不是逐个请求。批量摄取将丢弃任何没有在指定时间间隔内的数据。

Arbitrary Granularity Spec

此规范用于生成具有任意间隔的段(它试图创建大小均匀的段)。此规范不支持实时处理。

字段 类型 描述 是否必须
queryGranularity string 能够查询结果的最小粒度以及段内数据的粒度。例如,“minute”值表示数据以每分钟的粒度聚合。也就是说,如果元组(分钟(时间戳)、维度)中存在冲突,那么它将使用聚合器将值聚合在一起,而不是存储单独的行。“NONE”的粒度表示毫秒粒度。有关支持的粒度,请参阅Granularity 否(默认==‘NONE’)
rollup boolean 汇总与不汇总 否(默认== true)
intervals JSON string array 正在摄取的原始数据的间隔列表。忽略实时摄取。 否,如果指定,Hadoop和本地非并行批处理摄取任务可能会跳过确定分区阶段,这将导致更快的摄取; 本机并行摄取任务可以预先请求所有锁,而不是逐个请求。批量摄取将丢弃任何没有在指定时间间隔内的数据。

Transform Spec(变换规格)

变换规范允许Druid在摄取期间转换和过滤输入数据。请参见 Transform specs

IO Config(IO配置)

IOConfig规范根据摄取任务类型而有所不同。

Tuning Config(调整配置)

TuningConfig规范根据摄取任务类型而有所不同。

Evaluating Timestamp, Dimensions and Metrics(评估时间戳,维度和指标)

Druid将按以下顺序解释维度,维度排除和指标:

  • 维度列表中列出的任何列都被视为维度。
  • 维度排除列表中列出的任何列都将作为维度排除。
  • 默认情况下会排除度量标准所需的时间戳列和列/字段名称。
  • 如果度量标准也列为维度,则度量标准必须具有与维度名称不同的名称。

原文链接

http://druid.apache.org/docs/latest/ingestion/ingestion-spec.html#dataschema

Guess you like

Origin blog.csdn.net/u010993514/article/details/94740217