Ingestion Spec (ingestion specification)
Apache Druid ingestion spec consists of three parts:
{
"dataSchema" : {...},
"ioConfig" : {...},
"tuningConfig" : {...}
}
Field | Types of | description | Do you have to |
---|---|---|---|
dataSchema | JSON Object | Specifies whether incoming data Schema. Ingestion Spec can all share the same dataSchema. | Yes |
ioConfig | JSON Object | Specify the source and destination of the data. This object will vary with the method of ingestion. | Yes |
tuningConfig | JSON Object | Specifies how to adjust various parameters uptake. This object will vary with the method of ingestion. | no |
DataSchema
The following is an example of a dataSchema:
"dataSchema" : {
"dataSource" : "wikipedia",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"timestampSpec" : {
"column" : "timestamp",
"format" : "auto"
},
"dimensionsSpec" : {
"dimensions": [
"page",
"language",
"user",
"unpatrolled",
"newPage",
"robot",
"anonymous",
"namespace",
"continent",
"country",
"region",
"city",
{
"type": "long",
"name": "countryNum"
},
{
"type": "float",
"name": "userLatitude"
},
{
"type": "float",
"name": "userLongitude"
}
],
"dimensionExclusions" : [],
"spatialDimensions" : []
}
}
},
"metricsSpec" : [{
"type" : "count",
"name" : "count"
}, {
"type" : "doubleSum",
"name" : "added",
"fieldName" : "added"
}, {
"type" : "doubleSum",
"name" : "deleted",
"fieldName" : "deleted"
}, {
"type" : "doubleSum",
"name" : "delta",
"fieldName" : "delta"
}],
"granularitySpec" : {
"segmentGranularity" : "DAY",
"queryGranularity" : "NONE",
"intervals" : [ "2013-08-31/2013-09-01" ]
},
"transformSpec" : null
}
Field | Types of | description | Do you have to |
---|---|---|---|
dataSource | String | Intake name of the data source. The data source can be viewed as tables. | Yes |
parser | JSON Object | Specifies how to parse the uptake of data. | Yes |
metricsSpec | JSON Object array | Aggregator list. | Yes |
granularitySpec | JSON Object | Specifies how to create segments and summary data. | Yes |
transformSpec | JSON Object | Specify how the input data is converted and filtered. See the conversion specification . | no |
Parser
If type
not included, the parser by default string
. For additional data formats, please see our list of extensions .
String Parser
Field | Types of | description | Do you have to |
---|---|---|---|
type | String | This should be a general sense string , or Hadoop index used in the job hadoopyString . |
no |
parseSpec | JSON Object | Specified format, and the time stamp of the dimension data. | Yes |
ParseSpec
ParseSpecs two purposes:
- String Parser uses them to determine the format of the incoming line (ie, JSON, CSV, TSV).
- All parsers use them to determine whether an incoming line timestamp and dimensions.
If format
not included, the default is parseSpec tsv
.
JSON ParseSpec
It is used with String Parser to load JSON.
Field | Types of | description | Do you have to |
---|---|---|---|
format | String | Here it should bejson |
no |
timestampSpec | JSON Object | And column format specified timestamp. | Yes |
dimensionsSpec | JSON Object | Dimensions specified data. | Yes |
flattenSpec | JSON Object | Specify a nested JSON data flattened configuration. For more information, see Flattening JSON . | no |
JSON Lowercase ParseSpec
_jsonLowercase_ parser has been deprecated and may be removed in future versions of the Druid.
JSON parseSpec This is a special variant, it is possible to reduce the size of all the incoming JSON data column name. If you update from Druid 0.6.x to Druid 0.7.x, directly use the column name mixed case of direct uptake JSON, without any ETL these column names to lower case, and you want to query, you need to include this parseSpec use 0.6.x 0.7.x and data created.
Field | Types of | description | Do you have to |
---|---|---|---|
format | String | Here it should bejsonLowercase |
Yes |
timestampSpec | JSON Object | And column format specified timestamp. | Yes |
dimensionsSpec | JSON Object | Dimensions specified data. | Yes |
CSV ParseSpec
It is used with String Parser to load CSV. Use com.opencsv library to parse the string.
Field | Types of | description | Do you have to |
---|---|---|---|
format | String | Here it should becsv |
Yes |
timestampSpec | JSON Object | And column format specified timestamp. | Yes |
dimensionsSpec | JSON Object | Dimensions specified data. | Yes |
listDelimiter | String | Multivalued dimensional custom delimiter. | No (default == ctrl + A) |
columns | JSON array | Column specifies the data. | Yes |
TimestampSpec
Field | Types of | description | Do you have to |
---|---|---|---|
column | String | Timestamp columns. | Yes |
format | String | iso, posix, millis, micro, nano, auto or any Joda time format. | No (default == 'auto') |
DimensionsSpec
Field | Types of | description | Do you have to |
---|---|---|---|
dimensions | JSON array | dimension schema list of objects or dimension name. Provide a name is equivalent to providing dimensional model of type String with the given name. If this is an empty array, Druid will not all appear in the "dimensionExclusions" non timestamp column as a string of non-metric type of dimension columns. | Yes |
dimensionExclusions | JSON String array | Name excluded dimensions from ingestion. | 否(默认== []) |
spatialDimensions | JSON Object array | 一系列spatial dimensions | 否(默认== []) |
Dimension Schema
维度模式指定要摄取的维度的类型和名称。
对于字符串列,维度模式还可用于通过设置createBitmapIndex
布尔值来启用或禁用位图索引 。默认情况下,为所有字符串列启用位图索引。只有字符串列才能有位图索引; 数字列不支持它们。
例如,以下dimensionsSpec
部分将dataSchema
一列作为Long(countryNum
),两列作为Float(userLatitude
,userLongitude
),其他列作为字符串,并为comment
列禁用位图索引。
"dimensionsSpec" : {
"dimensions": [
"page",
"language",
"user",
"unpatrolled",
"newPage",
"robot",
"anonymous",
"namespace",
"continent",
"country",
"region",
"city",
{
"type": "string",
"name": "comment",
"createBitmapIndex": false
},
{
"type": "long",
"name": "countryNum"
},
{
"type": "float",
"name": "userLatitude"
},
{
"type": "float",
"name": "userLongitude"
}
],
"dimensionExclusions" : [],
"spatialDimensions" : []
}
metricsSpec
metricsSpec
是一个聚合器列表。如果在granularity spec
中rollup
为false,则metricsSpec应该是一个空列表,所有列应该在dimensionsSpec
中定义(没有rollup,在获取时维度和度量之间没有真正的区别)。不过,这是可选的。
GranularitySpec
GranularitySpec定义了如何将数据源划分为时间块。默认granularitySpec是uniform
,可以通过设置type
字段来更改。目前,支持uniform
类型和arbitrary
类型。
Uniform Granularity Spec
此规范用于生成具有均匀间隔的段。
字段 | 类型 | 描述 | 是否必须 |
---|---|---|---|
segmentGranularity | string | 创建时间块的粒度。每个时间块可以创建多个段。例如,在“DAY”分段粒度的情况下,同一天的事件属于相同的时间块,可以根据其他配置和输入大小可选地进一步划分为多个段。有关支持的粒度,请参阅Granularity。 | 否(默认==‘DAY’) |
queryGranularity | string | 能够查询结果的最小粒度以及段内数据的粒度。例如,“minute”值表示数据以每分钟的粒度聚合。也就是说,如果元组(分钟(时间戳)、维度)中存在冲突,那么它将使用聚合器将值聚合在一起,而不是存储单独的行。“NONE”的粒度表示毫秒粒度。有关支持的粒度,请参阅Granularity。 | 否(默认==‘NONE’) |
rollup | boolean | 汇总与不汇总 | 否(默认== true) |
intervals | JSON string array | 正在摄取的原始数据的间隔列表。忽略实时摄取。 | 否,如果指定,Hadoop和本地非并行批处理摄取任务可能会跳过确定分区阶段,这将导致更快的摄取; 本机并行摄取任务可以预先请求所有锁,而不是逐个请求。批量摄取将丢弃任何没有在指定时间间隔内的数据。 |
Arbitrary Granularity Spec
此规范用于生成具有任意间隔的段(它试图创建大小均匀的段)。此规范不支持实时处理。
字段 | 类型 | 描述 | 是否必须 |
---|---|---|---|
queryGranularity | string | 能够查询结果的最小粒度以及段内数据的粒度。例如,“minute”值表示数据以每分钟的粒度聚合。也就是说,如果元组(分钟(时间戳)、维度)中存在冲突,那么它将使用聚合器将值聚合在一起,而不是存储单独的行。“NONE”的粒度表示毫秒粒度。有关支持的粒度,请参阅Granularity。 | 否(默认==‘NONE’) |
rollup | boolean | 汇总与不汇总 | 否(默认== true) |
intervals | JSON string array | 正在摄取的原始数据的间隔列表。忽略实时摄取。 | 否,如果指定,Hadoop和本地非并行批处理摄取任务可能会跳过确定分区阶段,这将导致更快的摄取; 本机并行摄取任务可以预先请求所有锁,而不是逐个请求。批量摄取将丢弃任何没有在指定时间间隔内的数据。 |
Transform Spec(变换规格)
变换规范允许Druid在摄取期间转换和过滤输入数据。请参见 Transform specs
IO Config(IO配置)
IOConfig规范根据摄取任务类型而有所不同。
- 本机批量摄取:请参阅Native Batch IOConfig
- Hadoop Batch ingestion:请参阅Hadoop Batch IOConfig
- Kafka Indexing Service:请参阅Kafka Supervisor IOConfig
- Stream Push Ingestion:使用Tranquility进行Stream Push不需要IO配置
- Stream Pull Ingestion(已弃用):请参阅Stream pull ingestion
Tuning Config(调整配置)
TuningConfig规范根据摄取任务类型而有所不同。
- 本机批量摄取:请参阅Native Batch TuningConfig
- Hadoop Batch ingestion:请参阅Hadoop Batch TuningConfig
- Kafka Indexing Service:请参阅Kafka Supervisor TuningConfig
- Stream Push Ingestion(Tranquility):参见Tranquility TuningConfig
- Stream Pull Ingestion(已弃用):请参阅Stream pull ingestion
Evaluating Timestamp, Dimensions and Metrics(评估时间戳,维度和指标)
Druid将按以下顺序解释维度,维度排除和指标:
- 维度列表中列出的任何列都被视为维度。
- 维度排除列表中列出的任何列都将作为维度排除。
- 默认情况下会排除度量标准所需的时间戳列和列/字段名称。
- 如果度量标准也列为维度,则度量标准必须具有与维度名称不同的名称。
原文链接
http://druid.apache.org/docs/latest/ingestion/ingestion-spec.html#dataschema