Add Scalar function to Databend | Function Development Series 1

According to the function implementation in Databend, it is divided into: scalars function and aggregates function.

Scalar function: Returns a single value based on input values. Common Scalar functions include now, round, etc.

Aggregate function: Used to operate on the values ​​of a column and return a single value. Common Agg functions include sum, count, avg, etc.

https://github.com/datafuselabs/databend/tree/main/src/query/functions/src

There are two articles in this series. This article mainly introduces how Scalar Function runs in Databend from registration to execution.

function registration

Function registration is taken over by FunctionRegistry.

#[derive(Default)]
pub struct FunctionRegistry {
    pub funcs: HashMap<&'static str, Vec<Arc<Function>>>,
    #[allow(clippy::type_complexity)]
    pub factories: HashMap<
        &'static str,
        Vec<Box<dyn Fn(&[usize], &[DataType]) -> Option<Arc<Function>> + 'static>>,
    >,
    pub aliases: HashMap<&'static str, &'static str>,
}

All three items are Hashmap.

Among them, both funcs and factories are used to store the registered functions. The difference is that funcs registers functions with a fixed number of parameters (currently the minimum number of parameters supported is 0, and the maximum number of parameters is 5), divided into register_0_arg, register_1_arg and so on. And factories register functions with variable parameters (such as concat), and call the register_function_factory function.

Since a function may have multiple aliases (for example, the aliases of minus include subtract and neg), there is alias, its key is the alias of a certain function, v is the name of the currently existing function, and the register_aliases function is called.

In addition, according to different functional requirements, we provide different levels of register api.

function composition

Knowing that the value of funcs is the body of the function, let's take a look at how Function is constructed in Databend.

pub struct Function {
    pub signature: FunctionSignature,
    #[allow(clippy::type_complexity)]
    pub calc_domain: Box<dyn Fn(&[Domain]) -> Option<Domain>>,
    #[allow(clippy::type_complexity)]
    pub eval: Box<dyn Fn(&[ValueRef<AnyType>], FunctionContext) -> Result<Value<AnyType>, String>>,
}

Among them, signatureit includes function name, parameter type, return type, and function characteristics (currently there is no function usage characteristic, it is only reserved). It is important to note that the function name needs to be in lowercase when registering. And some tokens will be converted through src/query/ast/src/parser/token.rs.

#[allow(non_camel_case_types)]
#[derive(Logos, Clone, Copy, Debug, PartialEq, Eq, Hash)]
pub enum TokenKind {
    ...
    #[token("+")]
    Plus,
    ...
}

Take the addition function of `select 1+2` as an example, `+` is converted to Plus, and the function name needs to be lowercase, so we use `plus` for the function name when registering.

with_number_mapped_type!(|NUM_TYPE| match left {
    NumberDataType::NUM_TYPE => {
        registry.register_1_arg::<NumberType<NUM_TYPE>, NumberType<NUM_TYPE>, _, _>(
            "plus",
            FunctionProperty::default(),
            |lhs| Some(lhs.clone()),
            |a, _| a,
        );
    }
});

calc_domain The set of input values ​​used to calculate the output value. If it is described by a mathematical formula, such as `y = f(x)`, the domain is the set of x values, which can be used as the parameter of f to generate the y value. This allows us to easily filter out values ​​that are not in the domain when indexing data, greatly improving response efficiency.

eval can be understood as the specific implementation content of the function. The essence is to accept some characters or numbers, parse them into expressions, and then convert them into another set of values.

example

The functions currently implemented in function-v2 include these categories: arithmetic, array, boolean, control, comparison, datetime, math, string, string_mult_args, variant

Take the implementation of length as an example:

length accepts a String type as a parameter and returns a Number type. The name is length, and the domain has no limit (because any string has a length). The last parameter is a closure function, which is part of the eval implementation of length.

registry.register_1_arg::<StringType, NumberType<u64>, _, _>(
    "length",
    FunctionProperty::default(),
    |_| None,
    |val, _| val.len() as u64,
);

In the implementation of register_1_arg, we see that the function called is register_passthrough_nullable_1_arg, and the function name contains a nullable. And eval is called by vectorize_1_arg.

Note: Please do not manually modify the file [src/query/expression/src/register.rs] where register_1_arg is located ( https://github.com/datafuselabs/databend/blob/2aec38605eebb7f0e1717f7f54ec52ae0f2e530b/src/query/expression/src/register. rs ). Because it is [src/query/codegen/src/writes/register.rs]( https://github.com/datafuselabs/databend/blob/2aec38605eebb7f0e1717f7f54ec52ae0f2e530b/src/query/codegen/src/writes/register.rs ) Generated.

pub fn register_1_arg<I1: ArgType, O: ArgType, F, G>(
    &mut self,
    name: &'static str,
    property: FunctionProperty,
    calc_domain: F,
    func: G,
) where
    F: Fn(&I1::Domain) -> Option<O::Domain> + 'static + Clone + Copy,
    G: Fn(I1::ScalarRef<'_>, FunctionContext) -> O::Scalar + 'static + Clone + Copy,
{
    self.register_passthrough_nullable_1_arg::<I1, O, _, _>(
        name,
        property,
        calc_domain,
        vectorize_1_arg(func),
    )
}

This is because eval accepts not only characters or numbers in actual application scenarios, but also null or various other types. And null is undoubtedly the most special kind. The parameter we receive may also be a column or a value. for example

select length(null);
+--------------+
| length(null) |
+--------------+
|         NULL |
+--------------+
select length(id) from t;
+------------+
| length(id) |
+------------+
|          2 |
|          3 |
+------------+

Based on this, if we don't need to do special treatment for the value of type null in the function, just use register_x_arg directly. If you need to do special handling for null type, refer to try_to_timestamp .

For functions that need to be specialized in vectorize, register_passthrough_nullable_x_arg needs to be called to perform specific vectorization optimization on the function to be implemented.

For example, the implementation of the comparison function regexp: regexp receives two values ​​of String type and returns a Bool value. In vectorized execution, in order to further optimize and reduce the parsing of repeated regular expressions, the HashMap structure is introduced. Therefore `vectorize_regexp` is implemented separately.

registry.register_passthrough_nullable_2_arg::<StringType, StringType, BooleanType, _, _>(
    "regexp",
    FunctionProperty::default(),
    |_, _| None,
    vectorize_regexp(|str, pat, map, _| {
        let pattern = if let Some(pattern) = map.get(pat) {
            pattern
        } else {
            let re = regexp::build_regexp_from_pattern("regexp", pat, None)?;
            map.insert(pat.to_vec(), re);
            map.get(pat).unwrap()
        };
        Ok(pattern.is_match(str))
    }),
);


function test

Unit Test

Function-related unit tests are in the scalars directory.

Logic Test

The logic tests related to Functions are in the 02_function directory.

About Databend

Databend is an open source, flexible, low-cost, new data warehouse that can also perform real-time analysis based on object storage. Looking forward to your attention, let's explore cloud-native data warehouse solutions together to create a new generation of open source Data Cloud.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5489811/blog/7554715