User-defined functions allow you to extend a database by adding a new function, that can be evaluated in SQL. The SQL standard distinguishes between scalar, aggregate and
table functions. A scalar function takes a set of values and returns only a single value (or NULL), an aggregate function takes a column of values and return a single value, whereas a table function returns a (relational) table comprising zero or more rows, each row with one or more columns.

SDF allows you to define almost any imagenable user-defined function. The following sections define how to add user defined function signatures.

Scalar functions

SDF distinguishes scalar functions in various dimensions:

  • They can have fixed or varying arity, i.e number of parameters
  • They can have one defintion or be overloaded, i.e. have many defintions.
  • They can be monomorphic, i.e., defined for one type only, or polymorphic, i.e. defined to operate on all types
  • They can be first-order, i.e., take only data as parameters, or higher order, take also functions as parameter. We explain each of these concepts and the authoring of the corresponding function signature in turn.

Fixed Arity

The function substr is a typical scalar function. It takes two arguments of the types varchar and int, respectively. We say substr has fixed arity two. The expression substr(text,start) returns a varchar.

In SDF we write substr’s function signature as follows:

function:
  name: substr
  parameters: 
    - datatype: varchar
    - datatype: int
  returns: 
    datatype: varchar

By default function definitions are for scalar functions. If we want to make a function kind explicit we add the function-kind field.

function:
  name: substr
  function-kind: scalar
  ...

Alternative function kinds are aggregate, and table functions, which we will introduce later.

Overloading

The function substr is overloaded. You can call it with two parameters, as we saw above, or with three parameters, e.g. substr(string, start, length). To allow for a second defintion of substr you have to declare one of them to be an overload.

function:
  name: substr
  overload: true
  parameters: 
    - datatype: varchar
    - datatype: int
    - datatype: int
  volatility: pure
  returns: 
    datatype: varchar

The default for overload is false.

Uniform Variadic Arity

Many functions in SQL can take an arbitrary number of arguments. For instance concat allows to concat an arbitrary number strings. We say concat is uniform variadic, that is the function can have one or more parameters but all must have the same type. In SDF we write:

function:
  name: concat
  variadic: uniform
  parameters: 
    - datatype: varchar
  returns: 
    datatype: varchar

Generic types and functions

Modern SQL does not only have primitive types like int and varchar, but also composite types like array, map, and struct. These composite types are generic, so that we can build arrays of ints, arrays of varchar or array of array of ints, etc.
SDF denotes the type of

  • arrays of type T as array\<T\> where T is either a concrete type, like varchar, or a type variable, which we denote here by $1, $2, etc.
  • generic maps of key type T to value type S by map\<T,S\>, where T and S can either be concrete types or type variables.
  • named structs of n-fields using the syntax struct\<x1 T1, … xn Tn\>, where each Ti can either be concrete types or a type variable. Some SQL variants, like Spark, allow to drop the names xi.

With these notation in hand, we can now define an overloaded version of concat that allows concatenating arrays of arbitrary but fixed type:

function:
  name: concat
  variadic: uniform
  overload: true
  parameters: 
    - datatype: array\<$1\>
  returns: 
    datatype: array\<$1\>

Higher-order functions

Trino and Spark support higher order function applications. As an example, Trino’s transform function applies a function pointwise to an array, e.g., the expression transform(array[1,2,3], x -> x*x) returns in Trino the value array[1,4,9].

In SDF we capture the signature of function via the lambda\<A,T\> type constructor, where A stands for the argument type and R for the return type. The function transform thus has the following type signature:

function:
  name: transform
  parameters: 
    - datatype: array\<$1\>
    - datatype: lambda\<$1, $2\>
  volatility: pure
  kind: scalar
  returns: 
    datatype: array\<$2\>

Lambda functions can also take tuples as argument. The array_sort function, for instance, takes an array as the first parameter and a comparison operator as the second.

function:
  name: array_sort
  parameters: 
    - datatype: array\<$1\>
    - datatype: lambda\<tuple\<$1,$1\>,int\>
  returns: 
    datatype: array\<$1\>

Bounded types

Sometimes functions don’t work over all types, but can only applied to a limited set of types. We call this set type parameter bounds. The abs function for instance can only be applied to number types and to intervals. So we define abs as a type parameter bounded function.

function:
  name: abs
  parameters: 
    - datatype: $1
  returns: 
    datatype: $1
  with:
    - bound: $1
      datatypes:
       - number
       - interval

Aggregate and window functions

Aggregate functions take a column of values and return a scalar value. For instance select sum(x) from values (1),(2),(3) T(x) computes the sum of the x column, here 5.

While the implementation might be non-trivial the signature of aggregate functions is easy. We just specify the functions kind as aggregate, the type of the aggregate function is its column type. Thus for sum we get the trivial signature:

function:
  name: sum
  function-kind: aggregate
    parameters: 
    - datatype: $1
  returns: 
    datatype: $1
  with:
    - bound: $1
      datatypes:
       - number

Further features

Dialect

All of SDF’s function definitions are dialect specific. We specify the dialect as part of the function signature. So to define substr as available for Trino we write:

function:
  name: substr
  dialect: trino
  parameters: 
    - datatype: varchar
    - datatype: int
  returns: 
    datatype: varchar

Volatility

SDF distinguishes three kinds of functions.

  • Pure functions, like substr(..) return the same output for the same input.
  • Stable functions, like now() return the same output within one query execution.
  • Non-deterministic functions, like first(), might return a different output for the same input.

We specify the volatility of each function using the volatility field, e.g.

function:
  name: now
  volatility: stable
  parameters: 
  returns: 
    datatype: datetime

The default volatility is pure. Knowing the volaitity is key for applying optimizations, like comon subexpression elimination.

Non-uniform variadic functions

Spark offers several functions that are variadic but not uniform. For instance map(1,'a',2, 'b', ...) builds a map<int,varchar>`. By closer inspection we observe that every even position has one type, every odd position another. In SDF we specify map to be of type

function:
  name: map
  variadic: even-odd
  parameters: 
    - datatype: $1
    - datatype: $2
  returns: 
    datatype: map\<$1,$2\>
  with:
    - bound: $1
      datatypes:
       - number

Three-Part Function Names

SDF supports multi-part function names by specifying the catalog and schema these functions belong to in the name. The result is a fully qualified name for the function.

function:
  name: functions_db.functions_schema.my_custom_udf
  parameters: 
    - datatype: varchar
    - datatype: int
  returns: 
    datatype: varchar

This function can then be invoked in SQL like so:

SELECT functions_db.functions_schema.my_custom_udf('hello', 1);

This pattern is often seen in Snowflake as functions are materialized to a database and schema.

Outlook

SDF function blocks allow to specify almost all of SQL’s function signatures. Going forward we also want to enable you to write implementations, in SQL, Python and Rust. Stay tuned for the continuation of this story!