User Defined Functions
SDF allows you to author user defined functions using function signatures in sdf.yml blocks. Types are used to statically analyze SQL, provide label propagation and lineage computation.
User-defined functions allow you to extend a database by adding a new function, that can be evaluated in SQL. The SQL standard distinguishes between scalar, aggregate and
table functions. A scalar function takes a set of values and returns only a single value (or NULL), an aggregate function takes a column of values and return
a single value, whereas a table function returns a (relational) table comprising zero or more rows, each row with one or more columns.
SDF allows you to define almost any imagenable user-defined function. The following sections define how to add user defined function signatures.
Scalar functions
SDF distinguishes scalar functions in various dimensions:
- They can have fixed or varying arity, i.e number of parameters
- They can have one defintion or be overloaded, i.e. have many defintions.
- They can be monomorphic, i.e., defined for one type only, or polymorphic, i.e. defined to operate on all types
- They can be first-order, i.e., take only data as parameters, or higher order, take also functions as parameter. We explain each of these concepts and the authoring of the corresponding function signature in turn.
Fixed Arity
The function substr
is a typical scalar function. It takes two arguments of the types varchar
and int
, respectively.
We say substr
has fixed arity two. The expression substr(text,start)
returns a varchar
.
In SDF we write substr
’s function signature as follows:
By default function definitions are for scalar functions. If we want to make a function kind explicit we add the function-kind
field.
Alternative function kinds are aggregate
, and table
functions, which we will introduce later.
Overloading
The function substr
is overloaded. You can call it with two parameters, as we saw above, or with three parameters, e.g. substr(string, start, length)
.
To allow for a second defintion of substr
you have to declare one of them to be an overload
.
The default for overload
is false.
Uniform Variadic Arity
Many functions in SQL can take an arbitrary number of arguments. For instance concat
allows to concat an arbitrary number strings.
We say concat
is uniform variadic, that is the function can have one or more parameters but all must have the same type. In SDF we write:
Generic types and functions
Modern SQL does not only have primitive types like int
and varchar
, but also composite types like array
, map
, and struct
. These composite types are generic,
so that we can build arrays of ints, arrays of varchar or array of array of ints, etc.
SDF denotes the type of
- arrays of type T as
array\<
T\>
where T is either a concrete type, likevarchar
, or a type variable, which we denote here by$1
,$2
, etc. - generic maps of key type T to value type S by
map\<
T,
S\>
, where T and S can either be concrete types or type variables. - named structs of n-fields using the syntax
struct\<x
1T
1, … x
nT
n\>
, where eachT
i can either be concrete types or a type variable. Some SQL variants, like Spark, allow to drop the namesx
i.
With these notation in hand, we can now define an overloaded version of concat
that allows concatenating arrays of arbitrary but fixed type:
Higher-order functions
Trino and Spark support higher order function applications. As an example, Trino’s transform
function applies a function pointwise to an array,
e.g., the expression transform(array[1,2,3], x -> x*x)
returns in Trino the value array[1,4,9]
.
In SDF we capture the signature of function via the lambda\<
A,
T\>
type constructor, where A stands for the argument type and R for the return type.
The function transform
thus has the following type signature:
Lambda functions can also take tuples as argument. The array_sort
function, for instance, takes an array as the first parameter and a comparison operator as the second.
Bounded types
Sometimes functions don’t work over all types, but can only applied to a limited set of types.
We call this set type parameter bounds. The abs
function for instance can only be applied to number types and to intervals.
So we define abs
as a type parameter bounded function.
Aggregate and window functions
Aggregate functions take a column of values and return a scalar value. For instance select sum(x) from values (1),(2),(3) T(x)
computes the sum of the x
column, here 5
.
While the implementation might be non-trivial the signature of aggregate functions is easy. We just specify the functions kind as aggregate, the type of the aggregate function is its column type.
Thus for sum
we get the trivial signature:
Further features
Dialect
All of SDF’s function definitions are dialect specific. We specify the dialect as part of the function signature.
So to define substr
as available for Trino we write:
Volatility
SDF distinguishes three kinds of functions.
- Pure functions, like
substr(..)
return the same output for the same input. - Stable functions, like
now()
return the same output within one query execution. - Non-deterministic functions, like
first()
, might return a different output for the same input.
We specify the volatility of each function using the volatility field, e.g.
The default volatility is pure
. Knowing the volaitity is key for applying optimizations, like comon subexpression elimination.
Non-uniform variadic functions
Spark offers several functions that are variadic but not uniform. For instance map(1,'a',2, 'b', ...) builds a
map<int,varchar>`. By closer inspection we observe that every even position has one type, every odd position another.
In SDF we specify map to be of type
Three-Part Function Names
SDF supports multi-part function names by specifying the catalog and schema these functions belong to in the name. The result is a fully qualified name for the function.
This function can then be invoked in SQL like so:
This pattern is often seen in Snowflake as functions are materialized to a database and schema.
Outlook
SDF function blocks allow to specify almost all of SQL’s function signatures. Going forward we also want to enable you to write implementations, in SQL, Python and Rust. Stay tuned for the continuation of this story!