High-Level Array API
Finch tensors also support many of the basic array operations one might expect, including indexing, slicing, and elementwise maps, broadcast, and reduce. For example:
julia> A = fsparse([1, 1, 2, 3], [2, 4, 5, 6], [1.0, 2.0, 3.0])
3×6 Tensor{SparseCOOLevel{2, Tuple{Int64, Int64}, Vector{Int64}, Tuple{Vector{Int64}, Vector{Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}:
0.0 1.0 0.0 2.0 0.0 0.0
0.0 0.0 0.0 0.0 3.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
julia> A + 0
3×6 Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}:
0.0 1.0 0.0 2.0 0.0 0.0
0.0 0.0 0.0 0.0 3.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
julia> A + 1
3×6 Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{1.0, Float64, Int64, Vector{Float64}}}}}:
1.0 2.0 1.0 3.0 1.0 1.0
1.0 1.0 1.0 1.0 4.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0
julia> B = A .* 2
3×6 Tensor{SparseDictLevel{Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}, Vector{Int64}, SparseDictLevel{Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}, Vector{Int64}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}:
0.0 2.0 0.0 4.0 0.0 0.0
0.0 0.0 0.0 0.0 6.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
julia> B[1:2, 1:2]
2×2 Tensor{SparseDictLevel{Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}, Vector{Int64}, SparseDictLevel{Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}, Vector{Int64}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}:
0.0 2.0
0.0 0.0
julia> map(x -> x^2, B)
3×6 Tensor{SparseDictLevel{Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}, Vector{Int64}, SparseDictLevel{Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}, Vector{Int64}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}:
0.0 4.0 0.0 16.0 0.0 0.0
0.0 0.0 0.0 0.0 36.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
Einsum
Finch also supports a highly general @einsum
macro which supports any reduction over any simple pointwise array expression.
Finch.@einsum
— Macro@einsum tns[idxs...] <<op>>= ex...
Construct an einsum expression that computes the result of applying op
to the tensor tns
with the indices idxs
and the tensors in the expression ex
. The result is stored in the variable tns
.
ex
may be any pointwise expression consisting of function calls and tensor references of the form tns[idxs...]
, where tns
and idxs
are symbols.
The <<op>>
operator can be any binary operator that is defined on the element type of the expression ex
.
The einsum will evaluate the pointwise expression tns[idxs...] <<op>>= ex...
over all combinations of index values in tns
and the tensors in ex
.
Here are a few examples:
@einsum C[i, j] += A[i, k] * B[k, j]
@einsum C[i, j, k] += A[i, j] * B[j, k]
@einsum D[i, k] += X[i, j] * Y[j, k]
@einsum J[i, j] = H[i, j] * I[i, j]
@einsum N[i, j] = K[i, k] * L[k, j] - M[i, j]
@einsum R[i, j] <<max>>= P[i, k] + Q[k, j]
@einsum x[i] = A[i, j] * x[j]
Array Fusion
Finch supports array fusion, which allows you to compose multiple array operations into a single kernel. This can be a significant performance optimization, as it allows the compiler to optimize the entire operation at once. The two functions the user needs to know about are lazy
and compute
. You can use lazy
to mark an array as an input to a fused operation, and call compute
to execute the entire operation at once. For example:
julia> C = lazy(A);
julia> D = lazy(B);
julia> E = (C .+ D) / 2;
julia> compute(E)
3×6 Tensor{SparseDictLevel{Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}, Vector{Int64}, SparseDictLevel{Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}, Vector{Int64}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}:
0.0 1.5 0.0 3.0 0.0 0.0
0.0 0.0 0.0 0.0 4.5 0.0
0.0 0.0 0.0 0.0 0.0 0.0
In the above example, E
is a fused operation that adds C
and D
together and then divides the result by 2. The compute
function examines the entire operation and decides how to execute it in the most efficient way possible. In this case, it would likely generate a single kernel that adds the elements of A
and B
together and divides each result by 2, without materializing an intermediate.
Finch.lazy
— Functionlazy(arg)
Create a lazy tensor from an argument. All operations on lazy tensors are lazy, and will not be executed until compute
is called on their result.
for example,
x = lazy(rand(10))
y = lazy(rand(10))
z = x + y
z = z + 1
z = compute(z)
will not actually compute z
until compute(z)
is called, so the execution of x + y
is fused with the execution of z + 1
.
Finch.compute
— Functioncompute(args...; ctx=default_scheduler(), kwargs...) -> Any
Compute the value of a lazy tensor. The result is the argument itself, or a tuple of arguments if multiple arguments are passed. Some keyword arguments can be passed to control the execution of the program: - verbose=false
: Print the generated code before execution - tag=:global
: A tag to distinguish between different classes of inputs for the same program.
Finch.fused
— Functionfused(f, args...; kwargs...)
This function decorator modifies f
to fuse the contained array operations and optimize the resulting program. The function must return a single array or tuple of arrays. Some keyword arguments can be passed to control the execution of the program: - verbose=false
: Print the generated code before execution - tag=:global
: A tag to distinguish between different classes of inputs for the same program.
The lazy
and compute
functions allow the compiler to fuse operations together, resulting in asymptotically more efficient code.
julia> using BenchmarkTools
julia> A = fsprand(1000, 1000, 100);
B = Tensor(rand(1000, 1000));
C = Tensor(rand(1000, 1000));
julia> @btime A .* (B * C);
145.940 ms (859 allocations: 7.69 MiB)
julia> @btime compute(lazy(A) .* (lazy(B) * lazy(C)));
694.666 μs (712 allocations: 60.86 KiB)
Optimizers
Different optimizers can be used with compute
, such as the state-of-the-art Galley optimizer, which can adapt to the sparsity patterns of the inputs. The optimizer can be set as an argument ctx
to the compute
function, or using set_scheduler
or with_scheduler
.
Finch.set_scheduler!
— Functionset_scheduler!(scheduler)
Set the current scheduler to scheduler
. The scheduler is used by compute
to execute lazy tensor programs.
Finch.with_scheduler
— Functionwith_scheduler(f, scheduler)
Execute f
with the current scheduler set to scheduler
.
Finch.default_scheduler
— Functiondefault_scheduler(;verbose=false)
The default scheduler used by compute
to execute lazy tensor programs. Fuses all pointwise expresions into reductions. Only fuses reductions into pointwise expressions when they are the only usage of the reduction.
The Galley Optimizer
Galley is a cost-based optimizer for Finch's lazy evaluation interface based on techniques from database query optimization. To use Galley, you just add the parameter ctx=galley_optimizer()
to the compute
function. While the default optimizer (ctx=default_scheduler()
) makes decisions entirely based on the types of the inputs, Galley gathers statistics on their sparsity to make cost-based based optimization decisions.
Finch.Galley.galley_scheduler
— Functiongalley_scheduler(verbose = false, estimator=DCStats)
The galley scheduler uses the sparsity patterns of the inputs to optimize the computation. The first set of inputs given to galley is used to optimize, and the estimator
is used to estimate the sparsity of intermediate computations during optimization.
julia> A = fsprand(1000, 1000, 0.1);
B = fsprand(1000, 1000, 0.1);
C = fsprand(1000, 1000, 0.0001);
julia> A = lazy(A);
B = lazy(B);
C = lazy(C);
julia> @btime compute(sum(A * B * C));
282.503 ms (1018 allocations: 184.43 MiB)
julia> @btime compute(sum(A * B * C), ctx=galley_scheduler());
152.792 μs (672 allocations: 28.81 KiB)
By taking advantage of the fact that C is highly sparse, Galley can better structure the computation. In the matrix chain multiplication, it always starts with the C,B matmul before multiplying with A. In the summation, it takes advantage of distributivity to pushing the reduction down to the inputs. It first sums over A and C, then multiplies those vectors with B.
Because Galley adapts to the sparsity patterns of the first input tensor, it can be useful to distinguish between different uses of the same function using the tag
keyword argument to compute
or fuse
. For example, we may wish to distinguish one spmv from another, as follows:
julia> A = rand(1000, 1000);
B = rand(1000, 1000);
C = fsprand(1000, 1000, 0.0001);
julia> fused((A, B, C) -> C .* (A * B), A, B, C; tag=:very_sparse_sddmm);
julia> C = fsprand(1000, 1000, 0.9);
julia> fused((A, B, C) -> C .* (A * B), A, B, C; tag=:very_dense_sddmm);
By distinguishing between the two uses of the same function, Galley can make better decisions about how to optimize each computation separately.