There is one generic array container object while the underlying data may be one of various basic types:
float
)cuComplex
)double*
)cuDoubleComplex*
)bool
)int
)unsigned
) You can generate matrices out on the device. The default underlying datatype is f32 (float
) unless otherwise specified. Some examples:
zeros(3) // 3-by-1 column of zeros of single-precision (f32 default) ones(3, 2, f64) // 3-by-2 matrix of ones of double-precision randu(1, 8) // row vector (1x8) of random values (uniformly distributed) randn(2, 2) // square matrix (2x2) of random values (normally distributed) identity(3, 3) // 3-by-3 identity (ones along diagonal, zero elsewhere) randu(5, 7, c32) // complex values, all real and complex components from a uniform distribution
You can also initialize values from a host array:
float hA[] = {0,1,2,3,4,5}; array A(hA, 2, 3); // 2x3 matrix of single-precision print(A); // A = | 0 2 4 | // | 1 3 5 |
You can print the contents of an array or expression:
There are hundreds of functions for element-wise arithmetic:
array R = randu(3,3); array C = ones(3,3) + complex(sin(R)); // C is c32 // rescale complex values to unit circle array a = randn(5,c32); print(a / abs(a)); // calculate L2 norm of every column array X = randn(20,30); print(sqrt(sum(pow(X,2)))); // norm of every column vector print(sqrt(sum(pow(X,2),0))); // same as above print(sqrt(sum(pow(X,2),1))); // norm of every row vector
By default A*B implements matrix multiply to favor linear algebra in v1.0; however, you can toggle this to be elementwise multiply.
You can initialize a matrix from either a host or device pointer:
float host_ptr[] = {0,1,2,3,4,5}; array a(host_ptr, 2, 3); // f32 matrix of size 2-by-3 from host data float *device_ptr; cudaMalloc((void**)&device_ptr, 6*sizeof(float)); cudaMemcpy(device_ptr, host_ptr, 6*sizeof(float), cudaMemcpyHostToDevice); array b(device_ptr, 2,3, afDevicePointer); // do not call \c cudaFree(device_ptr) -- it is freed when \c b is destructed.
You can get both device- and host-side pointers to the underlying data with device() and host().
array a = randu(3, f32); float *host_a = a.host<float>(); // must call hostFree() later printf("host_a[2] = %g\n", host_a[2]); // last element array::hostFree(host_a); float *device_a = a.device<float>(); // no need to free this float value; cudaMemcpy(&value, device_a + 2, sizeof(float), cudaMemcpyDeviceToHost); printf("device_a[2] = %g\n", value);
You can pull the scalar value from the first element of an array back to the CPU with scalar().
array a = randu(3); float val = a.scalar<float>(); printf("scalar value: %g\n", val);
You can access the dimensions of a matrix using a dim4 object or directly via dims() and ndims().
array a = randu(4,5,2); printf("ndims(a) %d\n", a.ndims()); // 3 dim4 dims = a.dims(); printf("dims = [%d %d]\n", dims[0], dims[1]); // 4,5 printf("dims = [%d %d]\n, a.dims(0), a.dims(1)); // 4,5
Integer support includes bitwise operations as well as the standard sort(), min/max, indexing (see more).
int h_A[] = {1, 2, 4, -1, 2, 0, 4, 2, 3}; int h_B[] = {2, 3, -5, 6, 0, 10, -12, 0, 1}; array A = array(3,3,h_A), B = array(3,3,h_B); print(A & B); print(A | B); print(A ^ B);
Several platform-independent constants are available to for reference: pi, nan, inf, i. When these variable names conflict with macros in the standard header files or variables in scope, then reference them with their full namespace, e.g. af::nan
array A = randu(5,5); A(A > .5) = af::nan; array x = randu(20e6), y = randu(20e6); double pi_est = 4 * sum<float>(hypot(x,y) < 1) / 20e6; printf("estimation error: %g\n", fabs(pi - pi_est));
Many different kinds of matrix manipulation routines are available:
tile() allows you to repeat a matrix along specified dimensions, effectively 'tiling' the matrix. Please note that the dimensions passed in indicate the number of times to replicate the matrix in each dimension, not the final dimensions of the matrix.
float h[] = {1, 2, 3, 4}; array small = array(2, 2, h, afHostPointer); // 2x2 matrix array large = tile(small, 4, 6); // produces 8x12 matrix: (2*4)x(2*6)
join() allows you to joining two matrices together. Matrix dimensions must match along every dimension except the dimension of joining (dimensions are 0-indexed). For example, a 2x3 matrix can be joined with a 2x4 matrix along dimension 1, but not along dimension 0 since {3,4} don't match up.
float hA[] = { 1, 2, 3, 4, 5, 6 }; float hB[] = { 10, 20, 30, 40, 50, 60, 70, 80, 90 }; array A = array(3, 2, hA, afHostPointer); array B = array(3, 3, hB, afHostPointer); print(join(A, B, 1)); // 3x5 matrix // array result = join(A, B, 0); // fail: dimension mismatch
grid() can be used to construct a regular mesh grid from vectors x
and y
. For example, a mesh grid of the vectors {1,2,3,4} and {5,6} would result in two matrices:
float hx[] = { 1, 2, 3, 4 }; float hy[] = { 5, 6 }; array x = array(4, hx, afHostPointer); array y = array(2, hy, afHostPointer); array u, v; grid(u, v, x, y); // produces: // u = |1 2 3 4| v=|5 5 5 5| // |1 2 3 4| |6 6 6 6|
newdims() can be used to create a (shallow) copy of a matrix with different dimensions. The number of elements must remain the same as the original array.
int hA[] = { 1, 2, 3, 4, 5, 6 }; array A = array(3, 2, hA); print(newdims(h1, 2, 3)); // 2x3 matrix print(newdims(h1, 6, 1)); // 6x1 column vector // print(newdims(h1, 2, 2)); // fail: wrong number of elements // print(newdims(h1, 8, 8)); // fail: wrong number of elements
The T() and H() methods can be used to form the matrix or vector transpose.
array x = randu(4,4,f64); array y = x.T(); array c = randu(4,4,c64); array c_trans = c.T(); // transpose array c_conj = c.H(); // Hermitian (conjugate) transpose
There are several ways of referencing values. ArrayFire uses parenthesis for subscripted referencing instead of the traditional square bracket notation. Indexing is zero-based, i.e. the first element is at index zero (A(0)
). Indexing can be done with mixtures of:
See Subscripted array indexing for the full listing.
array A = randu(3,3); array a1 = A(0); // first element array a2 = A(0,1); // first row, second column A(end); // last element A(-1); // also last element A(end-1); // second-to-last element A(1,span); // second row A.row(end); // last row A.cols(1,end); // all but first column float b_host[] = {0,1,2,3,4,5,6,7,8,9}; array b(b_host, 10, dim4(1,10)); b(seq(3)); // {0,1,2} b(seq(1,7)); // {1,2,3,4,5,6,7} b(seq(1,2,7)); // {1,3,5,7} b(seq(0,2,end)); // {0,2,4,6,8}
You can set values in an array:
// setting entries to a constant A(span) = 4; // fill entire array A.row(0) = -1; // first row A(seq(3)) = 3.1415; // first three elements // copy in another matrix array B = ones(4,4,f64); B.row(0) = randu(1,4,f32); // set a row to random values (also upcast)
Use one array to reference into another.
float h_inds[] = { 0, 4, 2, 1 }; // zero-based indexing array inds(h_inds, 1,4); array B = randu(1,4); array c = B(inds); // get B(inds) = -1; // set to scalar B(inds) = randu(4,1); // set to random
Matrix decompositions are available: lu, qr, svd, eigen, cholesky, and more.
Matrix operations: inv, mpow, det, solve, hessenberg, and more.
The decompositions have a general the general forms as follows. Here is an example to get packed output, or just the first output.
To get separated lower and upper outputs:
array in = randu(5); array l, u, p; lu(l, u, p, in); // verify outputs print(l); print(u); print(p);
Other examples:
array in = randu(5,5); array out_inv = inv(in); // Inverse of input array out_pow = mpow(in, 3); // out_pow = in * in * in; Not element wise. float out_det = det<float>(in); // determinant of the input
See also:
The convolve() is the single entrypoint for all image and signal convolution:
convolve() with two inputs performs N dimensional convolution, where N is the highest input dimension:
array image = randu(10,10); array kernel = ones(3,3) / 9; // average within 3x3 window print(convolve(image,kernel)); // 10x10 blurred image
However if the kernel is small and is on the host, it's faster to use it directly from the host pointer instead of pushing it to device first:
array signal = randu(5000,1); float host_filter[] = {1, 0, -1}; unsigned filter_dims[] = {3}; convolve(signal, 1, // number of filter dimensions filter_dims, // filter dimensions host_filter);// filter inside host memory
In some cases, a 2D filter kernel is considered "separable", meaning it can be decomposed into two orthogonal vectors. Convolving with those individual vectors is almost always faster.
// 5x5 derivative with separable kernels float h_dx[] = {1/12, -8/12, 0, 8/12, -1/12}; // five point stencil float h_spread[] = {1/5, 1/5, 1/5, 1/5, 1/5}; array dx = array(5,1,h_dx); array spread = array(1,5,h_spread); array kernel = dx * spread; // 5x5 derivative kernel array image = randu(640,480); convolve(image, kernel, afConvSame); // derivative of image going down columns // equivalent and faster version: convolve(dx,spread,image, afConvSame); // also supports passing host pointers: convolve(5,h_dx, 5,h_spread, image, afConvSame);
Running the convolve.cpp example shows nearly a 3x difference betwen the separable and non-separable cases:
arrayfire/examples/misc $ ./convolve full 2D convolution: 0.00156023 separable, device pointers: 0.000595222 separable, host pointers: 0.000590385
You can also produce different parts of the convolution with the afConv shape
parameter:
convolve(randu(3,1), randu(5,1), afConvSame) // 3x1 output convolve(randu(5,1), randu(3,1), afConvSame) // 5x1 output convolve(randu(3,1), randu(5,1), afConvFull) // 7x1 output convolve(randu(6,1), randu(5,1), afConvValid) // 2x1 output convolve(randu(5,1), randu(6,1), afConvValid) // empty output since kernel bigger than image
ArrayFire can be used in projects that involve writing CUDA kernels and compiling CUDA code. ArrayFire examples directory contains examples/pi/pi_cuda.cu that computes pi launching a CUDA kernel.
Make sure you have the CUDA toolkit installed. This is required as compiling CUDA kernels needs CUDA NVCC compiler.
ArrayFire examples/pi directory contains solution files for both Visual Studio 2008 and Visual Studio 2010.
To compile pi_cuda in VS 2008 (VS 2010), open pi_cuda_vs2008 (pi_cuda_vs2010) solution file, choose the configuration you want to build (Win32/x64, Debug|Release) and you should be able to build the example successfully.
Double-check the Makefile and make sure CUDA path is set to the CUDA toolkit installation directory. You should now be able to build the example successfully.