Global Matrix: assembly

In the Finite Element method a dense element matrix A_el of dimensions [ndofel X ndofel] is computed for every element (see Element Matrices: computation). Since neighboring elements share nodes, the individual element contributions need to be assembled (summed up) in the global matrix A. During the assembly every element matrix A_el is added to the global matrix A in the row/column positions determined by the node (dofs) numbers stored in the ELEMS (DOFS) arrays.

Unlike the dense element matrices the global matrix is sparse, i.e., most of its entries are zero. There is a characteristic average number of non-zero entries per row depending on the connectivity structure of the mesh. The exact number of entries per row depends on the element type and the number of degrees of freedom per node. Table 1 below gives an average number of row entries for various types of elements.

tri3tri6tri7quad4quad9tet4tet10tet15brick8brick27
nnz(row/col)711.51291614.628.431.426.763
Table 1. Average number of non-zero entries in every row/column of assembled system matrix for various element types.

Since the number of entries per row of A is bounded from above by a constant, the total number of non-zero entries constitute a small (and decreasing with increasing the system size) fraction of all the matrix entries (nnz(A)~C*n vs. numel(A)=n2). Storing the zero entries and explicitly using them in computations is inefficient from the point of view of the memory and computational resources. For sparse matrices, dedicated storage formats have been developed that allow only the non-zero entries to be kept in the memory and used in computations. This is commonly done by keeping additional row and column index information along with every non-zero matrix entry.

Below we show how to use the sparse storage and how to assemble the global sparse matrix A in MATLAB.

How not to do it in MATLAB

Consider the most straightforward implementation of assembling local matrices A_el into the global matrix A for a FEM problem with one dof per node.

A = sparse(nnod,nnod);
for iel=1:nel
    indx = ELEMS(:,iel);
    ...
    %compute element stiffness matrix A_el
    ...
    A(indx,indx) = A(indx,indx) + A_el;
end

In the code fragment above the sparse matrix A initially contains only zero entries, i.e., very little memory is used by the data structure. During the execution of the element loop the sparsity pattern (the positions of the non-zero entries) of the global matrix A changes as the contributions from the elements are added. The internal data structure that holds the matrix entries and the auxiliary row/column index arrays need to be constantly reallocated and updated. In practice, in the above approach the time spent on the assembly largely exceeds the time needed to compute the element matrices.

Since the sparsity pattern is determined by the mesh, the final data structure needed to hold the sparse matrix A can be
allocated before the element loop. In this approach A explicitly stores zero entries in places (row/column pairs), which can contain non-zeros after the assembly. The matrix entries can be updated in the element loop. This requires an additional work of finding the correct memory location in the sparse data structure, but the large overhead connected with constant rebuilding of the sparse storage structure is avoided. Unfortunately, this approach is not a viable option in MATLAB for two reasons:

1. It is not possible to explicitly store zero entries in sparse matrices in MATLAB, since those are automatically removed. Thus, it is not possible to initialize an empty sparse matrix with a predetermined sparsity pattern.

2. In our experience updating a small set of entries of a sparse matrix in MATLAB also involves a substantial overhead. This could be caused by the additional time MATLAB spends on verifying that a sparse matrix only contains non-zero entries, which requires reading all the matrix entries from the memory – an overhead that grows with the system size.

Overcoming the two deficiencies would require a major modification of the MATLAB sparse class.

Triplet sparse storage

A workaround solution is to compute and store all the element matrices A_el first and assemble them in one call to MATLAB sparse function, as shown in the code fragment below.

A_all = zeros(nnodel*nnodel,nel);
for iel=1:nel    
    ...
    %compute element stiffness matrix A_el
    ...
    A_all(:,iel) = A_all(:,iel) + A_el(:);
end
 
indx_j = repmat(1:nnodel,nnodel,1); 
indx_i = indx_j';
Ai = ELEMS(indx_i(:),:);
Aj = ELEMS(indx_j(:),:);
 
A = sparse(Ai,Aj,A_all);

In this approach all the element matrices are first stored in A_all array. Next, the row and column indices for every entry of all the element matrices are computed (Ai, Aj). This method of representing a sparse matrix is called the triplet format: for every matrix entry the corresponding row/column information is stored explicitly. Duplicate Ai, Aj pairs exist in the triplet list. The global matrix assembly is performed in one call to a dedicated MATLAB function sparse , which adds the duplicate entries.

This implementation is much faster than the first described approach. The disadvantages are that all the element matrices need to be stored in the A_all array prior to the assembly. The memory requirements are additionally increased due to the need to store the arrays containing row/column indices. Due to the duplicates the number of entries in the triplet storage can vary between 6.3 to 1.4 times the number of non-zero entries in the assembled global matrix A. The exact amount of memory overhead depends on the element type (see the Table 2). If the matrix assembly is not followed by the much more memory intensive Cholesky factorization, the elevated memory requirements related to the (Ai, Aj, A_all) arrays might restrict the problem size that can be solved.

element typetri3tri6tri7quad4quad9tet4tet10tet15brick8brick27
ratio (triplet/crs)2.61.61.41.81.36.32.61.62.41.4
Table 2. Ratio of the number of entries in the triplet to assembled sparse matrix format.

In the triplet sparse format every non-zero matrix entry requires three data values: A_all – the value of the entry, and Ai/Aj – row and column index of the entry. The MATLAB function sparse only takes as arguments A_all/Ai/Aj of type double. This means that in this storage format every triplet requires 3*8=24 bytes of memory. sparse removes the duplicate triplets and converts the matrix into MATLAB native sparse format CCS (Compressed Column Storage). In CCS sparse matrix A is stored column by column and every non-zero entry only requires its row index in addition to the value. In MATLAB Both row indices and data values use 8 bytes of memory per non-zero entry. Hence, the memory requirements for the assembled matrix A are 1.5 times the duplicates ratio lower than for the triplet storage.

The algorithmic complexity of converting the triplet sparse format to the native MATLAB sparse format (CCS – Compressed Column Storage) is O(n), i.e., it is linear with the system size n. It means that this approach is scalable with the number of elements and nodes in the mesh – doubling the mesh size requires twice the number of operations and double the amount of memory to assemble the global matrix A.

sparse vs. sparse2

The efficiency of the triplet approach can be further improved by using the sparse2 function provided by the SuiteSparse package from Tim Davis [1]. The entire procedure of global matrix assembly is essentially the same as described above, with minor differences that result in significant execution time improvement

1. Ai and Aj index arrays must be of type double in the case of MATLAB sparse routine. For sparse2 they can be of type int32 . Since doubles use 8 bytes and integers use 4 bytes, less memory is required.

2. Since accessing memory takes time, using integers for the Ai/Aj index arrays speeds up the code.

3. sparse2 supports creation of logical sparse matrices, i.e., matrices that contain 1 where there is a non-zero entry. In this case the matrix entries are also stored as integers instead of doubles, which further speeds up the code. Logical matrices can be used e.g., to compute the reordering.

sparse_create

In the triplet approach shown in the code fragment above the index arrays Ai/Aj are explicitly constructed based on the element dof information by generating all-to-all connections for all the dofs. However, explicit construction of Ai/Aj index arrays is not necessary and the index pairs for every element matrix stored in A_all can be generated on the fly from ELEMS by a sparse-like routine.

sparse_create is a MEX function distributed with MILAMIN that takes as the parameters the ELEMS and A_all arrays and assembles the global system matrix A. It can create symmetric (lower triangular part) and general matrices

A = sparse_create(ELEMS, A_all, 0);   % general sparse matrix
A = sparse_create(ELEMS, A_all, 1);   % symmetric sparse matrix, lower triangular part

The memory requirements of sparse_create are significantly smaller than of the approach using triplet storage. Except for the A_all, it does not require any space to be allocated for the duplicate entries. This results in significant memory savings for the ‘lighter’ elements like tri3 and tet4 (see Table 2). Thanks to much smaller memory requirements it is usually as fast as or faster than sparse2. In addition, it does not require the Ai/Aj index arrays to be created, which saves even more time.

Instead of A_all one can supply no arguments or ‘true’, in which case a symbolic sparse matrix is created containing 1 in all the non-zero entries. This syntax is useful when the matrix is only needed to compute reordering of the nodes.

A = sparse_create(ELEMS);

Algorithmic complexity of sparse_create is also O(n).

Efficiency tests

Figure 1. Time required to perform global sparse matrix assembly for 2D meshes of 3-node triangular elements using different methods. Symmetric sparse matrix is created. Computer: valinor.

Figure 1 shows the time needed to assemble the lower triangular part of the global sparse matrix A given the element matrices A_all using the described methods. The 3-node triangular element meshes are considered. For this element type sparse_create is clearly better than the approach based on the triplet storage, especially if the time required to compute the Ai/Aj index arrays is included.

tri3tri6tri7quad4quad9tet4tet10tet15brick8brick27
sparse_create2.980.630.463.300.532.570.240.080.860.04
sparse22.00.390.281.350.280.960.170.070.370.03
sparse0.520.160.120.410.070.340.070.030.120.01
Table 3. Efficiency in millions of elements per second of the different methods to assemble the global matrix A Time needed to compute Ai/Aj index arrays is included in results of sparse2 and sparse. Computer: valinor.

The performance gain of sparse_create in Figure 1 is due to a relatively high number of duplicate entries for this type of mesh, as shown in Table 2. For other element types the advantage of sparse_create in terms of execution time might be lower, as shown in the Table 3.

The memory requirements of the different implementations are shown in Table 4.

sparse_createsparse2sparse
A_elem1.61.61.6
ELEMS0.40.40.4
NODES0.30.30.3
A1.21.21.2
Ai + Aj01.63.2
assembly workspace including Ai/Aj3.19.510.9
maximum amount of memory allocated at any time5.411.813.2
assembly time (s)102556
Table 4. Memory usage (in GB) of different global matrix assembly implementations. 3-node triangular element, 33 million elements, 16.5 million nodes.