Abstract

FEM solver is designed for shared memory architectures using an object-oriented paradigm. The FEM solver is divided into five major sections; these are a collection of mesh data, creating an object for each finite element, application of boundary conditions, assembly of a linear system of equations, and lastly a solver to calculate the solution for the system of linear equations. This work has focused on the study of multiple parallel implementations for each stage of the FEM solver. All the stages are implemented in C++ while each stage parallel regions are implemented using OpenMP. The data structures used in this work are also C++ objects implemented using STL containers, like sparse matrix used is constructed as two-layer nested binary trees where each internal tree represents a sparse row in a global matrix. These structures are used in every stage and have never been converted into more conventional matrix structures like CSR, etc. Finally, this work has shown the contribution of time for each stage and the amount of speedup achieved for them. Using 16 threads, the solver has attained a speedup of 5.21 for mesh having 2099201 unknowns and Poisson’s equation. Two linear solvers Jacobi and CG are used which were consuming 95% of total solver time. The novelty of our work lies in the fact that this work focused on the implementation of Finite Element Application (FEA) using object-oriented programming for a shared memory architecture using OpenMP without considering conventional techniques of application speedup and specialized data structures like CSR, etc.

1. Introduction

Since 1999, when Intel has launched its last single-core Pentium IV processor, software development has changed drastically. Till 1999, the performance of an application was dependent on upcoming faster memory, faster processor, and memory access patterns. Due to hardware limitations, primarily heat dissipation issue Pentium IV processors were discontinued and multicore processors were introduced in the year 2006 initially as dual-core processors. The multicore processors have opened new paradigms for developers as not these have to implement their application target to utilize more than one low power core. Software engineers have to distribute their computations among available cores using either threads or processes.

Finite Element Method (FEM) is a numerical method to calculate an approximate solution of ordinary differential equations and partial differential equations when provided with initial and boundary conditions. For the last decades, due to its solid theoretical foundation, it has become the most commonly used method for approximation of differential equations. FEM applications are mostly divided into four steps. These are discretization of a domain, construction of the system of equations, application of constraints, and finally approximate solution computation for the system of equations. The approximation capability of a system depends on how fine a domain is approximated. In this work, the main focus is to implement FEM applications for shared memory architectures using the OOP paradigm and Open MP library. Open MP is considered a de facto standard for multithread applications. The application developed during this research has been divided into the following four sections.

1.1. Read Mesh File

A mesh is an approximation of the target domain, for which calculations are performed. It is an approximate modulation of a physical object which can be anything from a high rise building to human cells. The meshes used in this work are of square shape Ω2D = (0, 1)2 and were developed using a third party software called GMSH [1]. A mesh is composed of nodes and elements where all the nodes, which have either x or y coordinates, are 0, laying on Dirichlet boundaries ΓD = 0. The provided mesh is dependent on the FE solver user and has to be read every time an FE solver is at the initial stage. For more accuracy, usually finer meshes are used which means a greater number of nodes and elements. The mesh represents a digital approximation of some real-time structure which varies from domain to domain such as a building, oceans, human bone, or cell structure. For this reason, mesh density varies due to the problem as well as error. The mesh has to change very frequently.

1.2. Construction of Mesh Elements

In the second stage of solver, C++ objects are created for every node as well as an element in mesh. The node objects store coordinates while the element contains store node ids, equation constants, force applied, material properties, etc. There is an associative relation between nodes and elements which enabled to sharing of node objects among element objects.

1.3. Construction of Linear System of Equations

Each element in Ω acts as subdomain Ωk where problem equations such as Poisson, heat, and convection diffusion are approximated. The error in solution approximation is directly proportional to the size of Ωk. The reduction in Ωk size results in a larger system of equations as more nodes are needed to construct smaller Ωk. It is always desirable to reduce the element size so that the error comes under the desired level. Global stiffness matrix A is constructed from each element contribution Ak as well as right-hand side vector b from bk.

1.4. Application of Dirichlet Constraints

The physical problem definition also includes different boundary conditions like Dirichlet and Neumann, etc. In this work, Dirichlet constraints are applied by FE solver application while Neumann constraints are implemented at Ωk level.

1.5. Solve the Linear System of Equations Using Iterative Solvers

In the last stage of this work, two iterative linear algebra solvers Jacobi and Conjugate Gradient are implemented to solve the linear system of equations.

2. Background

2.1. Object-Oriented Finite Element Method

Finite element methods were invented to solve complex structural and elasticity problems in engineering and are frequently since the 1960s [2]. Initially, FEM solvers were implemented for supercomputers using functional programming paradigms which were cumbersome to maintain and reuse. In the 1980s, the object-oriented paradigm was the latest trend in programming practice. One of the earliest attempts to design object-oriented FEM was presented by Forde [3] in which classes were introduced for elements, materials, nodes, and boundary conditions. This work was improved by Lu et al. [4], by adding a solver class to isolate the solution algorithms from the finite element formulations. Mackie in 1992 [5] extended objects for dynamic problems to use complex geometry elements. In 1999, Mackie [6] introduced term data modeling which has enabled compatibility with other software types like CAD packages, etc. So, a class must represent data structure for basic objects; for example, degrees of freedom belong to nodes. Classes must provide access to the necessary data and services.

2.2. Parallel Finite Element Methods

With the advancements of finite element methods and their acceptability in the commercial sector, larger problems were designed to achieve better approximation [7]. So, the feasible means to solve such computationally intensive problems was to use parallel computers effectively [8]. Many researchers presented distributed memory designs for finite element software. One of the earliest works found was presented by K. H. Law in 1986 [9]. He proposed a solution for the MIMD machine to directly compute element distortion, without assembling the global system of equations. He purposed that, for parallel computing, iterative solution methods are suitable. Farhat and Roux in 1991 [10] presented an approach for domain decomposition regarding the solution of the parallel finite element method for equilibrium equations. The domain decomposition method was widely used earlier with supercomputers. The domain of finite elements is divided into subdomains. Each of the subdomain sets is allocated to an individual processor. Their proposed approach used fewer interprocessors communication and still offered the same amount of parallelism. Every subdomain is assigned to an individual processor and assembling of the stiffness matrix as well as the forcing vector, factoring, and computing of rigid modes have been determined. A precondition Conjugate Gradient method is applied to solve this system of equations.

2.3. Shared Memory Finite Element Methods

In shared memory processors, global memory (RAM) is accessed by multiple processing cores simultaneously. Since all the processors have the same view of data, these are easier to program than distributed memory processors as the communication between these processing cores is much faster as the same memory is accessed by them [11]. The same bus is used to access the main memory which limits the size of the system to only tens of processors. Sharma and Wang in 1992 [12] presented finite element programs for shared memory multiprocessors. A structure named flow node was formed which enables the programmer to make a list of blocks of computational statements and attach the properties with each relevant statement. The flow nodes are used for two purposes: an independent analysis of the code and in the derivation of numeric code for the target architecture. They generated the actual Sequent Balance code with modification [13]. Nistor et al. in 2008 [14] presented another technique of local parallelization that was to be applied on some subroutines which were CPU time-consuming inside the explicit integration main loop of the program. Only internal force vector along with stable time-step computations is parallelized by using OpenMP which generates a more efficient code without using the domain decomposition method. He presented four different techniques for parallelization. Usage of dynamic load balancing improved the efficiency of code by reducing the wait time. Turcksin et al. in 2016 [15] presented a design pattern for the multicore-enabled finite element method. In his opinion, modern codes of FEM assembling a system of linear equations are more expensive than solving them. In this regard, the authors have focused on the tasks specific to finite element toolkits. Their pattern named workstream consists of a group of elements on which local contributions are computed followed by a reduction operation. They presented three implementation schemes first of which is a parallel pipeline. Their pipeline has four stages: first is to create an item, second is to compute local contribution which can run in parallel, the third stage is the reduction, and the last stage simply marks the scratch and local contribution objects as useable again. They have not worked on only one element at a time but instead passed whole batches of elements through the pipeline. The second is the implementation of thread-local scratch elements. In the previous algorithm, scratch elements were taken round-robin from a global pool. Every processor has its cache, so this almost certainly leads to a large number of cache misses because a thread working on computing a local contribution rarely will get the same scratch objects it has while working on previous cells. So, they made the scratch objects thread-local. Local contribution objects are not thread-local. In the third implementation, they addressed another inefficiency which is the use of serial reduction operations which cannot scale to large numbers of processors if the reduction operations take a nonnegligible fraction of the runtime of the computation of the local contribution function. Reduction operations are sequential because they write into a global shared object. These memory write operations can conflict if these happen at the same time. However, in matrix assembly not all write conflict with others. For continuous elements, the reduction operations for cells conflict with each other; only these cells share degrees of freedom. So they colorize the set of cells in such a way that cells for which the reduction operations conflict are assigned different colors. With this implementation, there is no need for using create item stage because, for a given color, the order in which the elements are processed does not matter. They also presented some strategies for distributed system preceding implementations for shared memory systems. The first strategy they present is duplicating the entire data structure. This works as long as the output data structure is relatively small and the final reduction is cheap. The second approach is splitting data structures into chunks. This saves memory compared to the previous approach. But there are also downsides that load unbalancing. They used 4 16- core AMD Opteron 6378 processors running at 2.4 GHz, for a total of 64 cores to evaluate the performance. They achieved sizeable speedups between 7 and 44.

3. FE Solver

Figure 1 shows a class diagram of implemented Finite Element Poisson 2D solver. Poisson_2D_Solver class is the main class that has four public methods that represent each stage of finite element solutions. The generateElements method is used to read the mesh file and create objects for nodes and elements. The node class is implemented to store each node's ids and coordinates (x, y). An abstract element class is represented to define a common interface for all kinds of subdomains. The Poisson 2D element classes Edge and Triangle are inherited from element class and their objects will be accessed from the element pointer using polymorphism. Gmsh_Reader class is implemented to read nodes and elements data one by one using getNode and getElement methods, respectively.

Poisson_2D_Solver::generateEquation method constructs a system of equations using mesh element objects. In this function, all the elements and are calculated and accumulated into and , where and are Ωk contributions to global system matrix and load vector while is used to map local node’s ids to global node ids.

Each element is composed of a unique subset of nodes, and this information is private to every element. There are many compressed sparse matrix formats like CSR (Compressed Sparse Row), COO (Coordinate Format), BSR (Block Sparse Row), etc. for efficient sparse matrix-vector multiplication. These formats need to know in advance the number of nonzero entries in each row and their column indexes. The mapping among nodes is not known in this interface. A dynamic sparse matrix data structure is required, which should have constant time for insertion, searching, and deletion. Sparse_Matrix class is implemented to store , which has a complexity of . The Vector class is implemented to store with a complexity of 1.

Poisson_2D_Solver::applyDirichlet method applies Dirichlet constraints on and . The list of Dirichlet nodes is collected during the elements’ generation stage. The user has to provide information on which boundary of Ω is Dirichlet and the amount of force is being applied on these. In this work, all the nodes which have either or coordinate or both are zero lays at Dirichlet boundary and zero force is being applied on Dirichlet boundaries.

Poisson_2D_Solver::getSolution method calculates the solution of the linear system of equations in which two iterative linear solvers Jacobi and Conjugate Gradient are implemented.

4. Parallel FE Solver

All stages mentioned above are implemented in parallel for shared memory architecture using OpenMP. The selection of OpenMP was because it is a directive-based language that is very convenient to adopt and secondly software market has opted for it as a de facto standard. It is implemented by all major C++ compilers like Microsoft, Intel, GNU, etc. OpenMP hides the details about creating, management, and deletion of threads. A developer only indicates where to create a parallel region and which elements have to pass a shared or private variable. A developer also has to identify critical sections and methods to manage these critical sections.

4.1. Read Mesh

There were multiple algorithms considered during the parallel generateElement method.

4.1.1. Single File Handler Shared among Threads

The first approach considered for parallel mesh read is to share a file handler among multiple threads. Initially, the file is read in a critical section where it is observed that mesh read time increases with a rise in the number of threads due to the fact all threads except one are waiting for the critical section. Secondly, a shared file handler is used in each thread without a critical section. It resulted in performance gain but it was observed that some nodes and elements are read multiple times and their objects are created more than one time.

4.1.2. Each Thread Has Its Own File Handler

The second approach considered provides a private file handler to each thread. All threads initially calculate how many nodes and elements each thread will read and what should be their starting ids and end ids. Figure 2 shows how nodes and elements are mapped to threads using block distribution and cyclic distribution, respectively. Block distribution allocates consecutive rows to each thread as shown in Figure 2(a) while cyclic distribution allocates every ith row to each thread where there is the number of threads as shown in Figure 2(b). Initial experiments showed that block distribution has outperformed cyclic distribution as can be seen in Tables 1 and 2. In both distributions, an almost equal number of nodes and elements are read while each node and element are read in a single parallel region only. For 16 threads, cyclic distribution took 9.84 secs while block distribution took 1.24 seconds.

4.2. Construction of Equations

Constructing a system of equations from a mesh requires contribution from each element of mesh. The system of the equation is composed of stiffness matrix , solution vector , and load vector . For each element , local stiffness matrix and load vector are computed and added to and , respectively. In (1) and (2), is the connectivity matrix used to introduce the local element system of equations to the global system of equations.

In the parallel version, multiple approaches are adopted, which are discussed below with their advantages and disadvantages. The application was designed in a way to use less memory footprint as a well efficient method of calculation. Element class interface which is inherited by all element classes (as can be seen in Figure 1) provides a reference of global structures to each element while allowing each element to directly store local data into global data structures. In the parallel implementation, multiple elements add their contribution to global elements simultaneously which makes this piece of code a source of error. This critical section was addressed using the following methods.

4.2.1. Thread-Safe Data Structures

In this approach, a thread-safe data structure for and is considered. In the first structure of , shown in Figure 3(a), a single lock is added to sparse_matrix class and each time any of the threads either performs data insertion, modification. or deletion, lock must be acquired. The acquisition and release of lock are added to data structure insertion and deletion methods. This approach resulted in correct matrix data but the time taken was similar to the sequential time as shown in Tables 3 and 4. In both cases, maximum speedup was attained for three threads and reduces as threads increased. Only one thread has access to the critical section while other threads are waiting.

It is observed that each element is constructed using a unique subset of nodes which means that not all the elements are updating the same rows at any instance. The sparse_matrix is composed of two-level nested binary trees using standard template library binary tree implementation using std:map provided in Standard Template Library (STL). The sparse_matrix class is modified so that a lock is created for each row as shown in Figure 3(b). So, only those threads which are modifying the same rows have to wait. Tables 5 and 6 show the timing of the row-level locking mechanism. These have shown a continuous rise in speedup with an increase in threads as in most cases threads will be writing to different rows and very few threads are waiting at any time.

4.2.2. Use OpenMP Critical Section to Gather Equation Data

The second approach was to implement a critical section at the application in which a non-thread-safe data structure is used for and . The parallel region was developed using #pragma omp for where each element updates the global data structure in a critical section. The timing of this implementation is shown in Table 7. This method has shown no speedup at all as most of the time, except for one thread, all are in the waiting stage. Even though the number of tasks per thread reduces with a rise in the number of threads but wait time also increases.

4.2.3. Use Threads’ Local Data Structures and Gather Using Butterfly Communication

This approach aims to void critical sections by using the local sparse matrix in each thread. As shown in Figure 4, this algorithm is composed of two sections where in earlier stage each thread collects data from each of its designated elements into local data structures while the latter stage is to iteratively add these threads’ local data structures and to and using butterfly communications. The butterfly communication is an iterative process where the number of iterations is equal to log2(total threads). Figure 4 has only log2(8) = 3 synchronization shown by an orange horizontal line. Tables 8 and 9 show timing where each table has three rows. The first row indicates the construction of by summing all the allocated elements’ stiffness matrices. The second row indicates the time for the construction of a global matrix from while the last row indicates speedup achieved by multiple threads. Each thread has to sequentially construct its . The population of requires data from using butterfly communication. More matrices are added as the number of threads increases and more stages are added to butterfly communication. During experiments, elements are read in the same order as given in the file and a nonoverlapping subset of consecutive elements was allocated to each thread. Finally, the increase in the number of threads reduces and increases time for . Overall time remains constant and speedup always remained under 2%.

4.3. Application of Constraints

In this work, only the Dirichlet constraint is considered. The parallel algorithm for Dirichlet constraints has two parts as shown in Figure 5. The first part is to remove all the Dirichlet rows from while setting its diagonal entry to 1. In this experiment, a two-level nested std:map is used to store a sparse matrix. The std:map is not a thread-safe data structure. The removal of rows means the removal of internal std:map object from outer std:map object where each edge represents a row. The second part is more time-consuming as for every non-Dirichlet row, Dirichlet columns entries are searched and removed. Figure 6 shows a 7 × 7 stiffness matrix before and after applying Dirichlet constraints where nodes 4 and 6 lay on the Dirichlet boundary. Removal of Dirichlet rows was performed sequentially because of a non-thread-safe data structure. Non-Diriechlet rows were distributed among the threads where each thread will remove Dirichlet columns from each row. Table 10 shows timing and speedup for the Dirichlet algorithm. The second part of the algorithm is embarrassingly parallel and has achieved an almost linear speedup of 14.65%for 16 threads.

4.4. Linear Algebra Solvers

There were two most commonly used iterative solvers Jacobi and Conjugate Gradient considered in this work. These are implemented in parallel using OpenMP. Their algorithms and implementation details are as follows. All the critical sections are colored blue while synchronization points are colored red.(1)Jacobi solver iteration can be segregated into the following parts.(i)Calculation of residual (ii)New approximation of (iii)Calculation of error threshold (2)Conjugate Gradient Solver

Input:A, ,x, N, x_norm, r_norm
Output:x
(1)fork (A, ,x, N, x_norm, r_norm)
(2)Th_id, Tt_id (Thread id and Total Thread by OS)
(3)St_RowIdEd_RowId ← 0 {Start and end ids}
(4)IfTh_idN mod Tt_ththen
(5)St_RowIDTh_idN/Tt/th+Th_id
(6)Ed_RowIDSt_RowID+N/Tt_th+1
(7)else
(8)St_RowIDTh_idN/Tt_th+Tt_th
(9)Ed_RowIDSt_RowID+N/Tt_th
(10)end_if
(11)r {Local Residual vector for allocated nodes}
(12)error ← 1
(13)Whileerror > Thresholddo
(14)r_tmpx_tmp ← 0 {Local Norms for thread}
(15)foreachiSt_RowID, Ed,RowIDdo
(16)  repeat
(17)   r[i-St_RowID]←b[i]-A[i,j]∗x[j]
(18)  until For each entry j for row i of A
(19)  r_tmpr_tmp + r[i-St_RowID]2
(20)end for
(21)x_normr_norm ← 0 [Global Norms]
(22) barrier
(23)foreachiSt_RowID, Ed_RowIDdo
(24)  x[i] ← x[i] + r[i-St_RowID]/A[i,i]
(25)  x_tmpx_tmp+x[i]2
(26)end for
(27) lock
(28)r_normr_norm+r_tmp
(29)x_normx_norm+x_tmp
(30) unlock
(31) barrier
(32)error
(33)end while
(34)join

The second implementation for Jacobi is to create a single parallel region that will cover all three parts of Jacobi iteration as shown in Algorithm 1. There is only one parallel region in this application starting at line 1 and collapsing at line 35. All three stages are implemented in single parallel regions, and different stages have to be segregated using explicit barriers at line 22. The first explicit barrier is necessary as in the first region is used to calculate and in the second region is updated using . The second explicit barrier at line 31 is required to the calculation of error which requires . Before the second barrier, a critical section, from lines 27 to 30, is required to calculate and . Table 11 shows timing data for Algorithm 1 calculated using a mesh of 131585 nodes and 263168 elements. Jacobi has a slow convergence rate that is taking a large number of iterations to converge and it is the most time-consuming part of FE solver. For this problem, it took about 165009 iterations. The parallel task has enabled the reduction of the Jacobi solver's sequential time of 3780 sec to parallel time of 605 sec using 16 threads and achieving 6.246 times speedup. During execution, the stopping criteria adopted was either the error is less than 10–8 or total iterations are greater than 106.

Algorithm 2 shows a parallel algorithm of Conjugate Gradient. While designing this algorithm, the main aim was to keep the global synchronization minimum. In this algorithm, global synchronization is represented by barrier function at lines 17, 25, and 34. All the codes displayed in the black color are each thread individual code that has to perform without communicating with other threads. All the codes in blue are a critical section that has to be executed by all threads but one at a time. In this section, global variables , , and are computed. In each iteration, there are two synchronization points required to update shared variables such as and , respectively.

Input: (A, ,x, N, rtt1, rtr2, ptAp)
Output:x
(1)fork (A, ,x, N, rtt1, rtr2, ptAp) {shared variables}
(2)Th_id, Tt_id (Thread id and Total Thread by OS)
(3)St_RowIdEd_RowId ← 0 {Start and end ids}
(4)IfTh_idN % Tt_ththen
(5)St_RowIDTh_idN/Tt+Th_id
(6)Ed_RowIDSt_RowID+N/Tt_th+1
(7)else
(8)St_RowIDTh_idN/Tt_th+Tt_th
(9)Ed_RowIDSt_RowID+N/Tt_th
(10)end_if
(11)rtr1rtr2t_rtr ← 0
(12)foreachiSt_RowID.Ed_RowIDdo
(13) p[i] ← r[i] ← b[i]-
(14)t_rtr ← t_rtr + r[i] ∗ r[i]
(15)end for
(16)lock rtr2rtr2 + t_rtr unlock
(17)barrier
(18)while>Threshold do
(19)rtr1rtr2
(20)rtr2t_ptAp ← 0
(21)foreachiSt_RowID, Ed_RowIDdo
(22)  t_p'Apt_ptAp+p[i].
(23)end for
(24) lock ptApptAp+t_ptAp unlock
(25) barrier
(26)t_rtr ← 0
(27)foreachiSt_RowID, Ed_RowIDdo
(28)  x[i] ← x[i]+(rtr1/ptAp).p[i]
(29)  r[i] ← r[i]+(rtr1/ptAp).
(30)  t_rtrt_rtr+r[i] ∗ r [i]
(31)end for
(32)ptAp ← 0
(33) lock rtr2rtr2+t_rtr unlock
(34) barrier
(35)foreachiSt_RowID, Ed_RowIDdo
(36)  p[i] ← r[i]+(rtr2/rtr1).p[i]
(37)end for
(38)end while
(39)join

Tables 12 and 13 show the timing and speedup data of Algorithm 2 for two mashes. The system of 131585 unknown took 1019 iterations while the system of 2099281 unknown took 3987 iterations. Algorithm 2 has three critical sections (all colored in blue) where the first critical section at line 16 has to be executed once as it is before while loop while the other two sections at lines 24 and 33 are in a while loop and have to be executed in every iteration. These sections are to update global variables ( and ) with thread-local calculations. These variables cannot be updated in the same critical section as to use for the calculation of . Therefore, every time a thread after going through a critical section has to wait for other threads to execute it before further processing, this is imposed by global communication. For a fixed size problem, as the number of threads in the parallel region increases, the time required for thread-independent tasks (colored in black grey) will reduce while the critical section and communication timing will increase.

5. Parallel FE Solver Performance

The overall application performance obtained during experiments is given in Tables 14 and 15. These tables show the time consumed to solve the finite element problem with 131585 and 2099201 unknowns.

Figure 8 shows speedup for three problems (smaller to larger from left to right) using threads 1 to 16 using Conjugate solver. The data structure used to collect nodes and elements is not thread-safe. It can be seen that, for smaller mash, speedup has started to decline for more threads while as the number of unknowns increases, speedup keeps on increasing for higher threads as well. Construction of equation and Dirichlet speedup has almost linearly increased with an increase in threads. In both of these cases, the choice of data container for A has a significant impact as discussed previously. The last stage of the FE solver has taken the maximum time and achieved the least speedup among all. Iterative solvers (Jacobi and Conjugate Gradient) have a major impact on solver complete time, so the increase in solver speedup is similar to iterative speedup.

6. Conclusion

This work has focused on the implementation of Finite Element Application (FEA) using object-oriented programming for a shared memory architecture using OpenMP without considering conventional techniques of application speedup and specialized data structures like CSR, etc. FEA has been segregated into four stages which are read mesh, construction of equations, application of boundary conditions, and solution computation for a linear system of equations. The models available for solving FEM-associated issues work very similarly to each other. The key difference among these models lies in the fact that every model uses a different approach to handle the complicated geometries (and boundaries) with relative case All timings are calculated on a computer having Xeon E5-2650 (octa-core and support hyperthreading) processor and 32 GB RAM. In the initial stage, the mesh is read, in the most efficient block of nodes, and elements (block partitioning) are read in each thread using a local file descriptor and are stored in a shared data container. In the second stage, a system of equations is constructed from nodes and elements. Standard template library std:map class is used for sparse matrix container which is not a thread-safe data structure. Sparse matrix container is implemented using a two-level std:map where an individual std:map object is used for each row and an individual lock is associated with each row. If all threads update different rows, then no thread will have to wait as each row has a separate object and lock while on the other extreme, if all threads write to the same row, only one will have access to the row object while others have to wait. This stage has a maximum speedup of 12.67 times. In the next stage, the Dirichlet condition is implemented in parallel and achieved a maximum speedup of 14.65 times. In the last stage, two linear algebra solvers (Jacobi and Conjugate Gradient) are implemented to compute the solution. Jacobi has two synchronization points per iteration while Conjugate Gradient has three. Conjugate Gradient has a higher level of complexity and a fast divergence rate as compared to Jacobi. Parallel Jacobi has achieved 6 times speedup to its sequential Jacobi implementation. Parallel Conjugate Gradient has achieved 6 times speedup to its sequential Conjugate Gradient. The last stage is the most time-consuming part of FEA and overall speedup directly depends on this stage. During experiments, it has been observed that about 85% to 90% of the total time was consumed by solvers. To the best of our knowledge, in the domain of shared memory architecture, solving the issues related to FEM while designing multithreaded architecture is the pioneering work and this work still has potential for improvement.

Data Availability

The data used to support the findings of this study are available from the corresponding authors upon request.

Conflicts of Interest

The authors declare no potential conflicts of interest.