GPU-Accelerated Simulation

*Author of this section: Zhaofeng Luo, Carnegie Mellon University

We now rewrite the 2D mass-spring simulator to leverage GPU acceleration. Instead of directly writing CUDA, we resort to MUDA, a lightweight library that provides a simple interface for GPU-accelerated computations.

The architecture of the GPU-accelerated simulator is similar to the Python version. All function and variable names are consistent with the Numpy version. However, the implementation details are different due to the GPU architecture and programming model. Before delving into the details, let's first get a feeling of the speedup that GPU could bring us from the following gif (Figure 4.6.1).

**Figure 4.6.1.** An illustration of simulation speed of the Numpy CPU (left) and the MUDA GPU (right) versions.

Key Considerations for GPU Programming

To maximize resource utilization on the GPU, there are two important aspects to consider:

Minimizing Data Transfer. In most modern architectures, CPU and GPU have separate memory spaces. Transferring data between these spaces can be expensive. Therefore, it is essential to minimize data transfers between CPU and GPU.
Exploiting Parallelism. GPUs excel at parallel computations. However, care must be taken to avoid read-write conflicts that can arise when multiple threads attempt to access the same memory locations simultaneously.

Minimizing Data Transfer

To reduce data transfer between the CPU and GPU, we store the main energy values and their derivatives on the GPU. Computations are then performed directly on the GPU, and only the necessary position information is transferred back to the CPU for control and rendering. A more efficient implementation could render directly on the GPU, eliminating even this data transfer, but for simplicity and readability, we have not implemented that here.

To make the code more readable, the variables begin with device_ are stored in the GPU memory, and the variables begin with host_ are stored in the CPU memory.

Implementation 4.6.1 (Data structure, MassSpringEnergy.cu).

template <typename T, int dim>
struct MassSpringEnergy<T, dim>::Impl
{
	DeviceBuffer<T> device_x;
	DeviceBuffer<T> device_l2, device_k;
	DeviceBuffer<int> device_e;
	int N;
	DeviceBuffer<T> device_grad;
	DeviceTripletMatrix<T, 1> device_hess;
};

As shown in the code above, the energy values and their derivatives, as well as all the necessary parameters are stored in a DeviceBuffer object, which is a wrapper of the CUDA device memory implemented by the MUDA library. This allows us to perform computations directly on the GPU without the need for data transfer between the CPU and GPU.

Newton's Method

The iterations of Newton's method is a serial process and cannot be parallelized. Therefore, we implement this part on the CPU:

Implementation 4.6.2 (Newton's method, simulator.cu).

template <typename T, int dim>
void MassSpringSimulator<T, dim>::Impl::step_forward()
{
    update_x_tilde(add_vector<T>(device_x, device_v, 1, h));
    DeviceBuffer<T> device_x_n = device_x; // Copy current positions to device_x_n
    int iter = 0;
    T E_last = IP_val();
    DeviceBuffer<T> device_p = search_direction();
    T residual = max_vector(device_p) / h;
    while (residual > tol)
    {
        std::cout << "Iteration " << iter << " residual " << residual << "E_last" << E_last << "\n";
        // Line search
        T alpha = 1;
        DeviceBuffer<T> device_x0 = device_x;
        update_x(add_vector<T>(device_x0, device_p, 1.0, alpha));
        while (IP_val() > E_last)
        {
            alpha /= 2;
            update_x(add_vector<T>(device_x0, device_p, 1.0, alpha));
        }
        std::cout << "step size = " << alpha << "\n";
        E_last = IP_val();
        device_p = search_direction();
        residual = max_vector(device_p) / h;
        iter += 1;
    }
    update_v(add_vector<T>(device_x, device_x_n, 1 / h, -1 / h));
}

In this function, step_forward, the projected Newton method with line search is implemented, performing necessary computations on the GPU while controlling the process on the CPU. Any variable begin with device_ here is a DeviceBuffer object on the GPU. To print the values in DeviceBuffer for debugging purposes, the common practice is to transfer the data back to the CPU, or call the display_vec function (which calls printf in parallel on the GPU) implemented in uti.cu.

The update_x function updates the positions of the nodes to all Energy classes and transfers the updated positions back to the CPU for rendering:

Implementation 4.6.3 (Update positions, simulator.cu).

template <typename T, int dim>
void MassSpringSimulator<T, dim>::Impl::update_x(const DeviceBuffer<T> &new_x)
{
    inertialenergy.update_x(new_x);
    massspringenergy.update_x(new_x);
    device_x = new_x;
}

As the Energy classes has already updated its positions, the IP_val function no loner needs to pass any parameters, avoiding unnecessary data transfer. In fact, it only calls the val function of all energy classes and then sum the results together:

Implementation 4.6.4 (Computing IP, simulator.cu).

template <typename T, int dim>
T MassSpringSimulator<T, dim>::Impl::IP_val()
{

    return inertialenergy.val() + massspringenergy.val() * h * h;
}

Similarly for the IP_grad and IP_hess functions:

Implementation 4.6.5 (Computing IP gradient and Hessian, simulator.cu).

template <typename T, int dim>
DeviceBuffer<T> MassSpringSimulator<T, dim>::Impl::IP_grad()
{
    return add_vector<T>(inertialenergy.grad(), massspringenergy.grad(), 1.0, h * h);
}

template <typename T, int dim>
DeviceTripletMatrix<T, 1> MassSpringSimulator<T, dim>::Impl::IP_hess()
{
    DeviceTripletMatrix<T, 1> inertial_hess = inertialenergy.hess();
    DeviceTripletMatrix<T, 1> massspring_hess = massspringenergy.hess();
    DeviceTripletMatrix<T, 1> hess = add_triplet<T>(inertial_hess, massspring_hess, 1.0, h * h);
    return hess;
}

Notice that they utilize the parallel operations (add_vector and add_triplet, which are implemented in uti.cu) on the GPU to perform the summation for gradients and Hessians.

Parallel Computations

In our implementation, parallel computation is primarily employed in the computation of energy and its derivatives, as well as vector addition and subtraction. Let's take the MassSpringEnergy computation as an example.

Energy Computation

Implementation 4.6.6 (Computing energy, MassSpringEnergy.cu).

template <typename T, int dim>
T MassSpringEnergy<T, dim>::val()
{
	auto &device_x = pimpl_->device_x;
	auto &device_e = pimpl_->device_e;
	auto &device_l2 = pimpl_->device_l2;
	auto &device_k = pimpl_->device_k;
	int N = device_e.size() / 2;
	DeviceBuffer<T> device_val(N);
	ParallelFor(256).apply(N, [device_val = device_val.viewer(), device_x = device_x.cviewer(), device_e = device_e.cviewer(), device_l2 = device_l2.cviewer(), device_k = device_k.cviewer()] __device__(int i) mutable
						   {
		int idx1= device_e(2 * i); // First node index
		int idx2 = device_e(2 * i + 1); // Second node index
		T diff = 0;
		for (int d = 0; d < dim;d++){
			T diffi = device_x(dim * idx1 + d) - device_x(dim * idx2 + d);
			diff += diffi * diffi;
		}
		device_val(i) = 0.5 * device_l2(i) * device_k(i) * (diff / device_l2(i) - 1) * (diff / device_l2(i) - 1); })
		.wait();

	return devicesum(device_val);
} // Calculate the energy

The ParallelFor function distributes the computation across multiple GPU threads. The captured variables in the lambda function allow access to the necessary data structures within each thread.

Gradient Computation

Implementation 4.6.7 (Computing gradients, MassSpringEnergy.cu).

template <typename T, int dim>
const DeviceBuffer<T> &MassSpringEnergy<T, dim>::grad()
{
	auto &device_x = pimpl_->device_x;
	auto &device_e = pimpl_->device_e;
	auto &device_l2 = pimpl_->device_l2;
	auto &device_k = pimpl_->device_k;
	auto N = pimpl_->device_e.size() / 2;
	auto &device_grad = pimpl_->device_grad;
	device_grad.fill(0);
	ParallelFor(256).apply(N, [device_x = device_x.cviewer(), device_e = device_e.cviewer(), device_l2 = device_l2.cviewer(), device_k = device_k.cviewer(), device_grad = device_grad.viewer()] __device__(int i) mutable
						   {
		int idx1= device_e(2 * i); // First node index
		int idx2 = device_e(2 * i + 1); // Second node index
		T diff = 0;
		T diffi[dim];
		for (int d = 0; d < dim;d++){
			diffi[d] = device_x(dim * idx1 + d) - device_x(dim * idx2 + d);
			diff += diffi[d] * diffi[d];
		}
		T factor = 2 * device_k(i) * (diff / device_l2(i) -1);
		for(int d=0;d<dim;d++){
		   atomicAdd(&device_grad(dim * idx1 + d), factor * diffi[d]);
		   atomicAdd(&device_grad(dim * idx2 + d), -factor * diffi[d]);	  
		} })
		.wait();
	// display_vec(device_grad);
	return device_grad;
}

The atomicAdd function is crucial in the gradient computation to ensure safe concurrent updates to shared data (different edges can update the gradient of the same node), thus preventing race conditions.

Hessian Computation

We utilized the Sparse Matrix data structure to store the Hessian matrix. The computation is parallelized across multiple threads, with each thread updating a specific element of the Hessian matrix. The actual size of the Sparse Matrix is calculated at the start of the simulation, allocating just enough memory for non-zero entries. The main consideration here is to calculate the correct indices for each element during simulation:

Implementation 4.6.8 (Computing Hessians, MassSpringEnergy.cu).

template <typename T, int dim>
const DeviceTripletMatrix<T, 1> &MassSpringEnergy<T, dim>::hess()
{
	auto &device_x = pimpl_->device_x;
	auto &device_e = pimpl_->device_e;
	auto &device_l2 = pimpl_->device_l2;
	auto &device_k = pimpl_->device_k;
	auto N = device_e.size() / 2;
	auto &device_hess = pimpl_->device_hess;
	auto device_hess_row_idx = device_hess.row_indices();
	auto device_hess_col_idx = device_hess.col_indices();
	auto device_hess_val = device_hess.values();
	device_hess_val.fill(0);
	ParallelFor(256).apply(N, [device_x = device_x.cviewer(), device_e = device_e.cviewer(), device_l2 = device_l2.cviewer(), device_k = device_k.cviewer(), device_hess_val = device_hess_val.viewer(), device_hess_row_idx = device_hess_row_idx.viewer(), device_hess_col_idx = device_hess_col_idx.viewer(), N] __device__(int i) mutable
						   {
		int idx[2] = {device_e(2 * i), device_e(2 * i + 1)}; // First node index
		T diff = 0;
		T diffi[dim];
		for (int d = 0; d < dim; d++)
		{
			diffi[d] = device_x(dim * idx[0] + d) - device_x(dim * idx[1] + d);
			diff += diffi[d] * diffi[d];
		}
		Eigen::Matrix<T, dim, 1> diff_vec(diffi);
		Eigen::Matrix<T, dim, dim> diff_outer = diff_vec * diff_vec.transpose();
		T scalar = 2 * device_k(i) / device_l2(i);
		Eigen::Matrix<T, dim, dim> H_diff = scalar * (2 * diff_outer + (diff_vec.dot(diff_vec) - device_l2(i)) * Eigen::Matrix<T, dim, dim>::Identity());
		Eigen::Matrix<T, dim * 2, dim * 2> H_block, H_local;
		H_block << H_diff, -H_diff,
			-H_diff, H_diff;
		make_PSD(H_block, H_local);
		// add to global matrix
		for (int ni = 0; ni < 2; ni++)
			for (int nj = 0; nj < 2; nj++)
			{
				int indStart = i * 4*dim*dim + (ni * 2 + nj) * dim*dim;
				for (int d1 = 0; d1 < dim; d1++)
					for (int d2 = 0; d2 < dim; d2++){
						device_hess_row_idx(indStart + d1 * dim + d2)= idx[ni] * dim + d1;
						device_hess_col_idx(indStart + d1 * dim + d2)= idx[nj] * dim + d2;
						device_hess_val(indStart + d1 * dim + d2) = H_local(ni * dim + d1, nj * dim + d2);
					}
			} })
		.wait();
	return device_hess;
} // Calculate the Hessian of the energy