15-418 Final Project Final Report
This is the final project final report for 15-418 for Amos Yuen and Andrew Lau.
We implemented an interactive smooth particle hydrodynamics (SPH) application. We first implemented SPH on the CPU. Then we extended it to a parallel implementation in CUDA on the GPU. Finally we used MPI to run both the CPU and GPU implementations on multiple ghc machines.
SPH is a computational method for simulating fluid flows. It divides the fluid into discrete elements called particles, which represent the density of water. The particles have an effective radius, where it contributes to the density of the water within that range. The contribution also depends on the distance from the particle, with positions closer to the particle receiving more density. The method consists of four main steps: neighbor finding, density calculation, force calculation, and integration.

First, for each particle we must find all neighboring particles within the effective radius. Every particle could contribute to every other particle, but to save on computation, the algorithm only computes the contributions of particles that make a significant contribution. In order to find the neighbors, our implementation hashes all particles into a virtual grid of blocks depending on their position. We then sort all particles based on their virtual block index using a fast radix sort so all particles belonging to a block are in a continuous space. We use pre-computed lookup tables to find the neighboring blocks. Initially, we used a z-order curve (as suggested by Goswami et al) for indexing the virtual blocks because it is supposed to improve locality. But we found that it did not actually improve performance, so we changed to an x-major indexing scheme that made it easier to find which particles to share in the MPI version.

Next we need to calculate the density of each particle. The density is calculated as the sum of the values of a kernel function between the particle and each of its neighbors within h. The kernel function is the function that determines the contribution of the neighboring particle depending on its distance.

Then, we compute the forces and the resulting acceleration for each particle. The pressure force evens out the density. The viscosity force makes particles move with similar velocities. And lastly the gravity force. The pressure and viscosity forces are both functions of neighboring particles' densities, which is why the densities had to be calculated in a seperate step.

We then integrate the particles velocities and accelerations at each simulation step to find their new positions and velocities. We initially used a basic Euler step for the integration, but when we found that the integration step took very little time, we used the better Runge-Kutta 4 method so that we could use a larger time step for the simulation.
Our CPU implementation is written in C++, the GPU version uses CUDA, and the multi-machine version uses MPI to run the CPU and GPU versions on multiple machines. All versions are rendered using GLUT/OpenGL. All work targeted the GHC 5205 cluster machines, with Nvidia GTX480 GPUs and Intel Xeon CPUs.

In the CPU implementation, we serially do all calculations on all particles each iteration. For the CUDA implementation, we launch separate kernels for sorting particles, calculating densities, calculating forces, and integration. For each of these, we map one CUDA thread to one particle in the system. Some papers suggested parallelizing across volume blocks for the density and force calculation steps for locality reasons. However, on average there was only about 4 particles per block with a maximum of 10 particles per block. Also the majority of blocks are empty, so we found it more efficient to parallelize across particles rather than blocks.

For MPI, we divide the particles evenly between available machines based on the x coordinates. Originally, we used MPI_AllGather to share all the results from the calculations between all the machines between each step of the algorithm. This was very expensive bandwidth-wise and resulted to very poor performance. From there we calculated what needs to be shared between neighboring blocks and only sent those between neighboring blocks.

For the boundary, we tried a few different techniques, such as adding density based on the particle’s distance from the wall, and applying a force based on the distance. But we found that these techniques did not make a noticeable difference compared to just bouncing particles off of the walls.

At first we rendered each point using the glutSolidSphere, while it looked pretty nice, we found that the rendering step was taking more time than any other step. So we did several things to reduce the rendering time. First we capped our render rate to 30 FPS, but we allowed the physics simulation to run at max speed. Secondly, we changed our rendering system to use GL_POINTS instead. And lastly, we transfered all position and velocity data to the root in the MPI implementation and calculated colors for visualization only when needed for rendering.

We implemented several visualization schemes. One of them colors the particles using hue values in HSV depending on their velocities. In the CUDA version this calculation is done in parallel with a separate kernel. A second similar scheme varies the color from blue to white based on the velocity to try to give the appearance of water. The final visualization strategy colors the particles using different hue values based on which machine is responsible for its calculations, so we could visualize the work division between machines in the MPI version.

Interactivity was implemented by injecting velocities into the particles close to where the mouse was clicked. Left click and drag allows the user to ‘push’ particles around, while the right click allows the user to make a ‘fountain’ where he clicked. The 3D click location is determined from 2D mouse click location by ‘unprojecting’ into the scene and looking up the depth in the depth buffer. The CUDA version uses a kernel to test if each particle is within range of the click and update the velocity buffer accordingly.
For the CUDA MPI version, we found that it received an almost linear slowdown from running on multiple computers. This happens because the GPUs are bandwidth bound since there are 3 copies any time data is shared: GPU 1 to Computer 1, Computer 1 to Computer 2, Computer 2 to GPU 2. For the CPU version, we achieved a sublinear speedup, because the CPU is computation bound. In our search for multi GPU SPH implementations, we found an implementation that used latency hiding by computing the forces on the particles that need to be shared first, and then transferring the data as it computed the rest. We did not try this method, but it also seemed that their implementation was for multiple GPUs connected to the same computer.

Our timings were taken using the CycleTimer from assignment 2 to measure the time between different phases of each iteration. It uses the clock tick counts to calculate the time elapsed. All speedups are compared to the serial CPU implementation.

Breakdown of the time spent per iteration on each phase of the computation. Horizontal axis is the time in milliseconds, and vertical axis is the number of particles. Traditionally, neighbor finding is one of the more expensive parts of grid-based particle simulations, but we found that the force calculation took the most time by far, with the density calculation taking the second most time.
Speedup of the CUDA implementation over the single CPU implementation. Horizontal axis is the number of particles, and vertical axis is the speedup.
MPI Speedup of the CPU implemenation. Horizontal axis is the number of particles, and vertical axis is the speedup.

100k Particle simulation with hue-based coloring. Download
100k Particle simulation with 4 computers colored based on computer. Download
100k Particle simulation with white-blue coloring. Shows interactive aspects such as user injecting velocity into particles and moving the camera. Download
Interactive SPH Simulation and Rendering on the GPU (Goswami et al)
GPU-based Real-Time Snow Avalanche Simulations (Krog)
Work done by each student
Both students were involved with every part of the project for both programming and reports. We continuously checked our thought processes and algorithms with each other and reviewed each other’s code.