Due: Friday, April 28th by 11:59 PM

Update 4/25: Extended the due date; mentioned incorrect comment regarding cudaMalloc

Update 4/25: Mention the possibility of a one-dimensional grid

Getting Started

Download CS365_Assign04.zip. Unzip it on the cluster head node (or your local machine, if you are running Linux and have CUDA and libui installed).

If you are developing on the cluster, you will need to logged onto bambleweeny:

ssh -X bambleweeny

To compile the sequential and parallel versions of the programs, run the command

make

To run the programs:

./nbody_seq

./nbody_par

Note that the nbody_par program can take command line arguments:

The -p argument specifies the number of bodies (“particles”) to create
The -f argument specifies the number of animation frames to display

For example, to run with 2000 bodies, displaying 200 animation frames, the command would be

./nbody_par -p 2000 -f 200

If you want to compile the program using Linux on your own computer, you will to build libui. You will also need to have CUDA installed.

Your task

Your task is similar to the one in Assignment 3: you will parallelize an N-Body simulation.

The difference is that in this version, you will use CUDA to offload the computation onto the GPU.

You will make your code changes in sim_par.cu.

Hints and specifications

Look at the TODO comments in sim_par.cu. They indicate where you will need to add or modify code to execute the simulation on the GPU.

Note that the data representation has changed. Rather than the main data structure being an array of Particle objects, the main data structure is now arrays called x, y, dx, dy, mass, and color. A single particle is represented by the elements at a common index in these arrays. See particle.cu and sim_seq.cu to see how the sequential computation works.

The SimulationData struct type is used to group pointers to all of the arrays into a single object. The parallel computation (in sim_par.cu) will have two of these: one for the host (CPU), and one for the device (GPU). Note that you must use cudaMalloc to allocate device buffers.

The cudaMalloc function is called as

cudaMalloc((void**) &ptr, numBytes);

where ptr is a pointer variable where the address of the allocated device buffer should be stored, and numBytes is the number of bytes to allocate. In the context of the assignment, one of your calls might look something like

cudaMalloc((void**) &sim->pd_dev.x, sizeof(float) * sim->num_particles);

Note that the skeleton code included the following incorrect comment:

// TODO: allocate device buffers, e.g.
//    sim->pd_dev.x = cudaMalloc(...)

Make sure you call cudaMalloc as described above, not as described by this comment.

You will need to use the cudaMemcpy function to copy data between the host and device buffers. Specifically:

Data used by the kernel function must be copied from the host buffers to the device buffers before the kernel call
Data produced by the kernel function must be copied from the device buffers to the host buffers after the kernel call

Your kernel function should take pointers to the 5 arrays x, y, dx, dy, and mass. It will also need to know how many particles (bodies) are being simulated. Your kernel function call should look something like the following:

kernel<<<grid, THREADS_PER_BLOCK>>>(sim->pd_dev.x,
                                    sim->pd_dev.y,
                                    sim->pd_dev.dx,
                                    sim->pd_dev.dy,
                                    sim->pd_dev.mass,
                                    sim->num_particles);

Note that the grid parameter is a variable of type dim3. It is possible to have a one-dimensional grid, e.g.

dim3 grid(num_particles);

If you use a one-dimensional grid, only threadIdx.x (and not threadIdx.y) will vary in the calls to your kernel function.

Note that the functions particle_dist, particle_force, and particle_compute_attraction are defined to be available on both the CPU and GPU. So, your kernel and device functions can call these functions as required to do the computation.

Deliverables

There are two deliverables: a report and the code.

The report should be a text file that indicates how many particles (bodies) you were able to simulate without missing significant numbers of animation frames. (Consider “significant” to mean 5 or more frames.) The report should also estimate how much greater the throughput of your CUDA program is than the threaded program you implemented in Assignment 3. Note that the complexity of the computation grows with the square of the number of bodies being simulated.

Include the report in a text file called report.txt in your assignment submission.

Grading

The grading is broken down as follows:

sim_create (allocate device buffers, etc.): 10%
sim_destroy (deallocate device buffers): 10%
sim_tick data transfer: 10%
sim_tick kernel function call: 10%
Kernel function, computation: 50%
Report: 10%

Submitting

Run the command

make submit

Type your Marmoset username and password when prompted.