Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 73 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,84 @@ CUDA Rasterizer

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Yichen Shou
* [LinkedIn](https://www.linkedin.com/in/yichen-shou-68023455/), [personal website](http://www.yichenshou.com/)
* Tested on: Windows 10, i7-2600KU @ 3.40GHz 16GB RAM, NVIDIA GeForce GTX 660Ti 8GB (Personal Desktop)

### (TODO: Your README)
## Project Overview

*DO NOT* leave the README to the last minute! It is a crucial part of the
project, and we will not be able to grade you without a good README.
This project implements a rasterizer on the GPU using OpenGL and CUDA. Various steps of the graphics pipeline, such as Vertex Assembly, Primitive Assembly, Rasterization and Fragment shading, uses the GPU to speed up efficiency.

Diffuse Cow | Cow with Normals
:-------------------------:|:-------------------------:
![](renders/CowRegular.PNG) | ![](renders/CowNormal.PNG)

Diffuse Cube | Cube with Normals
:-------------------------:|:-------------------------:
![](renders/CubeRegular.PNG) | ![](renders/CubeNormal.PNG)

[Video Demo](https://youtu.be/kjadBVjjtRc)

## Tile-based Rendering

This project also implements tile-based rendering to further speed up computation time. Tile-based rendering is a rendering technique that divides the view port into smaller "tiles" to be rendered individually and then merged back into the full image. Since each tile is relatively small, the GPU can fully take advantage of cache locality/on-chip memory to drastically reduce memory access time and thus speed up rendering. This technique is widely used in GPU everywhere, although it is important for mobile GPUs that doesn't have as much global memory/speed to throw around as desktop GPUs.

For example: here's what the underlying tile system using 16x16 pixel tiles looks like:
![](renders/tiles16x16.PNG)

There are many ways of implementing tile-based rendering on the GPU. I implemented 2. One way is to parallize kernel calls over tiles and loop through every triangle in the tile in the kernel. One kernel is called and each thread handles one tile. Another way is to iteratively call kernels on every tile, parallizing over the triangles in that tiles. The Kernel is called as many times as there are tiles and each thread handles one triangle in one tile. I implemented both of these methods. The second method uses shared memory on the GPU to speed up computation.

## Performance Test

First I compared the FPS of all 3 methods (regular, tile-parallized, triangle-parallized) on a single Triangle, rendered at two different distances from the camera.

### Single Triangle

Far Triangle | Close triangle
:-------------------------:|:-------------------------:
![](renders/TriangleFar.PNG) | ![](renders/TriangleClose.PNG)

![](renders/TriangleChart.PNG)

While all 3 methods renders the far away triangle well, only the tile-parallized method did not suffer from a huge speed decrease when the triangle is close. This is likely due to the fact that method 1 and 3 parallizes over the triangle while method 2 parallizes over the tile. Even though the triangle is taking over a lot of screen space, method 1 still uses only 1 thread to process everything, while method 2 uses as many threads as the number of tiles overlapped by this triangle. I'm unclear why method 3 is just as slow though, since it's supposed to parallize over triangles per tile so it should be just as fast as method 2 if not faster.

### Cube

Far Cube | Close Cube
:-------------------------:|:-------------------------:
![](renders/CubeFar.PNG) | ![](renders/CubeClose.PNG)

![](renders/CubeChart.PNG)

Next I performed the same test on a simple 6 sides cube (12 triangles). The results are pretty much the same as the last test. When individual triangles take up a lot of screen space, method 1 always triumphs.

### Cow

Far Cow | Close Cow
:-----------------------:|:-----------------------:
![](renders/CowFar.PNG) | ![](renders/CowClose.PNG)

![](renders/CowChart.PNG)

The final test is done on the cow model, and here method 1 starts to show its weakness. When the model is far and triangles are squished up into a small amount of tiles, method 1 is significantly slower than method 1, which launches a thread for every triangle. The performance is a little better when the cow is closer, but still not enough.

The tests reveal clearly that regular rendering is great for rendering scenes with a large amount of triangles taking up screen space. When models are up close and a small amount of triangles are taking up the whole screen, tile-based rendering is better. Perhaps a heuristic can be used at the beginning of every frame to determine which rendering method would be better.

Method 2 is supposed to be the best of both worlds, seeing how it launches a thread per triangle per tile, but it's performing rather poorly in all cases. I think there might be something wrong with my implementation.

### Pixel Size

![](renders/TileSizeChart.PNG)

Lastly I compared the render time per frame (averged over 100 frames) of different pixel sizes on the close cube render. Lower is better in the graph. Without a doubt, the smaller sized tiles won, because smaller tiles = more number of tiles = more threads. I do think that eventually smaller tile sizes would run into trouble when memory/overhead is more limited. But it wasn't a problem at all on my 16 Gigs of GPU RAM.

### Credits

* [tinygltfloader](https://github.com/syoyo/tinygltfloader) by [@soyoyo](https://github.com/syoyo)
* [glTF Sample Models](https://github.com/KhronosGroup/glTF/blob/master/sampleModels/README.md)
* CIS 565 class slides

### Bloopers

When you mess up near/far planes and accidentally summon the cow of nightmares
![](renders/nightmareCow.gif)
Binary file added renders/CowChart.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added renders/CowClose.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added renders/CowFar.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added renders/CowNormal.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added renders/CowRegular.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added renders/CubeChart.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added renders/CubeClose.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added renders/CubeNormal.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added renders/CubeRegular.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added renders/TileSizeChart.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added renders/TriangleChart.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added renders/TriangleClose.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added renders/TriangleFar.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added renders/cubeFar.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added renders/nightmareCow.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added renders/tiles.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added renders/tiles16x16.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,5 @@ set(SOURCE_FILES

cuda_add_library(src
${SOURCE_FILES}
OPTIONS -arch=sm_20
OPTIONS -arch=sm_30
)
30 changes: 24 additions & 6 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@


#include "main.hpp"
#include <chrono>

#define STB_IMAGE_IMPLEMENTATION
#define TINYGLTF_LOADER_IMPLEMENTATION
Expand Down Expand Up @@ -50,10 +51,11 @@ int main(int argc, char **argv) {
}


frame = 0;
frame = 1;
seconds = time(NULL);
fpstracker = 0;


// Launch CUDA/GL
if (init(scene)) {
// GLFW main loop
Expand Down Expand Up @@ -97,7 +99,7 @@ void mainLoop() {
//---------RUNTIME STUFF---------
//-------------------------------
float scale = 1.0f;
float x_trans = 0.0f, y_trans = 0.0f, z_trans = -10.0f;
float x_trans = 0.0f, y_trans = 0.0f, z_trans = -1.0f;
float x_angle = 0.0f, y_angle = 0.0f;
void runCuda() {
// Map OpenGL buffer object for writing from CUDA on a single GPU
Expand All @@ -120,8 +122,22 @@ void runCuda() {
glm::mat4 MVP = P * MV;

cudaGLMapBufferObject((void **)&dptr, pbo);
rasterize(dptr, MVP, MV, MV_normal);
cudaGLUnmapBufferObject(pbo);

std::chrono::high_resolution_clock::time_point startTime = std::chrono::high_resolution_clock::now();

rasterize(dptr, MVP, MV, MV_normal, renderMode);

std::chrono::high_resolution_clock::time_point endTime = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> duro = endTime - startTime;
float elapsedTime = static_cast<decltype(elapsedTime)>(duro.count());

avgFrameTime += elapsedTime;
if (frame % 100 == 0) {
printf("100 frames took avg %f milliseconds\n", avgFrameTime / 100);
avgFrameTime = 0;
}

cudaGLUnmapBufferObject(pbo);

frame++;
fpstracker++;
Expand Down Expand Up @@ -183,6 +199,8 @@ bool init(const tinygltf::Scene & scene) {

rasterizeSetBuffers(scene);

rasterizeSetTileBuffers();

GLuint passthroughProgram;
passthroughProgram = initShader();

Expand Down Expand Up @@ -214,7 +232,7 @@ void initCuda() {
// Use device with highest Gflops/s
cudaGLSetGLDevice(0);

rasterizeInit(width, height);
rasterizeInit(width, height, tilePixelSize);

// Clean up on program exit
atexit(cleanupCuda);
Expand Down Expand Up @@ -395,6 +413,6 @@ void mouseMotionCallback(GLFWwindow* window, double xpos, double ypos)

void mouseWheelCallback(GLFWwindow* window, double xoffset, double yoffset)
{
const double s = 1.0; // sensitivity
const double s = 0.1; // sensitivity
z_trans += (float)(s * yoffset);
}
5 changes: 5 additions & 0 deletions src/main.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,11 @@ GLFWwindow *window;
int width = 800;
int height = 800;

int renderMode = 1; // 0 for regular, 1 for tile-based rendering parallizing on tiles, 2 for tile-based rendering parallizing primitives
int tilePixelSize = 8;

float avgFrameTime;

//-------------------------------
//-------------MAIN--------------
//-------------------------------
Expand Down
Loading