diff --git a/README.md b/README.md
index 41b91f0..6485c52 100644
--- a/README.md
+++ b/README.md
@@ -1,18 +1,112 @@
 CUDA Rasterizer
 ===============
 
-[CLICK ME FOR INSTRUCTION OF THIS PROJECT](./INSTRUCTION.md)
-
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
 
-* (TODO) YOUR NAME HERE
-	* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Jie Meng
+  * [LinkedIn](https://www.linkedin.com/in/jie-meng/), [YouTube](https://www.youtube.com/channel/UC7G8fUcQrrI_1YnXY5sQM6A).
+* Tested on: Windows 10, i7-7700HQ @ 2.80GHz, 16GB, GTX 1050 4GB (Personal Laptop)
+
+Demo renders
+================
+
+![](renders/11.gif)
+
+
+Rasterization
+==================
+_Rasterization_ [[wiki]](https://en.wikipedia.org/wiki/Rasterisation) is a render method in computer graphics, in essence, the process of showing 3D scenes containing objects are:
+
+ - Project 3d objects into 2D camera plane
+ - Given the 2D plane (with fixed resolution) and the projections of 3D shapes, fill the pixels that are inside those projected shapes.
+
+ The process can be visualized as the following images:
+
+Project | Fill pixels
+------------|---------------
+![](renders/ras1.jpg) | ![](renders/ras2.png)
+
+With the power of GPU and CUDA, we could achieve great performance by multi-threading.
+
+
+Features
+==================
+
+### Graphics Features
+#### Basic rasterization pipeline: Vertex transformation, Primitive Assembly, Rasterization, Fragment shading
+
+![](renders/truck.gif)
+
+ - Multiple Primitive types: Triangles, Wire frame, Points
+
+ wire frame | points
+------------|---------------
+![](renders/cow1.gif) | ![](renders/cow2.gif)
+
+* The way I did multiple primitive types is a simple way: instead of creating new **PrimitiveAssembly** kernels, I still treat primitives as triangles, but only **fill** the pixels on **edges** or **vertices**
+
+
+#### UV texture mapping with bilinear texture filtering and perspective correct texture coordinates
+
+No bilinear filtering | Bilinear filtering
+-------------------------- | ----------------------------
+![](renders/no_bilinear_filter.jpg) | ![](renders/bilinear_filter.jpg)
+
+
+* Bilinear texture mapping [[wiki]](https://en.wikipedia.org/wiki/Bilinear_filtering) is a way to avoid distortion when doing texture mapping. Given a uv coordinate, we use the nearest four texels to interpolate its color value.
+
+#### Automatic Rotation: a dumb way to use shared memory to store a uniform rotation matrix for all vertices, such that the model is rotating by itself.
+
+#### Post-processing: implemented bloom effect
+
+ Without bloom | With bloom
+------------|---------------
+![](renders/duck.gif) | ![](renders/duckbloom.gif)
+
+ * bloom effect [[wiki]](https://en.wikipedia.org/wiki/Bloom_(shader_effect)) is a widely-used post processing method to present high-brightness areas. Shared memory is used to store pixel values that a block needed for perform gaussian blur.
+
+### GUI controls
+- Number keys to switch primitive types: triangles(1) / wireframe(2) / points(3)
+- Key B to turn on/off bloom effect
+
+### Macros
+- `#define TEXTURE 1/0` : turn on(1)/off(0) to use diffuse texture
+- `#define PERSPECTIVE_INTERP 1/0` : turn on(1)/off(0) to perform perspective correct interpolation
+- `#define TEXTURE_BILINEAR 1/0` : turn on(1)/off(0) to use bilinear UV coordinates interpolation
+- `#define DOWNSCALERATIO 3/(1-10)` : change down scale ration in gaussian blur, can be any reasonable integer (1-10 recommended)
+
+Analysis
+=================
+
+### Basic render pipeline
+
+* Configuration: x64 Release mode, no bilinear texture, no bloom effect
+
+#### Absolute running time
+
+![](renders/perform1.png)
+
+From the chart: 
+- reastrization kernel and render kernel consumes the most part of running time
+- Initialize depth is basically one instruction per thread, so its running time can be viewed as a baseline
+- Vertex assembly is looping over vertices, so essentially less threads are launched comparing to depth initialization, resulting less running time. 
+- Send image to PBO stage is the same for both models, so the time spent is roughly the same
+- Since Cow has more primitives than duck model, its primitive assembly, rasterization and render stage all take relatively more time
+
+#### Running time percentages
+
+![](renders/perform2.png)
+
+The above chart shows the percentages of each stage's running time w.r.t. whole pipeline.
+- Not supurisingly, rasterization takes the most time
+- For small model like duck and cow, rasterization only takes a bit more time that render stage
+- For relatively more complex model like truck, rasterization stage can comsume most time of the entire pipeline
+
+#### Bloom effect
+
+In practice, it turns out that bloom effect has almost no effect on performances: no FPS dropping is happening. I believe it because bloom effect (or generally, post-processing) is performed simply on framebuffers, which should not take much time.
 
-### (TODO: Your README)
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
 
 
 ### Credits
diff --git a/renders/11.gif b/renders/11.gif
new file mode 100644
index 0000000..1ba7b57
Binary files /dev/null and b/renders/11.gif differ
diff --git a/renders/bilinear_filter.jpg b/renders/bilinear_filter.jpg
new file mode 100644
index 0000000..d2dd2e0
Binary files /dev/null and b/renders/bilinear_filter.jpg differ
diff --git a/renders/bloom.gif b/renders/bloom.gif
new file mode 100644
index 0000000..639cc24
Binary files /dev/null and b/renders/bloom.gif differ
diff --git a/renders/cow1.gif b/renders/cow1.gif
new file mode 100644
index 0000000..e24ab7b
Binary files /dev/null and b/renders/cow1.gif differ
diff --git a/renders/cow2.gif b/renders/cow2.gif
new file mode 100644
index 0000000..34c4a9c
Binary files /dev/null and b/renders/cow2.gif differ
diff --git a/renders/duck.gif b/renders/duck.gif
new file mode 100644
index 0000000..d6ff20c
Binary files /dev/null and b/renders/duck.gif differ
diff --git a/renders/duckbloom.gif b/renders/duckbloom.gif
new file mode 100644
index 0000000..cdc9680
Binary files /dev/null and b/renders/duckbloom.gif differ
diff --git a/renders/no_bilinear_filter.jpg b/renders/no_bilinear_filter.jpg
new file mode 100644
index 0000000..10eb6f8
Binary files /dev/null and b/renders/no_bilinear_filter.jpg differ
diff --git a/renders/perform1.png b/renders/perform1.png
new file mode 100644
index 0000000..9f2e503
Binary files /dev/null and b/renders/perform1.png differ
diff --git a/renders/perform2.png b/renders/perform2.png
new file mode 100644
index 0000000..8385d1f
Binary files /dev/null and b/renders/perform2.png differ
diff --git a/renders/ras1.jpg b/renders/ras1.jpg
new file mode 100644
index 0000000..17f8046
Binary files /dev/null and b/renders/ras1.jpg differ
diff --git a/renders/ras2.png b/renders/ras2.png
new file mode 100644
index 0000000..07cfcfd
Binary files /dev/null and b/renders/ras2.png differ
diff --git a/renders/truck.gif b/renders/truck.gif
new file mode 100644
index 0000000..c48d174
Binary files /dev/null and b/renders/truck.gif differ
diff --git a/src/main.cpp b/src/main.cpp
index 7986959..10771e9 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -52,6 +52,10 @@ int main(int argc, char **argv) {
 
     frame = 0;
     seconds = time(NULL);
+	
+	//to measure time performance
+	shaderTimer = time(NULL);
+
     fpstracker = 0;
 
     // Launch CUDA/GL
@@ -99,6 +103,16 @@ void mainLoop() {
 float scale = 1.0f;
 float x_trans = 0.0f, y_trans = 0.0f, z_trans = -10.0f;
 float x_angle = 0.0f, y_angle = 0.0f;
+
+//render mode toggle
+int renderMode = RenderMode::Triangle;
+// enable blomm effect
+bool bloomEffect = true;
+
+//for auto rotation
+float autoRotateAngle = 0.0f;
+float autoRotateSpeed = 0.3f;
+
 void runCuda() {
     // Map OpenGL buffer object for writing from CUDA on a single GPU
     // No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not use this buffer
@@ -119,8 +133,22 @@ void runCuda() {
 	glm::mat4 MV = V * M;
 	glm::mat4 MVP = P * MV;
 
+	double shaderTimerNew = glfwGetTime();
+	double deltaTime = shaderTimerNew - shaderTimer;
+	shaderTimer = shaderTimerNew;
+
+	//rotate the object
+	autoRotateAngle += autoRotateSpeed * deltaTime;
+	if (autoRotateAngle >= 360.0f)
+	{
+		autoRotateAngle = 0.0f;
+	}
+
+	//glm::mat4 autoRotateMat = glm::rotate(autoRotateAngle, glm::vec3(0.0f, 1.0f, 0.0f));
+	glm::mat4 autoRotateMat = glm::rotate(0.f, glm::vec3(0.0f, 1.0f, 0.0f));
+
     cudaGLMapBufferObject((void **)&dptr, pbo);
-	rasterize(dptr, MVP, MV, MV_normal);
+	rasterize(dptr, MVP, MV, MV_normal, renderMode, autoRotateMat, bloomEffect);
     cudaGLUnmapBufferObject(pbo);
 
     frame++;
@@ -189,6 +217,8 @@ bool init(const tinygltf::Scene & scene) {
     glUseProgram(passthroughProgram);
     glActiveTexture(GL_TEXTURE0);
 
+	shaderTimer = glfwGetTime();
+
     return true;
 }
 
@@ -325,9 +355,28 @@ void errorCallback(int error, const char *description) {
 }
 
 void keyCallback(GLFWwindow *window, int key, int scancode, int action, int mods) {
-    if (key == GLFW_KEY_ESCAPE && action == GLFW_PRESS) {
-        glfwSetWindowShouldClose(window, GL_TRUE);
+    if (action == GLFW_PRESS) {
+		switch (key)
+		{
+		case GLFW_KEY_ESCAPE:
+			glfwSetWindowShouldClose(window, GL_TRUE);
+			break;
+		case GLFW_KEY_1:
+			renderMode = RenderMode::Triangle;
+			break;
+		case GLFW_KEY_2:
+			renderMode = RenderMode::Wireframe;
+			break;
+		case GLFW_KEY_3:
+			renderMode = RenderMode::Points;
+			break;
+		case GLFW_KEY_B:
+			bloomEffect = !bloomEffect;
+			break;
+		}
+		
     }
+
 }
 
 //----------------------------
@@ -396,5 +445,9 @@ void mouseMotionCallback(GLFWwindow* window, double xpos, double ypos)
 void mouseWheelCallback(GLFWwindow* window, double xoffset, double yoffset)
 {
 	const double s = 1.0;	// sensitivity
+
+	if (yoffset > 0 && z_trans > -1.5f) {
+		return;
+	}
 	z_trans += (float)(s * yoffset);
 }
diff --git a/src/main.hpp b/src/main.hpp
index 4816fa1..b8fcdad 100644
--- a/src/main.hpp
+++ b/src/main.hpp
@@ -33,6 +33,13 @@ using namespace std;
 int frame;
 int fpstracker;
 double seconds;
+
+//TO measure time performance
+double shaderTimer;
+
+//render Mode
+enum RenderMode{Triangle, Wireframe, Points};
+
 int fps = 0;
 GLuint positionLocation = 0;
 GLuint texcoordsLocation = 1;
diff --git a/src/rasterize.cu b/src/rasterize.cu
index 1262a09..8581506 100644
--- a/src/rasterize.cu
+++ b/src/rasterize.cu
@@ -18,6 +18,12 @@
 #include <glm/gtc/quaternion.hpp>
 #include <glm/gtc/matrix_transform.hpp>
 
+#define TEXTURE 1
+#define PERSPECTIVE_INTERP 1 
+#define TEXTURE_BILINEAR 0
+
+#define DOWNSCALERATIO 3
+
 namespace {
 
 	typedef unsigned short VertexIndex;
@@ -34,6 +40,9 @@ namespace {
 		Triangle = 3
 	};
 
+	//render Mode
+	enum RenderMode { Triangles, Wireframe, Points };
+
 	struct VertexOut {
 		glm::vec4 pos;
 
@@ -43,11 +52,14 @@ namespace {
 
 		 glm::vec3 eyePos;	// eye space position used for shading
 		 glm::vec3 eyeNor;	// eye space normal used for shading, cuz normal will go wrong after perspective transformation
-		// glm::vec3 col;
+		
+		 glm::vec3 col;
+
 		 glm::vec2 texcoord0;
 		 TextureData* dev_diffuseTex = NULL;
 		// int texWidth, texHeight;
 		// ...
+		 int diffuseTexWidth, diffuseTexHeight;
 	};
 
 	struct Primitive {
@@ -62,10 +74,14 @@ namespace {
 		// The attributes listed below might be useful, 
 		// but always feel free to modify on your own
 
-		// glm::vec3 eyePos;	// eye space position used for shading
-		// glm::vec3 eyeNor;
+		 glm::vec3 eyePos;	// eye space position used for shading
+		 glm::vec3 eyeNor;
 		// VertexAttributeTexcoord texcoord0;
-		// TextureData* dev_diffuseTex;
+		TextureData* dev_diffuseTex;
+
+		int uvStart;
+		glm::vec2 uvBilinear;
+		int texWidth, texHeight;
 		// ...
 	};
 
@@ -111,20 +127,47 @@ static glm::vec3 *dev_framebuffer = NULL;
 
 static int * dev_depth = NULL;	// you might need this buffer when doing depth test
 
+//used for atomic depth writing
+static int * dev_mutex = NULL;
+//used for bloom effect post processing
+static glm::vec3 * dev_framebufferAux = NULL;
+static glm::vec3 * dev_framebufferDownScaled = NULL;
+static glm::vec3 * dev_framebufferDownScaledAux = NULL;
+
 /**
  * Kernel that writes the image to the OpenGL PBO directly.
  */
 __global__ 
-void sendImageToPBO(uchar4 *pbo, int w, int h, glm::vec3 *image) {
+void sendImageToPBO(uchar4 *pbo, int w, int h, glm::vec3 *image,
+	//used for bloomEffect
+	int framebufferEdgeOffset,
+	int downScaleW,
+	int downScaleRatio
+)
+{
     int x = (blockIdx.x * blockDim.x) + threadIdx.x;
     int y = (blockIdx.y * blockDim.y) + threadIdx.y;
     int index = x + (y * w);
 
     if (x < w && y < h) {
+		//frame buffer id
+		int fid; 
+		if (downScaleRatio == 1)
+		{
+			fid = x + (y * w);
+		}
+		else
+		{
+			fid = (x / downScaleRatio) + framebufferEdgeOffset
+				+ ((y / downScaleRatio) + framebufferEdgeOffset) * (downScaleW + 2 * framebufferEdgeOffset);
+		}
         glm::vec3 color;
-        color.x = glm::clamp(image[index].x, 0.0f, 1.0f) * 255.0;
-        color.y = glm::clamp(image[index].y, 0.0f, 1.0f) * 255.0;
-        color.z = glm::clamp(image[index].z, 0.0f, 1.0f) * 255.0;
+        //color.x = glm::clamp(image[index].x, 0.0f, 1.0f) * 255.0;
+        //color.y = glm::clamp(image[index].y, 0.0f, 1.0f) * 255.0;
+        //color.z = glm::clamp(image[index].z, 0.0f, 1.0f) * 255.0;
+		color.x = glm::clamp(image[fid].x, 0.0f, 1.0f) * 255.0;
+		color.y = glm::clamp(image[fid].y, 0.0f, 1.0f) * 255.0;
+		color.z = glm::clamp(image[fid].z, 0.0f, 1.0f) * 255.0;
         // Each thread writes one pixel location in the texture (textel)
         pbo[index].w = 0;
         pbo[index].x = color.x;
@@ -133,23 +176,289 @@ void sendImageToPBO(uchar4 *pbo, int w, int h, glm::vec3 *image) {
     }
 }
 
+
+__device__ glm::vec3 sampleFromTexture(TextureData* dev_diffuseTex, int uvStart) {
+	return glm::vec3(dev_diffuseTex[uvStart + 0],
+		dev_diffuseTex[uvStart + 1],
+		dev_diffuseTex[uvStart + 2]) / 255.f;
+}
+
+__device__ glm::vec3 sampleFromTextureBilinear(TextureData* dev_diffuseTex, glm::vec2 uv, int width, int height) {
+	// top right corner of grid cell
+	int u = uv.x;
+	int v = uv.y;
+
+	// bottom right corner of grid cell
+	int u1 = glm::clamp(u + 1, 0, width - 1);
+	int v1 = glm::clamp(v + 1, 0, height - 1);
+
+	// sample all 4 colors using indices to sample from texture
+	int id00 = (u + v * width) * 3;
+	int id01 = (u + v1 * width) * 3;
+	int id10 = (u1 + v * width) * 3;
+	int id11 = (u1 + v1 * width) * 3;
+	glm::vec3 c00 = sampleFromTexture(dev_diffuseTex, id00);
+	glm::vec3 c01 = sampleFromTexture(dev_diffuseTex, id01);
+	glm::vec3 c10 = sampleFromTexture(dev_diffuseTex, id10);
+	glm::vec3 c11 = sampleFromTexture(dev_diffuseTex, id11);
+
+	// lerp horizontally using x fraction
+	glm::vec3 lerp1 = glm::mix(c00, c01, (float)uv.x - (float)u);
+	glm::vec3 lerp2 = glm::mix(c10, c11, (float)uv.x - (float)u);
+
+	// return lerped color vertically using y fraction
+	return  glm::mix(lerp1, lerp2, (float)uv.y - (float)v);
+}
+
 /** 
 * Writes fragment colors to the framebuffer
 */
 __global__
-void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer) {
+void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer, 
+	glm::vec3 lightPos, //for lighting calc
+	int renderMode, //switch rendermode
+	int framebufferEdgeOffset //bloom Effect
+	)
+{
     int x = (blockIdx.x * blockDim.x) + threadIdx.x;
     int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-    int index = x + (y * w);
 
-    if (x < w && y < h) {
-        framebuffer[index] = fragmentBuffer[index].color;
+	if (x < w && y < h) {
+		int index = x + (y * w);
 
-		// TODO: add your fragment shader code here
+		//framebuffer[index] = fragmentBuffer[index].color;
 
-    }
+		////// TODO: add your fragment shader code here
+
+		Fragment f = fragmentBuffer[index];
+		if (renderMode == RenderMode::Triangles)
+		{
+
+#if TEXTURE
+			if (f.dev_diffuseTex) {
+#if TEXTURE_BILINEAR
+				f.color = sampleFromTextureBilinear(f.dev_diffuseTex, f.uvBilinear, f.texWidth, f.texHeight);
+#else
+				f.color = sampleFromTexture(f.dev_diffuseTex, f.uvStart);
+#endif
+			}
+			else {
+				f.color = glm::vec3(0.f);
+			}
+#endif	
+			
+			glm::vec3 lightDir = glm::normalize(glm::vec3(0.5f, 0.2f, 0.7f) - f.eyePos);
+			//glm::vec3 lightDir = glm::normalize(lightPos - f.eyePos);
+			float lambert = glm::dot(lightDir, f.eyeNor);
+			float ambient = 0.3f;
+			float light_exp = 1.0f;
+			lambert = lambert * light_exp + ambient;
+			framebuffer[index] = f.color * lambert;
+
+		}
+		//for wireframe and points render mode, pass the color
+		else if (renderMode == RenderMode::Wireframe || renderMode == RenderMode::Points)
+		{
+			framebuffer[index] = f.color;
+		}
+	}
 }
 
+// bloom effect : horizontal Gaussian blur
+__global__
+void gaussianBlurHorizon(int w, int h, glm::vec3 * framebufferIn,
+	glm::vec3 * framebufferOut, int framebufferEdgeOffset)
+{
+	//using shared memory to store shared frame buffer
+	__shared__ glm::vec3 framebufferIn_shared[144]; //144 = 8 * (5 + 8 + 5)
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	if (x < w && y < h)
+	{
+		//framebuffer index
+		int fid = (x + framebufferEdgeOffset) + ((y + framebufferEdgeOffset) * (w + 2 * framebufferEdgeOffset));
+		framebufferOut[fid] = glm::vec3(0.0f);
+		//read framebufferIn into shared memory
+		//18 = 5 * 2 + 8
+		int index = threadIdx.y * 18 + threadIdx.x + 5;
+		if (threadIdx.x == 0)
+		{
+			framebufferIn_shared[index - 5] = framebufferIn[fid - 5];
+			framebufferIn_shared[index - 4] = framebufferIn[fid - 4];
+			framebufferIn_shared[index - 3] = framebufferIn[fid - 3];
+			framebufferIn_shared[index - 2] = framebufferIn[fid - 2];
+			framebufferIn_shared[index - 1] = framebufferIn[fid - 1];
+		}
+		if (threadIdx.x == blockDim.x - 1)
+		{
+			framebufferIn_shared[index + 5] = framebufferIn[fid + 5];
+			framebufferIn_shared[index + 4] = framebufferIn[fid + 4];
+			framebufferIn_shared[index + 3] = framebufferIn[fid + 3];
+			framebufferIn_shared[index + 2] = framebufferIn[fid + 2];
+			framebufferIn_shared[index + 1] = framebufferIn[fid + 1];
+		}
+		framebufferIn_shared[index] = framebufferIn[fid];
+
+		__syncthreads();
+
+		//apply horizontal gaussian blur and write to output
+		framebufferOut[fid] += framebufferIn_shared[index - 5] * 0.0093f;
+		framebufferOut[fid] += framebufferIn_shared[index - 5] * 0.0093f;
+		framebufferOut[fid] += framebufferIn_shared[index - 4] * 0.028002f;
+		framebufferOut[fid] += framebufferIn_shared[index - 3] * 0.065984f;
+		framebufferOut[fid] += framebufferIn_shared[index - 2] * 0.121703f;
+		framebufferOut[fid] += framebufferIn_shared[index - 1] * 0.175713f;
+		framebufferOut[fid] += framebufferIn_shared[index] * 0.198596f;
+		framebufferOut[fid] += framebufferIn_shared[index + 1] * 0.175713f;
+		framebufferOut[fid] += framebufferIn_shared[index + 2] * 0.121703f;
+		framebufferOut[fid] += framebufferIn_shared[index + 3] * 0.065984f;
+		framebufferOut[fid] += framebufferIn_shared[index + 4] * 0.028002f;
+		framebufferOut[fid] += framebufferIn_shared[index + 5] * 0.0093f;
+	}
+}
+
+//bloom effect : vertical gussian blur
+__global__
+void gaussianBlurVert(int w, int h, glm::vec3 * framebufferIn,
+	glm::vec3 * framebufferOut, int framebufferEdgeOffset)
+{
+	//Same as horizontal
+
+    //array size should be blocksize.x * (framebufferEdgeOffset + blocksize.y + framebufferEdgeOffset)
+    // 8 * (5 + 8 + 5) -> 144
+	__shared__ glm::vec3 framebuffer_in_shared[144];
+
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	if (x < w && y < h) {
+		//framebuffer index
+		int fid = (x + framebufferEdgeOffset) + ((y + framebufferEdgeOffset) * (w + 2 * framebufferEdgeOffset));
+		framebufferOut[fid] = glm::vec3(0.f);
+
+		int numOfelementsOneRow = w + 2 * framebufferEdgeOffset;
+
+		// blocksize.x
+		// 8
+		int index = (threadIdx.y + 5) * 8 + threadIdx.x;
+		//write into shared memory
+		if (threadIdx.y == 0) {
+			// 40, 32, 24... -> 5 * blocksize.x          
+			framebuffer_in_shared[index - 40] = framebufferIn[fid - 5 * numOfelementsOneRow];
+			framebuffer_in_shared[index - 32] = framebufferIn[fid - 4 * numOfelementsOneRow];
+			framebuffer_in_shared[index - 24] = framebufferIn[fid - 3 * numOfelementsOneRow];
+			framebuffer_in_shared[index - 16] = framebufferIn[fid - 2 * numOfelementsOneRow];
+			framebuffer_in_shared[index - 8] = framebufferIn[fid - 1 * numOfelementsOneRow];
+		}
+		if (threadIdx.y == blockDim.y - 1) {
+			framebuffer_in_shared[index + 8] = framebufferIn[fid + 1 * numOfelementsOneRow];
+			framebuffer_in_shared[index + 16] = framebufferIn[fid + 2 * numOfelementsOneRow];
+			framebuffer_in_shared[index + 24] = framebufferIn[fid + 3 * numOfelementsOneRow];
+			framebuffer_in_shared[index + 32] = framebufferIn[fid + 4 * numOfelementsOneRow];
+			framebuffer_in_shared[index + 40] = framebufferIn[fid + 5 * numOfelementsOneRow];
+		}
+		framebuffer_in_shared[index] = framebufferIn[fid];
+
+		__syncthreads();
+
+		//apply vertical gaussian blur, write into output frames
+		framebufferOut[fid] += framebuffer_in_shared[index - 40] * 0.0093f;
+		framebufferOut[fid] += framebuffer_in_shared[index - 32] * 0.028002f;
+		framebufferOut[fid] += framebuffer_in_shared[index - 24] * 0.065984f;
+		framebufferOut[fid] += framebuffer_in_shared[index - 16] * 0.121703f;
+		framebufferOut[fid] += framebuffer_in_shared[index - 8] * 0.175713f;
+		framebufferOut[fid] += framebuffer_in_shared[index] * 0.198596f;
+		framebufferOut[fid] += framebuffer_in_shared[index + 8] * 0.175713f;
+		framebufferOut[fid] += framebuffer_in_shared[index + 16] * 0.121703f;
+		framebufferOut[fid] += framebuffer_in_shared[index + 24] * 0.065984f;
+		framebufferOut[fid] += framebuffer_in_shared[index + 32] * 0.028002f;
+		framebufferOut[fid] += framebuffer_in_shared[index + 40] * 0.0093f;
+	}
+}
+
+//downSample kernel
+__global__
+void downScaleSample(int downScaleW, int downScaleH, int downScaleRatio,
+	int w, int h,
+	glm::vec3 * downScale_framebuffer, glm::vec3* framebuffer,
+	int framebufferEdgeOffset)
+{
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+
+	if (x < downScaleW && y < downScaleH)
+	{
+		int index = (x + framebufferEdgeOffset) + ((y + framebufferEdgeOffset) * (downScaleW + 2 * framebufferEdgeOffset));
+		glm::vec3& f_col = downScale_framebuffer[index];
+		f_col = glm::vec3(0.0f);
+
+		float sampleCount = (float)downScaleRatio * (float)downScaleRatio;
+		int oriFramebufferX, oriFramebufferY;
+		int oriFramebufferInd;
+		//down sample and calculate average
+		for (int i = 0; i < downScaleRatio; i++)
+		{
+			for (int j = 0; j < downScaleRatio; j++)
+			{
+				oriFramebufferX = x * downScaleRatio + i;
+				oriFramebufferY = y * downScaleRatio + j;
+				oriFramebufferX = glm::clamp(oriFramebufferX, 0, w - 1);
+				oriFramebufferY = glm::clamp(oriFramebufferY, 0, h - 1);
+				oriFramebufferInd = oriFramebufferX + (oriFramebufferY * w);
+				f_col += framebuffer[oriFramebufferInd];
+			}
+		}
+		f_col /= sampleCount;
+	}
+}
+
+//bloom effect: brightness filter
+__global__
+void brightnessFilter(int w, int h, glm::vec3 * framebufferIn, glm::vec3 * framebufferOut)
+{
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	if (x < w && y < h)
+	{
+		int index = x + y * w;
+		// input frame buffer at index, not reference
+		glm::vec3 f_in = framebufferIn[index];
+		//calculate brightness of this frame buffer
+		float brightness = f_in.r * 0.2126f + f_in.g * 0.7152f + f_in.b * 0.0722f;
+		//framebufferOut[index] = brightness * f_in;
+		//brightness /= 2.0f;
+		//framebufferOut[index] = glm::vec3(brightness);
+		framebufferOut[index] = brightness * f_in;
+	}
+}
+
+//bloom effect: final stage, combine buffers
+__global__
+void bloomCombineFramebuffers(int w, int h,
+	glm::vec3 * mainFramebuffer, glm::vec3 * sideFramebuffer,
+	glm::vec3 * framebufferOut,
+	int sideFramebuffer_downScaleW,
+	int sideFramebuffer_downScaleRatio, int framebufferEdgeOffset)
+{
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	if (x < w && y < h)
+	{
+		int mainIdx = x + (y * w);
+		glm::vec3 main_col = mainFramebuffer[mainIdx];
+		int sideIdx = ((x / sideFramebuffer_downScaleRatio) + framebufferEdgeOffset) +
+			(((y / sideFramebuffer_downScaleRatio) + framebufferEdgeOffset) *
+			(sideFramebuffer_downScaleW + 2 * framebufferEdgeOffset));
+		glm::vec3 side_col = sideFramebuffer[sideIdx];
+
+		//combination factor
+		float k = 1.2f;
+		//comine color and write to output
+		framebufferOut[mainIdx] = main_col + k * side_col;
+	}
+}
+
+//edge width in gaussian blur
+int bloomGBedge = 5;
 /**
  * Called once at the beginning of the program to allocate memory.
  */
@@ -162,10 +471,32 @@ void rasterizeInit(int w, int h) {
     cudaFree(dev_framebuffer);
     cudaMalloc(&dev_framebuffer,   width * height * sizeof(glm::vec3));
     cudaMemset(dev_framebuffer, 0, width * height * sizeof(glm::vec3));
-    
+    //initialize auxilliary frame buffers for bloom effect
+	cudaFree(dev_framebufferAux);
+	cudaMalloc(&dev_framebufferAux, width * height * sizeof(glm::vec3));
+	cudaMemset(dev_framebufferAux, 0, width * height * sizeof(glm::vec3));
+
+	int downScaleRatio = DOWNSCALERATIO;
+	cudaFree(dev_framebufferDownScaled);
+	cudaMalloc(&dev_framebufferDownScaled,
+		((width / downScaleRatio) + 2 * bloomGBedge) * ((height / downScaleRatio) + 2 * bloomGBedge) * sizeof(glm::vec3));
+	cudaMemset(dev_framebufferDownScaled, 0,
+		((width / downScaleRatio) + 2 * bloomGBedge) * ((height / downScaleRatio) + 2 * bloomGBedge) * sizeof(glm::vec3));
+	
+	cudaFree(dev_framebufferDownScaledAux);
+	cudaMalloc(&dev_framebufferDownScaledAux,
+		((width / downScaleRatio) + 2 * bloomGBedge) * ((height / downScaleRatio) + 2 * bloomGBedge) * sizeof(glm::vec3));
+	cudaMemset(dev_framebufferDownScaledAux, 0,
+		((width / downScaleRatio) + 2 * bloomGBedge) * ((height / downScaleRatio) + 2 * bloomGBedge) * sizeof(glm::vec3));
+
+
+
 	cudaFree(dev_depth);
 	cudaMalloc(&dev_depth, width * height * sizeof(int));
 
+	cudaFree(dev_mutex);
+	cudaMalloc(&dev_mutex, width * height * sizeof(Fragment));
+	cudaMemset(dev_mutex, 0, width * height * sizeof(Fragment));
 	checkCUDAError("rasterizeInit");
 }
 
@@ -628,7 +959,10 @@ void _vertexTransformAndAssembly(
 	int numVertices, 
 	PrimitiveDevBufPointers primitive, 
 	glm::mat4 MVP, glm::mat4 MV, glm::mat3 MV_normal, 
-	int width, int height) {
+	int width, int height,
+	glm::mat4 autoRotateMat4 //to enable auto rotation of the object
+	)
+{
 
 	// vertex id
 	int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
@@ -639,9 +973,75 @@ void _vertexTransformAndAssembly(
 		// Then divide the pos by its w element to transform into NDC space
 		// Finally transform x and y to viewport space
 
+		//use shared memory to apply auto rotation
+		__shared__ glm::mat4 _autoRotateMat4;
+		if (threadIdx.x == 0)
+		{
+			_autoRotateMat4 = autoRotateMat4;
+		}
+		__syncthreads();
+
 		// TODO: Apply vertex assembly here
 		// Assemble all attribute arraies into the primitive array
 		
+		VertexOut& thisDevVertexOut = primitive.dev_verticesOut[vid];
+		////multiply model-view-projective matrix
+		//glm::vec4 worldSpacePos = MVP * _autoRotateMat4* glm::vec4(primitive.dev_position[vid], 1.0f);
+		////Projective divide
+		//glm::vec4 NDCpos = worldSpacePos * (1.0f / worldSpacePos.w);
+		////transform into pixels
+		//glm::vec4 PixelPos = glm::vec4(
+		//	(NDCpos.x + 1.0f) * (float)width / 2.0f,
+		//	(1.0f - NDCpos.y) * (float)height / 2.0f,
+		//	NDCpos.z,
+		//	NDCpos.w);
+		////write into vertexout struct
+		//thisDevVertexOut.pos = PixelPos;
+		////Eye space pos
+		//glm::vec3 eyeSpacePos = glm::vec3(MV * _autoRotateMat4 *glm::vec4(primitive.dev_position[vid], 1.0f));
+		//thisDevVertexOut.eyePos = eyeSpacePos;
+
+
+		int index = vid;
+		glm::vec4 vPos = glm::vec4(primitive.dev_position[index], 1.0f);
+		glm::vec4 vEyePos = MV * vPos;
+		thisDevVertexOut.eyePos = glm::vec3(vEyePos);
+
+		vPos = MVP * vPos;
+		vPos /= vPos.w;
+		vPos.x = 0.5f * (float)width * (vPos.x + 1.0f);
+		vPos.y = 0.5f * (float)height * (1.0f - vPos.y);
+		vPos.z = -vPos.z;
+
+		thisDevVertexOut.pos = vPos;
+
+
+		//Eye space normal
+		glm::vec3 eyeSpaceNormal = glm::normalize(MV_normal * glm::mat3(_autoRotateMat4) *primitive.dev_normal[vid]);
+		thisDevVertexOut.eyeNor = eyeSpaceNormal;
+		//texture coords
+		if (primitive.dev_texcoord0 != NULL)
+		{
+			thisDevVertexOut.texcoord0 = primitive.dev_texcoord0[vid];
+		}
+		else
+		{
+			thisDevVertexOut.texcoord0 = glm::vec2(0.0f, 0.0f);
+		}
+		//diffuse texture
+		if (primitive.dev_diffuseTex != NULL)
+		{
+			thisDevVertexOut.dev_diffuseTex = primitive.dev_diffuseTex;
+			thisDevVertexOut.diffuseTexHeight = primitive.diffuseTexHeight;
+			thisDevVertexOut.diffuseTexWidth = primitive.diffuseTexWidth;
+		}
+		//else assign sequential RGB color
+		else
+		{
+			thisDevVertexOut.col = glm::vec3(0.f);
+			thisDevVertexOut.col[vid % 3] += 0.6f;
+		}
+		
 	}
 }
 
@@ -660,12 +1060,12 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_
 		// TODO: uncomment the following code for a start
 		// This is primitive assembly for triangles
 
-		//int pid;	// id for cur primitives vector
-		//if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
-		//	pid = iid / (int)primitive.primitiveType;
-		//	dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
-		//		= primitive.dev_verticesOut[primitive.dev_indices[iid]];
-		//}
+		int pid;	// id for cur primitives vector
+		if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
+			pid = iid / (int)primitive.primitiveType;
+			dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
+				= primitive.dev_verticesOut[primitive.dev_indices[iid]];
+		}
 
 
 		// TODO: other primitive types (point, line)
@@ -673,12 +1073,381 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_
 	
 }
 
+////helper function: fill specific fragment buffer
+//__device__
+//void fillFragmentBufferRoutine(Fragment& thisFragment,
+//	glm::vec3 p, glm::vec3 p1, glm::vec3 p2, glm::vec3 p3,
+//	VertexOut v1, VertexOut v2, VertexOut v3)
+//{
+//	int diffuseTexWidth = v1.diffuseTexWidth;
+//	int diffuseTexHeight = v1.diffuseTexHeight;
+//	TextureData* textureData = v1.dev_diffuseTex;
+//
+//	//positions
+//	glm::vec3 eyeSpacePos_interpolated = vec3AttriInterpolate(
+//		p, p1, p2, p3,
+//		v1.eyePos, v2.eyePos, v3.eyePos
+//	);
+//	thisFragment.eyePos = eyeSpacePos_interpolated;
+//
+//	//normals
+//	glm::vec3 eyeSpaceNor_interpolated = vec3AttriInterpolate(
+//		p, p1, p2, p3,
+//		v1.eyeNor, v2.eyeNor, v3.eyeNor
+//	);
+//	thisFragment.eyeNor = glm::normalize(eyeSpaceNor_interpolated);
+//
+//	//uv coordinates
+//	glm::vec2 uv_interpolated = vec2AttriInterpolate(
+//		p, p1, p2, p3,
+//		v1.texcoord0, v2.texcoord0, v3.texcoord0
+//	);
+//	//get texture and color the fragment
+//	if (textureData != NULL)
+//	{
+//		//read texture and color the fragment with it
+//		glm::ivec2 texSpaceCoord = glm::ivec2(diffuseTexWidth * uv_interpolated.x,
+//			diffuseTexHeight * uv_interpolated.y);
+//		int texIdx = texSpaceCoord.x + diffuseTexWidth * texSpaceCoord.y;
+//
+//		//read from color channels
+//		int textChannelsNum = 3;
+//		TextureData r = textureData[texIdx * textChannelsNum];
+//		TextureData g = textureData[texIdx * textChannelsNum + 1];
+//		TextureData b = textureData[texIdx * textChannelsNum + 2];
+//
+//		thisFragment.color = glm::vec3((float)r / 255.0f,
+//			(float)g / 255.0f,
+//			(float)b / 255.0f);
+//		//thisFragment.color = thisFragment.eyeNor;
+//	}
+//	else
+//	{
+//		thisFragment.color = glm::vec3(1.0f, 1.0f, 1.0f);
+//	}
+//
+//}
+
+
+
+/**
+ * fill in wireframe mode
+ */
+__device__
+void rasterizerFillWireFrame(
+	Fragment* fragmentBuffer, int* depth,
+	glm::vec3 p1, glm::vec3 p2, glm::vec3 p3,
+	int w, int h)
+{
+
+	//2D Bounding Box
+	float xMin = fminf(p1.x, fminf(p2.x, p3.x));
+	float xMax = fmaxf(p1.x, fmaxf(p2.x, p3.x));
+	float yMin = fminf(p1.y, fminf(p2.y, p3.y));
+	float yMax = fmaxf(p1.y, fmaxf(p2.y, p3.y));
+
+	//check bounding box with screen size
+	//get start and end indices in pixel from bounding box
+	float xStart = xMin < 0 ? 0 : (int)glm::floor(xMin);
+	float xEnd = xMax > w ? w : (int)glm::ceil(xMax);
+
+	float yStart = yMin < 0 ? 0 : (int)glm::floor(yMin);
+	float yEnd = yMax > h ? h : (int)glm::ceil(yMax);
+
+	glm::vec3 tris[3];
+	tris[0] = p1;
+	tris[1] = p2;
+	tris[2] = p3;
+
+	int fragmentIdx;
+
+	float lineThickness = 0.08f;
+	glm::vec3 wireColor = glm::vec3(0.85f, 0.85f, 0.35f);
+
+
+	//scan pixel lines to fill the current primitive(triangle)
+	for (int i = yStart; i <= yEnd; i++)
+	{
+		for (int j = xStart; j <= xEnd; j++)
+		{
+			glm::vec3 barycentricCoord = calculateBarycentricCoordinate(tris, glm::vec2(j, i));
+			if (glm::abs(barycentricCoord.x) < lineThickness)
+			{
+				if (barycentricCoord.y >= 0.0f && barycentricCoord.y <= 1.0f
+					&&barycentricCoord.z >= 0.0f && barycentricCoord.z <= 1.0f)
+				{
+					fragmentIdx = j + (i * w);
+					fragmentBuffer[fragmentIdx].color = wireColor;
+				}
+			}
+			if (glm::abs(barycentricCoord.y) < lineThickness)
+			{
+				if (barycentricCoord.x >= 0.0f && barycentricCoord.x <= 1.0f
+					&&barycentricCoord.z >= 0.0f && barycentricCoord.z <= 1.0f)
+				{
+					fragmentIdx = j + (i * w);
+					fragmentBuffer[fragmentIdx].color = wireColor;
+				}
+			}
+			if (glm::abs(barycentricCoord.z) < lineThickness)
+			{
+				if (barycentricCoord.y >= 0.0f && barycentricCoord.y <= 1.0f
+					&&barycentricCoord.x >= 0.0f && barycentricCoord.x <= 1.0f)
+				{
+					fragmentIdx = j + (i * w);
+					fragmentBuffer[fragmentIdx].color = wireColor;
+				}
+			}
+			
+		}
+
+	}
+}
+
+/**
+ * fill in point mode
+ */
+__device__
+void rasterizerFillPoints(
+	Fragment* fragmentBuffer, int* depth,
+	glm::vec3 p1, glm::vec3 p2, glm::vec3 p3,
+	int w, int h)
+{
+
+	//2D Bounding Box
+	float xMin = fminf(p1.x, fminf(p2.x, p3.x));
+	float xMax = fmaxf(p1.x, fmaxf(p2.x, p3.x));
+	float yMin = fminf(p1.y, fminf(p2.y, p3.y));
+	float yMax = fmaxf(p1.y, fmaxf(p2.y, p3.y));
+
+	//check bounding box with screen size
+	//get start and end indices in pixel from bounding box
+	float xStart = xMin < 0 ? 0 : (int)glm::floor(xMin);
+	float xEnd = xMax > w ? w : (int)glm::ceil(xMax);
+
+	float yStart = yMin < 0 ? 0 : (int)glm::floor(yMin);
+	float yEnd = yMax > h ? h : (int)glm::ceil(yMax);
+
+	glm::vec3 tris[3];
+	tris[0] = p1;
+	tris[1] = p2;
+	tris[2] = p3;
+
+	int fragmentIdx;
+
+	float pointThickness = 0.08f;
+	glm::vec3 pointColor = glm::vec3(0.35f, 0.85f, 0.85f);
+
+
+	//scan pixel lines to fill the current primitive(triangle)
+	for (int i = yStart; i <= yEnd; i++)
+	{
+		for (int j = xStart; j <= xEnd; j++)
+		{
+			glm::vec3 barycentricCoord = calculateBarycentricCoordinate(tris, glm::vec2(j, i));
+			
+			if (glm::abs(barycentricCoord.x - 1.0f) < pointThickness)
+			{
+				if (glm::abs(barycentricCoord.y) < pointThickness 
+					&&glm::abs(barycentricCoord.z) < pointThickness)
+				{
+					fragmentIdx = j + (i * w);
+					fragmentBuffer[fragmentIdx].color = pointColor;
+				}
+			}
+			if (glm::abs(barycentricCoord.y - 1.0f) < pointThickness)
+			{
+				if (glm::abs(barycentricCoord.x) < pointThickness
+					&&glm::abs(barycentricCoord.z) < pointThickness)
+				{
+					fragmentIdx = j + (i * w);
+					fragmentBuffer[fragmentIdx].color = pointColor;
+				}
+			}
+			if (glm::abs(barycentricCoord.z - 1.0f) < pointThickness)
+			{
+				if (glm::abs(barycentricCoord.y) < pointThickness
+					&&glm::abs(barycentricCoord.x) < pointThickness)
+				{
+					fragmentIdx = j + (i * w);
+					fragmentBuffer[fragmentIdx].color = pointColor;
+				}
+			}
+
+		}
+
+	}
+}
+
+/**
+* fill fragment method, actually control over different renderModes
+* call this function in rasterize function
+*/
+__global__
+void rasterizeFill(int numPrimitives, int curPrimitiveBeginId, Primitive* primitives,
+	Fragment* fragmentBuffer, int * depth, int w, int h, int renderMode, int* dev_mutex)
+{
+	int primitiveIdx = (blockIdx.x * blockDim.x) + threadIdx.x;
 
+	if (primitiveIdx < numPrimitives)
+	{
+		Primitive& thisPrimitive = primitives[primitiveIdx + curPrimitiveBeginId];
+		glm::vec3 p1(thisPrimitive.v[0].pos[0], thisPrimitive.v[0].pos[1], thisPrimitive.v[0].pos[2]);
+		glm::vec3 p2(thisPrimitive.v[1].pos[0], thisPrimitive.v[1].pos[1], thisPrimitive.v[1].pos[2]);
+		glm::vec3 p3(thisPrimitive.v[2].pos[0], thisPrimitive.v[2].pos[1], thisPrimitive.v[2].pos[2]);
+	
+		//Switch between different rendermodes
+		if (renderMode == RenderMode::Triangles)
+		{
+			//rasterizerFillTriangle(thisPrimitive,
+			//	fragmentBuffer, depth, p1, p2, p3, w, h,dev_mutex);
+
+			////2D Bounding Box
+			//float xMin = fminf(p1.x, fminf(p2.x, p3.x));
+			//float xMax = fmaxf(p1.x, fmaxf(p2.x, p3.x));
+			//float yMin = fminf(p1.y, fminf(p2.y, p3.y));
+			//float yMax = fmaxf(p1.y, fmaxf(p2.y, p3.y));
+
+			////check bounding box with screen size
+			////get start and end indices in pixel from bounding box
+			//float xStart = xMin < 0 ? 0 : (int)glm::floor(xMin);
+			//float xEnd = xMax > w ? w : (int)glm::ceil(xMax);
+
+			//float yStart = yMin < 0 ? 0 : (int)glm::floor(yMin);
+			//float yEnd = yMax > h ? h : (int)glm::ceil(yMax);
+
+			glm::vec3 tri[3] = { p1,p2,p3 };
+			AABB boundingBox = getAABBForTriangleWithClamp(tri, (float)w, (float)h);
+			float xStart = boundingBox.min.x;
+			float xEnd = boundingBox.max.x;
+
+			float yStart = boundingBox.min.y;
+			float yEnd = boundingBox.max.y;
+			//scan pixel lines to fill the current primitive(triangle)
+			for (int i = yStart; i <= yEnd; i++)
+			{
+				for (int j = xStart; j <= xEnd; j++)
+				{
+					// test if the pos is inside the current triangle
+					glm::vec3 baryCoords = calculateBarycentricCoordinate(tri, glm::vec2(j, i));
+					if (isBarycentricCoordInBounds(baryCoords))
+					{
+
+					
+					/*if (isPosInTriangle(glm::vec3(j, i, 0.f), p1, p2, p3))
+					{*/
+						//interpolate depth 
+						//depth value is in eye pace
+						//const float z_interpolated = depthInterpolate(glm::vec3(j, i, 0.f), p1, p2, p3);
+						const float z_interpolated = getZAtCoordinate(baryCoords, tri);
+						//CAUTIOUS(basic)
+						//We need to use atomic function to write depth value
+						//But it only supports integers, so we multiply it by a large number to get as many digits as we can
+						//Note: int32 is bwtween -2^31 and 2^31, which is 2147483648
+						const int z_rounded = z_interpolated * INT_MAX;
+
+						//atomic depth writing
+						int fragmentIdx = j + (i * w);
+
+						bool isSet;
+						do {
+							isSet = (atomicCAS(&dev_mutex[fragmentIdx], 0, 1) == 0);
+							if (isSet) {
+								if (z_rounded < depth[fragmentIdx])
+								{
+									depth[fragmentIdx] = z_rounded;
+									//The pos of current interested fragment
+									glm::vec3 p((float)j, (float)i, z_interpolated);
+
+									//fillFragmentBufferRoutine(fragmentBuffer[fragmentIdx],
+									//	p, p1, p2, p3,
+									//	thisPrimitive.v[0], thisPrimitive.v[1], thisPrimitive.v[2]);
+									Fragment f;
+
+									f.eyeNor = glm::normalize(
+										thisPrimitive.v[0].eyeNor * baryCoords.x +
+										thisPrimitive.v[1].eyeNor * baryCoords.y +
+										thisPrimitive.v[2].eyeNor * baryCoords.z);
+
+									f.eyePos = glm::normalize(
+										thisPrimitive.v[0].eyePos * baryCoords.x +
+										thisPrimitive.v[1].eyePos * baryCoords.y +
+										thisPrimitive.v[2].eyePos * baryCoords.z);
+#if PERSPECTIVE_INTERP
+									// get correct depth to use for interpolation
+									float newBaryDepth = 1.f / (1 / thisPrimitive.v[0].pos.z * baryCoords.x +
+										1 / thisPrimitive.v[1].pos.z * baryCoords.y +
+										1 / thisPrimitive.v[2].pos.z * baryCoords.z);
+
+									f.color = glm::normalize(newBaryDepth * (thisPrimitive.v[0].col * baryCoords.x / thisPrimitive.v[0].pos.z +
+										thisPrimitive.v[1].col * baryCoords.y / thisPrimitive.v[1].pos.z +
+										thisPrimitive.v[2].col * baryCoords.z / thisPrimitive.v[2].pos.z));
+#else
+									f.color = glm::normalize(thisPrimitive.v[0].col * baryCoords.x +
+										thisPrimitive.v[1].col * baryCoords.y +
+										thisPrimitive.v[2].col * baryCoords.z);
+#endif
+
+
+
+									/***** Texture handling ****/
+#if TEXTURE
+#if PERSPECTIVE_INTERP
+						// get the starting index of the color in the tex buffer
+									glm::vec2 uv = newBaryDepth * (
+										thisPrimitive.v[0].texcoord0 * baryCoords.x / thisPrimitive.v[0].pos.z +
+										thisPrimitive.v[1].texcoord0 * baryCoords.y / thisPrimitive.v[1].pos.z +
+										thisPrimitive.v[2].texcoord0 * baryCoords.z / thisPrimitive.v[2].pos.z);
+
+#else
+									glm::vec2 uv = thisPrimitive.v[0].texcoord0 * baryCoords.x +
+										thisPrimitive.v[1].texcoord0 * baryCoords.y +
+										thisPrimitive.v[2].texcoord0 * baryCoords.z;
+#endif
+									uv.x *= thisPrimitive.v[0].diffuseTexWidth;
+									uv.y *= thisPrimitive.v[0].diffuseTexHeight;
+#if TEXTURE_BILINEAR
+									f.uvBilinear = uv;
+									f.diffuseTexWidth = thisPrimitive.v[0].diffuseTexWidth;
+									f.diffuseTexHeight = thisPrimitive.v[0].diffuseTexHeight;
+#endif
+									// actual index into tex buffer is 1D, multip by 3 since every 3 floats (rgb) is a color
+									f.uvStart = ((int)uv.x + (int)uv.y * thisPrimitive.v[0].diffuseTexWidth) * 3;
+									f.dev_diffuseTex = thisPrimitive.v[0].dev_diffuseTex;
+#endif
+									fragmentBuffer[fragmentIdx] = f;
+								}
+							}
+							if (isSet)
+							{
+								dev_mutex[fragmentIdx] = 0;
+							}
+						} while (!isSet);
+
+					}
+				}
+
+			}
+
+		}
+		if (renderMode == RenderMode::Wireframe)
+		{
+			rasterizerFillWireFrame(fragmentBuffer,
+				depth, p1, p2, p3, w, h);
+		}
+		if (renderMode == RenderMode::Points)
+		{
+			rasterizerFillPoints(fragmentBuffer, depth, p1, p2, p3, w, h);
+		}
+	}
+}
 
 /**
  * Perform rasterization.
  */
-void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal) {
+void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal,
+	int renderMode, glm::mat4 autoRotateMat4,
+	bool bloomEffect) {
     int sideLength2d = 8;
     dim3 blockSize2d(sideLength2d, sideLength2d);
     dim3 blockCount2d((width  - 1) / blockSize2d.x + 1,
@@ -702,7 +1471,7 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 				dim3 numBlocksForVertices((p->numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
 				dim3 numBlocksForIndices((p->numIndices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
 
-				_vertexTransformAndAssembly << < numBlocksForVertices, numThreadsPerBlock >> >(p->numVertices, *p, MVP, MV, MV_normal, width, height);
+				_vertexTransformAndAssembly << < numBlocksForVertices, numThreadsPerBlock >> >(p->numVertices, *p, MVP, MV, MV_normal, width, height, autoRotateMat4);
 				checkCUDAError("Vertex Processing");
 				cudaDeviceSynchronize();
 				_primitiveAssembly << < numBlocksForIndices, numThreadsPerBlock >> >
@@ -723,15 +1492,77 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 	initDepth << <blockCount2d, blockSize2d >> >(width, height, dev_depth);
 	
 	// TODO: rasterize
+	
+	curPrimitiveBeginId = 0;
+	dim3 numThreadsPerBlock(128);
 
+	auto it = mesh2PrimitivesMap.begin();
+	auto itEnd = mesh2PrimitivesMap.end();
 
+	for (; it != itEnd; ++it)
+	{
+		auto p = (it->second).begin();
+		auto pEnd = (it->second).end();
+		for (; p != pEnd; ++p)
+		{
+			//launch kernel to fill fragments
+			dim3 numBlocksForPrimitives((p->numPrimitives + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
+			rasterizeFill<<<numBlocksForPrimitives, numThreadsPerBlock>>>(p->numPrimitives, curPrimitiveBeginId,dev_primitives, dev_fragmentBuffer,
+				dev_depth, width, height,renderMode, dev_mutex);
+			checkCUDAError("rasterizerFillPrimitive error!");
+			cudaDeviceSynchronize();
+			curPrimitiveBeginId += p->numPrimitives;
+		}
+	}
 
     // Copy depthbuffer colors into framebuffer
-	render << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentBuffer, dev_framebuffer);
+	// Modified: need more info in fragment shader such as light position
+	glm::vec3 singlePointLight(3.0f, 6.0f, -5.0f);
+	render << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentBuffer, dev_framebuffer, singlePointLight, renderMode, bloomGBedge);
 	checkCUDAError("fragment shader");
-    // Copy framebuffer into OpenGL buffer for OpenGL previewing
-    sendImageToPBO<<<blockCount2d, blockSize2d>>>(pbo, width, height, dev_framebuffer);
-    checkCUDAError("copy render result to pbo");
+    
+	if (bloomEffect)
+	{
+		//BLOOM EFFECT POST PROCESSING
+
+		//BRIGHTNESS FITLER
+		brightnessFilter<<<blockCount2d, blockSize2d>>>(width, height, dev_framebuffer, dev_framebufferAux);
+		//DOWN SCALE
+		int downScaleRatio = DOWNSCALERATIO;
+		dim3 blockCount2dDownScaled((width / downScaleRatio - 1) / blockSize2d.x + 1,
+			(height / downScaleRatio - 1) / blockSize2d.y + 1);
+		downScaleSample<<<blockCount2dDownScaled,blockSize2d>>>(width / downScaleRatio, height / downScaleRatio, downScaleRatio,
+			width, height,
+			dev_framebufferDownScaled, dev_framebufferAux,
+			bloomGBedge);
+		//GAUSSIAN BLUR
+		gaussianBlurHorizon << <blockCount2dDownScaled, blockSize2d >> > (width / downScaleRatio, height / downScaleRatio,
+			dev_framebufferDownScaled, dev_framebufferDownScaledAux,
+			bloomGBedge);
+		gaussianBlurVert << <blockCount2dDownScaled, blockSize2d >> > (width / downScaleRatio, height / downScaleRatio,
+			dev_framebufferDownScaledAux, dev_framebufferDownScaled,
+			bloomGBedge);
+		//FINAL COMBINE
+		bloomCombineFramebuffers << <blockCount2d, blockSize2d >> > (width, height,
+			dev_framebuffer, dev_framebufferDownScaled,
+			dev_framebufferAux,
+			width / downScaleRatio,
+			downScaleRatio,
+			bloomGBedge);
+		checkCUDAError("bloom effect");
+
+		// Copy post-processed framebuffer into OpenGL buffer for OpenGL previewing
+		sendImageToPBO << <blockCount2d, blockSize2d >> > (pbo, width, height, dev_framebufferAux, bloomGBedge, width,1);
+		checkCUDAError("copy render result to pbo");
+	}
+	
+	else
+	{
+		// Copy framebuffer into OpenGL buffer for OpenGL previewing
+		sendImageToPBO << <blockCount2d, blockSize2d >> > (pbo, width, height, dev_framebuffer,bloomGBedge, width, 1);
+		checkCUDAError("copy render result to pbo");
+	}
+	
 }
 
 /**
@@ -769,8 +1600,20 @@ void rasterizeFree() {
     cudaFree(dev_framebuffer);
     dev_framebuffer = NULL;
 
+	//free extra buffers
+	cudaFree(dev_framebufferAux);
+	dev_framebufferAux = NULL;
+
+	cudaFree(dev_framebufferDownScaled);
+	dev_framebufferDownScaled = NULL;
+
+	cudaFree(dev_framebufferDownScaledAux);
+	dev_framebufferDownScaled = NULL;
+
 	cudaFree(dev_depth);
 	dev_depth = NULL;
 
+
+
     checkCUDAError("rasterize Free");
 }
diff --git a/src/rasterize.h b/src/rasterize.h
index 560aae9..f9ba086 100644
--- a/src/rasterize.h
+++ b/src/rasterize.h
@@ -11,6 +11,7 @@
 #include <glm/glm.hpp>
 #include <glm/gtc/matrix_transform.hpp>
 #include <glm/gtx/transform.hpp>
+#include "device_launch_parameters.h"
 
 namespace tinygltf{
 	class Scene;
@@ -20,5 +21,6 @@ namespace tinygltf{
 void rasterizeInit(int width, int height);
 void rasterizeSetBuffers(const tinygltf::Scene & scene);
 
-void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal);
+void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal,
+	int renderMode, glm::mat4 autoRotateMat4, bool bloomEffect);
 void rasterizeFree();
diff --git a/src/rasterizeTools.h b/src/rasterizeTools.h
index 46c701e..17eecee 100644
--- a/src/rasterizeTools.h
+++ b/src/rasterizeTools.h
@@ -30,15 +30,15 @@ glm::vec3 multiplyMV(glm::mat4 m, glm::vec4 v) {
  * Finds the axis aligned bounding box for a given triangle.
  */
 __host__ __device__ static
-AABB getAABBForTriangle(const glm::vec3 tri[3]) {
+AABB getAABBForTriangleWithClamp(const glm::vec3 tri[3], float width, float height) {
     AABB aabb;
     aabb.min = glm::vec3(
-            min(min(tri[0].x, tri[1].x), tri[2].x),
-            min(min(tri[0].y, tri[1].y), tri[2].y),
+            max(min(min(tri[0].x, tri[1].x), tri[2].x),0.0f),
+            max(min(min(tri[0].y, tri[1].y), tri[2].y),0.0f),
             min(min(tri[0].z, tri[1].z), tri[2].z));
     aabb.max = glm::vec3(
-            max(max(tri[0].x, tri[1].x), tri[2].x),
-            max(max(tri[0].y, tri[1].y), tri[2].y),
+            min(max(max(tri[0].x, tri[1].x), tri[2].x),width),
+            min(max(max(tri[0].y, tri[1].y), tri[2].y),height),
             max(max(tri[0].z, tri[1].z), tri[2].z));
     return aabb;
 }
@@ -48,7 +48,7 @@ AABB getAABBForTriangle(const glm::vec3 tri[3]) {
  * Calculate the signed area of a given triangle.
  */
 __host__ __device__ static
-float calculateSignedArea(const glm::vec3 tri[3]) {
+float calculateSignedArea(glm::vec3 tri[3]) {
     return 0.5 * ((tri[2].x - tri[0].x) * (tri[1].y - tri[0].y) - (tri[1].x - tri[0].x) * (tri[2].y - tri[0].y));
 }
 
@@ -57,7 +57,7 @@ float calculateSignedArea(const glm::vec3 tri[3]) {
  * Helper function for calculating barycentric coordinates.
  */
 __host__ __device__ static
-float calculateBarycentricCoordinateValue(glm::vec2 a, glm::vec2 b, glm::vec2 c, const glm::vec3 tri[3]) {
+float calculateBarycentricCoordinateValue(glm::vec2 a, glm::vec2 b, glm::vec2 c,  glm::vec3 tri[3]) {
     glm::vec3 baryTri[3];
     baryTri[0] = glm::vec3(a, 0);
     baryTri[1] = glm::vec3(b, 0);
@@ -70,7 +70,7 @@ float calculateBarycentricCoordinateValue(glm::vec2 a, glm::vec2 b, glm::vec2 c,
  * Calculate barycentric coordinates.
  */
 __host__ __device__ static
-glm::vec3 calculateBarycentricCoordinate(const glm::vec3 tri[3], glm::vec2 point) {
+glm::vec3 calculateBarycentricCoordinate(glm::vec3 tri[3], glm::vec2 point) {
     float beta  = calculateBarycentricCoordinateValue(glm::vec2(tri[0].x, tri[0].y), point, glm::vec2(tri[2].x, tri[2].y), tri);
     float gamma = calculateBarycentricCoordinateValue(glm::vec2(tri[0].x, tri[0].y), glm::vec2(tri[1].x, tri[1].y), point, tri);
     float alpha = 1.0 - beta - gamma;
@@ -99,3 +99,112 @@ float getZAtCoordinate(const glm::vec3 barycentricCoord, const glm::vec3 tri[3])
            + barycentricCoord.y * tri[1].z
            + barycentricCoord.z * tri[2].z);
 }
+
+//HELPER FUNCTIONS
+
+/**
+ * Helper function, determine whether a point is inside a triangle
+ */
+__device__
+bool isPosInTriangle(glm::vec3 p, glm::vec3 pos1, glm::vec3 pos2, glm::vec3 pos3)
+{
+	glm::vec3 v(p[0], p[1], 0.f);
+	glm::vec3 v1(pos1[0], pos1[1], 0.f);
+	glm::vec3 v2(pos2[0], pos2[1], 0.f);
+	glm::vec3 v3(pos3[0], pos3[1], 0.f);
+
+	//compute areas for barycentric coordinates
+	float area = 0.5f * glm::length(glm::cross(v1 - v2, v3 - v2));
+	float area1 = 0.5f * glm::length(glm::cross(v - v2, v3 - v2));
+	float area2 = 0.5f * glm::length(glm::cross(v - v3, v1 - v3));
+	float area3 = 0.5f * glm::length(glm::cross(v - v1, v2 - v1));
+
+	//check the sum of three (signed) areas against the whole area
+	return glm::abs(area1 + area2 + area3 - area) < 0.001f;
+}
+
+/**
+ * Helper function, interpolate depth value
+ * Argument list: p is in screen space: (pixel.x, pixel.y, 0)
+ * pos1, pos2, pos3 are in combination of screen space and camera space: (pixel.x, pixel.y, eyeSpace.z)
+ */
+
+__device__
+float depthInterpolate(glm::vec3 p,
+	glm::vec3 pos1, glm::vec3 pos2, glm::vec3 pos3)
+{
+	glm::vec3 v1(pos1[0], pos1[1], 0.f);
+	glm::vec3 v2(pos2[0], pos2[1], 0.f);
+	glm::vec3 v3(pos3[0], pos3[1], 0.f);
+
+	//compute areas for barycentric coordinates
+	float area = 0.5f * glm::length(glm::cross(v1 - v2, v3 - v2));
+	float area1 = 0.5f * glm::length(glm::cross(p - v2, v3 - v2));
+	float area2 = 0.5f * glm::length(glm::cross(p - v3, v1 - v3));
+	float area3 = 0.5f * glm::length(glm::cross(p - v1, v2 - v1));
+
+	//calculate interpolated z:
+	float z_inverse_interpolated = area1 / (pos1[2] * area) + area2 / (pos2[2] * area) + area3 / (pos3[2] * area);
+	return 1.0f / z_inverse_interpolated;
+}
+
+/**
+ * Helper function, interpolate any vec2 vertex attributes, mainly for texture coords
+ * Argument list: p, pos1, pos2, pos3 are in combination of screen space and camera space: (pixel.x, pixel.y, eyeSpace.z)
+ * Attributes to interpolate are attri1, attri2 and attri3
+ */
+__device__
+glm::vec2 vec2AttriInterpolate(glm::vec3 p,
+	glm::vec3 pos1, glm::vec3 pos2, glm::vec3 pos3,
+	glm::vec2 attri1, glm::vec2 attri2, glm::vec2 attri3)
+{
+	glm::vec3 v(p[0], p[1], 0.f);
+	glm::vec3 v1(pos1[0], pos1[1], 0.f);
+	glm::vec3 v2(pos2[0], pos2[1], 0.f);
+	glm::vec3 v3(pos3[0], pos3[1], 0.f);
+
+	//compute areas for barycentric coordinates
+	float area = 0.5f * glm::length(glm::cross(v1 - v2, v3 - v2));
+	float area1 = 0.5f * glm::length(glm::cross(v - v2, v3 - v2));
+	float area2 = 0.5f * glm::length(glm::cross(v - v3, v1 - v3));
+	float area3 = 0.5f * glm::length(glm::cross(v - v1, v2 - v1));
+
+	//formula:
+	// A/Z = Sigma (Ai/Zi * Si/S)
+	glm::vec2 AttriIntepolate =
+		(attri1 / pos1[2]) * (area1 / area)
+		+ (attri2 / pos2[2]) * (area2 / area)
+		+ (attri3 / pos3[2]) * (area3 / area);
+	return AttriIntepolate * p[2];
+}
+
+/**
+ * Helper function, interpolate any vec3 vertex attributes
+ * Argument list: p, pos1, pos2, pos3 are in combination of screen space and camera space: (pixel.x, pixel.y, eyeSpace.z)
+ * Attributes to interpolate are attri1, attri2 and attri3
+ * basicall same as vec2 attribute interpolation, but with vec3
+ */
+__device__
+glm::vec3 vec3AttriInterpolate(glm::vec3 p,
+	glm::vec3 pos1, glm::vec3 pos2, glm::vec3 pos3,
+	glm::vec3 attri1, glm::vec3 attri2, glm::vec3 attri3)
+{
+	glm::vec3 v(p[0], p[1], 0.f);
+	glm::vec3 v1(pos1[0], pos1[1], 0.f);
+	glm::vec3 v2(pos2[0], pos2[1], 0.f);
+	glm::vec3 v3(pos3[0], pos3[1], 0.f);
+
+	//compute areas for barycentric coordinates
+	float area = 0.5f * glm::length(glm::cross(v1 - v2, v3 - v2));
+	float area1 = 0.5f * glm::length(glm::cross(v - v2, v3 - v2));
+	float area2 = 0.5f * glm::length(glm::cross(v - v3, v1 - v3));
+	float area3 = 0.5f * glm::length(glm::cross(v - v1, v2 - v1));
+
+	//formula:
+	// A/Z = Sigma (Ai/Zi * Si/S)
+	glm::vec3 AttriIntepolate =
+		(attri1 / pos1[2]) * (area1 / area)
+		+ (attri2 / pos2[2]) * (area2 / area)
+		+ (attri3 / pos3[2]) * (area3 / area);
+	return AttriIntepolate * p[2];
+}