diff --git a/README.md b/README.md
index d63a6a1..8498cb4 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,58 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Name: Bowen Yang
+  * [LinkedIn](https://www.linkedin.com/in/%E5%8D%9A%E6%96%87-%E6%9D%A8-83bba6148)
+  * [GitHub](https://github.com/Grillnov)
+  * [Facebook](https://www.facebook.com/yang.bowen.7399)
+  * [Steam](https://steamcommunity.com/id/grillnov)
+* Tested on: Windows 10 x64, i7-6800K @ 3.40GHz 32GB, GTX 1080 8GB (Yep I've got another personal computer at home)
 
-### (TODO: Your README)
+* Additional modification made to the repo
+  Modified the option in ./src/CMakeLists.txt and make it "sm_60" again.
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+* **Description of the algorithm** *
+In this project we are to implement the flocking behavior of the boids, by updating the velocities of each boid according to certain mechanism:
+ * Boids try to fly towards the centre of mass of neighbouring boids.
+ * Boids try to keep a small distance away from other objects (including other boids).
+ * Boids try to match velocity with near boids.
+ 
+In general 2 types of implementation methods are included: the brute-force solution where we loop over each boid for each boid to iterate through their neighbors; and the unifoirm-grid implementation, where we can apply the uniform grid data structure to the neighbor-search to significantly boost the performance. By varying the maps or tables we use to look up the corresponding particles, we can also try the coherent method, which is an optimization of the scattered uniform grid method. (Actually I think we can further optimize the neighbor-search using oct-trees or spatial hash-tables but these are left for future...)
+
+* **What we achieved so far**
+Flocks of flocks of playful boids. The number shown at the top-left corner is the framerate displayed by NVIDIA's shadowplay.
+  ![](images/showcase.gif)
+  
+* **Analysis: boids**
+A side-by-side comparison of the performance of these 3 methods. The y axis represents the framerate (frames per second) and the x axis is the log of the number of total particles. (i.e. x = 10 means that there're 1024 boids in total) Technically it's a log plot.
+The block size is 128 for all test cases.
+  ![](images/sidebyside.png)
+  
+  * Q:For each implementation, how does changing the number of boids affect performance? Why do you think this is?
+  * A:For the naive implementation we can clearly see that the framerate drops quadratically with respect to the log of the number of boids. It's fairly easy to explain this behavior: Our naive implementation is a 2-layered nested loop for all boids and thus has a time complexity of O(n^2).
+  * Q:For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?
+  * A:For the scattered uniform grid implementation the performance is significantly boosted compared to the naive one, since we are no longer looping over every boid for every boid. Instead, we loop only over those that're considered within the proximity of the current boid.
+  * For the coherent implementation, the trends are the same as the scattered uniform grid, and it outperforms the scattered grid implementation since we no-longer have the "middleman" in the way. But at lower number of boids we can see that sometimes the implementation without the "middleman" runs faster. This can be explained since the coherent implementation creates more overhead when it comes to re-shuffling, the performance boost is eaten up by the additional calculation introduced.
+  * The anomaly happening around n = 2^13 to 2^14 for these 2 uniform grid implementation methods: The complexity of the uniform grid is not deterministic and still has an upper bound of O(n^2), therefore when the distribution becomes dense, the computational cost won't necessiarly increase proportionally.
+  
+* **Analysis continued: blocks**
+ * Naive implementation: tested at 8192 particles. The y axis represents the framerate (frames per second) and the x axis is the log of the block size. (i.e. x = 10 means that block size is set to 1024). The performance drastically dropped when block size reaches 1024 (maximum).
+ 
+ ![](images/naive.png)
+ 
+ * Scattered implementation: tested at 262144 particles. The performance is relatively not sensitive to the block size, but dropping with block size increasing as well.
+ 
+ ![](images/sca.png)
+ 
+ * Coherent implementation: tested at 262144 particles. The performance is relatively not sensitive to the block size, but dropping with block size increasing as well.
+ 
+ ![](images/coh.png)
+ 
+ * Q:For each implementation, how does changing the block count and block size affect performance? Why do you think this is?
+ * A:Explanation: For the naive implementation, the performance drop when block size reached 1024 is probably caused by the spatial locality: When block size is small enough and each stream multiprocessor only loops the 2-layered loop for a relatively small amount of boids, the global memory patch accessed by each stream multiprocessor is smaller. Therefore the locality is better when the block size is set smaller, and cache-miss is less likely to occur.
+ 
+ * Q:Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!
+ * A:I didn't change the cell width since my driver crashed. But checking 27 vs 8 might actually boost performance when the distribution is extremely dense: When we check all the 27 adjacent blocks including itself, we can reduce the resolution of the uniform grid, and therefore the cost introduced by sorting will be reduced as well.
+ 
+* **Feedback**
+The pseudo-code is somewhat misleading when it comes to averaging mass-center or perceived velocity.
diff --git a/images/coh.png b/images/coh.png
new file mode 100644
index 0000000..8711847
Binary files /dev/null and b/images/coh.png differ
diff --git a/images/naive.png b/images/naive.png
new file mode 100644
index 0000000..f49e345
Binary files /dev/null and b/images/naive.png differ
diff --git a/images/sca.png b/images/sca.png
new file mode 100644
index 0000000..24af69e
Binary files /dev/null and b/images/sca.png differ
diff --git a/images/showcase.gif b/images/showcase.gif
new file mode 100644
index 0000000..44ed89d
Binary files /dev/null and b/images/showcase.gif differ
diff --git a/images/sidebyside.png b/images/sidebyside.png
new file mode 100644
index 0000000..daeaf11
Binary files /dev/null and b/images/sidebyside.png differ
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index fdd636d..030ca66 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -10,5 +10,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_60
     )
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..a5f552e 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -85,6 +85,15 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *dev_pos_reorder;
+__global__ void kernReorderPos(int N, glm::vec3 *posReordered, glm::vec3 *pos, int *particleArrayIndices)
+{
+	int pid = threadIdx.x + blockIdx.x * blockDim.x;
+	if (pid < N)
+	{
+		posReordered[pid] = pos[particleArrayIndices[pid]];
+	}
+}
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -169,6 +178,25 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+
+  cudaMalloc(reinterpret_cast<void**>(&dev_particleGridIndices), N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc(reinterpret_cast<void**>(&dev_particleArrayIndices), N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc(reinterpret_cast<void**>(&dev_gridCellStartIndices), gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc(reinterpret_cast<void**>(&dev_gridCellEndIndices), gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
+  cudaMalloc((void **)&dev_pos_reorder, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos_reorder failed!");
+
   cudaDeviceSynchronize();
 }
 
@@ -236,15 +264,87 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
   return glm::vec3(0.0f, 0.0f, 0.0f);
 }
 
+__device__ glm::vec3 computeVelocityChangeBruteForce(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel)
+{
+	// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+	// Rule 2: boids try to stay a distance d away from each other
+	// Rule 3: boids try to match the speed of surrounding boids
+
+	glm::vec3 massCenter = glm::vec3(0.0f);
+	glm::vec3 c = glm::vec3(0.0f);
+	glm::vec3 perceivedVelocity = glm::vec3(0.0f);
+	const glm::vec3& thisPos = pos[iSelf];
+	int neighborCounterRule1 = 0;
+	int neighborCounterRule3 = 0;
+	for (int boidK = 0; boidK < N; ++boidK)
+	{
+		if (boidK == iSelf)
+		{
+			continue;
+		}
+		const glm::vec3& boidPos = pos[boidK];
+		float distance = glm::distance(boidPos, thisPos);
+
+		if (distance < rule1Distance)
+		{
+			massCenter += boidPos;
+			++neighborCounterRule1;
+		}
+		if (distance < rule2Distance)
+		{
+			c -= (boidPos - thisPos);
+		}
+		if (distance < rule3Distance)
+		{
+			perceivedVelocity += vel[boidK];
+			++neighborCounterRule3;
+		}
+	}
+
+	glm::vec3 term1 = glm::vec3(0.0f);
+	glm::vec3 term2 = glm::vec3(0.0f);
+	glm::vec3 term3 = glm::vec3(0.0f);
+
+	if (neighborCounterRule1 >= 1)
+	{
+		float averageFactor = 1.0f / static_cast<float>(neighborCounterRule1);
+		massCenter = averageFactor * massCenter;
+		term1 = rule1Scale * (massCenter - thisPos);
+	}
+
+	term2 = rule2Scale * c;
+
+	if (neighborCounterRule3 >= 1)
+	{
+		float averageFactor = 1.0f / static_cast<float>(neighborCounterRule3);
+		term3 = rule3Scale * averageFactor * perceivedVelocity;
+	}
+
+	return term1 + term2 + term3;
+}
+
 /**
 * TODO-1.2 implement basic flocking
 * For each of the `N` bodies, update its position based on its current velocity.
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
-  glm::vec3 *vel1, glm::vec3 *vel2) {
+  glm::vec3 *vel1, glm::vec3 *vel2)
+{
   // Compute a new velocity based on pos and vel1
   // Clamp the speed
   // Record the new velocity into vel2. Question: why NOT vel1?
+	int pid = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (pid >= N)
+	{
+		return;
+	}
+	glm::vec3 updatedVelocity = computeVelocityChangeBruteForce(N, pid, pos, vel1);
+	glm::vec3 novelVelocity = vel1[pid] + updatedVelocity;
+	if (glm::length(novelVelocity) > maxSpeed)
+	{
+		novelVelocity = maxSpeed * glm::normalize(novelVelocity);
+	}
+	vel2[pid] = novelVelocity;
 }
 
 /**
@@ -282,6 +382,17 @@ __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
   return x + y * gridResolution + z * gridResolution * gridResolution;
 }
 
+__device__ int getGridIndex(int gridResolution, float inverseCellWidth, const glm::vec3 &gridMin,
+	const glm::vec3 &rawPosition)
+{
+	glm::vec3 rawIndex = inverseCellWidth * (rawPosition - gridMin);
+	int x = static_cast<int>(std::floor(rawIndex.x));
+	int y = static_cast<int>(std::floor(rawIndex.y));
+	int z = static_cast<int>(std::floor(rawIndex.z));
+
+	return gridIndex3Dto1D(x, y, z, gridResolution);
+}
+
 __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 gridMin, float inverseCellWidth,
   glm::vec3 *pos, int *indices, int *gridIndices) {
@@ -289,6 +400,13 @@ __global__ void kernComputeIndices(int N, int gridResolution,
     // - Label each boid with the index of its grid cell.
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+	int pid = threadIdx.x + blockIdx.x * blockDim.x;
+	if (pid < N)
+	{
+		int gridIndex = getGridIndex(gridResolution, inverseCellWidth, gridMin, pos[pid]);
+		indices[pid] = pid;
+		gridIndices[pid] = gridIndex;
+	}
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -306,6 +424,25 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+	int pid = threadIdx.x + blockIdx.x * blockDim.x;
+	if (pid < N)
+	{
+		int previousIdx = (pid + N - 1) % N;
+		int nextIdx = (pid + N + 1) % N;
+
+		int thisCellId = particleGridIndices[pid];
+		int prevCellId = particleGridIndices[previousIdx];
+		int nextCellId = particleGridIndices[nextIdx];
+
+		if (thisCellId != prevCellId)
+		{
+			gridCellStartIndices[thisCellId] = pid;
+		}
+		if (thisCellId != nextCellId)
+		{
+			gridCellEndIndices[thisCellId] = pid;
+		}
+	}
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -322,6 +459,115 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+	int pid = threadIdx.x + blockIdx.x * blockDim.x;
+	if (pid < N)
+	{
+		int iSelf = particleArrayIndices[pid];
+		glm::vec3 thisPos = pos[iSelf];
+
+		enum cellNeighbors {SELF, POSX, POSY, POSZ, POSXPOSY, POSXPOSZ, POSYPOSZ, POSXPOSYPOSZ};
+		//Cell indices that should be probed if there're neighbor or not
+		int probeCells[8];
+		glm::vec3 iSelfIndexRaw = inverseCellWidth * (thisPos - gridMin);
+		int xSelf = static_cast<int>(std::round(iSelfIndexRaw.x));
+		int ySelf = static_cast<int>(std::round(iSelfIndexRaw.y));
+		int zSelf = static_cast<int>(std::round(iSelfIndexRaw.z));
+		//Tell which neighbors of the cell should also be probed
+		glm::vec3 iSelfIndexBottomLeftCorner = glm::vec3(xSelf, ySelf, zSelf);
+		glm::vec3 iSelfIntheBox = iSelfIndexRaw - iSelfIndexBottomLeftCorner;
+		int xNeighborOffset = -1;
+		int yNeighborOffset = -1;
+		int zNeighborOffset = -1;
+		if (iSelfIntheBox.x >= 0.5f)
+		{
+			xNeighborOffset = 1;
+		}
+		if (iSelfIntheBox.y >= 0.5f)
+		{
+			yNeighborOffset = 1;
+		}
+		if (iSelfIntheBox.z >= 0.5f)
+		{
+			zNeighborOffset = 1;
+		}
+
+		probeCells[SELF] = gridIndex3Dto1D(xSelf, ySelf, zSelf, gridResolution);
+		probeCells[POSX] = gridIndex3Dto1D(xSelf + xNeighborOffset, ySelf, zSelf, gridResolution);
+		probeCells[POSY] = gridIndex3Dto1D(xSelf, ySelf + yNeighborOffset, zSelf, gridResolution);
+		probeCells[POSZ] = gridIndex3Dto1D(xSelf, ySelf, zSelf + zNeighborOffset, gridResolution);
+		probeCells[POSXPOSY] = gridIndex3Dto1D(xSelf + xNeighborOffset, ySelf + yNeighborOffset, zSelf, gridResolution);
+		probeCells[POSXPOSZ] = gridIndex3Dto1D(xSelf + xNeighborOffset, ySelf, zSelf + zNeighborOffset, gridResolution);
+		probeCells[POSYPOSZ] = gridIndex3Dto1D(xSelf, ySelf + yNeighborOffset, zSelf + zNeighborOffset, gridResolution);
+		probeCells[POSXPOSYPOSZ] = gridIndex3Dto1D(xSelf + xNeighborOffset, ySelf + yNeighborOffset, zSelf + zNeighborOffset, gridResolution);
+
+		glm::vec3 massCenter(0.0f);
+		glm::vec3 c(0.0f);
+		glm::vec3 perceivedVelocity(0.0f);
+
+		int neighborCounterRule1 = 0;
+		int neighborCounterRule3 = 0;
+
+		for (int cellI = 0; cellI < 8; ++cellI)
+		{
+			int gridCellCount = gridResolution * gridResolution * gridResolution;
+			//What if it exceeds the index?
+			if (probeCells[cellI] < 0 || probeCells[cellI] >= gridCellCount)
+			{
+				continue;
+			}
+
+			int startIndex = gridCellStartIndices[probeCells[cellI]];
+			int endIndex = gridCellEndIndices[probeCells[cellI]];
+			for (int boidJ = startIndex; boidJ <= endIndex; ++boidJ)
+			{
+				int neighborIndex = particleArrayIndices[boidJ];
+				if (neighborIndex == iSelf)
+				{
+					continue;
+				}
+				float distance = glm::distance(pos[neighborIndex], thisPos);
+				if (distance < rule1Distance)
+				{
+					massCenter += pos[neighborIndex];
+					neighborCounterRule1++;
+				}
+				if (distance < rule2Distance)
+				{
+					c -= pos[neighborIndex] - thisPos;
+				}
+				if (distance < rule3Distance)
+				{
+					perceivedVelocity += vel1[neighborIndex];
+					neighborCounterRule3++;
+				}
+			}
+		}
+
+		glm::vec3 term1 = glm::vec3(0.0f);
+		glm::vec3 term2 = glm::vec3(0.0f);
+		glm::vec3 term3 = glm::vec3(0.0f);
+		if (neighborCounterRule1 >= 1)
+		{
+			float averageFactor = 1.0f / static_cast<float>(neighborCounterRule1);
+			massCenter = averageFactor * massCenter;
+			term1 = rule1Scale * (massCenter - thisPos);
+		}
+		term2 = rule2Scale * c;
+		if (neighborCounterRule3 >= 1)
+		{
+			float averageFactor = 1.0f / static_cast<float>(neighborCounterRule3);
+			term3 = rule3Scale * averageFactor * perceivedVelocity;
+		}
+
+		glm::vec3 updatedVelocity = term1 + term2 + term3;
+		glm::vec3 novelVelocity = updatedVelocity + vel1[iSelf];
+		if (glm::length(novelVelocity) > maxSpeed)
+		{
+			novelVelocity = maxSpeed * glm::normalize(novelVelocity);
+		}
+
+		vel2[iSelf] = novelVelocity;
+	}
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -341,6 +587,115 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+	int pid = threadIdx.x + (blockIdx.x * blockDim.x);
+
+	if (pid < N)
+	{
+		int iSelf = pid;
+		glm::vec3 thisPos = pos[iSelf];
+
+		enum cellNeighbors { SELF, POSX, POSY, POSZ, POSXPOSY, POSXPOSZ, POSYPOSZ, POSXPOSYPOSZ };
+		//Cell indices that should be probed if there're neighbor or not
+		int probeCells[8];
+		glm::vec3 iSelfIndexRaw = inverseCellWidth * (thisPos - gridMin);
+		int xSelf = static_cast<int>(std::round(iSelfIndexRaw.x));
+		int ySelf = static_cast<int>(std::round(iSelfIndexRaw.y));
+		int zSelf = static_cast<int>(std::round(iSelfIndexRaw.z));
+		//Tell which neighbors of the cell should also be probed
+		glm::vec3 iSelfIndexBottomLeftCorner = glm::vec3(xSelf, ySelf, zSelf);
+		glm::vec3 iSelfIntheBox = iSelfIndexRaw - iSelfIndexBottomLeftCorner;
+		int xNeighborOffset = -1;
+		int yNeighborOffset = -1;
+		int zNeighborOffset = -1;
+		if (iSelfIntheBox.x >= 0.5f)
+		{
+			xNeighborOffset = 1;
+		}
+		if (iSelfIntheBox.y >= 0.5f)
+		{
+			yNeighborOffset = 1;
+		}
+		if (iSelfIntheBox.z >= 0.5f)
+		{
+			zNeighborOffset = 1;
+		}
+
+		probeCells[SELF] = gridIndex3Dto1D(xSelf, ySelf, zSelf, gridResolution);
+		probeCells[POSX] = gridIndex3Dto1D(xSelf + xNeighborOffset, ySelf, zSelf, gridResolution);
+		probeCells[POSY] = gridIndex3Dto1D(xSelf, ySelf + yNeighborOffset, zSelf, gridResolution);
+		probeCells[POSZ] = gridIndex3Dto1D(xSelf, ySelf, zSelf + zNeighborOffset, gridResolution);
+		probeCells[POSXPOSY] = gridIndex3Dto1D(xSelf + xNeighborOffset, ySelf + yNeighborOffset, zSelf, gridResolution);
+		probeCells[POSXPOSZ] = gridIndex3Dto1D(xSelf + xNeighborOffset, ySelf, zSelf + zNeighborOffset, gridResolution);
+		probeCells[POSYPOSZ] = gridIndex3Dto1D(xSelf, ySelf + yNeighborOffset, zSelf + zNeighborOffset, gridResolution);
+		probeCells[POSXPOSYPOSZ] = gridIndex3Dto1D(xSelf + xNeighborOffset, ySelf + yNeighborOffset, zSelf + zNeighborOffset, gridResolution);
+
+		glm::vec3 massCenter(0.0f);
+		glm::vec3 c(0.0f);
+		glm::vec3 perceivedVelocity(0.0f);
+
+		int neighborCounterRule1 = 0;
+		int neighborCounterRule3 = 0;
+
+		for (int cellI = 0; cellI < 8; ++cellI)
+		{
+			int gridCellCount = gridResolution * gridResolution * gridResolution;
+			if (probeCells[cellI] < 0 || probeCells[cellI] >= gridCellCount)
+			{
+				continue;
+			}
+			int startIndex = gridCellStartIndices[probeCells[cellI]];
+			int endIndex = gridCellEndIndices[probeCells[cellI]];
+
+			for (int boidJ = startIndex; boidJ <= endIndex; boidJ++)
+			{
+				if (boidJ == iSelf)
+				{
+					continue;
+				}
+				float distance = glm::distance(pos[boidJ], thisPos);
+				if (distance < rule1Distance)
+				{
+					massCenter += pos[boidJ];
+					neighborCounterRule1++;
+				}
+				if (distance < rule2Distance)
+				{
+					c -= pos[boidJ] - thisPos;
+				}
+
+				if (distance < rule3Distance)
+				{
+					perceivedVelocity += vel1[boidJ];
+					neighborCounterRule3++;
+				}
+			}
+		}
+
+		glm::vec3 term1 = glm::vec3(0.0f);
+		glm::vec3 term2 = glm::vec3(0.0f);
+		glm::vec3 term3 = glm::vec3(0.0f);
+		if (neighborCounterRule1 >= 1)
+		{
+			float averageFactor = 1.0f / static_cast<float>(neighborCounterRule1);
+			massCenter = averageFactor * massCenter;
+			term1 = rule1Scale * (massCenter - thisPos);
+		}
+		term2 = rule2Scale * c;
+		if (neighborCounterRule3 >= 1)
+		{
+			float averageFactor = 1.0f / static_cast<float>(neighborCounterRule3);
+			term3 = rule3Scale * averageFactor * perceivedVelocity;
+		}
+
+		glm::vec3 updatedVelocity = term1 + term2 + term3;
+		glm::vec3 novelVelocity = updatedVelocity + vel1[iSelf];
+		if (glm::length(novelVelocity) > maxSpeed)
+		{
+			novelVelocity = maxSpeed * glm::normalize(novelVelocity);
+		}
+
+		vel2[iSelf] = novelVelocity;
+	}
 }
 
 /**
@@ -349,6 +704,17 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
   // TODO-1.2 ping-pong the velocity buffers
+	dim3 blocksPerGrid(1 + (numObjects) / blockSize);
+	kernUpdatePos<<<blocksPerGrid, threadsPerBlock>>>
+		(numObjects, dt, dev_pos, dev_vel1);
+	checkCUDAErrorWithLine("kernel update pos");
+
+	kernUpdateVelocityBruteForce<<<blocksPerGrid, threadsPerBlock>>>
+		(numObjects, dev_pos, dev_vel1, dev_vel2);
+	checkCUDAErrorWithLine("kernel update vel");
+	glm::vec3* temp = dev_vel1;
+	dev_vel1 = dev_vel2;
+	dev_vel2 = temp;
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
@@ -364,6 +730,27 @@ void Boids::stepSimulationScatteredGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed
+	dim3 cellsPerBlock(1 + (gridCellCount / blockSize));
+	//If a value is negative we know that it isn't adjacent to anything
+	kernResetIntBuffer<<<cellsPerBlock, threadsPerBlock>>>(gridCellCount, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer<<<cellsPerBlock, threadsPerBlock>>>(gridCellCount, dev_gridCellEndIndices, -2);
+
+	dim3 blocksPerGrid(1 + (numObjects / blockSize));
+	kernComputeIndices<<<blocksPerGrid, threadsPerBlock>>>(numObjects, gridSideCount, gridMinimum,
+		gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+	kernIdentifyCellStartEnd<<<blocksPerGrid, threadsPerBlock>>>(numObjects, dev_particleGridIndices,
+		dev_gridCellStartIndices, dev_gridCellEndIndices);
+	kernUpdateVelNeighborSearchScattered<<<blocksPerGrid, threadsPerBlock>>>(numObjects, gridSideCount,
+		gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices,
+		dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+	kernUpdatePos<<<blocksPerGrid, threadsPerBlock>>>(numObjects, dt, dev_pos, dev_vel2);
+
+	glm::vec3* temp = dev_vel1;
+	dev_vel1 = dev_vel2;
+	dev_vel2 = temp;
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
@@ -382,6 +769,29 @@ void Boids::stepSimulationCoherentGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+	dim3 cellsPerBlock((gridCellCount + blockSize - 1) / blockSize);
+	kernResetIntBuffer<<<cellsPerBlock, threadsPerBlock>>>(gridCellCount, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer<<<cellsPerBlock, threadsPerBlock>>>(gridCellCount, dev_gridCellEndIndices, -2);
+
+	dim3 blocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernComputeIndices<<<blocksPerGrid, threadsPerBlock>>>(numObjects, gridSideCount, gridMinimum,
+		gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+	kernReorderPos<<<blocksPerGrid, threadsPerBlock>>>(numObjects, dev_pos_reorder, dev_pos, dev_particleArrayIndices);
+	kernReorderPos<<<blocksPerGrid, threadsPerBlock>>>(numObjects, dev_vel2, dev_vel1, dev_particleArrayIndices);
+
+	kernIdentifyCellStartEnd<<<blocksPerGrid, threadsPerBlock >> >(numObjects, dev_particleGridIndices,
+		dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+	kernUpdateVelNeighborSearchCoherent<<<blocksPerGrid, threadsPerBlock>>>(numObjects, gridSideCount,
+		gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices,
+		dev_pos_reorder, dev_vel2, dev_vel1);
+	kernUpdatePos<<<blocksPerGrid, threadsPerBlock>>>(numObjects, dt, dev_pos_reorder, dev_vel1);
+
+	glm::vec3 *temp = dev_pos;
+	dev_pos = dev_pos_reorder;
+	dev_pos_reorder = temp;
 }
 
 void Boids::endSimulation() {
@@ -390,6 +800,11 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_pos_reorder);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..073e936 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,11 +14,11 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 40000;
 const float DT = 0.2f;
 
 /**