diff --git a/README.md b/README.md
index 0e38ddb..9c93f9b 100644
--- a/README.md
+++ b/README.md
@@ -1,14 +1,99 @@
-CUDA Stream Compaction
-======================
+**University of Pennsylvania, CIS 565: GPU Programming and Architecture,
+Project 2 - Stream-Compaction**
 
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
+* Henry Zhu
+  * [Github](https://github.com/Maknee), [LinkedIn](https://www.linkedin.com/in/henry-zhu-347233121/), [personal website](https://maknee.github.io/), [twitter](https://twitter.com/maknees1), etc.
+* Tested on: Windows 10 Home, Intel i7-4710HQ @ 2.50GHz 22GB, GTX 870M (Own computer)
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+## Scanning and stream-compaction 
+
+![](scan_all_things.png)
+
+### What is scanning/stream-compaction
+
+Scanning (prefix sum) is summing all the values from previous indices into the current index. The image below from [GPU Gems](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html) depicts how scan works:
+
+![](scan.png)
+
+Stream compaction is the removal of a particular value from an array. The image below from [GPU Gems](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html) depicts how stream compaction works:
+
+![](stream-compaction.png)
+
+## Answer to Questions
+
+### Performance of scan speeds (Array sizes)
+
+![](scan_speeds.png)
+
+### To guess at what might be happening inside the Thrust implementation (e.g. allocation, memory copy), take a look at the Nsight timeline for its execution. Your analysis here doesn't have to be detailed, since you aren't even looking at the code for the implementation.
+
+The thrust implementation is much better optimized for larger instances. It does not, however, work well with small array sizes.
+This is most likely due to its implementation. It takes in consideration a better block size for a larger amount of elements to scan through, so that is why it is much better in performance for larger numbers.
+
+### Can you find the performance bottlenecks? Is it memory I/O? Computation? Is it different for each implementation?
+
+I did find performance bottlenecks. I think, at least for the work efficient implementation of scan, it might be memory I/O as one has to send data to the GPU and then send it back for each iteration of d. For CPU, as shown in the graph below, it has to be with the computation as it has to iterate through every single node. The work-efficient implementation beats the naive implementation at first, but later, the naive implementation beats the work-efficient implementation as the memory i/o overhead becomes much larger than the computation overhead. 
+
+## Output of program
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  24  39   5   4   7  25  26  45  30  32  42  22  35 ...  19   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.865021ms    (std::chrono Measured)
+    [   0  24  63  68  72  79 104 130 175 205 237 279 301 ... 801298 801317 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.826429ms    (std::chrono Measured)
+    [   0  24  63  68  72  79 104 130 175 205 237 279 301 ... 801241 801278 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.67424ms    (CUDA Measured)
+    [   0  24  63  68  72  79 104 130 175 205 237 279 301 ... 801298 801317 ]
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 1.50448ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 1.03907ms    (CUDA Measured)
+    [   0  24  63  68  72  79 104 130 175 205 237 279 301 ... 801298 801317 ]
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.866112ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 4.9281ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.254016ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   3   1   0   1   1   2   1   0   2   0   2   3 ...   1   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.12029ms    (std::chrono Measured)
+    [   3   1   1   1   2   1   2   2   3   1   1   3   1 ...   3   1 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.11988ms    (std::chrono Measured)
+    [   3   1   1   1   2   1   2   2   3   1   1   3   1 ...   1   1 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 1.28747ms    (std::chrono Measured)
+    [   3   1   1   1   2   1   2   2   3   1   1   3   1 ...   3   1 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 1.34822ms    (CUDA Measured)
+    [   3   1   1   1   2   1   2   2   3   1   1   3   1 ...   3   1 ]
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 1.15571ms    (CUDA Measured)
+    [   3   1   1   1   2   1   2   2   3   1   1   3   1 ...   1   1 ]
+    passed
+```
 
-### (TODO: Your README)
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
 
diff --git a/scan.png b/scan.png
new file mode 100644
index 0000000..f19d9f6
Binary files /dev/null and b/scan.png differ
diff --git a/scan_all_things.png b/scan_all_things.png
new file mode 100644
index 0000000..4795ecd
Binary files /dev/null and b/scan_all_things.png differ
diff --git a/scan_speeds.png b/scan_speeds.png
new file mode 100644
index 0000000..5d26ad2
Binary files /dev/null and b/scan_speeds.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 1850161..b5f6eae 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -13,7 +13,7 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 13; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -44,14 +44,14 @@ int main(int argc, char* argv[]) {
     printDesc("cpu scan, non-power-of-two");
     StreamCompaction::CPU::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(NPOT, b, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
 	/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
@@ -71,7 +71,7 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
@@ -137,14 +137,14 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
     system("pause"); // stop Win32 console from closing on exit
diff --git a/stream-compaction.png b/stream-compaction.png
new file mode 100644
index 0000000..8dce98f
Binary files /dev/null and b/stream-compaction.png differ
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index 8fc0211..2db138c 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -23,7 +23,11 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+            const int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index > n)
+                return;
+
+            bools[index] = idata[index] ? 1 : 0;
         }
 
         /**
@@ -31,8 +35,18 @@ namespace StreamCompaction {
          * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
          */
         __global__ void kernScatter(int n, int *odata,
-                const int *idata, const int *bools, const int *indices) {
-            // TODO
+            const int *idata, const int *bools, const int *indices) {
+            const int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index > n)
+                return;
+
+            //filter only elements that are not zero in the bool map.
+            if (bools[index])
+            {
+                //get the index of where the element is suppose to be in the in the final array
+                const int index_of_filtered_element = indices[index];
+                odata[index_of_filtered_element] = idata[index];
+            }
         }
 
     }
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 05ce667..15855b3 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -1,50 +1,150 @@
 #include <cstdio>
 #include "cpu.h"
 
-#include "common.h"
-
-namespace StreamCompaction {
-    namespace CPU {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-	        static PerformanceTimer timer;
-	        return timer;
-        }
-
-        /**
-         * CPU scan (prefix sum).
-         * For performance analysis, this is supposed to be a simple for loop.
-         * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
-         */
-        void scan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
-            // TODO
-	        timer().endCpuTimer();
-        }
-
-        /**
-         * CPU stream compaction without using the scan function.
-         *
-         * @returns the number of elements remaining after compaction.
-         */
-        int compactWithoutScan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
-            // TODO
-	        timer().endCpuTimer();
-            return -1;
-        }
-
-        /**
-         * CPU stream compaction using scan and scatter, like the parallel version.
-         *
-         * @returns the number of elements remaining after compaction.
-         */
-        int compactWithScan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
-	        // TODO
-	        timer().endCpuTimer();
-            return -1;
-        }
-    }
+#include "common.h"
+#include <memory>
+#include <iostream>
+
+namespace StreamCompaction
+{
+	namespace CPU
+	{
+		using StreamCompaction::Common::PerformanceTimer;
+
+		PerformanceTimer& timer()
+		{
+			static PerformanceTimer timer;
+			return timer;
+		}
+
+		//actual implementation of scan
+		//because timer().startCpuTimer() is called inside
+		//scan(...) from scatter(...), causing an abort
+		void scan_impl(int n, int* odata, const int* idata)
+		{
+			/// super naive cpu implementation ///
+			// memset(odata, 0, n * sizeof(int));
+			//
+			//  for(int k = 1; k < n; k++)
+			//  {
+			//  	odata[k] = odata[k - 1] + idata[k - 1];
+			//  }
+
+			/// psuedo parallel implementation ///
+
+			//make sure the data is set first before beginning
+			memcpy(odata, idata, sizeof(int) * n);
+
+			for (int d = 1; static_cast<float>(d) <= std::ceil(std::log2(n)); d++)
+			{
+				//make a copy, because naive can't be done in place
+				auto temp = std::make_unique<int[]>(n);
+				memcpy(temp.get(), odata, n * sizeof(int));
+				for (int k = 0; k < n; k++)
+				{
+					//follow the formula
+					if (k >= static_cast<int>(std::pow(2, d - 1)))
+					{
+						odata[k] = temp[k - static_cast<int>(std::pow(2, d - 1))] + temp[k];
+					}
+				}
+			}
+
+			//copy the data back
+			auto temp = std::make_unique<int[]>(n);
+			memcpy(temp.get(), odata, n * sizeof(int));
+
+			//shift right by 1
+			for (int i = 1; i < n; i++)
+			{
+				odata[i] = temp[i - 1];
+			}
+			//set first element to 0
+			odata[0] = 0;
+		}
+
+		/**
+		 * CPU scan (prefix sum).
+		 * For performance analysis, this is supposed to be a simple for loop.
+		 * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
+		 */
+		void scan(int n, int* odata, const int* idata)
+		{
+			timer().startCpuTimer();
+
+			scan_impl(n, odata, idata);
+
+			timer().endCpuTimer();
+		}
+
+		/**
+		 * CPU stream compaction without using the scan function.
+		 *
+		 * @returns the number of elements remaining after compaction.
+		 */
+		int compactWithoutScan(int n, int* odata, const int* idata)
+		{
+			timer().startCpuTimer();
+
+			memset(odata, 0, n * sizeof(int));
+
+			int index = 0;
+
+			//iterate through and count
+			for (int i = 0; i < n; i++)
+			{
+				if (idata[i])
+				{
+					odata[index] = idata[i];
+					index++;
+				}
+			}
+
+			timer().endCpuTimer();
+			return index;
+		}
+
+		/**
+		 * CPU stream compaction using scan and scatter, like the parallel version.
+		 *
+		 * @returns the number of elements remaining after compaction.
+		 */
+		int compactWithScan(int n, int* odata, const int* idata)
+		{
+			timer().startCpuTimer();
+
+			auto counters = std::make_unique<int[]>(n);
+
+			int count = 0;
+
+			//iterate through and count
+			for (int i = 0; i < n; i++)
+			{
+				counters[i] = idata[i] ? 1 : 0;
+				if (counters[i])
+				{
+					count++;
+				}
+			}
+
+			auto indicies = std::make_unique<int[]>(n);
+
+			memcpy(indicies.get(), counters.get(), n);
+
+			//scan
+			scan_impl(n, indicies.get(), counters.get());
+
+			//now set the scanned result to the correct index
+			for (int i = 0; i < n; i++)
+			{
+				if (counters[i])
+				{
+					odata[indicies[i]] = idata[i];
+				}
+			}
+
+			timer().endCpuTimer();
+			return count;
+		}
+	}
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 36c5ef2..40e76b2 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -2,25 +2,158 @@
 #include <cuda_runtime.h>
 #include "common.h"
 #include "efficient.h"
+#include <memory>
 
-namespace StreamCompaction {
-    namespace Efficient {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-            static PerformanceTimer timer;
-            return timer;
+namespace StreamCompaction
+{
+    namespace Efficient
+    {
+        using StreamCompaction::Common::PerformanceTimer;
+
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
+        }
+
+        //taken from printArray in main
+        void printArray(int n, int* a, bool abridged = false)
+        {
+            printf("    [ ");
+            for (int i = 0; i < n; i++)
+            {
+                if (abridged && i + 2 == 15 && n > 16)
+                {
+                    i = n - 2;
+                    printf("... ");
+                }
+                printf("%3d ", a[i]);
+            }
+            printf("]\n");
+        }
+
+        //pow function
+        __device__ __host__ int kernPow2(int power_of)
+        {
+            int result = 1;
+            for (int i = 0; i < power_of; i++)
+            {
+                result <<= 1;
+            }
+            return result;
+        }
+
+        //round to the nearest power of 2 (ceiling)
+        void round_to_nearest_pow(int& n)
+        {
+            //round n to the nearest pow of n
+            n = std::ceil(std::log2(n));
+            n = kernPow2(n);
+        }
+
+        //up sweep function
+        __global__ void kernUpSweep(int N, int* odata, int d)
+        {
+            int k = threadIdx.x + (blockIdx.x * blockDim.x);
+
+            //k is multiplied by pow to index the correct location
+            k *= kernPow2(d + 1);
+            if (k > N)
+                return;
+
+            //formula
+            odata[k + kernPow2(d + 1) - 1] += odata[k + kernPow2(d) - 1];
+        }
+
+        __global__ void kernDownSweep(int N, int* odata, int d)
+        {
+            int k = threadIdx.x + (blockIdx.x * blockDim.x);
+            k *= kernPow2(d + 1);
+            if (k > N)
+                return;
+
+            //formula
+            int t = odata[k + kernPow2(d) - 1];
+            odata[k + kernPow2(d) - 1] = odata[k + kernPow2(d + 1) - 1];
+            odata[k + kernPow2(d + 1) - 1] += t;
+        }
+
+#define BLOCK_SIZE 128
+
+        //actual implementation of scan
+        //because timer().startGpuTimer() is called inside
+        //scan(...) from scatter(...), causing an abort
+        void scan_impl(int n, int* odata, const int* idata)
+        {
+            //round to nearest pow because n might not be a power of 2
+            round_to_nearest_pow(n);
+
+            dim3 kernelBlock((n + BLOCK_SIZE - 1) / BLOCK_SIZE);
+
+            int* kern_odata;
+
+            cudaMalloc(reinterpret_cast<void**>(&kern_odata), n * sizeof(int));
+
+            cudaMemcpy(kern_odata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+            //call upsweep
+            for (int d = 0; static_cast<float>(d) < std::ceil(std::log2(n)); d++)
+            {
+                kernUpSweep<<<kernelBlock, BLOCK_SIZE>>>(n, kern_odata, d);
+                //cudaMemcpy(odata, kern_odata, n * sizeof(int), cudaMemcpyDeviceToHost);
+                //printArray(n, odata, true);
+                //printArray(n/2, odata + n/2, true);
+            }
+
+            //printf("=====================\n");
+
+            //copy 0 to the end (from formula)
+            int zero = 0;
+            cudaMemcpy(kern_odata + n - 1, &zero, sizeof(int), cudaMemcpyHostToDevice);
+
+            //down sweep
+            for (int d = static_cast<int>(std::ceil(std::log2(n))) - 1; d >= 0; d--)
+            {
+                kernDownSweep<<<kernelBlock, BLOCK_SIZE>>>(n, kern_odata, d);
+                //cudaMemcpy(odata, kern_odata, n * sizeof(int), cudaMemcpyDeviceToHost);
+                //printArray(n, odata, true);
+                //printArray(n/2, odata + n/2, true);
+            }
+
+            //copy the result
+            cudaMemcpy(odata, kern_odata, n * sizeof(int), cudaMemcpyDeviceToHost);
+
+            cudaFree(kern_odata);
         }
 
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
-        void scan(int n, int *odata, const int *idata) {
+        void scan(int n, int* odata, const int* idata)
+        {
             timer().startGpuTimer();
-            // TODO
+
+            scan_impl(n, odata, idata);
+
             timer().endGpuTimer();
         }
 
+        __global__ void kernScatter(int N, int* final_array, const int* bool_array, const int* scan_array,
+                                    const int* unfiltered_array)
+        {
+            const int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index > N)
+                return;
+
+            //filter only elements that are not zero in the bool map.
+            if (bool_array[index])
+            {
+                //get the index of where the element is suppose to be in the in the final array
+                const int index_of_filtered_element = scan_array[index];
+                final_array[index_of_filtered_element] = unfiltered_array[index];
+            }
+        }
+
         /**
          * Performs stream compaction on idata, storing the result into odata.
          * All zeroes are discarded.
@@ -30,11 +163,76 @@ namespace StreamCompaction {
          * @param idata  The array of elements to compact.
          * @returns      The number of elements remaining after compaction.
          */
-        int compact(int n, int *odata, const int *idata) {
+        int compact(int n, int* odata, const int* idata)
+        {
             timer().startGpuTimer();
-            // TODO
+
+            dim3 kernelBlock((n + BLOCK_SIZE - 1) / BLOCK_SIZE);
+
+            //make another variable that is the power of n (we need this since counter can't iterate through pow n)
+            int rounded_n = n;
+            round_to_nearest_pow(rounded_n);
+
+            auto counters = std::make_unique<int[]>(rounded_n);
+            memset(counters.get(), 0, rounded_n * sizeof(int));
+
+            //idata
+            int* unfiltered_array;
+            cudaMalloc(reinterpret_cast<void**>(&unfiltered_array), n * sizeof(int));
+            cudaMemcpy(unfiltered_array, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+            //bool mapping (1 or 0)
+            int* bool_array;
+            cudaMalloc(reinterpret_cast<void**>(&bool_array), n * sizeof(int));
+
+            Common::kernMapToBoolean<<<kernelBlock, BLOCK_SIZE>>>(n, bool_array, unfiltered_array);
+
+            cudaMemcpy(counters.get(), bool_array, n * sizeof(int), cudaMemcpyDeviceToHost);
+
+            int count = 0;
+
+            //iterate through and count
+            for (int i = 0; i < n; i++)
+            {
+                if (counters[i])
+                {
+                    count++;
+                }
+            }
+
+            //now round to nearest pow
+            round_to_nearest_pow(n);
+
+            kernelBlock = dim3((n + BLOCK_SIZE - 1) / BLOCK_SIZE);
+
+            auto scan_result = std::make_unique<int[]>(n);
+
+            memcpy(scan_result.get(), counters.get(), n);
+
+            //scan
+            scan_impl(n, scan_result.get(), counters.get());
+
+            int* final_array;
+            int* scan_array;
+
+            cudaMalloc(reinterpret_cast<void**>(&final_array), n * sizeof(int));
+            cudaMalloc(reinterpret_cast<void**>(&scan_array), n * sizeof(int));
+
+            cudaMemcpy(scan_array, scan_result.get(), n * sizeof(int), cudaMemcpyHostToDevice);
+
+            //do scatter
+            kernScatter<<<kernelBlock, BLOCK_SIZE>>>(n, final_array, bool_array, scan_array, unfiltered_array);
+
+            //copy the result back
+            cudaMemcpy(odata, final_array, count * sizeof(int), cudaMemcpyDeviceToHost);
+
+            cudaFree(final_array);
+            cudaFree(bool_array);
+            cudaFree(scan_array);
+            cudaFree(unfiltered_array);
+
             timer().endGpuTimer();
-            return -1;
+            return count;
         }
     }
 }
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 9218f8e..37d5336 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -1,25 +1,108 @@
-#include <cuda.h>
-#include <cuda_runtime.h>
-#include "common.h"
-#include "naive.h"
-
-namespace StreamCompaction {
-    namespace Naive {
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include "common.h"
+#include "naive.h"
+#include <memory>
+
+namespace StreamCompaction
+{
+    namespace Naive
+    {
         using StreamCompaction::Common::PerformanceTimer;
+
         PerformanceTimer& timer()
         {
             static PerformanceTimer timer;
             return timer;
-        }
-        // TODO: __global__
-
-        /**
-         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
-         */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO
-            timer().endGpuTimer();
-        }
-    }
-}
+        }
+
+        //pow function
+        __device__ int kernPow2(int power_of)
+        {
+            int result = 1;
+            for (int i = 0; i < power_of; i++)
+            {
+                result <<= 1;
+            }
+            return result;
+        }
+
+        __global__ void kernNaive(int N, int* odata, const int* idata, int d)
+        {
+            const int k = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (k > N)
+                return;
+
+            //follow the formula
+            if (k >= kernPow2(d - 1))
+            {
+                odata[k] = idata[k - kernPow2(d - 1)] + idata[k];
+            }
+        }
+
+        //swap pointers
+        template <typename T, typename = std::enable_if<std::is_pointer<T>::value>::type>
+        void swap_pointers(T& a, T& b)
+        {
+            T c = a;
+            a = b;
+            b = c;
+        }
+
+
+#define BLOCK_SIZE 128
+
+        /**
+         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+         */
+        void scan(int n, int* odata, const int* idata)
+        {
+            timer().startGpuTimer();
+
+            dim3 kernelBlock((n + BLOCK_SIZE - 1) / BLOCK_SIZE);
+
+            int* kern_odata;
+            int* kern_odata2;
+
+            //allocate memory
+            cudaMalloc(reinterpret_cast<void**>(&kern_odata), n * sizeof(int));
+            cudaMalloc(reinterpret_cast<void**>(&kern_odata2), n * sizeof(int));
+
+            //copy the start data
+            cudaMemcpy(kern_odata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+            for (int d = 1; static_cast<float>(d) <= std::ceil(std::log2(n)); d++)
+            {
+                //make sure we copy over the ones that are not part of offset d
+                cudaMemcpy(kern_odata2, kern_odata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+                //call the naive impl of kernel
+                kernNaive<<<kernelBlock, BLOCK_SIZE>>>(n, kern_odata2, kern_odata, d);
+
+                //ping pong
+                swap_pointers(kern_odata, kern_odata2);
+            }
+
+            //copy the result
+            cudaMemcpy(odata, kern_odata, n * sizeof(int), cudaMemcpyDeviceToHost);
+
+            cudaFree(kern_odata);
+            cudaFree(kern_odata2);
+
+            //shift right
+            auto temp = std::make_unique<int[]>(n);
+            memcpy(temp.get(), odata, n * sizeof(int));
+
+            //shift right by 1
+            for (int i = 1; i < n; i++)
+            {
+                odata[i] = temp[i - 1];
+            }
+
+            //set first element to 0
+            odata[0] = 0;
+
+            timer().endGpuTimer();
+        }
+    }
+}
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 36b732d..b2243e8 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -8,21 +8,38 @@
 
 namespace StreamCompaction {
     namespace Thrust {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-            static PerformanceTimer timer;
-            return timer;
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
         }
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
         void scan(int n, int *odata, const int *idata) {
+            
+            thrust::host_vector<int> hv_in(idata, idata + n);
+            thrust::host_vector<int> hv_out(odata, odata + n);
+            thrust::device_vector<int> dv_in = hv_in;
+            thrust::device_vector<int> dv_out = hv_out;
+            
             timer().startGpuTimer();
-            // TODO use `thrust::exclusive_scan`
             // example: for device_vectors dv_in and dv_out:
             // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+            
+            thrust::exclusive_scan(std::begin(dv_in),
+                                   std::end(dv_in),
+                                   std::begin(dv_out));
+            // thrust::exclusive_scan(idata,
+            //     idata + n,
+            //     odata);
+
             timer().endGpuTimer();
+            
+            hv_out = dv_out;
+            memcpy(odata, &hv_out[0], n * sizeof(int));
+
         }
     }
 }