Skip to content

Commit 513da45

Browse files
committed
Adding comit for Distributed processing
1 parent ca80811 commit 513da45

13 files changed

+4152
-1
lines changed

CreateDataSet.ipynb

+1,679
Large diffs are not rendered by default.

DirectFileShard.ipynb

+734
Large diffs are not rendered by default.

FolderShard.ipynb

+737
Large diffs are not rendered by default.

ManifestFileShard.ipynb

+965
Large diffs are not rendered by default.

README.md

+23-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,23 @@
1-
# sagemaker-aws
1+
# sagemaker-aws
2+
3+
Create a bucket by name of sagemaker-crossaccnt-train -- or Change the DEF_BUCKET Variable
4+
Create a folder prefix of data/finance/distrib-multi -- or Change the PREFIX Variable
5+
6+
7+
8+
#### Generate Data
9+
*Generate synthetic housing data -- with no Categorical columns for now --
10+
*SHARD is RANDOM INDEX and so everytime we run we get a new set of shard indexes and files
11+
*Generate Json File -- not included as yet
12+
*Generate Zip file - not included as yet
13+
14+
#### Experiments run
15+
*Below are the Jupyter files with the experiements run --
16+
*Folder Shard -- We are sharding the data by folders and each folder has 1 csv file to be processed
17+
*TarBall Shard -- We are shariding the data by creating 1 TAR BALL for each of the unique data set to be processed
18+
*Manifest Shard -- We are sharding by the Manifest file which has the location for various files and data sets
19+
* For Manifest 5 different experiments were run which are documented
20+
* While creating the Manifest file the S3 bucket had to be hard coded in that script. Please CHANGE to your bucket
21+
*Direct File shard -- Place al files directly under 1 folder in S3 and try to shard by s3
22+
23+

data/test_0.csv

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
,SHARD_PREFIX,YEAR_BUILT,SQUARE_FEET,NUM_BEDROOMS,NUM_BATHROOMS,LOT_ACRES,GARAGE_SPACES,FRONT_PORCH,DECK,PRICE
2+
1,0,2004,3932.424542784298,2,2.0,1.3,1,1,1,599363
3+
5,0,1987,2806.1007143692805,3,1.5,0.69,1,0,0,338765

data/test_0.tar.gz

299 Bytes
Binary file not shown.

data/test_1.csv

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
,SHARD_PREFIX,YEAR_BUILT,SQUARE_FEET,NUM_BEDROOMS,NUM_BATHROOMS,LOT_ACRES,GARAGE_SPACES,FRONT_PORCH,DECK,PRICE
2+
2,1,2002,2110.3580824176097,6,2.5,1.02,2,1,1,374353
3+
4,1,1995,2356.6152874384297,2,2.0,1.41,3,0,1,349642
4+
6,1,1986,3359.8186254695715,3,1.0,1.18,3,0,0,446672

data/test_1.tar.gz

325 Bytes
Binary file not shown.

data/test_2.csv

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
,SHARD_PREFIX,YEAR_BUILT,SQUARE_FEET,NUM_BEDROOMS,NUM_BATHROOMS,LOT_ACRES,GARAGE_SPACES,FRONT_PORCH,DECK,PRICE
2+
3,2,2009,2688.848561383724,2,2.0,0.84,2,1,1,445927
3+
7,2,2002,3482.570149229119,4,3.0,1.26,3,1,1,586285
4+
9,2,1995,2868.5843673782756,4,3.0,0.91,1,0,0,423937

data/test_2.tar.gz

326 Bytes
Binary file not shown.

data/test_3.csv

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
,SHARD_PREFIX,YEAR_BUILT,SQUARE_FEET,NUM_BEDROOMS,NUM_BATHROOMS,LOT_ACRES,GARAGE_SPACES,FRONT_PORCH,DECK,PRICE
2+
0,3,2002,2587.2456112525065,2,3.0,1.65,1,0,0,407836
3+
8,3,2000,2275.1292749126524,6,1.0,0.4,0,0,0,327269

data/test_3.tar.gz

298 Bytes
Binary file not shown.

0 commit comments

Comments
 (0)