-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
what's the meaning of the naming convention for TFRecord files #86
Comments
These are sharded files, so '@1000' means the dataset is split into 1000 different files. You then also see this in the individual file names '00000-of-01000' etc. The training examples are distributed to the 1000 different shards using a fingerprint hashing function designed to distribute them roughly evenly. There are about 9.1 billion total training examples across the 1000 shards, sampled from skeletons all over the h01 volume. You can see a roughly similar sharding scheme (but not using the same hashing function) in the CSV ZIP archives of the embedding outputs that are demoed here using the simple reader library here. |
The max200000 and skip50 in the name refer to hyperparameters chosen in the extraction of the example pairs. I'm pretty sure max200k means that the maximum pair distance sampled is 200 um. I think skip50 means that in the sampling of nodes to form example pairs skeletons were first subsampled to 1/50 of the total skeleton nodes. @sdorkenw could say more or correct if I have misinterpreted the meanings. |
Thank you very much for your response. I think I understand a bit now. So it also means that these 1000 files do not correspond to 1000 segments, but rather all the sampled pairs are combined and averaged into 1000 files. You mentioned that there are approximately 9.1 billion training files in total, does this already include all the h01 embeddings? As mentioned in the code link you quoted, I have studied the sharding scheme and learned that each compressed file is named using a hash value. The embeddings of different segment IDs are hashed and stored in the corresponding hash value zip file.(Thank you very, very much agin) By the way, how can I create a TFRECORD file like yours, that is, how can I sample my skeleton and encode the data as a TFRECORD? |
Thank you very much. As mentioned in the paper, the distances between pairs are roughly evenly distributed in the intervals [0, 10000, 30000, 100000, 150000], so max200000 means that the maximum distance between pairs is 20000, which suddenly makes sense! And skip50 may be because the skeleton nodes are too dense, so down-sampling was performed. Once again, thank you for your answer! |
This TFRecord table is positive pair examples used for training SegCLR de novo. The precomputed output embeddings (~4 B total for h01) are available separately in the CSV ZIP sharded archives from the other notebook.
Right, the precomputed embedding CSV ZIPs are sharded according to segment IDs with a known sharding function to allow you to look up the right shard for a given ID (all handled by the simple
You can refer to the TensorFlow documentation here and use the format of the TFRecords in our demo table as a guide for how to structure your examples. |
Okay, great. Through step-by-step debugging and consulting relevant documentation, I have roughly understood the data structure inside the TFRecord. Because the TFRecord records the coordinates of two nodes However, I still have a question about how to determine |
We used the existing skeletonization of the h01 c3 segmentation. This was done via the improved TEASAR method implemented in the kimimaro package. Once you have skeletons, you can sample the nodes as appropriate for your dataset and then compute all the neighbors within 150 um and record their distances for bucketing purposes. Technically it's not necessary to start from skeletons; you could also sample directly from the segmentation masks, but we found the skeletons convenient and it may be that biasing to sampling from the centerline of the objects helps. |
Hi, long time no see. I encountered some confusion while studying your cool work.
When I was running the programTrain a SegCLR embedding model in
SegCLR wiki,
I ran the following code:
The above statement indicates that the program will load the training samples from Google Cloud h01-release.
Perhaps as a beginner in TensorFlow2, I don't quite understand the meaning of the sample file name
goog14c3_max200000_skip50.tfrecord-00000-of-01000
in Google Cloud Storage. For example, what doesmax200000
mean and what doesskip50
signify? It seems that0000-of-01000
indicates that this is the0th
file out of1000
samples, because there are a total of 1000 TFRecord files in that directory. I was surprised to find that they all seem to be around1.7G
in size. Does that mean that each TFRecord file represents randomly sampled pair information from a segment? So inSegCLR
, a total of 1000 segments were collected from H01, and the number of pairs sampled from each segment was the same, resulting in the basic consistency of each TFRecord's size.There may be some misunderstandings in my understanding. Please help me identify them.
The text was updated successfully, but these errors were encountered: