Comments about dj.Manual and dj.Computed tables #247

magland · 2022-03-10T15:44:30Z

magland
Mar 10, 2022

I've been thinking about the dj.Manual and dj.Computed tables in nwb_datajoint, especially in the context of our CuratedSpikeSorting discussion.

For the Manual entry tables: I believe the main purpose is to be able to refer to the manually-inserted rows from other tables, especially Computed tables. I think that the primary keys for the Manual tables should therefore always be meaningful names that are specified by the user at the time of insertion. That way they can be referred to by those names in downstream tables.

Let's look at a bunch of examples of Manual tables in nwb_datajoint:

class LabMember(dj.Manual):
    definition = """
    lab_member_name: varchar(80)
    ---
    first_name: varchar(200)
    last_nam

Yes, that conforms. Similarly, LabTeam, Institution, and Lab all conform.

class Subject(dj.Manual):
    definition = """
    subject_id: varchar(80)
    ---
    age = NULL: varchar(200)
    description = NULL: varchar(2000)
    genotype = NULL: varchar(2000)
    sex = "U": enum("M", "F", "U")
    species = NULL: varchar(200)
    """

This also confirms (subject_id serves as the name)

class Task(dj.Manual):
    definition = """
     task_name: varchar(80)
     ---
     task_description = NULL: varchar(2000)    # description of this task
     task_type = NULL: varchar(2000)           # type of task
     task_subtype = NULL: varchar(2000)        # subtype of task
     """

This conforms.

class IntervalList(dj.Manual):
    definition = """
    # Time intervals used for analysis
    -> Session
    interval_list_name: varchar(200)  # descriptive name of this interval list
    ---
    valid_times: longblob  # numpy array with start and end times for each interval
    """

This pretty much conforms, noting that Session is referenced via nwb_file_name. Similar for SortInterval

class WaveformParameters(dj.Manual):
    definition = """
    waveform_params_name: varchar(80) # name of waveform extraction parameters
    ---
    waveform_params: blob # a dict of waveform extraction parameters
    """

Yep

class SpikeSorterParameters(dj.Manual):
    definition = """
    sorter: varchar(200)
    sorter_params_name: varchar(200)
    ---
    sorter_params: blob
    """

This pretty much conforms. We can refer to a row by {'sorter': 'mountainsort4', 'sorter_params_name': 'default'}.

class ArtifactRemovedIntervalList(dj.Manual):
    definition = """
    # Stores intervals without detected artifacts.
    # Note that entries can come from either ArtifactDetection() or alternative artifact removal analyses.
    -> Session
    artifact_removed_interval_list_name: varchar(200)
    ---
    artifact_removed_valid_times: longblob
    artifact_times: longblob # np array of artifact intervals
    """

Yes, this pretty much conforms.

Now we come to the exceptions

class SpikeSortingRecordingSelection(dj.Manual):
    definition = """
    # Defines recordings to be sorted
    -> SortGroup
    -> SortInterval
    -> SpikeSortingPreprocessingParameters
    -> LabTeam
    ---
    -> IntervalList
    """

This is different... and it makes the downstream tables cumbersome.

class SpikeSortingSelection(dj.Manual):
    definition = """
    # Table for holding selection of recording and parameters for each spike sorting run
    -> SpikeSortingRecording
    -> SpikeSorterParameters
    -> ArtifactRemovedIntervalList
    ---
    import_path = "": varchar(200)  # optional path to previous curated sorting output
    """

Also different and difficult to work with.

I understand the decision to have separate tables for SpikeSortingRecordingSelection and SpikeSortingRecording, one being Manual and the other being Computed. Similar for SpikeSortingSelection and SpikeSorting. I think that you want SpikeSortingRecording and SpikeSorting to be dj.Computed tables, and you want to be able to call .populate() on them. The populate() function works well when you want to automatically compute all possible rows (parings or tuples between multiple dependent tables). What you are doing here is populating only a sparse subset of the table of all possible rows. One possibility is to use the restrictions argument of populate(), and do away with the *Selection tables altogether (never call populate() without a narrow restriction. That could simplify things. But you would still have the problem of easily referring to these rows for further downstream processing... curation, etc.

What I propose is that you rework these tables as follows:

class SpikeSortingRecordingSelection(dj.Manual):
    definition = """
    # Defines recordings to be sorted
    -> Session
    spike_sorting_recording_name: varchar(200) # Only needs to be unique up to the session
    ---
    -> SortGroup
    -> SortInterval
    -> SpikeSortingPreprocessingParameters
    -> LabTeam
    -> IntervalList
    """

class SpikeSortingRecording(dj.Computed):
    definition = """
    -> SpikeSortingRecordingSelection
    ---
    # Note: you might want to add these in as non-primary keys for convenience
    -> SortGroup
    -> SortInterval
    -> SpikeSortingPreprocessingParameters
    -> LabTeam
    -> IntervalList

    recording_path: varchar(1000)
    # Note sure what this is for:
    -> IntervalList.proj(sort_interval_list_name='interval_list_name')
    """

class SpikeSortingSelection(dj.Manual):
    definition = """
    # Table for holding selection of recording and parameters for each spike sorting run
    -> SpikeSortingRecording
    spike_sorting_name: varchar(200) # Only needs to be unique up to the SpikeSortingRecording
    ---
    -> SpikeSorterParameters
    -> ArtifactRemovedIntervalList
    # I would leave this out:
    import_path = "": varchar(200)  # optional path to previous curated sorting output
    """

class SpikeSorting(dj.Computed):
    definition = """
    -> SpikeSortingSelection
    ---
    # Note: you might want to add these in as non-primary keys for convenience
    -> SpikeSortingRecording
    -> SpikeSorterParameters
    -> ArtifactRemovedIntervalList

    sorting_path: varchar(1000)
    time_of_sort: int   # in Unix time, to the nearest second
    -> AnalysisNwbfile
    units_object_id: varchar(40)   # Object ID for the units in NWB file
    """

I would then recommend having a Manual table for DerivedSpikeSorting (or something), which could hold derived spike sortings obtained by automatic or manual curations. This table would also contain one row for each SpikeSorting row (uncurated). Would look something like this:

class DerivedSpikeSorting(dj.Manual):
    definition = """
    -> SpikeSorting
    derived_spike_sorting_name: varchar(200) # Only needs to be unique up to the SpikeSorting
    ---
    -> SpikeSorting
    ... Other info, including pointer to the spike sorting written to disk.
    """

Rather than memorizing all the names - the user can query the database to see what SpikeSortingRecording's, SpikeSorting's and DerivedSpikeSorting's are available.

lfrank · 2022-03-10T16:55:03Z

lfrank
Mar 10, 2022
Maintainer

Hi Jeremy. Thanks for that thoughtful post.

I fully agree that consistency would be useful, but I don't think unique names can work well from a sociological perspective. The challenge is that any given individual will have hundreds to perhaps a few thousand sortings. Having them be responsible for coming up with unique names will (and I say this from experience) result in incomprehensible labels that even they can't keep track of.

From my perspective one of the main goals of this system is to make things more reproducible and less error prone, and thus even though complex primary keys are a pain, they prevent errors in a way that names do not.

If we had a way to generate unique, meaningful names, I'd be fine with that, or if the names were never used themselves (as would, I think, be the case with the Sortings table) but instead would always be retrieved based on the key, that could also work.

In any case, we're having a group spikesorting meeting today, and I'll make sure we talk about this and get users' perspectives.

0 replies

magland · 2022-03-10T17:40:26Z

magland
Mar 10, 2022
Author

@lfrank I can see what you are saying regarding the challenge of creating and managing unique names. I will point out that sorting names do not need to be globally unique - just unique up to a given SpikeSortingRecording. I think I must be missing some information about how a user would interact with these tables.

Generally, I don't see the purpose of a manual table if you can't refer to a row by a proper subset of its attributes. In other words, if you need to specify all the data in the row in order to retrieve the row, why do you need the row to be stored in a table. :)

If you do decide to stick with the composite keys as they are now, I would suggest eliminating the SpikeSortingRecordingSelection and SpikeSortingSelection Manual tables and instead always use the restrictions argument when calling populate() on SpikeSortingRecording and SpikeSorting. This would eliminate the Manual tables and you would only have the Computed ones.

The DerivedSpikeSorting (or whatever it's called) table would need to be Manual and I still think you should still consider having that be jointly keyed off of SpikeSorting and a user-specified name. Note that the name only needs to be unique up to the SpikeSorting.

0 replies

lfrank · 2022-03-10T17:45:51Z

lfrank
Mar 10, 2022
Maintainer

@magland. Kyu and I will be meeting shortly and we’ll discuss this, but one quick response: As far as I understand things, we can’t use the restrictions because we need to be able to flexibly take different combinations of the recordings, sorter parameters, and artifact lists, and the restriction operation assumes that you can code up that combination at the time of the make. And I’m okay with the idea of a name as an additional primary key, but in practice I think this would be used infrequently. That said, if you think it would be useful for you, then adding it seems like a good idea to me unless we come up with some other problem it would introduce. On Mar 10, 2022, at 9:40 AM, Jeremy Magland ***@***.******@***.***>> wrote: This Message Is From an External Sender This message came from outside your organization. @lfrank<https://urldefense.com/v3/__https://github.com/lfrank__;!!LQC6Cpwp!_LEiZcThY42g3XruMPYMlvczXQ21n4ziE3mLyiBDM5NFbuuI71RZliZE5htKb-5Wew$> I can see what you are saying regarding the challenge of creating and managing unique names. I will point out that sorting names do not need to be globally unique - just unique up to a given SpikeSortingRecording. I think I must be missing some information about how a user would interact with these tables. Generally, I don't see the purpose of a manual table if you can't refer to a row by a subset of its attributes. In other words, if you need to specify all the data in the row in order to retrieve the row, why do you need the row. :) If you do decide to stick with the composite keys as they are now, I would suggest eliminating the SpikeSortingRecordingSelection and SpikeSortingSelection Manual tables and instead always use the restrictions argument when calling populate() on SpikeSortingRecording and SpikeSorting. This would eliminate the Manual tables and you would only have the Computed ones. The DerivedSpikeSorting (or whatever it's called) table would need to be Manual and I still think you should still consider having that be jointly keyed off of SpikeSorting and a user-specified name. Note that the name only needs to be unique up to the SpikeSorting. — Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/LorenFrankLab/nwb_datajoint/issues/158*issuecomment-1064325795__;Iw!!LQC6Cpwp!_LEiZcThY42g3XruMPYMlvczXQ21n4ziE3mLyiBDM5NFbuuI71RZliZE5ht1LQd45w$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABV4PSLM3QMJRXCLTWNDGE3U7IXZLANCNFSM5QNASV3Q__;!!LQC6Cpwp!_LEiZcThY42g3XruMPYMlvczXQ21n4ziE3mLyiBDM5NFbuuI71RZliZE5hv9JNyUiw$>. Triage notifications on the go with GitHub Mobile for iOS<https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!LQC6Cpwp!_LEiZcThY42g3XruMPYMlvczXQ21n4ziE3mLyiBDM5NFbuuI71RZliZE5huKKl84og$> or Android<https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!LQC6Cpwp!_LEiZcThY42g3XruMPYMlvczXQ21n4ziE3mLyiBDM5NFbuuI71RZliZE5huZSUqDZw$>. You are receiving this because you were mentioned.

0 replies

magland · 2022-03-10T17:52:08Z

magland
Mar 10, 2022
Author

@magland. Kyu and I will be meeting shortly and we’ll discuss this, but one quick response: As far as I understand things, we can’t use the restrictions because we need to be able to flexibly take different combinations of the recordings, sorter parameters, and artifact lists, and the restriction operation assumes that you can code up that combination at the time of the make.

Would the following not work?

SpikeSortingRecording.populate({
    'nwb_file_name': ...,
    'sort_interval_name': ...,
    'sort_group_id': ...,
    'preproc_params_name': ...
})

0 replies

lfrank · 2022-03-10T18:07:57Z

lfrank
Mar 10, 2022
Maintainer

It would, but if (as is almost guaranteed to happen) an inexperienced user calls SpikeSortingRecording.populate(), then it will automatically generate all possible combinations, which would be hard to clean up. I really want us to avoid those sorts of failure modes if at all possible. On Mar 10, 2022, at 9:52 AM, Jeremy Magland ***@***.******@***.***>> wrote: This Message Is From an External Sender This message came from outside your organization. @magland<https://urldefense.com/v3/__https://github.com/magland__;!!LQC6Cpwp!7cZYg9nDdou_L7D-SQkp0buQx3ngdZfAWWBg3vRA7x_Bh55zBfJ23Uaw1J9sOL1iMg$>. Kyu and I will be meeting shortly and we’ll discuss this, but one quick response: As far as I understand things, we can’t use the restrictions because we need to be able to flexibly take different combinations of the recordings, sorter parameters, and artifact lists, and the restriction operation assumes that you can code up that combination at the time of the make. Would the following not work? SpikeSortingRecording.populate({ 'nwb_file_name': ..., 'sort_interval_name': ..., 'sort_group_id': ..., 'preproc_params_name': ... }) — Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/LorenFrankLab/nwb_datajoint/issues/158*issuecomment-1064335623__;Iw!!LQC6Cpwp!7cZYg9nDdou_L7D-SQkp0buQx3ngdZfAWWBg3vRA7x_Bh55zBfJ23Uaw1J_oFdeYmg$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABV4PSI3T4DLIHQE6UJ5WSDU7IZFHANCNFSM5QNASV3Q__;!!LQC6Cpwp!7cZYg9nDdou_L7D-SQkp0buQx3ngdZfAWWBg3vRA7x_Bh55zBfJ23Uaw1J9KEDigww$>. Triage notifications on the go with GitHub Mobile for iOS<https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!LQC6Cpwp!7cZYg9nDdou_L7D-SQkp0buQx3ngdZfAWWBg3vRA7x_Bh55zBfJ23Uaw1J_Vwpfblg$> or Android<https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!LQC6Cpwp!7cZYg9nDdou_L7D-SQkp0buQx3ngdZfAWWBg3vRA7x_Bh55zBfJ23Uaw1J_6VUUygQ$>. You are receiving this because you were mentioned.

0 replies

magland · 2022-03-10T18:49:02Z

magland
Mar 10, 2022
Author

It would, but if (as is almost guaranteed to happen) an inexperienced user calls SpikeSortingRecording.populate(), then it will automatically generate all possible combinations, which would be hard to clean up. I really want us to avoid those sorts of failure modes if at all possible.

Yeah, that makes sense.

0 replies

jsoules · 2022-03-10T19:35:35Z

jsoules
Mar 10, 2022

I have only three thoughts I would add to this discussion.

I notice the SpikeSortingRecordingSelection and SpikeSortingSelection tables both use a collection of several foreign keys for their (composite) primary key. This is fine, but if you envision any other tables that might have facts about individual SpikeSortingRecordingSelection or SpikeSortingSelection entries, the joins are much easier if you use a surrogate/synthetic key instead, and just use a unique constraint to enforce uniqueness on the set of FKs. I built a system once which had an important table keyed by (DocumentId, DefendantId) and I was kicking myself for years because I had to include both keys on every other table that stored facts about what a particular document said about a particular defendant. (This got especially rough because of the ORM involved, but... I'll spare you the full details.)
For class LabMember, currently keyed by lab member name as PK, sooner or later you may have a problem when multiple lab members have the same name (or when one lab member has two names, or needs to change names).
Class SpikeSorterParameters uses a composite natural key on sorter and sorter_params_name. If it were me, I'd probably make sorters their own table and have a SpikeSorterParameters reference that by key, to avoid data entry errors (or capitalization/character set wonkiness, etc.) that lead to having different parameter sets for MountainSort4 and Mountainsort4.

0 replies

khl02007 · 2022-03-11T02:10:31Z

khl02007
Mar 11, 2022
Maintainer

@magland

Would the following not work?

SpikeSortingRecording.populate({
    'nwb_file_name': ...,
    'sort_interval_name': ...,
    'sort_group_id': ...,
    'preproc_params_name': ...
})

But if you are going to write out all the attributes, then it's not that much more efficient than first inserting the dict in the Selection table and then calling populate from the downstream dj.Computed table. You also lose the flexibility of running populate with various options (e.g. reserve_jobs).

I didn't get your point about using restrictions to call populate without Selection tables. Could you provide an example of an argument to SpikeSortingRecording.populate to generate an entry in it? How would you bring together the information in Session, SortInterval, SortGroup etc?

0 replies

khl02007 · 2022-03-11T02:37:36Z

khl02007
Mar 11, 2022
Maintainer

@jsoules

I notice the SpikeSortingRecordingSelection and SpikeSortingSelection tables both use a collection of several foreign keys for their (composite) primary key. This is fine, but if you envision any other tables that might have facts about individual SpikeSortingRecordingSelection or SpikeSortingSelection entries, the joins are much easier if you use a surrogate/synthetic key instead, and just use a unique constraint to enforce uniqueness on the set of FKs. I built a system once which had an important table keyed by (DocumentId, DefendantId) and I was kicking myself for years because I had to include both keys on every other table that stored facts about what a particular document said about a particular defendant. (This got especially rough because of the ORM involved, but... I'll spare you the full details.)

I see. I'm not too concerned about the joins being difficult yet because I'm not sure how often we will use that operation on SpikeSortingRecordingSelection or SpikeSortingSelection - we would mostly try to keep the different entries separate than to generate new combinations of entries. Then again maybe we will do things that I haven't envisioned yet...

For class LabMember, currently keyed by lab member name as PK, sooner or later you may have a problem when multiple lab members have the same name (or when one lab member has two names, or needs to change names).

Yes, but the thing is, this table is populated directly from an NWB file when it is first ingested into the database (i.e. even though this is a dj.Manual table, entries are not inserted manually but via a function call). The NWB file has the name of the experimenter and it doesn't provide unique IDs. Do you have any suggestions?

Class SpikeSorterParameters uses a composite natural key on sorter and sorter_params_name. If it were me, I'd probably make sorters their own table and have a SpikeSorterParameters reference that by key, to avoid data entry errors (or capitalization/character set wonkiness, etc.) that lead to having different parameter sets for MountainSort4 and Mountainsort4.

That used to be the case (i.e. @lfrank originally made two separate tables: SpikeSorter and SpikeSorterParameters) and I had merged them because it felt unnecessary to have a SpikeSorter table that is just a single column and whose rows were names of different spike sorters. But your point about data entry error had not occurred to me - clearly, @lfrank had the foresight! I will separate them again.

0 replies

lfrank · 2022-03-11T02:37:51Z

lfrank
Mar 11, 2022
Maintainer

@jsoules Good points. A few responses

We don't expect to have any other tables referring to the Selection entries, so hopefully that will be okay.
You're right about the LabMember uniqueness, of course, but in practice that's unlikely to be a problem given the sizes of our labs. It may also be a bit late for us to change it given other dependencies, but I'm not certain about that.
Right now the sorters and the parameters are all inserted via a call to spikeinterface, so as long as that returns a single set of values, we're okay. That said, if they change the names of a sorter, it will be added as a new entry, but at that point I think the parameters may have changed as well, so perhaps that's okay.

0 replies

khl02007 · 2022-03-11T03:03:53Z

khl02007
Mar 11, 2022
Maintainer

@magland

For the Manual entry tables: I believe the main purpose is to be able to refer to the manually-inserted rows from other tables, especially Computed tables. I think that the primary keys for the Manual tables should therefore always be meaningful names that are specified by the user at the time of insertion. That way they can be referred to by those names in downstream tables.

I'm not sure that is the main purpose. dj.Manual tables are just tables that hold information that come from outside of the database. In fact we don't even need to enter information manually - using a function call is a totally valid use case. See https://docs.datajoint.org/python/definition/12-Example.html?highlight=manual

As such, I don't think we have to impose such discipline upon ourselves to always have the primary keys of dj.Manual tables to be meaningful names. As far as I can see, they just need to be valid primary keys. Of course, that doesn't mean that we shouldn't have such a rule: maybe it is a good idea.

I think you prefer meaningful names because the user has to refer to them downstream. But I actually think that a name is more confusing to a user than a set of composite keys. As @lfrank points out, I'm most definitely going to forget what I meant by good_sorting1 in 6 months. To make sense of it, I will have to inspect all the other keys. I will then realize that it refers to a particular spike sorting on a particular NWB file, using a particular set of parameters. In that case, isn't it more natural to just use those attributes as the primary key? One useful rule of thumb might be to imagine that a new graduate student comes into the lab and decides to pick up where someone left off in the analysis pipeline of our datajoint database. If that person will be confused by something, that shouldn't be used to uniquely define an entry.

That means the user will basically have to carry around a dict containing all the attributes that she needs to interact with the database (chances are, the user will first make a general query about a particular NWB file, and then use intersection to gradually narrow it down). That's not as easy as just using a name. But I think clarity and completeness should trump convenience for referencing in this case (or at least that was the impression that I got from @lfrank).

As for what attributes to include in the CuratedSpikeSorting table (yes, that's the name we're going to use!), @lfrank and I had a discussion today to improve the way one can use and refer to it without being forced to come up with a unique name. I will post a comment in #152.

0 replies

magland · 2022-03-11T13:00:19Z

magland
Mar 11, 2022
Author

@khl02007 okay that sounds good.

Just one clarification -- I was proposing that the sorting name is only unique up to the SpikeSortingRecording. So if there was a sorting with name 'good_sorting1' you wouldn't have to wonder what nwb file it came from because the sorting name is not the entire key. It is just part of the key that distinguishes it from other sortings performed on the same SpikeSortingRecording. Without the name, you would need to just query to get a list of all the sortings performed and then inspect all the parameters by eye to see which is the one you wanted.

0 replies

jsoules · 2022-03-11T15:09:24Z

jsoules
Mar 11, 2022

Regarding the selection tables--it probably won't impact you if you don't intend to add additional tables that have foreign keys to them. Just want to make sure you're aware that the composite primary key sort of compounds--if you look at the __spike_sorting_recording table (the computed one), its PK is now a tuple of (nwb_file_name, sort_group_id, sort_interval_name, preproc_params_name, recording_id) with the first 4 fields automatically added because of the FK to SpikeSortingRecordingSelection. If you have tables describing recordings (like, say, a table that tracks which recordings are published; or annotations of recordings like "used in XYZ study", or "features significant drift," etc.) those tables will in turn pull in all five fields with a single FK to SpikeSortingRecording. You get the idea.
I mention only because DataJoint is effective at hiding this stuff, so it might not be obvious that it's happening. Though by the same token, you may not need to care, since DJ is taking care of it.
For LabMember, I imagine the more likely scenario is that someone's name changes, or they're referenced under different names (as someone who goes by my middle name, I'm particularly sensitive to this). But if you're importing from a file anyway, the file import won't know that different names refer to the same person, so you'd still wind up with duplicate records (which you then couldn't correct, since DataJoint doesn't believe in updates). So yes, no reason to mess with this now!
Just a note for @khl02007, I'd imagined the table of sorter names would have a surrogate key (so two columns). But it sounds like everybody's cool with how it works now, so it's a moot point anyway :)

0 replies

khl02007 · 2022-03-14T05:58:39Z

khl02007
Mar 14, 2022
Maintainer

@magland

Just one clarification -- I was proposing that the sorting name is only unique up to the SpikeSortingRecording. So if there was a sorting with name 'good_sorting1' you wouldn't have to wonder what nwb file it came from because the sorting name is not the entire key. It is just part of the key that distinguishes it from other sortings performed on the same SpikeSortingRecording. Without the name, you would need to just query to get a list of all the sortings performed and then inspect all the parameters by eye to see which is the one you wanted.

Oh I see, sorry I missed this important point (which I now see that you have made several times in this discussion - clearly I should read more carefully!). In that case I have absolutely no problem with it. I think in some cases (like your example of SpikeSortingRecording) we definitely need an extra attribute to identify a sorting, and the primary key could be FK from an upstream table + a unique name as you suggested.

I spent some time reading the Datajoint documentation about what they recommend for primary keys, and found this (quoting relevant parts below, full link here):

All integer types, dates, timestamps, and short character strings make good primary key attributes. Character strings are somewhat less suitable because they can be long and because they may have invisible trailing spaces... The primary key may be composite, i.e. comprising several attributes. In DataJoint, hierarchical designs often produce tables whose primary keys comprise many attributes...
A primary key comprising real-world attributes is a good choice when such real-world attributes are already properly and permanently assigned...
Your lab must maintain a system for uniquely identifying important entities. For example, experiment subjects and experiment protocols must have unique IDs. Use these as the primary keys in the corresponding tables in your DataJoint databases...

Then it goes on to discuss using hashes and smallint with auto_increment (which @jsoules has suggested previously) as PK.

Given this, I'm inclined to use as the PK for CuratedSpikeSorting table (#152) the FK from SpikeSorting and an integer ID that auto increments (sortings definitely count as 'important entities'). The ID obviously is not as descriptive as a name, but it conveys the order in which the entries for a given sorting has been added (it is unique up to a sorting as you say); e.g. the one with max ID would be the one that was added most recently, so the user can pick that out for downstream analysis. I will also add a column (maybe called notes) that the user can use to provide some more description about the sorting. How does that sound?

0 replies

jsoules · 2022-03-14T13:35:36Z

jsoules
Mar 14, 2022

@khl02007 One snag; I believe you can only use auto_increment in a non-composite primary key (the bottom of https://docs.datajoint.org/python/v0.11/definition/07-Primary-Key.html is the reference there but I'm not sure if I'm looking at the most recent version). If you wish to have (system-managed) auto-increment behavior, you probably need to take the FK to SpikeSorting out of the PK section and make it part of the rest of the record. (The rest of the logic is unchanged.)

Otherwise I think this is a great solution.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments about dj.Manual and dj.Computed tables #247

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 15 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Comments about dj.Manual and dj.Computed tables #247

magland Mar 10, 2022

Replies: 15 comments

lfrank Mar 10, 2022 Maintainer

magland Mar 10, 2022 Author

lfrank Mar 10, 2022 Maintainer

magland Mar 10, 2022 Author

lfrank Mar 10, 2022 Maintainer

magland Mar 10, 2022 Author

jsoules Mar 10, 2022

khl02007 Mar 11, 2022 Maintainer

khl02007 Mar 11, 2022 Maintainer

lfrank Mar 11, 2022 Maintainer

khl02007 Mar 11, 2022 Maintainer

magland Mar 11, 2022 Author

jsoules Mar 11, 2022

khl02007 Mar 14, 2022 Maintainer

jsoules Mar 14, 2022

magland
Mar 10, 2022

lfrank
Mar 10, 2022
Maintainer

magland
Mar 10, 2022
Author

lfrank
Mar 10, 2022
Maintainer

magland
Mar 10, 2022
Author

lfrank
Mar 10, 2022
Maintainer

magland
Mar 10, 2022
Author

jsoules
Mar 10, 2022

khl02007
Mar 11, 2022
Maintainer

khl02007
Mar 11, 2022
Maintainer

lfrank
Mar 11, 2022
Maintainer

khl02007
Mar 11, 2022
Maintainer

magland
Mar 11, 2022
Author

jsoules
Mar 11, 2022

khl02007
Mar 14, 2022
Maintainer

jsoules
Mar 14, 2022