forked from cliftonm/HOPE
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhope database design thoughts.txt
372 lines (232 loc) · 17.2 KB
/
hope database design thoughts.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
RSSFeedName Table, because it's the ST is identified as unique/normalized in RSSFeedItem
Text-Value
Url Table
Value
RSSFeedItem Table, because it's an ST
FK_RSSFeedName Normalized, m:1, therefore the column is an FK directly to RSSFeedName
RSSFeedTitle no normalized/unique flag
FK_Url Unique, 1:1, therefore the column is an FK the the table Url
RSSFeedDescription no normalized/unique flag
RSSFeedPubDate no normalized/unique flag
Questions:
What about these sub-tree components like:
Text.Value
Url.Value
vs. items like:
RSSFeedPubDate.Value
What about the fact that RSSFeedUrl is a specific semantic type of Url, which itself is a semantic type having a value?
Do we create a table called RSSFeedUrl with an FK to a table called Url? That seems silly, but should it actually
be a universal rule?
That would mean we would tables like this:
RSSFeedItem
FK_RSSFeedName_ID
FK_RSSFeedTitle_ID
FK_RSSFeedUrl_ID
FK_RSSFeedDescription_ID
FK_RSSFeedPubDate_ID
RSSFeedName
FK_Text_ID
RSSFeedTitle
FK_Text_ID
RSSFeedDescription
FK_Text_ID
RSSFeedUrl
FK_Url_ID
RSSFeedPubDate
Value
Text
Value
Url
Value
This would actually make a whole lot of sense because any ST could be instantiated at any level, and interestingly,
doing it this way always creates m:1 relationships (many parent instances with 1 child instance per FK).
From the ST schema, we can determine, for example, all the tables referencing Url, and so forth, up the tree to the root
node -- each ST that includes Url somewhere in the hierarchy.
We still flag ST's with unique/normalized because this guides the DB engine in determining whether or not to first query
the table for an existing record.
Interestingly, this also means that any fully qualified named ST value can be persisted (for example, RSSFeedName.Text.Value)
and all the ID's are accumulated to create the RSSFeedItem.
This kind of architecture leads to very thin tables, and tables higher up in the food chain are almost always FK's.
In this architecture:
1. Every table represents a semantic type
2. Every FK references a child semantic type
3. Every non-FK field is a native type
Rather than indicating that an ST is "unique" or "indexed", I am going to use the concept "normalized" to represent
that the values in the table are guaranteed unique. This instructs the persistence object to check for the existence
of the "unique key" consisting of the native type(s) in the ST.
But where to we put the "normalized" flag? This would be on any ST that contains an NT.
Tables which FK to a normalized (UQ) table are not necessarily UQ themselves unless indicated.
Tables that are purely FK's cannot be cached for normalization performance.
---------------------------
A very interesting thing happens when I looked at how to persist semantic data:
The semantic types themselves become tables
The structure of a semantic type is persisted through the use of foreign keys
You often get tables that are actually nothing more than containers for FK's
The "bottom" tables are the ones that actually persist the native types and have non-FK fields.
This is completely orthogonal to the usual database architecture. Let me illustrate:
Here's what a typical non-normalized schema might look like for managing items in an RSS Feed:
RSSFeedItem
ID
FeedName
FeedTitle
FeedUrl
FeedDescription
FeedPubDate
If we want to normalize this, we can break this up into two tables:
RSSFeed
ID
FeedName
RSSUrl
RSSFeedItem
ID
RSSFeed_ID
FeedTitle
FeedUrl
FeedDescription
FeedPubDate
Contrast this with the "semantic" table structure illustrated earlier.
-----------------
How to handle UI specificity? An example will illustrate the issue, and how it leads into dealing with non-UI issues as well:
I have an RSSFeed with feed items, and I want to display the last 40 (some arbitrary number) feeds from the database and also
merge in new feeds.
By merge, I mean: if the feed item is already in the view, don't display a second, identical, record
I also want to indicate which feeds are new (coming from the feed reader), which fields are old (coming from the database), and which feeds I've read.
We need a UI semantic type, something like:
FeedItemUserState
RSSFeedItem
State (old, new, read)
1. The first question is, how do we query the database? An example syntax might be:
"select top 40 RSSFeedItem where RSSFeed.FeedName='Code Project' order by RSSPubDate desc"
OK, that's not bad, I don't see any implementation problems with the DB service figuring out the tables needed to execute this query.
2. Next, how does the UI determine how to merge fields? Well, it can determine that there are unique keys (i.e., normalized values), which in our case are:
RSSFeedName
RSSFeedUrl
Technically, only RSSFeedUrl is needed, but the merge function, being somewhat dumb, will probably end up using both fields.
3. OK, that doesn't sound bad, but I wonder, should this be part of the UI behavior, or should we have an "unique record aggregator" receptor that takes data from
different streams and ensures their uniqueness (as in, it only emits unique records?)
That seems like a good idea, as this kind of behavior will probably be desired in many places.
4. Now, how do we include the user state for each feed? Well, the query would be changed to look something like this:
"select top 40 (RSSFeedItem, FeedItemUserState) where RSSFeed.FeedName='Code Project' order by RSSPubDate desc"
or (because State is a poor semantic description):
FeedItemUserState
RSSFeedItem
UserVisitedUrlState (old, new, read)
"select top 40 (RSSFeedItem, UserVisitedUrlState) where RSSFeed.FeedName='Code Project' order by RSSPubDate desc"
But this will create a semantic type that is undefined, as the tree now includes a branch (probably under RSSFeedItem) that includes State.
Also note the (), which I'm intending to indicate: get the top 40 RSSFeedItem records joined with the UserVisitedUrlState for each record.
The query does make a certain kind of sense, in that FeedItemUserState can be acquired from the Url unique key shared in the linkage.
*** Is there a better way to do this? ***
A. In the above description for FeedItemUserState, would it be better to define it this way (especially given the semantic name for the state):
UrlUserState
Url
UrlVisitedState (old, new, read)
After all, the state corresponds more to whether the user has visited the URL, not the feed item. We should look briefly also at including a user id in case we are running this on a network with multiple users:
UserUrlVisitedState
User
Url
UrlVisitedState (old, new, read)
Now the query would look like:
"select top 40 (RSSFeedItem, UrlVisitedState) where RSSFeed.FeedName='Code Project' and User.ID='123' order by RSSPubDate desc"
Note the semantics here though -- we have some really abstract concepts like "UserUrlVisitedState" which have child ST's that are a decomposition of the parent ST name. Interesting.
Also note that because User and Url are normalized, these entries will be unique for a unique user-url key.
Also, technically, the query might be better like this:
"select top 40 (RSSFeedItem, UserUrlVisitedState) where RSSFeed.FeedName='Code Project' and User.ID='123' order by RSSPubDate desc"
because it creates the full type UserUrlVisitedState which becomes easier to work with in terms of updating and persistence.
Now that we have that cleared up:
B. I don't want to have to create an ST for every kind of joined data structure I might need. This is an important issue...
B1. We need to determine where "UrlVisitedState" goes in the semantic tree. Does it become a native type of the Url ST? Does it become a native type of the RSSFeedItem ST that we are "joining" with? Does it actually matter? Not really. Joined types (ST or NT) can be added to the root type. If it's a collection, then the records will appear as a native type List<someType>. Also, deciding that the joined types always are added to the definition of the root type makes it clear as to where the additional data can be found.
So, the first order of business is the reality that new types are going to be defined and created by receptors. Let's assume these are going to be named by the user for now (so we don't have to come up with some naming scheme) so the query now would read:
"select top 40 (RSSFeedItem, UserUrlVisitedState) as UserFeedItem where RSSFeed.FeedName='Code Project' and User.ID='123' order by RSSPubDate desc"
where the "as..." part is required whenever joining two or more ST's.
5. How to implement the display logic? I hate hard-coding this stuff, but the only alternative seems to be a rather specific configuration UI and
behavior on the part of the list viewer.
*** Is there a more generic way to do this? ***
6. How does the UI persist state change? Again, this points to a solution-specific receptor (at least one that is specific to having state information that affects row colorization.)
7. How do we get session information, like the user ID, into receptors? Or, to put it another way, how do receptors access session information?
8. Does the aggregator update the state to "old", or does the UI receptor do this? What does "old" mean anyways? Do we want to see feed items as "new" for a while before they become "old" ? Or does "old" correspond to the age of the feed item itself, such that, say, feed items older than two days are marked as "old"? Should a feed be marked as old only when the user says so? Some of this is UI aspects, some of it might be how we automate the concepts of "new" and "old" in the aggregator or other places. Are there (at least) two different concepts of "old": old, as in, say, the pub date is more than 48 hours, and "old" as in "since the list was displayed?" I sort of like that idea, as I can see truly new items, and after a designated time, non-new items that have "aged", or "grown stale" are indicated differently.
9. How does the database determine whether to create new instances of a record or to "use" an existing instance?
For example, RSSFeedItem should always be unique, but the only real piece of information that determines this uniqueness is the URL value.
On the other hand, there might be scenarios where multiple records with other variances might point to the same URL--these would not be omitted.
Are those "other variances" to be considered part of the unique key, such that we should be able to make a determination, always, from the
"normalized" types in the signal, as to whether a structure should actually persist to the DB?
Let's assume so and consider this problem more rigorously later. Furthermore, we will assume that if the "normalized" fields exist, then the operation
is at best an update operation, otherwise, if one or more of the normalized fields do not exist, then we have an insert operation.
One of the complexities is that ST's, not just NT's, can be flagged as "normalized." This causes some issue when we have a structure like:
RSSFeedName => Text => Value(string)
where RSSFeedName and Text are ST's and Value is a string NT.
We want to normalize RSSFeedName, not Text, because we want Text normalized only in the context of RSSFeedName. , but RSSFeedName is an ST, not an NT,
so how do we query this to determine its normalization?
Contrast this with:
RSSFeedUrl => Url => Value(string)
Here, it makes more sense to always normalize Url because of its semantic specificity as compared to the more generic semantic description "Text."
Also, what do we do when more than one item in the structure-path is normalized, for example: RSSFeedUrl, Url, and Value could all be designated as Normalized.
Both questions might lie in the concept that the first normalized field we encounter when in traversing the structure becomes the anchor point for
determining normalization. Thus, the SQL would have to look like:
select a.ID, b.ID
from RSSFeedName a
left join Text b on a.FK_TextID = b.ID
where b.Value = 'Code Project'
where the left joins represent drilling down into the tree until all values are encountered.
If no rows are returned, there is no entry, so the operation should be considered an insert.
We have to make these tests from the bottom up when we insert the structure so as to ensure that we don't create duplication at lower levels. So,
the first query for URL that gets run would be:
select a.ID from Url a where a.Value = 'http://whatever'
No rows means "insert".
A returned row means that the returned ID would be used for the FK_UrlID value. This gets tested:
select a.ID from RssFeedUrl a where a.Value = [FK_UrlID]
On the other hand, inserting a unique feed name presents a problem. Text is not designated as normalized, but RSSFeedName is.
So, if the query:
select a.ID, b.ID
from RSSFeedName a
left join Text b on a.FK_TextID = b.ID
where b.Value = 'Code Project'
returns no rows, we first insert the feed name value into Text, then insert the resulting id into RSSFeedName.FK_TextID.
This would continue up the tree of ST's as we build the FK_[]ID values. Whenever an ST "fills" with the FK_[]ID values, we can insert it, or,
if it is normalized, test it, but this query doesn't drill down, it tests only the fields at that level. So, for example, if RSSFeedItem itself
is normalized, then only the FK_[]ID's for RSSFeedItem are tested to determine whether the insert should be made. At this level, "update" is
meaningless -- the ID's either exist (do nothing) or don't (insert.)
Back to the question at hand -- if RSSFeedItem is not flagged as normalized, we don't do duplicate record checks. If it is, we check for an existing record.
Because of the way the feed reader works -- it can emit the same records -- we definitely need to designate RSSFeedItem as normalized.
***
However, this leads to a realization -- all the SE's in RSSFeedItem must therefore also be normalized. SO, WHAT REALLY ARE WE DOING HERE?
I think we need to go back to the concept of unique fields -- the composite unique fields define the unique key to determine normalization. If there are no unique fields defined, then the data cannot be normalized.
So, back again to the question, what happens with something like Text, whose value we don't mark as unique, but we want RSSFeedTitle's Text to be unique?
And wait! in RSSFeedItem, (RSSFeedTitle, RSSFeedUrl) would be designated as the unique key, so that's all we need for determining the FK ID's of the other fields. But how does this prevent
us from writing the data for fields like RSSFeedDescription?
Let's say we take a top-to-bottom approach:
Scenario: No RSSFeedItem record matches.
1. We want to write out RSSFeedItem.
2. It has two unique keys, RSSFeedTitle and RSSFeedUrl
3. We drill into RSSFeedTitle and RSSFeedUrl
4. RSSFeedTitle has, at the bottom, an ST (Text) with a Value NT
5. We query this and discover there is no record
6. We add the record
7. We propagate back up the tree, adding the FK's
8. Back at RSSFeedName, we still need RSSFeedUrl's FK
9. Repeat 4-7 with RSSFeedUrl
10. Back at RSSFeedName, we have enough to query RSSFeedItem
11. We find no record
12. We see that we don't have values yet for the other FK's.
13. For each missing FK, We drill into the ST until we get to the NT's, insert their values, and propagate the FK's back up the tree.
14. Now we can write RSSFeedItem
Scenario: Existing RSSFeedItem record matches:
1. We want to write out RSSFeedItem.
2. It has two unique keys, RSSFeedTitle and RSSFeedUrl
3. We drill into RSSFeedTitle and RSSFeedUrl
4. RSSFeedTitle has, at the bottom, an ST (Text) with a Value NT
5. We query this and discover there is a record
6. We propagate back up the tree, adding the FK's, and possibly querying additional tables to get the parent FK's, but only if we can identify a unique composite key each step of the way
7. We repeat the process with RSSFeedUrl, discovering it's FK
8. Back at RSSFeedName, we have enough to query RSSFeedItem
9. We find a record and therefore take no further action (or, we can drill into the non-unique ST's and perform updates on the NT's in case values have changed.)
So, it seems like what we need is not a "Normalize" flag but a "Unique" flag.
Furthermore, uniqueness can be unique to the parent structure, or global to the ST definition. An interesting point.
Notes:
1. Multiple normalized NT's in an ST define a composite key.
2. If an ST is defined as normalized, all the ST FK's and NT's in the remaining structure define the composite key.
Other thoughts unrelated to DB stuff:
Q: When should something be a native type and when should something be an ST that wraps a native type in a Value field?
A: An ST wrapper should be created when there is sufficient semantic meaning to indicate re-use in other structures. For example,
Url, Zip5, and even Text all have semantic meaning that indicates re-use. Second, even when a type doesn't have re-use value, if it has
sub-types that are semantic, like Text, DateTime, etc., then it needs to be an ST to leverage the benefits of drilling down into the sub-types
so that linkages can be made with other types that either currently exist or may be defined later.