-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added glossary and design decisions. #3
base: initial-core
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some feedback:
- There are some formatting issues that need to be addressed.
- I think the glossary and design decisions should be separate documents.
- I think some of the entries in the glossary are better off in Scaladocs for the respective classes, and what we should have instead are higher-level descriptions of the concepts being represented.
I haven't gone over the wording in great detail.
We should also wait until #2 is merged and rebase on master
or do something else to prevent a merge commit between these branches.
README.md
Outdated
@@ -1,5 +1,52 @@ | |||
#Summary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can see if you look at this branch in the code view, there are some formatting issues with the headings.
You need a space between #
s and the text.
README.md
Outdated
The usual collection of *Adapters* and *Writers* are conceptually similar but take better advantage of the new data model implementation. The *Adapter* life cycle, in particular, is much cleaner. This new version of LaTiS will continue to support the same service interface (and more). | ||
|
||
#Glossary | ||
Items that are capitalized are Scala class names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically "code things" are surrounded in backticks like this:
`DataType`
So they look like this: DataType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those ticks are mapped to the html "code" tag so that seems like the right thing to do when referring to a specific class. Maybe another case where we need to differentiate concepts from actual implementation?
README.md
Outdated
* **Tuple** - A *Tuple* is a *DataType* consisting of a list of *DataType* objects. Unlike a function, the *Datatype* objects in a *Tuple* are associated but are not dependant on each other. | ||
An example is a 2D point consisting of the x,y *Tuple*. The scalars x and y are associated but x is not a function of y nor is y a function of x. Elements of a *Tuple* also do not have to be of the same type. Tuple is a member of the *DataType* algebraic data type. | ||
|
||
#Design Decisions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this should live as a separate document.
README.md
Outdated
The usual collection of *Adapters* and *Writers* are conceptually similar but take better advantage of the new data model implementation. The *Adapter* life cycle, in particular, is much cleaner. This new version of LaTiS will continue to support the same service interface (and more). | ||
|
||
#Glossary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should also be a separate document.
I like the idea of the README as relatively short document giving a brief overview, some "getting started" tips (like how to pull in the library to SBT or how to download and run a barebones thing), and pointers to additional documentation, like a glossary or design decisions.
I also wonder how much of this duplicates what would be in the docs for certain classes. For instance, the entry for Metadata
here seems like it could be a better fit for the Scaladocs and what could go here is a higher-level description of the concept of metadata without our implementation details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Glossary" does lead me to expect definitions of terms. What do we mean by "metadata" as opposed to how do we implement it. I agree with Chris that the latter belongs in the scaladoc. The question becomes whether we want to duplicate information and risk it getting out of date. I'm inclined to document the high level concepts that are unlikely to change in the glossary. Maybe add a link to the code or scaladoc?
README.md
Outdated
* **fdml** - (Functional Datamodel Markup Language) An xml file format to describe the contents of a *Dataset*. | ||
See the FDML Schema section below. | ||
* **Function** - A *Function* is a *DataType* consisting of exactly two *DataType* objects, a domain and a range. Function is a member of the DataType algebraic data type. | ||
* **functional** **data** **model** - Datsets consist of ordered sequences of samples that are decoupled from metadata. Operations that implement the functional algebra can be performed on these samples to create new datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe **functional data model**
will work just fine here.
There's probably also more that can be said about the FDM.
I'm not sure that I agree with everything Chris said. I think I do like the idea of making a separate document for the glossary; I can see how it could clutter up the README. I like the design decisions in the README. But I don't think we need to move anything from the glossary into Scaladocs. Sure, adding stuff to Scaladocs is a good idea, but I think we can do that and still keep the entries in the glossary. I like the idea of having a "one-stop-shop" glossary for our terms better than having to go combing through documentation to find something. |
README.md
Outdated
@@ -1,5 +1,52 @@ | |||
# Summary | |||
|
|||
*LaTiS version 3* is a rewrite of the *Functional Data Model* and *Algebra* based on lessons learned from version 2, and with a bit more inspiration from Functional Programming. It will incubate as *latis3 version 0.x* until we are ready to finalize the API and release it as *latis version 3.x*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little nitpicky, but we should be consistent with how we capitalize "LaTiS". For example, we have "Latis3 version 0.x" in this paragraph and I think that should be "LaTiS3 version 0.x".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't apologize, these are actual mistakes!
README.md
Outdated
Items that are capitalized are Scala class names. | ||
|
||
* **arity** - The number of un-flattened *DataTypes* in the domain of a function. | ||
For example, a function with a domain of (x, y) and wavelength has an arity of 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "...with a domain of (x, y) and a range of wavelength..."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, actually wavelength is part of the domain here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh! I misread. Twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we say "domain of ((x, y), wavelength)"?
README.md
Outdated
For example, a function with a domain of (x, y) and wavelength has an arity of 2. | ||
* **Data** - A recursive algebraic data type consisting of *ScalarData*, *TupleData*, and *FunctionData*. Unlike the *DataType*, which describes the model, the *Data* objects contain actual values representing measurements. Since *TupleData* and *FunctionData* contain multiple *Data* objects they can be represented as trees just like DataType objects. | ||
* **Dataset** - A *Dataset* instance consists of three things: | ||
1. *MetaData* for the dataset itself. A list of key/value pairs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use a colon? "MetaData for the dataset itself: A list of key/value pairs."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, we don't capitalize the "D" in "Metadata".
README.md
Outdated
* **DataType** - A recursive algebraic data type consisting of *Scalars*, *Tuples*, and *Functions*. The *DataType* is purely in the realm of the model. Since *Tuples* and Functions themselves contain multiple *DataType* objects they can be nested with arbitrary complexity, meaning a *DataType* can be represented as a tree. | ||
Also, every *DataType* must contain a *MetaData* object. | ||
* **dimensionality** - The number of flattened *DataTypes* in the domain of a function. | ||
For example, a function with a domain of (x, y) and wavelength has a dimensionality of 3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "...with a domain of (x, y) and a range of wavelength..."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, wavelength is part of the domain. I will add a range to make things less confusing for both arity and dimensionality.
README.md
Outdated
* **range** - The dependant variables of a function. | ||
* **Sample** - A *Sample* is no more than the dimensionality of a dataset followed by an array of *Data* objects. Nested tuples are not allowed in a *Sample* but *ScalarData* and *FunctionData* are. | ||
* **Scalar** - A *Scalar* is a *DataType* consisting of a single atomic variable. A *DataType* of arbitrary complexity must ultimately consist of *Scalar* objects as the leaf nodes of the representation tree. *Scalar* is a member of the *DataType* algebraic data type. | ||
* **Tuple** - A *Tuple* is a *DataType* consisting of a list of *DataType* objects. Unlike a function, the *Datatype* objects in a *Tuple* are associated but are not dependant on each other. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be "dependent" instead of "dependant".
README.md
Outdated
* **Function** - A *Function* is a *DataType* consisting of exactly two *DataType* objects, a domain and a range. Function is a member of the DataType algebraic data type. | ||
* **functional data model** - Datsets consist of ordered sequences of samples that are decoupled from metadata. Operations that implement the functional algebra can be performed on these samples to create new datasets. | ||
* **Metadata** - An instance of *Metadata* is simply a list of key/value pairs to describe a *DataType* or *Dataset*. Every *Scalar* object is required to contain a *MetaData* object, but *MetaData* objects may be empty for *Tuple* and *Function* objects. | ||
* **range** - The dependant variables of a function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be "dependent" instead of "dependant".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we don't celebrate Independance Day do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should though... 💃
README.md
Outdated
|
||
# Design Decisions | ||
* *DataType* and *Data* objects are completely decoupled: | ||
* Con: Trying to maintain two identical tree structures is difficult. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this pro/con approach you took here.
README.md
Outdated
* Con: Seems arbitrary to require both a single *Data* item followed by a list of additional *Data* items. | ||
* Pro: Ensures that at least one Data item is in the *TupleData*. | ||
* *Sample* objects are not implemented as a tuple of domain and range, as a *Function* is, but are instead implemented as an integer representing dimensionality followed by the domain and range combined into a single list: | ||
* Con: The dimensionality number feels arbitrary. simply breaking a *Sample* into domain and range would be more elegant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accidental double space between "feels" and "arbitrary". Need to capitalize "simply".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few comments. Once they're addressed, this all looks awesome to me.
README.md
Outdated
@@ -42,7 +42,7 @@ An example is a 2D point consisting of the x,y *Tuple*. The scalars x and y are | |||
* Con: Seems arbitrary to require both a single *Data* item followed by a list of additional *Data* items. | |||
* Pro: Ensures that at least one Data item is in the *TupleData*. | |||
* *Sample* objects are not implemented as a tuple of domain and range, as a *Function* is, but are instead implemented as an integer representing dimensionality followed by the domain and range combined into a single list: | |||
* Con: The dimensionality number feels arbitrary. simply breaking a *Sample* into domain and range would be more elegant. | |||
* Con: The dimensionality number feels arbitrary. simply breaking a *Sample* into domain and range would be more elegant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still a double space before the word "simply", which still isn't capitalized. :)
I went ahead and approved this PR. I'll let the discussion about the glossary get resolved under Chris's "requested changes". I gave my two cents above. |
@@ -0,0 +1,16 @@ | |||
# Design Decisions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the Geotrellis Architecture Decision Records, they appear to be focusing on larger architectural things, like whether they're using the stlib Future
or scalaz Task
.
We would need to have some standard by which we judge something to be worthy of including in this document.
I could imagine describing DataType
vs. Data
and the nature of Sample
, but I think the bit about TupleData
falls below the bar.
docs/design_decisions.md
Outdated
* Pro: Computation efficiencies require data to be unencumbered with metadata and model concepts. | ||
* *TupleData* is implemented as a tuple consisting of the first *Data* item and all of the rest: | ||
* Con: Seems arbitrary to require both a single *Data* item followed by a list of additional *Data* items. | ||
* Pro: Ensures that at least one Data item is in the *TupleData*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this is true. There's an apply
method on the companion object that accepts Seq[Data]
, which could be empty:
scala> TupleData(List())
scala.NotImplementedError: an implementation is missing
at scala.Predef$.$qmark$qmark$qmark(Predef.scala:284)
at latis.data.TupleData$.apply(Data.scala:44)
... 36 elided
docs/design_decisions.md
Outdated
* Con: Trying to maintain two identical tree structures is difficult. | ||
* Pro: Computation efficiencies require data to be unencumbered with metadata and model concepts. | ||
* *TupleData* is implemented as a tuple consisting of the first *Data* item and all of the rest: | ||
* Con: Seems arbitrary to require both a single *Data* item followed by a list of additional *Data* items. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say it's arbitrary. As you note, it (sort of) ensures that there's at least one item, and that's because this (a single element paired with a list of elements) is how you encode non-empty lists.
@@ -0,0 +1,25 @@ | |||
# Glossary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll reiterate that I think the glossary ought to describe concepts (what "metadata" is, what "dataset" means to us) rather than implementation (Metadata
is Map[String, String]
).
The latter is hopefully apparent from the code and could be described in the Scaladocs. Having implementation details here just adds another place where things are likely to get out of date if our discipline wavers.
Definitions of concepts are less likely to change with time and probably useful enough to be placed at such a high level.
README.md
Outdated
* Pro: The model contains all the information needed to represent nested tuples. | ||
|
||
|
||
see glossary [here](docs/glossary.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For accessibility reasons it's probably better for the links to be on "glossary" and "design decisions."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! That totally slipped passed me, but yes, "here" or "see more" links are notorious accessibility issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This information is valuable but it got me thinking about what belongs in the code docs vs here at the top level. I want to remain DRY so we don't have to worry about things getting out of sync. I'd like to see the broad architectural issues here and implementation details in the code. That boundary isn't always very clear, though.
It also struck me that some of this feels like a quick-start guide. I'm wondering if we need another document for that. There are ways to make docs with code that compile so we are more likely to keep them up to date.
I'm fine with this as a good start. We can be agile and evolve the docs as v3 incubates.
README.md
Outdated
The usual collection of *Adapters* and *Writers* are conceptually similar but take better advantage of the new data model implementation. The *Adapter* life cycle, in particular, is much cleaner. This new version of LaTiS will continue to support the same service interface (and more). | ||
|
||
#Glossary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Glossary" does lead me to expect definitions of terms. What do we mean by "metadata" as opposed to how do we implement it. I agree with Chris that the latter belongs in the scaladoc. The question becomes whether we want to duplicate information and risk it getting out of date. I'm inclined to document the high level concepts that are unlikely to change in the glossary. Maybe add a link to the code or scaladoc?
README.md
Outdated
The usual collection of *Adapters* and *Writers* are conceptually similar but take better advantage of the new data model implementation. The *Adapter* life cycle, in particular, is much cleaner. This new version of LaTiS will continue to support the same service interface (and more). | ||
|
||
#Glossary | ||
Items that are capitalized are Scala class names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those ticks are mapped to the html "code" tag so that seems like the right thing to do when referring to a specific class. Maybe another case where we need to differentiate concepts from actual implementation?
README.md
Outdated
Items that are capitalized are Scala class names. | ||
|
||
* **arity** - The number of un-flattened *DataTypes* in the domain of a function. | ||
For example, a function with a domain of (x, y) and wavelength has an arity of 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we say "domain of ((x, y), wavelength)"?
README.md
Outdated
For example, a function with a domain of (x, y) and wavelength has an arity of 2. | ||
* **Data** - A recursive algebraic data type consisting of *ScalarData*, *TupleData*, and *FunctionData*. Unlike the *DataType*, which describes the model, the *Data* objects contain actual values representing measurements. Since *TupleData* and *FunctionData* contain multiple *Data* objects they can be represented as trees just like DataType objects. | ||
* **Dataset** - A *Dataset* instance consists of three things: | ||
1. *MetaData* for the dataset itself. A list of key/value pairs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, we don't capitalize the "D" in "Metadata".
docs/design_decisions.md
Outdated
* *DataType* and *Data* objects are completely decoupled: | ||
* Con: Trying to maintain two identical tree structures is difficult. | ||
* Pro: Computation efficiencies require data to be unencumbered with metadata and model concepts. | ||
* *TupleData* is implemented as a tuple consisting of the first *Data* item and all of the rest: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if an implementation detail like this should be in the code and/or scaladoc. I was imagining these design decisions to be cross-cutting architectural ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A decision about not having empty tuples might qualify for this high level.
docs/design_decisions.md
Outdated
* Pro: Computation efficiencies require data to be unencumbered with metadata and model concepts. | ||
* *TupleData* is implemented as a tuple consisting of the first *Data* item and all of the rest: | ||
* Con: Seems arbitrary to require both a single *Data* item followed by a list of additional *Data* items. | ||
* Pro: Ensures that at least one Data item is in the *TupleData*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also disambiguates construction from a Seq[Data], which was the original reason for it.
docs/glossary.md
Outdated
* **dimensionality** - The number of flattened *DataTypes* in the domain of a function. | ||
For example, a function with a domain of (x, y) and wavelength with a range of irradiance has a dimensionality of 3. | ||
* **domain** - The independent variables of a function. | ||
* **fdml** - (Functional Datamodel Markup Language) An xml file format to describe the contents of a *Dataset*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. What about "Function Data Model Language"? Is "markup" relevant in this context, other that the rich history of *mls?
docs/glossary.md
Outdated
* **fdml** - (Functional Datamodel Markup Language) An xml file format to describe the contents of a *Dataset*. | ||
See the FDML Schema section below. | ||
* **Function** - A *Function* is a *DataType* consisting of exactly two *DataType* objects, a domain and a range. Function is a member of the DataType algebraic data type. | ||
* **functional data model** - Datsets consist of ordered sequences of samples that are decoupled from metadata. Operations that implement the functional algebra can be performed on these samples to create new datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like this needs a higher level definition.
No description provided.