Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added glossary and design decisions. #3

Open
wants to merge 6 commits into
base: initial-core
Choose a base branch
from

Conversation

pfisterp
Copy link

@pfisterp pfisterp commented Aug 1, 2018

No description provided.

@sapols sapols requested review from sapols and dlindhol August 1, 2018 20:05
@lindholc lindholc self-requested a review August 2, 2018 00:52
Copy link
Member

@lindholc lindholc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some feedback:

  • There are some formatting issues that need to be addressed.
  • I think the glossary and design decisions should be separate documents.
  • I think some of the entries in the glossary are better off in Scaladocs for the respective classes, and what we should have instead are higher-level descriptions of the concepts being represented.

I haven't gone over the wording in great detail.

We should also wait until #2 is merged and rebase on master or do something else to prevent a merge commit between these branches.

README.md Outdated
@@ -1,5 +1,52 @@
#Summary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see if you look at this branch in the code view, there are some formatting issues with the headings.

You need a space between #s and the text.

README.md Outdated
The usual collection of *Adapters* and *Writers* are conceptually similar but take better advantage of the new data model implementation. The *Adapter* life cycle, in particular, is much cleaner. This new version of LaTiS will continue to support the same service interface (and more).

#Glossary
Items that are capitalized are Scala class names.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically "code things" are surrounded in backticks like this:

`DataType`

So they look like this: DataType

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those ticks are mapped to the html "code" tag so that seems like the right thing to do when referring to a specific class. Maybe another case where we need to differentiate concepts from actual implementation?

README.md Outdated
* **Tuple** - A *Tuple* is a *DataType* consisting of a list of *DataType* objects. Unlike a function, the *Datatype* objects in a *Tuple* are associated but are not dependant on each other.
An example is a 2D point consisting of the x,y *Tuple*. The scalars x and y are associated but x is not a function of y nor is y a function of x. Elements of a *Tuple* also do not have to be of the same type. Tuple is a member of the *DataType* algebraic data type.

#Design Decisions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should live as a separate document.

README.md Outdated
The usual collection of *Adapters* and *Writers* are conceptually similar but take better advantage of the new data model implementation. The *Adapter* life cycle, in particular, is much cleaner. This new version of LaTiS will continue to support the same service interface (and more).

#Glossary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also be a separate document.

I like the idea of the README as relatively short document giving a brief overview, some "getting started" tips (like how to pull in the library to SBT or how to download and run a barebones thing), and pointers to additional documentation, like a glossary or design decisions.

I also wonder how much of this duplicates what would be in the docs for certain classes. For instance, the entry for Metadata here seems like it could be a better fit for the Scaladocs and what could go here is a higher-level description of the concept of metadata without our implementation details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Glossary" does lead me to expect definitions of terms. What do we mean by "metadata" as opposed to how do we implement it. I agree with Chris that the latter belongs in the scaladoc. The question becomes whether we want to duplicate information and risk it getting out of date. I'm inclined to document the high level concepts that are unlikely to change in the glossary. Maybe add a link to the code or scaladoc?

README.md Outdated
* **fdml** - (Functional Datamodel Markup Language) An xml file format to describe the contents of a *Dataset*.
See the FDML Schema section below.
* **Function** - A *Function* is a *DataType* consisting of exactly two *DataType* objects, a domain and a range. Function is a member of the DataType algebraic data type.
* **functional** **data** **model** - Datsets consist of ordered sequences of samples that are decoupled from metadata. Operations that implement the functional algebra can be performed on these samples to create new datasets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe **functional data model** will work just fine here.

There's probably also more that can be said about the FDM.

@sapols
Copy link

sapols commented Aug 2, 2018

I'm not sure that I agree with everything Chris said. I think I do like the idea of making a separate document for the glossary; I can see how it could clutter up the README. I like the design decisions in the README. But I don't think we need to move anything from the glossary into Scaladocs. Sure, adding stuff to Scaladocs is a good idea, but I think we can do that and still keep the entries in the glossary. I like the idea of having a "one-stop-shop" glossary for our terms better than having to go combing through documentation to find something.

README.md Outdated
@@ -1,5 +1,52 @@
# Summary

*LaTiS version 3* is a rewrite of the *Functional Data Model* and *Algebra* based on lessons learned from version 2, and with a bit more inspiration from Functional Programming. It will incubate as *latis3 version 0.x* until we are ready to finalize the API and release it as *latis version 3.x*.
Copy link

@sapols sapols Aug 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little nitpicky, but we should be consistent with how we capitalize "LaTiS". For example, we have "Latis3 version 0.x" in this paragraph and I think that should be "LaTiS3 version 0.x".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't apologize, these are actual mistakes!

README.md Outdated
Items that are capitalized are Scala class names.

* **arity** - The number of un-flattened *DataTypes* in the domain of a function.
For example, a function with a domain of (x, y) and wavelength has an arity of 2.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "...with a domain of (x, y) and a range of wavelength..."?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, actually wavelength is part of the domain here.

Copy link

@sapols sapols Aug 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh! I misread. Twice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say "domain of ((x, y), wavelength)"?

README.md Outdated
For example, a function with a domain of (x, y) and wavelength has an arity of 2.
* **Data** - A recursive algebraic data type consisting of *ScalarData*, *TupleData*, and *FunctionData*. Unlike the *DataType*, which describes the model, the *Data* objects contain actual values representing measurements. Since *TupleData* and *FunctionData* contain multiple *Data* objects they can be represented as trees just like DataType objects.
* **Dataset** - A *Dataset* instance consists of three things:
1. *MetaData* for the dataset itself. A list of key/value pairs.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use a colon? "MetaData for the dataset itself: A list of key/value pairs."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we don't capitalize the "D" in "Metadata".

README.md Outdated
* **DataType** - A recursive algebraic data type consisting of *Scalars*, *Tuples*, and *Functions*. The *DataType* is purely in the realm of the model. Since *Tuples* and Functions themselves contain multiple *DataType* objects they can be nested with arbitrary complexity, meaning a *DataType* can be represented as a tree.
Also, every *DataType* must contain a *MetaData* object.
* **dimensionality** - The number of flattened *DataTypes* in the domain of a function.
For example, a function with a domain of (x, y) and wavelength has a dimensionality of 3.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "...with a domain of (x, y) and a range of wavelength..."?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, wavelength is part of the domain. I will add a range to make things less confusing for both arity and dimensionality.

README.md Outdated
* **range** - The dependant variables of a function.
* **Sample** - A *Sample* is no more than the dimensionality of a dataset followed by an array of *Data* objects. Nested tuples are not allowed in a *Sample* but *ScalarData* and *FunctionData* are.
* **Scalar** - A *Scalar* is a *DataType* consisting of a single atomic variable. A *DataType* of arbitrary complexity must ultimately consist of *Scalar* objects as the leaf nodes of the representation tree. *Scalar* is a member of the *DataType* algebraic data type.
* **Tuple** - A *Tuple* is a *DataType* consisting of a list of *DataType* objects. Unlike a function, the *Datatype* objects in a *Tuple* are associated but are not dependant on each other.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "dependent" instead of "dependant".

README.md Outdated
* **Function** - A *Function* is a *DataType* consisting of exactly two *DataType* objects, a domain and a range. Function is a member of the DataType algebraic data type.
* **functional data model** - Datsets consist of ordered sequences of samples that are decoupled from metadata. Operations that implement the functional algebra can be performed on these samples to create new datasets.
* **Metadata** - An instance of *Metadata* is simply a list of key/value pairs to describe a *DataType* or *Dataset*. Every *Scalar* object is required to contain a *MetaData* object, but *MetaData* objects may be empty for *Tuple* and *Function* objects.
* **range** - The dependant variables of a function.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "dependent" instead of "dependant".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we don't celebrate Independance Day do we?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should though... 💃

README.md Outdated

# Design Decisions
* *DataType* and *Data* objects are completely decoupled:
* Con: Trying to maintain two identical tree structures is difficult.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this pro/con approach you took here.

README.md Outdated
* Con: Seems arbitrary to require both a single *Data* item followed by a list of additional *Data* items.
* Pro: Ensures that at least one Data item is in the *TupleData*.
* *Sample* objects are not implemented as a tuple of domain and range, as a *Function* is, but are instead implemented as an integer representing dimensionality followed by the domain and range combined into a single list:
* Con: The dimensionality number feels arbitrary. simply breaking a *Sample* into domain and range would be more elegant.
Copy link

@sapols sapols Aug 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental double space between "feels" and "arbitrary". Need to capitalize "simply".

Copy link

@sapols sapols left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few comments. Once they're addressed, this all looks awesome to me.

README.md Outdated
@@ -42,7 +42,7 @@ An example is a 2D point consisting of the x,y *Tuple*. The scalars x and y are
* Con: Seems arbitrary to require both a single *Data* item followed by a list of additional *Data* items.
* Pro: Ensures that at least one Data item is in the *TupleData*.
* *Sample* objects are not implemented as a tuple of domain and range, as a *Function* is, but are instead implemented as an integer representing dimensionality followed by the domain and range combined into a single list:
* Con: The dimensionality number feels arbitrary. simply breaking a *Sample* into domain and range would be more elegant.
* Con: The dimensionality number feels arbitrary. simply breaking a *Sample* into domain and range would be more elegant.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still a double space before the word "simply", which still isn't capitalized. :)

@sapols
Copy link

sapols commented Aug 2, 2018

I went ahead and approved this PR. I'll let the discussion about the glossary get resolved under Chris's "requested changes". I gave my two cents above.

@@ -0,0 +1,16 @@
# Design Decisions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the Geotrellis Architecture Decision Records, they appear to be focusing on larger architectural things, like whether they're using the stlib Future or scalaz Task.

We would need to have some standard by which we judge something to be worthy of including in this document.

I could imagine describing DataType vs. Data and the nature of Sample, but I think the bit about TupleData falls below the bar.

* Pro: Computation efficiencies require data to be unencumbered with metadata and model concepts.
* *TupleData* is implemented as a tuple consisting of the first *Data* item and all of the rest:
* Con: Seems arbitrary to require both a single *Data* item followed by a list of additional *Data* items.
* Pro: Ensures that at least one Data item is in the *TupleData*.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this is true. There's an apply method on the companion object that accepts Seq[Data], which could be empty:

scala> TupleData(List())
scala.NotImplementedError: an implementation is missing
  at scala.Predef$.$qmark$qmark$qmark(Predef.scala:284)
  at latis.data.TupleData$.apply(Data.scala:44)
  ... 36 elided

* Con: Trying to maintain two identical tree structures is difficult.
* Pro: Computation efficiencies require data to be unencumbered with metadata and model concepts.
* *TupleData* is implemented as a tuple consisting of the first *Data* item and all of the rest:
* Con: Seems arbitrary to require both a single *Data* item followed by a list of additional *Data* items.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say it's arbitrary. As you note, it (sort of) ensures that there's at least one item, and that's because this (a single element paired with a list of elements) is how you encode non-empty lists.

@@ -0,0 +1,25 @@
# Glossary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll reiterate that I think the glossary ought to describe concepts (what "metadata" is, what "dataset" means to us) rather than implementation (Metadata is Map[String, String]).

The latter is hopefully apparent from the code and could be described in the Scaladocs. Having implementation details here just adds another place where things are likely to get out of date if our discipline wavers.

Definitions of concepts are less likely to change with time and probably useful enough to be placed at such a high level.

README.md Outdated
* Pro: The model contains all the information needed to represent nested tuples.


see glossary [here](docs/glossary.md)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For accessibility reasons it's probably better for the links to be on "glossary" and "design decisions."

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! That totally slipped passed me, but yes, "here" or "see more" links are notorious accessibility issues.

Copy link
Member

@dlindhol dlindhol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This information is valuable but it got me thinking about what belongs in the code docs vs here at the top level. I want to remain DRY so we don't have to worry about things getting out of sync. I'd like to see the broad architectural issues here and implementation details in the code. That boundary isn't always very clear, though.

It also struck me that some of this feels like a quick-start guide. I'm wondering if we need another document for that. There are ways to make docs with code that compile so we are more likely to keep them up to date.

I'm fine with this as a good start. We can be agile and evolve the docs as v3 incubates.

README.md Outdated
The usual collection of *Adapters* and *Writers* are conceptually similar but take better advantage of the new data model implementation. The *Adapter* life cycle, in particular, is much cleaner. This new version of LaTiS will continue to support the same service interface (and more).

#Glossary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Glossary" does lead me to expect definitions of terms. What do we mean by "metadata" as opposed to how do we implement it. I agree with Chris that the latter belongs in the scaladoc. The question becomes whether we want to duplicate information and risk it getting out of date. I'm inclined to document the high level concepts that are unlikely to change in the glossary. Maybe add a link to the code or scaladoc?

README.md Outdated
The usual collection of *Adapters* and *Writers* are conceptually similar but take better advantage of the new data model implementation. The *Adapter* life cycle, in particular, is much cleaner. This new version of LaTiS will continue to support the same service interface (and more).

#Glossary
Items that are capitalized are Scala class names.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those ticks are mapped to the html "code" tag so that seems like the right thing to do when referring to a specific class. Maybe another case where we need to differentiate concepts from actual implementation?

README.md Outdated
Items that are capitalized are Scala class names.

* **arity** - The number of un-flattened *DataTypes* in the domain of a function.
For example, a function with a domain of (x, y) and wavelength has an arity of 2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say "domain of ((x, y), wavelength)"?

README.md Outdated
For example, a function with a domain of (x, y) and wavelength has an arity of 2.
* **Data** - A recursive algebraic data type consisting of *ScalarData*, *TupleData*, and *FunctionData*. Unlike the *DataType*, which describes the model, the *Data* objects contain actual values representing measurements. Since *TupleData* and *FunctionData* contain multiple *Data* objects they can be represented as trees just like DataType objects.
* **Dataset** - A *Dataset* instance consists of three things:
1. *MetaData* for the dataset itself. A list of key/value pairs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we don't capitalize the "D" in "Metadata".

* *DataType* and *Data* objects are completely decoupled:
* Con: Trying to maintain two identical tree structures is difficult.
* Pro: Computation efficiencies require data to be unencumbered with metadata and model concepts.
* *TupleData* is implemented as a tuple consisting of the first *Data* item and all of the rest:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if an implementation detail like this should be in the code and/or scaladoc. I was imagining these design decisions to be cross-cutting architectural ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A decision about not having empty tuples might qualify for this high level.

* Pro: Computation efficiencies require data to be unencumbered with metadata and model concepts.
* *TupleData* is implemented as a tuple consisting of the first *Data* item and all of the rest:
* Con: Seems arbitrary to require both a single *Data* item followed by a list of additional *Data* items.
* Pro: Ensures that at least one Data item is in the *TupleData*.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also disambiguates construction from a Seq[Data], which was the original reason for it.

docs/glossary.md Outdated
* **dimensionality** - The number of flattened *DataTypes* in the domain of a function.
For example, a function with a domain of (x, y) and wavelength with a range of irradiance has a dimensionality of 3.
* **domain** - The independent variables of a function.
* **fdml** - (Functional Datamodel Markup Language) An xml file format to describe the contents of a *Dataset*.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. What about "Function Data Model Language"? Is "markup" relevant in this context, other that the rich history of *mls?

docs/glossary.md Outdated
* **fdml** - (Functional Datamodel Markup Language) An xml file format to describe the contents of a *Dataset*.
See the FDML Schema section below.
* **Function** - A *Function* is a *DataType* consisting of exactly two *DataType* objects, a domain and a range. Function is a member of the DataType algebraic data type.
* **functional data model** - Datsets consist of ordered sequences of samples that are decoupled from metadata. Operations that implement the functional algebra can be performed on these samples to create new datasets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this needs a higher level definition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants