-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What targets/databases has this been used with? - Assume schema properties from documents #5
Comments
@MadLittleMods I believe it was run with I can think of a few possibilities to consider. That should be a valid JSON schema as written, but I'm sure there are targets/libraries/functions that expect a The other thing that comes to mind, is you may need to write |
Could we update Or should targets just support btw, I work at GitLab, specifically on the Gitter team trying to get data from Mongo to Snowflake for analysis in Looker. GitLab also made We are currently discussing usage in https://gitlab.com/meltano/meltano/issues/113 |
@MadLittleMods Thanks for your patience on this, I've had a lot of discussions to aggregate thoughts on how this fits into the Singer world. It's a bit of a unique situation with MongoDB's stance on data typing and schema. Here is a brain dump of what I've gathered: Who's Responsible for Data Types? Schema w/NoSQL Your suggestion of emitting a schema as it changes should work with most targets, as they should be updating their internal schema upon receipt of a message. However, I have performance concerns about the time it would take to perform a full schema generation + diff for each object retrieved. This might not be an issue in practice, but seemed worthwhile to mention. Potential Path So, in the general case of a strict schema mode, I can imagine something like a schema being specified in a specialized object within each collection, or in a specific Does that make sense? If you want to solve for your immediate use case without the general concerns, you can feel free to do the development work in a Fork and we can continue discussing it for incorporation. |
Perhaps that was a bit long winded for a simple answer, but the TLDR as it pertains to you question is that the tap should provide as much information as it can about the schema of the data, and the target should use that, along with its knowledge of the destination platform to make the best decision at the time of data typing. The rest is just the ever present question of |
@dmosorast Just to be clear, I assume the proposed I think my current use case can be solved by just checking the latest document in the collection. I know in my case, the schema of For reference, with
|
@MadLittleMods I think the canonical form for {
"type": "object",
"properties": [
{"id": {"type": "string"}},
{"d": {"type": "object"}},
{"t": {"type": "string" ... (whatever is needed for date fields)}}
]
} Then let the target unnest the fields and the output should have at least 3 columns: |
@MadLittleMods Correct, the configuration option I mentioned would be for the tap itself. I would expect it to need to at least be defined on a stream level, and maybe default to something that makes sense when a collection is not present in the config (whichever works out the best between the two modes). For example, the config could be something like:
Checking the last document in the collection and inferring schema from that sounds like a good first pass without being too opinionated, and the de-nesting that |
Closing this for now. Feel free to open another issue or reopen if needed! |
@MadLittleMods @micaelbergeron @dmosorast This Issue became fairly wide-ranging in its discussion. Could one of you take a stab at writing up a |
tldr;
tap-mongodb
is tested withtarget-csv
which doesn't have any strict schema issues and can just vacuum up whatever documentsproperties
. This means that taps that work with strict schemas, don't pull any data when a document flows through.schema_modes
inconfig.json
which can assume the schema fields from the latest document in the collection, see Implement schema_modes config to infer schema fields from document #6tap-mongodb
generates a schema with noproperties
which isn't playing nice withtarget-snowflake
. It errors but even after fixing the error, none of the fields are transferred. Just a bunch of rows withtime_extracted
fieldJust curious, what targets has
tap-mongodb
been used with so I can test and compare withtarget-snowflake
?The text was updated successfully, but these errors were encountered: