Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARKC-577, round two #1250

Open
wants to merge 17 commits into
base: b3.0
Choose a base branch
from
Open

SPARKC-577, round two #1250

wants to merge 17 commits into from

Conversation

absurdfarce
Copy link

@absurdfarce absurdfarce commented Jun 16, 2020

Description

How did the Spark Cassandra Connector Work or Not Work Before this Patch

Connector was using internal serializable reps for keyspace/table metadata because corresponding Java driver classes weren't serializable. This changed in v4.6.0.

General Design of the patch

Existing connector metadata types were converted to traits with existing impls becoming a "default" implementation. Also added trait impls based on Java driver metadata types. Worth noting that the existing representations served two distinct functions:

  1. Metadata retrieval/access
  2. As a definition/descriptor for future table creation

Schema loading functions return an impl based on Java driver metadata types so (1) is the default for most operations. (2) is primarily used in the ColumnMapper code.

Also worth noting: impl for (1) is largely a direct cut-through to Java driver types and methods, but in some cases this wasn't adequate to implement the trait. As an example: the connector metadata types were storing some information (such as clustering column info) at the column level while the Java metadata types represent this at the table level. To work around this problem the new impls (based on Java driver metadata types) act as providers of Java metadata types for the current level all the way up to the keyspace. So DriverColumnDef can provide column metadata for the column it represents as well as table and keyspace metadata for the appropriate structures. Distinct traits were implemented to identify which classes can provide which Java metadata type(s).

Fixes: SPARKC-577

How Has This Been Tested?

Still a WIP, hasn't been tested meaningfully yet

Checklist:

  • I have a ticket in the OSS JIRA
  • I have performed a self-review of my own code
  • Locally all tests pass (make sure tests fail without your patch)

@absurdfarce absurdfarce changed the title Sparkc 577 round two SPARKC-577, round two Jun 16, 2020
…nently restored KeyspaceDef

as a mechanism for accessing downstream "def" types (TableDef, IndexDef, etc.) from a Schema instance.
…umns effectively

duplicates columns and was only used in one spot.
…se toInternal. Hope was

this would address some of the ongoing IT issues we're seeing... but it doesn't look like that
was successful.

Also expanded Schema's IT to include tests for additional functoinality (now that it's being
provided largely by an impl backed by Java driver metadata)
…ter this

change so with any luck we should be good now.
*/
def tableFromCassandra(session: CqlSession,
keyspaceName: String,
tableName: String): DriverTableDef = {

fromCassandra(session, Some(keyspaceName), Some(tableName)).tables.headOption match {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the use of headOption here is a bit cumbersome, but unfortunately we're stuck with fromCassandra() for the foreseeable future. There are a fair amount of users who access that functionality through com.datastax.spark.connector.schemaFromCassandra(), which is used in a number of places throughout the code base. It's probably cleaner to re-work this API to return KeyspaceDefs or TableDefs directly rather than using Schema as a common return type (which in turn requires this "Schema with only some keyspaces/tables" nonsense used here). That seemed like a larger and/or discrete effort, however... and this ticket is already big enough. :)

@jtgrabowski
Copy link
Collaborator

@absurdfarce I'm not sure if this is the way to go. I thought that we we want to use Driver types and get rid of SCC types completely. The less code to maintain the better. Do you think it would be possible to refactor SCC so that it relies solely on Driver types?

@absurdfarce
Copy link
Author

absurdfarce commented Jun 25, 2020

There's a bit of a complication there @jtgrabowski , although I'm not completely sure it's fatal.

The old Table/ColumnDef impl served two use cases in the code base. It represented data retrieved from C* metadata but it also could be created to define tables/columns which should be created by SCC. My original take on this ticket tried to separate those concerns... but it snowballed quickly. That led me to the approach used here of pulling these impls into traits and creating parallel impls.

I suppose we might be able to get there by leaving the Table/ColumnDef impls alone and just changing the blahFromCassandra() calls to return Java driver types and just leave everything else alone. I believe we do have at least one case where TableDef.copy() is used to tweak something we get from C*, so in that case we'll have to construct a TableDef based on the underlying Java driver type... but that's probably doable.

There's a second (smaller) problem as well: ColumnDef contains info that is maintained in ColumnMetadata and some info from TableMetadata. So callers interested in that functionality may need to be changed to get a column and then get the corresponding table. Unfortunately the Java driver doesn't allow that traversal directly; they'll have to get the table name and look it up as a separate op. That's entirely doable, but it does represent a bit of an increase in complexity.

Anyways, if you're okay with the approach suggested above I can try this again and see how feasible that is to implement.

@jtgrabowski
Copy link
Collaborator

jtgrabowski commented Jun 26, 2020

I see. It's far more complicated than we all expected :) What would we gain with the current approach and the described one? I initially thought that we are going for less code to maintain but now I'm not sure if this whole thing is worth the effort. wdyt?

@absurdfarce
Copy link
Author

absurdfarce commented Jun 28, 2020

Yeah, it's definitely more complicated than we expected. :)

It seems to me that the following things are true:

  • We won't be able to remove Column/TableDef from the current code base
    • We need these classes (or something very much like them) to describe structures to be created by ColumnMapper
  • There is value in decoupling this function of describing structures to be created from analyzing keyspace/table/column metadata
    • This certainly doesn't require parallel trait impls (like what I've used in this PR)
    • In other contexts I've argued for removing intermediaries and exposing the Java driver API more directly... and I see no reason to argue for something different here
      • A further point: IIRC Russ dealt directly with Java driver metadata already in his DSv2 impl
  • There are two potential issues that occur to me when considering such an impl
    • There are occasions when we do a TableDef.copy() based on retrieved metadata, so in the new world we'd have to support the ability to create a TableDef from Java driver metadata (discussed already on this PR)
    • There are a few spots in code where metadata is expected to conform to FieldDef/StructDef traits so we might have to wrap Java driver metadata in these cases

Given all of these factors I think there's still a decent argument for trying the following:

  • Changing *FromSchema() methods to return Java driver metadata types
  • Preserving Table/ColumnDef for use as a means to define table creation in ColumnMapper

Doing so gives us the decoupling between table creation and metadata and moves us further along in the goal of interacting with the Java driver types more directly in the code base.

wdyt @jtgrabowski ?

@jtgrabowski
Copy link
Collaborator

Let's give it a try if you think it's feasible. I traced couple of *FromCassandra invocations and it looks like it would require a bit of work to convert them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants