Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Audit] specify the validation rules #10

Open
qishipengqsp opened this issue Jul 25, 2022 · 2 comments
Open

[Audit] specify the validation rules #10

qishipengqsp opened this issue Jul 25, 2022 · 2 comments
Labels
Future work This kind of issues are future works to do

Comments

@qishipengqsp
Copy link
Contributor

Three probable modes to valid the result,

  • ACID, validation, throughput test in separate
  • run result validation along with ACID
  • run benchmarking (throughtput measurement) along with ACID and result validation at the same time.

About how to valid the result, two probable ways to do,

  • self-validation by the driver
  • cross-validation
@qishipengqsp qishipengqsp added the Future work This kind of issues are future works to do label Jul 25, 2022
@rickatultipa
Copy link

Here are my general comments on graph query results validation, and I hope this will help explain why results validations are necessary and what can be done to identify those skewed/invalid results.

There are 2 types of operations against any graph database: TP vs. AP.

  • For TP type of operations, they are characterized by ACID against meta-data, and the validation isn't much different from traditional SQL/ RDBMS. Our focus shall be on AP type of operations.
  • For AP operations, there are in general 3 subtypes: K-Hop, Path and Algorithm (or compound/sophisticated queries).

The results validation of these AP operations have long been ignored/skipped, but there are ways to help us quickly zoom into the source data and identify if any results are skewed:

  1. K-Hop: Regular K-hop should be implemented using BFS instead of DFS, but you won't tell it unless you validate the results, and results should always be de-duplicated, clearly, there are vendor systems don't do de-dup by default (including Neo4j, if you are curious why this problem is so prevalent). Another possible problem is that the system's data modeling is outright wrong, for instance, some system only store the end in between a pair of vertices once, meaning the invert edge is NOT stored, which causes the K-hop results to be wrong.
  2. Path: Taking shortest-path as an example, to find shortest-paths between a pair of vertices, ALL paths are to be found, not just 1 when there are more shortestpaths available. ArangoDB has a boosted way for finding shortest path quickly, but only returning 1 path, which clearly is wrong.
  3. Algorithm: Taking the popular PageRank as an example, there are two key points to watch out for, firstly, the entire graph has to be traversed reiteratively, not just a handful of vertices and edges. Remember Neo4j allows you to run PageRank against only selective # of vertices, say, limit 1000? secondly, the results should be ranked such as Order by DESC, and the top-10, for instance, results should be exact across all systems participating the benchmark. If a system doesn't support Top-10 or the results are skewed, you know something is wrong there.

On the other hand, similar to SNB, I'm sure there are vendors who will encapsulate everything under the hood and expose things only via interfaces/APIs, but can they provide equivalent GQL (dialects) showing how a query is implemented? I know I maybe asking too much, but I'm only throwing my thoughts out here so that we can strengthen/improve the validation rules.

Best
Ricky

@qishipengqsp
Copy link
Contributor Author

Thanks for the helpful suggestions. Sure we should improve the validation rules. We will return to this topic when we dive into the validation rules detail design.

@qishipengqsp qishipengqsp changed the title [Validation] specify the validation rules [Audit] specify the validation rules Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Future work This kind of issues are future works to do
Projects
None yet
Development

No branches or pull requests

2 participants