Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(doris): add catalog support for Apache Doris #31580

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

liujiwen-up
Copy link
Contributor

@liujiwen-up liujiwen-up commented Dec 20, 2024

SUMMARY

add catalog for apache doris
In Apache Doris, in order to be compatible with different BI tools, we have information_schema for each catalog. When no catalog is specified, the default catalog is internal. This feature corresponds to this PR of Doris, apache/doris#28919
image

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've completed my review and didn't find any issues... but I did find this chicken.

  \\
   (o>
  <_ )
   ^^
Files scanned
File Path Reviewed
superset/db_engine_specs/doris.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Need a new review? Comment /korbit-review on this PR and I'll review your latest changes.

Korbit Guide: Usage and Customization

Interacting with Korbit

  • You can manually ask Korbit to review your PR using the /korbit-review command in a comment at the root of your PR.
  • You can ask Korbit to generate a new PR description using the /korbit-generate-pr-description command in any comment on your PR.
  • Too many Korbit comments? I can resolve all my comment threads if you use the /korbit-resolve command in any comment on your PR.
  • Chat with Korbit on issues we post by tagging @korbit-ai in your reply.
  • Help train Korbit to improve your reviews by giving a 👍 or 👎 on the comments Korbit posts.

Customizing Korbit

  • Check out our docs on how you can make Korbit work best for you and your team.
  • Customize Korbit for your organization through the Korbit Console.

Current Korbit Configuration

General Settings
Setting Value
Review Schedule Automatic excluding drafts
Max Issue Count 10
Automatic PR Descriptions
Issue Categories
Category Enabled
Naming
Database Operations
Documentation
Logging
Error Handling
Systems and Environment
Objects and Data Structures
Readability and Maintainability
Asynchronous Processing
Design Patterns
Third-Party Libraries
Performance
Security
Functionality

Feedback and Support

Note

Korbit Pro is free for open source projects 🎉

Looking to add Korbit to your team? Get started with a free 2 week trial here

Copy link

codecov bot commented Dec 20, 2024

Codecov Report

Attention: Patch coverage is 95.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 83.77%. Comparing base (76d897e) to head (1d7483b).
Report is 1243 commits behind head on master.

Files with missing lines Patch % Lines
superset/db_engine_specs/doris.py 95.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #31580       +/-   ##
===========================================
+ Coverage   60.48%   83.77%   +23.28%     
===========================================
  Files        1931      538     -1393     
  Lines       76236    39141    -37095     
  Branches     8568        0     -8568     
===========================================
- Hits        46114    32790    -13324     
+ Misses      28017     6351    -21666     
+ Partials     2105        0     -2105     
Flag Coverage Δ
hive 48.75% <35.00%> (-0.41%) ⬇️
javascript ?
mysql 76.47% <35.00%> (?)
postgres 76.56% <35.00%> (?)
presto 53.27% <35.00%> (-0.54%) ⬇️
python 83.77% <95.00%> (+20.28%) ⬆️
sqlite 76.02% <35.00%> (?)
unit 60.90% <95.00%> (+3.27%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pull-request-size pull-request-size bot added size/L and removed size/M labels Dec 20, 2024
@liujiwen-up liujiwen-up changed the title feat:add catalog for apache doris feat(doris): add catalog support for Apache Doris Dec 20, 2024
@michael-s-molina
Copy link
Member

@mistercrunch Any idea about the docker image error?

@rusackas
Copy link
Member

@betodealmeida is definitely the subject matter expert on catalog support lately!

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is missing a proper description, making it difficult to understand key details in the featured changes. Also, a few general comments that caught my eye while reviewing.

tests/unit_tests/db_engine_specs/test_doris.py Outdated Show resolved Hide resolved
tests/unit_tests/db_engine_specs/test_doris.py Outdated Show resolved Hide resolved
Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvements @liujiwen-up! One last comment, after that I feel this is good to go 👍

Comment on lines 251 to 260
if uri.database and "." in uri.database:
current_catalog, _ = uri.database.split(".", 1)
else:
current_catalog = "internal"

# In Apache Doris, each catalog has an information_schema for BI tool
# compatibility. See: https://github.com/apache/doris/pull/28919
adjusted_database = ".".join(
[catalog or current_catalog or "", "information_schema"]
).rstrip(".")
Copy link
Member

@villebro villebro Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may sound like a total nit, but I actually had some issues following what's going on here, especially the catalog or current_catalog or "" logic. As current_catalog is unnecessary if catalog is defined, I would have maybe just reused the latter variable for all these uses. Something like:

        if catalog:
            pass
        elif uri.database and "." in uri.database:
            catalog, _ = uri.database.split(".", 1) or ""  # notice how I also moved the `or ""` part here
        else:
            catalog = "internal"

Then later just

        adjusted_database = ".".join([catalog, "information_schema"])

Also, why is .rstrip(".") needed? I don't see how we can ever hit that, as adjusted_database will always end with .information_schema, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@villebro Thanks for your advice. After in-depth testing with Doris, we found that there is still a problem. The previous test only tested the case of linking data sources. When operating on SQL Lab, it will also go to this function and cannot use the information_schema library fixedly. When there is a schema value, the user-provided schema should be used for querying. This implementation is the correct behavior at present.

  1. When linking data sources, the schema is empty and the information_schema library is used uniformly
  2. When the schema has a value, the schema value provided by the user is used

@liujiwen-up
Copy link
Contributor Author

@villebro Please help me push it forward. Thank you.
In addition, I would like to ask, after the PR is merged, in which superset release version will it be released? I need to update the official documentation of apache doris based on this information for reference by superset users.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more round to simplify the code (it'll be easier for maintainers to carry this code forward with less duplication and ambiguity). We should be able to get this into 5.0 as long as we can get this merged before the release cut (no rush yet, but let's try to finish this as soon as possible).

One more thing that comes to mind: is it really necessary to assign information_schema to the connection string if no schema is selected? Typically we just leave it unspecified (if someone wants to access tables in the information_schema, they can just choose that schema explicitly).

superset/db_engine_specs/doris.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants