Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements of exposed data #215

Closed
claudiu-cristea opened this issue Sep 16, 2023 · 5 comments
Closed

Improvements of exposed data #215

claudiu-cristea opened this issue Sep 16, 2023 · 5 comments

Comments

@claudiu-cristea
Copy link

claudiu-cristea commented Sep 16, 2023

I'm missing some information that allows me to fully understand an entry from https://api.developers.italia.it/v1/software:

The code platform

In the case of self-hosted GitLab or Bitbucket software it's difficult to understand the underlying technology. Is it GitLab or Bitbucket? What API should I use if I want to fetch more info about that project? This info is not part of publiccodeYml blob either. But https://github.com/italia/publiccode-crawler knows this information and it would be nice to be exposed on the same level as id, url, publiccodeYml , etc. For instance:

{
  "data": [
    {
      "id": "9c07d69d-f66e-44b7-93ff-f19509e47dcf",
      "platformType": "gitlab",
      "url": "https://riuso.comune.salerno.it/root/simel_2.git",
      ...
    },
    ...
  ],
  ...
}

Of course, platformType could be any of github, gitlab, bitbucket.

The project's full path

Let's take an hypothetical case, a GitLab self-hosted project having this URL: https://example.com/base/path/group1/group2/group3/project.git. Note that the GitLab instance is installed at https://example.com/base/path (in a subdirectory, relative to the domain).

If a consumer of https://api.developers.italia.it/v1/software API wants to understand which is the project full path (namespace and project), by extracting it from the URL, they will fail. That's because extracting the path is misleading. Most probably they will assume that everything that comes after the host is the project full path:

  • namespace: base/path/group1/group2/group3
  • project name: project

But this is wrong as the project's namespace is group1/group2/group3. Again, this information is missed also from publiccodeYml blob.

I think this info should be exposed. something like:

    {
      "id": "9c07d69d-f66e-44b7-93ff-f19509e47dcf",
      "fullPath": "group1/group2/group3/project"
      "url": "https://example.com/base/path/group1/group2/group3/project.git",
      ...
    },

Moreover, this info is already available and exposed, as I see, by the /software/{softwareId}/logs path. In this way, a consumer understands how to derive the base URL of the GitLab/Bitbucket self-hosted platform.

@bfabio
Copy link
Member

bfabio commented Sep 20, 2023

Hey @ClaudiuCristea,

First off, thanks for bringing this up! It's always great to see community members contributing ideas to improve the project. I've got a few reservations on the proposed changes:

  • Introducing strings as symbolic names (github, gitea, etc.) for code hostings might add an extra layer of complexity. It means we'd have to maintain a list of these symbolic names and ensure they're consistent across the board.

  • Different versions of APIs or software for code hosting platforms like Gitea, GitLab, etc., can pose a challenge. How do we ensure compatibility and handle discrepancies between versions? We'd need more symbolic names for each of them (fe. gitlab-v3, gitlab-v4, etc.)

  • If a code hosting platform isn't supported, would we default to "other"? This seems like it would bring us back to the initial problem, where we're not providing enough clarity.

  • Projects sometimes migrate from one code hosting platform to another for various reasons. If a project switches its hosting, the platformType might not reflect the current state until the next crawl. This lag could lead to misinformation or confusion.

I think all revolves around that you'd like to know the base URL and the API type in order to query it, but I believe that task of detecting the API or the underlying technology might be better suited for a dedicated library.
For instance, in publiccode-crawler, we use go-vcsurl for this purpose which is limited to GitHub, GitLab (cloud or self-hosted) and Bitbucket, but can be expanded.

I think it's fair to think of developers-italia-api as being agnostic and knowing where the software is, but not giving assumptions on how you access it.

This way, instead of relying on a potentially outdated platformType, we can ensure real-time accuracy, providing insights into the hosting platform, its version, and other relevant metadata. This approach not only reduces the manual overhead, but also eliminates the need for clients to be up to date with our hypothetical symbolic names AND for them to implement the actual logic.

On top of that, most projects are on github.com or gitlab.com, so just by looking at the URL, we can tell where they are from.

Can you maybe reuse publiccode-crawler, or adapt it for your needs?

I hope these points resonate with your thoughts. If there's anything else in the proposed change that might need attention, I'd be happy to discuss further. Let's keep the collaboration going!

@claudiu-cristea
Copy link
Author

claudiu-cristea commented Sep 21, 2023

@bfabio,

Thank you for reply. Few remarks:

I was looking to https://github.com/alranel/go-vcsurl and that I'm thinking on very similar approach in order to guess the API from the URL. Given all your points, I understand that, exposing the API type, will not going to be supported here. I see most of them valid points. Maybe some are debatable but, yes, that's it, I can live with maintaining my own API guesser.

I was looking at the code from https://github.com/alranel/go-vcsurl (though I have zero Go knowledge) but I can't find an answer to my 2nd point: detecting the code hosting platform URL. Or the other way around: detecting the project full-path out of the full URL. Of course, I'm referring to the situation when a GitLab instance is located not on the host root, but in a subdirectory. It seems to me that the code assumes that the code platform is installed directly under the host (which, I agree, are most of the cases). Maybe I'm missing something?

@bfabio
Copy link
Member

bfabio commented Sep 22, 2023

I was looking at the code from https://github.com/alranel/go-vcsurl (though I have zero Go knowledge) but I can't find an answer to my 2nd point: detecting the code hosting platform URL. Or the other way around: detecting the project full-path out of the full URL. Of course, I'm referring to the situation when a GitLab instance is located not on the host root, but in a subdirectory. It seems to me that the code assumes that the code platform is installed directly under the host (which, I agree, are most of the cases). Maybe I'm missing something?

@claudiu-cristea you're spot on about go-vcsurl and publiccode-crawler currently assuming the code platform is right under the host. This is due to historical and practical reasons: namely, we never had to deal with that scenario :)

The thought was to potentially extend go-vcsurl or a similar library to manage these cases. It doesn't have to be go-vcsurl specifically; another library could work just as well. The key is to centralize the logic for detecting platform URLs and extracting project paths, making it more adaptable to different setups.

publiccode-crawler and/or go-vcsurl have to be extended regardless if we want new code hosting platforms (italia/publiccode-crawler#132) or plain git URLs (italia/publiccode-crawler#196)

I'm not an expert with PHP, but I think there is a way to load a Go library and call it with FFI?

@claudiu-cristea
Copy link
Author

@bfabio, thank you for clarifying and sorry for late feedback

We took a slightly different approach because we're using code hosting platform plugins (e.g. GitHub plugin). So each plugin knows to determine if they are in business of handling a given URL. Then we're caching the result so next time we know which API to use.

Solved also the "GitLab installed under a sub-dir" by performing some additional HTTP requests but only when we have the non-standard case

Thank you again for support. Closing this issue

@bfabio
Copy link
Member

bfabio commented Oct 31, 2023

@claudiu-cristea nice to know that approach makes sense, the plugins are kinda like the scanners in publiccode-crawler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants