Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Add more use cases to the Vector Database Upload Example #1452

Closed
9 of 15 tasks
mdemoret-nv opened this issue Jan 4, 2024 · 1 comment
Closed
9 of 15 tasks
Assignees
Labels
feature request New feature or request sherlock Issues/PRs related to Sherlock workflows and components

Comments

@mdemoret-nv
Copy link
Contributor

mdemoret-nv commented Jan 4, 2024

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem this feature solves

The VDB upload example is great for showing how VDB upload can be done quickly with Morpheus for RSS feeds, but it is limited to one use case. The example could be improved by adding other types of documents or additional sources to be used for parsing.

Describe your ideal solution

Add new capability to the VDB upload example to support additional workflows and ingest different types of documents. Which types of documents and their source can be flexible, as long as it shows how to ingest a different type of documents.

Completion Criteria

  • Find a new type of document to ingest. Guidelines for which type of document to use:
    • Should be different than the current RSS feed document type (i.e. it should NOT be a webpage)
    • Should have a parser available in Langchain and Haystack (Haystack is optional)
    • The dataset should be relatively large (shooting for 5-10 minute runtime of the pipeline)
    • The dataset should be publicly available and easily downloaded (i.e. no access restrictions)
    • It should be possible to repeatedly download the dataset or the downloads can be cached locally (for example, when running the RSS example, some of the websites will block your IP if they see many requests in a short time. So we have caching at the request level to prevent hitting a server each and every time we run the demo)
    • [Optional] it would use RAPIDS to ingest the document (this makes CSV and JSON object appealing)
    • [Optional] It should be related to Cyber security
  • Create new source classes to use instead of RSSSourceStage and WebScraperStage to process the documents
    • Create unit tests to verify correct functionality of these classes
  • Add new commands to the CLI to choose which workflow to run
    • The organization of the commands may need to be refactored to support multiple workflows. Look into click command chaining
  • Update the README.md for the upload example with new information about the different workflows and how to invoke them

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request
@mdemoret-nv mdemoret-nv added feature request New feature or request sherlock Issues/PRs related to Sherlock workflows and components labels Jan 4, 2024
@mdemoret-nv
Copy link
Contributor Author

Closed in #1454

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request sherlock Issues/PRs related to Sherlock workflows and components
Projects
Status: Done
Development

No branches or pull requests

2 participants