-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add description field with make_model_config_json function #203
Add description field with make_model_config_json function #203
Conversation
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
09e9da3
to
2d48813
Compare
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Codecov Report
@@ Coverage Diff @@
## main #203 +/- ##
==========================================
+ Coverage 91.15% 91.23% +0.08%
==========================================
Files 37 37
Lines 4092 4130 +38
==========================================
+ Hits 3730 3768 +38
Misses 362 362
|
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
CHANGELOG.md
Outdated
@@ -29,6 +29,7 @@ Inspired from [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) | |||
- Make make_model_config_json function more concise by @thanawan-atc in ([#191](https://github.com/opensearch-project/opensearch-py-ml/pull/191)) | |||
- Enabled auto-truncation for any pretrained models by @Yerzhaisang in ([#192](https://github.com/opensearch-project/opensearch-py-ml/pull/192)) | |||
- Generalize make_model_config_json function by @thanawan-atc in ([#200](https://github.com/opensearch-project/opensearch-py-ml/pull/200)) | |||
- Enable make_model_config_json to add model description to model config file by @thanawan-atc in ([#203](https://github.com/opensearch-project/opensearch-py-ml/pull/203)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.1.0
is already released. Let's add another section for 1.2.0
?
@@ -109,7 +109,7 @@ def check_values(self, oml_obj, pd_obj): | |||
|
|||
def check_exception(self, ed_exc, pd_exc): | |||
"""Checks that either an exception was raised or not from both opensearch_py_ml and pandas""" | |||
assert (ed_exc is None) == (pd_exc is None) and type(ed_exc) == type(pd_exc) | |||
assert (ed_exc is None) == (pd_exc is None) and isinstance(ed_exc, type(pd_exc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this for? Just curious to know why it's failing now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to look at code history, but also couldn't figure out why it's failing now. But it says in the error message that we can't do type(xxx) == type(xxx)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check if it's passing all the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It passes the integration test. So, doesn’t it mean that it passes all? Also, one strange thing is when I run nox -s lint on Mac and CloudDevDesktop, it doesn’t show the warning. It only failed on GitHub workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -68,8 +68,8 @@ def clean_test_folder(TEST_FOLDER): | |||
|
|||
|
|||
def test_init(): | |||
assert type(ml_client._client) == OpenSearch | |||
assert type(ml_client._model_uploader) == ModelUploader | |||
assert isinstance(ml_client._client, OpenSearch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same error here.
readme_data = MarkDownFile.read_file(readme_file_path) | ||
start_str = f"# {self.model_id}" | ||
start = readme_data.find(start_str) | ||
if start == -1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean? Can we add a comment here? Let's add a detailed description about this parsing logic.
@@ -1006,6 +1008,36 @@ def set_up_accelerate_config( | |||
"Failed to open config file for ml common upload: " + file_path + "\n" | |||
) | |||
|
|||
def get_model_description_from_md_file(self, readme_file_path) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as this is an internal function, let's add _
in front of it.
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
model_name = self.model_id.split("/")[1] | ||
start_str = f"# {model_name}" | ||
start = readme_data.find(start_str) | ||
end = readme_data.find("## ", start) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try with multiple sentence transformer models, how does this work. I feel like this is brittle. But I understand for the scraping we might need to depend on some placeholder. I just wanted to make sure if you checked with multiple different models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I randomly tried with many different models and it works. I can also add that it has to start with the word "This is". All the models I have seen use this format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked to Dhrubo offline and ended up not checking "This is" and made edits in the section ending part instead.
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
@@ -1035,7 +1035,7 @@ def _get_model_description_from_md_file(self, readme_file_path) -> str: | |||
model_name = self.model_id.split("/")[1] | |||
start_str = f"# {model_name}" | |||
start = readme_data.find(start_str) | |||
end = readme_data.find("## ", start) | |||
end = readme_data.find("# ", start + len(start_str)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach has a potential issue: if there are other occurrences of # within the description section that are not intended to be section headers, it might lead to incorrect results. For example, if the description contains a sentence like "This model has an accuracy of 90% # which is impressive", the code might incorrectly identify that # as the start of a new section.
# Find the index of the next line that starts with a '#'
end = readme_data.find('\n# ', start + len(start_str))
What do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked to Dhrubo offline and decided to use end = readme_data.find("\n#", start + len(start_str))
with no spacebar after #
. Note that clients can always overwrite description and we can do similar thing with the GitHub Actions workflow.
@@ -1006,6 +1008,55 @@ def set_up_accelerate_config( | |||
"Failed to open config file for ml common upload: " + file_path + "\n" | |||
) | |||
|
|||
def _get_model_description_from_md_file(self, readme_file_path) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fromt the comment below,
Get description of the model from README.md file
Seems we should rename the method name as _get_model_description_from_readme_file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
@@ -1006,6 +1008,55 @@ def set_up_accelerate_config( | |||
"Failed to open config file for ml common upload: " + file_path + "\n" | |||
) | |||
|
|||
def _get_model_description_from_md_file(self, readme_file_path) -> str: | |||
""" | |||
Get description of the model from README.md file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the README.md file
from huggingface? Can you add some example link?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's from huggingface. I can add example.
description = re.sub(r"\*", "", description) | ||
|
||
# Remove unnecessary part if exists(e.g. " For an introduction to ..." | ||
unnecessary_part = description.find(" For an introduction to") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how general this description method. Is the readme file always follow such pattern with " For an introduction to" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the readme pattern is unknown, or it's hard to predict the pattern. We'd better make the method flexible or more general for future pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are couple models that have this "For an introduction to ... " line (https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1), so I remove it if it exists since Dhrubo manually removes it from the current model listing. It will do nothing if it doesn't exist though.
@@ -1126,6 +1193,9 @@ def make_model_config_json( | |||
}, | |||
} | |||
|
|||
if description is not None: | |||
model_config_content["description"] = description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If description is None
, can we add some general description? For example "This is a sentence transformer model which can convert text into embedding (float vector)"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
* Add description field Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Restore notebook Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Debug test Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Resolve linting issues Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Update CHANGELOG.md Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Debug test_sentencetransformermodel_pytest.py Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Improve test coverage Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Edit test name Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Change CHANGELOG.md & Add comment to sentencetransformermodel.py Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Correct linting Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Improve add description Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Correct linting Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Loosen restriction Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Update sentencetransformermodel.py Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Change function name + Add comment + Add default description Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Debug Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> --------- Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> (cherry picked from commit 20435b1)
* updating notebook + bumping version (#199) * updating notebook + bumping version Signed-off-by: Dhrubo Saha <[email protected]> * addressing comments Signed-off-by: Dhrubo Saha <[email protected]> --------- Signed-off-by: Dhrubo Saha <[email protected]> (cherry picked from commit 1237aa6) * Add description field with make_model_config_json function (#203) * Add description field Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Restore notebook Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Debug test Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Resolve linting issues Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Update CHANGELOG.md Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Debug test_sentencetransformermodel_pytest.py Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Improve test coverage Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Edit test name Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Change CHANGELOG.md & Add comment to sentencetransformermodel.py Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Correct linting Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Improve add description Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Correct linting Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Loosen restriction Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Update sentencetransformermodel.py Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Change function name + Add comment + Add default description Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Debug Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> --------- Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> (cherry picked from commit 20435b1) --------- Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> Co-authored-by: Dhrubo Saha <[email protected]>
* updating notebook + bumping version (#199) * updating notebook + bumping version Signed-off-by: Dhrubo Saha <[email protected]> * addressing comments Signed-off-by: Dhrubo Saha <[email protected]> --------- Signed-off-by: Dhrubo Saha <[email protected]> (cherry picked from commit 1237aa6) * Add description field with make_model_config_json function (#203) * Add description field Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Restore notebook Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Debug test Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Resolve linting issues Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Update CHANGELOG.md Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Debug test_sentencetransformermodel_pytest.py Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Improve test coverage Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Edit test name Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Change CHANGELOG.md & Add comment to sentencetransformermodel.py Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Correct linting Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Improve add description Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Correct linting Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Loosen restriction Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Update sentencetransformermodel.py Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Change function name + Add comment + Add default description Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Debug Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> --------- Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> (cherry picked from commit 20435b1) * Backport Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Fix linting Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Update test_sentencetransformermodel_pytest.py Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> --------- Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> Co-authored-by: Dhrubo Saha <[email protected]>
* updating notebook + bumping version (#199) * updating notebook + bumping version Signed-off-by: Dhrubo Saha <[email protected]> * addressing comments Signed-off-by: Dhrubo Saha <[email protected]> --------- Signed-off-by: Dhrubo Saha <[email protected]> (cherry picked from commit 1237aa6) * Add description field with make_model_config_json function (#203) * Add description field Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Restore notebook Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Debug test Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Resolve linting issues Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Update CHANGELOG.md Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Debug test_sentencetransformermodel_pytest.py Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Improve test coverage Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Edit test name Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Change CHANGELOG.md & Add comment to sentencetransformermodel.py Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Correct linting Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Improve add description Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Correct linting Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Loosen restriction Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Update sentencetransformermodel.py Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Change function name + Add comment + Add default description Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Debug Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> --------- Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> (cherry picked from commit 20435b1) * Resolve merged conflict Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Correct linting Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> * Correct CHANGELOG.md Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> --------- Signed-off-by: Thanawan Atchariyachanvanit <[email protected]> Co-authored-by: Dhrubo Saha <[email protected]>
Description
Enable make_model_config_json function to add description field to model config file
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.