Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add description field with make_model_config_json function #203

Merged

Conversation

thanawan-atc
Copy link
Contributor

@thanawan-atc thanawan-atc commented Aug 1, 2023

Description

Enable make_model_config_json function to add description field to model config file

Check List

  • New functionality includes testing.
    • All tests pass
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
@codecov
Copy link

codecov bot commented Aug 1, 2023

Codecov Report

Merging #203 (f5f7552) into main (1237aa6) will increase coverage by 0.08%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #203      +/-   ##
==========================================
+ Coverage   91.15%   91.23%   +0.08%     
==========================================
  Files          37       37              
  Lines        4092     4130      +38     
==========================================
+ Hits         3730     3768      +38     
  Misses        362      362              
Files Changed Coverage Δ
...search_py_ml/ml_models/sentencetransformermodel.py 73.93% <100.00%> (+2.57%) ⬆️

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
CHANGELOG.md Outdated
@@ -29,6 +29,7 @@ Inspired from [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
- Make make_model_config_json function more concise by @thanawan-atc in ([#191](https://github.com/opensearch-project/opensearch-py-ml/pull/191))
- Enabled auto-truncation for any pretrained models by @Yerzhaisang in ([#192](https://github.com/opensearch-project/opensearch-py-ml/pull/192))
- Generalize make_model_config_json function by @thanawan-atc in ([#200](https://github.com/opensearch-project/opensearch-py-ml/pull/200))
- Enable make_model_config_json to add model description to model config file by @thanawan-atc in ([#203](https://github.com/opensearch-project/opensearch-py-ml/pull/203))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.1.0 is already released. Let's add another section for 1.2.0?

@@ -109,7 +109,7 @@ def check_values(self, oml_obj, pd_obj):

def check_exception(self, ed_exc, pd_exc):
"""Checks that either an exception was raised or not from both opensearch_py_ml and pandas"""
assert (ed_exc is None) == (pd_exc is None) and type(ed_exc) == type(pd_exc)
assert (ed_exc is None) == (pd_exc is None) and isinstance(ed_exc, type(pd_exc))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for? Just curious to know why it's failing now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to look at code history, but also couldn't figure out why it's failing now. But it says in the error message that we can't do type(xxx) == type(xxx).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check if it's passing all the tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It passes the integration test. So, doesn’t it mean that it passes all? Also, one strange thing is when I run nox -s lint on Mac and CloudDevDesktop, it doesn’t show the warning. It only failed on GitHub workflow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here’s the error
IMG_2158

@@ -68,8 +68,8 @@ def clean_test_folder(TEST_FOLDER):


def test_init():
assert type(ml_client._client) == OpenSearch
assert type(ml_client._model_uploader) == ModelUploader
assert isinstance(ml_client._client, OpenSearch)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same error here.

readme_data = MarkDownFile.read_file(readme_file_path)
start_str = f"# {self.model_id}"
start = readme_data.find(start_str)
if start == -1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean? Can we add a comment here? Let's add a detailed description about this parsing logic.

@@ -1006,6 +1008,36 @@ def set_up_accelerate_config(
"Failed to open config file for ml common upload: " + file_path + "\n"
)

def get_model_description_from_md_file(self, readme_file_path) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as this is an internal function, let's add _ in front of it.

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
model_name = self.model_id.split("/")[1]
start_str = f"# {model_name}"
start = readme_data.find(start_str)
end = readme_data.find("## ", start)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try with multiple sentence transformer models, how does this work. I feel like this is brittle. But I understand for the scraping we might need to depend on some placeholder. I just wanted to make sure if you checked with multiple different models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I randomly tried with many different models and it works. I can also add that it has to start with the word "This is". All the models I have seen use this format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked to Dhrubo offline and ended up not checking "This is" and made edits in the section ending part instead.

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
@@ -1035,7 +1035,7 @@ def _get_model_description_from_md_file(self, readme_file_path) -> str:
model_name = self.model_id.split("/")[1]
start_str = f"# {model_name}"
start = readme_data.find(start_str)
end = readme_data.find("## ", start)
end = readme_data.find("# ", start + len(start_str))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach has a potential issue: if there are other occurrences of # within the description section that are not intended to be section headers, it might lead to incorrect results. For example, if the description contains a sentence like "This model has an accuracy of 90% # which is impressive", the code might incorrectly identify that # as the start of a new section.

# Find the index of the next line that starts with a '#'
end = readme_data.find('\n# ', start + len(start_str))

What do you think about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked to Dhrubo offline and decided to use end = readme_data.find("\n#", start + len(start_str)) with no spacebar after #. Note that clients can always overwrite description and we can do similar thing with the GitHub Actions workflow.

@@ -1006,6 +1008,55 @@ def set_up_accelerate_config(
"Failed to open config file for ml common upload: " + file_path + "\n"
)

def _get_model_description_from_md_file(self, readme_file_path) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fromt the comment below,

Get description of the model from README.md file

Seems we should rename the method name as _get_model_description_from_readme_file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@@ -1006,6 +1008,55 @@ def set_up_accelerate_config(
"Failed to open config file for ml common upload: " + file_path + "\n"
)

def _get_model_description_from_md_file(self, readme_file_path) -> str:
"""
Get description of the model from README.md file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the README.md file from huggingface? Can you add some example link?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's from huggingface. I can add example.

description = re.sub(r"\*", "", description)

# Remove unnecessary part if exists(e.g. " For an introduction to ..."
unnecessary_part = description.find(" For an introduction to")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how general this description method. Is the readme file always follow such pattern with " For an introduction to" ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the readme pattern is unknown, or it's hard to predict the pattern. We'd better make the method flexible or more general for future pattern

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are couple models that have this "For an introduction to ... " line (https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1), so I remove it if it exists since Dhrubo manually removes it from the current model listing. It will do nothing if it doesn't exist though.

@@ -1126,6 +1193,9 @@ def make_model_config_json(
},
}

if description is not None:
model_config_content["description"] = description
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If description is None, can we add some general description? For example "This is a sentence transformer model which can convert text into embedding (float vector)"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Copy link
Collaborator

@ylwu-amzn ylwu-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@dhrubo-os dhrubo-os merged commit 20435b1 into opensearch-project:main Aug 9, 2023
14 checks passed
thanawan-atc referenced this pull request in thanawan-atc/opensearch-py-ml Aug 10, 2023
* Add description field

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Restore notebook

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Debug test

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Resolve linting issues

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Update CHANGELOG.md

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Debug test_sentencetransformermodel_pytest.py

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Improve test coverage

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Edit test name

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Change CHANGELOG.md & Add comment to sentencetransformermodel.py

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Correct linting

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Improve add description

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Correct linting

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Loosen restriction

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Update sentencetransformermodel.py

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Change function name + Add comment + Add default description

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Debug

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

---------

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
(cherry picked from commit 20435b1)
@thanawan-atc thanawan-atc mentioned this pull request Aug 10, 2023
1 task
dhrubo-os added a commit that referenced this pull request Aug 10, 2023
* updating notebook + bumping version (#199)

* updating notebook + bumping version

Signed-off-by: Dhrubo Saha <[email protected]>

* addressing comments

Signed-off-by: Dhrubo Saha <[email protected]>

---------

Signed-off-by: Dhrubo Saha <[email protected]>
(cherry picked from commit 1237aa6)

* Add description field with make_model_config_json function (#203)

* Add description field

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Restore notebook

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Debug test

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Resolve linting issues

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Update CHANGELOG.md

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Debug test_sentencetransformermodel_pytest.py

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Improve test coverage

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Edit test name

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Change CHANGELOG.md & Add comment to sentencetransformermodel.py

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Correct linting

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Improve add description

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Correct linting

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Loosen restriction

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Update sentencetransformermodel.py

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Change function name + Add comment + Add default description

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Debug

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

---------

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
(cherry picked from commit 20435b1)

---------

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Co-authored-by: Dhrubo Saha <[email protected]>
dhrubo-os added a commit that referenced this pull request Aug 22, 2023
* updating notebook + bumping version (#199)

* updating notebook + bumping version

Signed-off-by: Dhrubo Saha <[email protected]>

* addressing comments

Signed-off-by: Dhrubo Saha <[email protected]>

---------

Signed-off-by: Dhrubo Saha <[email protected]>
(cherry picked from commit 1237aa6)

* Add description field with make_model_config_json function (#203)

* Add description field

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Restore notebook

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Debug test

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Resolve linting issues

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Update CHANGELOG.md

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Debug test_sentencetransformermodel_pytest.py

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Improve test coverage

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Edit test name

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Change CHANGELOG.md & Add comment to sentencetransformermodel.py

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Correct linting

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Improve add description

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Correct linting

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Loosen restriction

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Update sentencetransformermodel.py

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Change function name + Add comment + Add default description

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Debug

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

---------

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
(cherry picked from commit 20435b1)

* Backport

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Fix linting

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Update test_sentencetransformermodel_pytest.py

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

---------

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Co-authored-by: Dhrubo Saha <[email protected]>
dhrubo-os added a commit that referenced this pull request Aug 23, 2023
* updating notebook + bumping version (#199)

* updating notebook + bumping version

Signed-off-by: Dhrubo Saha <[email protected]>

* addressing comments

Signed-off-by: Dhrubo Saha <[email protected]>

---------

Signed-off-by: Dhrubo Saha <[email protected]>
(cherry picked from commit 1237aa6)

* Add description field with make_model_config_json function (#203)

* Add description field

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Restore notebook

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Debug test

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Resolve linting issues

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Update CHANGELOG.md

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Debug test_sentencetransformermodel_pytest.py

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Improve test coverage

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Edit test name

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Change CHANGELOG.md & Add comment to sentencetransformermodel.py

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Correct linting

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Improve add description

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Correct linting

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Loosen restriction

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Update sentencetransformermodel.py

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Change function name + Add comment + Add default description

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Debug

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

---------

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
(cherry picked from commit 20435b1)

* Resolve merged conflict

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Correct linting

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

* Correct CHANGELOG.md

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>

---------

Signed-off-by: Thanawan Atchariyachanvanit <[email protected]>
Co-authored-by: Dhrubo Saha <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants