Clinton/ed 448/sdk html and text #820

clinton-encord · 2024-12-11T23:55:20Z

Introduction and Explanation

Add object and classification support for html-text

Some things left to do:

Add object and classification support for plain-text (although this may already be done due to html text being done, just haven't checked or added tests for it)
When calling ClassificationInstance.set_for_frames on a non-geometric label row (i.e. audio or text), validate the ranges
- Audio range should be over entire file
- Text range should be [0, 0]

JIRA

Link ticket(s)

Documentation

There should be enough internal documentation for a product owner to write customer-facing documentation or a separate PR linked if writing the customer documentation directly. Link all that are relevant below.

Internal: notion link
Customer docs PR: link
OpenAPI/SDK
- Generated docs: link to example if possible
- Command to generate: here

Tests

Make a quick statement and post any relevant links of CI / test results. If the testing infrastructure isn’t yet in-place, note that instead.

What are the critical unit tests?
Explain the Integration Tests such that it’s clear Correctness is satisfied. Link to test results if possible.

Known issues

If there are any known issues with the solution, make a statement about what they are and why they are Ok to leave unsolved for now. Make tickets for the known issues linked to the original ticket linked above

github-actions · 2024-12-11T23:56:13Z

Unit test report (Pydantic 1.x)

204 tests 204 ✅ 5s ⏱️
1 suites 0 💤
1 files 0 ❌

Results for commit f14e19b.

♻️ This comment has been updated with latest results.

github-actions · 2024-12-11T23:56:15Z

Unit test report ((Pydantic 2.x)

204 tests 204 ✅ 6s ⏱️
1 suites 0 💤
1 files 0 ❌

Results for commit f14e19b.

♻️ This comment has been updated with latest results.

clinton-encord · 2024-12-11T23:57:57Z

encord/objects/ontology_labels_impl.py

-            file_type=label_row_dict.get("file_type", None),
+            file_type=file_type,


For the longest time, this has just been None lolol.

I believe that we should be able to get the file_type on first requesting the label_row and instead of having to be included in the blob when we initialise labels:
i.e: The above method is called in lr.initialise_labels().
Probably worth checking whether we get the file_type before and we just don't provide it in the label blob and then here it gets overridden by the empty field in the actual label blob. Pretty confident that this is the case, checked. Let me know if this comment is obtuse cause this is a bad change somewhat.

Yes your comment is correct. I'm unsure as to what you want me to do though.
Are you trying to suggest that we don't overwrite it? And keep what we have in file_type before calling initialise_labels?

clinton-encord · 2024-12-12T00:00:03Z

encord/objects/ontology_object_instance.py

+        if range_html is not None:
+            if self._ontology_object.shape != Shape.TEXT:
+                raise LabelRowError(
+                    "Setting range_html of the object instance is only allowed for objects with the 'text' shape"
+                )
+            elif self._parent is not None and self._parent.file_type != "text/html":
+                raise LabelRowError("Cannot add range_html to a non-html text file")
+            elif len(self.range_list) > 0:
+                raise LabelRowError("Cannot add range_html to an object instance that has a non-html range")


An important set of checks to ensure range_html is used properly on an ObjectInstance.

clinton-encord · 2024-12-12T00:01:12Z

encord/objects/ontology_object_instance.py

+            if len(range_html) > 0:
+                self._range_html = range_html
+            else:
+                self._range_html = None


This allows a user to REMOVE the range_html, by setting it to an empty array.

github-actions · 2024-12-12T00:27:53Z

SDK integration test report

279 tests ±0 272 ✅ +48 18m 44s ⏱️ - 2h 15m 4s
1 suites ±0 7 💤 ± 0
1 files ±0 0 ❌ - 47

Results for commit f573bf6. ± Comparison against base commit e6715e0.

♻️ This comment has been updated with latest results.

Jim-Encord

Lots of comments. Please take care that you aren't breaking historical backwards compatiblity around the file type changes.

encord/objects/coordinates.py

Jim-Encord · 2024-12-12T10:51:56Z

encord/objects/ontology_labels_impl.py

+                annotation = obj.get_annotations()[0]
+                ret[obj.object_hash]["range_html"] = [x.to_dict() for x in obj.range_html]
+                ret[obj.object_hash]["range"] = []
+                ret[obj.object_hash]["createdBy"] = annotation.created_by


style:
Why so much C&P. Validation of non-type specific objects should be in a unified way in a unified place. i.e: Do createdBy stuff first and then use if statements to validate other fields

Gotcha on C&P.

I don't understand the validation stuff.

Jim-Encord · 2024-12-12T10:52:42Z

encord/objects/ontology_labels_impl.py

+        elif self.data_type == DataType.PLAIN_TEXT or self.data_type == DataType.PDF:
+            pass
+        elif (
+            self.data_type == DataType.IMAGE


style:
Could use your introduced geometric_types

Doesn't work with exhaustive guard.

Jim-Encord · 2024-12-12T10:55:14Z

encord/objects/ontology_labels_impl.py


        ret["labels"] = self._to_encord_labels(frame_level_data)

-        if self._label_row_read_only_data.duration is not None:
+        if self._label_row_read_only_data.duration is not None and self.data_type != DataType.PLAIN_TEXT:


Why do you need this condition?
Can we not ensure that for data_type PLAIN_TEXT, that we don't provide a duration.

It's not entirely clear to me why we need the original branching condition given that it seems fine to propagate the None into the ret dictionary. Maybe this is just someone else's ugly serialisation code.

Yea you're right about the original branching condition. Ideally, we only have to check the data_type before adding properties to the data_unit, and not have to check whether the value from _label_row_read_only_data is None or not.

Maybe there was a historical reason for this? If no discernible reason, will clean up this function (albeit probably in a separate PR to reduce surface area)

@sergei-encord any thoughts on this?

encord/objects/ontology_labels_impl.py

Jim-Encord · 2024-12-12T10:59:58Z

encord/objects/ontology_labels_impl.py

-            file_type=label_row_dict.get("file_type", None),
+            file_type=file_type,


I believe that we should be able to get the file_type on first requesting the label_row and instead of having to be included in the blob when we initialise labels:
i.e: The above method is called in lr.initialise_labels().
Probably worth checking whether we get the file_type before and we just don't provide it in the label blob and then here it gets overridden by the empty field in the actual label blob. Pretty confident that this is the case, checked. Let me know if this comment is obtuse cause this is a bad change somewhat.

encord/objects/ontology_object_instance.py

Jim-Encord · 2024-12-12T11:08:43Z

encord/objects/ontology_object_instance.py

+            if "createdAt" in d and d["createdAt"] is not None:
+                created_at = parse_datetime(d["createdAt"])
+            else:
+                created_at = datetime.now()


😮
This doesn't make sense to me.
On creation yes, we should be providing a created_at timestamp. But I strongly disagree that on fail find, we should fallback to providing now as the created_at. I'd rather have None or the Unix Epoch time than now.

If you are using this branch as a mechanism somehow of updating created_at, if this is the first time creating labels, then I would reconsider refactoring such that it doesn't have this risky side effect.

Ah that is very correct! I put this in at the start just to get it working. CreatedAt should always be there, no defaults required.

Jim-Encord · 2024-12-12T11:12:56Z

tests/objects/test_label_structure.py

+    label_row.add_classification_instance(classification_instance)
+    assert len(label_row.get_classification_instances()) == 1
+    classification_instance.set_for_frames(Range(start=2000, end=2499))
+    range_list = classification_instance.range_list


may just be personal:
This shouldn't be an ordered list imo, unless we are sorting it on requesting to ensure that we sort by the start index (If you are doing that then great!). The actual notion of classifications as you might view them in the label editor has the chronological ordering or here the lexicographic but if we are guaranteeing a stable order, it really needs to be derived from that.

Sorry, I don't quite understand this. ChatGPT suggests that:

They are questioning whether range_list is inherently or intentionally ordered. If it's being treated as an ordered list, they are asking whether you are explicitly sorting it by a meaningful criterion (e.g., the start index of ranges).

If that is the case, then yes, range_list is being sorted by start index of ranges. See the add_range function in encord/common/range_manager.py.

Cool, yeah I was asking whether it has a stable ordering

encord/objects/classification_instance.py

encord/objects/html_node.py

arthur-encord · 2024-12-27T11:08:44Z

encord/objects/html_node.py

+        offset (int): The offset of the content from the xpath
+    """
+
+    node: str


maybe we should call this xpath? ATM node is already in the class name HtmlNode, so feels like the field name can be more descriptive

In the actual range_html in the objects_index, we call it "node" though.

Worried it might be confusing to have it called "node" in the exports and in our DB, but then have it called "xpath" here in this one place.

encord/objects/html_node.py

tests/objects/test_label_structure.py

arthur-encord · 2025-01-02T09:17:28Z

encord/objects/coordinates.py

+        range (Ranges): Ranges of chars for simple text files
+    """
+
+    range_html: Optional[List[HtmlRange]] = None


should be plural? ranges and ranges_html (and same for Audio above? could be fine to change since no one is using it yet I assume)

Yeap agreed, but in the label format, we just call it range though. So using the same here for consistency.

arthur-encord · 2025-01-02T09:29:31Z

encord/objects/classification_instance.py

@@ -31,7 +31,7 @@

 from encord.common.range_manager import RangeManager


Would it be reasonable to expect label_row.add_classification_instance(cls_instance) to work for text/html/audio without having to call cls_instance.set_for_frames, given that we only support global classifications for these?

arthur-encord · 2025-01-02T09:43:45Z

encord/objects/html_node.py

+        return f"{self.start.node}-{self.start.offset}-{self.end.node}-{self.end.offset}"
+
+    @classmethod
+    def from_dict(cls, d: dict):


maybe we can use this method to be able to construct TextCoordinates.range_html from a dict rather than having to import HtmlRange & HtmlNode?
No big deal though, it's fine as is

arthur-encord · 2025-01-02T09:59:07Z

encord/objects/coordinates.py

+    range_html: Optional[List[HtmlRange]] = None
+    range: Optional[Ranges] = None
+
+    def __post_init__(self):


Could separate TextCoordinates and HtmlCoordinates to not have to do the logic below

Yes makes sense!

arthur-encord · 2025-01-02T10:00:24Z

encord/objects/coordinates.py

+    """
+
+    range_html: Optional[List[HtmlRange]] = None
+    range: Optional[Ranges] = None


could be nice for the text ranges to be able to be a list of int as well?

encord/objects/classification_instance.py

…on text files - Only allow classifications on frame=0 for non-geometric files

clinton-encord added 26 commits June 19, 2024 15:50

Bump version number

8047030

Merge branch 'master' of github.com:encord-team/encord-client-python

53fa0d8

Merge branch 'master' of github.com:encord-team/encord-client-python

6debbe8

Merge branch 'master' of github.com:encord-team/encord-client-python

d726863

Merge branch 'master' of github.com:encord-team/encord-client-python

4830734

Merge branch 'master' of github.com:encord-team/encord-client-python

a9fc54f

Merge branch 'master' of github.com:encord-team/encord-client-python

e0ba8d1

Merge branch 'master' of github.com:encord-team/encord-client-python

8690215

Merge branch 'master' of github.com:encord-team/encord-client-python

b3047d2

Merge branch 'master' of github.com:encord-team/encord-client-python

c5d32c4

Merge branch 'master' of github.com:encord-team/encord-client-python

b27a0f0

Merge branch 'master' of github.com:encord-team/encord-client-python

c7826e7

Merge branch 'master' of github.com:encord-team/encord-client-python

3a9aa3a

Merge branch 'master' of github.com:encord-team/encord-client-python

e86bdc0

Merge branch 'master' of github.com:encord-team/encord-client-python

100ed74

Merge branch 'master' of github.com:encord-team/encord-client-python

913f4cc

Merge branch 'master' of github.com:encord-team/encord-client-python

a46e751

Merge branch 'master' of github.com:encord-team/encord-client-python

362f166

Merge branch 'master' of github.com:encord-team/encord-client-python

510b9aa

Merge branch 'master' of github.com:encord-team/encord-client-python

973bc1a

Merge branch 'master' of github.com:encord-team/encord-client-python

3143c2b

Merge branch 'master' of github.com:encord-team/encord-client-python

c349686

Merge branch 'master' of github.com:encord-team/encord-client-python

93c68e2

Merge branch 'master' of github.com:encord-team/encord-client-python

97583a2

Merge branch 'master' of github.com:encord-team/encord-client-python

ccfeffd

Merge branch 'master' of github.com:encord-team/encord-client-python

93e3754

clinton-encord commented Dec 11, 2024

View reviewed changes

clinton-encord commented Dec 12, 2024

View reviewed changes

Jim-Encord reviewed Dec 12, 2024

View reviewed changes

arthur-encord reviewed Dec 27, 2024

View reviewed changes

clinton-encord force-pushed the clinton/ed-448/sdk-html-and-text branch from fc26087 to 92ec53a Compare December 30, 2024 09:15

Merge branch 'master' of github.com:encord-team/encord-client-python

a358161

clinton-encord force-pushed the clinton/ed-448/sdk-html-and-text branch from 6fac349 to 1084f6f Compare December 31, 2024 09:27

arthur-encord force-pushed the clinton/ed-448/sdk-html-and-text branch from 51454cc to b6f0245 Compare January 2, 2025 09:01

arthur-encord approved these changes Jan 2, 2025

View reviewed changes

clinton-encord force-pushed the clinton/ed-448/sdk-html-and-text branch from b6f0245 to 5c18d7a Compare January 2, 2025 14:20

Laverne-Encord reviewed Jan 3, 2025

View reviewed changes

encord/objects/classification_instance.py Outdated Show resolved Hide resolved

Laverne-Encord reviewed Jan 3, 2025

View reviewed changes

encord/objects/classification_instance.py Outdated Show resolved Hide resolved

Laverne-Encord reviewed Jan 3, 2025

View reviewed changes

encord/objects/classification_instance.py Outdated Show resolved Hide resolved

clinton-encord added 3 commits January 6, 2025 13:27

Merge branch 'master' of github.com:encord-team/encord-client-python

3a10a64

SDK and HTML PR

f83755f

- Use HtmlCoordinates and TextCoordinates to hold location of labels …

3ee74fc

…on text files - Only allow classifications on frame=0 for non-geometric files

clinton-encord force-pushed the clinton/ed-448/sdk-html-and-text branch from dbcb6c3 to 3ee74fc Compare January 6, 2025 05:27

clinton-encord added 4 commits January 6, 2025 14:15

Added test for serialising plain text labels

f573bf6

Update comment

ac38461

Use range=[] for non-geometric data

df37f4c

Fix ruff formatting

f14e19b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clinton/ed 448/sdk html and text #820

Clinton/ed 448/sdk html and text #820

clinton-encord commented Dec 11, 2024

github-actions bot commented Dec 11, 2024 •

edited

Loading

github-actions bot commented Dec 11, 2024 •

edited

Loading

clinton-encord Dec 11, 2024

Jim-Encord Dec 12, 2024

clinton-encord Dec 18, 2024

clinton-encord Dec 12, 2024

clinton-encord Dec 12, 2024

github-actions bot commented Dec 12, 2024 •

edited

Loading

Jim-Encord left a comment

Jim-Encord Dec 12, 2024

clinton-encord Dec 18, 2024

Jim-Encord Dec 12, 2024

clinton-encord Dec 18, 2024

Jim-Encord Dec 12, 2024

clinton-encord Dec 18, 2024 •

edited

Loading

Jim-Encord Dec 12, 2024

Jim-Encord Dec 12, 2024

clinton-encord Dec 18, 2024

Jim-Encord Dec 12, 2024

clinton-encord Dec 18, 2024

Jim-Encord Dec 18, 2024

arthur-encord Dec 27, 2024

clinton-encord Dec 30, 2024

arthur-encord Jan 2, 2025

clinton-encord Jan 2, 2025

arthur-encord Jan 2, 2025

arthur-encord Jan 2, 2025

arthur-encord Jan 2, 2025

clinton-encord Jan 2, 2025

arthur-encord Jan 2, 2025

		file_type=label_row_dict.get("file_type", None),
		file_type=file_type,

		@@ -31,7 +31,7 @@

		from encord.common.range_manager import RangeManager

Clinton/ed 448/sdk html and text #820

Are you sure you want to change the base?

Clinton/ed 448/sdk html and text #820

Conversation

clinton-encord commented Dec 11, 2024

Introduction and Explanation

JIRA

Documentation

Tests

Known issues

github-actions bot commented Dec 11, 2024 • edited Loading

Unit test report (Pydantic 1.x)

github-actions bot commented Dec 11, 2024 • edited Loading

Unit test report ((Pydantic 2.x)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 12, 2024 • edited Loading

SDK integration test report

Jim-Encord left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clinton-encord Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 11, 2024 •

edited

Loading

github-actions bot commented Dec 11, 2024 •

edited

Loading

github-actions bot commented Dec 12, 2024 •

edited

Loading

clinton-encord Dec 18, 2024 •

edited

Loading