Rework mime type white list #2198

Tschuppi81 · 2025-11-10T15:20:08Z

Org: Ensure mime type validator on file upload fields in form code

TYPE: Feature
LINK: ogc-2738

linear · 2025-11-10T15:20:13Z

codecov · 2025-11-10T15:27:31Z

❌ 3 Tests Failed:

Tests completed	Failed	Passed	Skipped
2360	3	2357	14

View the top 3 failed test(s) by shortest run time

tests/onegov/form/test_fields.py::test_upload_multiple_field

Stack Traces | 0.016s run time

def test_upload_multiple_field() -> None:
        def create_field(
            validators: Validators[FormT, Self] | None = None  # type:ignore[misc]
        ) -> tuple[Form, UploadMultipleField]:
            form = Form()
            field = UploadMultipleField(validators=validators)
            field = field.bind(form, 'uploads')  # type: ignore[attr-defined]
            return form, field
    
        # failing mime type validator
        form, field = create_field(validators=[ExpectedExtensions(['.json'])])
        file1 = create_file('text/plain', 'baz.txt', b'baz')
        field.process(DummyPostData({'uploads': [file1]}))
        assert not field.validate(form)
        assert len(field.data) == 1
        assert field.data[0]['filename'] == 'baz.txt'
        assert field.data[0]['mimetype'] == 'text/plain'
        validator = find_validator(field, WhitelistedMimeType)
        assert validator
>       assert validator.whitelist == {'application/json'}  # type:ignore[attr-defined]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E       AssertionError: assert {'application...s-excel', ...} == {'application/json'}
E         
E         Extra items in the left set:
E         'application/vnd.openxmlformats-officedocument.presentationml.presentation'
E         'image/xpm'
E         'text/csv'
E         'application/rtf'
E         'image/bmp'
E         'application/pdf'
E         'image/mpo'
E         'image/x-portable-anymap'
E         'text/plain'
E         'image/tiff'
E         'audio/mpeg'
E         'image/x-pcx'
E         'video/ogg'
E         'video/mpeg'
E         'application/CDFV2-unknown'
E         'application/msword'
E         'application/vnd.ms-excel'
E         'image/png'
E         'image/x-tga'
E         'image/avif'
E         'image/xbm'
E         'image/x-MS-bmp'
E         'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
E         'audio/wav'
E         'video/mp4'
E         'application/CDFV2'
E         'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
E         'image/x-xcf'
E         'image/svg+xml'
E         'image/icns'
E         'image/jp2'
E         'audio/mp4'
E         'image/gif'
E         'application/zip'
E         'image/jpeg'
E         'audio/webm'
E         'application/xml'
E         'image/x-icon'
E         'application/x-ole-storage'
E         'image/sgi'
E         'image/palm'
E         'image/vnd.adobe.photoshop'
E         'video/webm'
E         'video/x-msvideo'
E         'image/webp'
E         'video/quicktime'
E         Extra items in the right set:
E         'application/json'
E         
E         Full diff:
E           {
E         +     'application/CDFV2',
E         +     'application/CDFV2-unknown',
E         -     'application/json',
E         ?                  ^  ^
E         +     'application/msword',
E         ?                  ^ + ^^
E         +     'application/pdf',
E         +     'application/rtf',
E         +     'application/vnd.ms-excel',
E         +     'application/vnd.openxmlformats-officedocument.presentationml.presentation',
E         +     'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
E         +     'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
E         +     'application/x-ole-storage',
E         +     'application/xml',
E         +     'application/zip',
E         +     'audio/mp4',
E         +     'audio/mpeg',
E         +     'audio/wav',
E         +     'audio/webm',
E         +     'image/avif',
E         +     'image/bmp',
E         +     'image/gif',
E         +     'image/icns',
E         +     'image/jp2',
E         +     'image/jpeg',
E         +     'image/mpo',
E         +     'image/palm',
E         +     'image/png',
E         +     'image/sgi',
E         +     'image/svg+xml',
E         +     'image/tiff',
E         +     'image/vnd.adobe.photoshop',
E         +     'image/webp',
E         +     'image/x-MS-bmp',
E         +     'image/x-icon',
E         +     'image/x-pcx',
E         +     'image/x-portable-anymap',
E         +     'image/x-tga',
E         +     'image/x-xcf',
E         +     'image/xbm',
E         +     'image/xpm',
E         +     'text/csv',
E         +     'text/plain',
E         +     'video/mp4',
E         +     'video/mpeg',
E         +     'video/ogg',
E         +     'video/quicktime',
E         +     'video/webm',
E         +     'video/x-msvideo',
E           }

.../onegov/form/test_fields.py:300: AssertionError

tests/onegov/form/test_parser.py::test_parse_multiplefileinput

Stack Traces | 0.026s run time

def test_parse_multiplefileinput() -> None:
        form = parse_form("Files = *.pdf|*.doc (multiple)")()
    
        assert form['files'].label.text == 'Files'
        assert isinstance(form['files'], FileField)
        assert form['files'].widget.multiple is True  # type: ignore[attr-defined]
    
        # verify attached mime type validator
        validator = find_validator(form['files'], WhitelistedMimeType)
        assert validator
>       assert validator.whitelist == {  # type:ignore[attr-defined]
            'application/msword', 'application/pdf'
        }
E       AssertionError: assert {'application...s-excel', ...} == {'application...lication/pdf'}
E         
E         Extra items in the left set:
E         'video/ogg'
E         'video/x-msvideo'
E         'image/jpeg'
E         'image/sgi'
E         'image/jp2'
E         'image/tiff'
E         'image/mpo'
E         'application/rtf'
E         'application/xml'
E         'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
E         'text/csv'
E         'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
E         'image/svg+xml'
E         'image/palm'
E         'image/gif'
E         'image/x-tga'
E         'video/webm'
E         'image/x-icon'
E         'image/webp'
E         'image/x-MS-bmp'
E         'image/x-pcx'
E         'audio/mpeg'
E         'application/vnd.ms-excel'
E         'video/mpeg'
E         'image/x-portable-anymap'
E         'application/vnd.openxmlformats-officedocument.presentationml.presentation'
E         'image/vnd.adobe.photoshop'
E         'application/zip'
E         'application/CDFV2'
E         'image/icns'
E         'audio/webm'
E         'video/mp4'
E         'application/CDFV2-unknown'
E         'image/xpm'
E         'image/avif'
E         'image/xbm'
E         'image/x-xcf'
E         'audio/mp4'
E         'video/quicktime'
E         'audio/wav'
E         'text/plain'
E         'image/bmp'
E         'application/x-ole-storage'
E         'image/png'
E         
E         Full diff:
E           {
E         +     'application/CDFV2',
E         +     'application/CDFV2-unknown',
E               'application/msword',
E               'application/pdf',
E         +     'application/rtf',
E         +     'application/vnd.ms-excel',
E         +     'application/vnd.openxmlformats-officedocument.presentationml.presentation',
E         +     'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
E         +     'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
E         +     'application/x-ole-storage',
E         +     'application/xml',
E         +     'application/zip',
E         +     'audio/mp4',
E         +     'audio/mpeg',
E         +     'audio/wav',
E         +     'audio/webm',
E         +     'image/avif',
E         +     'image/bmp',
E         +     'image/gif',
E         +     'image/icns',
E         +     'image/jp2',
E         +     'image/jpeg',
E         +     'image/mpo',
E         +     'image/palm',
E         +     'image/png',
E         +     'image/sgi',
E         +     'image/svg+xml',
E         +     'image/tiff',
E         +     'image/vnd.adobe.photoshop',
E         +     'image/webp',
E         +     'image/x-MS-bmp',
E         +     'image/x-icon',
E         +     'image/x-pcx',
E         +     'image/x-portable-anymap',
E         +     'image/x-tga',
E         +     'image/x-xcf',
E         +     'image/xbm',
E         +     'image/xpm',
E         +     'text/csv',
E         +     'text/plain',
E         +     'video/mp4',
E         +     'video/mpeg',
E         +     'video/ogg',
E         +     'video/quicktime',
E         +     'video/webm',
E         +     'video/x-msvideo',
E           }

.../onegov/form/test_parser.py:386: AssertionError

tests/onegov/pas/test_views.py::test_view_upload_json

Stack Traces | 1.77s run time

client = <tests.shared.client.Client object at 0x7fd6daa9f250>
people_json = {'count': 2, 'next': None, 'previous': None, 'results': [{'created': '2024-12-23T16:44:12.056040+01:00', 'firstName': ...13799+01:00', 'firstName': 'Heinz', 'fullName': 'Achermann Heinz', 'id': '56603980-d390-4cf8-bc2b-5335affe22be', ...}]}
organization_json = {'count': 2, 'next': None, 'previous': None, 'results': [{'created': '2024-12-23T16:44:24.920938+01:00', 'description'...12-23T16:44:24.928269+01:00', 'description': '', 'id': 'f124b1fe-99b8-4bc2-b40d-bc8dec4997bc', 'isActive': True, ...}]}
memberships_json = {'count': 2, 'next': None, 'previous': None, 'results': [{'address': False, 'department': '', 'description': '', 'email': False, ...}, {'address': False, 'department': '', 'description': '', 'email': False, ...}]}

    def test_view_upload_json(
        client: Client[TestPasApp],
        people_json: dict[str, Any],
        organization_json: dict[str, Any],
        memberships_json: dict[str, Any]
    ) -> None:
        """ Test successful import of all data using fixtures.
    
        *1. Understanding the Data and Models**
    
        **people.json**: Contains individual person data (Parliamentarians). Key
             fields are firstName, officialName, primaryEmail, tags, title, id.
             This maps to the Parliamentarian model.
    
        **organizations.json**:
            The organizationTypeTitle dictates the type of organization.
            - "Kommission":  Maps to Commission model.
            - "Kantonsrat":  This is a special case. It's not a Commission. It
            represents the Parliament itself. We link this as ParliamentarianRole
            directly on the Parliamentarian model with role='member' and associated
            with the Kantonsrat organization.
            - "Fraktion":  Maps to ParliamentaryGroup.
            - "Sonstige": Could be various types. Let's see how these are intended
              to be modeled. We need more clarity on how "Sonstige" is categorized.
    
        **memberships.json**: Connects person and organization.
            It defines the role within that organization, start, end dates.
            The nested person and organization blocks are crucial for establishing
            relationships.
        """
    
        client.login_admin()
    
        # --- Reference: Previously used function for local file testing ---
        # import os
        # def yield_paths():
        #     """ Yields paths in this order: organization, membership, people """
        #     base_path = '.../local/json/files' # Adjust this path
        #     yield [base_path + '/organization.json']
        #     membership_count = 7 # Adjust as needed
        #     membership_paths = [
        #         f'{base_path}/memberships_{i}.json'
        #         for i in range(1, membership_count + 1)
        #     ]
        #     assert all(
        #         os.path.exists(path) for path in membership_paths
        #     ), "Some membership paths don't exist"
        #     yield membership_paths
        #
        #     # Yield people paths after validating existence
        #     # Adjust as needed:
        #     people_paths = [f'{base_path}/people_{i}.json' for i in range(1, 3)]
        #     assert all(
        #         os.path.exists(path) for path in people_paths
        #     ), "Some people paths don't exist"
        #     yield people_paths
    
    
        # def upload_file(filepath):
        #     with open(filepath, 'rb') as f:
        #         content = f.read()
        #         return Upload(
        #             os.path.basename(filepath),
        #             content,
        #             'application/json'
        #         )
    
        # # Get all paths
        # paths_generator = yield_paths()
    
        # org_paths = next(paths_generator)
        # page.form['organizations_source'] = [
        #     upload_file(path) for path in org_paths
        # ]
    
        # membership_paths = next(paths_generator)
        # page.form['memberships_source'] = [
        #     upload_file(path) for path in membership_paths
        # ]
    
        # people_paths = next(paths_generator)
        # page.form['people_source'] = [
        #     upload_file(path) for path in people_paths
        # ]
    
    
        # --- End Reference ---
    
        def create_upload_object(
            filename: str, data: dict[str, list[Any]]
        ) -> Upload:
            """Creates a webtest Upload object from a dictionary."""
            content_bytes = json.dumps(data).encode('utf-8')
            return Upload(
                filename,
                content_bytes,
                'application/json'
            )
    
        def do_upload_procedure(
            org_data: dict[str, Any],
            member_data: dict[str, Any],
            ppl_data: dict[str, Any]
        ) -> ExtendedResponse:
            """Uploads data using Upload objects created from fixtures."""
            page = client.get('/pas-import')
    
            # Create Upload objects from the fixture data
            # We wrap the list in the expected 'results' structure if needed,
            # matching the fixture structure.
            org_upload = create_upload_object('organization.json', org_data)
            # Assuming memberships_json fixture contains the 'results' list
            memberships_upload = create_upload_object(
                'memberships.json', member_data
            )
            # Assuming people_json fixture contains the 'results' list
            people_upload = create_upload_object('people.json', ppl_data)
    
            # Assign the Upload objects to the form fields
            # Note: The form expects a list of uploads, even if there's only one.
            page.form['validate_schema'] = False
            page.form['organizations_source'] = [org_upload]
            page.form['memberships_source'] = [memberships_upload]
            page.form['people_source'] = [people_upload]
    
            # Submit the form
            result = page.form.submit().maybe_follow()
    
            # Add assertions as needed
            assert result.status_code == 200
            assert result.status_code == 200, f"Import failed: {result.text}"
            return result
    
        # --- First Import ---
        result1 = do_upload_procedure(
            organization_json, memberships_json, people_json
        )
    
        # Check the import logs after first import
        logs_page = client.get('/import-logs')
        assert logs_page.status_code == 200
        # Todo: This should validate all columns on all table
        # For example address is not checked here.
    
        # --- Second Import (Test idempotency) ---
        # Run the import again with the same data, to test robustness
        do_upload_procedure(
            organization_json, memberships_json, people_json
        )
    
        # Check logs again after second import
        logs_page = client.get('/import-logs')
        assert logs_page.status_code == 200, "Could not load import logs page"
        # Should now have two logs
>       assert len(logs_page.pyquery('tbody tr')) == 2
E       AssertionError: assert 1 == 2
E        +  where 1 = len([<tr>])
E        +    where [<tr>] = [<html.no-js>]('tbody tr')
E        +      where [<html.no-js>] = <200 OK text/html body=b'<!DOCTYP...\n\n'/35383>.pyquery

.../onegov/pas/test_views.py:410: AssertionError

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

…load

src/onegov/form/validators.py

Tschuppi81 · 2025-12-04T12:02:05Z

src/onegov/pas/forms/data_import.py

    people_source = UploadMultipleField(
        label=_('People Data (JSON)'),
        description=_('JSON file containing parliamentarian data.'),
+        validators=[WhitelistedMimeType(MIME_TYPES_JSON)]


But of course json files are allowed if explicitly enabled

Since these import files aren't stored we could also just not have the validator here, since it could lead to false positives. There's nothing dangerous about a JSON parser opening these files, whatever they may contain.

Tschuppi81 · 2025-12-04T12:05:07Z

I saw that files types are handled differently for view_upload_file_by_json in handle_file_upload. Basically all file types are allowed. Shall we keep this?

Tschuppi81 · 2025-12-04T12:16:23Z

Should I completely remove type application/octet-stream ? It is mostly used in conjunction with application/zip

Daverball · 2025-12-04T12:18:26Z

I saw that files types are handled differently for view_upload_file_by_json in handle_file_upload. Basically all file types are allowed. Shall we keep this?

We can make sure to set supported_content_types on GeneralFileCollection. That's the only one that would allow anything to be uploaded currently through those views.

Daverball

Looks good overall, but there's a couple of details we should iron out.

src/onegov/form/validators.py

Daverball · 2025-12-04T12:55:29Z

src/onegov/org/forms/event.py

-                'text/csv',
-                'text/plain',
-            }),
+            WhitelistedMimeType(),


This is not a file we store and could be downloaded by unsuspecting users after the fact, so the whitelist being strict isn't that important. That being said, we could probably trim it a little bit, since all we seem to accept for event imports are .xls and .xlsx files, it might be worth adding application/x-ole-storage though for old Excel files and application/octet-stream is probably fine here as well.

So I would keep the original whitelist, get rid of the bottom three and add application/x-ole-storage.

Daverball · 2025-12-04T12:57:19Z

src/onegov/pas/forms/data_import.py

    people_source = UploadMultipleField(
        label=_('People Data (JSON)'),
        description=_('JSON file containing parliamentarian data.'),
+        validators=[WhitelistedMimeType(MIME_TYPES_JSON)]


Since these import files aren't stored we could also just not have the validator here, since it could lead to false positives. There's nothing dangerous about a JSON parser opening these files, whatever they may contain.

Daverball · 2025-12-04T13:06:03Z

src/onegov/form/fields.py

    action: Literal['keep', 'replace', 'delete']
    file: IO[bytes] | None
    filename: str | None
+    validators = [WhitelistedMimeType()]


This is not very robust, we definitely should overwrite __init__ instead, the only remaining question is, whether or not we want to add an extra parameter allowed_mimetypes or if we want to change the default of the validators argument to (WhitelistedMimeType(),).

I kind of like the extra parameter better, since it means we don't need to import WhitelistedMimeType everywhere.

You can then pass it on to super().__init__ as validators=[*(validators or ()), WhitelistedMimeType(allowed_mimetypes)].

Daverball · 2025-12-04T13:06:16Z

src/onegov/form/fields.py


    upload_field_class: type[UploadField] = UploadField
    upload_widget: Widget[UploadField] = UploadWidget()
+    validators = [WhitelistedMimeType()]


Same thing here

src/onegov/form/fields.py

Daverball · 2025-12-04T14:51:53Z

Should I completely remove type application/octet-stream ? It is mostly used in conjunction with application/zip

It's probably fine to remove it for now. There may however be the rare false positive for any files that cannot be identified correctly by libmagic. Generally pdfs, zips and any other binary file formats can end up as application/octet-stream, it's a generic catch-all content type for binary data if it couldn't be detected as anything else.

Tschuppi81 · 2025-12-05T10:07:01Z

src/onegov/file/utils.py

-        'image/x-pcx',
-        'image/x-portable-pixmap',
-        'image/x-tga'
+        'image/x-xcf',


the resulting list for get_supported_image_mime_types is:

- image/tiff - image/palm - image/x-MS-bmp - image/avif - image/x-portable-anymap - image/x-icon - image/xbm - image/x-tga - image/xpm - image/icns - image/mpo - image/bmp - image/webp - image/jp2 - image/sgi - image/vnd.adobe.photoshop - video/mpeg - image/jpeg - image/png - image/gif - image/svg+xml - image/x-pcx

Rework mime type white list

8699bc1

Tschuppi81 added 16 commits November 17, 2025 11:00

Ensure mime type validator for file fields in formcode

66ac404

Adds mime type validator by default

5ebe0fb

Merge branch 'master' into feature/ogc-2738-pentest-arbitrary-file-up…

9416abf

…load

Fix wrongly attached validators

7eed62a

Revert

4ab42df

improve tests and fix linter issues

cf252e2

Rework tests

6e9afd6

Fix file size validator and align

3f718d2

Set mime types for all upload fields

5572020

Merge branch 'master' into feature/ogc-2738-pentest-arbitrary-file-up…

a6bd385

…load

Fix missing validator

1293c8e

Cleanup unused import

af685f5

Add fixme

a50f6f0

Fix linting errors

458a82d

Extend test

8d99429

Fix more linter issues

c31d0e3

Tschuppi81 commented Dec 4, 2025

View reviewed changes

src/onegov/form/validators.py Show resolved Hide resolved

Tschuppi81 commented Dec 4, 2025

View reviewed changes

Tschuppi81 requested a review from Daverball December 4, 2025 12:03

Daverball requested changes Dec 4, 2025

View reviewed changes

Tschuppi81 added 3 commits December 5, 2025 10:00

Remove validators from field list

eee980c

Add old ms office doc types

248819a

Remove non-standard svg type

5621994

Update supported image mime type

21274d1

Tschuppi81 commented Dec 5, 2025

View reviewed changes

Remove unused json validator

c4a1eef

Rework mime type white list #2198

Are you sure you want to change the base?

Rework mime type white list #2198

Uh oh!

Conversation

Tschuppi81 commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linear bot commented Nov 10, 2025

Uh oh!

codecov bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 3 Tests Failed:

Uh oh!

Uh oh!

Tschuppi81 Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Daverball Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Tschuppi81 commented Dec 4, 2025

Uh oh!

Tschuppi81 commented Dec 4, 2025

Uh oh!

Daverball commented Dec 4, 2025

Uh oh!

Daverball left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Daverball Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Daverball Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Daverball Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Daverball Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Daverball commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tschuppi81 Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Tschuppi81 commented Nov 10, 2025 •

edited

Loading

codecov bot commented Nov 10, 2025 •

edited

Loading

Daverball commented Dec 4, 2025 •

edited

Loading