Skip to content

Conversation

@Tschuppi81
Copy link
Contributor

@Tschuppi81 Tschuppi81 commented Nov 10, 2025

Org: Ensure mime type validator on file upload fields in form code

TYPE: Feature
LINK: ogc-2738

@linear
Copy link

linear bot commented Nov 10, 2025

@codecov
Copy link

codecov bot commented Nov 10, 2025

❌ 3 Tests Failed:

Tests completed Failed Passed Skipped
2360 3 2357 14
View the top 3 failed test(s) by shortest run time
tests/onegov/form/test_fields.py::test_upload_multiple_field
Stack Traces | 0.016s run time
def test_upload_multiple_field() -> None:
        def create_field(
            validators: Validators[FormT, Self] | None = None  # type:ignore[misc]
        ) -> tuple[Form, UploadMultipleField]:
            form = Form()
            field = UploadMultipleField(validators=validators)
            field = field.bind(form, 'uploads')  # type: ignore[attr-defined]
            return form, field
    
        # failing mime type validator
        form, field = create_field(validators=[ExpectedExtensions(['.json'])])
        file1 = create_file('text/plain', 'baz.txt', b'baz')
        field.process(DummyPostData({'uploads': [file1]}))
        assert not field.validate(form)
        assert len(field.data) == 1
        assert field.data[0]['filename'] == 'baz.txt'
        assert field.data[0]['mimetype'] == 'text/plain'
        validator = find_validator(field, WhitelistedMimeType)
        assert validator
>       assert validator.whitelist == {'application/json'}  # type:ignore[attr-defined]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E       AssertionError: assert {'application...s-excel', ...} == {'application/json'}
E         
E         Extra items in the left set:
E         'application/vnd.openxmlformats-officedocument.presentationml.presentation'
E         'image/xpm'
E         'text/csv'
E         'application/rtf'
E         'image/bmp'
E         'application/pdf'
E         'image/mpo'
E         'image/x-portable-anymap'
E         'text/plain'
E         'image/tiff'
E         'audio/mpeg'
E         'image/x-pcx'
E         'video/ogg'
E         'video/mpeg'
E         'application/CDFV2-unknown'
E         'application/msword'
E         'application/vnd.ms-excel'
E         'image/png'
E         'image/x-tga'
E         'image/avif'
E         'image/xbm'
E         'image/x-MS-bmp'
E         'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
E         'audio/wav'
E         'video/mp4'
E         'application/CDFV2'
E         'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
E         'image/x-xcf'
E         'image/svg+xml'
E         'image/icns'
E         'image/jp2'
E         'audio/mp4'
E         'image/gif'
E         'application/zip'
E         'image/jpeg'
E         'audio/webm'
E         'application/xml'
E         'image/x-icon'
E         'application/x-ole-storage'
E         'image/sgi'
E         'image/palm'
E         'image/vnd.adobe.photoshop'
E         'video/webm'
E         'video/x-msvideo'
E         'image/webp'
E         'video/quicktime'
E         Extra items in the right set:
E         'application/json'
E         
E         Full diff:
E           {
E         +     'application/CDFV2',
E         +     'application/CDFV2-unknown',
E         -     'application/json',
E         ?                  ^  ^
E         +     'application/msword',
E         ?                  ^ + ^^
E         +     'application/pdf',
E         +     'application/rtf',
E         +     'application/vnd.ms-excel',
E         +     'application/vnd.openxmlformats-officedocument.presentationml.presentation',
E         +     'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
E         +     'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
E         +     'application/x-ole-storage',
E         +     'application/xml',
E         +     'application/zip',
E         +     'audio/mp4',
E         +     'audio/mpeg',
E         +     'audio/wav',
E         +     'audio/webm',
E         +     'image/avif',
E         +     'image/bmp',
E         +     'image/gif',
E         +     'image/icns',
E         +     'image/jp2',
E         +     'image/jpeg',
E         +     'image/mpo',
E         +     'image/palm',
E         +     'image/png',
E         +     'image/sgi',
E         +     'image/svg+xml',
E         +     'image/tiff',
E         +     'image/vnd.adobe.photoshop',
E         +     'image/webp',
E         +     'image/x-MS-bmp',
E         +     'image/x-icon',
E         +     'image/x-pcx',
E         +     'image/x-portable-anymap',
E         +     'image/x-tga',
E         +     'image/x-xcf',
E         +     'image/xbm',
E         +     'image/xpm',
E         +     'text/csv',
E         +     'text/plain',
E         +     'video/mp4',
E         +     'video/mpeg',
E         +     'video/ogg',
E         +     'video/quicktime',
E         +     'video/webm',
E         +     'video/x-msvideo',
E           }

.../onegov/form/test_fields.py:300: AssertionError
tests/onegov/form/test_parser.py::test_parse_multiplefileinput
Stack Traces | 0.026s run time
def test_parse_multiplefileinput() -> None:
        form = parse_form("Files = *.pdf|*.doc (multiple)")()
    
        assert form['files'].label.text == 'Files'
        assert isinstance(form['files'], FileField)
        assert form['files'].widget.multiple is True  # type: ignore[attr-defined]
    
        # verify attached mime type validator
        validator = find_validator(form['files'], WhitelistedMimeType)
        assert validator
>       assert validator.whitelist == {  # type:ignore[attr-defined]
            'application/msword', 'application/pdf'
        }
E       AssertionError: assert {'application...s-excel', ...} == {'application...lication/pdf'}
E         
E         Extra items in the left set:
E         'video/ogg'
E         'video/x-msvideo'
E         'image/jpeg'
E         'image/sgi'
E         'image/jp2'
E         'image/tiff'
E         'image/mpo'
E         'application/rtf'
E         'application/xml'
E         'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
E         'text/csv'
E         'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
E         'image/svg+xml'
E         'image/palm'
E         'image/gif'
E         'image/x-tga'
E         'video/webm'
E         'image/x-icon'
E         'image/webp'
E         'image/x-MS-bmp'
E         'image/x-pcx'
E         'audio/mpeg'
E         'application/vnd.ms-excel'
E         'video/mpeg'
E         'image/x-portable-anymap'
E         'application/vnd.openxmlformats-officedocument.presentationml.presentation'
E         'image/vnd.adobe.photoshop'
E         'application/zip'
E         'application/CDFV2'
E         'image/icns'
E         'audio/webm'
E         'video/mp4'
E         'application/CDFV2-unknown'
E         'image/xpm'
E         'image/avif'
E         'image/xbm'
E         'image/x-xcf'
E         'audio/mp4'
E         'video/quicktime'
E         'audio/wav'
E         'text/plain'
E         'image/bmp'
E         'application/x-ole-storage'
E         'image/png'
E         
E         Full diff:
E           {
E         +     'application/CDFV2',
E         +     'application/CDFV2-unknown',
E               'application/msword',
E               'application/pdf',
E         +     'application/rtf',
E         +     'application/vnd.ms-excel',
E         +     'application/vnd.openxmlformats-officedocument.presentationml.presentation',
E         +     'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
E         +     'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
E         +     'application/x-ole-storage',
E         +     'application/xml',
E         +     'application/zip',
E         +     'audio/mp4',
E         +     'audio/mpeg',
E         +     'audio/wav',
E         +     'audio/webm',
E         +     'image/avif',
E         +     'image/bmp',
E         +     'image/gif',
E         +     'image/icns',
E         +     'image/jp2',
E         +     'image/jpeg',
E         +     'image/mpo',
E         +     'image/palm',
E         +     'image/png',
E         +     'image/sgi',
E         +     'image/svg+xml',
E         +     'image/tiff',
E         +     'image/vnd.adobe.photoshop',
E         +     'image/webp',
E         +     'image/x-MS-bmp',
E         +     'image/x-icon',
E         +     'image/x-pcx',
E         +     'image/x-portable-anymap',
E         +     'image/x-tga',
E         +     'image/x-xcf',
E         +     'image/xbm',
E         +     'image/xpm',
E         +     'text/csv',
E         +     'text/plain',
E         +     'video/mp4',
E         +     'video/mpeg',
E         +     'video/ogg',
E         +     'video/quicktime',
E         +     'video/webm',
E         +     'video/x-msvideo',
E           }

.../onegov/form/test_parser.py:386: AssertionError
tests/onegov/pas/test_views.py::test_view_upload_json
Stack Traces | 1.77s run time
client = <tests.shared.client.Client object at 0x7fd6daa9f250>
people_json = {'count': 2, 'next': None, 'previous': None, 'results': [{'created': '2024-12-23T16:44:12.056040+01:00', 'firstName': ...13799+01:00', 'firstName': 'Heinz', 'fullName': 'Achermann Heinz', 'id': '56603980-d390-4cf8-bc2b-5335affe22be', ...}]}
organization_json = {'count': 2, 'next': None, 'previous': None, 'results': [{'created': '2024-12-23T16:44:24.920938+01:00', 'description'...12-23T16:44:24.928269+01:00', 'description': '', 'id': 'f124b1fe-99b8-4bc2-b40d-bc8dec4997bc', 'isActive': True, ...}]}
memberships_json = {'count': 2, 'next': None, 'previous': None, 'results': [{'address': False, 'department': '', 'description': '', 'email': False, ...}, {'address': False, 'department': '', 'description': '', 'email': False, ...}]}

    def test_view_upload_json(
        client: Client[TestPasApp],
        people_json: dict[str, Any],
        organization_json: dict[str, Any],
        memberships_json: dict[str, Any]
    ) -> None:
        """ Test successful import of all data using fixtures.
    
        *1. Understanding the Data and Models**
    
        **people.json**: Contains individual person data (Parliamentarians). Key
             fields are firstName, officialName, primaryEmail, tags, title, id.
             This maps to the Parliamentarian model.
    
        **organizations.json**:
            The organizationTypeTitle dictates the type of organization.
            - "Kommission":  Maps to Commission model.
            - "Kantonsrat":  This is a special case. It's not a Commission. It
            represents the Parliament itself. We link this as ParliamentarianRole
            directly on the Parliamentarian model with role='member' and associated
            with the Kantonsrat organization.
            - "Fraktion":  Maps to ParliamentaryGroup.
            - "Sonstige": Could be various types. Let's see how these are intended
              to be modeled. We need more clarity on how "Sonstige" is categorized.
    
        **memberships.json**: Connects person and organization.
            It defines the role within that organization, start, end dates.
            The nested person and organization blocks are crucial for establishing
            relationships.
        """
    
        client.login_admin()
    
        # --- Reference: Previously used function for local file testing ---
        # import os
        # def yield_paths():
        #     """ Yields paths in this order: organization, membership, people """
        #     base_path = '.../local/json/files' # Adjust this path
        #     yield [base_path + '/organization.json']
        #     membership_count = 7 # Adjust as needed
        #     membership_paths = [
        #         f'{base_path}/memberships_{i}.json'
        #         for i in range(1, membership_count + 1)
        #     ]
        #     assert all(
        #         os.path.exists(path) for path in membership_paths
        #     ), "Some membership paths don't exist"
        #     yield membership_paths
        #
        #     # Yield people paths after validating existence
        #     # Adjust as needed:
        #     people_paths = [f'{base_path}/people_{i}.json' for i in range(1, 3)]
        #     assert all(
        #         os.path.exists(path) for path in people_paths
        #     ), "Some people paths don't exist"
        #     yield people_paths
    
    
        # def upload_file(filepath):
        #     with open(filepath, 'rb') as f:
        #         content = f.read()
        #         return Upload(
        #             os.path.basename(filepath),
        #             content,
        #             'application/json'
        #         )
    
        # # Get all paths
        # paths_generator = yield_paths()
    
        # org_paths = next(paths_generator)
        # page.form['organizations_source'] = [
        #     upload_file(path) for path in org_paths
        # ]
    
        # membership_paths = next(paths_generator)
        # page.form['memberships_source'] = [
        #     upload_file(path) for path in membership_paths
        # ]
    
        # people_paths = next(paths_generator)
        # page.form['people_source'] = [
        #     upload_file(path) for path in people_paths
        # ]
    
    
        # --- End Reference ---
    
        def create_upload_object(
            filename: str, data: dict[str, list[Any]]
        ) -> Upload:
            """Creates a webtest Upload object from a dictionary."""
            content_bytes = json.dumps(data).encode('utf-8')
            return Upload(
                filename,
                content_bytes,
                'application/json'
            )
    
        def do_upload_procedure(
            org_data: dict[str, Any],
            member_data: dict[str, Any],
            ppl_data: dict[str, Any]
        ) -> ExtendedResponse:
            """Uploads data using Upload objects created from fixtures."""
            page = client.get('/pas-import')
    
            # Create Upload objects from the fixture data
            # We wrap the list in the expected 'results' structure if needed,
            # matching the fixture structure.
            org_upload = create_upload_object('organization.json', org_data)
            # Assuming memberships_json fixture contains the 'results' list
            memberships_upload = create_upload_object(
                'memberships.json', member_data
            )
            # Assuming people_json fixture contains the 'results' list
            people_upload = create_upload_object('people.json', ppl_data)
    
            # Assign the Upload objects to the form fields
            # Note: The form expects a list of uploads, even if there's only one.
            page.form['validate_schema'] = False
            page.form['organizations_source'] = [org_upload]
            page.form['memberships_source'] = [memberships_upload]
            page.form['people_source'] = [people_upload]
    
            # Submit the form
            result = page.form.submit().maybe_follow()
    
            # Add assertions as needed
            assert result.status_code == 200
            assert result.status_code == 200, f"Import failed: {result.text}"
            return result
    
        # --- First Import ---
        result1 = do_upload_procedure(
            organization_json, memberships_json, people_json
        )
    
        # Check the import logs after first import
        logs_page = client.get('/import-logs')
        assert logs_page.status_code == 200
        # Todo: This should validate all columns on all table
        # For example address is not checked here.
    
        # --- Second Import (Test idempotency) ---
        # Run the import again with the same data, to test robustness
        do_upload_procedure(
            organization_json, memberships_json, people_json
        )
    
        # Check logs again after second import
        logs_page = client.get('/import-logs')
        assert logs_page.status_code == 200, "Could not load import logs page"
        # Should now have two logs
>       assert len(logs_page.pyquery('tbody tr')) == 2
E       AssertionError: assert 1 == 2
E        +  where 1 = len([<tr>])
E        +    where [<tr>] = [<html.no-js>]('tbody tr')
E        +      where [<html.no-js>] = <200 OK text/html body=b'<!DOCTYP...\n\n'/35383>.pyquery

.../onegov/pas/test_views.py:410: AssertionError

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

people_source = UploadMultipleField(
label=_('People Data (JSON)'),
description=_('JSON file containing parliamentarian data.'),
validators=[WhitelistedMimeType(MIME_TYPES_JSON)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But of course json files are allowed if explicitly enabled

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these import files aren't stored we could also just not have the validator here, since it could lead to false positives. There's nothing dangerous about a JSON parser opening these files, whatever they may contain.

@Tschuppi81 Tschuppi81 requested a review from Daverball December 4, 2025 12:03
@Tschuppi81
Copy link
Contributor Author

I saw that files types are handled differently for view_upload_file_by_json in handle_file_upload. Basically all file types are allowed. Shall we keep this?

@Tschuppi81
Copy link
Contributor Author

Should I completely remove type application/octet-stream ? It is mostly used in conjunction with application/zip

@Daverball
Copy link
Member

I saw that files types are handled differently for view_upload_file_by_json in handle_file_upload. Basically all file types are allowed. Shall we keep this?

We can make sure to set supported_content_types on GeneralFileCollection. That's the only one that would allow anything to be uploaded currently through those views.

Copy link
Member

@Daverball Daverball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, but there's a couple of details we should iron out.

'text/csv',
'text/plain',
}),
WhitelistedMimeType(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a file we store and could be downloaded by unsuspecting users after the fact, so the whitelist being strict isn't that important. That being said, we could probably trim it a little bit, since all we seem to accept for event imports are .xls and .xlsx files, it might be worth adding application/x-ole-storage though for old Excel files and application/octet-stream is probably fine here as well.

So I would keep the original whitelist, get rid of the bottom three and add application/x-ole-storage.

people_source = UploadMultipleField(
label=_('People Data (JSON)'),
description=_('JSON file containing parliamentarian data.'),
validators=[WhitelistedMimeType(MIME_TYPES_JSON)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these import files aren't stored we could also just not have the validator here, since it could lead to false positives. There's nothing dangerous about a JSON parser opening these files, whatever they may contain.

action: Literal['keep', 'replace', 'delete']
file: IO[bytes] | None
filename: str | None
validators = [WhitelistedMimeType()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not very robust, we definitely should overwrite __init__ instead, the only remaining question is, whether or not we want to add an extra parameter allowed_mimetypes or if we want to change the default of the validators argument to (WhitelistedMimeType(),).

I kind of like the extra parameter better, since it means we don't need to import WhitelistedMimeType everywhere.

You can then pass it on to super().__init__ as validators=[*(validators or ()), WhitelistedMimeType(allowed_mimetypes)].


upload_field_class: type[UploadField] = UploadField
upload_widget: Widget[UploadField] = UploadWidget()
validators = [WhitelistedMimeType()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here

@Daverball
Copy link
Member

Daverball commented Dec 4, 2025

Should I completely remove type application/octet-stream ? It is mostly used in conjunction with application/zip

It's probably fine to remove it for now. There may however be the rare false positive for any files that cannot be identified correctly by libmagic. Generally pdfs, zips and any other binary file formats can end up as application/octet-stream, it's a generic catch-all content type for binary data if it couldn't be detected as anything else.

'image/x-pcx',
'image/x-portable-pixmap',
'image/x-tga'
'image/x-xcf',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the resulting list for get_supported_image_mime_types is:

 - image/tiff
 - image/palm
 - image/x-MS-bmp
 - image/avif
 - image/x-portable-anymap
 - image/x-icon
 - image/xbm
 - image/x-tga
 - image/xpm
 - image/icns
 - image/mpo
 - image/bmp
 - image/webp
 - image/jp2
 - image/sgi
 - image/vnd.adobe.photoshop
 - video/mpeg
 - image/jpeg
 - image/png
 - image/gif
 - image/svg+xml
 - image/x-pcx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants