Skip to content

[Bug]: Incorrect Encoding Handling of text/plain;charset=UTF-8 Responses Containing Binary Data #3000

@MichaelSuen-thePointer

Description

@MichaelSuen-thePointer

Version

1.55.0

Steps to reproduce

Description:

I encountered an issue with the library when handling server responses that use the text/plain;charset=UTF-8 Content-Type header but contain binary data. Specifically, the page.on('requestfinished') event captures the response data incorrectly. It seems like the library internally attempts an invalid UTF-8 conversion on non-UTF-8 binary data, resulting in data corruption or loss.

Steps to Reproduce:

Set up a server to respond with the Content-Type text/plain;charset=UTF-8, but send binary data in the response body. Here is a simple flask server as example.

from flask import Flask, Response

app = Flask(__name__)

@app.route('/')
def index():
    data = b'\xb5\x6e\x4c\x0d\x28\xd4\x8b\xea\x8b\x9a\x5c\x3f\x72\xf9\xa1\xcf'
    return Response(data, content_type='text/plain;charset=utf-8')

if __name__ == '__main__':
    app.run(host='127.0.0.1', port=5000)

Use the library to intercept the response via the page.on('requestfinished') event.

import asyncio
from playwright.async_api import async_playwright, Request

async def on_request_finished(request: Request):
    response = await request.response()
    if response.ok:
        body = await response.body()
        print(f"Request finished: {request.url}, Status: {response.status}, Body size: {len(body)} bytes")
        print(f"Body data in hex: {body.hex()} (should be b56e4c0d28d48bea8b9a5c3f72f9a1cf)")
    else:
        print(f"Request finished: {request.url}, Status: {response.status}, No body (response not OK)")

async def main():
    async with async_playwright() as p:
        browser_type = p.chromium
        browser = await browser_type.launch(headless=False)
        page = await browser.new_page()
        page.on('requestfinished', on_request_finished)
        await page.goto('http://127.0.0.1:5000/')
        browser_closed_event = asyncio.Event()
        page.on("close", lambda: browser_closed_event.set())
        await browser_closed_event.wait()

asyncio.run(main())

Observe the response data captured by the code

Expected behavior

The library should capture the raw response data as-is, without applying any unnecessary or incorrect UTF-8 decoding, even if the Content-Type header specifies charset=UTF-8.

Actual behavior

The library seems to apply an erroneous UTF-8 decoding to the binary data, causing data corruption or loss. The captured data does not match the actual data sent by the server.

In my machine, the output is

Request finished: http://127.0.0.1:5000/, Status: 200, Body size: 24 bytes
Body data in hex: efbfbd6e4c0d28d48bea8b9a5c3f72efbfbdefbfbdefbfbd (should be b56e4c0d28d48bea8b9a5c3f72f9a1cf)

Additional context

No response

Environment

- Operating System: [Windows 10]
- CPU: [AMD 2700X]
- Browser: [Chromium & Chrome] (others not tested)
- Python Version: [3.12.3]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions