-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Version
1.55.0
Steps to reproduce
Description:
I encountered an issue with the library when handling server responses that use the text/plain;charset=UTF-8 Content-Type header but contain binary data. Specifically, the page.on('requestfinished') event captures the response data incorrectly. It seems like the library internally attempts an invalid UTF-8 conversion on non-UTF-8 binary data, resulting in data corruption or loss.
Steps to Reproduce:
Set up a server to respond with the Content-Type text/plain;charset=UTF-8, but send binary data in the response body. Here is a simple flask server as example.
from flask import Flask, Response
app = Flask(__name__)
@app.route('/')
def index():
data = b'\xb5\x6e\x4c\x0d\x28\xd4\x8b\xea\x8b\x9a\x5c\x3f\x72\xf9\xa1\xcf'
return Response(data, content_type='text/plain;charset=utf-8')
if __name__ == '__main__':
app.run(host='127.0.0.1', port=5000)Use the library to intercept the response via the page.on('requestfinished') event.
import asyncio
from playwright.async_api import async_playwright, Request
async def on_request_finished(request: Request):
response = await request.response()
if response.ok:
body = await response.body()
print(f"Request finished: {request.url}, Status: {response.status}, Body size: {len(body)} bytes")
print(f"Body data in hex: {body.hex()} (should be b56e4c0d28d48bea8b9a5c3f72f9a1cf)")
else:
print(f"Request finished: {request.url}, Status: {response.status}, No body (response not OK)")
async def main():
async with async_playwright() as p:
browser_type = p.chromium
browser = await browser_type.launch(headless=False)
page = await browser.new_page()
page.on('requestfinished', on_request_finished)
await page.goto('http://127.0.0.1:5000/')
browser_closed_event = asyncio.Event()
page.on("close", lambda: browser_closed_event.set())
await browser_closed_event.wait()
asyncio.run(main())Observe the response data captured by the code
Expected behavior
The library should capture the raw response data as-is, without applying any unnecessary or incorrect UTF-8 decoding, even if the Content-Type header specifies charset=UTF-8.
Actual behavior
The library seems to apply an erroneous UTF-8 decoding to the binary data, causing data corruption or loss. The captured data does not match the actual data sent by the server.
In my machine, the output is
Request finished: http://127.0.0.1:5000/, Status: 200, Body size: 24 bytes
Body data in hex: efbfbd6e4c0d28d48bea8b9a5c3f72efbfbdefbfbdefbfbd (should be b56e4c0d28d48bea8b9a5c3f72f9a1cf)Additional context
No response
Environment
- Operating System: [Windows 10]
- CPU: [AMD 2700X]
- Browser: [Chromium & Chrome] (others not tested)
- Python Version: [3.12.3]