Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using VPC endpoint and PandasCursor together #576

Open
KitauchiShinji opened this issue Feb 28, 2025 · 3 comments
Open

Using VPC endpoint and PandasCursor together #576

KitauchiShinji opened this issue Feb 28, 2025 · 3 comments

Comments

@KitauchiShinji
Copy link

I am accessing Athena from a closed network via a VPC endpoint.
Specifying the URL of the VPC endpoint to endpoint_url= works as expected, but it did not work well when used with PandasCursor.

I checked code and found that when creating a boto3 client for S3, endpoint_url= is also applied, and I suspect that is the cause of the error.

If possible, I would appreciate it if endpoint_url= and PandasCursor can be used together.

  • Python: 3.12.1
  • PyAthena: 3.12.2
from pyathena import connect
from pyathena.pandas.cursor import PandasCursor

cursor = connect(
    work_group='XXXXXXX',
    endpoint_url='https://vpce-XXXXXXX.athena.XXXXXXX.vpce.amazonaws.com',
    region_name='XXXXXXX').cursor(PandasCursor)

df = cursor.execute('''
    SELECT * FROM XXXXXXX.XXXXXXX LIMIT 10
''').as_pandas()

print(df)
Failed to get content length.
Traceback (most recent call last):
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyathena\result_set.py", line 434, in _get_content_length
    response = retry_api_call(
               ^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyathena\util.py", line 84, in retry_api_call
    return retry(func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\tenacity\__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\tenacity\__init__.py", line 376, in iter
    result = action(retry_state)
             ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\tenacity\__init__.py", line 398, in <lambda>
    self._add_action_func(lambda rs: rs.outcome.result())
                                     ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\_base.py", line 401, in __get_result
    raise self._exception
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\tenacity\__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\botocore\client.py", line 569, in _api_call
    return self._make_api_call(operation_name, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\botocore\client.py", line 1023, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found
Traceback (most recent call last):
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyathena\result_set.py", line 434, in _get_content_length
    response = retry_api_call(
               ^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyathena\util.py", line 84, in retry_api_call
    return retry(func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\tenacity\__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\tenacity\__init__.py", line 376, in iter
    result = action(retry_state)
             ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\tenacity\__init__.py", line 398, in <lambda>
    self._add_action_func(lambda rs: rs.outcome.result())
                                     ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\_base.py", line 401, in __get_result
    raise self._exception
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\tenacity\__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\botocore\client.py", line 569, in _api_call
    return self._make_api_call(operation_name, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\botocore\client.py", line 1023, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\tmp\sample3.py", line 10, in <module>
    df = cursor.execute('''
         ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyathena\pandas\cursor.py", line 162, in execute
    self.result_set = AthenaPandasResultSet(
                      ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyathena\pandas\result_set.py", line 143, in __init__
    df = self._as_pandas()
         ^^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyathena\pandas\result_set.py", line 386, in _as_pandas
    df = self._read_csv()
         ^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyathena\pandas\result_set.py", line 269, in _read_csv
    length = self._get_content_length()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyathena\result_set.py", line 443, in _get_content_length
    raise OperationalError(*e.args) from e
pyathena.error.OperationalError: An error occurred (404) when calling the HeadObject operation: Not Found
@laughingman7743
Copy link
Owner

laughingman7743 commented Feb 28, 2025

Maybe the endpoint_url setting isn't being passed to the S3 client. 🤔
https://github.com/laughingman7743/PyAthena/blob/master/pyathena/pandas/result_set.py#L170-L178

@KitauchiShinji
Copy link
Author

Thank you for your reply.
I think that connection is passed to S3FileSystem but connection._client_kwargs contains endpoint_url.

if connection:
self._client = connection.session.client(
"s3",
region_name=connection.region_name,
config=connection.config,
**connection._client_kwargs,
)

# test under my environment
from pyathena import connect
from pyathena.pandas.cursor import PandasCursor

cursor = connect(
    work_group='XXXXXXX',
    endpoint_url='https://vpce-XXXXXXX.athena.XXXXXXX.vpce.amazonaws.com',
    region_name='XXXXXXX').cursor(PandasCursor)

print(cursor._connection._client_kwargs)
# {'endpoint_url': 'https://vpce-XXXXXXX.athena.XXXXXXX.vpce.amazonaws.com'}

@laughingman7743
Copy link
Owner

laughingman7743 commented Mar 3, 2025

  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyathena\pandas\result_set.py", line 269, in _read_csv
    length = self._get_content_length()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Kitauchi.Shinji\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyathena\result_set.py", line 443, in _get_content_length
    raise OperationalError(*e.args) from e
pyathena.error.OperationalError: An error occurred (404) when calling the HeadObject operation: Not Found

https://github.com/laughingman7743/PyAthena/blob/master/pyathena/pandas/result_set.py#L264
https://github.com/laughingman7743/PyAthena/blob/master/pyathena/result_set.py#L428
The error occurs when checking the size of the CSV file. It seems that the file cannot be found for some reason, but are you sure that the file is actually in S3? It is possible that there is another bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants