Skip to content
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,9 @@ hdf5

Note that also are additional software to be installed.

parquet
For using :ref:`Parquet files <io_parquet>` via pandas.

remote
For reading and writing from :ref:`Remote Sources <io_remotes>` with `fsspec`.

Expand Down
12 changes: 10 additions & 2 deletions docs/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -390,8 +390,16 @@ Avro files (fastavro)
:start-after: begin_complex_schema
:end-before: end_complex_schema

.. module:: petl.io.gsheet
.. _io_gsheet:
.. module:: petl.io.parquet
.. _io_parquet:

Parquet files
^^^^^^^^^^^^^

These functions read and write Parquet via pandas:

.. autofunction:: petl.io.parquet.fromparquet
.. autofunction:: petl.io.parquet.toparquet

Google Sheets (gspread)
^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
2 changes: 2 additions & 0 deletions petl/io/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,3 +45,5 @@
from petl.io.remotes import SMBSource

from petl.io.gsheet import fromgsheet, togsheet, appendgsheet

from petl.io.parquet import fromparquet, toparquet

Check warning

Code scanning / Ruff (reported by Codacy)

`petl.io.parquet.fromparquet` imported but unused; consider removing, adding to `__all__`, or using a redundant alias (F401)

`petl.io.parquet.fromparquet` imported but unused; consider removing, adding to `__all__`, or using a redundant alias (F401)

Check warning

Code scanning / Ruff (reported by Codacy)

`petl.io.parquet.toparquet` imported but unused; consider removing, adding to `__all__`, or using a redundant alias (F401)

`petl.io.parquet.toparquet` imported but unused; consider removing, adding to `__all__`, or using a redundant alias (F401)

Check warning

Code scanning / Prospector (reported by Codacy)

'petl.io.parquet.fromparquet' imported but unused (F401)

'petl.io.parquet.fromparquet' imported but unused (F401)
64 changes: 64 additions & 0 deletions petl/io/parquet.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# -*- coding: utf-8 -*-

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Missing module docstring

Missing module docstring

Check warning

Code scanning / Pylint (reported by Codacy)

Missing module docstring

Missing module docstring
from __future__ import absolute_import, print_function, division

# standard library dependencies
from petl.compat import PY2

Check warning

Code scanning / Ruff (reported by Codacy)

`petl.compat.PY2` imported but unused (F401)

`petl.compat.PY2` imported but unused (F401)

Check warning

Code scanning / Prospector (reported by Codacy)

Unused PY2 imported from petl.compat (unused-import)

Unused PY2 imported from petl.compat (unused-import)

Check notice

Code scanning / Pylintpython3 (reported by Codacy)

Unused PY2 imported from petl.compat

Unused PY2 imported from petl.compat

Check notice

Code scanning / Pylint (reported by Codacy)

Unused PY2 imported from petl.compat

Unused PY2 imported from petl.compat
from petl.io.pandas import fromdataframe, todataframe
# internal dependencies
from petl.util.base import Table
from petl.io.sources import read_source_from_arg, write_source_from_arg


# third-party dependencies
import pandas as pd

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

third party import "pandas" should be placed before first party imports "petl.compat.PY2", "petl.io.pandas.fromdataframe", "petl.util.base.Table", "petl.io.sources.read_source_from_arg"

third party import "pandas" should be placed before first party imports "petl.compat.PY2", "petl.io.pandas.fromdataframe", "petl.util.base.Table", "petl.io.sources.read_source_from_arg"


def fromparquet(source=None, **kwargs):
"""
Extract data from a Parquet file and return as a PETL table.
The input can be a local filesystem path or any URL supported by fsspec (e.g., S3, GCS).
Example:
>>> import petl as etl
>>> # read a Parquet file into a PETL table
... table = etl.fromparquet('data/example.parquet')
>>> table
+-------+------+
| name | age |
+=======+======+
| 'Amy' | 22 |
+-------+------+
| 'Bob' | 34 |
+-------+------+
:param source: path or URL to Parquet file
:param kwargs: passed through to pandas.read_parquet
:returns: a PETL Table
"""

src = read_source_from_arg(source)
with src.open('rb') as f:

Check warning

Code scanning / Pylint (reported by Codacy)

Variable name "f" doesn't conform to snake_case naming style

Variable name "f" doesn't conform to snake_case naming style
df = pd.read_parquet(f, **kwargs)

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Module 'pandas' has no 'read_parquet' member

Module 'pandas' has no 'read_parquet' member

Check warning

Code scanning / Pylint (reported by Codacy)

Variable name "df" doesn't conform to snake_case naming style

Variable name "df" doesn't conform to snake_case naming style

Check warning

Code scanning / Pylint (reported by Codacy)

Module 'pandas' has no 'read_parquet' member

Module 'pandas' has no 'read_parquet' member
return fromdataframe(df)

def toparquet(table, source=None, **kwargs):
"""
Write a PETL table or pandas DataFrame out to a Parquet file via pandas.
:param table_or_df: PETL table or pandas DataFrame
:param source: filesystem path or fsspec-supported URL for output
:param kwargs: passed through to pandas.DataFrame.to_parquet
:returns: the original PETL Table or pandas DataFrame
"""
src = write_source_from_arg(source)
with src.open('wb') as f:

Check warning

Code scanning / Pylint (reported by Codacy)

Variable name "f" doesn't conform to snake_case naming style

Variable name "f" doesn't conform to snake_case naming style
df = df = todataframe(table)
df.to_parquet(f, **kwargs)
return table



Table.fromparquet = fromparquet
Table.toparquet = toparquet

Check warning

Code scanning / Pylint (reported by Codacy)

Exactly one space required before assignment

Exactly one space required before assignment
23 changes: 23 additions & 0 deletions petl/test/io/test_parquet.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
import pandas as pd

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Django was not configured. For more information run pylint --load-plugins=pylint_django --help-msg=django-not-configured

Django was not configured. For more information run pylint --load-plugins=pylint_django --help-msg=django-not-configured

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Missing module docstring

Missing module docstring

Check warning

Code scanning / Pylint (reported by Codacy)

Missing module docstring

Missing module docstring
import petl as etl


def make_sample(tmp_path):

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Missing function or method docstring

Missing function or method docstring
df = pd.DataFrame([{'x': 1}, {'x': 2}, {'x': 3}])

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Module 'pandas' has no 'DataFrame' member

Module 'pandas' has no 'DataFrame' member

Check warning

Code scanning / Pylint (reported by Codacy)

Missing function docstring

Missing function docstring

Check warning

Code scanning / Pylint (reported by Codacy)

Module 'pandas' has no 'DataFrame' member

Module 'pandas' has no 'DataFrame' member

Check warning

Code scanning / Pylint (reported by Codacy)

Variable name "df" doesn't conform to snake_case naming style

Variable name "df" doesn't conform to snake_case naming style
path = tmp_path / 'foo.parquet'
df.to_parquet(path)
return path


def test_fromparquet(tmp_path):

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Missing function or method docstring

Missing function or method docstring
tbl = etl.io.fromparquet(str(make_sample(tmp_path)))

Check warning

Code scanning / Pylint (reported by Codacy)

Missing function docstring

Missing function docstring
assert tbl.header() == ('x',)

Check warning

Code scanning / Bandit (reported by Codacy)

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
assert list(tbl.values()) == [(1,), (2,), (3,)]

Check warning

Code scanning / Bandit (reported by Codacy)

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.


def test_toparquet(tmp_path):

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Missing function or method docstring

Missing function or method docstring
tbl = etl.fromdicts([{'y': 10}, {'y': 20}])

Check warning

Code scanning / Pylint (reported by Codacy)

Missing function docstring

Missing function docstring
out = tmp_path / 'out.parquet'
tbl.toparquet(str(out))
df2 = pd.read_parquet(out)

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Module 'pandas' has no 'read_parquet' member

Module 'pandas' has no 'read_parquet' member
assert list(df2['y']) == [10, 20]

Check warning

Code scanning / Pylint (reported by Codacy)

Module 'pandas' has no 'read_parquet' member

Module 'pandas' has no 'read_parquet' member
47 changes: 35 additions & 12 deletions petl/util/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -240,34 +240,57 @@
return r




import operator

Check warning

Code scanning / Ruff (reported by Codacy)

Module level import not at top of file (E402)

Module level import not at top of file (E402)

Check warning

Code scanning / Ruff (reported by Codacy)

Redefinition of unused `operator` from line 8 (F811)

Redefinition of unused `operator` from line 8 (F811)

Check warning

Code scanning / Prospector (reported by Codacy)

Reimport 'operator' (imported line 8) (reimported)

Reimport 'operator' (imported line 8) (reimported)

Check warning

Code scanning / Prospector (reported by Codacy)

redefinition of unused 'operator' from line 8 (F811)

redefinition of unused 'operator' from line 8 (F811)

Check warning

Code scanning / Prospector (reported by Codacy)

Import "import operator" should be placed at the top of the module (wrong-import-position)

Import "import operator" should be placed at the top of the module (wrong-import-position)

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Imports from package operator are not grouped

Imports from package operator are not grouped

Check notice

Code scanning / Pylintpython3 (reported by Codacy)

Reimport 'operator' (imported line 8)

Reimport 'operator' (imported line 8)

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Import "import operator" should be placed at the top of the module

Import "import operator" should be placed at the top of the module

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

standard import "operator" should be placed before first party imports "petl.compat.imap", "petl.errors.FieldSelectionError", "petl.comparison.comparable_itemgetter"

standard import "operator" should be placed before first party imports "petl.compat.imap", "petl.errors.FieldSelectionError", "petl.comparison.comparable_itemgetter"

Check warning

Code scanning / Pylint (reported by Codacy)

standard import "import operator" should be placed before "from petl.compat import imap, izip, izip_longest, ifilter, ifilterfalse, reduce, next, string_types, text_type"

standard import "import operator" should be placed before "from petl.compat import imap, izip, izip_longest, ifilter, ifilterfalse, reduce, next, string_types, text_type"

Check warning

Code scanning / Pylint (reported by Codacy)

Import "import operator" should be placed at the top of the module

Import "import operator" should be placed at the top of the module

Check notice

Code scanning / Pylint (reported by Codacy)

Reimport 'operator' (imported line 8)

Reimport 'operator' (imported line 8)

Check warning

Code scanning / Pylint (reported by Codacy)

Imports from package operator are not grouped

Imports from package operator are not grouped

def itervalues(table, field, **kwargs):
"""
Iterate over the value(s) in the given field(s).
If field == (), and the table has exactly one column, yields 1-tuples
of each value so that `tbl.values()` on a single-column table returns
[(v,), (v,), …]. Otherwise, behaves exactly as before.
"""
missing = kwargs.get('missing', None)
it = iter(table)
try:
hdr = next(it)
except StopIteration:
hdr = []

# which column(s) were requested?
indices = asindices(hdr, field)
assert len(indices) > 0, 'no field selected'
getvalue = operator.itemgetter(*indices)

# special case: no field & single-column table → default to that column
if not indices and field == () and len(hdr) == 1:
indices = [0]

assert indices, 'no field selected'

Check warning

Code scanning / Bandit (reported by Codacy)

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

getter = operator.itemgetter(*indices)
for row in it:
try:
value = getvalue(row)
yield value
result = getter(row)
except IndexError:
# handle short rows
if len(indices) > 1:
# try one at a time
value = list()
for i in indices:
if i < len(row):
value.append(row[i])
else:
value.append(missing)
yield tuple(value)
vals = [
row[i] if i < len(row) else missing
for i in indices
]
yield tuple(vals)
else:
yield missing
else:
# wrap single result in tuple only for our special single-column case
if len(indices) == 1 and field == ():
yield (result,)
else:
yield result





class TableWrapper(Table):
Expand Down
4 changes: 4 additions & 0 deletions requirements-docs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,7 @@ rinohtype

setuptools
setuptools-scm

# add parquet dependencies
pandas
pyarrow
4 changes: 3 additions & 1 deletion requirements-tests.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,6 @@ pytest>=4.6.6,<7.0.0
tox
coverage
coveralls
mock; python_version < '3.0'
mock; python_version < '3.0'
pandas>=1.0
pyarrow>=3.0.0
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
'xlsx': ['openpyxl>=2.6.2'],
'xpath': ['lxml>=4.4.0'],
'whoosh': ['whoosh'],
"parquet": ["pandas>=1.3.0","pyarrow>=4.0.0"]

Check warning

Code scanning / Pylint (reported by Codacy)

Exactly one space required after comma

Exactly one space required after comma
},
use_scm_version={
"version_scheme": "guess-next-dev",
Expand Down
Loading