Skip to content

Commit

Permalink
extract conditions and packages (#257)
Browse files Browse the repository at this point in the history
* updated platformdirs 3.10.0 -> 4.1.0 in requirements

* updated packaging 22.0 -> 23.2  in requirements

* pip-compile requirements

* updated know pruefis

* added conditions flavour

* moved conditions parsing to flavour conditions

* 🩹corrected loop in process_pruefi

* ➕added package table and functions to extract conditions

* removed unused arguments in process_package_conditions

* ➕added tests for packagetable, linting issue

* linting/typechecking

* ➕added edifactformat -> file mapping

* move test docx files

* Add new test files into new folder structure

* ✅ Update tests to new test data folder structure

* ✅ Add unit test to compare csv export files

* 📝 Add information how much pages one format has

* 🎨 Improve handling of output-path

* ✅ fix test

* ✅ Add test for change history

* 🎨 Move changehistory functions into an extra module

* 🎨 Fix imports

* 🎨 Use function to check for change history section

* 🎨 split commands into separate modules

* 🚛 move dump conditions function

* ✅ harmonize cli tests

* 🚛 move pruefi command

* 🚛 rename changehistory.function to __init__.py

* 🎨 clean up imports

* 🎨 clean up arguments for pruefi command

* 🎨 add enum for ahbexportfileformat

* ✅ fix tests

* 🚛 move changehistory command

* ✅ test the cli command pruefi direct

* 🚛 move cli test for changehistory in extra file

* 🔥 remove unused imports

* 🚛 rename cli pruefi module

* 📝 add documentation

* ➕➖ replace attrs with pydantic and use pyproject.toml for dependencies

* 🔄 Replace attrs with pydantic

* 🔥 remove unused code

* 🎨 use enum for file exports

* 🎨 Use the click validation for filetype and make it required

* 🎨 use functions to check for paragraph and table kinds

* 🎨 Use ConfigDict to remove DeprecationWarning

* ➕ add freezegun to mock datetime.now() in the tests

* 🎨 use timezone.utc instead of UTC

* 🚨remove unused imports

* 🚧 WIP of the get_ahb_table rework

* 🎉🚧 Finally COMDIS is there

this commit fixes the issue that all pruefis which are above the change history section got not exportet

* 🚨remove unused import

* ✅ Improve the test to check the current state of the cli tool

* ✅ clean up tests

* ✅ Use sort instead of sorted

* 🎨 Further improvements of the get_ahb_table function

* 🎨 remove warning after tests

src/kohlrahbi/pruefis/__init__.py:111
  /Users/kevin/workspaces/hochfrequenz/kohlrahbi/src/kohlrahbi/pruefis/__init__.py:111: SyntaxWarning: invalid escape sequence '\d'
    """

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

* ➕ Add pydantic to pyproject.toml

* 🚨 Fix linter warnings

* fixed tests

* unittests changed path formats

* linting

* 🚨 fix further linter warnings

* ✅ add test file to test cli conditions command

* WIP

* WIP

* restructured and cleaned ahbconditions, read_functions and packagetable

* 🩹 add missing function after merge

* 🩹 fix imports

* 🩹fix more imports

* 🚧 WIP

* refactoring conditions and packagetables

* WIP

* WIP2

* added test for conditions/__init__.py

* WIP testing

* added more tests for read_functions

* 🩹linting/type_check

* removed expected json from .gitignore

* Added test, removed unused function

* added even more tests

* updated readme

* automatically remove testfiles

* solved interference of test_outputs

* changed default output path for conditions to unify all subroutines

* updated readme: --input-path -> --edi-energy-mirror-path/-eemp

* added missing time freeze

* Update src/kohlrahbi/ahbtable/ahbcondtions.py

Co-authored-by: kevin <[email protected]>

* Update src/kohlrahbi/ahbtable/ahbcondtions.py

Co-authored-by: kevin <[email protected]>

* Update src/kohlrahbi/ahbtable/ahbcondtions.py

Co-authored-by: kevin <[email protected]>

* simplified if statements

* Removed condition dict extraction from unfolded ahb table

as it is not used

* reorganized duplicate code

* moved function to parse conditions text due to circular import

* fixed minor issue

* Update src/kohlrahbi/ahbtable/ahbpackagetable.py

Co-authored-by: kevin <[email protected]>

* Update src/kohlrahbi/ahbtable/ahbpackagetable.py

Co-authored-by: kevin <[email protected]>

* simplified unnecessary line

* Update src/kohlrahbi/ahbtable/ahbpackagetable.py

Co-authored-by: kevin <[email protected]>

* reduced call of get_format_of_pruefidentifikator fct.

* updated doc strings and added assume-yes flag to conditions command

* Update src/kohlrahbi/read_functions.py

Co-authored-by: kevin <[email protected]>

* Update src/kohlrahbi/read_functions.py

Co-authored-by: kevin <[email protected]>

* removed unused imports

* Added explanation to duplicate code warning

* Removed unused function

---------

Co-authored-by: hf-krechan <[email protected]>
Co-authored-by: kevin <[email protected]>
  • Loading branch information
3 people committed May 14, 2024
1 parent b757e59 commit 715c179
Show file tree
Hide file tree
Showing 85 changed files with 1,424 additions and 275 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,7 @@ src/kohlrahbi/cache/
#however include
!unittests/test-edi-energy-mirror-repo/edi_energy_de/*/expected-output/**/*.xlsx
!unittests/test-edi-energy-mirror-repo/edi_energy_de/*/expected-output/**/*.csv
!unittests/test-edi-energy-mirror-repo/edi_energy_de/*/expected-output/**/*.json

# Word temporary
~$*.doc*
Expand Down
19 changes: 13 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,39 +115,46 @@ kohlrahbi --help
To extract the all AHB tables for each pruefi of a specific format version, you can run the following command.

```bash
kohlrahbi ahb --input-path ../edi_energy_mirror/edi_energy_de/ --output-path ./output/ --file-type csv --format-version FV2310
kohlrahbi ahb --edi-energy-mirror-path ../edi_energy_mirror/ --output-path ./output/ --file-type csv --format-version FV2310
```

To extract the AHB tables for a specific pruefi of a specific format version, you can run the following command.

```bash
kohlrahbi ahb --input-path ../edi_energy_mirror/edi_energy_de/ --output-path ./output/ --file-type csv --pruefis 13002 --format-version FV2310
kohlrahbi ahb -eemp ../edi_energy_mirror/ --output-path ./output/ --file-type csv --pruefis 13002 --format-version FV2310
```

You can also provide multiple pruefis.

```bash
kohlrahbi ahb --input-path ../edi_energy_mirror/edi_energy_de/ --output-path ./output/ --file-type csv --pruefis 13002 --pruefis 13003 --pruefis 13005 --format-version FV2310
kohlrahbi ahb -eemp ../edi_energy_mirror/ --output-path ./output/ --file-type csv --pruefis 13002 --pruefis 13003 --pruefis 13005 --format-version FV2310
```

And you can also provide multiple file types.

```bash
kohlrahbi ahb --input-path ../edi_energy_mirror/edi_energy_de/ --output-path ./output/ --file-type csv --file-type xlsx --file-type flatahb --pruefis 13002 --format-version FV2310
kohlrahbi ahb -eemp ../edi_energy_mirror/ --output-path ./output/ --file-type csv --file-type xlsx --file-type flatahb --pruefis 13002 --format-version FV2310
```

### Extract all conditions

To extract all conditions for each format of a specific format version, you can run the following command.

```bash
kohlrahbi conditions --input-path ../edi_energy_mirror/edi_energy_de/ --output-path ./output/ --format-version FV2310
kohlrahbi conditions -eemp ../edi_energy_mirror/ --output-path ./output/ --format-version FV2310
```
This will provide you with:
* all conditions
* all packages

found in all AHBs (including the condition texts from package tables) within the specified folder with the .docx files.
The output will be saved for each Edifact format separately as `conditions.json` and `packages.json` in the specified output path.
Please note that the information regarding the conditions collected here may more comprehensive compared to the information collected for the AHBs above. This is because `conditions` uses a different routine than `ahb`.

### Extract change history

```bash
kohlrahbi changehistory --input-path ../edi_energy_mirror/edi_energy_de/ --output-path ./output/ --format-version FV2310
kohlrahbi changehistory -eemp ../edi_energy_mirror/ --output-path ./output/ --format-version FV2310
```

## `.docx` Data Sources
Expand Down
6 changes: 5 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#
# This file is autogenerated by pip-compile with Python 3.11
# This file is autogenerated by pip-compile with Python 3.12
# by the following command:
#
# pip-compile pyproject.toml
Expand All @@ -10,6 +10,10 @@ attrs==23.2.0
# via maus
click==8.1.7
# via kohlrahbi (pyproject.toml)
colorama==0.4.6
# via
# click
# colorlog
colorlog==6.8.2
# via kohlrahbi (pyproject.toml)
et-xmlfile==1.1.0
Expand Down
1 change: 1 addition & 0 deletions src/_kohlrahbi_version.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
version = "0.4.2.dev91+g53b6228"
32 changes: 0 additions & 32 deletions src/kohlrahbi/ahb/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,24 +43,6 @@ def load_pruefi_docx_file_map_from_file(path_to_pruefi_docx_file_map_file: Path)
return pruefi_docx_file_map


def get_or_cache_document(ahb_file_path: Path, path_to_document_mapping: dict) -> Document:
"""
Get the document from the cache or read it from the file system.
"""
if ahb_file_path not in path_to_document_mapping:
if not ahb_file_path.exists():
logger.warning("The file '%s' does not exist", ahb_file_path)
raise FileNotFoundError(f"The file '{ahb_file_path}' does not exist")
try:
doc = docx.Document(str(ahb_file_path))
path_to_document_mapping[ahb_file_path] = doc
logger.debug("Saved %s document in cache", ahb_file_path)
except IOError as ioe:
logger.exception("There was an error opening the file '%s'", ahb_file_path, exc_info=True)
raise click.Abort() from ioe
return path_to_document_mapping[ahb_file_path]


def process_ahb_table(
ahb_table: AhbTable,
pruefi: str,
Expand Down Expand Up @@ -204,20 +186,6 @@ def table_header_contains_text_pruefidentifikator(table: Table) -> bool:
return table.row_cells(0)[-1].paragraphs[-1].text.startswith("Prüfidentifikator")


def create_pruefi_docx_filename_map(format_version: EdifactFormatVersion, edi_energy_mirror_path: Path):
"""Creates a mapping of pruefis to their corresponding docx files."""

ahb_documents_path = get_ahb_documents_path(edi_energy_mirror_path, format_version)

pruefis = find_pruefidentifikatoren(ahb_documents_path)

if not pruefis:
log_no_pruefis_warning(format_version.value, ahb_documents_path)
pruefis = get_default_pruefi_map(ahb_documents_path)

save_pruefi_map_to_toml(pruefis, format_version.value)


def get_pruefi_to_file_mapping(basic_input_path: Path, format_version: EdifactFormatVersion) -> dict[str, str]:
"""Returns the pruefi to file mapping. If the cache file does not exist, it creates it."""
default_path_to_cache_file = Path(__file__).parents[1] / "cache" / f"{format_version}_pruefi_docx_filename_map.toml"
Expand Down
124 changes: 124 additions & 0 deletions src/kohlrahbi/ahbtable/ahbcondtions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
"""This module contains the ahbconditions class."""

import json
import re
from pathlib import Path

from docx.table import Table as DocxTable # type: ignore[import-untyped]
from maus.edifact import EdifactFormat
from pydantic import BaseModel, ConfigDict

from kohlrahbi.logger import logger


class AhbConditions(BaseModel):
"""
Class which contains a dict of conditions for each edifact format
"""

conditions_dict: dict[EdifactFormat, dict[str, str]] = {}

model_config = ConfigDict(arbitrary_types_allowed=True)

@classmethod
def from_docx_table(cls, docx_tables: list[DocxTable], edifact_format: EdifactFormat) -> "AhbConditions":
"""
Create an AhbPackageTable object from a docx table.
"""
table_data = []
for table in docx_tables:
for row in table.rows:
if row.cells[-1].text and row.cells[0].text != "EDIFACT Struktur":
row_data = row.cells[-1].text
table_data.append(row_data)

conditions_dict = {}
are_there_conditions = len(table_data) > 0
if are_there_conditions:
conditions_dict = AhbConditions.collect_conditions(
conditions_list=table_data, edifact_format=edifact_format
)

return cls(conditions_dict=conditions_dict)

@staticmethod
def collect_conditions(
conditions_list: list[str], edifact_format: EdifactFormat
) -> dict[EdifactFormat, dict[str, str]]:
"""collect conditions from list of all conditions and store them in conditions dict."""
conditions_dict: dict[EdifactFormat, dict[str, str]] = {edifact_format: {}}

conditions_str = "".join(conditions_list)
conditions_dict = parse_conditions_from_string(conditions_str, edifact_format, conditions_dict)
logger.info("The package conditions for %s were collected.", edifact_format)
return conditions_dict

def include_condition_dict(self, to_add=dict[EdifactFormat, dict[str, str]] | None) -> None:
""" " Include a dict of conditions to the conditions_dict"""
if to_add is None:
logger.info("Conditions dict to be added is empty.")
for edifact_format, edi_cond_dict in to_add.items():
for condition_key, condition_text in edi_cond_dict.items():
if edifact_format in self.conditions_dict:
if (
condition_key in self.conditions_dict[edifact_format]
and len(condition_text) > len(self.conditions_dict[edifact_format][condition_key])
or condition_key not in self.conditions_dict[edifact_format]
):
self.conditions_dict[edifact_format][condition_key] = condition_text
else:
self.conditions_dict[edifact_format] = {condition_key: condition_text}

logger.info("Conditions were updated.")

def dump_as_json(self, output_directory_path: Path) -> None:
"""
Writes all collected conditions to a json file.
The file will be stored in the directory:
'output_directory_path/<edifact_format>/conditions.json'
"""
for edifact_format, format_cond_dict in self.conditions_dict.items():
condition_json_output_directory_path = output_directory_path / str(edifact_format)
condition_json_output_directory_path.mkdir(parents=True, exist_ok=True)
file_path = condition_json_output_directory_path / "conditions.json"
# resort ConditionKeyConditionTextMappings for output
sorted_condition_dict = {k: format_cond_dict[k] for k in sorted(format_cond_dict, key=int)}
array = [
{"condition_key": i, "condition_text": sorted_condition_dict[i], "edifact_format": edifact_format}
for i in sorted_condition_dict
]
with open(file_path, "w", encoding="utf-8") as file:
json.dump(array, file, ensure_ascii=False, indent=2)

logger.info(
"The conditions.json file for %s is saved at %s",
edifact_format,
file_path,
)


def parse_conditions_from_string(
conditions_text: str, edifact_format: EdifactFormat, conditions_dict: dict[EdifactFormat, dict[str, str]]
) -> dict[EdifactFormat, dict[str, str]]:
"""
Takes string with some conditions and sorts it into a dict.
"""
# Split the input into parts enclosed in square brackets and other parts
matches = re.findall(
r"\[(\d+)](.*?)(?=\[\d+]|$)",
conditions_text,
re.DOTALL,
)
for match in matches:
# make text prettier:
text = match[1].strip()
text = re.sub(r"\s+", " ", text)

# check whether condition was already collected:
existing_text = conditions_dict[edifact_format].get(match[0])
is_condition_key_collected_yet = existing_text is not None
if is_condition_key_collected_yet and existing_text is not None:
key_exits_but_shorter_text = len(text) > len(existing_text)
if not is_condition_key_collected_yet or key_exits_but_shorter_text:
conditions_dict[edifact_format][match[0]] = text
return conditions_dict
125 changes: 125 additions & 0 deletions src/kohlrahbi/ahbtable/ahbpackagetable.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
"""
class which contains AHB package condition table
"""

import json
import re
from pathlib import Path

import pandas as pd
from docx.table import Table as DocxTable # type: ignore[import-untyped]
from maus.edifact import EdifactFormat
from pydantic import BaseModel, ConfigDict

from kohlrahbi.ahbtable.ahbcondtions import parse_conditions_from_string
from kohlrahbi.logger import logger


class AhbPackageTable(BaseModel):
"""
This class contains the AHB Package table as you see it in the beginning AHB documents,
but in a machine readable format.
Caution: if two PackageTables objects are combined so far only the package_dict field is updated.
"""

table: pd.DataFrame = pd.DataFrame()
package_dict: dict[EdifactFormat, dict[str, str]] = {}
model_config = ConfigDict(arbitrary_types_allowed=True)

@classmethod
def from_docx_table(cls, docx_tables: list[DocxTable]) -> "AhbPackageTable":
"""
Create an AhbPackageTable object from a docx table.
"""
table_data = []
for table in docx_tables:
for row in table.rows:
row_data = [cell.text for cell in row.cells]
table_data.append(row_data)

headers = table_data[0]
data = table_data[1:]
df = pd.DataFrame(data, columns=headers)
return cls(table=df)

def provide_conditions(self, edifact_format: EdifactFormat) -> dict[EdifactFormat, dict[str, str]]:
"""collect conditions from package table and store them in conditions dict."""
conditions_dict: dict[EdifactFormat, dict[str, str]] = {edifact_format: {}}
there_are_conditions = (self.table["Bedingungen"] != "").any()
if there_are_conditions:
for conditions_text in self.table["Bedingungen"][self.table["Bedingungen"] != ""]:
conditions_dict = parse_conditions_from_string(conditions_text, edifact_format, conditions_dict)
logger.info("The package conditions for %s were collected.", edifact_format)
return conditions_dict

def provide_packages(self, edifact_format: EdifactFormat):
"""collect conditions from package table and store them in conditions dict."""
package_dict: dict[EdifactFormat, dict[str, str]] = {edifact_format: {}}

there_are_packages = (self.table["Paket"] != "").any()
if there_are_packages:
for _, row in self.table.iterrows():
package = row["Paket"]
# Use re.search to find the first match
match = re.search(r"\[(\d+)P\]", package)
if not match:
raise ValueError("No valid package key found in the package column.")
# Extract the matched digits
package = match.group(1)
if package != "1":
package_conditions = row["Paketvoraussetzung(en)"].strip()
# check whether package was already collected:
existing_text = package_dict[edifact_format].get(package)
is_package_key_collected_yet = existing_text is not None
if is_package_key_collected_yet:
key_exits_but_shorter_text = len(package_conditions) > len(
existing_text # type: ignore[arg-type]
) # type: ignore[arg-type]
if not is_package_key_collected_yet or key_exits_but_shorter_text:
package_dict[edifact_format][package] = package_conditions

logger.info("Packages for %s were collected.", edifact_format)
self.package_dict = package_dict

def include_package_dict(self, to_add=dict[EdifactFormat, dict[str, str]] | None) -> None:
"""Include a dict of conditions to the conditions_dict"""
if to_add is None:
logger.info("Packages dict to be added is empty.")
for edifact_format, edi_cond_dict in to_add.items():
for package_key, package_conditions in edi_cond_dict.items():
if edifact_format in self.package_dict:
if (
package_key in self.package_dict[edifact_format]
and len(package_conditions) > len(self.package_dict[edifact_format][package_key])
or package_key not in self.package_dict[edifact_format]
):
self.package_dict[edifact_format][package_key] = package_conditions
else:
self.package_dict[edifact_format] = {package_key: package_conditions}

logger.info("Packages were updated.")

def dump_as_json(self, output_directory_path: Path) -> None:
"""
Writes all collected packages to a json file.
The file will be stored in the directory:
'output_directory_path/<edifact_format>/conditions.json'
"""
for edifact_format, format_pkg_dict in self.package_dict.items():
package_json_output_directory_path = output_directory_path / str(edifact_format)
package_json_output_directory_path.mkdir(parents=True, exist_ok=True)
file_path = package_json_output_directory_path / "packages.json"
# resort PackageKeyConditionTextMappings for output
sorted_package_dict = {k: format_pkg_dict[k] for k in sorted(format_pkg_dict, key=int)}
array = [
{"package_key": i + "P", "package_expression": sorted_package_dict[i], "edifact_format": edifact_format}
for i in sorted_package_dict
]
with open(file_path, "w", encoding="utf-8") as file:
json.dump(array, file, ensure_ascii=False, indent=2)

logger.info(
"The package.json file for %s is saved at %s",
edifact_format,
file_path,
)
Loading

0 comments on commit 715c179

Please sign in to comment.