extract conditions and packages (#257)

* updated platformdirs 3.10.0 -> 4.1.0 in requirements * updated packaging 22.0 -> 23.2 in requirements * pip-compile requirements * updated know pruefis * added conditions flavour * moved conditions parsing to flavour conditions * 🩹corrected loop in process_pruefi * ➕added package table and functions to extract conditions * removed unused arguments in process_package_conditions * ➕added tests for packagetable, linting issue * linting/typechecking * ➕added edifactformat -> file mapping * move test docx files * Add new test files into new folder structure * ✅ Update tests to new test data folder structure * ✅ Add unit test to compare csv export files * 📝 Add information how much pages one format has * 🎨 Improve handling of output-path * ✅ fix test * ✅ Add test for change history * 🎨 Move changehistory functions into an extra module * 🎨 Fix imports * 🎨 Use function to check for change history section * 🎨 split commands into separate modules * 🚛 move dump conditions function * ✅ harmonize cli tests * 🚛 move pruefi command * 🚛 rename changehistory.function to __init__.py * 🎨 clean up imports * 🎨 clean up arguments for pruefi command * 🎨 add enum for ahbexportfileformat * ✅ fix tests * 🚛 move changehistory command * ✅ test the cli command pruefi direct * 🚛 move cli test for changehistory in extra file * 🔥 remove unused imports * 🚛 rename cli pruefi module * 📝 add documentation * ➕➖ replace attrs with pydantic and use pyproject.toml for dependencies * 🔄 Replace attrs with pydantic * 🔥 remove unused code * 🎨 use enum for file exports * 🎨 Use the click validation for filetype and make it required * 🎨 use functions to check for paragraph and table kinds * 🎨 Use ConfigDict to remove DeprecationWarning * ➕ add freezegun to mock datetime.now() in the tests * 🎨 use timezone.utc instead of UTC * 🚨remove unused imports * 🚧 WIP of the get_ahb_table rework * 🎉🚧 Finally COMDIS is there this commit fixes the issue that all pruefis which are above the change history section got not exportet * 🚨remove unused import * ✅ Improve the test to check the current state of the cli tool * ✅ clean up tests * ✅ Use sort instead of sorted * 🎨 Further improvements of the get_ahb_table function * 🎨 remove warning after tests src/kohlrahbi/pruefis/__init__.py:111 /Users/kevin/workspaces/hochfrequenz/kohlrahbi/src/kohlrahbi/pruefis/__init__.py:111: SyntaxWarning: invalid escape sequence '\d' """ -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html * ➕ Add pydantic to pyproject.toml * 🚨 Fix linter warnings * fixed tests * unittests changed path formats * linting * 🚨 fix further linter warnings * ✅ add test file to test cli conditions command * WIP * WIP * restructured and cleaned ahbconditions, read_functions and packagetable * 🩹 add missing function after merge * 🩹 fix imports * 🩹fix more imports * 🚧 WIP * refactoring conditions and packagetables * WIP * WIP2 * added test for conditions/__init__.py * WIP testing * added more tests for read_functions * 🩹linting/type_check * removed expected json from .gitignore * Added test, removed unused function * added even more tests * updated readme * automatically remove testfiles * solved interference of test_outputs * changed default output path for conditions to unify all subroutines * updated readme: --input-path -> --edi-energy-mirror-path/-eemp * added missing time freeze * Update src/kohlrahbi/ahbtable/ahbcondtions.py Co-authored-by: kevin <[email protected]> * Update src/kohlrahbi/ahbtable/ahbcondtions.py Co-authored-by: kevin <[email protected]> * Update src/kohlrahbi/ahbtable/ahbcondtions.py Co-authored-by: kevin <[email protected]> * simplified if statements * Removed condition dict extraction from unfolded ahb table as it is not used * reorganized duplicate code * moved function to parse conditions text due to circular import * fixed minor issue * Update src/kohlrahbi/ahbtable/ahbpackagetable.py Co-authored-by: kevin <[email protected]> * Update src/kohlrahbi/ahbtable/ahbpackagetable.py Co-authored-by: kevin <[email protected]> * simplified unnecessary line * Update src/kohlrahbi/ahbtable/ahbpackagetable.py Co-authored-by: kevin <[email protected]> * reduced call of get_format_of_pruefidentifikator fct. * updated doc strings and added assume-yes flag to conditions command * Update src/kohlrahbi/read_functions.py Co-authored-by: kevin <[email protected]> * Update src/kohlrahbi/read_functions.py Co-authored-by: kevin <[email protected]> * removed unused imports * Added explanation to duplicate code warning * Removed unused function --------- Co-authored-by: hf-krechan <[email protected]> Co-authored-by: kevin <[email protected]>
Hochfrequenz · May 14, 2024 · 715c179 · 715c179
1 parent b757e59
commit 715c179
Show file tree

Hide file tree

Showing 85 changed files with 1,424 additions and 275 deletions.
diff --git a/.gitignore b/.gitignore
@@ -178,6 +178,7 @@ src/kohlrahbi/cache/
 #however include
 !unittests/test-edi-energy-mirror-repo/edi_energy_de/*/expected-output/**/*.xlsx
 !unittests/test-edi-energy-mirror-repo/edi_energy_de/*/expected-output/**/*.csv
+!unittests/test-edi-energy-mirror-repo/edi_energy_de/*/expected-output/**/*.json
 
 # Word temporary
 ~$*.doc*

diff --git a/README.md b/README.md
@@ -115,39 +115,46 @@ kohlrahbi --help
 To extract the all AHB tables for each pruefi of a specific format version, you can run the following command.
 
 ```bash
-kohlrahbi ahb --input-path ../edi_energy_mirror/edi_energy_de/ --output-path ./output/ --file-type csv --format-version FV2310
+kohlrahbi ahb --edi-energy-mirror-path ../edi_energy_mirror/ --output-path ./output/ --file-type csv --format-version FV2310
 ```
 
 To extract the AHB tables for a specific pruefi of a specific format version, you can run the following command.
 
 ```bash
-kohlrahbi ahb --input-path ../edi_energy_mirror/edi_energy_de/ --output-path ./output/ --file-type csv --pruefis 13002 --format-version FV2310
+kohlrahbi ahb -eemp ../edi_energy_mirror/ --output-path ./output/ --file-type csv --pruefis 13002 --format-version FV2310
 ```
 
 You can also provide multiple pruefis.
 
 ```bash
-kohlrahbi ahb --input-path ../edi_energy_mirror/edi_energy_de/ --output-path ./output/ --file-type csv --pruefis 13002 --pruefis 13003 --pruefis 13005 --format-version FV2310
+kohlrahbi ahb -eemp ../edi_energy_mirror/ --output-path ./output/ --file-type csv --pruefis 13002 --pruefis 13003 --pruefis 13005 --format-version FV2310
 ```
 
 And you can also provide multiple file types.
 
 ```bash
-kohlrahbi ahb --input-path ../edi_energy_mirror/edi_energy_de/ --output-path ./output/ --file-type csv --file-type xlsx --file-type flatahb --pruefis 13002 --format-version FV2310
+kohlrahbi ahb -eemp ../edi_energy_mirror/ --output-path ./output/ --file-type csv --file-type xlsx --file-type flatahb --pruefis 13002 --format-version FV2310
 ```
 
 ### Extract all conditions
 
 To extract all conditions for each format of a specific format version, you can run the following command.
 
 ```bash
-kohlrahbi conditions --input-path ../edi_energy_mirror/edi_energy_de/ --output-path ./output/ --format-version FV2310
+kohlrahbi conditions -eemp ../edi_energy_mirror/ --output-path ./output/ --format-version FV2310
 ```
+This will provide you with:
+* all conditions
+* all packages
+
+found in all AHBs (including the condition texts from package tables) within the specified folder with the .docx files.
+The output will be saved for each Edifact format separately as `conditions.json` and `packages.json` in the specified output path.
+Please note that the information regarding the conditions collected here may more comprehensive compared to the information collected for the AHBs above. This is because `conditions` uses a different routine than `ahb`.
 
 ### Extract change history
 
 ```bash
-kohlrahbi changehistory --input-path ../edi_energy_mirror/edi_energy_de/ --output-path ./output/ --format-version FV2310
+kohlrahbi changehistory -eemp ../edi_energy_mirror/ --output-path ./output/ --format-version FV2310
 ```
 
 ## `.docx` Data Sources

diff --git a/requirements.txt b/requirements.txt
@@ -1,5 +1,5 @@
 #
-# This file is autogenerated by pip-compile with Python 3.11
+# This file is autogenerated by pip-compile with Python 3.12
 # by the following command:
 #
 #    pip-compile pyproject.toml
@@ -10,6 +10,10 @@ attrs==23.2.0
     # via maus
 click==8.1.7
     # via kohlrahbi (pyproject.toml)
+colorama==0.4.6
+    # via
+    #   click
+    #   colorlog
 colorlog==6.8.2
     # via kohlrahbi (pyproject.toml)
 et-xmlfile==1.1.0

diff --git a/src/_kohlrahbi_version.py b/src/_kohlrahbi_version.py
@@ -0,0 +1 @@
+version = "0.4.2.dev91+g53b6228"
diff --git a/src/kohlrahbi/ahb/__init__.py b/src/kohlrahbi/ahb/__init__.py
@@ -43,24 +43,6 @@ def load_pruefi_docx_file_map_from_file(path_to_pruefi_docx_file_map_file: Path)
     return pruefi_docx_file_map
 
 
-def get_or_cache_document(ahb_file_path: Path, path_to_document_mapping: dict) -> Document:
-    """
-    Get the document from the cache or read it from the file system.
-    """
-    if ahb_file_path not in path_to_document_mapping:
-        if not ahb_file_path.exists():
-            logger.warning("The file '%s' does not exist", ahb_file_path)
-            raise FileNotFoundError(f"The file '{ahb_file_path}' does not exist")
-        try:
-            doc = docx.Document(str(ahb_file_path))
-            path_to_document_mapping[ahb_file_path] = doc
-            logger.debug("Saved %s document in cache", ahb_file_path)
-        except IOError as ioe:
-            logger.exception("There was an error opening the file '%s'", ahb_file_path, exc_info=True)
-            raise click.Abort() from ioe
-    return path_to_document_mapping[ahb_file_path]
-
-
 def process_ahb_table(
     ahb_table: AhbTable,
     pruefi: str,
@@ -204,20 +186,6 @@ def table_header_contains_text_pruefidentifikator(table: Table) -> bool:
     return table.row_cells(0)[-1].paragraphs[-1].text.startswith("Prüfidentifikator")
 
 
-def create_pruefi_docx_filename_map(format_version: EdifactFormatVersion, edi_energy_mirror_path: Path):
-    """Creates a mapping of pruefis to their corresponding docx files."""
-
-    ahb_documents_path = get_ahb_documents_path(edi_energy_mirror_path, format_version)
-
-    pruefis = find_pruefidentifikatoren(ahb_documents_path)
-
-    if not pruefis:
-        log_no_pruefis_warning(format_version.value, ahb_documents_path)
-        pruefis = get_default_pruefi_map(ahb_documents_path)
-
-    save_pruefi_map_to_toml(pruefis, format_version.value)
-
-
 def get_pruefi_to_file_mapping(basic_input_path: Path, format_version: EdifactFormatVersion) -> dict[str, str]:
     """Returns the pruefi to file mapping. If the cache file does not exist, it creates it."""
     default_path_to_cache_file = Path(__file__).parents[1] / "cache" / f"{format_version}_pruefi_docx_filename_map.toml"

diff --git a/src/kohlrahbi/ahbtable/ahbcondtions.py b/src/kohlrahbi/ahbtable/ahbcondtions.py
@@ -0,0 +1,124 @@
+"""This module contains the ahbconditions class."""
+
+import json
+import re
+from pathlib import Path
+
+from docx.table import Table as DocxTable  # type: ignore[import-untyped]
+from maus.edifact import EdifactFormat
+from pydantic import BaseModel, ConfigDict
+
+from kohlrahbi.logger import logger
+
+
+class AhbConditions(BaseModel):
+    """
+    Class which contains a dict of conditions for each edifact format
+    """
+
+    conditions_dict: dict[EdifactFormat, dict[str, str]] = {}
+
+    model_config = ConfigDict(arbitrary_types_allowed=True)
+
+    @classmethod
+    def from_docx_table(cls, docx_tables: list[DocxTable], edifact_format: EdifactFormat) -> "AhbConditions":
+        """
+        Create an AhbPackageTable object from a docx table.
+        """
+        table_data = []
+        for table in docx_tables:
+            for row in table.rows:
+                if row.cells[-1].text and row.cells[0].text != "EDIFACT Struktur":
+                    row_data = row.cells[-1].text
+                    table_data.append(row_data)
+
+        conditions_dict = {}
+        are_there_conditions = len(table_data) > 0
+        if are_there_conditions:
+            conditions_dict = AhbConditions.collect_conditions(
+                conditions_list=table_data, edifact_format=edifact_format
+            )
+
+        return cls(conditions_dict=conditions_dict)
+
+    @staticmethod
+    def collect_conditions(
+        conditions_list: list[str], edifact_format: EdifactFormat
+    ) -> dict[EdifactFormat, dict[str, str]]:
+        """collect conditions from list of all conditions and store them in conditions dict."""
+        conditions_dict: dict[EdifactFormat, dict[str, str]] = {edifact_format: {}}
+
+        conditions_str = "".join(conditions_list)
+        conditions_dict = parse_conditions_from_string(conditions_str, edifact_format, conditions_dict)
+        logger.info("The package conditions for %s were collected.", edifact_format)
+        return conditions_dict
+
+    def include_condition_dict(self, to_add=dict[EdifactFormat, dict[str, str]] | None) -> None:
+        """ " Include a dict of conditions to the conditions_dict"""
+        if to_add is None:
+            logger.info("Conditions dict to be added is empty.")
+        for edifact_format, edi_cond_dict in to_add.items():
+            for condition_key, condition_text in edi_cond_dict.items():
+                if edifact_format in self.conditions_dict:
+                    if (
+                        condition_key in self.conditions_dict[edifact_format]
+                        and len(condition_text) > len(self.conditions_dict[edifact_format][condition_key])
+                        or condition_key not in self.conditions_dict[edifact_format]
+                    ):
+                        self.conditions_dict[edifact_format][condition_key] = condition_text
+                else:
+                    self.conditions_dict[edifact_format] = {condition_key: condition_text}
+
+        logger.info("Conditions were updated.")
+
+    def dump_as_json(self, output_directory_path: Path) -> None:
+        """
+        Writes all collected conditions to a json file.
+        The file will be stored in the directory:
+            'output_directory_path/<edifact_format>/conditions.json'
+        """
+        for edifact_format, format_cond_dict in self.conditions_dict.items():
+            condition_json_output_directory_path = output_directory_path / str(edifact_format)
+            condition_json_output_directory_path.mkdir(parents=True, exist_ok=True)
+            file_path = condition_json_output_directory_path / "conditions.json"
+            # resort ConditionKeyConditionTextMappings for output
+            sorted_condition_dict = {k: format_cond_dict[k] for k in sorted(format_cond_dict, key=int)}
+            array = [
+                {"condition_key": i, "condition_text": sorted_condition_dict[i], "edifact_format": edifact_format}
+                for i in sorted_condition_dict
+            ]
+            with open(file_path, "w", encoding="utf-8") as file:
+                json.dump(array, file, ensure_ascii=False, indent=2)
+
+            logger.info(
+                "The conditions.json file for %s is saved at %s",
+                edifact_format,
+                file_path,
+            )
+
+
+def parse_conditions_from_string(
+    conditions_text: str, edifact_format: EdifactFormat, conditions_dict: dict[EdifactFormat, dict[str, str]]
+) -> dict[EdifactFormat, dict[str, str]]:
+    """
+    Takes string with some conditions and sorts it into a dict.
+    """
+    # Split the input into parts enclosed in square brackets and other parts
+    matches = re.findall(
+        r"\[(\d+)](.*?)(?=\[\d+]|$)",
+        conditions_text,
+        re.DOTALL,
+    )
+    for match in matches:
+        # make text prettier:
+        text = match[1].strip()
+        text = re.sub(r"\s+", " ", text)
+
+        # check whether condition was already collected:
+        existing_text = conditions_dict[edifact_format].get(match[0])
+        is_condition_key_collected_yet = existing_text is not None
+        if is_condition_key_collected_yet and existing_text is not None:
+            key_exits_but_shorter_text = len(text) > len(existing_text)
+        if not is_condition_key_collected_yet or key_exits_but_shorter_text:
+            conditions_dict[edifact_format][match[0]] = text
+    return conditions_dict
diff --git a/src/kohlrahbi/ahbtable/ahbpackagetable.py b/src/kohlrahbi/ahbtable/ahbpackagetable.py
@@ -0,0 +1,125 @@
+"""
+class which contains AHB package condition table
+"""
+
+import json
+import re
+from pathlib import Path
+
+import pandas as pd
+from docx.table import Table as DocxTable  # type: ignore[import-untyped]
+from maus.edifact import EdifactFormat
+from pydantic import BaseModel, ConfigDict
+
+from kohlrahbi.ahbtable.ahbcondtions import parse_conditions_from_string
+from kohlrahbi.logger import logger
+
+
+class AhbPackageTable(BaseModel):
+    """
+    This class contains the AHB Package table as you see it in the beginning AHB documents,
+    but in a machine readable format.
+    Caution: if two PackageTables objects are combined so far only the package_dict field is updated.
+    """
+
+    table: pd.DataFrame = pd.DataFrame()
+    package_dict: dict[EdifactFormat, dict[str, str]] = {}
+    model_config = ConfigDict(arbitrary_types_allowed=True)
+
+    @classmethod
+    def from_docx_table(cls, docx_tables: list[DocxTable]) -> "AhbPackageTable":
+        """
+        Create an AhbPackageTable object from a docx table.
+        """
+        table_data = []
+        for table in docx_tables:
+            for row in table.rows:
+                row_data = [cell.text for cell in row.cells]
+                table_data.append(row_data)
+
+        headers = table_data[0]
+        data = table_data[1:]
+        df = pd.DataFrame(data, columns=headers)
+        return cls(table=df)
+
+    def provide_conditions(self, edifact_format: EdifactFormat) -> dict[EdifactFormat, dict[str, str]]:
+        """collect conditions from package table and store them in conditions dict."""
+        conditions_dict: dict[EdifactFormat, dict[str, str]] = {edifact_format: {}}
+        there_are_conditions = (self.table["Bedingungen"] != "").any()
+        if there_are_conditions:
+            for conditions_text in self.table["Bedingungen"][self.table["Bedingungen"] != ""]:
+                conditions_dict = parse_conditions_from_string(conditions_text, edifact_format, conditions_dict)
+        logger.info("The package conditions for %s were collected.", edifact_format)
+        return conditions_dict
+
+    def provide_packages(self, edifact_format: EdifactFormat):
+        """collect conditions from package table and store them in conditions dict."""
+        package_dict: dict[EdifactFormat, dict[str, str]] = {edifact_format: {}}
+
+        there_are_packages = (self.table["Paket"] != "").any()
+        if there_are_packages:
+            for _, row in self.table.iterrows():
+                package = row["Paket"]
+                # Use re.search to find the first match
+                match = re.search(r"\[(\d+)P\]", package)
+                if not match:
+                    raise ValueError("No valid package key found in the package column.")
+                    # Extract the matched digits
+                package = match.group(1)
+                if package != "1":
+                    package_conditions = row["Paketvoraussetzung(en)"].strip()
+                    # check whether package was already collected:
+                    existing_text = package_dict[edifact_format].get(package)
+                    is_package_key_collected_yet = existing_text is not None
+                    if is_package_key_collected_yet:
+                        key_exits_but_shorter_text = len(package_conditions) > len(
+                            existing_text  # type: ignore[arg-type]
+                        )  # type: ignore[arg-type]
+                    if not is_package_key_collected_yet or key_exits_but_shorter_text:
+                        package_dict[edifact_format][package] = package_conditions
+
+        logger.info("Packages for %s were collected.", edifact_format)
+        self.package_dict = package_dict
+
+    def include_package_dict(self, to_add=dict[EdifactFormat, dict[str, str]] | None) -> None:
+        """Include a dict of conditions to the conditions_dict"""
+        if to_add is None:
+            logger.info("Packages dict to be added is empty.")
+        for edifact_format, edi_cond_dict in to_add.items():
+            for package_key, package_conditions in edi_cond_dict.items():
+                if edifact_format in self.package_dict:
+                    if (
+                        package_key in self.package_dict[edifact_format]
+                        and len(package_conditions) > len(self.package_dict[edifact_format][package_key])
+                        or package_key not in self.package_dict[edifact_format]
+                    ):
+                        self.package_dict[edifact_format][package_key] = package_conditions
+                else:
+                    self.package_dict[edifact_format] = {package_key: package_conditions}
+
+        logger.info("Packages were updated.")
+
+    def dump_as_json(self, output_directory_path: Path) -> None:
+        """
+        Writes all collected packages to a json file.
+        The file will be stored in the directory:
+            'output_directory_path/<edifact_format>/conditions.json'
+        """
+        for edifact_format, format_pkg_dict in self.package_dict.items():
+            package_json_output_directory_path = output_directory_path / str(edifact_format)
+            package_json_output_directory_path.mkdir(parents=True, exist_ok=True)
+            file_path = package_json_output_directory_path / "packages.json"
+            # resort  PackageKeyConditionTextMappings for output
+            sorted_package_dict = {k: format_pkg_dict[k] for k in sorted(format_pkg_dict, key=int)}
+            array = [
+                {"package_key": i + "P", "package_expression": sorted_package_dict[i], "edifact_format": edifact_format}
+                for i in sorted_package_dict
+            ]
+            with open(file_path, "w", encoding="utf-8") as file:
+                json.dump(array, file, ensure_ascii=False, indent=2)
+
+            logger.info(
+                "The package.json file for %s is saved at %s",
+                edifact_format,
+                file_path,
+            )