diff --git a/evals/evaluation/lm_evaluation_harness/model_card/README.md b/evals/evaluation/lm_evaluation_harness/model_card/README.md new file mode 100644 index 00000000..29d33ead --- /dev/null +++ b/evals/evaluation/lm_evaluation_harness/model_card/README.md @@ -0,0 +1,244 @@ +# Model Card Generator + +Model Card Generator allows users to create interactive HTML and static Markdown reports containing model performance and fairness metrics. + +**Model Card Sections** + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Section
SubsectionDescription
Model DetailsOverviewA brief, one-line description of the model card.
DocumentationA thorough description of the model and its usage.
OwnersThe individuals or teams who own the model.
VersionThe version of the schema
LicensesThe model's license for use.
ReferencesLinks providing more information about the model.
CitationsHow to reference this model card.
PathThe path where the model is stored.
GraphicsCollection of overview graphics.
Model ParametersModel ArchitectureThe architecture of the model.
DataThe datasets used to train and evaluate the model.
Input FormatThe data format for inputs to the model.
Input Format MapThe data format for inputs to the model, in key-value format.
Output FormatThe data format for outputs from the model.
Output Format MapThe data format for outputs from the model, in key-value format.
Quantitative analysisPerformance MetricsThe model performance metrics being reported.
GraphicsCollection of performance graphics
ConsiderationsUsersWho are the intended users of the model?
Use CasesWhat are the intended use cases of the model?
LimitationsWhat are the known technical limitations of the model? E.g. What kind(s) of data should the model be expected not to perform well on? What are the factors that might degrade model performance?
TradeoffsWhat are the known tradeoffs in accuracy/performance of the model?
Ethical ConsiderationsWhat are the ethical (or environmental) risks involved in the application of this model?
+ +## Steps to generate a Model Card + +**Step 1**: Clone the GitHub repository. + +```shell +git clone https://github.com/opea-project/GenAIEval.git +``` + +**Step 2**: Navigate to `model_card` directory. + +```shell +cd evals/evaluation/lm_evaluation_harness/model_card/ +``` + +**Step 3**: Choose a virtual environment to use: eg. Using virtualenv: + +```shell +python3 -m virtualenv mg_venv +source mg_venv/bin/activate +``` + +**Step 4**: Install the required dependencies using `pip`. + +```shell +pip install -r requirements.txt +``` + +**Step 5**: Prepare the input Model Card metadata JSON + +Draft your Model Card metadata by following the specified [JSON schema](https://github.com/intel/intel-xai-tools/tree/main/model_card_gen/intel_ai_safety/model_card_gen/schema/v0.0.1/model_card.schema.json) and save the content in a `.json` file. Refer to the above table for sections and fields to include in the JSON file. You can add any fields that comply with the schema, but ensure the required field 'model name' is included." +For guidance, refer to example Model Card JSONs available [here](https://github.com/intel/intel-xai-tools/tree/main/model_card_gen/intel_ai_safety/model_card_gen/docs/examples/json). The path to Model Card metadata JSON should be provided to the `input_mc_metadata_json` argument. + +Optionally, specify the template for rendering the model card by replacing `MODEL_CARD_TEMPLATE` with either "html" for an interactive HTML model card or "md" for a static Markdown version. By default, the template type is set to HTML. +Additionally, provide the directory path where the generated model card and related files should be saved using the `OUTPUT_DIRECTORY` argument. + +```shell +INPUT_MC_METADATA_JSON_PATH=/path/to/model_card_metadata.json +MODEL_CARD_TEMPLATE="html" +OUTPUT_DIRECTORY=/path/to/output + +python examples/main.py --input_mc_metadata_json ${INPUT_MC_METADATA_JSON_PATH} --mc_template_type ${MODEL_CARD_TEMPLATE} --output_dir ${OUTPUT_DIRECTORY} +``` + +**Step 6 (Optional)**: Generate Performance Metrics + +Draft a Metrics by Threshold CSV file based on the generated metric results. To see examples of metric files, click [here](https://github.com/intel/intel-xai-tools/tree/main/model_card_gen/intel_ai_safety/model_card_gen/docs/examples/csv). +For a step-by-step guide on creating these files, follow this [link](https://github.com/intel/intel-xai-tools/tree/main/notebooks/model_card_gen/hugging_face_model_card/hugging-face-model-card.ipynb). The "Metrics by Threshold" section of the Model Card enables you to visually analyze how metric values vary with different probability thresholds. +Provide the path to the Metrics by Threshold CSV file using the `metrics_by_threshold` argument. + + +Draft a Metrics by Group CSV file based on the generated metric results. To see examples of metric files, click [here](https://github.com/intel/intel-xai-tools/tree/main/model_card_gen/intel_ai_safety/model_card_gen/docs/examples/csv). +For a step-by-step guide on creating these files, follow this [link](https://github.com/intel/intel-xai-tools/tree/main/notebooks/model_card_gen/hugging_face_model_card/hugging-face-model-card.ipynb). The "Metrics by Group" section of Model Card is used to organize and display a model's performance metrics by distinct groups or subcategories within the data. Provide the path to the Metrics by Group CSV file using the `metrics_by_group` argument. + +```shell +INPUT_MC_METADATA_JSON_PATH=/path/to/model_card_metadata.json +MODEL_CARD_TEMPLATE="html" +OUTPUT_DIRECTORY=/path/to/output +METRICS_BY_THRESHOLD=/path/to/metrics_by_threshold.csv +METRICS_BY_GROUP=/path/to/metrics_by_group.csv + +python examples/main.py --input_mc_metadata_json ${INPUT_MC_METADATA_JSON_PATH} --mc_template_type ${MODEL_CARD_TEMPLATE} --output_dir ${OUTPUT_DIRECTORY} --metrics_by_threshold ${METRICS_BY_THRESHOLD} --metrics_by_group ${METRICS_BY_GROUP} +``` + +**Step 7 (Optional)**: Generate Metrics by Threshold for `lm_evaluation_harness` + +Additionally, you can generate a `Metrics by Threshold` CSV for some of the `lm_evaluation_harness` tasks. Currently, we support the tasks that produce numeric metrics, like log probabilities or log likelihoods, to determine the best label in text generation and question answering scenarios. In the future, we aim to expand our parsing logic and the Model Card Generator to support a wider array of text generation tasks. + +To generate Metrics by Threshold file for supported tasks, provide the path to the metric results JSONL file in place of `METRICS_RESULTS_PATH`. + +```shell +INPUT_MC_METADATA_JSON_PATH=/path/to/model_card_metadata.json +MODEL_CARD_TEMPLATE="html" +OUTPUT_DIRECTORY=/path/to/output +METRICS_RESULTS_PATH=/path/to/metrics_results.jsonl + +python ./examples/main.py --input_mc_metadata_json ${INPUT_MC_METADATA_JSON_PATH} --mc_template_type ${MODEL_CARD_TEMPLATE} --output_dir ${OUTPUT_DIRECTORY} --metric_results_path ${METRICS_RESULTS_PATH} +``` + +Consider an example of a result JSON file from an `lm_evaluation_harness` task as follows. +``` +[ + { + "doc_id": 0, + "target": "Neither", + "arguments": [ + ["Lorem ipsum dolor sit amet, consectetur adipiscing elit", " True"], + ["Lorem ipsum dolor sit amet, consectetur adipiscing elit", " Neither"], + ["Lorem ipsum dolor sit amet, consectetur adipiscing elit", " False"] + ], + "filtered_resps": [ + [-10.0, false], + [-9.0, false], + [-11.0, false] + ], + "acc": 0.0 + }, + { + "doc_id": 1, + "target": "True", + "arguments": [ + ["Lorem ipsum dolor sit amet, consectetur adipiscing elit", " True"], + ["Lorem ipsum dolor sit amet, consectetur adipiscing elit", " Neither"], + ["Lorem ipsum dolor sit amet, consectetur adipiscing elit", " False"] + ], + "filtered_resps": [ + [-12.0, false], + [-10.5, false], + [-13.0, false] + ], + "acc": 1.0 + }, + ... +] +``` +The `filtered_resps` field contains log likelihoods for each response option, representing the model's confidence levels. When the path to this JSON file is specified in the `metric_results_path` argument, these log likelihood values are parsed and converted into probabilities using the softmax function. These probabilities are then used to calculate performance metrics across various thresholds, ranging from 0.0 to 1.0, and are compiled into the `metrics_by_threshold` CSV file which would look as follows: + +| Threshold | Precision | Recall | F1 Score | Accuracy | Label | +|-----------|-----------|--------|----------|----------|---------| +| 0.000 | 0.500 | 0.600 | 0.545 | 0.550 | True | +| 0.001 | 0.510 | 0.610 | 0.556 | 0.560 | True | +| ... | ... | ... | ... | ... | ... | +| 1.000 | 0.700 | 0.750 | 0.724 | 0.720 | True | +| 0.000 | 0.400 | 0.500 | 0.444 | 0.450 | False | +| 0.001 | 0.410 | 0.510 | 0.455 | 0.460 | False | +| ... | ... | ... | ... | ... | ... | +| 1.000 | 0.600 | 0.650 | 0.624 | 0.620 | False | +| 0.000 | 0.300 | 0.400 | 0.345 | 0.350 | Neither | +| 0.001 | 0.310 | 0.410 | 0.356 | 0.360 | Neither | +| ... | ... | ... | ... | ... | ... | +| 1.000 | 0.500 | 0.550 | 0.524 | 0.520 | Neither | + +The `model_card_gen` tool uses the generated `metrics_by_threshold` dataframe to format and present the evaluation results in a comprehensive model card. + +You can find an example of a generated Model Card [here](https://github.com/intel/intel-xai-tools/tree/main/model_card_gen/intel_ai_safety/model_card_gen/docs/examples/html) diff --git a/evals/evaluation/lm_evaluation_harness/model_card/arguments.py b/evals/evaluation/lm_evaluation_harness/model_card/arguments.py new file mode 100644 index 00000000..d16cd60c --- /dev/null +++ b/evals/evaluation/lm_evaluation_harness/model_card/arguments.py @@ -0,0 +1,45 @@ +# Copyright (C) 2025 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + + +import argparse + + +def parse_arguments(): + parser = argparse.ArgumentParser(description="Generate a model card with optional metrics processing.") + parser.add_argument( + "--input_mc_metadata_json", + type=str, + required=True, + help="Path to the JSON file containing input model card metadata.", + ) + parser.add_argument( + "--metrics_by_threshold", + type=str, + default=None, + help="Metrics by threshold dataframe or the path to the metrics by threshold CSV file.", + ) + parser.add_argument( + "--metrics_by_group", + type=str, + default=None, + help="Metrics by group dataframe or Path to the metrics by group CSV file.", + ) + parser.add_argument( + "--metric_results_path", + type=str, + default=None, + help="Path to the metric results JSONL file for which metrics by threshold dataframe needs to be generated.", + ) + parser.add_argument( + "--mc_template_type", + type=str, + default="html", + help="Template to use for rendering the model card. html for an interactive HTML model card or md for a static Markdown version. Defaults to html", + ) + parser.add_argument( + "--output_dir", type=str, default=None, help="Directory to save the generated model card and related files." + ) + args = parser.parse_args() + + return args diff --git a/evals/evaluation/lm_evaluation_harness/model_card/examples/main.py b/evals/evaluation/lm_evaluation_harness/model_card/examples/main.py new file mode 100644 index 00000000..fea7ba9d --- /dev/null +++ b/evals/evaluation/lm_evaluation_harness/model_card/examples/main.py @@ -0,0 +1,48 @@ +# Copyright (C) 2025 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + + +import os + +from evals.evaluation.lm_evaluation_harness.model_card.arguments import parse_arguments +from evals.evaluation.lm_evaluation_harness.model_card.generate_model_card import generate_model_card +from evals.evaluation.lm_evaluation_harness.model_card.utils import generate_metrics_by_threshold, generate_pred_prob + + +def main(): + args = parse_arguments() + metric_results_path = args.metric_results_path + output_dir = args.output_dir + metrics_by_threshold = args.metrics_by_threshold + # Generate the metrics by threshold for the metric results if provided by the user + + if metric_results_path: + if not os.path.exists(args.metric_results_path): + raise FileNotFoundError( + f"The file at {metric_results_path} does not exist. Please provide a valid file path." + ) + + try: + y_pred_prob, labels, num_options, class_label_index_map = generate_pred_prob(metric_results_path) + metrics_by_threshold = generate_metrics_by_threshold( + y_pred_prob, labels, num_options, class_label_index_map, output_dir + ) + except OSError as e: + print(f"Error: {e}") + except Exception: + print("Task is currently not supported for metrics by threshold generation.") + return + + # Generate the model card + model_card = generate_model_card( + args.input_mc_metadata_json, + metrics_by_threshold, + args.metrics_by_group, + mc_template_type=args.mc_template_type, + output_dir=output_dir, + ) + return model_card + + +if __name__ == "__main__": + main() diff --git a/evals/evaluation/lm_evaluation_harness/model_card/generate_model_card.py b/evals/evaluation/lm_evaluation_harness/model_card/generate_model_card.py new file mode 100644 index 00000000..e0bcf51c --- /dev/null +++ b/evals/evaluation/lm_evaluation_harness/model_card/generate_model_card.py @@ -0,0 +1,69 @@ +# Copyright (C) 2025 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + + +import json +import os + +from intel_ai_safety.model_card_gen.model_card_gen import ModelCardGen +from intel_ai_safety.model_card_gen.validation import validate_json_schema +from jsonschema import ValidationError + + +def generate_model_card( + input_mc_metadata_json_path, + metric_by_threshold=None, + metric_by_group=None, + mc_template_type="html", + output_dir=None, +): + """Generates an HTML or Markdown representation of a model card. + + Parameters: + input_mc_metadata_json_path (json, required): The model card JSON object containing the model's metadata and other details. + metric_threshold_csv (str, optional): The file path to a CSV containing metric threshold data. + metric_grp_csv (str, optional): The file path to a CSV containing metric group data. + mc_template_type (str, optional): Template to use for rendering the model card. Options include "html" for an interactive HTML model card or "md" for a static Markdown version. Defaults to "html" + output_dir (str, optional): The directory where the model card file will be saved. Defaults to the current directory. + + Returns: + str: The HTML or Markdown representation of the model card. + """ + if output_dir is None: + output_dir = os.getcwd() + + if os.path.exists(input_mc_metadata_json_path) and os.path.isfile(input_mc_metadata_json_path): + try: + with open(input_mc_metadata_json_path, "r") as file: + model_card_json = json.load(file) + + except json.JSONDecodeError as e: + raise ValueError("The file content is not valid JSON.") from e + else: + raise FileNotFoundError(f"The JSON file at {input_mc_metadata_json_path} does not exist.") + + try: + validate_json_schema(model_card_json) + + except ValidationError as e: + raise ValidationError( + "Warning: The schema version of the uploaded JSON does not correspond to a model card schema version or " + "the uploaded JSON does not follow the model card schema." + ) + + model_card = ModelCardGen.generate( + model_card_json, + metrics_by_threshold=metric_by_threshold, + metrics_by_group=metric_by_group, + template_type=mc_template_type, + ) + + model_card_name = f"Model Card.{mc_template_type}" + + full_path = os.path.join(output_dir, model_card_name) + model_card.export_model_card(full_path) + + if mc_template_type == "html": + return model_card._repr_html_() + else: + return model_card._repr_md_() diff --git a/evals/evaluation/lm_evaluation_harness/model_card/requirements.txt b/evals/evaluation/lm_evaluation_harness/model_card/requirements.txt new file mode 100644 index 00000000..3a75af66 --- /dev/null +++ b/evals/evaluation/lm_evaluation_harness/model_card/requirements.txt @@ -0,0 +1,8 @@ +intel-ai-safety-model-card-gen@git+https://github.com/intel/intel-xai-tools.git#subdirectory=model_card_gen +kaleido +lm-eval==0.4.3 +lxml +numpy +pandas +plotly +scikit-learn diff --git a/evals/evaluation/lm_evaluation_harness/model_card/utils.py b/evals/evaluation/lm_evaluation_harness/model_card/utils.py new file mode 100644 index 00000000..f5b05cd6 --- /dev/null +++ b/evals/evaluation/lm_evaluation_harness/model_card/utils.py @@ -0,0 +1,282 @@ +# Copyright (C) 2025 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + + +import json +import os + +import numpy as np +import pandas as pd +from scipy.special import softmax +from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score + +RESPONSE_MAP = {"Yes": "True", "It's impossible to say": "Neither", "No": "False", "no": "False", "yes": "True"} + + +def generate_pred_prob(metric_results_path): + """Processes a JSON file containing model evaluation results to generate predicted probabilities and labels. + + Parameters: + metric_results_path (str): The file path to the JSON file containing the evaluation results. + + Returns: + tuple: A tuple containing: + - predicted_probabilities (list): A list of predicted probabilities for each evaluation instance. + - labels (list): A list of true labels for each evaluation instance. + - num_labels (int): The number of distinct labels or options available. + - class_label_index_map (dict): A mapping from class indices to class labels, used for interpreting the predicted probabilities. + """ + if not metric_results_path: + raise ValueError("The results_path is None or an empty string. Please provide a valid file path.") + + if not os.path.exists(metric_results_path): + raise FileNotFoundError(f"The file at {metric_results_path} does not exist. Please provide a valid file path.") + + try: + with open(metric_results_path, "r") as f: + data = json.load(f) + except json.JSONDecodeError as e: + raise ValueError("The file content is not a valid JSON array.") + + labels = [] + map_target = False + + num_labels = len(data[0]["arguments"]) + predicted_probabilities = [] + + if isinstance(data[0]["target"], list): + if data[0]["target"][0] in RESPONSE_MAP and data[0]["arguments"][0][1].strip() in RESPONSE_MAP.values(): + map_target = True + elif data[0]["target"] in RESPONSE_MAP and data[0]["arguments"][0][1].strip() in RESPONSE_MAP.values(): + map_target = True + + if num_labels == 2: + is_ans_match_type_q = False + has_diff_labels = False + are_labels_identical = False + is_one_based_indexing = False + + if set([i[1] for i in data[1]["arguments"]]) != set([i[1] for i in data[0]["arguments"]]) and isinstance( + data[0]["target"], str + ): + has_diff_labels = True + + if len(set([data[0]["arguments"][0][1], data[0]["arguments"][1][1]])) == 1: + are_labels_identical = True + + if "answer" in data[0]["doc"]: + for item in data: + ref = item["doc"]["answer"] + if isinstance(ref, str): + ref = int(ref) + + if ref >= 2: + is_one_based_indexing = True + + if "answer_matching_behavior" in data[0]["doc"]: + reference_label = data[0]["doc"]["answer_matching_behavior"].strip() + is_ans_match_type_q = True + + for item in data: + target_label = item["target"] + class_label_index_map = {} + + if are_labels_identical: + if "label" in item["doc"]: + target_label = item["doc"]["label"] + + elif "answer" in item["doc"]: + target_label = item["doc"]["answer"] + if isinstance(target_label, str): + target_label = int(target_label) + + if is_one_based_indexing: + target_label -= 1 + + if isinstance(target_label, str): + target_label = target_label.strip() + options = [i[1] for i in data[0]["arguments"]] + + if isinstance(target_label, list): + if map_target: + target_label = RESPONSE_MAP[target_label[0]] + + if set([i[1] for i in data[1]["arguments"]]) != set(options): + options = [i for i in range(num_labels)] + class_label_index_map = {options[i]: options[i] for i in range(len(options))} + + else: + class_label_index_map = {i: options[i] for i in range(num_labels)} + else: + target_label = target_label[0] + else: + if map_target: + target_label = RESPONSE_MAP[target_label] + class_label_index_map = {i: options[i] for i in range(num_labels)} + + if set([i[1] for i in data[1]["arguments"]]) != set(options): + options = [i for i in range(num_labels)] + class_label_index_map = {options[i]: options[i] for i in range(len(options))} + + else: + class_label_index_map = {i: options[i] for i in range(num_labels)} + + if has_diff_labels and not are_labels_identical: + arguments = [quest[1].strip() for quest in item["arguments"]] + + if isinstance(target_label, int) and (isinstance(arguments[0], str)): + target_label = str(target_label) + try: + target_label = arguments.index(target_label) + except: + exit() + + if is_ans_match_type_q and item["arguments"][0][1].strip() != reference_label: + + target_label = 0 if target_label == 1 else 1 + filtered_resps = item["filtered_resps"][::-1] + else: + + filtered_resps = item["filtered_resps"] + + labels.append(target_label) + + # Convert log likelihoods to probabilities + log_likelihoods = np.array([resp[0] for resp in filtered_resps]).reshape(1, -1) + probs = softmax(log_likelihoods, axis=1)[0][1] # Extract probabilities for the positive class + predicted_probabilities.append(probs) + + else: + options = [i[1] for i in data[0]["arguments"]] + has_diff_labels = False + + if set([i[1] for i in data[1]["arguments"]]) != set(options): + options = [i for i in range(num_labels)] + has_diff_labels = True + class_label_index_map = {options[i]: options[i] for i in range(len(options))} + + else: + class_label_index_map = {i: options[i] for i in range(num_labels)} + for item in data: + if isinstance(item["target"], list): + if map_target: + target_label = RESPONSE_MAP[item["target"][0]] + else: + target_label = item["target"][0] + else: + if map_target: + target_label = RESPONSE_MAP[item["target"]] + else: + target_label = item["target"] + + if isinstance(item["target"], str): + target_label = target_label.strip() + if has_diff_labels: + option_resp = {i: item["filtered_resps"][i][0] for i in range(len(item["arguments"]))} + arguments = [quest[1].strip() for quest in item["arguments"]] + if target_label in arguments: + target_label = arguments.index(target_label) + + else: + + option_resp = { + item["arguments"][i][1]: item["filtered_resps"][i][0] for i in range(len(item["arguments"])) + } + if len(option_resp) < len(options): + + log_likelihoods = [option_resp[option] for option in options[: len(option_resp)]] + log_likelihoods = np.array(log_likelihoods + [0] * (len(options) - len(option_resp))).reshape(1, -1) + + else: + + log_likelihoods = np.array([option_resp[option] for option in options]).reshape(1, -1) + + probs = softmax(log_likelihoods, axis=1) + predicted_probabilities.append(probs) + labels.append(target_label) + + return predicted_probabilities, labels, num_labels, class_label_index_map + + +def generate_metrics_by_threshold( + prediction_probabilities, labels, num_labels, label_index_map, metric_by_threshold_path=None +): + """Generates a CSV file containing metrics by threshold dataframe. + + Parameters: + prediction_probabilities (array-like): Predicted probabilities for each label. + labels (array-like): True labels for the data. + num_labels (int): Number of distinct labels. + label_index_map (dict): Mapping from labels to label indices. + metric_by_threshold_path (str, optional): Path to save the metrics CSV file. Defaults to './metric_by_threshold.csv'. + + Return: + metric_by_threshold (Dataframe): Dataframe with performance metrics at a variable threshold, ranging from 0 to 1. + """ + + if isinstance(labels[0], str) and label_index_map != {}: + index_label_map = {v.strip() if isinstance(v, str) else v: k for k, v in label_index_map.items()} + + filtered_data = [ + (index_label_map[label.strip()], prob) + for label, prob in zip(labels, prediction_probabilities) + if label.strip() in index_label_map + ] + labels, prediction_probabilities = zip(*filtered_data) + + prob_thresholds = np.linspace(0, 1, 1001) + metrics_by_threshold = pd.DataFrame() + + if num_labels == 2: + # Calculate metrics by threshold for binary label tasks + + metrics_dict = { + "threshold": prob_thresholds, + "precision": [ + precision_score(labels, prediction_probabilities > theta, zero_division=0) for theta in prob_thresholds + ], + "recall": [ + recall_score(labels, prediction_probabilities > theta, zero_division=0) for theta in prob_thresholds + ], + "f1": [f1_score(labels, prediction_probabilities > theta, zero_division=0) for theta in prob_thresholds], + "accuracy": [accuracy_score(labels, prediction_probabilities > theta) for theta in prob_thresholds], + } + metrics_by_threshold = pd.DataFrame.from_dict(metrics_dict) + + else: + # Calculate metrics by threshold for tasks having multiple distinct labels + + for label_index in range(num_labels): + prediction_probabilities = np.vstack(prediction_probabilities) + predicted_probabilities_per_label = prediction_probabilities[:, label_index] + binary_labels = [1 if label == label_index else 0 for label in labels] + metrics_dict_per_label = { + "threshold": prob_thresholds, + "precision": [ + precision_score(binary_labels, predicted_probabilities_per_label > theta, zero_division=0) + for theta in prob_thresholds + ], + "recall": [ + recall_score(binary_labels, predicted_probabilities_per_label > theta) for theta in prob_thresholds + ], + "f1": [f1_score(binary_labels, predicted_probabilities_per_label > theta) for theta in prob_thresholds], + "accuracy": [ + accuracy_score(binary_labels, predicted_probabilities_per_label > theta) + for theta in prob_thresholds + ], + "label": [label_index_map[label_index]] * len(prob_thresholds), + } + metrics_by_threshold = pd.concat( + [metrics_by_threshold, pd.DataFrame.from_dict(metrics_dict_per_label)], ignore_index=True + ) + + if not metric_by_threshold_path: + metric_by_threshold_path = "./metric_by_threshold.csv" + else: + if os.path.exists(metric_by_threshold_path): + metric_by_threshold_path = os.path.join(metric_by_threshold_path, "metric_by_threshold.csv") + + # Save the DataFrame to the specified path + metrics_by_threshold.to_csv(metric_by_threshold_path, index=False) + + return metrics_by_threshold diff --git a/tests/requirements.txt b/tests/requirements.txt index f0b7a773..bee5612b 100644 --- a/tests/requirements.txt +++ b/tests/requirements.txt @@ -1,4 +1,5 @@ bigcode-eval@git+https://github.com/bigcode-project/bigcode-evaluation-harness.git@e5c2f31625223431d7987f43b70b75b9d26ba118 +intel-ai-safety-model-card-gen@git+https://github.com/intel/intel-xai-tools.git#subdirectory=model_card_gen jieba jsonlines langchain_community diff --git a/tests/test_model_card_gen.py b/tests/test_model_card_gen.py new file mode 100644 index 00000000..600c931d --- /dev/null +++ b/tests/test_model_card_gen.py @@ -0,0 +1,93 @@ +# Copyright (C) 2025 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +import json +import os +import pkgutil +import unittest + +import pandas as pd +from intel_ai_safety.model_card_gen.model_card_gen import ModelCardGen +from intel_ai_safety.model_card_gen.validation import ( + _LATEST_SCHEMA_VERSION, + _SCHEMA_FILE_NAME, + _find_json_schema, + validate_json_schema, +) + +PACKAGE = "intel_ai_safety.model_card_gen" + +model_card_example = { + "schema_version": "0.0.1", + "model_details": { + "name": "dolore", + "path": "elit do incididunt", + "version": {"name": "adipisicing", "diff": "sit pariatur ex Lorem dolore", "date": "2011-03-31"}, + "overview": "amet qui non dolor", + "documentation": "ut", + }, +} + +metrics_by_threshold = pd.DataFrame( + [ + {"threshold": 0.0, "precision": 0.5, "recall": 1.0, "f1": 0.6, "accuracy": 0.5}, + {"threshold": 0.1, "precision": 0.4, "recall": 0.5, "f1": 0.5, "accuracy": 0.5}, + {"threshold": 0.2, "precision": 0.4, "recall": 0.5, "f1": 0.5, "accuracy": 0.5}, + {"threshold": 0.3, "precision": 0.4, "recall": 0.5, "f1": 0.5, "accuracy": 0.5}, + ] +) + +metrics_by_group = pd.DataFrame( + [ + {"feature": "sex_Female", "group": 0.0, "binary_accuracy": 0.8, "auc": 0.9}, + {"feature": "Overall", "group": "Overall", "binary_accuracy": 0.8, "auc": 0.9}, + {"feature": "sex_Female", "group": 1.0, "binary_accuracy": 0.9, "auc": 0.9}, + ] +) + + +class TestModelCardGen(unittest.TestCase): + + def test_init(self): + """Test ModelCardGen initialization.""" + mcg = ModelCardGen(model_card=model_card_example) + self.assertIsNotNone(mcg.model_card) + + def test_read_json(self): + """Test ModelCardGen._read_json method.""" + mcg = ModelCardGen(model_card=model_card_example) + self.assertEqual(mcg.model_card, ModelCardGen._read_json(model_card_example)) + + def test_validate_json(self): + """Test JSON validates.""" + self.assertEqual(validate_json_schema(model_card_example), _find_json_schema()) + + def test_schemas(self): + """Test JSON schema loads.""" + schema_file = os.path.join("schema", "v" + _LATEST_SCHEMA_VERSION, _SCHEMA_FILE_NAME) + json_file = pkgutil.get_data(PACKAGE, schema_file) + schema = json.loads(json_file) + self.assertEqual(schema, _find_json_schema(_LATEST_SCHEMA_VERSION)) + + def test_load_from_csv(self): + """Test if metrics files are loaded properly and generate model card.""" + mcg = ModelCardGen.generate(metrics_by_threshold=metrics_by_threshold, metrics_by_group=metrics_by_group) + self.assertIsNotNone(mcg.model_card) + + def test_load_template(self): + """Test ModelCardGen generates a model card using the specified template type.""" + for template_type in ("md", "html"): + with self.subTest(template_type=template_type): + mcg = ModelCardGen.generate(template_type=template_type) + self.assertIsNotNone(mcg.model_card) + + def test_missing_threshold_column_exception(self): + """Test if the correct exception is raised when the 'threshold' column is missing in the CSV.""" + with self.assertRaises(AssertionError) as context: + example_df = pd.DataFrame(data={"col1": [1, 2]}) + ModelCardGen.generate(metrics_by_threshold=example_df) + self.assertTrue("No column named 'threshold'" in str(context.exception)) + + +if __name__ == "__main__": + unittest.main()