Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Integrated Vectorization] Changes for Admin Configuration page #798

Merged
merged 83 commits into from
May 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
b53c1b0
downloading the contents of url & uploading blob
komalg1 Apr 9, 2024
b706206
changes for downloading url content & uploading to blob
komalg1 Apr 10, 2024
39d373b
Merge branch 'main' into komal/iv-addurl-pull
komalg1 Apr 11, 2024
636bfcf
Adding url as the metadata to the blob
komalg1 Apr 11, 2024
4404b89
PR only for adding environment var
komalg1 Apr 11, 2024
71bab88
removing chnages unrelated to env variable
komalg1 Apr 11, 2024
b54c002
Grouping integrated vec. with azure search
komalg1 Apr 11, 2024
7edbf1e
upadting description to match the usage
komalg1 Apr 11, 2024
73cd9c0
grouping search variables
komalg1 Apr 12, 2024
4b84909
Spike for resource creation
komalg1 Apr 15, 2024
a96be78
Merge branch 'main' into komal/pull-resources-creation
komalg1 Apr 15, 2024
92c5923
Failing build due to poetry lock file
komalg1 Apr 15, 2024
1879966
Merge branch 'main' into komal/pull-resources-creation
komalg1 Apr 15, 2024
f1baf8c
pull resources creation
komalg1 Apr 15, 2024
4cb2a16
Pull resource creation refactoring
komalg1 Apr 16, 2024
fd76585
Merge branch 'main' into komal/pull-resources-creation
komalg1 Apr 17, 2024
f5f4190
updating index field names
komalg1 Apr 17, 2024
c82910a
fixing tests for when using IV
komalg1 Apr 17, 2024
b370b77
changes for rbac IV
komalg1 Apr 18, 2024
49ef655
role for storage & tests
komalg1 Apr 18, 2024
8798b64
unit tests for helpers
komalg1 Apr 18, 2024
0c8a8fa
unit tests for helper classes
komalg1 Apr 18, 2024
44bd0be
Merge branch 'main' into komal/pull-resources-creation
komalg1 Apr 18, 2024
05e1663
code review changes
komalg1 Apr 19, 2024
7e5e103
Merge branch 'main' into komal/pull-resources-creation
komalg1 Apr 19, 2024
811eaeb
tests update
komalg1 Apr 19, 2024
6cf0868
Explore Data changes
komalg1 Apr 20, 2024
b18a1af
Merge branch 'main' into komal/iv-explorepage
komalg1 Apr 22, 2024
16d11c3
decoupled explore & delete pages
komalg1 Apr 22, 2024
e949a85
Merge branch 'main' into komal/iv-explorepage
komalg1 Apr 23, 2024
42a76bf
showing only the chunk number to match push model
komalg1 Apr 23, 2024
a65fa1b
Merge branch 'main' into komal/iv-explorepage
komalg1 Apr 23, 2024
6d28e44
Deleting unused files
komalg1 Apr 23, 2024
0a49fa2
Deleting unused files
komalg1 Apr 23, 2024
23ee585
unused code
komalg1 Apr 23, 2024
669230d
unit tests for explore & delete
komalg1 Apr 23, 2024
a5cd945
Merge branch 'main' into komal/iv-explorepage
komalg1 Apr 23, 2024
8015c37
Code review changes
komalg1 Apr 24, 2024
310de32
Merge branch 'main' into komal/iv-explorepage
komalg1 Apr 24, 2024
1c13a72
Chat changes for IV
komalg1 Apr 25, 2024
cf85d16
missed mapping
komalg1 Apr 25, 2024
8cd9ddf
Merge branch 'main' into komal/iv-questionanswer
komalg1 Apr 26, 2024
4a4da6c
Merge branch 'main' into komal/iv-questionanswer
komalg1 Apr 26, 2024
277e633
Question Answer tool changes
komalg1 Apr 26, 2024
c3e4026
module load issue
komalg1 Apr 29, 2024
8057b01
mapping chunk id as chunk number
komalg1 Apr 29, 2024
2f68227
Merge branch 'main' of https://github.com/komalg1/chat-with-your-data…
komalg1 Apr 29, 2024
5f753d1
Merge branch 'main' into komal/iv-questionanswer
komalg1 Apr 29, 2024
c363956
sync main
komalg1 Apr 29, 2024
78b8197
Merge branch 'main' into komal/iv-questionanswer
komalg1 Apr 29, 2024
388c789
test fix
komalg1 Apr 29, 2024
de7e50e
code review comments
komalg1 Apr 29, 2024
11ad356
method name change
komalg1 Apr 29, 2024
a44fafd
Admin config changes
komalg1 Apr 29, 2024
5962563
Merge branch 'main' of https://github.com/komalg1/chat-with-your-data…
komalg1 Apr 30, 2024
090fea3
Merge branch 'Azure-Samples:main' into main
komalg1 Apr 30, 2024
f57755e
Merge branch 'main' into komal/iv-adminconfig
komalg1 Apr 30, 2024
e34ba0a
Merge branch 'Azure-Samples:main' into main
komalg1 Apr 30, 2024
09859d8
Merge branch 'main' into komal/iv-adminconfig
komalg1 Apr 30, 2024
aa31589
Refactoring Document Processor
komalg1 May 1, 2024
38e17f1
Merge branch 'Azure-Samples:main' into main
komalg1 May 1, 2024
0f2cd43
Merge branch 'main' into komal/iv-adminconfig
komalg1 May 1, 2024
d57d6ba
unit test fix
komalg1 May 1, 2024
132271a
code review changes
komalg1 May 2, 2024
ca5603e
Merge branch 'Azure-Samples:main' into main
komalg1 May 2, 2024
e21f304
Merge branch 'main' into komal/iv-adminconfig
komalg1 May 2, 2024
e63edee
updating configHelper path
komalg1 May 2, 2024
cb45503
test fix
komalg1 May 2, 2024
30b29ea
test fix
komalg1 May 2, 2024
26204e8
unit test fix
komalg1 May 3, 2024
ed350b5
code review comments
komalg1 May 3, 2024
e79dc09
unit test fix
komalg1 May 3, 2024
b9f12b2
config type
komalg1 May 3, 2024
25dfd7d
Merge branch 'main' of https://github.com/komalg1/chat-with-your-data…
komalg1 May 3, 2024
48ed004
Merge branch 'main' into komal/iv-adminconfig
komalg1 May 3, 2024
b4e0ed7
Merge branch 'Azure-Samples:main' into main
komalg1 May 3, 2024
be061a9
Merge branch 'main' into komal/iv-adminconfig
komalg1 May 3, 2024
6493317
loading env variables inAdmin.py
komalg1 May 6, 2024
ac0f10b
type declaration for env_helper
komalg1 May 7, 2024
990a72e
moving env helper below app insghts
komalg1 May 7, 2024
bf572b5
Merge branch 'Azure-Samples:main' into main
komalg1 May 7, 2024
3a5210e
Merge branch 'main' into komal/iv-adminconfig
komalg1 May 7, 2024
c951288
unused variable
komalg1 May 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions .env.sample
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,6 @@ AZURE_SEARCH_CONVERSATIONS_LOG_INDEX=conversations-log
AZURE_SEARCH_USE_INTEGRATED_VECTORIZATION=false
AZURE_SEARCH_INDEXER_NAME=
AZURE_SEARCH_DATASOURCE_NAME=
AZURE_SEARCH_IV_MAX_PAGE_LENGTH=2000
AZURE_SEARCH_IV_PAGE_OVERLAP_LENGTH=500
# Azure OpenAI for generating the answer and computing the embedding of the documents
AZURE_OPENAI_RESOURCE=
AZURE_OPENAI_API_KEY=
Expand Down
4 changes: 0 additions & 4 deletions code/backend/Admin.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,8 @@
import os
import logging
import sys
from dotenv import load_dotenv
from azure.monitor.opentelemetry import configure_azure_monitor


load_dotenv()

sys.path.append(os.path.join(os.path.dirname(__file__), ".."))

logging.captureWarnings(True)
Expand Down
15 changes: 5 additions & 10 deletions code/backend/batch/AddURLEmbeddings.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,8 @@
import logging
import traceback
import azure.functions as func

from utilities.helpers.DocumentProcessorHelper import DocumentProcessor
from utilities.helpers.ConfigHelper import ConfigHelper

from utilities.helpers.embedders.EmbedderFactory import EmbedderFactory
from utilities.helpers.EnvHelper import EnvHelper

bp_add_url_embeddings = func.Blueprint()
logger = logging.getLogger(__name__)
Expand All @@ -14,6 +12,7 @@

@bp_add_url_embeddings.route(route="AddURLEmbeddings")
def add_url_embeddings(req: func.HttpRequest) -> func.HttpResponse:
env_helper: EnvHelper = EnvHelper()
logger.info("Python HTTP trigger function processed a request.")

# Get Url from request
Expand All @@ -28,12 +27,8 @@ def add_url_embeddings(req: func.HttpRequest) -> func.HttpResponse:
# Check if url is present, compute embeddings and add them to VectorStore
if url:
try:
config = ConfigHelper.get_active_config_or_default()
document_processor = DocumentProcessor()
processors = list(
filter(lambda x: x.document_type == "url", config.document_processors)
)
document_processor.process(source_url=url, processors=processors)
embedder = EmbedderFactory.create(env_helper)
embedder.embed_file(url, ".url")
except Exception:
return func.HttpResponse(
f"Error: {traceback.format_exc()}", status_code=500
Expand Down
21 changes: 3 additions & 18 deletions code/backend/batch/BatchPushResults.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,8 @@
from urllib.parse import urlparse

from utilities.helpers.AzureBlobStorageClient import AzureBlobStorageClient
from utilities.helpers.DocumentProcessorHelper import DocumentProcessor
from utilities.helpers.ConfigHelper import ConfigHelper
from utilities.helpers.EnvHelper import EnvHelper

from utilities.helpers.embedders.EmbedderFactory import EmbedderFactory

bp_batch_push_results = func.Blueprint()
logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -39,24 +37,11 @@ def do_batch_push_results(msg: func.QueueMessage) -> None:
msg.get_body().decode("utf-8"),
)

document_processor = DocumentProcessor()
blob_client = AzureBlobStorageClient()
# Get the file name from the message
file_name = _get_file_name_from_message(msg)
# Generate the SAS URL for the file
file_sas = blob_client.get_blob_sas(file_name)
# Get file extension's processors
file_extension = file_name.split(".")[-1]

# Process the file
if env_helper.AZURE_SEARCH_USE_INTEGRATED_VECTORIZATION:
document_processor.process_using_integrated_vectorisation(source_url=file_sas)
else:
processors = list(
filter(
lambda x: x.document_type.lower() == file_extension.lower(),
ConfigHelper.get_active_config_or_default().document_processors,
)
)
document_processor.process(source_url=file_sas, processors=processors)
blob_client.upsert_blob_metadata(file_name, {"embeddings_added": "true"})
embedder = EmbedderFactory.create(env_helper)
embedder.embed_file(file_sas, file_name)
2 changes: 1 addition & 1 deletion code/backend/batch/GetConversationResponse.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

from utilities.helpers.EnvHelper import EnvHelper
from utilities.helpers.OrchestratorHelper import Orchestrator
from utilities.helpers.ConfigHelper import ConfigHelper
from utilities.helpers.config.ConfigHelper import ConfigHelper


bp_get_conversation_response = func.Blueprint()
Expand Down
88 changes: 0 additions & 88 deletions code/backend/batch/utilities/helpers/DocumentProcessorHelper.py

This file was deleted.

6 changes: 0 additions & 6 deletions code/backend/batch/utilities/helpers/EnvHelper.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,12 +78,6 @@ def __load_config(self, **kwargs) -> None:
self.AZURE_SEARCH_USE_INTEGRATED_VECTORIZATION = self.get_env_var_bool(
"AZURE_SEARCH_USE_INTEGRATED_VECTORIZATION", "False"
)
self.AZURE_SEARCH_IV_MAX_PAGE_LENGTH = int(
os.getenv("AZURE_SEARCH_IV_MAX_PAGE_LENGTH", "800")
)
self.AZURE_SEARCH_IV_PAGE_OVERLAP_LENGTH = int(
os.getenv("AZURE_SEARCH_IV_PAGE_OVERLAP_LENGTH", "100")
)

self.AZURE_AUTH_TYPE = os.getenv("AZURE_AUTH_TYPE", "keys")
# Azure OpenAI
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@
import json
import logging
from string import Template
from .AzureBlobStorageClient import AzureBlobStorageClient
from ..document_chunking.Strategies import ChunkingSettings, ChunkingStrategy
from ..document_loading import LoadingSettings, LoadingStrategy
from .DocumentProcessorHelper import Processor
from .OrchestratorHelper import (
from ..AzureBlobStorageClient import AzureBlobStorageClient
from ...document_chunking.Strategies import ChunkingSettings, ChunkingStrategy
from ...document_loading import LoadingSettings, LoadingStrategy
from .EmbeddingConfig import EmbeddingConfig
from ..OrchestratorHelper import (
OrchestrationSettings,
OrchestrationStrategy,
)
from .EnvHelper import EnvHelper
from ..EnvHelper import EnvHelper

CONFIG_CONTAINER_NAME = "config"
CONFIG_FILE_NAME = "active.json"
Expand All @@ -24,7 +24,7 @@ def __init__(self, config: dict):
self.example = Example(config["example"])
self.logging = Logging(config["logging"])
self.document_processors = [
Processor(
EmbeddingConfig(
document_type=c["document_type"],
chunking=(
ChunkingSettings(c["chunking"])
Expand All @@ -49,6 +49,11 @@ def __init__(self, config: dict):
self.orchestrator = OrchestrationSettings(
config.get("orchestrator", self.default_orchestration_settings)
)
self.integrated_vectorization_config = (
IntegratedVectorizationConfig(config["integrated_vectorization_config"])
if self.env_helper.AZURE_SEARCH_USE_INTEGRATED_VECTORIZATION
else None
)

def get_available_document_types(self):
document_types = [
Expand Down Expand Up @@ -107,6 +112,14 @@ def __init__(self, logging: dict):
self.log_tokens = logging["log_tokens"]


class IntegratedVectorizationConfig:
def __init__(self, integrated_vectorization_config: dict):
self.max_page_length = integrated_vectorization_config["max_page_length"]
self.page_overlap_length = integrated_vectorization_config[
"page_overlap_length"
]


class ConfigHelper:
_default_config = None

Expand Down Expand Up @@ -142,6 +155,11 @@ def _set_new_config_properties(config: dict, default_config: dict):
if config.get("example") is None:
config["example"] = default_config["example"]

if config.get("integrated_vectorization_config") is None:
config["integrated_vectorization_config"] = default_config[
"integrated_vectorization_config"
]

@staticmethod
def get_active_config_or_default():
env_helper = EnvHelper()
Expand Down Expand Up @@ -175,9 +193,7 @@ def get_default_config():
if ConfigHelper._default_config is None:
env_helper = EnvHelper()

config_file_path = os.path.join(
os.path.dirname(__file__), "config", "default.json"
)
config_file_path = os.path.join(os.path.dirname(__file__), "default.json")

with open(config_file_path) as f:
logger.info(f"Loading default config from {config_file_path}")
Expand Down
27 changes: 27 additions & 0 deletions code/backend/batch/utilities/helpers/config/EmbeddingConfig.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
from ..DocumentLoadingHelper import LoadingSettings
from ..DocumentChunkingHelper import ChunkingSettings


class EmbeddingConfig(ChunkingSettings, LoadingSettings):
def __init__(
self,
document_type: str,
chunking: ChunkingSettings | None,
loading: LoadingSettings | None,
use_advanced_image_processing: bool,
):
self.document_type = document_type
self.chunking = chunking
self.loading = loading
self.use_advanced_image_processing = use_advanced_image_processing

def __eq__(self, other):
if isinstance(self, other.__class__):
return (
self.document_type == other.document_type
and self.chunking == other.chunking
and self.loading == other.loading
and self.use_advanced_image_processing
== other.use_advanced_image_processing
)
return False
4 changes: 4 additions & 0 deletions code/backend/batch/utilities/helpers/config/default.json
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,10 @@
}
}
],
"integrated_vectorization_config": {
"max_page_length": "800",
"page_overlap_length": "100"
},
"logging": {
"log_user_interactions": true,
"log_tokens": true
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from abc import ABC, abstractmethod


class EmbedderBase(ABC):
@abstractmethod
def embed_file(self, source_url: str, file_name: str):
pass
15 changes: 15 additions & 0 deletions code/backend/batch/utilities/helpers/embedders/EmbedderFactory.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
from ..EnvHelper import EnvHelper
from ..AzureBlobStorageClient import AzureBlobStorageClient
from .PushEmbedder import PushEmbedder
from .IntegratedVectorizationEmbedder import (
IntegratedVectorizationEmbedder,
)


class EmbedderFactory:
@staticmethod
def create(env_helper: EnvHelper):
if env_helper.AZURE_SEARCH_USE_INTEGRATED_VECTORIZATION:
return IntegratedVectorizationEmbedder(env_helper)
else:
return PushEmbedder(AzureBlobStorageClient())
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
from .EmbedderBase import EmbedderBase
from ..EnvHelper import EnvHelper
from ..LLMHelper import LLMHelper
from ...integrated_vectorization.AzureSearchIndex import AzureSearchIndex
from ...integrated_vectorization.AzureSearchIndexer import AzureSearchIndexer
from ...integrated_vectorization.AzureSearchDatasource import AzureSearchDatasource
from ...integrated_vectorization.AzureSearchSkillset import AzureSearchSkillset
from ..config.ConfigHelper import ConfigHelper
import logging

logger = logging.getLogger(__name__)


class IntegratedVectorizationEmbedder(EmbedderBase):
def __init__(self, env_helper: EnvHelper):
self.env_helper = env_helper
self.llm_helper: LLMHelper = LLMHelper()

def embed_file(self, source_url: str, file_name: str):
self.process_using_integrated_vectorization(source_url=source_url)
cecheta marked this conversation as resolved.
Show resolved Hide resolved

def process_using_integrated_vectorization(self, source_url: str):
config = ConfigHelper.get_active_config_or_default()
try:
search_datasource = AzureSearchDatasource(self.env_helper)
search_datasource.create_or_update_datasource()
search_index = AzureSearchIndex(self.env_helper, self.llm_helper)
search_index.create_or_update_index()
search_skillset = AzureSearchSkillset(
self.env_helper, config.integrated_vectorization_config
)
search_skillset_result = search_skillset.create_skillset()
search_indexer = AzureSearchIndexer(self.env_helper)
indexer_result = search_indexer.create_or_update_indexer(
self.env_helper.AZURE_SEARCH_INDEXER_NAME,
skillset_name=search_skillset_result.name,
)
return indexer_result
except Exception as e:
logger.error(f"Error processing {source_url}: {e}")
raise e
Loading