Skip to content

Conversation

@paulinek13
Copy link
Contributor

This PR resolves #130 by lazy loading data files in garak/detectors/unsafe_content.py.
The data files are now loaded only when needed and cached for subsequent use.

@paulinek13 paulinek13 force-pushed the perf/130/lazyload_unsafe_content branch from 1acd37f to b476280 Compare October 28, 2025 18:17
Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting, can we validate and quantify benefit here?

I am reluctant to introduce something that uses lru_cache decorators when accessing data_path files as these can be overridden with user supplied files. Plugins are all now lazy loaded in standard installs.

At this time garak does not execute in a service mode, however at least one user has already suggested a service daemon mode in #1325 and this could introduce a need to invalidate the cache as a cursory investigation suggests reimport of a package file does not clear the lru_cache entires.

@leondz
Copy link
Collaborator

leondz commented Oct 29, 2025

This is a great idea. As @jmartin-tech mentions - validity of the cached result is predicated on the value data_path resolves to.

I don't think we want to worry about supporting the possibility that file content changes mid-run - really that would invalidate the run - so I see no risks from caching there.

At this time garak does not execute in a service mode, however at least one user has already suggested a service daemon mode in #1325 and this could introduce a need to invalidate the cache as a cursory investigation suggests reimport of a package file does not clear the lru_cache entires.

If we build it, service mode would need invalidation for many things, this is true

Also: does the caching persist across executions? If not, how many times are we really calling this code?

Good idea though - suspect there are multiple other places this concept can help

@leondz leondz added the quality-speed This affects the speed of program use label Oct 29, 2025
@paulinek13
Copy link
Contributor Author

does the caching persist across executions

It does not

how many times are we really calling this code?

Well, actually the loading happens only once per garak run.

My bad as I should have checked that first, because it seems that the problem I'm trying to solve here doesn't really exist.

--

Avoiding loading the CSVs at module import time (as mentioned in #130) makes sense in case of listing probes with --list_probes (files loaded but not needed), but even then it takes a fraction of a second so I don't know whether it's worth the added complexity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

quality-speed This affects the speed of program use

Projects

None yet

Development

Successfully merging this pull request may close these issues.

lazyload riskywords / unsafe_content

3 participants