Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't add personal data db/collection to auth.json #1684

Closed
rxng opened this issue Jun 13, 2024 · 6 comments
Closed

can't add personal data db/collection to auth.json #1684

rxng opened this issue Jun 13, 2024 · 6 comments

Comments

@rxng
Copy link

rxng commented Jun 13, 2024

According to the instructions, we can add a make_db.py database to auth.json , but does not specify exactly how to do this.

To make a new one for the user, fill `user_path_jon` with documents (can be soft or hard linked to avoid dups across multiple users), do:
```bash
python src/make_db.py --user_path=gptdocsdb/jon--collection_name=JonData --langchain_type=personal --hf_embedding_model=hkunlp/instructor-large --persist_directory=users/jon/db_dir_JonData

Then you'll have:

(h2ogpt) jon@pseudotensor:~/h2ogpt$ ls -alrt users/jon/db_dir_JonData/
total 264
drwx------ 13 jon jon   4096 Apr 16 12:28 ../
drwx------  2 jon jon   4096 Apr 16 12:28 d7ccacb6-93fe-4380-9340-b7f5edffb655/
-rw-------  1 jon jon 249856 Apr 16 12:28 chroma.sqlite3
-rw-------  1 jon jon     41 Apr 16 12:28 embed_info
drwx------  3 jon jon   4096 Apr 16 12:28 ./

You can add that database to the auth.json for their entry if using auth.json type file, and they will see when they login.


h2ogpt is being run like so and everything works well except it does not load the correct collection for the user 
`python generate.py --base_model=mistral-7b-instruct-v0.2.Q8_0.gguf --score_model=None --prompt_type=instruct --auth_access=closed --auth=auth.json --guest_name='' --auth_freeze`

I have tried the following by adding db parameters but it does not work. 

{
"jon": {
"password": "jon1306",
"userid": "acb8fef1a77d122b5e12b261202ada7a",
"selection_docs_state": {
"langchain_modes": [
"JonData",
"LLM",
"Disabled"
],
"langchain_mode_types": {
"JonData": "personal"
}
},
"dbs": "users/jon/db_dir_JonData",
"load_db_if_exists": "users/jon/db_dir_JonData"
}
}


How do we make it such that when user logs in, their  collection JonData is automatically added? 
Or, Any way to simply specify a per user user_path? that would be easiest.
@pseudotensor
Copy link
Collaborator

pseudotensor commented Jun 13, 2024

If you are trying this for shared collection, did you try the CLI options?

https://github.com/h2oai/h2ogpt/blob/main/docs/README_LangChain.md#multiple-embeddings-and-sources

i.e.

python generate.py --model_lock="[{'base_model': 'llama', 'model_path_llama': 'Phi-3-mini-4k-instruct-q4.gguf', 'tokenizer_base_model': 'microsoft/Phi-3-mini-4k-instruct'}]" --use_auth_token=$HUGGING_FACE_HUB_TOKEN --langchain_modes="['UserData', 'MyData', 'UserData2']"

Would show all users those 2 by default.

Even if a user logs in that already had a db entry, they will be forced to see those CLI ones.

If the system is online, without restarting, there's currently no way to add to all users at once with e.g. some kind of global user added settings. Is that what you are trying to achieve?

@pseudotensor
Copy link
Collaborator

For personal collections, there's no CLI options for that, it's only in the db/json file. By default sqlite3 db is used in newer h2oGPT to address speed issues with json, so one would have to edit the db using operations like in the src/db_utils.py.

I'll think about how to handle this better, probably adding an option to add things via the admin page is best. Would that work for you?

@rxng
Copy link
Author

rxng commented Jun 14, 2024

thanks for your quick response! Maybe I was confusing in my explanation. I was trying to achieve having a user logging in and then their own collection would be automatically loaded for them.

However, I tried every single parameter and just found a way to do it via the auth.json file, by adding the line
"langchain_mode": "JonData", above the selection_docs_state entry, like so

"langchain_mode": "JonData",
    "selection_docs_state": {

The only question I have is, if we wanted to then add more documents to the collection via make_db.py , would we then have to restart the entire instance of h2ogpt to automatically use the updated collection?

It would definitely be great if there was an admin page where these things could easily be managed :)

@pseudotensor
Copy link
Collaborator

image

image

image

image

@rxng
Copy link
Author

rxng commented Jun 28, 2024

image

image

image

image

that's so amazing @pseudotensor !!

@pseudotensor
Copy link
Collaborator

Note that if you have an auth file that is .json, just pass to CLI that it is now .db and we'll migrate it to .db format that is required for this control

h2ogpt/src/db_utils.py

Lines 80 to 101 in 3498b03

# Connect to an SQLite database (change the database path as necessary)
if auth_filename.endswith('.json'):
json_filename = auth_filename
db_filename = auth_filename[:-4] + '.db'
else:
assert auth_filename.endswith('.db')
db_filename = auth_filename
json_filename = auth_filename[:-3] + '.json'
if os.path.isfile(db_filename) and os.path.getsize(db_filename) == 0:
os.remove(db_filename)
if os.path.isfile(json_filename) and os.path.getsize(json_filename) == 0:
os.remove(json_filename)
if os.path.isfile(json_filename) and not os.path.isfile(db_filename):
# then make, one-time migration
with open(json_filename, 'rt') as f:
auth_dict = json.load(f)
create_table(db_filename)
upsert_auth_dict(db_filename, auth_dict, verbose=verbose)
# Slow way:
# [upsert_user(db_filename, username1, auth_dict[username1]) for username1 in auth_dict]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants