-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
custom dataset creation for unisar #37
Comments
Maybe you can try to convert your dataset into the format of Spider as the first step? Then you may directly use UniSAR on these datasets! Ping @DreamerDeo for further help if needed. |
Hi @srewai , many thanks for your interest in unisar. To make the unisar work on your own dataset, the ONLY job is to call the step2 to generate the file following the same format as fairseq_preprocessed_file. If you don't want to involve the schema-linking information (e.g. exact-match, partial match and value match). You could just skip the step1.py. The performance difference is about 3% in Spider w/wo schema-linking. You could simply transform the data format of yours into Spider-fashion as @SivilTaram mentioned (the easiest way). Moreover, other Spider codebase (e.g., IRNET, RATSQL) could be employed too. Otherwise, here is what I think you need to do:
That's all the stuff you need to do for preprocessing the data. Then you could train&infer as stated in train.py and step3.py. Note that the current version of unisar is built on BART-large (English). That is, if the non-English parser is desired, you need to use the mbart pre-process (step2.py) and train command (train.py) as stated in the offical fairseq documents. If your need further help, please let me know. I'm very glad about that :) |
first of all thank you very much @DreamerDeo and @SivilTaram. In any case , thank you so much once again! I will be back soon with some questions :D |
@srewai Actually, mBART shows a promising (but not good as English) performance in German (i.e., question and DB are in German and the output SQL also involves German headers.) In our other project about multilingual text-to-SQL, we compare the (1)mBART-CC25 with (2)RAT-SQL + XLM-R-Large (This dataset/paper will be released later this month. I will pin you on this issue. Welcome to follow:) ). Of course, you could try mBART first since it's more scalable and training is extremely fast (thanks to fairseq). |
@DreamerDeo , that would be amazing. Please pin me on the release of the multilingual paper. I'm waiting for it:) |
@srewai Q1: I will need to create sqlite db out of these table correct?
This function would return the values follows the same format like
Q2: the steps for inference on new tables If you want online inference like a demo, please try
You could just enter your question about the table, the commandline would return the predicted SQL. |
@DreamerDeo , great! Let me try that out. So first i will need to convert my excel to json post which I can use the function:
|
Hi @srewai, this is the paper about multilingualSpider: https://arxiv.org/pdf/2212.13492.pdf . Welcome to read :) The codebase and dataset will be released in about one month. (I need to prepare more time for ACL deadline ) |
Hi,
Thank you. |
@longxudou Maybe longxu can help on this question. |
@epejhan90 Thanks for your interest! Q1: What are the preprocesses to add my table to db and be able to run three steps? The codebase doesn't support this function. But you can implement it following your statement. I think it's reasonable to append the schema of new db in tables.json. But you should also append content of db in another via (1) creating db sqlite file like Spider does; or (2) storing all the content in another json file. Note that the current version of UNISAR codebase only support non-value SQL generation. This is because of (1)the research benchmark mainly focus on SQL-sketch accuracy rather than SQL execution accuracy; (2) shorten the input length to avoid the length exceeding problem. If you want to make your SQL executable, just append the value after corresponding columns. It works in my experiments. Q2: My table also have timestamp columns, does the model also support that? Firstly, I don't take into account the timestamp cases. But I think you can just treat "timestamp" as a special column. You don't have to change the pipeline except replacing this column with the concrete values after SQL generation. Q3: How many samples would be fine to add for my new db? If the domains of your db (assumed in English) are overlapped with Spider Train-set, I think it's unnecessary to further fine-tune the model. If you want to fine-tune the model for better performance, I think fifty cases is enough and also scalable. |
@longxudou Thank you so much for your reply . I could fix it. But the only remaining problem is that in my queries instead of values, I get 'value' not the real value I have put in my question. (I'm using the interactive script) Do know how can I fix it? Or have I missed any preprocessing? And it is the same for other databases. |
@epejhan90 This is because the released checkpoint is no-value version, which is trained by "query_toks_no_value" rather than "query_toks" in Spider dataset. If you want to fix this (make SQL contains real values), just follows these two steps: (1) append the value in the input; (2) train the model with "query_toks" SQL with value. I would release a new version of checkpoint in the next month, which could predict value. You can adopt that version as well. |
@longxudou Thank you. I don't have access to any GPU at this moment and I'm running the code on google Colab. Do you think is it possible to retrain the model there? |
@epejhan90 I think you could first fine-tune our non-value checkpoint to be "valuable" with construting the training corpus. Otherwise, it's time-consuming to adopt one GPU to fine-tune the BART-large model from scratch with vanilla BART model. In my experiments, I have adopted four V100-16G gpus for 10 hours. |
Hello @SivilTaram,
I've two tables and they can be linked with primary and foreign key. I would like to use unisar on these tables. could you please share steps or hints on how can i create custom dataset for making use of unisar? I appreciate your help:)
The text was updated successfully, but these errors were encountered: