Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom dataset creation for unisar #37

Open
srewai opened this issue Nov 8, 2022 · 15 comments
Open

custom dataset creation for unisar #37

srewai opened this issue Nov 8, 2022 · 15 comments

Comments

@srewai
Copy link

srewai commented Nov 8, 2022

Hello @SivilTaram,

I've two tables and they can be linked with primary and foreign key. I would like to use unisar on these tables. could you please share steps or hints on how can i create custom dataset for making use of unisar? I appreciate your help:)

@SivilTaram
Copy link
Collaborator

Maybe you can try to convert your dataset into the format of Spider as the first step? Then you may directly use UniSAR on these datasets! Ping @DreamerDeo for further help if needed.

@longxudou
Copy link
Contributor

longxudou commented Nov 14, 2022

Hi @srewai , many thanks for your interest in unisar.

To make the unisar work on your own dataset, the ONLY job is to call the step2 to generate the file following the same format as fairseq_preprocessed_file. If you don't want to involve the schema-linking information (e.g. exact-match, partial match and value match). You could just skip the step1.py. The performance difference is about 3% in Spider w/wo schema-linking.

You could simply transform the data format of yours into Spider-fashion as @SivilTaram mentioned (the easiest way). Moreover, other Spider codebase (e.g., IRNET, RATSQL) could be employed too.

Otherwise, here is what I think you need to do:

  1. the most tricky part is to derive the schema-linking tag(step1.py). The different format between yours and Spider will make some function(e.g., DBContext) in step1 failed. Luckliy, it's not very hard to implement derive_schema_linking_function by yourself. The input are question, table headers and table values. The output should be the schema linking information following
    return value_match, value_alignment, exact_match, partial_match
    . You can just adopt the fuzzy string match to achieve this.
  2. Make sure the schema-linking file follows the same format as schema_linking_file. Then you could directly call step2.py to generate the fairseq training files.

That's all the stuff you need to do for preprocessing the data. Then you could train&infer as stated in train.py and step3.py.

Note that the current version of unisar is built on BART-large (English). That is, if the non-English parser is desired, you need to use the mbart pre-process (step2.py) and train command (train.py) as stated in the offical fairseq documents.

If your need further help, please let me know. I'm very glad about that :)

@srewai
Copy link
Author

srewai commented Nov 14, 2022

first of all thank you very much @DreamerDeo and @SivilTaram.
@DreamerDeo , amazing that you shared the detailed steps to data preparation. I am going to implement the same and get back in case of any concerns. It's great that you went extra mile and mentioned non-eng parser.
In addition to eng parser, i would like to test mbart as well. Do you think mbart is good choice for parsing German tables?

In any case , thank you so much once again! I will be back soon with some questions :D

@longxudou
Copy link
Contributor

longxudou commented Nov 15, 2022

@srewai Actually, mBART shows a promising (but not good as English) performance in German (i.e., question and DB are in German and the output SQL also involves German headers.)

In our other project about multilingual text-to-SQL, we compare the (1)mBART-CC25 with (2)RAT-SQL + XLM-R-Large (This dataset/paper will be released later this month. I will pin you on this issue. Welcome to follow:) ).
The set-match accuracy of mBART(without constrained decoding) is about 40% while RAT-SQL achieves 62%.
Through the case studies (this is done by native-speaker vendors). The most bad cases of vanilla mBART could be categorized to the synonym problem (e.g., like nation -> citizen in English). This problem could be alleviated by careful design the way of construction of prefix-tree (it's a bit tricky but not very hard.) to constrain the decoding process.

Of course, you could try mBART first since it's more scalable and training is extremely fast (thanks to fairseq).
But if you have enough computational resources, I would recommend attempting mT5 rather than mBART in non-English. It's more powerful (a lot of multilingual NLP work also prefer mT5).

@srewai
Copy link
Author

srewai commented Nov 16, 2022

@DreamerDeo , that would be amazing. Please pin me on the release of the multilingual paper. I'm waiting for it:)
For now I only have 2 tables without "NL questions" and therefore, i don't think it is possible to create the dataset. However, i would like to perform unisar inference on the these tables. I have processed the cleaned the tables. So if i understand correctly to perform inference on unisar , I will need to create sqlite db out of these table correct? Or rather would you please share the steps for inference on new tables?
Thank you.

@longxudou
Copy link
Contributor

@srewai
You can wait for our multilingual-Spider to train your own parser. It supports Chinese/Japanese/German/French/Spanish/Vietnamese. About 10K data and 120 tables.

Q1: I will need to create sqlite db out of these table correct?
No. Our code could read the databse values from .json too. Try this function

def read_dataset_values_from_json(db_id: str, db_content_dict: Dict[str, Any], tables: List[str]):

This function would return the values follows the same format like

def read_dataset_values(db_id: str, dataset_path: str, tables: List[str]):

Q2: the steps for inference on new tables
If you want to offline predict, just serial your table schema and question into the format like https://github.com/microsoft/ContextualSP/tree/master/unified_parser_text_to_sql/dataset_post/spider_sl

If you want online inference like a demo, please try

def infer_query(self, question, db_id):

You could just enter your question about the table, the commandline would return the predicted SQL.

@srewai
Copy link
Author

srewai commented Nov 18, 2022

@DreamerDeo , great! Let me try that out. So first i will need to convert my excel to json post which I can use the function:

def read_dataset_values_from_json(db_id: str, db_content_dict: Dict[str, Any], tables: List[str]):

@longxudou
Copy link
Contributor

longxudou commented Dec 30, 2022

Hi @srewai, this is the paper about multilingualSpider: https://arxiv.org/pdf/2212.13492.pdf . Welcome to read :)

The codebase and dataset will be released in about one month. (I need to prepare more time for ACL deadline )
At the dataset release time, I will also pin you in this issue :)

@epejhan90
Copy link

Hi,
Thank you for the explanation. But I still have three more questions regarding to run the model for my own data. I could run the three steps of unified_parser_text_to_sql for spider data and also run the interactive options. I would like to be able to query on my own data.

  1. I have one table at this moment. To be able to add my data to the db , I have made the schema for it and also added the table to the tables.json. The other thing I should do manually is to add some queries about my table to the dev.son?
    My general question is what are the preprocesses to add my table to db and be able to run three steps?

  2. My table also have timestamp columns, does the model also support that?

  3. How many samples would be fine to add for my new db?

Thank you.

@SivilTaram
Copy link
Collaborator

@longxudou Maybe longxu can help on this question.

@SivilTaram SivilTaram reopened this May 9, 2023
@longxudou
Copy link
Contributor

@epejhan90 Thanks for your interest!

Q1: What are the preprocesses to add my table to db and be able to run three steps?

The codebase doesn't support this function. But you can implement it following your statement. I think it's reasonable to append the schema of new db in tables.json. But you should also append content of db in another via (1) creating db sqlite file like Spider does; or (2) storing all the content in another json file.

Note that the current version of UNISAR codebase only support non-value SQL generation. This is because of (1)the research benchmark mainly focus on SQL-sketch accuracy rather than SQL execution accuracy; (2) shorten the input length to avoid the length exceeding problem.

If you want to make your SQL executable, just append the value after corresponding columns. It works in my experiments.

Q2: My table also have timestamp columns, does the model also support that?

Firstly, I don't take into account the timestamp cases. But I think you can just treat "timestamp" as a special column. You don't have to change the pipeline except replacing this column with the concrete values after SQL generation.

Q3: How many samples would be fine to add for my new db?

If the domains of your db (assumed in English) are overlapped with Spider Train-set, I think it's unnecessary to further fine-tune the model.
See https://github.com/ryanzhumich/editsql/blob/master/data/spider/train_db_ids.txt for the details of domain information in Spider.

If you want to fine-tune the model for better performance, I think fifty cases is enough and also scalable.

@epejhan90
Copy link

epejhan90 commented May 19, 2023

@longxudou Thank you so much for your reply . I could fix it. But the only remaining problem is that in my queries instead of values, I get 'value' not the real value I have put in my question. (I'm using the interactive script)
For example:
Ask a question: how many has status between 10 and 20
"pred:" select count ( * ) from alarmturbine where alarmturbine@status between 'value' and 'value'.

Do know how can I fix it? Or have I missed any preprocessing? And it is the same for other databases.

@longxudou
Copy link
Contributor

longxudou commented May 19, 2023

@epejhan90 This is because the released checkpoint is no-value version, which is trained by "query_toks_no_value" rather than "query_toks" in Spider dataset. If you want to fix this (make SQL contains real values), just follows these two steps: (1) append the value in the input; (2) train the model with "query_toks" SQL with value. I would release a new version of checkpoint in the next month, which could predict value. You can adopt that version as well.

@epejhan90
Copy link

@longxudou Thank you. I don't have access to any GPU at this moment and I'm running the code on google Colab. Do you think is it possible to retrain the model there?

@longxudou
Copy link
Contributor

longxudou commented May 19, 2023

@epejhan90 I think you could first fine-tune our non-value checkpoint to be "valuable" with construting the training corpus. Otherwise, it's time-consuming to adopt one GPU to fine-tune the BART-large model from scratch with vanilla BART model. In my experiments, I have adopted four V100-16G gpus for 10 hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants