Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using My own dataset with csv #287

Open
aqsa27 opened this issue Oct 24, 2019 · 14 comments
Open

Using My own dataset with csv #287

aqsa27 opened this issue Oct 24, 2019 · 14 comments

Comments

@aqsa27
Copy link

aqsa27 commented Oct 24, 2019

Hi, I am trying to build a cdqa with my customized dataset which is in CSV. Can you tell me what format should my dataset be?
and there is only a pdf converter for csv.
is there any way of converting my dataset into the acceptable cdqa dataframe?

@ghost
Copy link

ghost commented Oct 25, 2019

title paragraphs
The Article Title [Paragraph 1 of Article, ... , Paragraph N of Article]

@aqsa27
Copy link
Author

aqsa27 commented Oct 25, 2019

Is there any automated way to convert the data into this format?

@ghost
Copy link

ghost commented Oct 25, 2019

https://github.com/cdqa-suite/cdQA/blob/88a1ff2bb249f24edc427737ccb0b8f8959cf0b6/cdqa/scrapper/bs4_bnpp_newsroom.py
This is the script they have used. It's a good starting point.

@aqsa27
Copy link
Author

aqsa27 commented Oct 25, 2019

I will try with this

@aqsa27
Copy link
Author

aqsa27 commented Nov 1, 2019

The convertors used for pdf does not read in my file, is there any format for the pdf file as well?

@swebalaji
Copy link

Even I want to do the same. Kindly help on this.

@andrelmfarias
Copy link
Collaborator

andrelmfarias commented Nov 22, 2019

Hi,

Unfortunately, our pdf_converter does not generalize well, I will be working on a solution to that soon. For now, I advise you to try to use other libraries to convert your pdf into text, such as pdfminer, and do some preprocessing to build the dataframe with the format presented in the readme.

@fmikaelian
Copy link
Collaborator

@aqsa27 how does your csv look like? Can you share the format or a sample here?

@nayakvidya
Copy link

@aqsa27 ,@fmikaelian - I would also like to have a look at the csv. Can you please share a sample ? Also once we build the csv , how do you train the model ? If I use the existing QAPipeline :

cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib')
cdqa_pipeline.fit_retriever(df=df)

the results are not matching the questions asked while testing. Any pointers on training the model ?

@aqsa27
Copy link
Author

aqsa27 commented Nov 25, 2019

@aqsa27 how does your csv look like? Can you share the format or a sample here?

Hi,

My dataset contains 4 columns, like question, answer, date and additional information.

@aqsa27
Copy link
Author

aqsa27 commented Nov 25, 2019

@aqsa27 ,@fmikaelian - I would also like to have a look at the csv. Can you please share a sample ? Also once we build the csv , how do you train the model ? If I use the existing QAPipeline :

cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib')
cdqa_pipeline.fit_retriever(df=df)

the results are not matching the questions asked while testing. Any pointers on training the model

I create a new dataframe of my csv and use that to train my model
cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib')
cdqa_pipeline.fit_retriever(df=newdf)

The answer wit this method is not 100% accurate, but its a lot more relevant

@aqsa27
Copy link
Author

aqsa27 commented Nov 25, 2019

Hi,

Unfortunately, our pdf_converter does not generalize well, I will be working on a solution to that soon. For now, I advise you to try to use other libraries to convert your pdf into text, such as pdfminer, and do some preprocessing to build the dataframe with the format presented in the readme.

Hi,

can you give us a example? about the format? not the one mentioned in the readme, a live example of a csv in the recommend format

@andrelmfarias
Copy link
Collaborator

andrelmfarias commented Nov 27, 2019

Hi,
can you give us a example? about the format? not the one mentioned in the readme, a live example of a csv in the recommend format

One of our official tutorials (found in our readme and our examples repository): https://colab.research.google.com/github/cdqa-suite/cdQA/blob/master/examples/tutorial-first-steps-cdqa.ipynb

If you run this notebook the csv will be saved at the directory ./data/bnpp_newsroom_v1.1/

You can ignore the columns date, category, link, abstract. You only need title and paragraphs

@falcon-codz
Copy link

#345 can you fix my issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants