Using My own dataset with csv #287

aqsa27 · 2019-10-24T18:35:15Z

Hi, I am trying to build a cdqa with my customized dataset which is in CSV. Can you tell me what format should my dataset be?
and there is only a pdf converter for csv.
is there any way of converting my dataset into the acceptable cdqa dataframe?

ghost · 2019-10-25T00:29:37Z

title	paragraphs
The Article Title	[Paragraph 1 of Article, ... , Paragraph N of Article]

aqsa27 · 2019-10-25T01:34:53Z

Is there any automated way to convert the data into this format?

ghost · 2019-10-25T01:43:38Z

https://github.com/cdqa-suite/cdQA/blob/88a1ff2bb249f24edc427737ccb0b8f8959cf0b6/cdqa/scrapper/bs4_bnpp_newsroom.py
This is the script they have used. It's a good starting point.

aqsa27 · 2019-10-25T12:42:17Z

I will try with this

aqsa27 · 2019-11-01T15:10:08Z

The convertors used for pdf does not read in my file, is there any format for the pdf file as well?

swebalaji · 2019-11-18T11:29:36Z

Even I want to do the same. Kindly help on this.

andrelmfarias · 2019-11-22T10:41:07Z

Hi,

Unfortunately, our pdf_converter does not generalize well, I will be working on a solution to that soon. For now, I advise you to try to use other libraries to convert your pdf into text, such as pdfminer, and do some preprocessing to build the dataframe with the format presented in the readme.

fmikaelian · 2019-11-23T11:31:09Z

@aqsa27 how does your csv look like? Can you share the format or a sample here?

nayakvidya · 2019-11-25T06:07:07Z

@aqsa27 ,@fmikaelian - I would also like to have a look at the csv. Can you please share a sample ? Also once we build the csv , how do you train the model ? If I use the existing QAPipeline :

cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib')
cdqa_pipeline.fit_retriever(df=df)

the results are not matching the questions asked while testing. Any pointers on training the model ?

aqsa27 · 2019-11-25T16:17:02Z

@aqsa27 how does your csv look like? Can you share the format or a sample here?

Hi,

My dataset contains 4 columns, like question, answer, date and additional information.

aqsa27 · 2019-11-25T16:18:38Z

@aqsa27 ,@fmikaelian - I would also like to have a look at the csv. Can you please share a sample ? Also once we build the csv , how do you train the model ? If I use the existing QAPipeline :

cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib')
cdqa_pipeline.fit_retriever(df=df)

the results are not matching the questions asked while testing. Any pointers on training the model

I create a new dataframe of my csv and use that to train my model
cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib')
cdqa_pipeline.fit_retriever(df=newdf)

The answer wit this method is not 100% accurate, but its a lot more relevant

aqsa27 · 2019-11-25T16:20:12Z

Hi,

Unfortunately, our pdf_converter does not generalize well, I will be working on a solution to that soon. For now, I advise you to try to use other libraries to convert your pdf into text, such as pdfminer, and do some preprocessing to build the dataframe with the format presented in the readme.

Hi,

can you give us a example? about the format? not the one mentioned in the readme, a live example of a csv in the recommend format

andrelmfarias · 2019-11-27T15:01:46Z

Hi,
can you give us a example? about the format? not the one mentioned in the readme, a live example of a csv in the recommend format

One of our official tutorials (found in our readme and our examples repository): https://colab.research.google.com/github/cdqa-suite/cdQA/blob/master/examples/tutorial-first-steps-cdqa.ipynb

If you run this notebook the csv will be saved at the directory ./data/bnpp_newsroom_v1.1/

You can ignore the columns date, category, link, abstract. You only need title and paragraphs

falcon-codz · 2020-02-28T06:20:40Z

#345 can you fix my issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using My own dataset with csv #287

Using My own dataset with csv #287

aqsa27 commented Oct 24, 2019

ghost commented Oct 25, 2019

aqsa27 commented Oct 25, 2019

ghost commented Oct 25, 2019

aqsa27 commented Oct 25, 2019

aqsa27 commented Nov 1, 2019

swebalaji commented Nov 18, 2019

andrelmfarias commented Nov 22, 2019 •

edited

Loading

fmikaelian commented Nov 23, 2019

nayakvidya commented Nov 25, 2019

aqsa27 commented Nov 25, 2019

aqsa27 commented Nov 25, 2019

aqsa27 commented Nov 25, 2019

andrelmfarias commented Nov 27, 2019 •

edited

Loading

falcon-codz commented Feb 28, 2020

Using My own dataset with csv #287

Using My own dataset with csv #287

Comments

aqsa27 commented Oct 24, 2019

ghost commented Oct 25, 2019

aqsa27 commented Oct 25, 2019

ghost commented Oct 25, 2019

aqsa27 commented Oct 25, 2019

aqsa27 commented Nov 1, 2019

swebalaji commented Nov 18, 2019

andrelmfarias commented Nov 22, 2019 • edited Loading

fmikaelian commented Nov 23, 2019

nayakvidya commented Nov 25, 2019

aqsa27 commented Nov 25, 2019

aqsa27 commented Nov 25, 2019

aqsa27 commented Nov 25, 2019

andrelmfarias commented Nov 27, 2019 • edited Loading

falcon-codz commented Feb 28, 2020

andrelmfarias commented Nov 22, 2019 •

edited

Loading

andrelmfarias commented Nov 27, 2019 •

edited

Loading