Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial for its usage #23

Open
kcmtest opened this issue Aug 2, 2020 · 29 comments
Open

Tutorial for its usage #23

kcmtest opened this issue Aug 2, 2020 · 29 comments

Comments

@kcmtest
Copy link

kcmtest commented Aug 2, 2020

Can you put a tutorial for its usage I do see the reporsitory but Im getting confused what Im supposed to run the web version of pubtator is straight forward where I have to just put pmids it returns back the result . I would be glad if you can put a tutorial

I ran this

bash execute.sh
wget: download/bioconcepts2pubtatorcentral_offset.gz.log: No such file or directory

but this exist here "https://github.com/greenelab/pubtator/blob/master/download/bioconcepts2pubtator_offsets.gz.log"

Im not sure what Im doing wrong

@danich1
Copy link
Contributor

danich1 commented Aug 3, 2020

execute.sh cannot find the download folder which is why you are getting the No such file or directory error. Please make sure you are running bash execute.sh is the same level as the download folder.

@kcmtest
Copy link
Author

kcmtest commented Aug 3, 2020

bash execute.sh is the same level as the download folder. so i should be inside my download folder? you mean i can simply get into my download folder and run it?

@danich1
Copy link
Contributor

danich1 commented Aug 3, 2020

bash execute.sh is the same level as the download folder. so i should be inside my download folder? you mean i can simply get into my download folder and run it?

No. You should be outside of the download folder. The current directory should look like this:

data/
download/
mapper/
scripts/
execute.sh
... (other files)

Then run bash execute.sh. The download process should work from there. Just tested the download feature and it works for me.

@kcmtest
Copy link
Author

kcmtest commented Aug 3, 2020

to make my confusion clear i have to copy each of your folder the way you have made ..then only i can run..i was thinking that i would simply run execute.sh and it will work..

@danich1
Copy link
Contributor

danich1 commented Aug 3, 2020

i was thinking that i would simply run execute.sh and it will work..

Right the intended goal here is for execute.sh to handle everything. Did you clone the repository or just download the individual file? Forgive my confusion, but I'm not sure how things are set up on your end.

@kcmtest
Copy link
Author

kcmtest commented Aug 3, 2020

"Did you clone the repository" I will clone it ...and will update you..thank you for clarifying

@danich1
Copy link
Contributor

danich1 commented Aug 3, 2020

"Did you clone the repository" I will clone it ...and will update you..thank you for clarifying

No problem. Please me know if you run into any other issues.

@kcmtest
Copy link
Author

kcmtest commented Aug 3, 2020

I cloned it its running ..how much would be the download size? I would like to know and once i done it do i have to run every now and then?

@danich1
Copy link
Contributor

danich1 commented Aug 3, 2020

I cloned it its running ..how much would be the download size? I would like to know and once i done it do i have to run every now and then?

The file should be about 18+GB. I say 18+ because Pubtator Central updates their server monthly; therefore, your downloaded file should be at least 18GB to be correct.

@kcmtest
Copy link
Author

kcmtest commented Aug 3, 2020

Thank you for the information once its done I will be back with question to bug you again.

With regards

@kcmtest
Copy link
Author

kcmtest commented Aug 3, 2020

I will have to read me file properly before i come back to you. I will run the test example present first.
The download is finished i think but last couple of hours this is running not sure what it is
its not downloading anything I guess but what it is?

Screenshot from 2020-08-04 12-37-23

@kcmtest
Copy link
Author

kcmtest commented Aug 5, 2020

the bash script is running like 33 hours as of now is it expanding or what exactly is going on? I would be glad to know. So as im not sure so I haven't terminated the process. I would be glad if you can tell me

@danich1
Copy link
Contributor

danich1 commented Aug 5, 2020

the bash script is running like 33 hours as of now is it expanding or what exactly is going on? I would be glad to know. So as im not sure so I haven't terminated the process. I would be glad if you can tell me

The 33 hour process is my pipeline converting pubtator central's annotations into xml format to be processed later. It is a large file that can take up to a few days to fully process. No other solution here but to wait until all the pieces have been completed.

@kcmtest
Copy link
Author

kcmtest commented Aug 6, 2020

Unfortunately the machine was restarted it seems i have to do it again or it can run from where it was there last?

@danich1
Copy link
Contributor

danich1 commented Aug 6, 2020

Unfortunately the machine was restarted it seems i have to do it again or it can run from where it was there last?

The older version of the code required you to start from scratch. The newly updated version allows you to start from anywhere in the pipeline. I highly recommend using the newly upgraded version/read the docs for it. It could make your life easier when restarting the parsers.

@kcmtest
Copy link
Author

kcmtest commented Aug 10, 2020

30988903it [58:10:58, 147.95it/s] 
30988894it [11:25:20, 753.61it/s]  
1097it [2:10:37,  7.14s/it]
sys:1: DtypeWarning: Columns (4,10) have mixed types. Specify dtype option on import or set low_memory=False.
1097it [1:44:13,  5.70s/it]
274it [10:05:12, 132.53s/it]
Traceback (most recent call last):
  File "scripts/download_full_text.py", line 124, in <module>
    download_full_text(args.input, args.document_batch, args.temp_dir)
  File "scripts/download_full_text.py", line 58, in download_full_text
    response = call_api(query)
  File "/home/punit/anaconda3/envs/pubtator/lib/python3.8/site-packages/ratelimit/decorators.py", line 113, in wrapper
    return func(*args, **kargs)
  File "/home/punit/anaconda3/envs/pubtator/lib/python3.8/site-packages/ratelimit/decorators.py", line 80, in wrapper
    return func(*args, **kargs)
  File "scripts/download_full_text.py", line 21, in call_api
    raise Exception(response.text)
Exception

Is this an error or something else do let me know ...im not sure

@danich1
Copy link
Contributor

danich1 commented Aug 10, 2020

This error was generated because Pubtator Central's server sent back an error code. I don't know what caused it, so my suggestion is try rerunning that part of the pipeline and if the error comes again I'll take a look.

@kcmtest
Copy link
Author

kcmtest commented Aug 10, 2020

". I don't know what caused it, so my suggestion is try rerunning that part of the pipeline and if the error comes again I'll take a look." i simply ran this

bash execute.sh

shall i run this again?

@danich1
Copy link
Contributor

danich1 commented Aug 10, 2020

No. Don't do that run this command:

 python scripts/download_full_text.py \
    --input data/pubtator-pmids-to-pmcids.tsv \
    --document_batch 100000 \
    --output data/pubtator-central-full-text.xml

If you run bash execute.sh you will restart everything. Not ideal.

@kcmtest
Copy link
Author

kcmtest commented Aug 10, 2020

thank you for the immediate help

this i got after running the above code sorry for asking these fundamental doubts ..since I use R almost so Im not sure about te errors

download_full_text.py: error: the following arguments are required: --temp_dir

I did make a new folder its running

python scripts/download_full_text.py --input data/pubtator-pmids-to-pmcids.tsv --document_batch 100000 --output data/pubtator-central-full-text.xml --temp_dir /run/media/punit/data4/tupa/
0it [00:00, ?it/s]

@kcmtest
Copy link
Author

kcmtest commented Aug 10, 2020

The error i received after running the above

0it [02:10, ?it/s]
Traceback (most recent call last):
  File "scripts/download_full_text.py", line 124, in <module>
    download_full_text(args.input, args.document_batch, args.temp_dir)
  File "scripts/download_full_text.py", line 58, in download_full_text
    response = call_api(query)
  File "/home/punit/anaconda3/envs/pubtator/lib/python3.8/site-packages/ratelimit/decorators.py", line 113, in wrapper
    return func(*args, **kargs)
  File "/home/punit/anaconda3/envs/pubtator/lib/python3.8/site-packages/ratelimit/decorators.py", line 80, in wrapper
    return func(*args, **kargs)
  File "scripts/download_full_text.py", line 21, in call_api
    raise Exception(response.text)
Exception: <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<title>Submitted URI too large!</title>
<link rev="made" href="mailto:[email protected]" />
<style type="text/css"><!--/*--><![CDATA[/*><!--*/ 
    body { color: #000000; background-color: #FFFFFF; }
    a:link { color: #0000CC; }
    p, address {margin-left: 3em;}
    span {font-size: smaller;}
/*]]>*/--></style>
</head>

<body>
<h1>Submitted URI too large!</h1>
<p>


    The length of the requested URL exceeds the capacity limit for
	this server. The request cannot be processed.
   
</p>
<p>
If you think this is a server error, please contact
the <a href="mailto:[email protected]">webmaster</a>.

</p>

<h2>Error 414</h2>
<address>
  <a href="/">www.ncbi.nlm.nih.gov</a><br />
  <span>Apache</span>
</address>
</body>
</html>

@danich1
Copy link
Contributor

danich1 commented Aug 10, 2020

Basically the program is sending too many ids to be processed. Change document_batch to be 100 or 1000 and run again. The default parameter is too high for Pubtator Central's api.

@kcmtest
Copy link
Author

kcmtest commented Aug 10, 2020

"Basically the program is sending too many ids to be processed. Change document_batch to be 100 or 1000 and run again. The default parameter is too high for Pubtator Central's api."

okay i will try small numbers

@kcmtest
Copy link
Author

kcmtest commented Aug 10, 2020

python scripts/download_full_text.py --input data/pubtator-pmids-to-pmcids.tsv --document_batch 100 --output data/pubtator-central-full-text.xml --temp_dir /run/media/punit/data4/tupa/
38it [1:26:01, 135.83s/it]
Traceback (most recent call last):
  File "scripts/download_full_text.py", line 124, in <module>
    download_full_text(args.input, args.document_batch, args.temp_dir)
  File "scripts/download_full_text.py", line 58, in download_full_text
    response = call_api(query)
  File "/home/punit/anaconda3/envs/pubtator/lib/python3.8/site-packages/ratelimit/decorators.py", line 113, in wrapper
    return func(*args, **kargs)
  File "/home/punit/anaconda3/envs/pubtator/lib/python3.8/site-packages/ratelimit/decorators.py", line 80, in wrapper
    return func(*args, **kargs)
  File "scripts/download_full_text.py", line 21, in call_api
    raise Exception(response.text)
Exception

Please do have a look

I did see the folder i do see xml files around 553 mb a total of 38 files

@danich1
Copy link
Contributor

danich1 commented Aug 10, 2020

For ease of debugging please upload this file: data/pubtator-pmids-to-pmcids.tsv. I'll need it so I can see whats causing the issue.

@kcmtest
Copy link
Author

kcmtest commented Aug 10, 2020

For ease of debugging please upload this file: data/pubtator-pmids-to-pmcids.tsv. I'll need it so I can see whats causing the issue.

sorry for the late reply im doing it now..i will share the link since its more than 10mb https://drive.google.com/file/d/1G-6ehkeR_V8IhqiBryCMVe1jGc9GPB8Y/view?usp=sharing

@kcmtest
Copy link
Author

kcmtest commented Aug 12, 2020

Hello sir ..I would be glad to know what was going wrong on my side ...

@cgreene
Copy link
Member

cgreene commented Aug 12, 2020

Hi @krushnach80 - you have encountered a research project that is in progress but on someone's back burner at the moment. It sounds like you might be better served by directly interacting with the pubtator API or similar if you need faster responses in this case: https://www.ncbi.nlm.nih.gov/research/pubtator/

@kcmtest
Copy link
Author

kcmtest commented Aug 12, 2020

thank you sir ..i found something which would be easier for me ..https://cran.rstudio.com/web/packages/pubtatordb/vignettes/pubtatordb.html

but i would love to use your tool as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants