Discover how a browser interacts with pdf2doc.com and reverse engineer a solution that does not require a browser.
A combination of the Chrome Developer Tools and the curl command line utility were used to conduct this research.
-
The website uses the Plupload API to handle file uploads.
-
Uploads and progress notifications are done through AJAX - the page never reloads.
-
Upon page load, a 16 character session ID (
sid) is generated using the following method:function randomString() { for (var t = "0123456789abcdefghiklmnopqrstuvwxyz", e = 16, i = "", n = 0; e > n; n++) { var a = Math.floor(Math.random() * t.length); i += t.substring(a, a + 1) } return i }
-
Before uploading, the Plupload library generates a ~30 character unique file ID (
fid) using the following method:var guid = (function() { var counter = 0; return function(prefix) { var guid = new Date().getTime().toString(32), i; for (i = 0; i < 5; i++) { guid += Math.floor(Math.random() * 65535).toString(32); } return (prefix || 'o_') + guid + (counter++).toString(32); }; }());
-
The upload process begins when a POST request with Content-Type of "multipart/form-data" is sent to the
/upload/<sid>endpoint. The request also contains three parameters:nameThe filename, ex. "Test.pdf"idThe file ID (fid)fileThe file itself, in binary format.
NOTE: Although
sidandfidare generated using the methods above, the fact that they are created client-side means that you can substitute your own values if you so wish.fidappears to accept any value, whilesidmust be 16 characters long in order to be processed.An example
curllooks like this:curl -X POST -F "name=ID-Test.pdf" -F "id=testing" -F "file=@Test.pdf" -H "Content-Type: multipart/form-data" http://pdf2doc.com/upload/3sw4i3wpq25qm46sThe response is sent as JSON and looks like this:
{ "data": { "file": "Test.pdf", "file_size_human": "74K" }, "id": "testing", "jsonrpc": "2.0", "result": null } -
Immediately after uploading, the page sends a GET request to
/convert/<sid>/<fid>?rnd=<rnd>.rndis generated usingMath.random()and can be omitted from the request. I believe it simply acts as a cache-busting mechanism.An example
curllooks like this:curl http://pdf2doc.com/convert/3sw4i3wpq25qm46s/testingAnd the response:
{"status": "success"}I wasn't able to get a conversion to fail (I didn't really try) but it is certainly possible - and if it does, this is probably where you can find out.
-
The conversion can be monitored through the
/status/<sid>/<fid>?rnd=<rndendpoint.rndserves the same purpose here as it did previously.An example
curllooks like this:curl http://pdf2doc.com/status/3sw4i3wpq25qm46s/testingResponse:
{ "fid": "testing", "progress": 0, "sid": "3sw4i3wpq25qm46s", "status": "processing", "status_text": null }Presumably,
progresschanges over time to reflect how close the conversion is to being completed. In addition, the JSON format changes once the conversion is completed:{ "convert_result": "Test.doc", "fid": "testing", "progress": 100, "savings": null, "sid":" 3sw4i3wpq25qm46s", "status": "success", "thumb_url": "\/files\/3sw4i3wpq25qm46s\/testing\/thumb.png?nimg" }convert_resultis the filename of the newly converted document.thumb_urlis a URI leading to a 125x77 screenshot of the converted document. The query (nimg) appears to be another randomly generated cache-busting mechanism.NOTE: If you visit this endpoint before hitting the previous one (
/convert), you will get the following error:{ "details": "Conversion error.", "status": "error" }This does not actually mean the conversion failed, it just means that it was never started. The conversion must be triggered manually.
-
Finally, to download the file, the page sends a GET request to
/download/<sid>/<fid>/<convert_result>?rnd=<rnd>. This link is generated in an anonymous function assigned as a click event handler:$("#" + data.fid + " div.plupload_file_button" + (thumbnail_clickable ? ", #" + data.fid + " .plupload_thumb" : "")).click(function() { downloadURI("download/" + data.sid + "/" + data.fid + "/" + data.convert_result + "?rnd=" + Math.random(), data.convert_result); });
And here's the source for
downloadURI:function downloadURI(uri, name) { if (HTMLElement.prototype.click) { var link = document.createElement("a"); link.download = name; link.href = uri; link.style.display = "none"; document.body.appendChild(link); link.click(); setTimeout(function() { link.remove(); }, 500); } else { window.location.href = uri; } }
Example
curl:curl http://pdf2doc.com/download/3sw4i3wpq25qm46s/testing/Test.docThe response is, of course, the file itself. However, if
sid,fid, etc. are invalid, the server will send back a 500 Server Error response.
-
There are four main API endpoints:
/upload/convert/status/download
-
Every endpoint uses a combination of a session ID (
sid) and file ID (fid) which are both generated client-side. -
There is no form of authentication.
-
There may be rate-limiting, but I don't expect this tool to be used so frequently by its users that rate-limiting actually becomes a problem.
I will be using Python 2.7 with the requests library to automate this process. In addition, I will utilize Tkinter and py2exe to package the software into a Windows executable with a GUI.