You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@tonofshell Not sure if you've seen my comment in another issue (I posted it since last Saturday). So, I'm going to post it here again.
Can you please tell me which line in your code causes the script to scan the whole file?
I have a question about these lines (in startElement()).
if self.row == 1:
self.out.write(str(attributes.keys())[1:-1] + "\n")
if len(attributes) > 0:
self.out.write(str(attributes.values())[1:-1] + "\n")
It seems to me that attributes is a dictionary. Order in Python dictionary is not guaranteed. How do you know that the keys and attributes of each row of data will have the exact same order? (If it happens to yield the right thing, it's luck.) For example, if attributes of the first row 0 is {id: 0, name: "a", link: "url0"}, how can you be sure that row 1's attributes dict would not be something like {name: "b", id: 1, link: "url1"}? So, the resulting CSV is:
id, name, link
0, "a", "url0"
"b", 1, "url1"
If you agree with me that this could potentially be a problem, we should control the order by keeping a list of attributes. (That was the reason why I hard-coded the column names, but I did plan to make changes to make it more generic).
MapReduce might not be a good idea for this task. I am going to write an MPI script that does the conversion tomorrow and will make changes to the code according to the potential problem that I pointed out above.
The text was updated successfully, but these errors were encountered: