Memgraph split files#376
Conversation
| if rel_type not in file_handles: | ||
| split_jsonl_path = f"{out_base}_{rel_type}.jsonl" | ||
| file_handles[rel_type] = open(split_jsonl_path, "w", encoding="utf-8") | ||
| file_handles[rel_type].write(line) |
There was a problem hiding this comment.
@EvanDietzMorris I see the issue here with the original code, mainly related to checking whether the output file exists or not and if not, the corresponding file_handle is set to None. I cannot remember exactly why I did this, but probably was meant to not override existing files. I agree it is better to just output the file without doing this check. But I see one issue here with hardcoding the ext name to jsonl. In fact, the output file name is csv rather than jsonl. I think it is better to leverage the out_ext here to make the output path f"{out_base}_{rel_type}{out_ext}" here.
There was a problem hiding this comment.
isn't this part just writing the split edge files before they are converted to csv? unless we change a lot of other stuff in ORION those would always be jsonl.. maybe I'm misunderstanding though
There was a problem hiding this comment.
Yes, you are right, Evan. It is always jsonl, but since we pass in edges_input_file as an input parameter to this method, it is probably better to just use that input file to get its base and ext even though it is currently just edges.jsonl.
| # split a large edge jsonl file into multiple jsonl files, one per predicate (relationship type) | ||
| # for subsequent conversions by edge types | ||
| base, ext = os.path.splitext(edges_input_file) | ||
| out_base, out_ext = os.path.splitext(output_base_file) |
There was a problem hiding this comment.
I think we will need the original input base and ext file for later use. We could rename them as in_base and in_ext to be clearer, though.
There was a problem hiding this comment.
yea, feel free to make changes or rename those, I was mainly just trying to fix the None file handler issue
| all_file_names = [] | ||
| for rel_type in file_handles.keys(): | ||
| input_split_file = f"{base}_{rel_type}{ext}" | ||
| input_split_file = f"{out_base}_{rel_type}.jsonl" |
There was a problem hiding this comment.
I think we need to keep the original code here since it is input_split_file?
hyi
left a comment
There was a problem hiding this comment.
@EvanDietzMorris Looking into this more, I think the changes you made here look good and clearer. My comments can be ignored. Feel free to merge it.
I think there was a bug in the kgx_file_converter, where when it was splitting the edges file for memgraph it wasn't handling the existence check for file handlers correctly, leading to an error when calling close() on None in the finally clause, and only writing 1 line per file. This should fix that.
This also changes it so that it writes the intermediate split edges files to the output directory instead of next to the input files, in case you need all of outputs written in a different place.