Skip to content

Memgraph split files#376

Merged
EvanDietzMorris merged 2 commits intoneo4j-sources-issuefrom
memgraph-split-files
Mar 6, 2026
Merged

Memgraph split files#376
EvanDietzMorris merged 2 commits intoneo4j-sources-issuefrom
memgraph-split-files

Conversation

@EvanDietzMorris
Copy link
Contributor

I think there was a bug in the kgx_file_converter, where when it was splitting the edges file for memgraph it wasn't handling the existence check for file handlers correctly, leading to an error when calling close() on None in the finally clause, and only writing 1 line per file. This should fix that.

This also changes it so that it writes the intermediate split edges files to the output directory instead of next to the input files, in case you need all of outputs written in a different place.

@github-actions github-actions bot added the Biological Context QC Require validation of biological context to ensure accuracy and consistency label Mar 6, 2026
@EvanDietzMorris EvanDietzMorris requested a review from hyi March 6, 2026 08:04
if rel_type not in file_handles:
split_jsonl_path = f"{out_base}_{rel_type}.jsonl"
file_handles[rel_type] = open(split_jsonl_path, "w", encoding="utf-8")
file_handles[rel_type].write(line)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EvanDietzMorris I see the issue here with the original code, mainly related to checking whether the output file exists or not and if not, the corresponding file_handle is set to None. I cannot remember exactly why I did this, but probably was meant to not override existing files. I agree it is better to just output the file without doing this check. But I see one issue here with hardcoding the ext name to jsonl. In fact, the output file name is csv rather than jsonl. I think it is better to leverage the out_ext here to make the output path f"{out_base}_{rel_type}{out_ext}" here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this part just writing the split edge files before they are converted to csv? unless we change a lot of other stuff in ORION those would always be jsonl.. maybe I'm misunderstanding though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, Evan. It is always jsonl, but since we pass in edges_input_file as an input parameter to this method, it is probably better to just use that input file to get its base and ext even though it is currently just edges.jsonl.

# split a large edge jsonl file into multiple jsonl files, one per predicate (relationship type)
# for subsequent conversions by edge types
base, ext = os.path.splitext(edges_input_file)
out_base, out_ext = os.path.splitext(output_base_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will need the original input base and ext file for later use. We could rename them as in_base and in_ext to be clearer, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, feel free to make changes or rename those, I was mainly just trying to fix the None file handler issue

all_file_names = []
for rel_type in file_handles.keys():
input_split_file = f"{base}_{rel_type}{ext}"
input_split_file = f"{out_base}_{rel_type}.jsonl"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to keep the original code here since it is input_split_file?

Copy link
Contributor

@hyi hyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EvanDietzMorris Looking into this more, I think the changes you made here look good and clearer. My comments can be ignored. Feel free to merge it.

@EvanDietzMorris EvanDietzMorris merged commit 5e6bb72 into neo4j-sources-issue Mar 6, 2026
2 checks passed
@EvanDietzMorris EvanDietzMorris deleted the memgraph-split-files branch March 6, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Biological Context QC Require validation of biological context to ensure accuracy and consistency

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants