Skip to content

Extend importer module to allow bulk import from Rivet#958

Draft
Copilot wants to merge 2 commits intomainfrom
copilot/extend-importer-module-bulk-import
Draft

Extend importer module to allow bulk import from Rivet#958
Copilot wants to merge 2 commits intomainfrom
copilot/extend-importer-module-bulk-import

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 17, 2026

The importer module was hardcoded to fetch INSPIRE IDs and submission files exclusively from hepdata.net, and always assigned user ID 1 as the Coordinator. This blocked bulk import of ~780 Rivet analyses hosted at an alternate web location.

Changes

api.py

  • get_inspire_ids: new ids_url parameter — when set, fetches the INSPIRE ID list directly from that URL (expects a JSON array of integers, e.g. inspire.json) instead of constructing the HEPData /search/ids endpoint. n_latest still applies client-side; last_updated is ignored when ids_url is used.
  • _download_file: new files_url parameter — when set, downloads from {files_url}/ins{inspire_id}.tar.gz instead of {base_url}/download/submission/ins{inspire_id}/original.
  • _import_record / import_records: new coordinator_id (default 1) and files_url parameters, replacing the hardcoded admin_user_id = 1.

cli.py

  • import-records: adds --coordinator-id/-c, --files-url/-f
  • bulk-import-records: adds --ids-url, --files-url/-f, --coordinator-id/-c

Example — bulk import from a Rivet mirror:

hepdata importer bulk-import-records \
  --ids-url https://example.com/hepdata/inspire.json \
  --files-url https://example.com/hepdata \
  --coordinator-id 42

Tests

  • Extended test_get_inspire_ids to cover ids_url success, ids_url + n_latest, and ids_url 404.
  • Updated test_import_records call-signature assertions and added a case for non-default coordinator_id/files_url propagation.
  • Added files_url download-error case to test_import_record.
Original prompt

This section details on the original issue you should resolve

<issue_title>records: extend importer module to allow bulk import from Rivet</issue_title>
<issue_description>The importer module (CLI) was written to import records from hepdata.net to a developer's local instance. It uses a list of INSPIRE IDs given at https://www.hepdata.net/search/ids?inspire_ids=true and it downloads files using a URL pattern url = "{0}/download/submission/ins{1}/original".format(base_url, inspire_id) where base_url = 'https://hepdata.net'.

The importer module should be extended to get the list of INSPIRE IDs and the download files from an alternate location, for example, a simple web directory with the INSPIRE IDs contained in the name of the files. It should also be possible to create records with any user assigned as the Coordinator (rather than just admin_user_id = 1). The ability to import only a subset of the complete list of INSPIRE IDs would be useful.

These changes should be carefully tested locally and on the QA system before importing to the production instance. Such an extension would be a quicker way of importing the 780 records obtained from Rivet than using the normal submission web interface.

See also discussion with @20DM in HEPData/hepdata_lib#229.

A list of the Rivet analyses can be seen at https://gitlab.com/hepcedar/rivet/-/issues/485 .</issue_description>

<agent_instructions>Most relevant information is given in the first and last comments.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@GraemeWatt @20DM : thanks, that's great! I'll look into modifying the [`importer`](https://github.com/HEPData/hepdata/blob/main/hepdata/modules/records/importer/api.py) module soon.

I picked a random submission (ins2705058.tar.gz) and uploaded it to my Sandbox. Few (optional) comments for your consideration:

  1. You give http://rivet.hepforge.org/analyses#BESIII_2023_I2705058 as an additional resource. This is not strictly necessary (see submission docs) since the link will automatically be added after the record is finalised from the nightly harvesting of the analyses.json file. Moreover, the automatic link added will be http://rivet.hepforge.org/analyses/BESIII_2023_I2705058 with a / instead of a #. So if you want to keep the Rivet analysis in the submission.yaml file, better to use a link with a / instead of a #, or just remove it completely.
  2. The comment has a weird markup that is not rendered by HEPData. It looks like you are taking this from the journal abstract given by the INSPIRE record (JSON). The INSPIRE JSON also provides the arXiv abstract (second item of abstracts) that uses LaTeX markup and can be rendered by HEPData. HEPData uses the arXiv abstract from INSPIRE if possible (code). Since HEPData already stores the paper abstract (although it is only displayed if there is no comment), I don't think you need to duplicate it in the comment. So I would just use the additional information "NUMERICAL VALUES HAVE BEEN DIGITISED FROM THE PAPER." as the comment or omit the comment completely if there is nothing to add. (Another possibility is to use the Description from the Rivet .info file as the comment, but in this case it contains Beam energy must be specified as analysis option "ENERGY" when rivet-merging samples. which is not relevant to the HEPData record.)
  3. It looks like Tables 1 and 2 share a common independent variable axis, so it would make sense to combine them into one table with two dependent variables, then the "Custom Rivet identifier" would not need to be given since the YODA export would give the correct identifiers automatically. Of course, I realise that some compromises need to be made in the interest of automation, and so the best overall encoding for 780 submissions is going to be different than if each submission was prepared separately.</comment_new>
    <comment_new>@GraemeWatt
    Thanks for making the changes to the tarballs. I haven't started looking at this yet, since I didn't see that it was particularly urgent, but I'll try to look into it within the next couple of months.</comment_new>
    <comment_new>@GraemeWatt
    The links given in the [previous comment](https://github.com/HEPData/hepdata/...

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: GraemeWatt <11544204+GraemeWatt@users.noreply.github.com>
Copilot AI changed the title [WIP] Extend importer module to allow bulk import from Rivet Extend importer module to allow bulk import from Rivet Mar 17, 2026
Copilot AI requested a review from GraemeWatt March 17, 2026 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

records: extend importer module to allow bulk import from Rivet

2 participants