-
Notifications
You must be signed in to change notification settings - Fork 2
Bulk import of identifier (and other?) table data from files #62
Description
JournalsDB tables were initially populated by loading existing classic files into tables. Subsequent to this, additions to the tables are made by exporting the tables to Google Sheets, making edits, and reimporting the data. The export/import process records the editorial history and enforces some data requirements, but it is time-intensive if more than a few records are being added.
We need to implement a process in run.py that will trigger the bulk import of data from files, but modified so that the data undergo validity checks prior to db insert. This discussion will focus on the bulk import of ISSNs.
Starting from the assumption that the bibcodes for new records already exist in the master table, we need a task that:
- will read a formatted text file containing a bibstem, an ISSN_type, and an ISSN
- verify that the bibstems are already in journalsdb (i.e. have a masterid)
- verify that the ISSN isn't already in journalsdb.idents
- check whether there is an existing ISSN of the given ISSN_type for that bibstem
If the given ISSN is truly new, the task should import the masterid, ISSN_type, and ISSN into idents. If it is not new, or if there is some other issue, it should provide actionable information to the user about what is wrong.
The procedure for this will be something like:
- read the file
- search journalsdb for the bibstem, and whether that bibstem already has any ISSNs
- search journalsdb for the ISSN, and return a warning if it has already been assigned to another bibstem, or the same bibstem but with a different ISSN_type
- if bibstem exists, ISSN does not, and ISSN_type has not already been assigned to bibstem, insert (masterid, ISSN, ISSN_type)
- if bibstem exists, ISSN does not, but ISSN_type has already been assigned (and you're sure the new record is a valid alternate ISSN for this bibstem), insert (masterid, ISSN, ISSN_type_altN), where the altN is generated from existing table data for alternate ISSNs
- if bibstem does not exist, reject the row and warn the user
- if bibstem exists and ISSN exists for the same bibstem, reject the row and notify the user the ISSN is already present
- if bibstem exists, and ISSN exists but for a different bibstem and/or ISSN_type, reject the row and warn the user there's a problematic record that requires investigation
The code for inserting an identifier of given type already exists in journalsmanager.tasks.task_db_load_identifier, so the new coding required is for reading the file, doing the required sqlalchemy queries to validate the incoming data, and generating logging messages of the appropriate level (info/warning) that will direct the curator on what needs to be resolved.