refactor data access part 1 models validators [Please donot merge]#2007
Draft
superryeti wants to merge 3 commits intoMIT-LCP:devfrom
Draft
refactor data access part 1 models validators [Please donot merge]#2007superryeti wants to merge 3 commits intoMIT-LCP:devfrom
superryeti wants to merge 3 commits intoMIT-LCP:devfrom
Conversation
Discussed in [Issue MIT-LCP#1987](MIT-LCP#1927 (comment)), Quick Summary, DataSource model will decide where the files are stored(determined by `data_location`) for project and how they can be accessed(determined by `access_mechanism`). A single project can have multiple DataSource. About the fields `files_available` - determines if the files can be viewed/downloaded for the given type of datasource. `email` - For GCP group access, this would store the email of the group. `uri` - The URI for the data on the external service. For s3 this would be of the form s3://<bucket_name>, for gsutil this would be of the form gs://<bucket_name>
Data Location and AccessMechanism are tightly coupled. For example, as of now Research Environment is only implemented for GOOGLE_CLOUD_STORAGE, or Direct data access is only available for the resources which are directly stored on server. The validator will first check if the appropriate fields are provided depending on the datalocation type and also check if the expected access mechanism is used for the given datalocation.
Note: This doesnot upload the data to the location. Currently it is expected that the upload will be done separately before adding the datasource.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context:
We are breaking the PR #1967 into smaller PR(easy to review and work on). This branch is expected to merge on the 1967, not dev.
This PR introduces the
DataAccessmodel and validatorsQuick Summary about the model,
DataSourcemodel should be used to decide where the files arestored(determined by
data_location) for project and how they can be accessed(determined byaccess_mechanism).A single project can have multiple DataSource.
About the fields
files_available- determines if the files can be viewed/downloaded for the given type of datasource.(@kshalot had notes about this field here #1967 (comment))email- For GCP group access, this would store the email of the group.uri- The URI for the data on the external service. For s3 this would be of the form s3://<bucket_name>, for gsutil this would be of the form gs://<bucket_name>Quick Summary about validators
The validation is based on four aspects: required fields, forbidden fields, required access mechanisms, and forbidden access mechanisms.
Required Fields: For each data location (such as Google BigQuery, Google Cloud Storage, AWS Open Data, and AWS S3), certain fields must be present. For instance, Google BigQuery requires an 'email', while Google Cloud Storage, AWS Open Data, and AWS S3 require a 'uri'. If a required field is missing, a validation error is raised.
Forbidden Fields: Conversely, for certain data locations, some fields must not be present. For example, for 'Direct' data location, 'uri' and 'email' fields should not be present. If they are found, a validation error is raised.
Required Access Mechanisms: Each data location may also require one of several specified access mechanisms. For instance, Google BigQuery and Google Cloud Storage can require either a 'Google Group Email' or a 'Research Environment' access mechanism, while AWS Open Data and AWS S3 require an 'S3' access mechanism. If none of the acceptable access mechanisms are found, a validation error is raised.
Forbidden Access Mechanisms: Finally, some data locations forbid certain access mechanisms. Specifically, the 'Direct' data location forbids the 'Google Group Email', 'S3', and 'Research Environment' access mechanisms. If any of these are present, a validation error is raised.
Quick Note about the interface
This is so that we can quickly test if the validators work. and create datasources.