-
Notifications
You must be signed in to change notification settings - Fork 0
How to Load RDF Data
We're assuming your RDF data is stored on an EBS volume (called DATA) in some specific availability zone (called ZONE). (Note: Add link to a test snapshot) We're also assuming that this volume has a partition table and that the first partition holds the data, thus we will mount the drive on /dev/xvdp1. We will call the machine you are doing this on the HOST.
(An alternative strategy is to create a new DATA volume, attach, format and mount, then copy the data from some other location such as S3 before you schedule the load.)
RDFeasy targets the r3 series of instances in AWS. Up to at least the r3.2xlarge instance, the size of the database file that can be stored on the SSD is the factor that limits how much you can load and this is expected to continue up to the r3.4xlarge without software changes.
- Start up instance of product B00KRI3DWW in the same
ZONEasDATA; log in but do not proceed until database password (visible on login) is assigned.
Instructions for logging in are in the Basic Usage instructions
Strictly speaking, this step is optional, but if you're going to burn a machine image, you might as well burn one that has the latest security fixes on it.
You DO NOT want to update the operating system during the first boot, before the new password has been installed in the database.
The following procedure is overkill but bulletproof.
sudo service virtuoso stopsudo apt-get updatesudo apt-get upgrade -ysudo reboot
In these steps, the var directory for Virtuoso is copied to an SSD and remounted. Fulltext indexing is disabled to improve scalability, and the raw data is added to the system
initialize_ssd-
wait_until virtuoso_ready(junk output from the curl command is normal when this script runs) disable_fulltext- use AWS Console or API to assign
DATAto /dev/xvdp sudo mkdir /mnt/data ; sudo mount /dev/xvdp1 /mnt/data
In the Virtuoso bulk loading process, it is necessary first to populate the database table db.dba.load_list with a list of files to be loaded. This is detailed in the Virtuoso Documentation.
Shell scripts are included to schedule the loading of certain data sets. If you use the snap-7dbc8eaf data set that includes :BaseKB Gold and :SubjectiveEye, there are two different loader scripts:
schedule_small_load -- loads the Compact Edition of :BaseKB on an r3.xlarge instance.
schedule_large_load -- loads the Complete Edition of :BaseKB on an r3.2xlarge instance
Looking at the source code for these scripts may give you some idea as to how to write your own scripts to load your own data sets. The RDFeasy directory is checked in with Git; feel free to fork it if you wish to write your own loading scripts.
A single instance of the RDF Bulk Loader can reach nearly 100% CPU usage on a 4 core or smaller machine (r3.large or r3.xlarge). 100% CPU usage can be attained with 2 copies of the bulk loader running concurrently (r3.2xlarge) and presumably one runs 4 copies on an r3.4xlarge and 8 copies on an r3.8xlarge.
A single instance of the bulk loader is created by the command
rdf_loader_run
(which prints some trash to the console) Multiple RDF loaders can be run by running this command more than once.
The following script waits until the end of the load, which could take a few hours for a large data set.
wait_and_beep still_loading
By creating the machine image, you take a snapshot of the database state which can be restored later.
-
create an EBS volume large enough to hold the database snapshot (call it
NEW) It is a conservative choice to create a volume as large as the SSD on the machine you are running on, but it is reasonable to create a volume which is 20% larger than the data file to allow for temporary files created by large queries. -
Attach
NEWto /dev/xvdf on the host with the AWS Console or API -
copy_to_ebs -
add_ebs_database_to_fstab -
shred_evidence_and_halt
The shred_evidence_and_halt removes cryptographic key information to make the AMI safe for general distribution. Your cryptographic keys will be installed when you create a new instance based on this AMI, however, the loss of key information on the HOST means you will not be able to log into it if you reboot it. This condition can be repaired by mounting the root filesystem of HOST on another computer and editing the \home\ubuntu\authorized_keys but it is a best practice to terminate HOST once you've created an image from host.
Finally, you need to create the machine image. This can be done from the EC2 Management Console. You should make sure that exactly one EBS volume (the NEW volume) is attached to the machine and that this volume is marked with "Delete on Termination" as true. (This way you can spin up and terminate many instances of this machine without the accumulation of large EBS volumes)
The time scale of image creation is 'an hour or so' for data sets that fill an r3.xlarge or r3.xlarge2. Terminate HOST when the image is complete.
Launch the machine image on the same-sized instance as you used to create it. See usage instructions for for the new AMI.