| marp | theme |
|---|---|
true |
uncover |
- Understanding the Computer Cluster setup
- Learning command line Linux
- Navigating the cluster
- Submitting cluster jobs
-
You will need to be registered to gain access to the cluster
-
Windows systems software:
- FileZilla Client - https://filezilla-project.org
- Putty: https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html
-
macOS software:
- FileZilla Client - https://filezilla-project.org
- 130 nodes (on CPU partition)
- 754 GB RAM
- 112 (HT) cores
- 14,560 usable cores
- Simply put, 14,560 processes that can be run in parallel
- Huge data storage (
/cephfs2: 7.1PB) - (Almost) never turned off
- Specialist software manages long-running jobs
- Compute cluster needed for modern life sciences datasets
- Maintained by Scientific Computing
- On a Mac open the terminal
- Shell -> New Tab -> Homebrew (good colour scheme)
- Connect to head node (either hal, hex or max):
ssh –Y username@hal - Enter your cluster password
- Connect to atg first if connecting from outside (Scientific Computing need to set up access)
-
On a PC, open Putty and connect to the head node (either hal, hex or max):
username@hal -
Connect to atg first if connecting from outside
- FileZilla Client: https://filezilla-project.org
- Free and available for Windows and macOS
- Normal logon credentials, and
Host:
halPort:22
-
Analogous to a single cluster node
-
21TB of data storage, 80CPUs and 97GB RAM
-
Maintained by the Cell Biology Division
-
Software can be installed as per users' requirements
-
Let us know if you would like an account (and you are a memember of the Cell Biology Division)
-
sean-pc-10.lmb.internal -
NOT BACKED UP!!!
- Similar to Windows and macOS, Linux is an operating system
- Free and open source
- Different types of Linux e.g. Android
- The LMB cluster uses AlmaLinux
- Big difference : you need to use the command line�
- Not so intuitive, but more powerful
-
Shells – command line interface interpreter programs
-
We recommend using Bash - arguably the best known
-
This is now the LMB cluster default (but didn't used to be)
-
Ask scientific computing to make it your default
-
Otherwise, temporarily specify the bash shell with:
bash
-
Each command is actually a program
-
Modified by flags, options and arguments
command [-flag(s)] [-option(s) [value]] [argument(s)]
ls
directory1 file1.txt file2.txt file3.txt
command [-flag(s)]
ls –l
total 12
drwxrwxr-x 2 swingett swingett 4096 Jul 15 15:59 directory1
-rw-rw-r-- 1 swingett swingett 0 Jul 15 15:57 file1.txt
-rw-rw-r-- 1 swingett swingett 17 Jul 15 16:35 file2.txt
-rw-rw-r-- 1 swingett swingett 37 Jul 15 16:34 file3.txt
command [-flag(s)]
ls -l --human-readable
total 12K
drwxrwxr-x 2 swingett swingett 4.0K Jul 15 15:59 directory1
-rw-rw-r-- 1 swingett swingett 0 Jul 15 15:57 file1.txt
-rw-rw-r-- 1 swingett swingett 17 Jul 15 16:35 file2.txt
-rw-rw-r-- 1 swingett swingett 37 Jul 15 16:34 file3.txt
command [-flag(s)]
ls –l -h
total 12K
drwxrwxr-x 2 swingett swingett 4.0K Jul 15 15:59 directory1
-rw-rw-r-- 1 swingett swingett 0 Jul 15 15:57 file1.txt
-rw-rw-r-- 1 swingett swingett 17 Jul 15 16:35 file2.txt
-rw-rw-r-- 1 swingett swingett 37 Jul 15 16:34 file3.txt
command [-flag(s)]
ls –lh
total 12K
drwxrwxr-x 2 swingett swingett 4.0K Jul 15 15:59 directory1
-rw-rw-r-- 1 swingett swingett 0 Jul 15 15:57 file1.txt
-rw-rw-r-- 1 swingett swingett 17 Jul 15 16:35 file2.txt
-rw-rw-r-- 1 swingett swingett 37 Jul 15 16:34 file3.txt
command [-flag(s)] [-option(s) [value]]
ls -l --sort=size
total 12
drwxrwxr-x 2 swingett swingett 4096 Jul 15 15:59 directory1
-rw-rw-r-- 1 swingett swingett 37 Jul 15 16:34 file3.txt
-rw-rw-r-- 1 swingett swingett 17 Jul 15 16:35 file2.txt
-rw-rw-r-- 1 swingett swingett 0 Jul 15 15:57 file1.txt
command [-flag(s)] [-option(s) [value]] [argument(s)]
ls -l file2.txt file3.txt
-rw-rw-r-- 1 swingett swingett 17 Jul 15 16:35 file2.txt
-rw-rw-r-- 1 swingett swingett 37 Jul 15 16:34 file3.txt
-
Locations represented as a line of text
-
Each folder ends with a forward slash:
/lmb/home/jsmith/file1.txt -
Relative links:
../pjones/file2.txt./file4.txt~/folderA/file5.txt
* ls (-l)
* pwd
* cd (. .. - ~)
* cp (-r)
* mv (move and rename)
* mkdir
* rmdir
* rm (-f -r)
* * arrows/history/autocomplete/ CTRL A + CTRL E
-
Use only alphanumeric characters, the underscore symbol (_) and the dot (.):
my_file1.txt -
Not spaces!
-
File extension can tell you what a file is
-
Hidden files:
.hidden_file.log
* cat
* head (wonderland.txt, numeric flags)
* tail
* more
* nano (demo with / without filename)
* gzip
* zcat
* gunzip
-
Redirect to a file:
cat file1.txt > file1_copy.txtcat file1.txt file2.txt file3.txt > combined.txt -
Append to a file:
cat file4.txt >> combined.txt
Can use redirects with other command (i.e. not just cat)
- Takes output from one command and pass to another:
zcat file.txt.gz | more
-
Search text files
-
Return lines in a text file where search term is found:
grep organoid thesis.txt
-
Represent symbolically other characters
-
Example:
england.txt,northern_ireland.txt,scotland.txt,wales.txt -
Asterisk matches none or more characters:
ls *land.txt england.txt northern_ireland.txt scotland.txt -
Question mark matches exactly one character:
ls wa?es.txt wales.txt
-
Character class matches any of the single alphanumeric characters in the list:
ls [es]*.txt england.txt scotland.txt
-
Symbolic links akin to shortcuts on Windows and aliases on macOS
-
Link to a single file:
ln -s /target_folder/target_file_of_interest.txt
-
Link to a single file, except link has a different name:
ln -s /target_folder/target_file_of_interest.txt link.txt -
Links to multiple files:
ln -s /target_folder/*.txt .
-
Simple description:
whatis -
Detailed manual:
man -
Google, ChatGPT
-
Forums
-
Cheat sheet
<style scoped> table { font-size: 20px; } </style>
| Column | Description (ls -l) |
|---|---|
| 1 | File type (- file / d directory / l link) |
| 2 | Permission string (owner / group /everyone) (rwx) |
| 3 | Number of hard links |
| 4 | Owner name |
| 5 | Owner group |
| 6 | File size in bytes |
| 7 | Modification time |
| 8 | File name |
-
Add execute privileges for user:
chmod u+x [files] -
Add write privileges for group:
chmod g+w [files] -
Remove read privileges for others:
chmod o-r [files] -
Add read privileges for everyone:
chmod a+r [files] -
There is also a "numerical" system to do this
-
Variables (e.g.
$USER) – built-in and user-defined -
Display to screen using
echo -
Order lines with
sort -
Transfer data with
curl -
Fix line endings with
dos2unixandmac2unix
-
$PATH/usr/bin:/usr/local/sbin:/usr/sbin -
whichwhich ls/usr/bin/ls -
ps/top -
nohup(no hang up) -
Backgrounding with
&
-
Cancel job with CTRL + C
-
Suspend with CTRL + Z /
bg(fgwill foreground a job) -
kill [job id] -
kill -9 [job id]
- Cluster architecture
- Logging in to the cluster
- Using Linux and command line shells
- BASH
- Navigating
- Copying; deleting; moving; linking files
- Reading; writing; searching files
- Compressing data
- Re-direction, appending, piping
- Wildcards
- File permissions
- Downloading
- Variables
- Running programs
- Checking running programs (
ps,top) $PATH- Running in the background (
&,bg,nohup) - Where to get help
- Try accessing the cluster from outside the LMB via atg: https://www.mrc-lmb.cam.ac.uk/scicomp/index.php?id=ssh-x2go
- Microsoft Text editor
- Windows / Mac / Linux
- Edit remote files (even via atg)
- Built-in terminal
- View webpages
- Transfer files
- Free
-
Command palette
-
Shift + Command + P (Mac)
-
Ctrl + Shift + P (Windows/Linux)
-
Remote-SSH: Open SSH Configuration File...Host hal_on_site HostName hal User your_username Host hal_external HostName hal User your_username IdentityFile ~/.ssh/hal ProxyCommand ssh -q -W %h:%p atg.mrc-lmb.cam.ac.uk
-
Remote-SSH: Connect current window to host...
-
Clusters require job management and scheduling system
-
Keeps the nodes all in contact with one another etc.
-
LMB cluster uses Slurm
-
Slurm is open-source software for large and small Linux clusters
-
Uses the command line
-
manpages are available
-
squeue -
squeue -u $USER
-
sqsummary– CPU node state -
sinfo– partition node information -
qinfo– interactive webpage: http://nagios2/qinfo/
-
Interactive jobs: run short operations that complete quickly while you wait, then check the results and perform another calculation if required
-
Submitted jobs: long-running jobs that do not require user intervention
-
Move to a compute node
-
srun --pty bash -
Prompt change:
username@fmb376 -
There are options:
srun -c 8 --pty bash
-
Job runs without further user input
-
Write Bash script:
#!/bin/bash echo Sleeping! sleep 100 -
Execute script on a head node:
bash test.sh Sleeping
- Submit script to queue on a head node:
sbatch test.sh
-
More options:
sbatch -J test_job -c 2 --mail-type=ALL --mail-user=$USER@mrc-lmb.cam.ac.uk --mem=2G test.sh
<style scoped> table { font-size: 20px; } </style>
| Command | Function |
|---|---|
| -J [jobname] | Specify an easily identifiable jobname |
| -c [number of cores] | Number of cores on a node to reserve for the job [default: 1] |
| --mem=[RAM]G | GB of RAM to reserve for the job [default: 5] |
| --mail-type=ALL | Send email updates on job progress |
| --mail-user=$USER@mrc-lmb.cam.ac.uk | Recipient’s email address |
-
sacct -j [job id] -
To get the maximum memory usage:
sacct --format=jobID%20,CPUTime,MaxRSS -j [job id] -
scancel [job id]
Slurm scripts (actual email needed):
#!/bin/bash
#SBATCH --job-name=test_job
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=john_smith@mrc-lmb.cam.ac.uk
# Bash commands
echo Hello World!
Run on a head node:
sbatch test.slurm
-
Process many related tasks simultaneously
-
Specify the job array in a Slurm script
-
Submit using
sbatch
-
Import specific software versions
-
To list available module:
module avail -
To use a module:
module load [module name]
-
Exit codes – 0 means success!
-
View images requires XQuartz (Mac) or VcXsrv (Windows)
-
Not so responsive – maybe transfer to local machine first?
-
More details: https://www.mrc-lmb.cam.ac.uk/scicomp
-
~(home directory) - config files and scripts -
/cephfs&/cephfs2- very large data storage / suitable location for processing data. -
/scratch- suitable location for processing data, BUT FILES ARE AUTOMATICALLY DELETED - DON'T STORE FILES HERE! -
/istoreor/isilon- a place to store data
-
Your group may have its own dedicated storage area
-
Check your quota in the dashboard
-
Refer to Scientific Computing for further information
-
~(home directory) - config files and scripts -
Create a named folder in
/data1,/data2or/data3/scratchto store data files -
Much smaller storage compacity as compared to the Cluster (terrabytes)
-
To/From the cluster To/From another machine via the intranet
-
scp user@host:[target_to_download] [destination_path] -
scp [target_to_upload] user@host:[destination_path] -
Perform a recursive copy for folders:
-r
-
To/From the cluster To/From another machine via the internet
-
Linux command line equivalent of FileZilla
-
sftp [hostname] -
mget -r [files_to_download] -
mput -r [files_to_download]
-
ftp [hostname] -
bin -
prompt -
LMB FTP:
/ftp/pub/ -
LMB FTP:
ftp.mrc-lmb.cam.ac.uk -
'anonymous' with no password
-
Interactive, but run in background (needed for SFTP)
-
screen -S [screen_name] -
CTRL + A and then press D
-
screen -ls -
screen -r [ID_number] -
exit
-
Web interface
- Jupyter Notebook - create and share documents that contain live code, equations, plots and descriptive text
-
Not supported on the Cluster
-
Installed on the Cell Biology Workstation (Xeon)
-
We can set you up with an account
-
Course: https://github.com/StevenWingett/data-analysis-with-python-course
-
/public/genomics/soft/bin -
Add to PATH?
-
softwaregroup
-
Enable software and dependencies to be bundled into one file
-
Most effective way to distribute versioned bioinformatics software
-
On the cluster, containers can only be run from:
/public/singularity/. -
Add files to that folder:
singularitygroup -
Also installed on Xeon, where containers can be run from any location
- NGS QC
- ATAC-seq
- ChIP-seq
- Cut and Run/Tag
- RNA-seq
- Single Cell RNA-seq (10x)
- Single Cell RNA-seq (Parse)
- Taxonomy Profiling
- NGS data downloading Data
- Online tutorial: https://www.youtube.com/watch?v=PPEneJfFsOI
-
Linux / Bash
-
Cell Biolgy Xeon Workstation
-
Compute cluster and its architecture
-
Slurm
-
R Studio Server / JupyterHub / Visual Studio Code
-
Find a reason to have a go in the coming weeks
-
Thanks for listening!!!
https://stevenwingett.github.io/Bioinformatics_Computer_Cluster_Course











