Vietnamese Sign Language Recognition (VSL)

📋 Introduction

This project builds a real-time Vietnamese Sign Language (VSL) recognition system using Deep Learning and Computer Vision. The system includes 2 main models:

Model	File	Function	Architecture
Model 1	`asl_landmarks_best.joblib`	Recognize letters (A–Z) and digits (0–9)	SVM/MLP with MediaPipe Hands
Model 2	`VSL.h5`	Recognize Vietnamese words/phrases (60 sentences)	LSTM Neural Network

Project Structure

LBS/
├── archive/
│   └── asl_dataset/          # Dataset of letters & digits (A–Z, 0–9)
├── Dataset_Staged/           # Staged VSL dataset by difficulty level
│   ├── 01_Alphabet_Numbers/  # Letters, digits
│   ├── 02_Simple_Words/      # Simple words
│   ├── 03_Complex_Words/     # Complex words
│   └── 04_Advanced/          # Advanced
├── func_files/               # Utility scripts
│   ├── convert_to_gif.py     # Convert to GIF
│   ├── data_crawl.py         # Data collection
│   ├── extract_kp.py         # Keypoint extraction
│   └── make_2d.py            # 2D conversion
├── training_method_1/        # Method 1: SVM/MLP for letters
│   ├── train_3.ipynb         # Training notebook
│   ├── cam.py                # Realtime inference
│   └── asl_landmarks_best.joblib  # Trained model
├── training_method_2/        # Method 2: GCN (experimental)
│   ├── gcn.py                # Graph Convolutional Network
│   ├── preprocessing.py      # Preprocessing
│   └── models/               # Checkpoints
├── training_method_3/        # Method 3: LSTM for words/phrases
│   ├── CollectData.ipynb     # Data collection
│   ├── ActionDetection.ipynb # Model training
│   ├── Model.py              # Realtime inference
│   └── VSL.h5                # Trained model
└── requirements.txt          # Dependencies

Model 1: Letter & Digit Recognition

Data Collection Method

Data source: ASL Dataset (American Sign Language) with folders A–Z and 0–9
Format: Static images (JPG, PNG, BMP)
Total classes: 36 classes (26 letters + 10 digits)

Feature Extraction Method

Use MediaPipe Hands to extract 21 hand landmarks:

# Each landmark has 3 coordinates (x, y, z)
# Total features: 21 * 3 = 63 features

def lm_to_feat(lm, handed):
    pts = np.array([[p.x, p.y, p.z] for p in lm.landmark])
    
    # Normalization: flip if left hand
    if handed == 'Left':
        pts[:,0] = 1.0 - pts[:,0]
    
    # Translate so wrist is at origin
    wrist = pts[0].copy()
    pts -= wrist
    
    # Normalize by palm scale
    palm_scale = np.linalg.norm(pts[9,:2]) + 1e-6
    pts /= palm_scale
    
    return pts.flatten()  # 63 features

Data Augmentation

Rotation augmentation: Random rotation of ±10° to improve robustness
Augment ratio: 1x (each sample generates 1 augmented sample)

Model Architecture

Use GridSearchCV to search for the best model between:

SVM (Support Vector Machine)
- Kernel: RBF
- C: [1, 5, 10]
- Gamma: ['scale', 0.1, 0.01]
- Class weight: balanced
MLP (Multi-Layer Perceptron)
- Hidden layers: (128, 64)
- Activation: ReLU
- Alpha: [1e-4, 1e-3]
- Learning rate: [1e-3, 5e-4]

Pipeline

pipe = Pipeline([
    ('scaler', StandardScaler()),  # Feature normalization
    ('clf', SVC/MLP)               # Classifier
])

# Cross-validation: 5-fold Stratified
cv = StratifiedKFold(n_splits=5, shuffle=True)

Realtime Inference

Confidence threshold: 0.60
History voting: 7 most recent frames
Method: Majority voting to reduce noise

Model 2: VSL Word/Phrase Recognition

List of 60 Recognized Sentences

The model can recognize 60 common Vietnamese words/phrases:

Group	Example
Greetings	xin chao, cam on, chuc mung, hen gap lai cac ban
Personal information	ban ten la gi, toi la hoc sinh, toi la nguoi Diec
Health	ban khoe khong, toi bi dau dau, toi can gap bac si
Locations	toi dang o cong vien, toi di sieu thi, toi song o Ha Noi
Emotions	toi cam thay rat vui, toi dang buon, toi thay nho ban
Emergency	cap cuu, toi bi cuop, toi bi lac

(Notes: Sentences are kept in Vietnamese without diacritics for label naming.)

Data Collection Method

Use a webcam to record videos of performing the signs
Per sentence: 60 video sequences
Per sequence: 60 frames
Storage: .npy files containing keypoints of each frame

# Data folder structure
Data/
├── xin chao/
│   ├── 0/        # Sequence 0
│   │   ├── 0.npy
│   │   ├── 1.npy
│   │   └── ... (60 files)
│   ├── 1/        # Sequence 1
│   └── ... (60 sequences)
├── cam on/
└── ... (60 actions)

Feature Extraction Method

Use MediaPipe Holistic to extract landmarks of both hands:

def extract_keypoints(results):
    # Left hand: 21 points * 3 coordinates = 63 features
    lh = np.array([[res.x, res.y, res.z] 
                   for res in results.left_hand_landmarks.landmark]).flatten() 
         if results.left_hand_landmarks else np.zeros(21*3)
    
    # Right hand: 21 points * 3 coordinates = 63 features
    rh = np.array([[res.x, res.y, res.z] 
                   for res in results.right_hand_landmarks.landmark]).flatten() 
         if results.right_hand_landmarks else np.zeros(21*3)
    
    return np.concatenate([lh, rh])  # 126 features

LSTM Model Architecture

model = Sequential()
model.add(LSTM(64, return_sequences=True, activation='relu', 
               input_shape=(60, 126)))  # Input: 60 frames, 126 features
model.add(LSTM(128, return_sequences=True, activation='relu'))
model.add(LSTM(64, return_sequences=False, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(60, activation='softmax'))  # 60 classes

model.compile(optimizer='Adam', 
              loss='categorical_crossentropy', 
              metrics=['categorical_accuracy'])

Hyperparameters

Parameter	Value
Sequence Length	60 frames
Features per frame	126 (21×3×2 hands)
LSTM layers	3 (64→128→64 units)
Dense layers	2 (64→32 units)
Optimizer	Adam
Loss	Categorical Crossentropy
Epochs	100
Train/Test split	80/20

Realtime Inference

# Thresholds and parameters
CONF_THRESHOLD = 0.80      # Confidence threshold to accept prediction
PRESENCE_START_FRAMES = 5  # Number of frames with visible hands to start recording
ABSENCE_END_FRAMES = 10    # Number of frames without hands to reset
HOLD_PRED_FRAMES = 45      # Number of frames to display the result

# State Machine
STATE_IDLE → STATE_RECORDING → STATE_WAIT_ABSENCE → STATE_IDLE

Inference procedure:

IDLE: Wait for stable hand presence (5 frames)
RECORDING: Record 60 frames of keypoints
PREDICT: Predict once when 60 frames are collected
WAIT_ABSENCE: Wait for the user to lower hands to reset

Setup and Usage

System Requirements

Python 3.8+
Webcam
GPU (recommended for training)

Install dependencies

pip install -r requirements.txt

Main dependencies

torch>=2.0.0
tensorflow>=2.5.0
mediapipe
opencv-python
scikit-learn>=1.0.0
numpy>=1.21.0
pandas>=1.3.0
matplotlib>=3.5.0
joblib

Run realtime inference

Model 1 – Letters & Digits:

cd training_method_1
python cam.py

Model 2 – Words/Phrases:

cd training_method_3
python Model.py

Retrain the models

Model 1:

# Open Jupyter Notebook
jupyter notebook training_method_1/train_3.ipynb

Model 2:

# Collect new data
jupyter notebook training_method_3/CollectData.ipynb

# Train model
jupyter notebook training_method_3/ActionDetection.ipynb

Model Evaluation

Model 1 (Letters & Digits)

Evaluation method: Classification Report + Confusion Matrix
Cross-validation: 5-fold Stratified

Model 2 (Words/Phrases)

Evaluation method: Accuracy Score + Confusion Matrix
Metrics: Categorical Accuracy

Dataset

ASL Dataset (Model 1)

Location: archive/asl_dataset/
Content: Static images of 36 signs (A–Z, 0–9)
Structure: One folder per class

VSL Dataset (Model 2)

Location: Dataset_Staged/
Content: Vietnamese sign language videos by difficulty level
Levels:
- 01: Letters, digits, tone markers
- 02: Common simple words
- 03: Complex words
- 04: Advanced

Utility Files

File	Function
`func_files/convert_to_gif.py`	Convert videos to GIFs
`func_files/data_crawl.py`	Collect data from sources
`func_files/extract_kp.py`	Extract keypoints from videos
`func_files/make_2d.py`	Convert data to 2D format

Usage Guide

Letter/Digit Recognition

Run training_method_1/cam.py
Place your hand in the camera frame
Perform the sign for a letter or digit
The result will be displayed on the screen
Press q to quit

Word/Phrase Recognition

Run training_method_3/Model.py
Wait for the IDLE state
Place your hands in the camera frame and keep them stable
Perform the sign (60 frames will be recorded)
Wait for the result to be displayed
Lower your hands to reset and perform a new sign
Press q or ESC to quit

Contributors

Trần Gia Khánh - 23021599
Vũ Nhật Tường Vân - 23021747
Course: Human–Computer Interaction

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
Dataset_Staged		Dataset_Staged
archive/asl_dataset		archive/asl_dataset
func_files		func_files
training_method_1		training_method_1
training_method_2		training_method_2
training_method_3		training_method_3
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
test.py		test.py

Folders and files

Latest commit

History

Repository files navigation

Vietnamese Sign Language Recognition (VSL)

📋 Introduction

Project Structure

Model 1: Letter & Digit Recognition

Data Collection Method

Feature Extraction Method

Data Augmentation

Model Architecture

Pipeline

Realtime Inference

Model 2: VSL Word/Phrase Recognition

List of 60 Recognized Sentences

Data Collection Method

Feature Extraction Method

LSTM Model Architecture

Hyperparameters

Realtime Inference

Setup and Usage

System Requirements

Install dependencies

Main dependencies

Run realtime inference

Retrain the models

Model Evaluation

Model 1 (Letters & Digits)

Model 2 (Words/Phrases)

Dataset

ASL Dataset (Model 1)

VSL Dataset (Model 2)

Utility Files

Usage Guide

Letter/Digit Recognition

Word/Phrase Recognition

Contributors

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages