This project builds a real-time Vietnamese Sign Language (VSL) recognition system using Deep Learning and Computer Vision. The system includes 2 main models:
| Model | File | Function | Architecture |
|---|---|---|---|
| Model 1 | asl_landmarks_best.joblib |
Recognize letters (A–Z) and digits (0–9) | SVM/MLP with MediaPipe Hands |
| Model 2 | VSL.h5 |
Recognize Vietnamese words/phrases (60 sentences) | LSTM Neural Network |
LBS/
├── archive/
│ └── asl_dataset/ # Dataset of letters & digits (A–Z, 0–9)
├── Dataset_Staged/ # Staged VSL dataset by difficulty level
│ ├── 01_Alphabet_Numbers/ # Letters, digits
│ ├── 02_Simple_Words/ # Simple words
│ ├── 03_Complex_Words/ # Complex words
│ └── 04_Advanced/ # Advanced
├── func_files/ # Utility scripts
│ ├── convert_to_gif.py # Convert to GIF
│ ├── data_crawl.py # Data collection
│ ├── extract_kp.py # Keypoint extraction
│ └── make_2d.py # 2D conversion
├── training_method_1/ # Method 1: SVM/MLP for letters
│ ├── train_3.ipynb # Training notebook
│ ├── cam.py # Realtime inference
│ └── asl_landmarks_best.joblib # Trained model
├── training_method_2/ # Method 2: GCN (experimental)
│ ├── gcn.py # Graph Convolutional Network
│ ├── preprocessing.py # Preprocessing
│ └── models/ # Checkpoints
├── training_method_3/ # Method 3: LSTM for words/phrases
│ ├── CollectData.ipynb # Data collection
│ ├── ActionDetection.ipynb # Model training
│ ├── Model.py # Realtime inference
│ └── VSL.h5 # Trained model
└── requirements.txt # Dependencies
- Data source: ASL Dataset (American Sign Language) with folders A–Z and 0–9
- Format: Static images (JPG, PNG, BMP)
- Total classes: 36 classes (26 letters + 10 digits)
Use MediaPipe Hands to extract 21 hand landmarks:
# Each landmark has 3 coordinates (x, y, z)
# Total features: 21 * 3 = 63 features
def lm_to_feat(lm, handed):
pts = np.array([[p.x, p.y, p.z] for p in lm.landmark])
# Normalization: flip if left hand
if handed == 'Left':
pts[:,0] = 1.0 - pts[:,0]
# Translate so wrist is at origin
wrist = pts[0].copy()
pts -= wrist
# Normalize by palm scale
palm_scale = np.linalg.norm(pts[9,:2]) + 1e-6
pts /= palm_scale
return pts.flatten() # 63 features- Rotation augmentation: Random rotation of ±10° to improve robustness
- Augment ratio: 1x (each sample generates 1 augmented sample)
Use GridSearchCV to search for the best model between:
-
SVM (Support Vector Machine)
- Kernel: RBF
- C: [1, 5, 10]
- Gamma: ['scale', 0.1, 0.01]
- Class weight: balanced
-
MLP (Multi-Layer Perceptron)
- Hidden layers: (128, 64)
- Activation: ReLU
- Alpha: [1e-4, 1e-3]
- Learning rate: [1e-3, 5e-4]
pipe = Pipeline([
('scaler', StandardScaler()), # Feature normalization
('clf', SVC/MLP) # Classifier
])
# Cross-validation: 5-fold Stratified
cv = StratifiedKFold(n_splits=5, shuffle=True)- Confidence threshold: 0.60
- History voting: 7 most recent frames
- Method: Majority voting to reduce noise
The model can recognize 60 common Vietnamese words/phrases:
| Group | Example |
|---|---|
| Greetings | xin chao, cam on, chuc mung, hen gap lai cac ban |
| Personal information | ban ten la gi, toi la hoc sinh, toi la nguoi Diec |
| Health | ban khoe khong, toi bi dau dau, toi can gap bac si |
| Locations | toi dang o cong vien, toi di sieu thi, toi song o Ha Noi |
| Emotions | toi cam thay rat vui, toi dang buon, toi thay nho ban |
| Emergency | cap cuu, toi bi cuop, toi bi lac |
(Notes: Sentences are kept in Vietnamese without diacritics for label naming.)
- Use a webcam to record videos of performing the signs
- Per sentence: 60 video sequences
- Per sequence: 60 frames
- Storage:
.npyfiles containing keypoints of each frame
# Data folder structure
Data/
├── xin chao/
│ ├── 0/ # Sequence 0
│ │ ├── 0.npy
│ │ ├── 1.npy
│ │ └── ... (60 files)
│ ├── 1/ # Sequence 1
│ └── ... (60 sequences)
├── cam on/
└── ... (60 actions)Use MediaPipe Holistic to extract landmarks of both hands:
def extract_keypoints(results):
# Left hand: 21 points * 3 coordinates = 63 features
lh = np.array([[res.x, res.y, res.z]
for res in results.left_hand_landmarks.landmark]).flatten()
if results.left_hand_landmarks else np.zeros(21*3)
# Right hand: 21 points * 3 coordinates = 63 features
rh = np.array([[res.x, res.y, res.z]
for res in results.right_hand_landmarks.landmark]).flatten()
if results.right_hand_landmarks else np.zeros(21*3)
return np.concatenate([lh, rh]) # 126 featuresmodel = Sequential()
model.add(LSTM(64, return_sequences=True, activation='relu',
input_shape=(60, 126))) # Input: 60 frames, 126 features
model.add(LSTM(128, return_sequences=True, activation='relu'))
model.add(LSTM(64, return_sequences=False, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(60, activation='softmax')) # 60 classes
model.compile(optimizer='Adam',
loss='categorical_crossentropy',
metrics=['categorical_accuracy'])| Parameter | Value |
|---|---|
| Sequence Length | 60 frames |
| Features per frame | 126 (21×3×2 hands) |
| LSTM layers | 3 (64→128→64 units) |
| Dense layers | 2 (64→32 units) |
| Optimizer | Adam |
| Loss | Categorical Crossentropy |
| Epochs | 100 |
| Train/Test split | 80/20 |
# Thresholds and parameters
CONF_THRESHOLD = 0.80 # Confidence threshold to accept prediction
PRESENCE_START_FRAMES = 5 # Number of frames with visible hands to start recording
ABSENCE_END_FRAMES = 10 # Number of frames without hands to reset
HOLD_PRED_FRAMES = 45 # Number of frames to display the result
# State Machine
STATE_IDLE → STATE_RECORDING → STATE_WAIT_ABSENCE → STATE_IDLEInference procedure:
- IDLE: Wait for stable hand presence (5 frames)
- RECORDING: Record 60 frames of keypoints
- PREDICT: Predict once when 60 frames are collected
- WAIT_ABSENCE: Wait for the user to lower hands to reset
- Python 3.8+
- Webcam
- GPU (recommended for training)
pip install -r requirements.txttorch>=2.0.0
tensorflow>=2.5.0
mediapipe
opencv-python
scikit-learn>=1.0.0
numpy>=1.21.0
pandas>=1.3.0
matplotlib>=3.5.0
joblib
Model 1 – Letters & Digits:
cd training_method_1
python cam.pyModel 2 – Words/Phrases:
cd training_method_3
python Model.pyModel 1:
# Open Jupyter Notebook
jupyter notebook training_method_1/train_3.ipynbModel 2:
# Collect new data
jupyter notebook training_method_3/CollectData.ipynb
# Train model
jupyter notebook training_method_3/ActionDetection.ipynb- Evaluation method: Classification Report + Confusion Matrix
- Cross-validation: 5-fold Stratified
- Evaluation method: Accuracy Score + Confusion Matrix
- Metrics: Categorical Accuracy
- Location:
archive/asl_dataset/ - Content: Static images of 36 signs (A–Z, 0–9)
- Structure: One folder per class
-
Location:
Dataset_Staged/ -
Content: Vietnamese sign language videos by difficulty level
-
Levels:
- 01: Letters, digits, tone markers
- 02: Common simple words
- 03: Complex words
- 04: Advanced
| File | Function |
|---|---|
func_files/convert_to_gif.py |
Convert videos to GIFs |
func_files/data_crawl.py |
Collect data from sources |
func_files/extract_kp.py |
Extract keypoints from videos |
func_files/make_2d.py |
Convert data to 2D format |
- Run
training_method_1/cam.py - Place your hand in the camera frame
- Perform the sign for a letter or digit
- The result will be displayed on the screen
- Press
qto quit
- Run
training_method_3/Model.py - Wait for the IDLE state
- Place your hands in the camera frame and keep them stable
- Perform the sign (60 frames will be recorded)
- Wait for the result to be displayed
- Lower your hands to reset and perform a new sign
- Press
qorESCto quit
- Trần Gia Khánh - 23021599
- Vũ Nhật Tường Vân - 23021747
- Course: Human–Computer Interaction