Skip to content

khanhtran0111/LBS

Repository files navigation

Vietnamese Sign Language Recognition (VSL)

Python TensorFlow MediaPipe OpenCV

📋 Introduction

This project builds a real-time Vietnamese Sign Language (VSL) recognition system using Deep Learning and Computer Vision. The system includes 2 main models:

Model File Function Architecture
Model 1 asl_landmarks_best.joblib Recognize letters (A–Z) and digits (0–9) SVM/MLP with MediaPipe Hands
Model 2 VSL.h5 Recognize Vietnamese words/phrases (60 sentences) LSTM Neural Network

Project Structure

LBS/
├── archive/
│   └── asl_dataset/          # Dataset of letters & digits (A–Z, 0–9)
├── Dataset_Staged/           # Staged VSL dataset by difficulty level
│   ├── 01_Alphabet_Numbers/  # Letters, digits
│   ├── 02_Simple_Words/      # Simple words
│   ├── 03_Complex_Words/     # Complex words
│   └── 04_Advanced/          # Advanced
├── func_files/               # Utility scripts
│   ├── convert_to_gif.py     # Convert to GIF
│   ├── data_crawl.py         # Data collection
│   ├── extract_kp.py         # Keypoint extraction
│   └── make_2d.py            # 2D conversion
├── training_method_1/        # Method 1: SVM/MLP for letters
│   ├── train_3.ipynb         # Training notebook
│   ├── cam.py                # Realtime inference
│   └── asl_landmarks_best.joblib  # Trained model
├── training_method_2/        # Method 2: GCN (experimental)
│   ├── gcn.py                # Graph Convolutional Network
│   ├── preprocessing.py      # Preprocessing
│   └── models/               # Checkpoints
├── training_method_3/        # Method 3: LSTM for words/phrases
│   ├── CollectData.ipynb     # Data collection
│   ├── ActionDetection.ipynb # Model training
│   ├── Model.py              # Realtime inference
│   └── VSL.h5                # Trained model
└── requirements.txt          # Dependencies

Model 1: Letter & Digit Recognition

Data Collection Method

  • Data source: ASL Dataset (American Sign Language) with folders A–Z and 0–9
  • Format: Static images (JPG, PNG, BMP)
  • Total classes: 36 classes (26 letters + 10 digits)

Feature Extraction Method

Use MediaPipe Hands to extract 21 hand landmarks:

# Each landmark has 3 coordinates (x, y, z)
# Total features: 21 * 3 = 63 features

def lm_to_feat(lm, handed):
    pts = np.array([[p.x, p.y, p.z] for p in lm.landmark])
    
    # Normalization: flip if left hand
    if handed == 'Left':
        pts[:,0] = 1.0 - pts[:,0]
    
    # Translate so wrist is at origin
    wrist = pts[0].copy()
    pts -= wrist
    
    # Normalize by palm scale
    palm_scale = np.linalg.norm(pts[9,:2]) + 1e-6
    pts /= palm_scale
    
    return pts.flatten()  # 63 features

Data Augmentation

  • Rotation augmentation: Random rotation of ±10° to improve robustness
  • Augment ratio: 1x (each sample generates 1 augmented sample)

Model Architecture

Use GridSearchCV to search for the best model between:

  1. SVM (Support Vector Machine)

    • Kernel: RBF
    • C: [1, 5, 10]
    • Gamma: ['scale', 0.1, 0.01]
    • Class weight: balanced
  2. MLP (Multi-Layer Perceptron)

    • Hidden layers: (128, 64)
    • Activation: ReLU
    • Alpha: [1e-4, 1e-3]
    • Learning rate: [1e-3, 5e-4]

Pipeline

pipe = Pipeline([
    ('scaler', StandardScaler()),  # Feature normalization
    ('clf', SVC/MLP)               # Classifier
])

# Cross-validation: 5-fold Stratified
cv = StratifiedKFold(n_splits=5, shuffle=True)

Realtime Inference

  • Confidence threshold: 0.60
  • History voting: 7 most recent frames
  • Method: Majority voting to reduce noise

Model 2: VSL Word/Phrase Recognition

List of 60 Recognized Sentences

The model can recognize 60 common Vietnamese words/phrases:

Group Example
Greetings xin chao, cam on, chuc mung, hen gap lai cac ban
Personal information ban ten la gi, toi la hoc sinh, toi la nguoi Diec
Health ban khoe khong, toi bi dau dau, toi can gap bac si
Locations toi dang o cong vien, toi di sieu thi, toi song o Ha Noi
Emotions toi cam thay rat vui, toi dang buon, toi thay nho ban
Emergency cap cuu, toi bi cuop, toi bi lac

(Notes: Sentences are kept in Vietnamese without diacritics for label naming.)

Data Collection Method

  1. Use a webcam to record videos of performing the signs
  2. Per sentence: 60 video sequences
  3. Per sequence: 60 frames
  4. Storage: .npy files containing keypoints of each frame
# Data folder structure
Data/
├── xin chao/
│   ├── 0/        # Sequence 0
│   │   ├── 0.npy
│   │   ├── 1.npy
│   │   └── ... (60 files)
│   ├── 1/        # Sequence 1
│   └── ... (60 sequences)
├── cam on/
└── ... (60 actions)

Feature Extraction Method

Use MediaPipe Holistic to extract landmarks of both hands:

def extract_keypoints(results):
    # Left hand: 21 points * 3 coordinates = 63 features
    lh = np.array([[res.x, res.y, res.z] 
                   for res in results.left_hand_landmarks.landmark]).flatten() 
         if results.left_hand_landmarks else np.zeros(21*3)
    
    # Right hand: 21 points * 3 coordinates = 63 features
    rh = np.array([[res.x, res.y, res.z] 
                   for res in results.right_hand_landmarks.landmark]).flatten() 
         if results.right_hand_landmarks else np.zeros(21*3)
    
    return np.concatenate([lh, rh])  # 126 features

LSTM Model Architecture

model = Sequential()
model.add(LSTM(64, return_sequences=True, activation='relu', 
               input_shape=(60, 126)))  # Input: 60 frames, 126 features
model.add(LSTM(128, return_sequences=True, activation='relu'))
model.add(LSTM(64, return_sequences=False, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(60, activation='softmax'))  # 60 classes

model.compile(optimizer='Adam', 
              loss='categorical_crossentropy', 
              metrics=['categorical_accuracy'])

Hyperparameters

Parameter Value
Sequence Length 60 frames
Features per frame 126 (21×3×2 hands)
LSTM layers 3 (64→128→64 units)
Dense layers 2 (64→32 units)
Optimizer Adam
Loss Categorical Crossentropy
Epochs 100
Train/Test split 80/20

Realtime Inference

# Thresholds and parameters
CONF_THRESHOLD = 0.80      # Confidence threshold to accept prediction
PRESENCE_START_FRAMES = 5  # Number of frames with visible hands to start recording
ABSENCE_END_FRAMES = 10    # Number of frames without hands to reset
HOLD_PRED_FRAMES = 45      # Number of frames to display the result

# State Machine
STATE_IDLESTATE_RECORDINGSTATE_WAIT_ABSENCESTATE_IDLE

Inference procedure:

  1. IDLE: Wait for stable hand presence (5 frames)
  2. RECORDING: Record 60 frames of keypoints
  3. PREDICT: Predict once when 60 frames are collected
  4. WAIT_ABSENCE: Wait for the user to lower hands to reset

Setup and Usage

System Requirements

  • Python 3.8+
  • Webcam
  • GPU (recommended for training)

Install dependencies

pip install -r requirements.txt

Main dependencies

torch>=2.0.0
tensorflow>=2.5.0
mediapipe
opencv-python
scikit-learn>=1.0.0
numpy>=1.21.0
pandas>=1.3.0
matplotlib>=3.5.0
joblib

Run realtime inference

Model 1 – Letters & Digits:

cd training_method_1
python cam.py

Model 2 – Words/Phrases:

cd training_method_3
python Model.py

Retrain the models

Model 1:

# Open Jupyter Notebook
jupyter notebook training_method_1/train_3.ipynb

Model 2:

# Collect new data
jupyter notebook training_method_3/CollectData.ipynb

# Train model
jupyter notebook training_method_3/ActionDetection.ipynb

Model Evaluation

Model 1 (Letters & Digits)

  • Evaluation method: Classification Report + Confusion Matrix
  • Cross-validation: 5-fold Stratified

Model 2 (Words/Phrases)

  • Evaluation method: Accuracy Score + Confusion Matrix
  • Metrics: Categorical Accuracy

Dataset

ASL Dataset (Model 1)

  • Location: archive/asl_dataset/
  • Content: Static images of 36 signs (A–Z, 0–9)
  • Structure: One folder per class

VSL Dataset (Model 2)

  • Location: Dataset_Staged/

  • Content: Vietnamese sign language videos by difficulty level

  • Levels:

    • 01: Letters, digits, tone markers
    • 02: Common simple words
    • 03: Complex words
    • 04: Advanced

Utility Files

File Function
func_files/convert_to_gif.py Convert videos to GIFs
func_files/data_crawl.py Collect data from sources
func_files/extract_kp.py Extract keypoints from videos
func_files/make_2d.py Convert data to 2D format

Usage Guide

Letter/Digit Recognition

  1. Run training_method_1/cam.py
  2. Place your hand in the camera frame
  3. Perform the sign for a letter or digit
  4. The result will be displayed on the screen
  5. Press q to quit

Word/Phrase Recognition

  1. Run training_method_3/Model.py
  2. Wait for the IDLE state
  3. Place your hands in the camera frame and keep them stable
  4. Perform the sign (60 frames will be recorded)
  5. Wait for the result to be displayed
  6. Lower your hands to reset and perform a new sign
  7. Press q or ESC to quit

Contributors

  • Trần Gia Khánh - 23021599
  • Vũ Nhật Tường Vân - 23021747
  • Course: Human–Computer Interaction

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors