Skip to content

Vicen-te/tiny-nn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tiny-NN — Fully Connected Neural Networks in C++20 + CUDA 12.8

Tiny-NN is a high-performance implementation of fully connected neural networks supporting both CPU and GPU execution. It's designed for easy experimentation and benchmarking, featuring:

  • CPU execution (parallelized)
  • CUDA execution with memory reuse (weights and biases uploaded only once per layer)
  • Training with backpropagation and SGD
  • Model serialization using json.hpp (MIT licensed) included in the repository
  • Simple MNIST dataset integration and ASCII preview

Requirements

  • C++20 compatible compiler
  • CUDA 12.8 (for GPU support)
  • CMake >= 3.24
  • Python 3.12 (optional, for dataset download and preview)

Although development was done on Windows 10/11 using Visual Studio 2022, the project can be built on any OS with a compatible C++20 compiler and CUDA installation.

Setup

  1. Clone or copy the repository to your machine.
git clone https://github.com/Vicen-te/tiny-nn.git
cd tiny-nn
  1. Download the MNIST dataset: Using Python script (recommended):
python scripts/download_mnist.py
  • This will download and save the MNIST dataset in data/minst/.
  • Alternatively, you can download the dataset manually from Kaggle
  1. Optional: generate a small model using Python (arguments: input layer, hidden layer, output layer):
python data/generate_model.py 128 64 10
  1. Optional: preview MNIST digits:
  • Python: python scripts/preview.py
  • C++: ascii_preview() function in MNISTLoader

Build

Visual Studio:

  • Open Visual Studio -> File -> Open -> Folder... and select the project folder.
  • Visual Studio will detect CMake. For GPU usage, choose x64 configuration.
  • Build -> Build All.

PowerShell / Developer Command Prompt (recommended):

Option 1: Specify all options manually

mkdir build
cd build
cmake .. -G "Visual Studio 17 2022" -A x64 -DCUDA_TOOLKIT_ROOT_DIR="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.8"
cmake --build . --config Release
  • -G "Visual Studio 17 2022" selects Visual Studio 2022
  • -A x64 selects 64-bit architecture (recommended for CUDA)
  • -DCUDA_TOOLKIT_ROOT_DIR is optional, CMake can auto-detect CUDA

Note: The -A x64 option is recommended if you want to use CUDA on Windows. On Linux or macOS, this is not necessary.

Option 2: Let CMake detect everything automatically (recommended)

cmake -B build -S .
cmake --build build --config Release
  • CMake will detect Visual Studio and CUDA if installed in standard locations
  • -S is the source folder, -B is the build folder

Both methods produce the same result. Use Option 2 for simplicity and fewer manual settings.

Run

From the build/Release folder:

tinny-nn.exe <mode>

Modes:

  • train or t → Train model
  • inference or i → Run inference on a sample
  • benchmark or b → Compare CPU vs CUDA performance

Expected output

Training (train / t)

  • Training progress printed to console
  • Training duration in seconds
  • Saved model JSON to ./data/models/fc_digit_classification.json
  • ASCII MNIST preview of a single sample image

Inference (inference / i)

  • Output values of selected sample
  • Maximum value and its index
  • ASCII preview of the sample

Benchmark (benchmark / b)

  • CPU vs GPU inference correctness check
  • Average inference timings per method
  • CSV results saved to ./data/results/bench.csv

Currently, benchmark only measures inference, not training. Measuring training performance would require additional implementation.

Notes & Improvements

  • Currently, weights W and biases b are uploaded to the GPU once per layer. The input vector is uploaded for each inference.
  • cuBLAS GEMM is already used for matrix multiplications, replacing the simple custom FC kernel.
  • Intermediate GPU buffers (dX/dY) are allocated per layer and batch and are not fully reused, though CUDA streams enable asynchronous execution.
  • For higher performance (future improvements):
    • Reusing intermediate GPU buffers across layers and batches via CUDA streams.
    • Implementing more efficient batching and overlapping of data transfers with computation.
  • Profiling can be done with Nsight Systems / Nsight Compute.

About

A tiny neural network framework for fully-connected layers with CPU and CUDA support

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors