Tiny-NN is a high-performance implementation of fully connected neural networks supporting both CPU and GPU execution. It's designed for easy experimentation and benchmarking, featuring:
- CPU execution (parallelized)
- CUDA execution with memory reuse (weights and biases uploaded only once per layer)
- Training with backpropagation and SGD
- Model serialization using
json.hpp(MIT licensed) included in the repository - Simple MNIST dataset integration and ASCII preview
- C++20 compatible compiler
- CUDA 12.8 (for GPU support)
- CMake >= 3.24
- Python 3.12 (optional, for dataset download and preview)
Although development was done on Windows 10/11 using Visual Studio 2022, the project can be built on any OS with a compatible C++20 compiler and CUDA installation.
- Clone or copy the repository to your machine.
git clone https://github.com/Vicen-te/tiny-nn.git
cd tiny-nn- Download the MNIST dataset: Using Python script (recommended):
python scripts/download_mnist.py- This will download and save the MNIST dataset in
data/minst/. - Alternatively, you can download the dataset manually from Kaggle
- Optional: generate a small model using Python (arguments:
input layer,hidden layer,output layer):
python data/generate_model.py 128 64 10- Optional: preview MNIST digits:
- Python:
python scripts/preview.py - C++:
ascii_preview()function in MNISTLoader
- Open Visual Studio -> File -> Open -> Folder... and select the project folder.
- Visual Studio will detect CMake. For GPU usage, choose x64 configuration.
- Build -> Build All.
mkdir build
cd build
cmake .. -G "Visual Studio 17 2022" -A x64 -DCUDA_TOOLKIT_ROOT_DIR="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.8"
cmake --build . --config Release-G "Visual Studio 17 2022"selects Visual Studio 2022-A x64selects 64-bit architecture (recommended for CUDA)-DCUDA_TOOLKIT_ROOT_DIRis optional, CMake can auto-detect CUDA
Note: The -A x64 option is recommended if you want to use CUDA on Windows. On Linux or macOS, this is not necessary.
cmake -B build -S .
cmake --build build --config Release- CMake will detect Visual Studio and CUDA if installed in standard locations
-Sis the source folder,-Bis the build folder
Both methods produce the same result. Use Option 2 for simplicity and fewer manual settings.
From the build/Release folder:
tinny-nn.exe <mode>Modes:
- train or t → Train model
- inference or i → Run inference on a sample
- benchmark or b → Compare CPU vs CUDA performance
- Training progress printed to console
- Training duration in seconds
- Saved model JSON to ./data/models/fc_digit_classification.json
- ASCII MNIST preview of a single sample image
- Output values of selected sample
- Maximum value and its index
- ASCII preview of the sample
- CPU vs GPU inference correctness check
- Average inference timings per method
- CSV results saved to ./data/results/bench.csv
Currently, benchmark only measures inference, not training. Measuring training performance would require additional implementation.
- Currently, weights
Wand biasesbare uploaded to the GPU once per layer. The input vector is uploaded for each inference. - cuBLAS GEMM is already used for matrix multiplications, replacing the simple custom FC kernel.
- Intermediate GPU buffers (
dX/dY) are allocated per layer and batch and are not fully reused, though CUDA streams enable asynchronous execution. - For higher performance (future improvements):
- Reusing intermediate GPU buffers across layers and batches via CUDA streams.
- Implementing more efficient batching and overlapping of data transfers with computation.
- Profiling can be done with Nsight Systems / Nsight Compute.