A tiny GPT model to generate Spanish poetry, built from scratch.
Note
This project only focuses in the pre-training process, which means that you won't get a production-ready model. Instead of that, you will get a base model that you can use for your own post-training process.
You can use uv to install all the dependencies and create a virtual environment.
-
Clone the repository
git clone https://github.com/Pacatro/gpoetry.git cd gpoetry -
Install dependencies and create a virtual environment
uv sync
-
Run tests
uv run pytest
-
Run the application
To see all available commands and options, run:
uv run gpoetry --help
The application is structured as a CLI with two main commands: train and gen.
To train a new model, use the train command. This will train a new model using the parameters in the configuration file and save it to the models directory.
uv run gpoetry train [OPTIONS]Options:
| Option | Description | Default |
|---|---|---|
-t, --tokenization |
The tokenizer type (word or char). |
char |
-b, --batch-size |
The training batch size. | 32 |
-e, --epochs |
The number of epochs. | 5 |
-l, --lr |
The learning rate. | 3e-4 |
-s, --train-size |
The training split size. | 0.8 |
Example of training with word tokenization:
uv run gpoetry train --tokenization wordTo generate poetry with a trained model, use the gen command. This command will load the latest model from the models directory and generate text.
uv run gpoetry gen [OPTIONS]Options:
| Option | Description | Default |
|---|---|---|
-i, --init-text |
The initial text to use for generation. | INIT_TOKEN |
-t, --temperature |
Controls the randomness of the generated text. | 0.6 |
-k, --top-k |
Samples from the top K most likely next tokens. | 50 |
-l, --gen-limit |
The generation limit in tokens. | 1000 |
Example of generating text with a higher temperature:
uv run gpoetry gen --temperature 0.8GPoeTry uses a standard GPT (Generative Pre-trained Transformer) architecture, implemented in gpoetry/core/model.py. It consists of:
- Token and Positional Embeddings: To represent the input tokens and their positions in the sequence.
- Transformer Blocks: A stack of
NUM_LAYERSblocks. Each block contains:- A multi-head self-attention mechanism (
MHSelfAttention). - A feed-forward neural network (
MLP). - Layer normalization and residual connections.
- A multi-head self-attention mechanism (
- Language Model Head: A final linear layer that maps the Transformer's output to vocabulary-sized logits.
This project uses the andreamorgar/spanish_poetry dataset from HuggingFace. Which contains +5k spanish poems form different authors.
This project is licensed under the MIT License - see the LICENSE file for details.
Created by Paco Algar Muñoz.