Skip to content

Pacatro/gpoetry

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPoeTry

A tiny GPT model to generate Spanish poetry, built from scratch.

Note

This project only focuses in the pre-training process, which means that you won't get a production-ready model. Instead of that, you will get a base model that you can use for your own post-training process.

Getting Started

You can use uv to install all the dependencies and create a virtual environment.

  1. Clone the repository

    git clone https://github.com/Pacatro/gpoetry.git
    cd gpoetry
  2. Install dependencies and create a virtual environment

    uv sync
  3. Run tests

    uv run pytest
  4. Run the application

    To see all available commands and options, run:

    uv run gpoetry --help

Usage

The application is structured as a CLI with two main commands: train and gen.

Training the model

To train a new model, use the train command. This will train a new model using the parameters in the configuration file and save it to the models directory.

uv run gpoetry train [OPTIONS]

Options:

Option Description Default
-t, --tokenization The tokenizer type (word or char). char
-b, --batch-size The training batch size. 32
-e, --epochs The number of epochs. 5
-l, --lr The learning rate. 3e-4
-s, --train-size The training split size. 0.8

Example of training with word tokenization:

uv run gpoetry train --tokenization word

Generating poetry

To generate poetry with a trained model, use the gen command. This command will load the latest model from the models directory and generate text.

uv run gpoetry gen [OPTIONS]

Options:

Option Description Default
-i, --init-text The initial text to use for generation. INIT_TOKEN
-t, --temperature Controls the randomness of the generated text. 0.6
-k, --top-k Samples from the top K most likely next tokens. 50
-l, --gen-limit The generation limit in tokens. 1000

Example of generating text with a higher temperature:

uv run gpoetry gen --temperature 0.8

Model Architecture

GPoeTry uses a standard GPT (Generative Pre-trained Transformer) architecture, implemented in gpoetry/core/model.py. It consists of:

  • Token and Positional Embeddings: To represent the input tokens and their positions in the sequence.
  • Transformer Blocks: A stack of NUM_LAYERS blocks. Each block contains:
    • A multi-head self-attention mechanism (MHSelfAttention).
    • A feed-forward neural network (MLP).
    • Layer normalization and residual connections.
  • Language Model Head: A final linear layer that maps the Transformer's output to vocabulary-sized logits.

Dataset

This project uses the andreamorgar/spanish_poetry dataset from HuggingFace. Which contains +5k spanish poems form different authors.

References

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Created by Paco Algar Muñoz.

Releases

No releases published

Packages

 
 
 

Contributors

Languages