GPoeTry

A tiny GPT model to generate Spanish poetry, built from scratch.

Note

This project only focuses in the pre-training process, which means that you won't get a production-ready model. Instead of that, you will get a base model that you can use for your own post-training process.

Getting Started

You can use uv to install all the dependencies and create a virtual environment.

Clone the repository

git clone https://github.com/Pacatro/gpoetry.git
cd gpoetry

Install dependencies and create a virtual environment
```
uv sync
```
Run tests
```
uv run pytest
```
Run the application

To see all available commands and options, run:
```
uv run gpoetry --help
```

Usage

The application is structured as a CLI with two main commands: train and gen.

Training the model

To train a new model, use the train command. This will train a new model using the parameters in the configuration file and save it to the models directory.

uv run gpoetry train [OPTIONS]

Options:

Option	Description	Default
`-t`, `--tokenization`	The tokenizer type (`word` or `char`).	`char`
`-b`, `--batch-size`	The training batch size.	`32`
`-e`, `--epochs`	The number of epochs.	`5`
`-l`, `--lr`	The learning rate.	`3e-4`
`-s`, `--train-size`	The training split size.	`0.8`

Example of training with word tokenization:

uv run gpoetry train --tokenization word

Generating poetry

To generate poetry with a trained model, use the gen command. This command will load the latest model from the models directory and generate text.

uv run gpoetry gen [OPTIONS]

Options:

Option	Description	Default
`-i`, `--init-text`	The initial text to use for generation.	`INIT_TOKEN`
`-t`, `--temperature`	Controls the randomness of the generated text.	`0.6`
`-k`, `--top-k`	Samples from the top K most likely next tokens.	`50`
`-l`, `--gen-limit`	The generation limit in tokens.	`1000`

Example of generating text with a higher temperature:

uv run gpoetry gen --temperature 0.8

Model Architecture

GPoeTry uses a standard GPT (Generative Pre-trained Transformer) architecture, implemented in gpoetry/core/model.py. It consists of:

Token and Positional Embeddings: To represent the input tokens and their positions in the sequence.
Transformer Blocks: A stack of NUM_LAYERS blocks. Each block contains:
- A multi-head self-attention mechanism (MHSelfAttention).
- A feed-forward neural network (MLP).
- Layer normalization and residual connections.
Language Model Head: A final linear layer that maps the Transformer's output to vocabulary-sized logits.

Dataset

This project uses the andreamorgar/spanish_poetry dataset from HuggingFace. Which contains +5k spanish poems form different authors.

References

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Created by Paco Algar Muñoz.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
gpoetry		gpoetry
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
todo.md		todo.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPoeTry

Getting Started

Usage

Training the model

Generating poetry

Model Architecture

Dataset

References

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPoeTry

Getting Started

Usage

Training the model

Generating poetry

Model Architecture

Dataset

References

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages