LogoPilou
HOMEABOUTPROJECTS
Pierre-Louis © 2026
Back to Projects

NLP-GENIUS

A music-focused NLP suite featuring a fine-tuned GPT-2 lyrics generator trained on 1M+ songs and a multi-model genre classifier, built from scratch in Python.

PythonPyTorchGPT-2Hugging Facescikit-learnNLPDocker
Completed in December 2024 - took 6 weeks to be built
4

About the Project

NLP-GENIUS is a comprehensive natural language processing project built for the NLP course at EPITA, centered around the Genius Song Lyrics dataset (5M+ songs).
The headline feature is a title-to-lyrics generator: a GPT-2 model fine-tuned on over 1 million song lyrics (18 hours on an RTX 4090) that takes a song title as input and generates coherent, stylistically appropriate lyrics using beam search with nucleus sampling (top-k=50, top-p=0.95, temperature=0.7). Model quality is evaluated via perplexity on a 200k-sample validation set.
The project also includes a genre classification pipeline that benchmarks Logistic Regression (SAGA solver, L2 penalty), Multinomial Naive Bayes (with GridSearchCV hyperparameter tuning), and a soft-voting ensemble of both, all using GPT-4 tiktoken tokenization and spaCy stop-word filtering. Additionally, it features a from-scratch Byte Pair Encoding (BPE) tokenizer implementation and n-gram language models for lyric generation comparison.

Project Architecture

The Genius dataset flows through a shared preprocessing pipeline, then branches into four modules: GPT-2 lyrics generation, genre classification, BPE tokenization, and n-gram modeling. 

1 / 3

Motivation

I wanted to go beyond surface-level NLP tutorials and build a complete pipeline, from raw text preprocessing to transformer fine-tuning to classical ML classification, on a single large-scale dataset, comparing modern generative models against traditional statistical approaches.

Challenges Overcome

  • •Scaling GPT-2 fine-tuning to 1M+ lyrics while managing GPU memory constraints, requiring careful batch sizing (4), sequence truncation (max_length=768), and 18 hours of continuous training on an RTX 4090
  • •Achieving meaningful genre classification accuracy (~62%) on a highly imbalanced multi-class dataset where lyrical style differences between genres can be subtle
  • •Iterating through solver configurations (LBFGS, SAG, SAGA) and penalty terms (L1, L2, ElasticNet) to find the optimal convergence trade-off for logistic regression on a large sparse feature matrix

What I Learned

  • •Fine-tuning a GPT-2 language model with Hugging Face Transformers: custom Dataset class, causal LM training loop with gradient clipping, and learning rate scheduling (StepLR)
  • •Evaluating generative model quality using perplexity computed over a held-out validation set
  • •Benchmarking classification pipelines with scikit-learn: Logistic Regression vs. Naive Bayes vs. soft-voting ensemble, with cross-validation and GridSearchCV for hyperparameter tuning
  • •Implementing a Byte Pair Encoding (BPE) tokenizer from scratch to understand subword tokenization mechanics
  • •Building reproducible text preprocessing pipelines: language filtering, annotation stripping, ASCII normalization, and train/validation splitting