NLP-GENIUS

A music-focused NLP suite featuring a fine-tuned GPT-2 lyrics generator trained on 1M+ songs and a multi-model genre classifier, built from scratch in Python.

PythonPyTorchGPT-2Hugging Facescikit-learnNLPDocker

Completed in December 2024 - took 6 weeks to be built

About the Project

NLP-GENIUS is a comprehensive natural language processing project built for the NLP course at EPITA, centered around the Genius Song Lyrics dataset (5M+ songs).

The headline feature is a title-to-lyrics generator: a GPT-2 model fine-tuned on over 1 million song lyrics (18 hours on an RTX 4090) that takes a song title as input and generates coherent, stylistically appropriate lyrics using beam search with nucleus sampling (top-k=50, top-p=0.95, temperature=0.7). Model quality is evaluated via perplexity on a 200k-sample validation set.

The project also includes a genre classification pipeline that benchmarks Logistic Regression (SAGA solver, L2 penalty), Multinomial Naive Bayes (with GridSearchCV hyperparameter tuning), and a soft-voting ensemble of both, all using GPT-4 tiktoken tokenization and spaCy stop-word filtering. Additionally, it features a from-scratch Byte Pair Encoding (BPE) tokenizer implementation and n-gram language models for lyric generation comparison.

Motivation

I wanted to go beyond surface-level NLP tutorials and build a complete pipeline, from raw text preprocessing to transformer fine-tuning to classical ML classification, on a single large-scale dataset, comparing modern generative models against traditional statistical approaches.

Challenges Overcome

•Scaling GPT-2 fine-tuning to 1M+ lyrics while managing GPU memory constraints, requiring careful batch sizing (4), sequence truncation (max_length=768), and 18 hours of continuous training on an RTX 4090
•Achieving meaningful genre classification accuracy (~62%) on a highly imbalanced multi-class dataset where lyrical style differences between genres can be subtle
•Iterating through solver configurations (LBFGS, SAG, SAGA) and penalty terms (L1, L2, ElasticNet) to find the optimal convergence trade-off for logistic regression on a large sparse feature matrix

What I Learned

•Fine-tuning a GPT-2 language model with Hugging Face Transformers: custom Dataset class, causal LM training loop with gradient clipping, and learning rate scheduling (StepLR)
•Evaluating generative model quality using perplexity computed over a held-out validation set
•Benchmarking classification pipelines with scikit-learn: Logistic Regression vs. Naive Bayes vs. soft-voting ensemble, with cross-validation and GridSearchCV for hyperparameter tuning
•Implementing a Byte Pair Encoding (BPE) tokenizer from scratch to understand subword tokenization mechanics
•Building reproducible text preprocessing pipelines: language filtering, annotation stripping, ASCII normalization, and train/validation splitting

About the Project

NLP-GENIUS is a comprehensive natural language processing project built for the NLP course at EPITA, centered around the Genius Song Lyrics dataset (5M+ songs).

Challenges Overcome

•Scaling GPT-2 fine-tuning to 1M+ lyrics while managing GPU memory constraints, requiring careful batch sizing (4), sequence truncation (max_length=768), and 18 hours of continuous training on an RTX 4090

•Achieving meaningful genre classification accuracy (~62%) on a highly imbalanced multi-class dataset where lyrical style differences between genres can be subtle

•Iterating through solver configurations (LBFGS, SAG, SAGA) and penalty terms (L1, L2, ElasticNet) to find the optimal convergence trade-off for logistic regression on a large sparse feature matrix

What I Learned

•Fine-tuning a GPT-2 language model with Hugging Face Transformers: custom Dataset class, causal LM training loop with gradient clipping, and learning rate scheduling (StepLR)

•Evaluating generative model quality using perplexity computed over a held-out validation set

•Benchmarking classification pipelines with scikit-learn: Logistic Regression vs. Naive Bayes vs. soft-voting ensemble, with cross-validation and GridSearchCV for hyperparameter tuning

•Implementing a Byte Pair Encoding (BPE) tokenizer from scratch to understand subword tokenization mechanics

•Building reproducible text preprocessing pipelines: language filtering, annotation stripping, ASCII normalization, and train/validation splitting