NLP-GENIUS is a comprehensive natural language processing project built for the NLP course at EPITA, centered around the
Genius Song Lyrics dataset (5M+ songs).
The headline feature is a title-to-lyrics generator: a GPT-2 model fine-tuned on over 1 million song lyrics (18 hours on an RTX 4090) that takes a song title as input and generates coherent, stylistically appropriate lyrics using beam search with nucleus sampling (top-k=50, top-p=0.95, temperature=0.7). Model quality is evaluated via perplexity on a 200k-sample validation set.
The project also includes a genre classification pipeline that benchmarks Logistic Regression (SAGA solver, L2 penalty), Multinomial Naive Bayes (with GridSearchCV hyperparameter tuning), and a soft-voting ensemble of both, all using GPT-4 tiktoken tokenization and spaCy stop-word filtering. Additionally, it features a from-scratch Byte Pair Encoding (BPE) tokenizer implementation and n-gram language models for lyric generation comparison.