Generating high-quality text with LSTMs is very difficult, but clickbait headlines are inherently low quality. An simple LSTM should have no trouble generating them!
Code on Github
I used a database of clickbait headlines collected by the team behind Stop Clickbait: Detecting and Preventing Clickbaits in Online News Media
The database contains 17,000 headlines from the following august publications:
In order to feed the data into the model, I read the entire 17,000 headline dataset as a single string and split it into 20-word samples.
My model was a two-layer LSTM with 256 units and 20% dropout. I experimented with using GloVe embeddings, but I found that clickbait headlines have too many made up words that don’t exist in GloVe. Manually training embeddings gave much better results.
I made the actual model using tf2 and keras.
See for yourself! Here are some of the best headlines it’s generated so far:
- we know your zodiac sign based on your zodiac sign
- the 17 most important canadian celebrity moments of 2015
- here’s how to make a vampire
- can you guess your favorite ’90s movie based on your favorite kitten
- are you more a canadian or taylor swift or oprah
These could easily pass for headlines on any of the esteemed websites listed above.
The easiest way to improve this would be to add more samples. I couldn’t find a bigger dataset online, but a simple web scraper would be able to get thousands more with only moderate effort.