samford100's comments

samford100 · on Oct 26, 2021

When this was posted last time, I read it as "This WOD (Workout of the Day) Does Not Exist". It inspired me to make a workout generator trained on workouts from crossfit.com using a character-level RNN.

A year removed from it an I still laugh at some of the ridiculous workouts.

https://thiswoddoesnotexist.com/

samford100 · on June 21, 2020

https://thiswoddoesnotexist.com/

I got to playing around with recurrent neural networks and made a site that generates Crossfit "Workout of the Days (WODs)". It's trained on the workouts from crossfit.com.

My motivation were to learn more about how character-based RNN's work, remember how to host a site on my Digital Ocean VPS with flask, and do some fun frontend work. It posed a few unique challenges, like scraping the crossfit website, experimenting with different network architectures, and finding ways to validate the efficacy of those networks.

It's terribly overfit and will sometimes generate workouts verbatim from the crossfit.com database, but since it's just a fun project, it was more important for me to get consistently good, grammatically correct results and some overfit ones rather than a bunch of nonsense text and a few hidden gems.

My next step is to sum up the key takeaways in a blog post about the full stack of the application and call it finished, or continue to play around with network hyper-parameters and training techniques, since nurturing my neural network knowledge for NLP was a huge goal.

Always looking for feedback and happy to answer any questions!

aketchum · on June 21, 2020

I love this project! Have you tried actually completing any of the generated workouts?

How do you calculate overfitting for something like this? I know how you would do so with a more traditional supervised learning model with numerical inputs/outputs but NLP still seems a little like black magic to me since I haven't dived into a project using it myself. Is it just a comparison of similarity between generated posts and all posts in the training set? How do you calculate how "close" an output is to an input?

samford100 · on June 21, 2020

Haha no I might get kicked out of my gym if I try to do

- Workout of the day (WOD)

- 15 Dumbbell Throws

To understand how I calculated loss, I have to describe the whole model, so let's take it from the top.

The model consists of 3 distinct layers. The first layer is a character embedding. We need a unique representation for each character in the entire corpus. Without checking, I believe it was ~80 different characters. This includes all the the uppercase letters, lowercase letters, numbers, little symbols, and even the new line character (`\n`). One way to encode these characters as a state vector is to one-hot encode each character. With one hot encoding, something like 'a' would be (potentially) encoded as `[1,0,0,0,...,0]`. Each character has its own unit vector orthogonal to all the other vectors. With ~80 different characters, a one-hot encoded corpus would span `R^80`. That seemed prohibitively large to me, so I went with the learned embedding route. With an embedding, you reduce the dimensionality of the state vectors from `R^80` to something much smaller by no longer making each vector orthogonal to each other. In this system, `a` could be encoded as `[.34, 0, .01,..., 0]`. In this system, characters do not have their own unique dimension and their cross-products are no longer 0 as they are not independent. But this is actually something we want! We learn different characters are related to each other from the corpus. This may put all the number character vectors closer to each other, since they are used in similar ways in the workouts.

So the benefit of the embedding over the one-hot encoding is two-fold: more compact representation and a vector representation that is able to show similarities between different characters. Note to self, exploring the embedding created by the Crossfit workout corpus would be super interesting .

The next layer (or actually a series of layers) is the LSTM layer. To avoid writing a novel here, here are a of resource that can explain it better than me (https://colah.github.io/posts/2015-08-Understanding-LSTMs/). It's a node that maintains a hidden state that allows it to "remember" previous inputs to the run. These cells are followed by some hidden layers. Look for a blog post soon on my page for a more in depth explanation as I learn to explain it better.

The output from our LSTM cell is fed into a final fully connected layer that is the size of our vocab (~80 characters). A softmax activation is attached to the full connected layer so that our final output is a probability distribution across all the different characters.

So, the way our network works (at predict time) is we feed in a single character, it's converted to a number (a->24). That is embedded as a vector, that vector goes through the LSTM layers (which hold some hidden state that "remember" that `a` passed through). Then a fully connected layer and softmax gives a probability distribution of the characters. I sample from that distribution, which yields the next character.

As an example (and how the site works), when you click "Pump It Up", I prime the network with the text "Workout of the day (WOD)". After priming the network (which gives some state to this LSTM cells), I take the next generated character, print it to the screen, and then feed it back into the network. Without fail, after priming with "Workout of the day (WOD)", the next character generated is "\n". The ")" character that was fed in just before would not be enough to generate "\n", but the LSTM has enough built up state to know it's time for a line break. I find that so cool and is why I went with the character model when a word-based model could likely generate workouts better.

Now that we understand the network, particularly the output as a probability distribution over the characters, we can finally talk about training loss. The naive way to calculate loss would be to feed in a character, produce a character, and then give a +1 if the produced character matched the expected character from the training text. But we can do something much smarter. Instead of comparing by character, we simply compare the output probability distributions! Yes, we can do that using cross entropy loss. This is so much more powerful than simply comparing character outputs. This loss function is both how our model is trained (propagating that loss back through the network) and how we evaluate the network a testing time.

This validation testing that I did relied on the fact that the characters in a workout are dependent on one another, but workout themselves are independent. With this in mind, I was able to randomly split up the whole workouts into a training a testing batches. I trained on a subset of workouts, then tested the efficacy of the models using the testing set. Then I summed an averaged the losses of the run, plotted the results, and ran through the entire corpus of workouts again with a new random selection of workouts. By exploiting the independence across workouts, I was able to perform cross-validation.

Did this work? Honestly, I did not see much of a divergence in the training and testing efficacy, but it was the best thing I could think of to test if my model was overfitting.

---

I hope this stream of consciousness gives you a little overview of how this works and my theory on testing. This will serve as a good rough draft for what I've been meaning to write for a while. I really could not match the model's loss on testing vs training to spot overfitting in the network, so maybe that was flawed. I need to continue to do research into testing on sequence data. I am doing some stock market time series investigation work right now, so I really hope to learn the state-of-the-art techniques are for validation testing on time series data, which in essence, is sequence data like these crossfit workouts.

dkersten · on June 21, 2020

I clicked the link and it told me "rest day" :D Nice. haha

samford100 · on June 21, 2020

Crossfit.com is VERY liberal with their rest days :P I thought about removing all the duplicate rest days from the training text, but I wanted to stay faithful to the original material.

samford100 · on April 3, 2018

I took the class during my undergrad at Tech and I could not agree more. There was such a disconnect between the class material and the project.

Additionally, this is the class that gave us "Jill Watson", the robot TA.

http://www.cbc.ca/news/technology/robot-ta-ai-1.3585801