So, I don't know a ton about Word2Vec which probably doesn't help, but I do understand that it makes tasks, like NLP, much easier since you're going from this massive space (the english language) into a smaller embedding that you learn.
That being said, how are you embedding images? Is it based on how similar they are, if so what does "similarity" mean? Also, what dataset was leveraged?
Any more info on how you do this task would be awesome :).
Our launch model is trained on ImageNet, which has enough variety that it usually generalizes well even when your dataset is very dissimilar to the input distribution. We're planning to train on a wider variety of image data in the future, but we wanted to get something into people's hands quickly.