This only uses LDA on your starred repository descriptions, to find topic terms that describe your starred repositories. These topic terms are then used to query the GitHub search API to find matching repositories. The results are then sorted by star count.
That is a clever way to make use of a search API like GitHub's. The principled way to do this, though, is to run LDA over all descriptions on GitHub, then use that similarity index to find similar repositories. You could run LDA over code, too.
I'll note that there is a cold start problem with this implementation: using LDA on such a small set of short documents will often lead to uninformative topics with words that are too-specific. You need a big corpus to capture e.g. synonym relationships.
Your point is quite interesting although I'm not sure running LDA on the entire code would be useful. I spent half a year writing my postgraduate thesis on a recommender systems for streaming services based on LDA, in particular we wanted to infer who is watching what and when in a shared account.
From all the tests I did with LDA I believe the best thing would be to run it on the README files.
Thats right, but one additional level of depth in fetching repositories increases the API latency by so much that almost becomes unusable for a web app at least for a hobby web app :P Hence the shortcuts. Open to ideas and suggestions.
As I said, your approach is a clever way to use the GitHub API. I think you need to change the title and readme to indicate that this isn't an LDA index of GitHub descriptions. To ML practitioners, that's what you are implying with a title of "Show HN: Using LDA to suggest GitHub repositories based on what you have starred".
Typo when you don't have the right kind of repos starred: "yeild" should be yield
I've worked with NLP a bit before, but haven't worked with LDA and have only read the wikipedia article and gensim documentation. One thing I don't understand is why you only generate a single topic for each user, and then query the top n (5) terms. From what I understand of LDA, its usefulness is in partitioning text into k separate topics based on how often words are used in similar contexts. In my mind, this is more or less analogous (please tell me if this is wrong) to finding k centroids for a vector representation of text after training a word2vec mapping (in an appropriately low dimension given the document size) on that text. However, if you are only finding a single topic, you are only using one centroid, so your search will be the n tokens that are closest to the centroid. I'm pretty sure that the tokens (from the text) closest to the centroid of a word2vec mapping trained on a text will mostly consist of high-frequency words and semi-stop words (by this I mean words used in varied contexts because of their use in language, but not filtered by the stop word check).
Then if someone has many different topical interests, LDA might over-represent whichever topic has the plurality of text dedicated to it. For example if my starred repos are something like 30% Fortran, 30% Javascript, 40% Java, I believe your algorithm will mostly contain Java terms as queries. This seems to run counter to the goal of using LDA, which would be (to my understanding) to identify these latent topics and give relevant queries for each one / combining them.
I think a good way to address this would be to implement some way to change the default number of topics. One approach may be to use a trained (perhaps on github instances itself) word2vec instance to determine the "spread" of the incoming tokens: you could construct cliques based on pairwise distance between vectors and do something with that (let k be the number of cliques, or the number of cliques of size greater than m, etc.).
A different approach might be to precompute the vector average of each github repo. Then you could perform richer comparisons directly to documents (e.g. compare each clique's centroid to the repos) without directly querying github for tons of repos.
Very interesting. I've implemented something similar[1] using a pure Collaborative Filtering approach[2][3], that I think works better for me, but it's unable to recommend unpopular repositories.
The New York Times recommender system uses a hybrid approach (Content Based + Collaborative Filtering) called Collaborative Topic Modeling on top of LDA[4]. It would be interesting to try that.
Really nice links... Thanks! Will take a look at it and compare and contrast for better tuning. :) Feel free to do the same and raise some pull requests :)
I appreciate the effort but it is giving me as a scala guy, an author of books on such subjects as distributed computing and akka, links for PHP libraries. I think the ranking should account for language interest very highly.
I personally play around with a lot of tools and languages and wanted to make sure that language barrier didn't make the tool limited. In the sense that if there are a lot of repositories out there in js for something I am starting in python I may wanna take a look at why so many people chose to do it in js instead of python and then make my decision based on it.
This definitely will be lost if I filtered based on languages. In the current form, it gives importance to the topic and not the language for this specific reason.
There's a variant of LDA called "structured LDA" that allows you to introduce a linear combination of "structural variables" into the topic distribution prior. If you wanted to make this project "language-aware", might be a good use case for sLDA.
Does this only recommend awesome style repos? I got about 20 of those despite having over 400 stars, majority computer graphics. Also a lot of Ruby, a language I don't use or less than 1% of my stars. Odd.
Problem is this, the suggestions are generated by using the repo descriptions as the starting point dropping all references to language and non-English words. What this means is even if you have a lot of stars on ruby repositories, if the search on terms derived from the above process leads to non-ruby repositories that is what is shown. Will look into tuning it better.
I'd suggest adding a filter at the top of results to filter by language- i.e. a checkbox by each language result at the top to specify which results to filter to
OTOH, I kind of like the idea that something like this could lead me to discover a programming language that is better suited to the subjects that I am interested in.
Well not sure why, but it definitely didn't work for me. Random would have outperformed it. I like the idea though. Suggested github repos would be fun to have a look at.
Take a look at the repositories suggested, you may find it interesting after some digging or add a new liking to your list. After some more starring, the tool might actually start suggesting repositories to your liking.
Also if you find something strongly troubling feel free to create issues on the repo and I will try to see how it can be better tuned/filtered.
Cool project! Thanks for publishing and sharing it.
It'd be interesting to know what topic terms it produces for each of my repos. It looks like it's taking all the repo descriptions, producing a topic model over that corpus with a single topic (`LdaModel(num_topics=1)`), and retrieving the top N terms for that topic. Those topic terms will be the most frequent words from the topic, so I think this will end up producing the most frequent words from the cleaned token set.
I'd be curious to see what happens if you could run LDA over the full dataset, produce multiple topics, and suggest repos based on those topics. This would be a pretty fun extension to the project!
If you're just running LDA over the repo description (and not looking into the content of any file, e.g. README), might http://ghtorrent.org/ be able to provide this?
Or, it might be interesting to try producing a vector representation per repo by taking the description (and readme?), and doing something like: produce word vectors for each word, and sum the word vectors. https://spacy.io/ is a nice-to-use library that could help here.
Once you have a vector representation for each repo, using a distance metric cosine similarity could find related repos. Or (depending on the dataset size / performance) an approximation like spill trees or LSH forest.
I am so curious about your implementation, for instance, what sort of preprocessing did you have to carry out? I had written a script sometime back to analyze Paul Graham's essays (link: https://github.com/futureUnsure/pg-essay-lda), and had to remove date and times because they appeared a lot and distorted the top topics. I'm wondering if you had to do something similar for text that described code?
Also, did you write an LDA library yourself or did you leverage an existing library?
I apologize in advance if my questions sound naive/stupid, am just a noob...
Thanks, I am using gensim package for LDA. In a nutshell:
1. Get descriptions of repos user is interested in
2. Cleanup/Filtering/Tokenization
3. Use LDA to generate Topics
3. Use the topics to search for repositories github can provide.
That is a clever way to make use of a search API like GitHub's. The principled way to do this, though, is to run LDA over all descriptions on GitHub, then use that similarity index to find similar repositories. You could run LDA over code, too.
I'll note that there is a cold start problem with this implementation: using LDA on such a small set of short documents will often lead to uninformative topics with words that are too-specific. You need a big corpus to capture e.g. synonym relationships.