Show HN: Using LDA to suggest GitHub repositories based on what you have starred

rw · on Oct 1, 2017

This only uses LDA on your starred repository descriptions, to find topic terms that describe your starred repositories. These topic terms are then used to query the GitHub search API to find matching repositories. The results are then sorted by star count.

That is a clever way to make use of a search API like GitHub's. The principled way to do this, though, is to run LDA over all descriptions on GitHub, then use that similarity index to find similar repositories. You could run LDA over code, too.

I'll note that there is a cold start problem with this implementation: using LDA on such a small set of short documents will often lead to uninformative topics with words that are too-specific. You need a big corpus to capture e.g. synonym relationships.

painted · on Oct 2, 2017

Your point is quite interesting although I'm not sure running LDA on the entire code would be useful. I spent half a year writing my postgraduate thesis on a recommender systems for streaming services based on LDA, in particular we wanted to infer who is watching what and when in a shared account. From all the tests I did with LDA I believe the best thing would be to run it on the README files.

rw · on Oct 2, 2017

Good idea, the READMEs would be best of all.

c5urf3r · on Oct 1, 2017

Thats right, but one additional level of depth in fetching repositories increases the API latency by so much that almost becomes unusable for a web app at least for a hobby web app :P Hence the shortcuts. Open to ideas and suggestions.

rw · on Oct 1, 2017

As I said, your approach is a clever way to use the GitHub API. I think you need to change the title and readme to indicate that this isn't an LDA index of GitHub descriptions. To ML practitioners, that's what you are implying with a title of "Show HN: Using LDA to suggest GitHub repositories based on what you have starred".

nl · on Oct 2, 2017

Lab41 has done work on code recommendations by using a word2vec representation on the code itself.

c5urf3r · on Oct 4, 2017

Links please?

opportune · on Oct 2, 2017

Typo when you don't have the right kind of repos starred: "yeild" should be yield

I've worked with NLP a bit before, but haven't worked with LDA and have only read the wikipedia article and gensim documentation. One thing I don't understand is why you only generate a single topic for each user, and then query the top n (5) terms. From what I understand of LDA, its usefulness is in partitioning text into k separate topics based on how often words are used in similar contexts. In my mind, this is more or less analogous (please tell me if this is wrong) to finding k centroids for a vector representation of text after training a word2vec mapping (in an appropriately low dimension given the document size) on that text. However, if you are only finding a single topic, you are only using one centroid, so your search will be the n tokens that are closest to the centroid. I'm pretty sure that the tokens (from the text) closest to the centroid of a word2vec mapping trained on a text will mostly consist of high-frequency words and semi-stop words (by this I mean words used in varied contexts because of their use in language, but not filtered by the stop word check).

Then if someone has many different topical interests, LDA might over-represent whichever topic has the plurality of text dedicated to it. For example if my starred repos are something like 30% Fortran, 30% Javascript, 40% Java, I believe your algorithm will mostly contain Java terms as queries. This seems to run counter to the goal of using LDA, which would be (to my understanding) to identify these latent topics and give relevant queries for each one / combining them.

I think a good way to address this would be to implement some way to change the default number of topics. One approach may be to use a trained (perhaps on github instances itself) word2vec instance to determine the "spread" of the incoming tokens: you could construct cliques based on pairwise distance between vectors and do something with that (let k be the number of cliques, or the number of cliques of size greater than m, etc.).

A different approach might be to precompute the vector average of each github repo. Then you could perform richer comparisons directly to documents (e.g. compare each clique's centroid to the repos) without directly querying github for tons of repos.

c5urf3r · on Oct 2, 2017

Thanks for your suggestions. Will keep track of it and try including them in the next run. Raise it as an issue if it bothers you a lot.

Also, I have rectified the typo.

jbochi · on Oct 2, 2017

Very interesting. I've implemented something similar[1] using a pure Collaborative Filtering approach[2][3], that I think works better for me, but it's unable to recommend unpopular repositories.

The New York Times recommender system uses a hybrid approach (Content Based + Collaborative Filtering) called Collaborative Topic Modeling on top of LDA[4]. It would be interesting to try that.

[1]: https://github-recs.appspot.com/

[2]: https://medium.com/towards-data-science/recommending-github-...

[3]: https://github.com/jbochi/facts

[4]: https://open.blogs.nytimes.com/2015/08/11/building-the-next-...

c5urf3r · on Oct 4, 2017

Really nice links... Thanks! Will take a look at it and compare and contrast for better tuning. :) Feel free to do the same and raise some pull requests :)

keeran · on Oct 1, 2017

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

For anyone else who didn't know :D

morazow · on Oct 2, 2017

I thought it was about Linear Discriminant Analysis..

eggie5 · on Oct 2, 2017

Yes LDA surely has 2 popular meanings

bugmen0t · on Oct 2, 2017

It is.

painted · on Oct 2, 2017

no, it's not

nerdponx · on Oct 2, 2017

You can, however, use Linear Discriminant Analysis on thr topic scores from Latent Dirichlet Allocation.....

painted · on Oct 2, 2017

I guess, if you have a fair amount of topics, although the point was to explain the title :)

ninjakeyboard · on Oct 2, 2017

I appreciate the effort but it is giving me as a scala guy, an author of books on such subjects as distributed computing and akka, links for PHP libraries. I think the ranking should account for language interest very highly.

c5urf3r · on Oct 2, 2017

That is a thought which indeed crossed my mind.

I personally play around with a lot of tools and languages and wanted to make sure that language barrier didn't make the tool limited. In the sense that if there are a lot of repositories out there in js for something I am starting in python I may wanna take a look at why so many people chose to do it in js instead of python and then make my decision based on it.

This definitely will be lost if I filtered based on languages. In the current form, it gives importance to the topic and not the language for this specific reason.

nerdponx · on Oct 2, 2017

There's a variant of LDA called "structured LDA" that allows you to introduce a linear combination of "structural variables" into the topic distribution prior. If you wanted to make this project "language-aware", might be a good use case for sLDA.

iKlsR · on Oct 1, 2017

Does this only recommend awesome style repos? I got about 20 of those despite having over 400 stars, majority computer graphics. Also a lot of Ruby, a language I don't use or less than 1% of my stars. Odd.

c5urf3r · on Oct 1, 2017

Problem is this, the suggestions are generated by using the repo descriptions as the starting point dropping all references to language and non-English words. What this means is even if you have a lot of stars on ruby repositories, if the search on terms derived from the above process leads to non-ruby repositories that is what is shown. Will look into tuning it better.

mrkstu · on Oct 2, 2017

I'd suggest adding a filter at the top of results to filter by language- i.e. a checkbox by each language result at the top to specify which results to filter to

devrandomguy · on Oct 2, 2017

OTOH, I kind of like the idea that something like this could lead me to discover a programming language that is better suited to the subjects that I am interested in.

tschellenbach · on Oct 2, 2017

Well not sure why, but it definitely didn't work for me. Random would have outperformed it. I like the idea though. Suggested github repos would be fun to have a look at.

c5urf3r · on Oct 2, 2017

Take a look at the repositories suggested, you may find it interesting after some digging or add a new liking to your list. After some more starring, the tool might actually start suggesting repositories to your liking.

Also if you find something strongly troubling feel free to create issues on the repo and I will try to see how it can be better tuned/filtered.

jayunit · on Oct 2, 2017

Cool project! Thanks for publishing and sharing it.

It'd be interesting to know what topic terms it produces for each of my repos. It looks like it's taking all the repo descriptions, producing a topic model over that corpus with a single topic (`LdaModel(num_topics=1)`), and retrieving the top N terms for that topic. Those topic terms will be the most frequent words from the topic, so I think this will end up producing the most frequent words from the cleaned token set.

I'd be curious to see what happens if you could run LDA over the full dataset, produce multiple topics, and suggest repos based on those topics. This would be a pretty fun extension to the project!

If you're just running LDA over the repo description (and not looking into the content of any file, e.g. README), might http://ghtorrent.org/ be able to provide this?

Alternatively, maybe you want to include text from the README files -- could you use the Google Data snapshot of GitHub https://cloud.google.com/bigquery/public-data/github and do analysis like this: https://blog.exploratory.io/clustering-r-packages-based-on-g...

Or, it might be interesting to try producing a vector representation per repo by taking the description (and readme?), and doing something like: produce word vectors for each word, and sum the word vectors. https://spacy.io/ is a nice-to-use library that could help here.

Once you have a vector representation for each repo, using a distance metric cosine similarity could find related repos. Or (depending on the dataset size / performance) an approximation like spill trees or LSH forest.

Looking forward to seeing where this goes next!

c5urf3r · on Oct 2, 2017

Some really good suggestions, can you please raise an issue on the repo so that we can keep track of the same.

futureishere · on Oct 2, 2017

That is one really cool application of LDA!

I am so curious about your implementation, for instance, what sort of preprocessing did you have to carry out? I had written a script sometime back to analyze Paul Graham's essays (link: https://github.com/futureUnsure/pg-essay-lda), and had to remove date and times because they appeared a lot and distorted the top topics. I'm wondering if you had to do something similar for text that described code?

Also, did you write an LDA library yourself or did you leverage an existing library?

I apologize in advance if my questions sound naive/stupid, am just a noob...

c5urf3r · on Oct 2, 2017

Thanks, I am using gensim package for LDA. In a nutshell:

1. Get descriptions of repos user is interested in 2. Cleanup/Filtering/Tokenization 3. Use LDA to generate Topics 3. Use the topics to search for repositories github can provide.

serf · on Oct 1, 2017

given that stars, afaik, are public -- why do I have to login? is there some 'hide my stars' option you're trying to get around?

c5urf3r · on Oct 1, 2017

https://developer.github.com/v3/search/#rate-limit

detaro · on Oct 1, 2017

probably API rate limits

michaelmior · on Oct 1, 2017

Got an error :( Going to assume it's the HN effect. Look forward to trying it out later!

c5urf3r · on Oct 4, 2017

Did you eventually get to try it out?

michaelmior · on Oct 9, 2017

Unfortunately not. I tried again several times over the course of multiple days (including today) with no success.

painted · on Oct 2, 2017

looks like the code for this is here: https://github.com/csurfer/gitsuggest

c5urf3r · on Oct 4, 2017

There are fork and star buttons which directly link to this repo on the website.

painted · on Oct 5, 2017

sorry, I missed that

eggie5 · on Oct 2, 2017

I never know if people mean NLP LDA or gaussian classifier LDA....

Whoaa512 · on Oct 3, 2017

Didn't work for me, I have about ~2700 stars though.

c5urf3r · on Oct 4, 2017

That is sad :(. Please raise an issue with more details and I will try to fine tune it to work better.

_qbjt · on Oct 1, 2017

Just a heads up, the layout is pretty broken on mobile.

c5urf3r · on Oct 1, 2017

Please raise it as an issue. My UI knowledge is pretty limited but will try to make it a bit cleaner.

alfla · on Oct 2, 2017

Did not work for me, I have about ~130 starred repos.

c5urf3r · on Oct 4, 2017

That is sad :(. Please raise an issue with more details and I will try to fine tune it to work better.