Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Cool.

What is the structure of the sent.model file inside the corpusEN.bin zip archive?

It's a strikingly small file for something called corpus. Say I have a larger corpus, or a corpus in a different language, how would I go about building one of these sent.model files with more data?



The corpusEN.bin file is the training data provided by OpenNLP which I used to split sentences (http://opennlp.sourceforge.net/models-1.5/). It's not the training data used for summarization.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: