Cool. What is the structure of the sent.model file inside the corpusEN.bin zip a... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		natch on Oct 12, 2013 \| parent \| context \| favorite \| on: TextTeaser – An automatic summarization algorithm Cool. What is the structure of the sent.model file inside the corpusEN.bin zip archive? It's a strikingly small file for something called corpus. Say I have a larger corpus, or a corpus in a different language, how would I go about building one of these sent.model files with more data?

MojoJolo on Oct 12, 2013 [–]

The corpusEN.bin file is the training data provided by OpenNLP which I used to split sentences (http://opennlp.sourceforge.net/models-1.5/). It's not the training data used for summarization.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact