Dialing down the memory constraint

Nov 15, 2012 at 2:04 PM

I see that you've mentioned in the footnotes that it would not be difficult to lower the required memory from 90GB, but would you mind elaborating?  I'm looking to lower the RAM requirement to about 10 to 12 GB.

Nov 16, 2012 at 3:03 AM

The parsing of the pages is not memory intensive because it operates in a 4MB memory manager (it's a class in the project).

During that step we keep track of all unique words and keep the content of each page as an integer stream (where each integer corresponds to a word in the word dictionary). These structures consume a lot of memory.

You could instead send each decoded page to a separate thread that write it to disk and then frees the memory. Then you would only have to keep the word dictionary in memory which is much smaller.

The other step that takes a lot of memory is when we calculate the term frequency dictionaries and dictionaries of unique bigrams and the set of concepts where each of these appear. You could instead do two passes - in the first pass you do the terms and bigrams and count how often each appears. Then you get rid of the rare words and bigrams. And then you pass again and get the concept set only for the surviving terms and bigrams.