Manual for WikiPrep#

The wikiprep program is completely written in C# and based on .Net 4.0 . It processes Wikipedia dumps which are available here:

http://dumps.wikimedia.org/enwiki/latest/

Download the file enwiki-latest-pages-articles.xml.bz2.

You should expand the file to get the uncompressed XML file (you can use 7zip, for example). While the program can handle the original bz2 compressed file, the processing will be significantly slower. [1]

The process time for the uncompressed Wikimedia dump from October 2012 (about 38GB) is less than 6 hours using 14 cores on a 2010 Intel Nehalem server.

The program requires a large amount of memory: processing the October 2012 wikimedia dump has a peak memory demand of about 90GB.[2]

To install the program, use the 64bit installation program (it will not install on 32-bit Windows).

The source code for the project is provided as a Visual Studio 2010 solution.

wikiprep# is significantly faster than the original wikiprep Perl program by Evgeniy Gabrilovich.[3]  We achieve fast processing speed by using a number of optimizations: 

  • The program is multi-threaded.
  • We wrote our own memory manager for pages that are from the XML file into memory. A single worker thread feeds pages  to the memory manager which can take about 4MB of pages. This is typically enough to store a few dozen pages at a time.
  • Other worker threads extract pages from the memory manager for processing.
  • All worker threads operate on the same 4MB of memory all the time – this allows the CPU to cache the memory manager and eliminates the need for garbage collection.
  • ·We wrote our own linear text processor which can process any HTML or Wikipedia-type XML page to extract tag information and the body of articles. The text processor analyzes the document (as a byte array) in a single pass. This eliminates the need for regular expressions and other expensive string operations. We found that text processing  is the single greatest bottleneck in achieving a high degree of parallelization.

 

To install the program, use the 64bit installation program (it will not install on 32-bit Windows). The program adds the program directory to the PATH variable.

Usage

The program should be used from a command line window. Use wikiprep --help to show all the command line options:

WIKIPREP -wiki <path> [-threads <int>] [-wordthreshold <int>] [-bigramthreshold <int>] [-minconceptlength <int>] [-minconstructs <int>] [-debug]

-wiki <path>: Denotes path to wikimedia file (required)

-threads <int>: Worker threads for wiki decoding (default is 2)

-wordthreshold <int>: number of concepts a word has to appear in for inclusion (default is 3)

-bigramthreshold <int>: number of concepts a bigram has to appear in for inclusion (default is 3)

-minconceptlength <int>: minimum length of a concept for inclusion (default is 100)

-minconstructs <int>: minimum inlinks/outlinks for a concept to be included (default is 5)

-debug: Output extra debug information (optional)

The only required variable is the path to the Wikimedia XML file. The number of threads should not exceed the number of physical cores. Note that you can achieve significant faster processing by increasing the number of worker threads on a multi-processor and multi-core machine.

The default thresholds are chosen to exclude short concepts, concepts with few in and outlinks and rare words .

Output files

The program produces a number of tab-delimited text files with variable-length rows. The notation {} indicates a list of pairs.

 

words.txt

unique stemmed word (Porter Stemmer), wordid,  # of concepts words appears in (this file includes stopwords)

wordfrequency.txt

word, wordid, # of concepts, # of bigrams, {bigram second word, bigram second wordid, # of concepts bigram appears in}

(this file excludes stopwords)

wordfrequency_details.txt

wordid, # of concepts, {conceptid,# of times word appears in concept}, # of bigrams, {bigram second wordid, # of concepts  bigram appears in, {conceptid, # of times word appears in concept}}

(this file excludes stopwords)

concepts.txt

Concept title, conceptid, number of words in concept

(disambiguation pages and category pages are excluded)

concept_titles.txt

conceptid, list of word ids making up the concept (-1 indicates nonword and integers<-1 indicate natural numbers where X translates into -1-X)

concept_word.txt

conceptid, {wordids}

concept_outlinks.txt

conceptid,{conceptids of all outlinks}

concept_categories.txt

conceptid, {categoryids}

categories.txt

category name, categoryid

categories_titles.txt

categoryid, list of words making up the category title concept (-1 indicates nonword and integers<-1 indicate natural numbers where X translates into -1-X)

categories_parentmatrix.txt

categoryid, {list of child categoryids}

disambiguation.txt

disambiguation text, {list of concept ids}

redirects.txt

text, mapped conceptid

 


[1] We did not try to optimize the processing of bz2 files since a 7zip decompress takes only a few minutes. We believe the bottleneck when using bz2 files is the DotNetZip library which uses a single thread for decompression and seems to starve the worker threads.

[2] It shouldn’t be difficult to optimize the program to use less memory by writing more intermediate files to disk.

[3] Our project does not use any of Gabrilovich’s code but is written from scratch. Our program produces different output files – however, it should be possible to map our files to Gabrilovich’s format if required.

Last edited Nov 9, 2012 at 9:04 PM by markusmobius, version 3

Comments

No comments yet.