Skip to content
mimno edited this page Apr 3, 2020 · 3 revisions

MALLET represents data as lists of "instances". All MALLET instances include a data object. An instance can also include a name and (in classification contexts) a label. For example, if the application is guessing the language of web pages, an instance might consist of a vector of word counts (data), the URL of the page (name) and the language of the page (label).

For information about the MALLET data import API, see the data import developer's guide.

There are two primary methods for importing data into MALLET format, first when the source data consists of many separate files, and second when the data is contained in a single file, with one instance per line.

One instance per file: After downloading and building MALLET, change to the MALLET directory. Assume that text-only (.txt) versions of English web pages are in files in a directory called sample-data/web/en and text-only versions of German pages are in sample-data/web/de. Now run this command:

bin/mallet import-dir --input sample-data/web/* --output web.mallet

MALLET will use the directory names as labels and the filenames as instance names. Note: make sure you are in the mallet directory, not the mallet/bin directory; otherwise you will get a ClassNotFoundException exception.

One file, one instance per line: Assume the data is in the following format:

[URL]  [language]  [text of the page...]

After downloading and building MALLET, change to the MALLET directory and run the following command:

bin/mallet import-file --input /data/web/data.txt --output web.mallet

In this case, the first token of each line (whitespace delimited, with optional comma) becomes the instance name, the second token becomes the label, and all additional text on the line is interpreted as a sequence of word tokens. Note that the data in this case will be a vector of feature/value pairs, such that a feature consists of a distinct word type and the value is the number of times that word occurs in the text.

There are many additional options to the import-dir and import-file commands. Add the --help option to either of these commands to get a full list. Some commonly used options to either command are:

--keep-sequence. This option preserves the document as a sequence of word features, rather than a vector of word feature counts. Use this option for sequence labeling tasks. The MALLET topic modeling toolkit also requires feature sequences rather than feature vectors.

--preserve-case. MALLET by default converts all word features to lowercase.

--remove-stopwords. This option tells MALLET to ignore a standard list of very common English adverbs, conjunctions, pronouns and prepositions. There are several other options related to stopword specification.

--token-regex. MALLET divides documents into tokens using a regular expression. By default, a token must begin and end with Unicode letter characters and can include Unicode letters and punctuation. Other options include:

  • For non-English text, a good choice is --token-regex '[\p{L}\p{M}]+', which means Unicode letters and marks (required for Indic scripts). MALLET currently does not support Chinese or Japanese word segmentation.
  • If you would like to include punctuation inside tokens (for example contractions like "don't" and internet addresses), you might use --token-regex '[\p{L}\p{P}]*\p{L}', which means any sequence of letters or punctuation marks that ends in a letter. Note that this will include quotation marks at the beginning of words.

SVMLight format: SVMLight-style data in the format

target feature:value feature:value ...

can be imported with

bin/mallet import-svmlight --input train test --output train.mallet test.mallet

Note that the input and output arguments can take multiple files that are processed together using the same Pipe. Note that the target and feature fields can be either indices or strings. If they are indices, note that the indices in the Mallet alphabets and indices in the file may be different, though the data is equivalent. Real valued targets are not supported.

SVMLight format for sequence input: In some cases we want to convert feature:count pairs to sequences of tokens, for example as topic model input. This tool assumes that strings have already been converted to integer ids. There are two input files. The first is a vocabulary in order:

cat
dog
rabbit

The second is the actual documents, one per line. There are by default positions for instance name and label, these can be turned off by changing the --line-regex format and the command line options specifying which capturing groups in the line regex correspond to which fields. Each data segment must start with the number of distinct words in the document:

1	Y	2 1:5 2:4
2	N	3 0:3 1:6 2:1
3	Y	2 0:4 2:2

The first instance has name 1, label Y, two distinct words, five tokens of word ID 1 (dog), four tokens of word ID 2 (rabbit). To run:

bin/mallet run cc.mallet.classify.tui.MultFileToSequences --input test.docs --vocabulary test.vocab --output test.seq 

To confirm correct formatting (note word order is randomized):

bin/mallet info --print-instances --input test.seq
1	Y	0: rabbit (2)
1: dog (1)
2: dog (1)
3: rabbit (2)
4: rabbit (2)
5: dog (1)
6: dog (1)
7: dog (1)
8: rabbit (2)