Fxt is comprised of a number of components which we will cover briefly here to give an overview of how to setup and configure a feature extraction pipeline.
There are two main components within the pipeline:
- Indexing
- Feature extraction
The first step is to build an index. Currently, an existing Indri index is required.
The Indri index will need to have the following fields indexed:
- mainbody
- heading
- inlink
- body
- title
- table
- td
- a
- applet
- object
- embed
Assuming you already have an Indri index at the path qs-indri
you can build a
Fxt index with the following:
indexer qs-indri myindex
generate_static_doc_features qs-indri myindex/static_doc
With an index created, it is now possible to run the feature extraction
component via the extractor
program. To do this we need to:
-
Create a query file in the right format:
The query format is a text file with one entry per line as
<query id>;<query terms>
. For example a file with three queries looks like:51;horse hooves 52;avp 53;discovery channel store
-
Apply stemming to the queries if the original Indri index used stemming:
If there was no stemming configured then you can skip this step. For indexes that use stemming, the queries need to be stemmed manually. For example, if the Krovetz stemmer was used then you can use the
kstem
program to apply stemming:sed 's/;/ /' queryfile | kstem | sed 's/ /;/' > queryfile.kstem
-
Configuration options via INI file:
The path to the Fxt index and the features to be extracted are configured via an INI configuration file. Each available feature can be configured for extraction. Features must be explicitly enabled for extraction. See the example configuration file.
-
Labels
Instance labels need to be converted into a specific format (qrels). The label file is a text file with one label entry per line as
<query id> 0 <docid> <label>
. For example a label file for the three queries mentioned earlier would look like:51 0 docid-1 1 51 0 docid-2 0 52 0 docid-7 0 52 0 docid-2 1 53 0 docid-1 0 53 0 docid-8 1
-
Document list (Stage0 run):
The feature extraction process assumes that you already have a candidate set of documents for each query that you wish to extract the features for. This is known as a run file and uses the following format
<query id> Q0 <docid> 0 <score> <run identifier>
. -
Add labels to the Stage0 run file:
./script/label.awk labels.txt myrun.txt > stage0.run
-
Perform feature extraction:
./extractor -c config.ini queryfile.kstem stage0.run output.csv