Quick Start Guide

Fxt is comprised of a number of components which we will cover briefly here to give an overview of how to setup and configure a feature extraction pipeline.

There are two main components within the pipeline:

Indexing
Feature extraction

Indexing

The first step is to build an index. Currently, an existing Indri index is required.

The Indri index will need to have the following fields indexed:

mainbody
heading
inlink
body
title
table
td
a
applet
object
embed

Assuming you already have an Indri index at the path qs-indri you can build a Fxt index with the following:

indexer qs-indri myindex
generate_static_doc_features qs-indri myindex/static_doc

Feature Extraction

With an index created, it is now possible to run the feature extraction component via the extractor program. To do this we need to:

Create a query file in the right format:

The query format is a text file with one entry per line as <query id>;<query terms>. For example a file with three queries looks like:
```
51;horse hooves
52;avp
53;discovery channel store
```
Apply stemming to the queries if the original Indri index used stemming:

If there was no stemming configured then you can skip this step. For indexes that use stemming, the queries need to be stemmed manually. For example, if the Krovetz stemmer was used then you can use the kstem program to apply stemming:
```
sed 's/;/ /' queryfile | kstem | sed 's/ /;/' > queryfile.kstem
```
Configuration options via INI file:

The path to the Fxt index and the features to be extracted are configured via an INI configuration file. Each available feature can be configured for extraction. Features must be explicitly enabled for extraction. See the example configuration file.
Labels

Instance labels need to be converted into a specific format (qrels). The label file is a text file with one label entry per line as <query id> 0 <docid> <label>. For example a label file for the three queries mentioned earlier would look like:
```
51 0 docid-1 1
51 0 docid-2 0
52 0 docid-7 0
52 0 docid-2 1
53 0 docid-1 0
53 0 docid-8 1
```
Document list (Stage0 run):

The feature extraction process assumes that you already have a candidate set of documents for each query that you wish to extract the features for. This is known as a run file and uses the following format <query id> Q0 <docid> 0 <score> <run identifier>.

Add labels to the Stage0 run file:

./script/label.awk labels.txt myrun.txt > stage0.run

Perform feature extraction:

./extractor -c config.ini queryfile.kstem stage0.run output.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quick-start.md

quick-start.md

Quick Start Guide

Indexing

Feature Extraction

Files

quick-start.md

Latest commit

History

quick-start.md

File metadata and controls

Quick Start Guide

Indexing

Feature Extraction