-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathINSTRUCTIONS.txt
337 lines (279 loc) · 21.8 KB
/
INSTRUCTIONS.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V
Overview
-------------------------------------------------------
This file lists information about how to run the algorithms listed in the paper describing RedThread, and how to read the results. Algorithms and supporting code are located in the RedThread/source folder. Algorithms are run by calling the RedThread/source/runScript.py function, passing arguements to specify the algorithm, dataset, and run parameters (such as query budget, etc.) via stdin. By default, results are stored in the RedThread/resultStorage folder - this can be changed by editting the RedThread/source/nameOfFolderToSaveResultsTo.py file.
^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^
V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V
Required Packages In Order To Run Algorithms
-------------------------------------------------------
This code was run with the following versions of packages:
numpy 1.11.2
scipy 0.18.1
Python 2.7.5
This code is also required to run inside of an active git repository.
The version of git used was:
git version 1.8.3.1
This code was designed and used on linux-based systems. No guarentee is made for use on other platforms.
^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^
V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V
How To Run The Experiments (for examples, see the section "Commands to replicate runs presented in the paper for the public datasets" below)
-------------------------------------------------------
After navigating into the RedThread/source folder, the code is run by passing commands via stdin to the file runScript.py, for example:
echo <command string> | python runScript.py
The command string is a list of key-value pairs that are comma-delimited: the keys do not need to occur in any particular order. The keys are:
algorithmName:<value1>,algorithmParameters:<value2>,queryBudget:<value3>,datasetName:<value4>,seedNodesType:<value5>,typeOfRun:<value6>
Description:
algorithmName:
This is names the algorithm to be run, and can take one of the following
values:
AlgorithmRandom
AlgorithmSimpleRandomWalk
AlgorithmWeightedRandomWalk
AlgorithmRedThreadWithoutFeedback
AlgorithmRedThreadWithoutEvidenceWeights
AlgorithmRedThread
AlgorithmRedThreadWithoutModalityParameters
AlgorithmRedThreadWithoutParameters
This list can also be found in the file:
RedThread/source/runScriptInputParameters.py
which is where the algorithm names are translated into the instances
of the algorithms to run.
algorithmParameters:
This is a list of key-value pairs that are underscore-delimited: keys
do not need to occur in any particular order. Different algorithms
may have different parameters - details are listed in the subsection
"Algorithm Parameters" below.
queryBudget:
This is the maximum number of times for a given seed node than an
algorithm may recieve a label for a node it returns. This must be a
non-negative integer.
datasetName:
The name of the dataset to use. This must match one of the keys listed
in the dictionary:
datasetsToUseDict
located in the file:
RedThread/source/runScriptInputParameters.py
Details for specifying new datasets can be found in the section "How To
Add Datasets so that Algorithms Can Run on Them" below.
seedNodesType:
This parameter specifies the seed-node file to use. The value
provided must match one of the keys listed in the dictionary:
dictMappingAlgoNameToDictMappingSeedNodeDescriptionToSeedNodeFilePath[<datasetName>]
in the file:
RedThread/source/runScriptInputParameters.py
where <datasetName> is the value specified for the datasetName parameter
passed in. Currently, "normal" is the only available value of this
parameter. The intent of this parameter is to allow experimentation
with multiple different seed node sets without having to manual change
data files or paths.
typeOfRun:
This parameter essentially specifies whether the dataset being used is a
human trafficking dataset or not, and in the case it is, whether or not
a high-confidence matcher should be used. Technically, this parameter
can take one of the following values, but users of this code other
than the authors should only use the nonHT setting - the authors do not
guarentee support for other users to be able to seamlessly use any other
setting.
"nonHT":
The dataset is not a human trafficking dataset. The algorithms
will be restricted to returning at most queryBudget many nodes
per seed node.
"HT_noHCM" :
The dataset used is a human trafficking dataset, but we will
not use a high-confidence matcher. That is, the algorithm will
be allowed to return more than queryBudget many nodes, but it
will not be allowed to return more than queryBudget nodes which
we deem to *not* be near-duplicates. As always, the number of
nodes for which the algorithm recieves labels for a given seed
node is no more than queryBudget; the only additional nodes
an algorithm may return are ones we deem to be near-duplicates,
for which the algorithm recieves no label and no credit for.
"HT_HCM":
Similar to the HT_noHCM setting, this setting (HT_HCM) indicates
that the dataset in use is a human trafficking dataset and that
a near-duplicate detector will be used to filter out nodes that
algorithms return which are very similar to other known positives.
Unlike the HT_noHCM setting, however, a high-confidence matcher
will be used to query the node node. Node returned by the
algorithm are subsequently put into three categories:
near-duplicates (we'll abbrevaite as ND here),
high-confidence matches (HCM), and non-near-duplicate-non-high-confidence-matches (nonHCM).
When the algorithm returns a node, first it is run through
the near duplicate detector, then, if it is not a near-duplicate,
it is run through the high-confidence matcher, and finally, if
it is not a high-confidence matcher, we query the user / query
the ground-truth. The algorithm does not recieve credit for
ND nodes, it does receive credit for finding HCM nodes but does
not revieve labels for finding them (i.e., after the algorithm
is finished, we count if it found HCM nodes that were positive,
but we do not tell the algorithm as it runs if an HCM node is
positive or negative), and the algorithm recieves both credit
and labels for nonHCM nodes. At most queryBudget many nonHCM
nodes may be returned.
V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V
Algorithm Parameters
-------------------------------------------------------
Below is listed the algorithm available to run and thier parameters callable from the command line. The mechanisms used to extract the parameters, etc., can be seen in the parseInputParameters function present in the algorithm files at RedThread/source/Algorithms/ and some in the function getRestOfAlgorithmParametersUsedInPaper in RedThread/source/runScriptInputParameters.py. As mention above, algorithm parameters are fed into the runsScript.py as a substring of the command string, and in the following format:
algorithmParameters:<key1>:<value1>_<key2>:<value2>_....<keyN>:<valueN>
Keys do not need to be listed in any particular order.
List of algorithms and thier callable parameters:
AlgorithmRandom:
queueSize:
As used in paper, was set to 1000. For the random algorithm, this
parameter does not actually effect anything since the algorithm
does not internally use a queue.
AlgorithmSimpleRandomWalk
queueSize:
As used in paper, was set to 1000. For the simple randome walk,
this parameter does not actually effect anything since the algorithm
does not internally use a queue.
AlgorithmWeightedRandomWalk
queueSize:
As used in paper, was set to 1000. For the weighted random walk,
this parameter does not actually effect anything since the algorithm
does not internally use a queue.
AlgorithmRedThreadWithoutFeedback
queueSize:
As used in paper, was set to 1000.
AlgorithmRedThreadWithoutEvidenceWeights
clearWeightsWithNewSeedNode:
This is either 0.0 or 1.0, where 0.0 indicates that we retain the
modality weights learned going from one seed node to another, while
1.0 indicates that the modality weights are restored to uniform for
each new seed node.
learningRate:
As used in the paper, this was set to 0.3
queueSize:
As used in paper, this was set to 1000
AlgorithmRedThreadWithoutModalityParameters
queueSize:
As used in paper, this was set to 1000
AlgorithmRedThreadWithoutParameters
queueSize:
As used in paper, this was set to 1000
AlgorithmRedThread
clearWeightsWithNewSeedNode:
This is either 0.0 or 1.0, where 0.0 indicates that we retain the
modality weights learned going from one seed node to another, while
1.0 indicates that the modality weights are restored to uniform for
each new seed node.
learningRate:
As used in the paper, this was set to 0.3
queueSize:
As used in paper, this was set to 1000
^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^
^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^
V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V
How To Add Datasets so that Algorithms Can Run on Them
-------------------------------------------------------
The path to datafiles and the name under which the code should recognize them is specified in the file:
source/runScriptInputParameters.py
in the dictionaries:
datasetsToUseDict
dictMappingAlgoNameToDictMappingSeedNodeDescriptionToSeedNodeFilePath
In order to add a new datafile to run:
(1) add an entry to datasetsToUseDict listing the name to call the dataset by
(as the key) and the path to the data as the value (see the subsection
"Expected Data Format" below for details on the data expected).
(2) add an entry to dictMappingAlgoNameToDictMappingSeedNodeDescriptionToSeedNodeFilePath
(2a) The key used for this dictionary should match the one used for
datasetsToUseDict.
(2b) The value should be a dictionary mapping names for seed node files
to the path of the seed nodes one desires to use. See the description
of the seedNodesType parameter in the section "How To Run The
Experiments" for more detail.
V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V
Expected Data Format
-------------------------------------------------------
We expect the following files to be located in the top-level of a single folder,
the path to whom would be specified by the path provided in the dictionary
datasetsToUseDict. They are in expected to be in the typical csv format:
different records are delimited by newlines and fields within a record are
delimited by commas.
groundTruthNodeLabels.csv
A two column file, where the first column is the name of a target node
and the second column is the "class label" of the node. A single node
might appear multiple times with different values listed in the
second field on different lines; the values in the second field for a
single target node correspond to the members of the classes to which
the node in question belongs.
For RedThread and its variations, positives are nodes which have
class labels that are non-disjiont with the class labels of the seed
node and which have a path to the seed node in the graph such that each
target entity on the path is positive.
Any node that is not listed in the groundTruthNodeLabels.csv file is
considered negative.
nodeAttributeFile.csv
A two column file, where the first column is the name of a node and the
second column is the entity type of the node.
seedNodes.csv (this can be located in an alternate directory by specifying
the path in dictMappingAlgoNameToDictMappingSeedNodeDescriptionToSeedNodeFilePath)
A single column file where each record lists a seed node to be used in
experiments. The seed nodes are used in reverse order of how they are
listed in this file, i.e., the bottom seed is first, and the top seed
is last.
weightedAdjacencyList.csv
A three column file where the first two columns list nodes that are
adjacent in the graph. The third column is not used in processing and
only present for compatability with various versions of other code
(not included in the software release for RedThread) designed to read
it. We expect the third column to be the constant value: 1
^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^
^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^
V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V
How To Change The Location Where Results Are Saved
-------------------------------------------------------
As noted in the overview, the file RedThread/source/nameOfFolderToSaveResultsTo.py determines where the results for the algorithm are saved. This file contains a path variable that can be changed as desired. Currently, the code is set to save content at RedThread/resultStorage , so as to require less set-up from first-time users in order to us the code.
^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^
V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V
How To Read Data Into A (Postgresql) database
-------------------------------------------------------
This code produces a series of sql files containing the results for the experiment runs. These can be read into a posgres database by using the \i command in the database terminal after creating the relavent tables. The script to create the relavent tables is located at RedThread/source/createTables.sql
^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^
V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V
How To Produce The List of Positives And Negatives For Each Run
-------------------------------------------------------
After loading the data into a (Postgresql) database (see the section "How To Read Data Into A (Postgresql) database" above), run the script helper_collectResultsForEachUUID.sql , found in the source subdirectory.
^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^
V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~V~~V~V~V
Commands to replicate runs presented in the paper for the public datasets.
-------------------------------------------------------
After navigating to the RedThread/source , run the below-listed commands to reproduce the results listed in the paper on the public datasets.
# Memetracker, not resetting for each seed node
echo algorithmName:AlgorithmRandom,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:memeTracker,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmWeightedRandomWalk,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:memeTracker,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmSimpleRandomWalk,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:memeTracker,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThreadWithoutParameters,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:memeTracker,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThreadWithoutEvidenceWeights,algorithmParameters:clearWeightsWithNewSeedNode:0.0_learningRate:0.3_queueSize:1000,queryBudget:40,datasetName:memeTracker,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThreadWithoutFeedback,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:memeTracker,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThreadWithoutModalityParameters,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:memeTracker,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThread,algorithmParameters:clearWeightsWithNewSeedNode:0.0_learningRate:0.3_queueSize:1000,queryBudget:40,datasetName:memeTracker,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
# Memetracker, resetting for each seed node
echo algorithmName:AlgorithmRandom,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:memeTracker,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmWeightedRandomWalk,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:memeTracker,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmSimpleRandomWalk,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:memeTracker,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThreadWithoutParameters,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:memeTracker,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThreadWithoutEvidenceWeights,algorithmParameters:clearWeightsWithNewSeedNode:1.0_learningRate:0.3_queueSize:1000,queryBudget:40,datasetName:memeTracker,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThreadWithoutFeedback,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:memeTracker,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThreadWithoutModalityParameters,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:memeTracker,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThread,algorithmParameters:clearWeightsWithNewSeedNode:1.0_learningRate:0.3_queueSize:1000,queryBudget:40,datasetName:memeTracker,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
----------------------------------------------------------
# Discogs, not resetting for each seed node
echo algorithmName:AlgorithmRandom,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:discogs,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmWeightedRandomWalk,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:discogs,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmSimpleRandomWalk,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:discogs,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThreadWithoutParameters,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:discogs,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThreadWithoutEvidenceWeights,algorithmParameters:clearWeightsWithNewSeedNode:0.0_learningRate:0.3_queueSize:1000,queryBudget:40,datasetName:discogs,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThreadWithoutFeedback,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:discogs,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThreadWithoutModalityParameters,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:discogs,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThread,algorithmParameters:clearWeightsWithNewSeedNode:0.0_learningRate:0.3_queueSize:1000,queryBudget:40,datasetName:discogs,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
# Discogs, resetting for each seed node
echo algorithmName:AlgorithmRandom,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:discogs,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmWeightedRandomWalk,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:discogs,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmSimpleRandomWalk,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:discogs,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThreadWithoutParameters,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:discogs,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThreadWithoutEvidenceWeights,algorithmParameters:clearWeightsWithNewSeedNode:1.0_learningRate:0.3_queueSize:1000,queryBudget:40,datasetName:discogs,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThreadWithoutFeedback,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:discogs,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThreadWithoutModalityParameters,algorithmParameters:queueSize:1000,queryBudget:40,datasetName:discogs,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
echo algorithmName:AlgorithmRedThread,algorithmParameters:clearWeightsWithNewSeedNode:1.0_learningRate:0.3_queueSize:1000,queryBudget:40,datasetName:discogs,seedNodesType:normal,typeOfRun:nonHT | python runScript.py
^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^_^