-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Jon Gunderson edited this page Jul 20, 2020
·
20 revisions
These scripts are in the faeAuditor/faeData
directory:
Python Script | Param 1 | Param 2 | Output | Description |
---|---|---|---|---|
adjust-audit-websites.py | audit ID | none | updated audit information | Changes http to https for websites with low number of pages found |
audit-urls-to-csv.py | none | none | none | not sure |
test_urllib.py | none | none | none | not sure |
audit-one-group-urls-to-csv.py | none | none | none | not sure |
create_csv_from_urls.py | Text file with one URL on each line, with optional grouping ids | none | CSV file of URLs | Converts a text file of URL to the CSV format for populating the faeAuditor database |
populate_websites_from_csv.py | JSON formatting files with grouping information | CSV formated file with URL info | updated database | Populates Auditor database with urls with up to two grouping labels |
The populate_websites_from_csv.py
python script is used to create, update and add URLs to a set of URLs for an Audit. The first file is a JSON file that contains information about the Audit and any grouping information used to associate a URL with up to two levels of grouping and sub grouping. The levels of grouping are hierarchical, so a high level group can have sub-groups (e.g. Universities with colleges as a sub group).
Command to run within the faeAuditor virtual environment:
python populate_websites_from_csv.py file1.json file2.csv
{ "title" : "Penn State University",
"audit_slug" : "psu",
"depth" : 3,
"max_pages" : 200,
"ruleset_id" : "ARIA_STRICT",
"wait_time" : 90000,
"groups" : []
}
{
"title" : "2020 Illinois",
"audit_slug" : "2020-illinois",
"depth" : 2,
"ruleset_id" : "ARIA_STRICT",
"wait_time" : 90000,
"max_pages" : 200,
"groups" : [
{
"id" : "college",
"title" : "College",
"title_plural" : "Colleges",
"members" : [
{
"slug" : "ahs",
"title": "Applied Health Sciences",
"abbrev": "AHS"
},
{
"slug" : "cob",
"title": "Gies College of Business",
"abbrev": "LAS"
},
{
"slug" : "coe",
"title": "College of Education",
"abbrev": "EDUC"
},
{
"slug" : "faa",
"title": "College of Fine and Applied Arts",
"abbrev": "FAA"
},
{
"slug" : "las",
"title": "Liberal Arts and Sciences",
"abbrev": "LAS"
},
{
"slug" : "engr",
"title": "Grainger College of Engineering",
"abbrev": "ENGR"
},
{
"slug" : "aces",
"title": "Agriculture and Consumer Sciences",
"abbrev": "ACES"
}
]
}
]
}
{ "title" : "2018 Basketball Conferences II",
"audit_slug" : "2018-bb-conf-2",
"depth" : 2,
"ruleset_id" : "ARIA_STRICT",
"wait_time" : 90000,
"max_pages" : 200,
"groups" : [
{
"id" : "conference",
"title" : "Conference",
"title_plural" : "Conferences",
"members" : [
{
"slug" : "acc",
"title": "Atlantic Coast",
"abbrev": "ACC"
},
{
"slug" : "atlantic10",
"title": "Atlantic 10",
"abbrev": "Atl10"
},
{
"slug" : "big10",
"title": "Big Ten",
"abbrev": "Big10"
},
{
"slug" : "big12",
"title": "Big 12",
"abbrev": "Big12"
},
{
"slug" : "bigwest",
"title": "Big West",
"abbrev": "BigWest"
},
{
"slug" : "bigeast",
"title": "Big East",
"abbrev": "BigEast"
},
{
"slug" : "conferenceusa",
"title": "Conference USA",
"abbrev": "ConfUSA"
},
{
"slug" : "ivy",
"title": "Ivy League",
"abbrev": "IVY"
},
{
"slug" : "midamerican",
"title": "Midamerican",
"abbrev": "Midamerican"
},
{
"slug" : "missourivalley",
"title": "Missouri Valley",
"abbrev": "MoValley"
},
{
"slug" : "noconference",
"title": "No Conference",
"abbrev": "Noconference"
},
{
"slug" : "pac12",
"title": "Pacific 12",
"abbrev": "Pac12"
},
{
"slug" : "sec",
"title": "Southeast",
"abbrev": "SEC"
},
{
"slug" : "wac",
"title": "Western Athletic",
"abbrev": "WAC"
}
]
},
{
"id" : "university",
"title" : "University",
"title_plural" : "Universities",
"members" : [
{
"group": "acc",
"slug": "bc",
"title": "Boston College",
"abbrev": "BC"
},
{
"group": "acc",
"slug": "clemson",
"title": "Clemson",
"abbrev": "Clemson"
},
{
"group": "acc",
"slug": "duke",
"title": "Duke",
"abbrev": "Duke"
}
........
{
"group": "wac",
"slug": "grand_canyon",
"title": "Grand Canyon",
"abbrev": "Grand Canyon"
}
]
}
]
}
CSV Column descriptions:
"[depth]","[max pages]","[wait time]","[Title]","[url]","[id1]","[id2]","[id3]"
Column | Required | Description |
---|---|---|
depth | Yes | Depth of spidering of website , if blank use default values defined for the audit |
max pages | Yes | Maximum number of pages to include in audit, if blank use default values defined for the audit |
wait time | yes | Maximum time to wait for a web page to respond, if blank use default values defined for the audit |
Title | Yes | A human readible title for the website |
url | Yes | The starting URL to start spidering the website |
id1 | Optional | group id for top level group |
id2 | Optional | group id for top level sub-group |
id3 | Optional | group id for identifying websites with the same top level and sub-level ids |
Sample CSV data:
"","","","Boston College: Admissions","http://www.bc.edu/bc-web/admission.html","acc","bc","admissions"
"","","","Boston College: Business","http://www.bc.edu/bc-web/schools/caroll-school.html","acc","bc","business"
"","","","Boston College: Education","http://www.bc.edu/schools/lsoe/","acc","bc","education"