Skip to content
Jon Gunderson edited this page Jul 20, 2020 · 20 revisions

Python Scripts for creating groups of URLs

These scripts are in the faeAuditor/faeData directory:

Python Script Param 1 Param 2 Output Description
adjust-audit-websites.py audit ID none updated audit information Changes http to https for websites with low number of pages found
audit-urls-to-csv.py none none none not sure
test_urllib.py none none none not sure
audit-one-group-urls-to-csv.py none none none not sure
create_csv_from_urls.py Text file with one URL on each line, with optional grouping ids none CSV file of URLs Converts a text file of URL to the CSV format for populating the faeAuditor database
populate_websites_from_csv.py JSON formatting files with grouping information CSV formated file with URL info updated database Populates Auditor database with urls with up to two grouping labels

Populating faeAuditor Database with URL and Group Information

The populate_websites_from_csv.py python script is used to create, update and add URLs to a set of URLs for an Audit. The first file is a JSON file that contains information about the Audit and any grouping information used to associate a URL with up to two levels of grouping and sub grouping. The levels of grouping are hierarchical, so a high level group can have sub-groups (e.g. Universities with colleges as a sub group).

Command to run within the faeAuditor virtual environment:

python populate_websites_from_csv.py file1.json file2.csv

JSON Format Example: No Grouping

{ "title"        : "Penn State University",
  "audit_slug"   : "psu",
  "depth"        : 3,
  "max_pages"    : 200,
  "ruleset_id"   : "ARIA_STRICT",
  "wait_time"    : 90000,
  "groups"       : []
}

JSON Format Example: 1 level of Grouping

{
  "title"        : "2020 Illinois",
  "audit_slug"   : "2020-illinois",
  "depth"        : 2,
  "ruleset_id"   : "ARIA_STRICT",
  "wait_time"    : 90000,
  "max_pages"    : 200,
  "groups"       : [
  {
    "id"           : "college",
    "title"        : "College",
    "title_plural" : "Colleges",
    "members" : [
      {
        "slug" : "ahs",
        "title": "Applied Health Sciences",
        "abbrev": "AHS"
      },
      {
        "slug" : "cob",
        "title": "Gies College of Business",
        "abbrev": "LAS"
      },
      {
        "slug" : "coe",
        "title": "College of Education",
        "abbrev": "EDUC"
      },
      {
        "slug" : "faa",
        "title": "College of Fine and Applied Arts",
        "abbrev": "FAA"
      },
      {
        "slug" : "las",
        "title": "Liberal Arts and Sciences",
        "abbrev": "LAS"
      },
      {
        "slug" : "engr",
        "title": "Grainger College of Engineering",
        "abbrev": "ENGR"
      },
      {
        "slug" : "aces",
        "title": "Agriculture and Consumer Sciences",
        "abbrev": "ACES"
      }
    ]
  }
  ]
}

JSON Format Example: 2 levels of Grouping

{ "title"        : "2018 Basketball Conferences II",
  "audit_slug"   : "2018-bb-conf-2",
  "depth"        : 2,
  "ruleset_id"   : "ARIA_STRICT",
  "wait_time"    : 90000,
  "max_pages"    : 200,
  "groups"       : [
  {
    "id"           : "conference",
    "title"        : "Conference",
    "title_plural" : "Conferences",
    "members" : [
       {
        "slug" : "acc",
        "title": "Atlantic Coast",
        "abbrev": "ACC"
       },
       {
        "slug" : "atlantic10",
        "title": "Atlantic 10",
        "abbrev": "Atl10"
       },
       {
        "slug" : "big10",
        "title": "Big Ten",
        "abbrev": "Big10"
       },
       {
        "slug" : "big12",
        "title": "Big 12",
        "abbrev": "Big12"
       },
       {
        "slug" : "bigwest",
        "title": "Big West",
        "abbrev": "BigWest"
       },
       {
        "slug" : "bigeast",
        "title": "Big East",
        "abbrev": "BigEast"
       },
       {
        "slug" : "conferenceusa",
        "title": "Conference USA",
        "abbrev": "ConfUSA"
       },
       {
        "slug" : "ivy",
        "title": "Ivy League",
        "abbrev": "IVY"
       },
       {
        "slug" : "midamerican",
        "title": "Midamerican",
        "abbrev": "Midamerican"
       },
       {
        "slug" : "missourivalley",
        "title": "Missouri Valley",
        "abbrev": "MoValley"
       },
       {
        "slug" : "noconference",
        "title": "No Conference",
        "abbrev": "Noconference"
       },
       {
        "slug" : "pac12",
        "title": "Pacific 12",
        "abbrev": "Pac12"
       },
       {
        "slug" : "sec",
        "title": "Southeast",
        "abbrev": "SEC"
       },
       {
        "slug" : "wac",
        "title": "Western Athletic",
        "abbrev": "WAC"
       }
    ]
  },
  {
    "id"           : "university",
    "title"        : "University",
    "title_plural" : "Universities",
    "members" : [
       {
        "group": "acc",
        "slug": "bc",
        "title": "Boston College",
        "abbrev": "BC"
       },
       {
        "group": "acc",
        "slug": "clemson",
        "title": "Clemson",
        "abbrev": "Clemson"
       },
       {
        "group": "acc",
        "slug": "duke",
        "title": "Duke",
        "abbrev": "Duke"
       }

       ........

       {
        "group": "wac",
        "slug": "grand_canyon",
        "title": "Grand Canyon",
        "abbrev": "Grand Canyon"
       }
    ]
  }
  ]
}

CSV file of URLs format

CSV Column descriptions:

"[depth]","[max pages]","[wait time]","[Title]","[url]","[id1]","[id2]","[id3]"
Column Required Description
depth Yes Depth of spidering of website , if blank use default values defined for the audit
max pages Yes Maximum number of pages to include in audit, if blank use default values defined for the audit
wait time yes Maximum time to wait for a web page to respond, if blank use default values defined for the audit
Title Yes A human readible title for the website
url Yes The starting URL to start spidering the website
id1 Optional group id for top level group
id2 Optional group id for top level sub-group
id3 Optional group id for identifying websites with the same top level and sub-level ids

Sample CSV data:

"","","","Boston College: Admissions","http://www.bc.edu/bc-web/admission.html","acc","bc","admissions"
"","","","Boston College: Business","http://www.bc.edu/bc-web/schools/caroll-school.html","acc","bc","business"
"","","","Boston College: Education","http://www.bc.edu/schools/lsoe/","acc","bc","education"