- Author: Etienne P Jacquot
- Date: February 25th, 2025
- Course: Penn Carey Law: LAW-9580 Cybercrime (Levy)
- Statement of Academic Integrity: I have adhered to the academic integrity guidelines of the Code of Academic Integrity in this assignment. All of the work included (written and code), is my own and is not representative or indicative of any opinions or views of the University of Pennsylvania, the Penn Carey Law School, the Annenberg School for Communication, or any other entities or individuals associated with UPenn.
- License & Liability: MIT License
The Fair Use Index dataset included in this repository was copied manually from the web pages of the US Govt Copyright Office Fair Use Index on February 24th, 2025.
- Excel data file:
- Consists of 6 Columns available for 250 rows:
Index(['Case', 'Year', 'Court', 'Jurisdiction', 'Categories', 'Outcome'], dtype='object')
This analysis is performed in a Jupyter Notebook with Python 3.11, using the Fair Use Index dataset to understand the distribution of the US Govt Copyright Office referenced cases provided as reference for the public.
Focusing on Time, Outcomes, and Categories, the scope of this analysis focuses on Education & Research given my near experience as working as an IT Staff member for nearly 10 years at the Annenberg School for Communication at the University of Pennsylvania, in support of my graduate course work for LAW-9580 Cybercrime at Penn Carey Law.
Steps taken in this coding analysis are primarily aimed at cleaning the data for inconsistencies and visualizing the cleaned data for statistics on trends in the cases referenced in the Fair Use Index dataset.
Please be advised that the following AI tools were used in support of this course work, consistent with Penn Carey Law's guidance & policies on AI outlined here: https://www.law.upenn.edu/its/docs/ai/
-
Github: CoPilot VSCode extension was used to assist in writing and auto-generating Python code snippets throughout the Jupyter notebook analysis & scripting, along with formatting this README documentation.
-
OpenAI: GPT-4o AI model is used to assist in filtering all categories to differentiate parent categories from subcategories.
-
Anthropic: Claude Sonnet 3.7 model was used to assist in generating the advanced reference on data visualizations generated in this analysis.
Using Python 3.11, run the following to replicate the code in a local virtual environment
- For the analysis.ipynb notebook, make sure to install the dependencies in requirements.txt file:
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
- If using OpenAI to make API calls, you must save your API key in ./secret_key.json file.
An export of the cleaned dataset is provided for reference:
Below are image exports of generated data visualizations from the analysis of the Fair Use Index dataset, of which a subset of these graphs are included in my official course work submission paper.
Notably, AI summarizes all available categories in the dataset into 5 or 6 parent level categories
- Education & Research
- Journalism & Commentary
- Legal & Governance
- Media & Entertainment
- Technology & Digital
- Visual Arts
- Using the pdf-download.sh script, download each PDF from the US Govt Fair Use Index website for review
$ ./pdf-download.sh
-
The PDFs are stored in the ./pdfs/ directory! Manually inspect to confirm all files are valid and not corrupted. If any are corrupted, manually delete those files and re-run the script to download the PDFs again.
-
This [./pdfs/] directory will be used as a RAG KnowledgeBase for informing an AI model on the Fair Use Index cases for future analysis. The example will use Open-WebUI, TBD
docker run open-webui ....
For questions, comments, concerns, or feedback, please feel free to reach out to me at [email protected]