datasci-w266 · cali131 · Jan 17, 2025 · Jan 18, 2025 · Jan 22, 2025 · Jan 24, 2025
diff --git a/README.md b/README.md
@@ -1,3 +1,51 @@
+<<<<<<< HEAD
+### Assignments
+
+You should begin working on assignments as soon as they are released.
+
+Each assignment is due at 11:59pm Pacific on Sundays, unless otherwise stated.
+
+You do not need any resources outside the class to do the assignments.  If
+you are stuck, please reach out on Ed Discussion!
+
+### Weighting
+
+Assignments have different weights. See [course evaluation portion of syllabus](https://github.com/datasci-w266/2025-spring-main/blob/master/syllabus/README.md#course-evaluation).
+
+### Collaboration policy
+
+The work you submit must be your own. You are permitted to discuss the 
+assignment at a high level with other students, but you should not share code, 
+implementation details, or specific solutions.
+
+More specifically:
+- You may collaborate verbally with other students, but you should not look at 
+  or help debug each other's code.
+- You can get help on general programming issues, but only in a very general 
+  capacity that does not relate to a specific part of the assignment. (example: 
+  "Oh, I had issues with that function crashing, but updating TensorFlow to 2.14 
+  fixed it for me.") If in doubt, ask in a *private* question on Ed Discussion.
+- You should not search for or use solutions to the assignment problems that you 
+  find on the web. If you inadvertently come across such a solution, you 
+  **must** cite it appropriately in your answer. (Bad search: "how to implement 
+  GloVe in TensorFlow". Good search: "how to use `tf.variable_scope`" or "how to 
+  concatenate matrices")
+- You should not use external libraries to shortcut parts of the assignment. The 
+  libraries imported in the starter code should be more than sufficient, 
+  although you're welcome to use standard Python libraries like `pdb` or 
+  `itertools` if you desire.
+
+## Work in git!
+
+Please do your work in a `git` repository; you can simply clone the course repo 
+to get started. If there are any updates or fixes to the assignment, we'll 
+update the master repo and you can patch your client with `git pull`.  Be sure
+to use the assignment submit script (see Assignment 0 for instructions) to submit.
+
+**Do not** make pull requests on the course repo!
+
+Be sure to **`submit.sh`** often to save your work!
+=======
 # DATASCI 266: Natural Language Processing with Deep Learning
 
 Understanding language is fundamental to human interaction. Our brains have
@@ -18,3 +66,4 @@ language generation, question answering, and summarization.
 * [Notebooks & Materials](materials/)
 
 
+>>>>>>> 1f2654e6196496c686dcf8764d37e372d85acbe7
diff --git a/a0/.clean b/a0/.clean
diff --git a/a0/README.md b/a0/README.md
@@ -0,0 +1,49 @@
+# Assignment 0:  Hello 266!
+
+This assignment is a quick walk-through to help you get set up logistically for the course.  It isn't a real assignment (no problems to solve) and counts towards 2% of your grade (because you can't do the other assignments without doing it correctly!).
+
+**Reminder:** You may only use 2 late days for any one deliverable in this course.  See the [syllabus](../../syllabus/) for details.
+
+If you haven't yet, please:
+
+- Sign in to Ed Discussion via bCourses
+
+**READ ALL OF THESE STEPS BEFORE RUNNING ANYTHING!**
+
+Now we'll get you all set up with the software packages and the course GitHub.
+
+1. [Setup](https://calmail.berkeley.edu/manage/account/create_account) a @berkeley.edu account setup if you don't already have one (@ischool.berkeley.edu is **not** sufficient!)
+
+2. **Set up your computing environment:** We are going to use Google Colab for this class.  It provides free access to a GPU which we will need in later assignments.  In your UC Berkeley G Drive create a folder for this class.  You will be saving the Colab notebooks you run in this folder.
+
+3. **Clone the course repo** On your laptop or local machine run this command in a terminal:
+`git clone https://github.com/datasci-w266/2025-spring-main.git ./w266`
+You will use this local copy to get updates as we post them and to store your work.  This git repo is independent of the GDrive folder. You will also submit work from this repository.
+
+4. **Create your personal submission repo** at [this link](https://classroom.github.com/a/jxD7Rs8V). We'll use this for holding assignments that you have completed so the instructors can collect them for grading; it's private to you and the instructors. 
+
+You'll use the submit.sh script discussed in step 8 to push things from your laptop/computer (where you do your work) to this private (to you and the instructors) classroom repo.
+
+5. **Open and run a0 notebook in Colab**. Copy the `a0.ipynb` notebook to the folder you created in GDrive. Then you double click on it to open it in Colab.  This notebook makes some simple checks and gives a taste of some of the kinds of older NLP datasets we'll be working with. You don't need to write any code here - just run the cells and save.  After you have run the notebook, go to File -> Download and download an .ipynb version and overwrite the copy in you local git repository. There you can commit it to your local repo. (Do **NOT** run `git push` in your the local repo.)
+
+6. **Run the check_python.sh script on your laptop**.  This will identify the version of python available on your machine.  To run it on a mac, you will want to type something like `/bin/bash ./check_python.sh` in a terminal. If you have the python executable then the submit.sh script will run as is.  If your python executable is named python3, then you'll need to change line 75 so that it uses python3 instead of python.  The check_python script will tell you which version you have.
+
+7. **Answer the questions in the answers file** in the `assignment/a0`directory by editing and saving the answers file.  Run the presubmit script, `answers_test.py`, yourself using the python executable you just identified.  Unlike in future assignments, this presubmit checks your answers for you and flags any errors, e.g. you have entered the wrong number or not deleted the correct set of answers. You should run git commit in your local repo to commit your changes. Do NOT rename the file to answers.txt.  The file you are commiting must be named simply answers.
+
+
+8. **Run the submit script on your laptop/local machine**: In the top level (root) of your local assignment repo (on your laptop or desktop machine) run the following command `bash ./assignment/submit.sh -u your-github-username -a 0`, (make sure you swap `your-github-username` for your git user name) which will push to your private repo in the git classroom. It will try to verify the submission, but you should also visit the repo on GitHub and confirm that your changes show up.  (For all assignments in this course, it's your responsibility to make sure your submission has made it to your classroom GitHub repo!) 
+
+**Note:** There is no need to send pull requests or any of the other usual git machinery.  All you need to do is run the submit.sh script and check that your code appeared in a branch named a0-submit in your "classroom" repository -- the one you set up in the step 4.  If you can't find it, **this is a problem**.  If you can't figure it out, ask (preferably publicly) on Ed Discussion and someone will help you out.  There will be a small number of points for each assignment for submitting your homework in the right place.
+
+Each student who correctly submits their work will receive 5 points and each student whosubmits a compplete answers file named "answers" that is parseable by the autograder receives another 5 points.
+
+**When you run the submit.sh script for the first time**, it will ask you if you want to use https or ssh.  If you choose https then you will need to create a [personal access token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token) in your github account and use it in lieu of your password. 
+
+If you choose ssh then you need to follow the directions [here](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent) to generate an ssh key on your laptop and then [here](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account) to add the public key to your ssh account.
+
+## Next...
+
+Continue on to [Assignment 1](../a1/) once it's released.  (Unlike Assignment 0, Assignment 1 isn't just a setup exercise.  Don't wait too long to get started!)
+
+
+**Again, it is YOUR responsibility to make sure your submission has made it to your classroom GitHub repo and into the correct branch!**
diff --git a/a0/a0.ipynb b/a0/a0.ipynb
@@ -0,0 +1,288 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "kDFiCs_0MaeI"
+      },
+      "source": [
+        "# Assignment 0\n",
+        "\n",
+        "This notebook will help verify that you're all set up with the Python packages we'll be using this semester.\n",
+        "\n",
+        "**Your task:** just run the cells below, and verify that the output is as expected. If anything looks wrong, weird, or crashes, update your Python installation or contact the course staff. We don't want library issues to get in the way of the real coursework!"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 1,
+      "metadata": {
+        "scrolled": true,
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "Rk5s0520MaeO",
+        "outputId": "bd870023-8024-4cce-cbab-d73ca02cdc92"
+      },
+      "outputs": [],
+      "source": [
+        "# Version checks\n",
+        "import importlib\n",
+        "def version_greater_equal(v1, v2):\n",
+        "    for x, y in zip(v1.split('.'), v2.split('.')):\n",
+        "        if int(x) != int(y):\n",
+        "            return int(x) > int(y)\n",
+        "    return True\n",
+        "\n",
+        "assert version_greater_equal('1.2.3', '0.1.1')\n",
+        "assert version_greater_equal('1.2.3', '0.5.1')\n",
+        "assert version_greater_equal('1.2.3', '1.2.3')\n",
+        "assert version_greater_equal('0.22.0', '0.20.3')\n",
+        "assert not version_greater_equal('1.1.1', '1.2.3')\n",
+        "assert not version_greater_equal('0.5.1', '1.2.3')\n",
+        "assert not version_greater_equal('0.20.3', '0.22.0')\n",
+        "\n",
+        "def version_check(libname, min_version):\n",
+        "    m = importlib.import_module(libname)\n",
+        "    print (\"%s version %s is \" % (libname, m.__version__))\n",
+        "    print (\"OK\"\n",
+        "           if version_greater_equal(m.__version__, min_version)\n",
+        "           else \"out-of-date. Please upgrade!\")\n",
+        "\n",
+        "version_check(\"numpy\", \"1.26.4\")\n",
+        "version_check(\"matplotlib\", \"3.10.0\")\n",
+        "version_check(\"pandas\", \"2.2.2\")\n",
+        "version_check(\"nltk\", \"3.9.1\")\n",
+        "version_check(\"keras\", \"3.5.0\")\n",
+        "version_check(\"tensorflow\", \"2.17.1\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "MJeLTePaMaeT"
+      },
+      "source": [
+        "## TensorFlow\n",
+        "\n",
+        "We'll be using [TensorFlow](tensorflow.org) to build deep learning models this semester. TensorFlow is a whole programming system in itself, based around the idea of a computation graph and deferred execution. We'll be talking a lot more about it in Assignment 1, but for now you should just test that it loads on your system.\n",
+        "\n",
+        "Run the cell below; you should see:\n",
+        "```\n",
+        "Hello, TensorFlow!\n",
+        "42\n",
+        "```"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "Yeo7loI7MaeU",
+        "outputId": "3fc90e42-1f77-4f5b-d3a7-679a879b4564"
+      },
+      "outputs": [],
+      "source": [
+        "import tensorflow as tf\n",
+        "\n",
+        "hello = tf.constant(\"Hello, TensorFlow!\")\n",
+        "tf.print(hello)\n",
+        "\n",
+        "a = tf.constant(10)\n",
+        "b = tf.constant(32)\n",
+        "tf.print((a+b))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "6tpeOZp4MaeV"
+      },
+      "source": [
+        "## NLTK\n",
+        "\n",
+        "[NLTK](http://www.nltk.org/) is a large compilation of Python NLP packages. It includes implementations of a number of classic NLP models, as well as utilities for working with linguistic data structures, preprocessing text, and managing corpora.\n",
+        "\n",
+        "NLTK is included with Anaconda, but the corpora need to be downloaded separately. Be warned that this will take up around 3.2 GB of disk space if you download everything! If this is too much, you can download individual corpora as you need them through the same interface.\n",
+        "\n",
+        "Type the following into a Python shell on the command line. It'll open a pop-up UI with the downloader:\n",
+        "\n",
+        "```\n",
+        "import nltk\n",
+        "nltk.download()\n",
+        "```\n",
+        "\n",
+        "Alternatively, you can download individual corpora by name. The cell below will download the famous [Reuters-21578 benchmark corpus](https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html):"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "WibuBpFrMaeW",
+        "outputId": "73611592-0c72-4cf5-a920-0837b80e5e59"
+      },
+      "outputs": [],
+      "source": [
+        "import nltk\n",
+        "assert(nltk.download('punkt'))\n",
+        "assert(nltk.download('punkt_tab'))\n",
+        "assert(nltk.download('reuters'))  # should return True if successful, or already installed"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "TvBDxW_LMaeX"
+      },
+      "source": [
+        "Now we can look at a few sentences. Expect to see:\n",
+        "```\n",
+        "ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said .\n",
+        "\n",
+        "They told Reuter correspondents in Asian capitals a U . S . Move against Japan might boost protectionist sentiment in the U . S . And lead to curbs on American imports of their products .\n",
+        "```"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "_GweB4zRMaeX",
+        "outputId": "c2308d00-f00b-4172-d405-d2b35b1ec4af"
+      },
+      "outputs": [],
+      "source": [
+        "from nltk.corpus import reuters\n",
+        "# Look at the first two sentences\n",
+        "for s in reuters.sents()[:2]:\n",
+        "    print(\" \".join(s))\n",
+        "    print(\"\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "WjYOd3vQMaeY"
+      },
+      "source": [
+        "NLTK also includes a sample of the [Penn treebank](https://www.cis.upenn.edu/~treebank/), which we'll be using later in the course for parsing and part-of-speech tagging. Here's a sample of sentences, and an example tree. Expect to see:\n",
+        "```\n",
+        "The top money funds are currently yielding well over 9 % .\n",
+        "\n",
+        "(S\n",
+        "  (NP-SBJ (DT The) (JJ top) (NN money) (NNS funds))\n",
+        "  (VP\n",
+        "    (VBP are)\n",
+        "    (ADVP-TMP (RB currently))\n",
+        "    (VP (VBG yielding) (NP (QP (RB well) (IN over) (CD 9)) (NN %))))\n",
+        "  (. .))\n",
+        "```"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "ONFIsI-aMaeZ",
+        "outputId": "ed373f3b-f195-4416-8960-e4b1b7884a25"
+      },
+      "outputs": [],
+      "source": [
+        "assert(nltk.download(\"treebank\"))  # should return True if successful, or already installed\n",
+        "print(\"\")\n",
+        "from nltk.corpus import treebank\n",
+        "# Look at the parse of a sentence.\n",
+        "# Don't worry about what this means yet!\n",
+        "idx = 45\n",
+        "print(\" \".join(treebank.sents()[idx]))\n",
+        "print(\"\")\n",
+        "print(treebank.parsed_sents()[idx])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "OcQzQGo9Maea"
+      },
+      "source": [
+        "We can also look at the [Europarl corpus](http://www.statmt.org/europarl/), which consists of *parallel* text - a sentence and its translations to multiple languages. You should see:\n",
+        "```\n",
+        "ENGLISH: Resumption of the session I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999 , and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period .\n",
+        "```\n",
+        "and its translation into French and Spanish."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 6,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "MGOrVrQtMaeb",
+        "outputId": "2de281fd-2d2c-46c7-f76f-7c90cf39aca5"
+      },
+      "outputs": [],
+      "source": [
+        "assert(nltk.download(\"europarl_raw\"))  # should return True if successful, or already installed\n",
+        "print(\"\")\n",
+        "from nltk.corpus import europarl_raw\n",
+        "\n",
+        "idx = 0\n",
+        "\n",
+        "print(\"ENGLISH: \" + \" \".join(europarl_raw.english.sents()[idx]))\n",
+        "print(\"\")\n",
+        "print(\"FRENCH: \" + \" \".join(europarl_raw.french.sents()[idx]))\n",
+        "print(\"\")\n",
+        "print(\"SPANISH: \" + \" \".join(europarl_raw.spanish.sents()[idx]))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "f6aRtMyvMaeb"
+      },
+      "outputs": [],
+      "source": []
+    }
+  ],
+  "metadata": {
+    "anaconda-cloud": {},
+    "kernelspec": {
+      "display_name": "Python 3 (ipykernel)",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.11.4"
+    },
+    "colab": {
+      "provenance": []
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}