- Will open on Monday, November 16 at 8:00 am, and will remain open until midnight on Tuesday, December 1. https://utk.campuslabs.com
- Please let me know how you'd like to improve this course
- Nov 19, Nov 24, During Final Exam (1:00 p.m. – 3:15 p.m. Tuesday, December 8)
- Preferences
- TwitterSentiment: Nov 19, Nov 24
- Olympics: Nov 24, Nov 19
- Security Vulnerabilities: Final Exam, 24th, 19th
- Digital Currency: Final Exam, 24th, 19th
- FoodRec: Nov 24, Final
- AppPermissions: Nov 24, Final
- TitleIV: Final, Nov 24th, Nov 19th
- RoadSafety: Final Exam, 24th,
- StockPrices: Final Exam, 24th,
- Spotify Song: Nov 24, Final
- KickTool: Final Exam, 24th, 19th
- Presidential Poll: Final Exam, 24th
- BikeShare: Final Exam, 24th
- Emoji Context: 24th, Final, 19th
- DebateAnalysis: Final, 24
- RateMyProfessors: Final, 24
- NewsBias: Final, 24 18: NetworkIntrusion, 24, Final
- Billboard 100, 24, Final
- Nov 19: KickTool, TwitterSentiment (up to 20min per presentattion)
- Nov 24: FoodRec, AppPermissions, Spotify Song, Emoji Context, NetworkIntrusion, Billboard 100, Olympics (up to 10/presentation)
- Final: Security Vulnerabilities, Digital Currency, TitleIV, RoadSafety, StockPrices, Presidential Poll, Debate, RateMyProfessor, NewsBias, BikeShare (up to 11m/presentation)
Similar to progress reports with additional sections:
- Objective (research question)
- Data that was used: how obtained, how processed, integrated, and validated
- What models or algorithms were used
- Results: A description of the results
- Primary issues encountered during the project
- Future work: ideas generated, improvements that would make sense, etc
- Org chart: rough timeline and responsibilities for each member
- s+ NewsBias: https://tennessee.zoom.us/j/91408896943
- s+ Olympics: https://tennessee.zoom.us/j/96713656851
- s+ TwitterSentiment: https://tennessee.zoom.us/j/5611854477
- s+ NetworkIntrustion: https://tennessee.zoom.us/j/4128432326
- s+ AppPermissions: https://tennessee.zoom.us/j/91672070940
- s+ SecurityVulnerabilities: https://tennessee.zoom.us/j/96606022135
- s+ Emoji Context: https://tennessee.zoom.us/j/7355437957
- s+ Debate Analysis: https://tennessee.zoom.us/j/8671212362
- s+ Billboard100: https://tennessee.zoom.us/j/3939099316
- s+ RoadSafety(medicalbenchmarks): https://tennessee.zoom.us/j/7887946203
- s+ RateMyProfessors: https://tennessee.zoom.us/j/8145700408
- s+ Digital Currency: https://zoom.us/j/93762046342?pwd=TlA4TVZ0UVVCc3lxVG8vYmNYQW1wZz09
- s+ stockprices: https://zoom.us/j/5985407628?pwd=L3hwTnVzV3NsOERaR3lVY0Y3OXpvUT09
- s+ titleiv: https://tennessee.zoom.us/j/92931020017
- s- bikeshare: https://tennessee.zoom.us/j/97472848970
- s- Presidential Poll: https://tennessee.zoom.us/j/4698520661
- s+ spotifysong: https://tennessee.zoom.us/j/321 855 4360
- s+ FoodRec: https://tennessee.zoom.us/j/2365079269
- MP3 Part D feature extraction is due on Nov 24
- MP3 Part D Analysis introduced
- MP3 Part D is introduced
- Questions
- Work on final projects
- Questions on MP3C: MP3 Part C is Due at the end of class
- Work on final projects
- Questions on MP3 Part C
- Work on final projects
- ** No class: Engineering day **
- Introducing MP3 Part C
- Work on final projects or
- Ask questions
- MP3 Part B Due
- Work on final projects or
- Ask questions
- MP3 Part A status need nrec >= 1000
- Work on final projects or
- Ask questions
- MP3 Part A Due
- Work on final projects or
- Ask questions
- Introducing MP3 Part B
- Work on final projects or
- Ask questions
- Work on final projects or
- Ask questions
- MP2 due
- Revisit MP3
- Work on final projects
- Questions on gcp
- Introducing MP3
- Work on final projects
- Schedule for Miniproject2: Due on Oct 6 end of class
- Introducing GCP
- Other matters as needed
Final project proposals are due at the end of the class
The group needs to submit a project proposal (1.5-2 pages in IEEE format (see https://www.overleaf.com/latex/templates/preparation-of-papers-for-ieee-sponsored-conferences-and-symposia/zfnqfzzzxghk).
The proposal should provide
an objective
a brief motivation for the project,
detailed discussion of the data that will be obtained or used in the project,
responsibilities of each member, along with
a time-line of milestones, and
the expected outcome
The proposal pdf will be committed to fdac20/ProjectName/proposal.pdf
- Please fill project teams: https://github.com/orgs/fdac20/teams, the first person in the pitch can add the rest.
- !!!Do a pull request on Miniproject1 if you have not done so yet: will stop accepting PRs for MP1 on Thursday. If you experience difficulties, please let me know!!!
- Lecture on data discovery and data storing (databases).
- Introducing Miniproject2.
- Final project team formation
- Most teams formed (create fdac20/ProjectName repo and a team of the same name; invite members of the team)
- Start brainstorming/writing final project proposal (see Sep 24)
- First 20 min: Final project pitches (Regular zoom number https://tennessee.zoom.us/j/2766448345)
- I have created teams for each of the submitted pitches and added the first person to that team: please add your team members to your team (e.g., go to https://github.com/orgs/fdac20/teams/emojicontext and add team members)
- Presentations of the Miniproject by the person selected in each group
- Group 1: jgray51, Group 2: isikkema, Group 3: chupi, Group 4: delayed, Group 5: lcourtn5,
- Group 6: tuckermiles70, Group 7: abhidya, Group 8: alambe22, Group 9: bbible, Group 10: hjw848
- First 20 min: Final project pitches (Regular zoom number https://tennessee.zoom.us/j/2766448345)
- Small group presentations of Miniproject1
- Group 1 https://tennessee.zoom.us/j/95725208783 rharri63 jgray51 cmuncey mstanto4 jjack113 mphill66
- Group 2 https://tennessee.zoom.us/j/93213528802 jlangst6 isikkema jherman4 jbryan74 llocke2 wboyd8
- Group 3 https://tennessee.zoom.us/j/94956030633 jneely10 bmarti68 aravi hdoerr nskuda crizzo
- Group 4 https://tennessee.zoom.us/j/91499146968 jcharl12 ssmit285 rcongmon dxh594 dreid6 oiqbal
- Group 5 https://tennessee.zoom.us/j/95661325820 lcourtn5 kvyas1 jjelinek istone1 bli43 spatel95
- Group 6 https://tennessee.zoom.us/j/95829164881 mwermert aengelvi tnguye69 dnguye18 tmiles7 zdong7
- Group 7 https://tennessee.zoom.us/j/98055424312 wph612 vrajago2 rflint zables abhidya tjames17
- Group 8 https://tennessee.zoom.us/j/96123759849 alambe22 jgurganu cfei1 eswanger mmohandi lsangeor
- Group 9 https://tennessee.zoom.us/j/97793867441 ktailor1 bbible3 spatel84 chayne10 kfidan cmathew9 clampe1
- Gropp 10 https://tennessee.zoom.us/j/92620040619 rholmber jzhu34 mcox59 hjw848 hchoi9 jpi jalle119
- Results of the voting are tallied by the first person in the group at #8
- We will use regular zoom number today: https://tennessee.zoom.us/j/2766448345
- Please don't forget to enable issues on your fork of Miniproject1 so your peer can raise issues
- After all questions are answered we will switch to https://tennessee.zoom.us/j/99284951954 so that each group is ready for the next class.
- Discuss ideas with your assigned peers in the small groups, finish Miniproject1
- Thank you for submitting project ideas to fdac20/FinalProjectPitches
- Please submit ** Practice0 task ** if you have not done so
- If you still have any questions regarding MP1, I will try to answer them:
- Think about selecting the course project (see course projects for the last five years at fdac19, fdac18, fdac17, fdac16, fdac for inspiration)
- Please submit Practice0 task if you have not done so
- See the simple text analysis of your descriptions
- Introducing the MiniProject1 process
- and template
- Attend ony if you need help with forking/ssh/Practice0 task.
- TAs and I will help you set up ssh/putty so that you can access jupyter notebooks, here are two additional zoom conference rooms: https://tennessee.zoom.us/j/94932963014 https://tennessee.zoom.us/j/98393877697
- Make sure you do not have any issues with the following:
- Have you accepted github invite: iqbo, johnpi, bhl19950207, michael-cox
- Make sure your ssh/putty setup works:Full details
- Can you complete fdac20/Practice0?
- If you need a refresher on unix tools: edX on unix for data science
- Previous lecture has been recorded (please see the link below)
- We will go over the lecture on key tools used for the class and on version control practices
- If you have not submitted the PR before class please:
- go to https://github.com/fdac20/students and click the fork button on the top right side of the screen.
- Next, add your netid.md and netid.key files (replace 'netid' by your own netid).
- Then, click on create pull request.
- Complete instructions are at https://github.com/fdac20/news/blob/master/Preliminary.md.
- Make sure you have accepted your github invitations at https://github.com/fdac20
- Please follow through ssh/putty setup - Full details
- Create your github account
- fork repo students
- create your utid.md file providing your name and interests: see per fdac20/students/README.md, and also provide your utid.key with your public ssh key. Once done, please
- submit a pull request to fdac20/students
- Make sure you do it during the class so we can start ready on Aug 25
Class video recordings
-
Join from a PC, Mac, iPad, iPhone or Android device: Please click this URL to start or join. https://tennessee.zoom.us/j/2766448345 Or, go to https://tennessee.zoom.us/join and enter class session/meeting ID: 276 644 8345
-
Join from dial-in phone line: (Note: these are NOT toll-free numbers) Dial: +1 646 558 8656 or +1 408 638 0968 Meeting ID: 276 644 8345 Participant ID: Shown after joining the meeting International numbers available: https://tennessee.zoom.us/zoomconference?m=leg4C6yjhpfGHE-_Q9EYRNHXCUMBC-2T
- Course: [COSCS-445/COSCS-545]
- ** Zoom link above 04:30PM-5:45PM TTh**
- Instructor: Audris Mockus, [email protected] office hours - upon request
- TA: David Reid [email protected] office hours - upon request
- TA: James Hammer [email protected] office hours - upon request
- ** Syllabus **
- Need help?
Simple rules:
- There are no stupid questions. However, it may be worth going over the following steps:
- Think of what the right answer may be.
- Search online: stack overflow, etc.
- code snippets: On GH gist.github.com or, if anyone contributes, for this class
- answers to questions: Stack Overflow
- Look through issues
- Post the question as an issue.
- Ask instructor: email for 1-on-1 help, or to set up a time to meet
The course will combine theoretical underpinning of big data with intense practice. In particular, approaches to ethical concerns, reproducibility of the results, absence of context, missing data, and incorrect data will be both discussed and practiced by writing programs to discover the data in the cloud, to retrieve it by scraping the deep web, and by structuring, storing, and sampling it in a way suitable for subsequent decision making. At the end of the course students will be able to discover, collect, and clean digital traces, to use such traces to construct meaningful measures, and to create tools that help with decision making.
Upon completion, students will be able to discover, gather, and analyze digital traces, will learn how to avoid mistakes common in the analysis of low-quality data, and will have produced a working analytics application.
In particular, in addition to practicing critical thinking, students will acquire the following skills:
-
Use Python and other tools to discover, retrieve, and process data.
-
Use data management techniques to store data locally and in the cloud.
-
Use data analysis methods to explore data and to make predictions.
A great volume of complex data is generated as a result of human activities, including both work and play. To exploit that data for decision making it is necessary to create software that discovers, collects, and integrates the data.
Digital archeology relies on traces that are left over in the course of ordinary activities, for example the logs generated by sensors in mobile phones, the commits in version control systems, or the email sent and the documents edited by a knowledge worker. Understanding such traces is complicated in contrast to data collected using traditional measurement approaches.
Traditional approaches rely on a highly controlled and well-designed measurement system. In meteorology, for example, the temperature is taken in specially designed and carefully selected locations to avoid direct sunlight and to be at a fixed distance from the ground. Such measurement can then be trusted to represent these controlled conditions and the analysis of such data is, consequently, fairly straightforward.
The measurements from geolocation or other sensors in mobile phones are affected by numerous (yet not recorded) factors: was the phone kept in the pocket, was it indoors or outside? The devices are not calibrated or may not work properly, so the corresponding measurements would be inaccurate. Locations (without mobile phones) may not have any measurement, yet may be of the greatest interest. This lack of context and inaccurate or missing data necessitates fundamentally new approaches that rely on patterns of behavior to correct the data, to fill in missing observations, and to elucidate unrecorded context factors. These steps are needed to obtain meaningful results from a subsequent analysis.
The course will cover basic principles and effective practices to increase the integrity of the results obtained from voluminous but highly unreliable sources.
-
Ethics: legal aspects, privacy, confidentiality, governance
-
Reproducibility: version control, ipython notebook
-
Fundamentals of big data analysis: extreme distributions, transformations, quantiles, sampling strategies, and logistic regression
-
The nature of digital traces: lack of context, missing values, and incorrect data
Students are expected to have basic programming skills, in particular, be able to use regular expressions, programming concepts such as variables, functions, loops, and data structures like lists and dictionaries (for example, COSC 365)
Being familiar with version control systems (e.g., COSC 340), Python (e.g., COSC 370), and introductory level probability (e.g., ECE 313) and statistics, such as, random variables, distributions and regression would be beneficial but is not expected. Everyone is expected, however, to be willing and highly motivated to catch up in the areas where they have gaps in the relevant skills.
All the assignments and projects for this class will use github and Python. Knowledge of Python is not a prerequisite for this course, provided you are comfortable learning on your own as needed. While we have strived to make the programming component of this course straightforward, we will not devote much time to teaching programming, Python syntax, or any of the libraries and APIs. You should feel comfortable with:
- How to look up Python syntax on Google and StackOverflow.
- Basic programming concepts like functions, loops, arrays, dictionaries, strings, and if statements.
- How to learn new libraries by reading documentation and reusing examples
- Asking questions on StackOverflow or as a GitHub issue.
These apply to real life, as well.
- Must apply "good programming style" learned in class
- Optimize for readability
- Bonus points for:
- Creativity (as long as requirements are fulfilled)
- Agree on an editor and environment that you're comfortable with
- The person who's less experienced/comfortable should have more keyboard time
- Switch who's "driving" regularly
- Make sure to save the code and send it to others on the team
-
Class Participation – 15%: students are expected to read all material covered in a week and come to class prepared to take part in the classroom discussions (online). Asking and responding to other student questions (issues) counts as a key factor for classroom participation. With online format and collaborative nature of the projects, this should not be hard to accomplish.
-
Assignments - 40%: Each assignment will involve writing (or modifying a template of) a small Python program.
-
Project - 45%: one original project done alone or in a group of 2 or 3 students. The project will explore one or more of the themes covered in the course that students find particularly compelling. The group needs to submit a project proposal (2 pages IEEE format) approximately 1.5 months before the end of term. The proposal should provide a brief motivation of the project, detailed discussion of the data that will be obtained or used in the project, along with a time-line of milestones, and expected outcome.
As a programmer you will never write anything from scratch, but will reuse code, frameworks, or ideas. You are encouraged to learn from the work of your peers. However, if you don't try to do it yourself, you will not learn. deliberate-practice (activities designed for the sole purpose of effectively improving specific aspects of an individual's performance) is the only way to reach perfection.
Please respect the terms of use and/or license of any code you find, and if you re-implement or duplicate an algorithm or code from elsewhere, credit the original source with an inline comment.
This class assumes you are confident with this material, but in case you need a brush-up...
- A MongoDB Schema Analyzer. One JavaScript file that you run with the mongo shell command on a database collection and it attempts to come up with a generalized schema of the datastore. It was also written about on the official MongoDB blog.
- Modern Applied Statistics with S (4th Edition) by William N. Venables, Brian D. Ripley. ISBN0387954570
- R
- Code School
- Quick-R
- Git and GitHub
- GitHub Pages