CS205: Projects in Parallel Data Science at SCALE

Spring 2017

About

Extreme scale data science at the convergence of big data and massively parallel computing is enabling simulation, modelling and real-time analysis of complex natural and social phenomena at unprecedented scales. The aim of the project is to gain practical experience into this interplay by applying parallel computation principles in solving a compute and data-intensive problem. Applying interdisciplinary principles and skills of parallel computation and data science from CS205 and other courses the goal is to construct a novel parallel software solution for an open-ended data science application that requires orders of magnitude compute scaling using Harvard’s supercomputer: Odyssey. Additionally, the project provides an opportunity to apply novel concepts and technologies to create niche applications and research outputs.

Requirements

As a project team (4 to 5 members) you will identify a data science problem, analyse its compute scaling requirement, collect the data, design and implement a parallel software, and demonstrate scaled performance of an end-to-end application.

The parallel software solution

should be implemented on a heterogenous distributed memory architecture with either a many-core or a multi-core compute node and evaluated on 8 compute nodes (note: each compute node on Odyssey is a multi-core with 32 (or 64) cores or with a manycore GPU with hundreds of cores).
as a hybrid parallel program in either
- MPI+OpenMP
- MPI+OpenAcc (or CUDA )
- PGAS + X
- Spark (with GPU acceleration)
its performance evaluated on large data sets to demonstrate both weak and strong scaling using appropriate metrics (throughput, efficiency, iso-efficiency).
and should solve a problem for a non-trivial computation graph and with hierarchical parallelism.

Advanced concepts and technologies

To create novel parallel software solutions, or to undertake a research oriented outcome, you can make use of advanced concepts and technology that was explored in the course. Implementations in the form of libraries and open source software are available to build niche applications on top of it:

‘Cutting edge’ P100 GPU architecture at 10 TeraFlops throughput from National resource (https://www.xsede.org/). Create an account in the XSEDE portal https://portal.xsede.org/ if you wish to go down the MPI+ OpenAcc path in the project.
Communication avoiding algorithms to achieve strong scaling (is installed on Odyssey as modules: https://portal.rc.fas.harvard.edu/apps/modules/CANDMC/2016-fasrc01, https://portal.rc.fas.harvard.edu/apps/modules/CombBLAS/15.0-fasrc01 )
Polyhedral model software to synthesise tiled parallel programs (OpenMP). https://sourceforge.net/projects/pocc/
Parallel algorithm formulation in the semi-ring (libraries).

Project Deliverables

Web site (max. 5 pages)
- Introduction which should include comparison with existing work on the problem.
- Technical description of the parallel software solution, programming models, with links to code repository.
- Application scaling plots (speed-up, throughput, weak and strong scaling).
- Advanced Features
- Citations
Software with evaluation data sets, test cases (on Github repo)
Presentation to the students and staff

Project Milestones

The milestones for your final project will be graded at each step according to the grading criteria given below. It is important to adhere to the deadlines as the late date submission policy does not apply to projects.

Milestones	Deadline
Project Team announcement (sign up document deposited in Git repo)	22nd March 2017
Project Proposal (1 page web site)	25th March 2017
Interim progress report (website populated with preliminary results)	15th April
Project deliverables (web site, code, README)	1st May
Project presentation to class (10 mins. + Qs)	2nd May 2017
Weekly meetings with project supervisors	20th March to 28th April

Project submissions: Github and Piazza

All submissions are per group. Make your own respository on Github with a link to your project web page.
All project deliverables, including milestones related, should be deposited in GitHub repos for peer evaluation.
All project related correspondence should be posted on Piazza.
Project related emails (meeting schedules) to: [email protected], general queries to Piazza. Only critical (or personal) nature emails to project supervisors email.

Project Supervisors

Manju
Charles
Rafael
WeiWeiPan

Project choices and programming environments

You can choose any data/computational science problem that you have already worked on in any other course: AC 209a, AC 209b, AM205, AM207, AC297R.
Alternatively, your own research work with advanced concepts as above would be suitable to generate research output.
Supervisors may offer projects based on their research interests.
You can re-use any code from the CS205 homework set and build your application software on top of it. But the code should be augmented with additional parallel code with the requirements as specified above to gain further credit.
You can implement the solution in any programming language of your choice (discuss with supervisors).
In the interdisciplinary spirit of the subject area, and the cs205 course, projects and project teams should have multiple disciplines.

Research output (optional)

Optionally, the project may take the path of research and generate a research paper as output. In this case the project requirements are:

To implement a parallel algorithm as above but to support the research problem being addressed.
To generate as final output a technical paper of journal quality comparable in depth to papers published in leading journals in computational/data science or parallel computing.
To choose a parallel solution which could range from a novel parallel graph algorithm to optimising a scientific application on odyssey with new insights, and anything inbetween these two bounds of theory and experimentation.

Project Grade

Project will be graded on the depth of work undertaken, communication (web site, presentation) and participation.

10%: Project review (Peer and Supervisor meetings).
40%: Project software, README, overall quality (base line).
30%: Advanced features.
10%: Project web site.
10%: Presentation to class.

Project Scale and Scope

To uniformly assess the different projects the following criteria will be applied:

A team of 2 will be expected to generate at least 2 homeworks equivalent of output for baseline outcome.
A team of 4 will therefore be expected to generate double the output in qualitative terms (not necessarily code, but features, analysis, evaluation, innovation). A team of 5 should be exceeding this threshold.
Advanced features can include half on modelling and half on parallel software.

Resources

Harvard Computational and Data Science: references in courses listed above in project choices
NERSC data science projects: http://www.nersc.gov/science/science-highlights-presentations/
Exascale Computing Project: https://exascaleproject.org/researchareas
Previous years projects: http://iacs-courses.seas.harvard.edu/courses/cs205/projects.html

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
TechTrekBoston		TechTrekBoston
edit/master/images		edit/master/images
CS205- Presentation-Evaluation.xlsx		CS205- Presentation-Evaluation.xlsx
Presentations.md		Presentations.md
ProjectAllocations.md		ProjectAllocations.md
Readme.md		Readme.md
_config.yml		_config.yml
about.png		about.png
projects.md		projects.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS205: Projects in Parallel Data Science at SCALE

About

Requirements

The parallel software solution

Advanced concepts and technologies

Project Deliverables

Project Milestones

Project submissions: Github and Piazza

Project Supervisors

Project choices and programming environments

Research output (optional)

Project Grade

Project Scale and Scope

Resources

About

Releases

Packages

Contributors 8

harvard-cs205/CS205-Spring2017-Projects

Folders and files

Latest commit

History

Repository files navigation

CS205: Projects in Parallel Data Science at SCALE

About

Requirements

The parallel software solution

Advanced concepts and technologies

Project Deliverables

Project Milestones

Project submissions: Github and Piazza

Project Supervisors

Project choices and programming environments

Research output (optional)

Project Grade

Project Scale and Scope

Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Packages