Elements of Data Science is an introduction to data science in Python for people with no programming experience. My goal is to present a small, powerful subset of Python that allows you to do real work in data science as quickly as possible.
At the same time, I want to make sure the material is presented clearly. I don't assume that the reader knows anything about programming, statistics, or data science. When I use a term, I try to define it immediately, and when I use a programming feature, I try to explain it.
There are a few places where I use a programming feature before it is fully explained, but I keep them to a minimum, and I'll let you know what you don't need to know.
This "book" is in the form of Jupyter notebooks. Jupyter is a software development tool you can run in a web browser, so you don't have to install any software. A Jupyter notebook is a document that contains text, Python code, and results. So you can read it like a book, but you can also modify the code, run it, develop new programs, and test them.
The notebooks contains exercises where you can practice what you learn. Most of the exercises are meant to be quick, but a few are more substantial.
This material is a work in progress, so suggestions are welcome. The best way to provide feedback is to click here and create an issue in this GitHub repository.
For each of the notebooks below, you have two options: if you view the notebook on NBViewer, you can read it, but you can't run the code. If you run the notebook on Colab, you'll be able to run the code, do the exercises, and save your modified version of the notebook in a Google Drive (if you have one).
Variables and values: The first notebook explains how to use Jupyter and introduces the most basic programming features in Python, variables and values.
Press this button to run this notebook on Colab:
or click here to read it on NBViewer
Times and places: This notebook shows how to represent times, dates, and locations in Python, and uses the GeoPandas library to plot points on a map.
Press this button to run this notebook on Colab:
or click here to read it on NBViewer
Lists and Arrays: This notebook presents lists and NumPy arrays. It discusses absolute, relative, and percent errors, and ways to summarize them.
Press this button to run this notebook on Colab:
or click here to read it on NBViewer
Loops and Files: This notebook presents the for
loop and the if
statement; then it uses them to speed-read War and Peace and count the words.
Press this button to run this notebook on Colab:
or click here to read it on NBViewer
Dictionaries: This notebook presents one of the most powerful features of Python, dictionaries, and uses them to count the unique words in War and Peace.
Press this button to run this notebook on Colab:
or click here to read it on NBViewer
Plotting: This notebook introduces Matplotlib, a plotting library for Python, and uses it to generate a few common data visualizations and one less common one, a Zipf plot.
Press this button to run this notebook on Colab:
or click here to read it on NBViewer
DataFrames: This notebook presents DataFrames, which are used to represent tables of data. And it uses data from the National Survey of Family Growth to find the average weight of babies in the U.S.
Press this button to run this notebook on Colab:
or click here to read it on NBViewer
Distributions: This notebook explains what a distribution is and presents 3 ways to represent a distribution: a PMF, CDF, or PDF. It also shows how to compare a distribution to another distribution or a mathematical model.
Press this button to run this notebook on Colab:
or click here to read it on NBViewer
Relationships: This notebook explores relationships between variables using scatter plots, violin plots, and box plots. It quantifies the strength of a relationship using the correlation coefficient and uses simple regression to estimate the slope of a line.
Press this button to run this notebook on Colab:
or click here to read it on NBViewer
Regression: This notebook presents multiple regression and uses it to explore the relationship between age, eduction, and income. It uses visualization to interpret multivariate models. It also presents binary variables and logistic regression.
Press this button to run this notebook on Colab:
or click here to read it on NBViewer
Inference: This notebook presents computational inference, a process for computing p-values, standard errors, and confidence intervals using randomization methods rather than analysis.
Press this button to run this notebook on Colab: