Data science is a multidisciplinary field that unifies statistics, data analysis, machine learning, and their related methods to extract knowledge and insights from data. Extracting insights from seemingly random data, data science normally involves collecting data, cleaning data, performing exploratory data analysis, building and evaluating machine learning models, and communicating insights to stakeholders.
Business analytics: Business analytics is a scientific process that transforms data into insights.
Descriptive analytics includes techniques that explain what has happened in the past.
Predictive analytics includes techniques that predict the future by using models created from past data or determining one variable's impact on another.
Prescriptive analytics, the final phase of business analytics, specifies the best course of action for business activity in the form of the output of a prescriptive model (Recommended to the organization).
R is a programming language and software environment for statistical analysis, graphics representation, and reporting.
- R is a well-developed, simple, and effective programming language that includes conditionals, loops, user-defined recursive functions, and input and output facilities.
- R has an effective data handling and storage facility.
- R provides a suite of operators for calculations on arrays, lists, vectors, and matrices.
- R provides a large, coherent, and integrated collection of tools for data analysis.
- R provides graphical facilities for data analysis and display either directly at the computer or by printing the papers.
In conclusion, R is the worldβs most widely used statistics programming language. It's the #1 choice of data scientists and is supported by a vibrant and talented community of contributors.
The variables are assigned R-Objects and the data type of the R-object becomes the data type of the variable. There are many types of R-objects. The frequently used ones are β
- Vectors
- Lists
- Matrices
- Arrays
- factors
- Data Frames
The simplest of these objects is the vector object and there are six data types of these atomic vectors, also termed as six classes of vectors. The other R-Objects are built upon the atomic vectors.
Example:Logical- TRUE or FALSE
Numeric - 12.3, 5, 999
Integer - 2L, 34L, 0L
Complex - 3 + 2i
Character - 'a', '"good", "TRUE", '23.4'
Raw - charToRaw("Hello") - "Hello" is stored as 48 65 6c 6c 6f
-
vectors apple <- c('r', 'o', 'g')
print(apple) -
lists list1 <-list(c(2,3,4), 21, 3.4, sin)
print(list1) -
matrices m = matrix(c('a', 'a','b', 'c', 'b', 'a'), nrow=2, ncol=3, byrow = TRUE)
print(m) -
arrays a <- array(c('green', 'red'), dim = c(3, 3, 2))
print(a) -
factors create a vector first
apple_colors <- c('r', 'o', 'g', 'r', 'o')
create a factor object
factor_apple <- factor(apple_colors)
print factor
print(factor_apple)
print(nlevels(factor_apple)) -
dataframes BMI <- data.frame(
gender = c('male','female', 'male'),
height = c( 152, 142, 156.8),
weight = c(56,54,34),
age = c(54,33,22)
)
print(BMI)
Variables are used to stored data, and the unique name given to it is called identifier.
Arethmatic(+,-,*,/,^,%%, and integer division-%/%-quotiont-16/5=3), Relational(>, <, >=, <=, ==, !=) , Logical(and-&, or-|, not-!) and assignment( a<-8 means a=8, 8->a means 8=a) operators.
String
string <- "hello world"print(string)
Multiline comment
if(FALSE){"this is a multi line
comment, and this is how we
put it"
}
dplyr:
the dplyr package is used to transform and summarize tabular data with rows and columns
select, filter, sort, arrange, summary, mutate
tidyr
The tidyr package helps you create tidy data, a tidy data is easy to visualize and model, gather(make wide data longer), spread(makes long data wider), separate(split a col into multiple cols), unite(combine multiple cols)
%>% is called the forward pipe operator in R. It provides a mechanism for chaining commands with a new forward-pipe operator, %>%.
This operator will forward a value, or the result of an expression into the next function call/expression. It is defined by the package magrittr (CRAN) and is heavily used by dplyr (CRAN).
x=20if (x>18)
{
print("major")
} else {
print ("minor")
}
vec <- c(1,2,3,4,5)
for(val in vec)
{
print(val)
} i <- 1
while(i<6)
{
print(i)
i=i+1
} x <- 1
repeat
{
print(x)
x=x+1
if(x==6){
break
}
} num <- 1:5
for (val in num){
if (val == 3){
break
}
print(val)
}
num <- 1:5
for (val in num){
if (val == 3){
next
}
print(val)
}
A set of commands to be executed in console
source("myScript.R")Stored as r object, there are 1000 of functions at the core of R
append(), c(), identical(), length() and so on
R enables you to import data from different sources.
-
Table: A table can be loaded in R using the read.table function.
-
CSV: A .csv file is imported using the read.csv function.
-
Excel: A .xls file is imported using the read.excel function. or Read data from the sheets using read_excel i.e read_excel("filename.xlsx",sheet='sheetname')
You can also export different files to another location in R.
-
To export a table: Write.table(file_name, βc:/file_name.txtβ, sep=β\tβ)
-
To export an Excel file: Write.xls(file_name, "c:/file_name.txt", sep= "\t")
-
To export a CSV file: Write.csv(file_name, βc:/file_name.csvβ)
Bar plot, Pie chart, Histogram, Kernel density plot, Line chart, Box plot, Heat map, Word cloud.
Hypothesis: Assumption
The hypothesis needs analysis to be validated.
Simple hypothesis: R.S between 2 variables
Complex hypothesis: R.S between more than 2 variables
Null hypothesis: H1-mean =100
Alternate hypothesis: H1-mean = or != 100
Statistical hypothesis: statistical inference performed data from a scientific study.
Hypothesis test is a formal procedure in statistics used to test whether a hypothesis can be accepted or not.
A statistical hypothesis technique used to select, manipulate, and analyze a subset of data points to discover hidden patterns and trends in the larger data set.
Statistical tests are statistical methods that help us reject or not reject our null hypothesis. Theyβre based on probability distributions and can be one-tailed or two-tailed, depending on the hypotheses that weβve chosen.
There are other ways in which statistical tests can differ and one of them is based on their assumptions of the probability distribution that the data in question follows.
Parametric tests are those statistical tests that assume the data approximately follows a normal distribution (In a normal distribution the mean is zero and the standard deviation is 1. Its mean (average), median (midpoint), and mode (most frequent observation) are all equal.), amongst other assumptions. for example: z-test, t-test, ANOVA, Manova.
Important note β the assumption is that the data of the whole population follows a normal distribution, not the sample data that youβre working with.
Nonparametric tests are those statistical tests that donβt assume anything about the distribution followed by the data and hence are also known as distribution-free tests (examples include Chi-square, Mann-Whitney U-test).
Nonparametric tests are based on the ranks held by different data points.
Common parametric tests are focused on analyzing and comparing the mean or variance of data.
The mean is the most commonly used measure of central tendency to describe data, however, it is also heavily impacted by outliers. Thus it is important to analyze your data and determine whether the mean is the best way to represent it. If yes, then parametric tests are the way to go! If not, and the median better represents your data, then nonparametric tests might be the better option.
As mentioned above, parametric tests have a couple of assumptions that need to be met by the data:
- Normality β the sample data come from a population that approximately follows a normal distribution
- Homogeneity of variance β the sample data come from a population with the same variance
- Independence β the sample data consists of independent observations and is sampled randomly
- Outliers β the sample data donβt contain any extreme outliers
The degrees of freedom are essentially the number of independent values that can vary in a set of data while measuring statistical parameters.
If you want to compare the means of two groups then the right tests to choose between are the z-test and the t-test.
t-test is a classic method for comparing mean values of two samples that are normally distributed (i.e. they have a Gaussian distribution). Such samples are described as being parametric and the t-test is a parametric test. In R the t.test() command will carry out several versions of the t-test.
A one-sample z-test is used to determine whether the population mean is equal or different from a predefined standard (or theoretical) value of mean, when the population standard deviation is known and the sample size is larger.
One-sample (one-sample z-test or a one-sample t-test): one group will be a sample and the second group will be the population. So youβre basically comparing a sample with a standard value from the population. We are basically trying to see if the sample comes from the population, i.e. does it behave differently from the population or not?
Two-sample (two-sample z-test and a two-sample t-test): both groups will be separate samples. As in the case of one-sample tests, both samples must be randomly selected from the population and the observations must be independent of one another.
difference: in the case of z the test statistics is t, and @ which is known while in the case of t-test it is unknown.
ANOVA -short for βanalysis of varianceβ- is a statistical technique for testing if 3(+) population means are all equal.
Multivariate analysis of variance (MANOVA) is a procedure for comparing multivariate sample means. As a multivariate procedure, it is used when there are two or more dependent variables and is often followed by significance tests involving individual dependent variables separately.
Oneway ANOVA, twowayanova
The U-test is used for comparing the median values of two samples. You use it when the data are not normally distributed, so it is described as a non-parametric test. The U-test is often called the Mann-Whitney U-test but is generally attributed to Wilcoxon (Wilcoxon Rank Sum test), hence in R the command is wilcox.test().
It is used to observe the closeness of a sample that matches a population.
It predicts the value of a variable based on the value of two or more variables.
It considers more than one quantitative and qualitative variable (X1, .. , Xn) to predict a quantitative and dependent variable Y.
not linear line i.e. polynomial, logarithmic, square root, reciprocal, and exponential regression.
type of unsupervised learning.
group things based on similarities.(1)prototype based clustering=partial clustering (K-means and fuzzy c-means) clustering-based on centeroid
(2)hierarchical clustering-based on dendrogram = agglomerative (bottom-top) and divisive(top-bottom)
(3)density based clustering-used to identify clusters of any shape in a data set containing noise and outliers
- Get the data
- Explore and visualize data for insights
- Cleaning data for machine learning algorithms
- Select and train model
- Tune the parameters (if possible) for performance-enhancement
- Present your findings and solutions
- Create, launch and maintain a scalable system
When the model is split into training and testing it can be possible that a specific type of data point may go entirely into either the training or testing portion. This would lead the model to perform poorly. Hence over-fitting and underfitting problems can be well avoided with cross-validation techniques. cross-validation is a technique used to determine the accuracy of predicting models.
We are doing evaluation because we want to get an accurate measure of how well the model performs. If our dataset is small, our test set is going to be small. Thus it might not be a good random assortment of data points and by random chance end up with easy or difficult data points in our evaluation set. Since our goal is to get the best possible measure of our metrics (accuracy, precision, recall, and F1 score), we can do a little better than just a single training and test set. Instead of doing a single train/test split, weβll split our data into a training set and test set multiple times.
This process for creating multiple training and test sets is called k-fold cross-validation. The k is the number of chunks we split our dataset into.
Reducing the number of input variables for a predictive model is referred to as dimensionality reduction.
Perhaps the most popular technique for dimensionality reduction in machine learning is Principal Component Analysis, or PCA for short. This is a technique that comes from the field of linear algebra and can be used as a data preparation technique to create a projection of a dataset prior to fitting a model.
PCA are linear components of the original variables. They tend to capture as much variance as possible in a dataset.
Mean - Arithmetic average
Median - Midpoint of the distribution (50th percentile)
Mode - Most frequent observation
Variance - Variance measures how far a set of numbers is spread out, it is the average of the squared differences from the Mean.
Standard Deviation - The standard deviation measures the amount of variation or dispersion from the average, it is the square root of the Variance.
IQR - The interquartile range (IQR), is a measure of statistical dispersion, being equal to the difference between the upper and lower quartiles IQR = Q3 - Q1
Probability:
Disjoint - events do not have any common outcomes - P(A and B) = 0 e.g. A man cannot be dead and alive.
Non-disjoint - events can have common outcomes - P(A and B) != 0 e.g. A student can get 100 marks in statistics and 100
marks in probability.
π₯ Furthermore, the practice files are included in the "RStudio" folder, and the datasets are included in the "dataset" folder π₯