PG-Introduction to Data Analysis, Semester 3, First Internal Exam, September 2023
AI010302 INTRODUCTION TO DATA ANALYTICS
Answer
Key
Section
A
1.
Supervised learning is the types of machine learning in which
machines are trained using well "labelled" training data, and on
basis of that data, machines predict the output. The labelled data means some
input data is already tagged with the correct output.
In supervised learning, the
training data provided to the machines work as the supervisor that teaches the
machines to predict the output correctly. It applies the same concept as a
student learns in the supervision of the teacher.
2.
Least square estimates of β0 and β1 are:
Mean response estimation is done
using the equation:
3.
Detecting problem of multicollinearity
Multicollinearity is a
statistical phenomenon that occurs when two or more independent variables in a
regression model are highly correlated with each other.
It can be detected either by using Variance Inflation Factor (VIF) or
calculating correlation matrix of the independent variables.
VIF determines strength of the correlation
between the independent variables. It is predicted by taking a variable and
regressing it against every other variable.
R2 value is determined to find out how well an independent variable
is described by the other independent variables. A high value of R2
means that the variable is highly correlated with the other variables. This is
captured by the VIF, which is denoted below:
So,
the closer the R2 value to 1,
the higher the value of VIF and the higher the multicollinearity with the
particular independent variable.
·
VIF starts at 1 and has no upper limit
·
VIF = 1, no correlation between the independent variable and the other
variables
·
VIF exceeding 5 or 10 indicates high multicollinearity between this
independent variable and the others
4. Sensitivity
and Specificity
Sensitivity (true positive rate) is the probability of a positive test
result, conditioned on the individual truly being positive.
Specificity (true negative rate) is the probability of a negative test result,
conditioned on the individual truly being negative.
5. Binary
classification: It refers to
those classification tasks that have two
class labels. Examples include:
Email spam detection (spam or not).
Section B
6. To find the coefficient of determination, simply square the
correlation coefficient. The resulting value ranges between zero and one, which
you convert to a percent to explain what portion of the variation in
y occurs because of the changes in x. A value of 0.70 for the coefficient of determination means that 70% of the variability in the outcome variable (y) can be
explained by the predictor variable (x). This also means
that the model used to predict the value is a relatively accurate fit.
7. Feature selection is a way to reduce the number of features
and hence reduce the computational complexity of the model. Many times, feature
selection becomes very useful to overcome with overfitting problem. It helps us
in determining the smallest set of features that are needed to predict the
response variable with high accuracy. AIC is most often
used for model selection. By calculating and comparing the AIC
scores of several possible models, you can choose the one that is the best fit
for the data. AIC determines the relative
information value of the model using the maximum likelihood estimate and the
number of parameters (independent variables) in the model. The formula for AIC
is:
AIC = 2K – 2ln(L)
K is the number unknown parameters estimated
when fitting the model and L is the log-likelihood
estimate (a.k.a. the likelihood that the model could have produced your
observed y-values).
- Naïve
Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
It is mainly used in text classification that includes a
high-dimensional training dataset. Naïve Bayes Classifier is one of the
simple and most effective Classification algorithms which helps in building
the fast machine learning models that can make quick predictions. It is a
probabilistic classifier, which means it predicts on the basis of the
probability of an object.
Naive Bayes classifiers are a collection of classification
algorithms based on Bayes’
Theorem. It is not a single algorithm but a family of
algorithms where all of them share a common principle, i.e., every pair of
features being classified is independent of each other and all features contribute equally to the
outcome.
9. kNN classifier: K Nearest Neighbor is one
of the fundamental algorithms in machine learning. Machine learning models use
a set of input values to predict output values. KNN is one of the simplest
forms of machine learning algorithms mostly used for classification. It classifies
the data point on how its neighbor is classified. KNN classifies the new data points based on the similarity
measure of the earlier stored data points.
Section C
10. The matrix expression for expected value of Y
in multiple regression is:
The equation for mean response is:
E[Y]=Xβ
The equation for LS estimator is:
Significance of weighted
least squares for the MLR model:
The method of
ordinary least squares assumes that there is constant variance in the errors (homoscedasticity).
The method of weighted least squares can
be used when the ordinary least squares assumption of constant variance in the
errors is violated (heteroscedasticity)
The model under consideration
is:
Y=Xβ+Є,
where Є is assumed to be (multivariate) normally distributed with
mean vector 0 and nonconstant
variance-covariance matrix
If we define the reciprocal of
each variance, , as the weight, =1/, then let matrix W be
a diagonal matrix containing these weights:
The weighted
least squares estimate is then:
11. Logistic
regression
Logistic regression models
a relationship between predictor variables and a categorical response
variable. Binary
Logistic Regression is used when the response is binary (i.e.,
it has two possible outcomes). Particular issues with modelling a categorical
response variable include nonnormal error terms, nonconstant error variance,
and constraints on the response function (i.e., the response is bounded between
0 and 1).
Our goal is to find the logistic regression function 𝑝(𝐱) such that the predicted
responses 𝑝(𝐱ᵢ) are as close as possible to the actual response 𝑦ᵢ for each observation 𝑖 = 1, …, 𝑛. Remember that the actual response can be only 0 or 1 in
binary classification problems! This means that each 𝑝(𝐱ᵢ) should be close to either 0 or 1.
The logistic function or the sigmoidal function
is a popular choice for p(x) as it has values very close to either 0 or 1
across most of its domain.
where, denotes the linear predictor.
The conditional probability of category residence is modelled via
Given a feature input x, the discriminant function favours Y = 1
when , and Y = 0
otherwise.
Comments
Post a Comment