PG-Introduction to Data Analysis, Semester 3, First Internal Exam, September 2023

 

AI010302 INTRODUCTION TO DATA ANALYTICS

Answer Key

Section A

1.     Supervised learning is the types of machine learning in which machines are trained using well "labelled" training data, and on basis of that data, machines predict the output. The labelled data means some input data is already tagged with the correct output.

In supervised learning, the training data provided to the machines work as the supervisor that teaches the machines to predict the output correctly. It applies the same concept as a student learns in the supervision of the teacher.

2.     Least square estimates of β0 and β1 are:

Mean response estimation is done using the equation:

3.     Detecting problem of multicollinearity

Multicollinearity is a statistical phenomenon that occurs when two or more independent variables in a regression model are highly correlated with each other.

It can be detected either by using Variance Inflation Factor (VIF) or calculating correlation matrix of the independent variables.

VIF determines strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable.  R2 value is determined to find out how well an independent variable is described by the other independent variables. A high value of R2 means that the variable is highly correlated with the other variables. This is captured by the VIF, which is denoted below:

So, the closer the R2 value to 1, the higher the value of VIF and the higher the multicollinearity with the particular independent variable.

·         VIF starts at 1 and has no upper limit

·         VIF = 1, no correlation between the independent variable and the other variables

·         VIF exceeding 5 or 10 indicates high multicollinearity between this independent variable and the others

4.     Sensitivity and Specificity

Sensitivity (true positive rate) is the probability of a positive test result, conditioned on the individual truly being positive.

Specificity (true negative rate) is the probability of a negative test result, conditioned on the individual truly being negative.

Confusion Matrix and Class Statistics | by Rosaria Silipo | Towards Data  Science

 

5.     Binary classification: It refers to those classification tasks that have two class labels. Examples include: Email spam detection (spam or not).

 

Section B

 

6.     To find the coefficient of determination, simply square the correlation coefficient. The resulting value ranges between zero and one, which you convert to a percent to explain what portion of the variation in y occurs because of the changes in x. A value of 0.70 for the coefficient of determination means that 70% of the variability in the outcome variable (y) can be explained by the predictor variable (x). This also means that the model used to predict the value is a relatively accurate fit.

 

7.     Feature selection is a way to reduce the number of features and hence reduce the computational complexity of the model. Many times, feature selection becomes very useful to overcome with overfitting problem. It helps us in determining the smallest set of features that are needed to predict the response variable with high accuracy. AIC is most often used for model selection. By calculating and comparing the AIC scores of several possible models, you can choose the one that is the best fit for the data. AIC determines the relative information value of the model using the maximum likelihood estimate and the number of parameters (independent variables) in the model. The formula for AIC is:

AIC = 2K – 2ln(L)

K is the number unknown parameters estimated when fitting the model and L is the log-likelihood estimate (a.k.a. the likelihood that the model could have produced your observed y-values). 

 

  1. Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems. It is mainly used in text classification that includes a high-dimensional training dataset. Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the fast machine learning models that can make quick predictions. It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e., every pair of features being classified is independent of each other and all features contribute equally to the outcome.

 

9.     kNN classifier: K Nearest Neighbor is one of the fundamental algorithms in machine learning. Machine learning models use a set of input values to predict output values. KNN is one of the simplest forms of machine learning algorithms mostly used for classification. It classifies the data point on how its neighbor is classified. KNN classifies the new data points based on the similarity measure of the earlier stored data points.

 

Section C

 

10.  The matrix expression for expected value of Y in multiple regression is:

The equation for mean response is:

E[Y]=Xβ

The equation for LS estimator is:

Significance of weighted least squares for the MLR model:

The method of ordinary least squares assumes that there is constant variance in the errors (homoscedasticity). The method of weighted least squares can be used when the ordinary least squares assumption of constant variance in the errors is violated (heteroscedasticity)

The model under consideration is:

Y=Xβ+Є,

where Є is assumed to be (multivariate) normally distributed with mean vector 0 and nonconstant variance-covariance matrix

If we define the reciprocal of each variance, , as the weight, =1/, then let matrix W be a diagonal matrix containing these weights:

The weighted least squares estimate is then:

11.  Logistic regression

Logistic regression models a relationship between predictor variables and a categorical response variable. Binary Logistic Regression is used when the response is binary (i.e., it has two possible outcomes). Particular issues with modelling a categorical response variable include nonnormal error terms, nonconstant error variance, and constraints on the response function (i.e., the response is bounded between 0 and 1).

Our goal is to find the logistic regression function 𝑝(𝐱) such that the predicted responses 𝑝(𝐱) are as close as possible to the actual response 𝑦 for each observation 𝑖 = 1, …, 𝑛. Remember that the actual response can be only 0 or 1 in binary classification problems! This means that each 𝑝(𝐱) should be close to either 0 or 1.

The logistic function or the sigmoidal function is a popular choice for p(x) as it has values very close to either 0 or 1 across most of its domain.

where,                        denotes the linear predictor.

The conditional probability of category residence is modelled via

Given a feature input x, the discriminant function favours Y = 1 when , and Y = 0 otherwise.

Comments

Popular posts from this blog

UG, S1 BCA, First internal examination, Introduction to Problem Solving and Web Designing, September 2024