example and current example from the data] C --> D[Sort the distances in ascending order] D --> E[Pick the first K entries] E --> F[Get the labels of the selected K entries] F --> G[/Return mode of the K labels/] {{< /mermaid >}} --> The selection of the hyperparameter k has a significant effect on the classifier. In general, for the lower value of k, the classifier may overfit on new unseen data. The value of k is chosen such that balances bias and variance. When k is small, we are restraining the region of a given prediction and forcing our classifier to be โmore blindโ to the overall distribution. A small value for K provides the most flexible fit, which will have low bias but high variance. Graphically, our decision boundary will be more jagged. On the other hand, a higher k averages more voters in each prediction and hence is more resilient to outliers. Larger values of k will have smoother decision boundaries which means lower variance but increased bias. The value of k is chosen such that the desired accuracy of kNN classifier is achieved. The simple method to calculate the value of k is plotting error versus k graph and choosing the k on which error is minimum. ### 2.2 Naรฏve Bayes Classifier Bayesโ Rule or Bayesโ Theorem is a statistical principle for combining prior knowledge of the classes with new evidence gathered from data. The class-conditional probability ๐(๐|๐), and the evidence, P(X):The Bayesโ Rule (also known as the Bayesโ Theorem) stipulates that: $$ P(๐|\boldsymbol{X}) = \frac{P(\boldsymbol{X}|๐) P(๐)} {P(\boldsymbol{X})} \tag4 $$ In Bayesโ rule (4), it finds the probability of event ๐, given that the event ๐ is true. Event ๐ is also termed as evidence. ๐(๐) is the priori of ๐ (the prior probability, i.e. Probability of event before evidence is seen). ๐(๐|๐ฟ) is a posteriori probability of ๐, i.e. probability of event after evidence is seen. A Naรฏve Bayes classifier estimates the class-conditional probability by assuming that the attributes are conditionally independent, given the class label ๐ฆ. Here, ๐(๐ฟ) is a class probability and ๐(๐ฟ|๐ฆ) is a conditional probability. The conditional independence assumption can be formally stated as follows: $$ ๐(\boldsymbol{X}|๐ = ๐ฆ) = \prod_{i=1}^d๐(๐_๐|๐ = ๐ฆ)\tag5 $$ Where each attribute set $$ \boldsymbol{X} = \{๐_1 ,๐_2, โฆ ,๐_๐\} $$ consists of d attributes. The Naรฏve Bayes is also called Simple Bayes as it assumes that features of a measurement are independent of each other and makes equal contribution to the outcome. ## 3 METRICS The classifier model doesnโt always give the accurate result. There are some parameters to measure how the classifier behave with unseen data to classify like Confusion matrix, Accuracy, F1 score, Precision, Recall, Heatmap etc. The different evaluation metrics are used for different kinds of problems. We build a model, get feedback from metrics, make improvements and continue until we achieve a desirable accuracy. Evaluation metrics explain the performance of a model. An important aspect of evaluation metrics is their capability to discriminate among model results. ### 3.1 Confusion Matrix The confusion matrix is a table that summarizes how successful the classification model is at predicting examples belonging to various classes. One axis of the confusion matrix is the label that the model predicted, and the other axis is the actual label. In a binary classification problem, there are two classes. Letโs say, the model predicts two classes: โspamโ and โnot_spamโ: || spam (predicted) | not spam (predicted) | | --- | --- | -- | | spam (actual) | 23 (TP) | 1 (FN) | | not spam (actual) | 12 (FP) | 556 (TN) | The above confusion matrix shows that of the 24 examples that actually were spam, the model correctly classified 23 as spam. In this case, we say that we have 23 true positives or TP = 23. The model incorrectly classified 1 example as not_spam. In this case, we have 1 false negative, or FN = 1. Similarly, of 568 examples that actually were not spam, 556 were correctly classified (556 true negatives or TN = 556), and 12 were incorrectly classified (12 false positives, FP = 12). ### 3.2 Precision/Recall The two most frequently used metrics to assess the model are precision and recall. Precision is the ratio of correct positive predictions to the overall number of positive predictions: $$ precision = \frac{๐๐}{๐๐ + ๐น๐}\tag6 $$ Recall is the ratio of correct predictions to the overall number of positive examples in the datasets: $$ recall = \frac{๐๐}{TP+FN} \tag7 $$ In the case of the spam detection problem, we want to have high precision (we want to avoid making mistakes by detecting that a legitimate message is spam) and we are ready to tolerate lower recall (we tolerate some spam messages in our inbox). The goal of classifier model is to choose between a high precision or a high recall. Itโs usually impossible to have both. The hyperparameter tuning helps to maximize precision or recall. ### 3.3 Accuracy Accuracy is given by the number of correctly classified examples divided by the total number of classified examples. In terms of the confusion matrix, it is given by: $$ accuracy = \frac{TP+TN}{TP+TN+FP+FN} \tag8 $$ Accuracy is a useful metric when errors in predicting all classes are equally important. ### 3.4 F1 Score F1-Score is the harmonic mean of precision and recall values for a classification problem. The formula for F1-Score is as follows: $$ F1 = \frac{recall^{-1}+precision^{-1}}{2} \tag9 $$ $$ F1 = 2.\frac{precision.recall}{precision + recall} \tag{10} $$ The general formula for positive real ฮฒ, where ฮฒ is chosen such that recall is considered ฮฒ times as important as precision, is: $$ F_{\beta}=(1+{\beta}^2) \cdot \frac{precision \cdot recall}{({\beta}^2 \cdot precision)+recall} \tag{11} $$ The equation (11) or $ F_\beta $ measures the effectiveness of a model with respect to a user who attaches $ \beta $ times as much importance to recall as precision. ### 3.5 Heat Map The heat map can be elucidated as a cross table or spreadsheet which contains colors instead of numbers. The default color gradient sets the lowest value in the heat map to dark blue, the highest value to a bright red, and mid-range values to light gray, with a corresponding transition (or gradient) between these extremes. Heat maps are well-suited for visualizing large amounts of multi-dimensional data and can be used to identify clusters of rows with similar values, as these are displayed as areas of similar color. ## 4 RESULT The value of hyperparameter like k in the k-NN classifier plays a significant role to correctly classify the labels or target variables. The error versus k values plot provides a guideline to choose k and the value of k with minimum error is chosen. {{< image src="/images/k_value_vs_error.png" alt="k_value_vs_error" title="K-value versus Error" caption="K-value versus Error" >}} Fig shows the fluctuation of error at different values of k and the graph is not continuous. We would rather prefer to calculate minimum error k-value than maximum error k-value as minimum error k-value gives more accurate prediction. The minimum error of k-NN classifier model for test set is at ๐ = 12 and the error is 0.0467 (i.e. 4.67%). Hence, ๐ = 12 is chosen as k-value for k-NN classifier. The performance metrics of kNN classifier with parameters metric as โminkowskiโ, neighbors as โ12โare: Confusion matrix: [[136 6]

$\qquad$ $\qquad$ [8 150]],

Precision for label โ0โ prediction: 0.94,

Precision for label โ1โ prediction: 0.96,

Recall for label โ0โ prediction: 0.96,

Recall for label โ1โ prediction: 0.95,

F1-score for label โ0โ prediction: 142,

F1-score for label โ1โ prediction: 158,

Accuracy: 0.95

The model has classified the labels with ๐๐ = 136, ๐น๐ = 6, ๐น๐ = 8, ๐๐ = 150 that means model misclassified 6 labels as label โ1โ which is actually label โ0โ and misclassified 8 labels as label โ0โ which is actually label โ1โ. Hence, the model has an accuracy of about 95%. {{< image src="/images/heat_map.jpg" alt="Heat Map" title="Heat Map" caption="Heat Map" >}} Figure Heat map predicted label over the true label Heat map is a graphical representation of value in the confusion matrix obtained from the predicted label and actual target name. In the above heatmap, the red square denotes the maximum value on the confusion matrix and with a decrease in value the color fades up. Diagonal elements have a higher value as shown in the heatmap which shows a higher performance of the classification model and informs predicated label matches the true label for any given input. For the given model, โprime minister of nepalโ supplied as input assign a label โtalk.politics.mideastโ similarly , when โjokerโ is supplied as input assign a label โcomp.sys.ibm.pc.hardwareโ. Here for two different input two different label has been assigned out of which one label assigned for the input โprime minister of nepalโ is correct whereas for โjokerโ correct label has not been assigned properly which is due to naรฏve base treating the input as independent values as well as lack of data being supplied. ## 5 Conclusion The two popular classifiers k-NN and Naรฏve Bayes provide good accuracy to the model. Many parameters contribute to model performance. The right choice of hyperparameter also yields a better result. There is no rule of thumb to select the right value of hyperparameter for the first trial and the hyperparameter value that works fine for one model may not yield the same result for another model. The good model is that which considers all the performance metric parameters like Accuracy, F1-score, Precision, Recall, etc. Though we have so many metrics parameters to evaluate the model performance, some analytics is needed to better explain the metric that addresses classification problems in the best possible way. The contribution of all performance metrics needs to be analyzed to make the model accurate. ## 6 Appendix ### A k-NN classifier [You can find dataset here!](dataset.csv) ```python # Importing the libraries import matplotlib.pyplot as plt import numpy as np import pandas as pd from sklearn.metrics import classification_report, confusion_matrix, accuracy_score from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import StandardScaler # Importing the dataset df = pd.read_csv('datasets/dataset.csv', index_col=0) # Quick look at data df.head(4) # Standardizing the variables scaler = StandardScaler() scaler.fit(df.drop('TARGET CLASS', axis=1)) scaled_features = scaler.transform(df.drop('TARGET CLASS', axis=1)) df_feat = pd.DataFrame(scaled_features, columns=df.columns[:-1]) # After standardization print(df_feat.head()) # train test split X_train, X_test, y_train, y_test = train_test_split( scaled_features, df['TARGET CLASS'], test_size=0.3, random_state=42) # Initializing error and k_value list error = [] k_value = [] for k in range(40): k_value.append(k+1) # Using KNN knn = KNeighborsClassifier(n_neighbors=k+1) print(knn.fit(X_train, y_train)) pred = knn.predict(X_test) error_ = 1 - accuracy_score(y_test, pred) error.append(error_) # Plotting k_value and error plt.plot(k_value, error) plt.xlabel('K value', fontsize=13) plt.ylabel('Error', fontsize=13) plt.savefig('k_value_vs_erro.png', dpi=1000, bbox_inches="tight") plt.show() def performance_report(X_train, y_train, X_test, y_test, n_neighbors=3): # k-NN classifier knn = KNeighborsClassifier(n_neighbors=n_neighbors) print(knn.fit(X_train, y_train)) pred = knn.predict(X_test) confusion_matrix_ = confusion_matrix(y_test, pred) classification_report_ = classification_report(y_test, pred) return { 'confusion_matrix': confusion_matrix_, 'classification_report': classification_report_ } # to numpy error_np = np.array(error) k_value_np = np.array(k_value) error_min_index = error_np.argmin().item() # numpy int to python int k_value_ = k_value_np[error_min_index] print('K= {} and error= {}'.format(k_value_, error_np[k_value_])) # for minimum error performance_report_ = performance_report(X_train, y_train, X_test, y_test, n_neighbors=k_value_) print('For k = {}: \n {}{}'.format(k_value_, performance_report_['confusion_matrix'], performance_report_['classification_report'])) ``` ### B Naรฏve Bayes classifier ```python # Importing the libraries import matplotlib.pyplot as plt import numpy as np import seaborn as sns from sklearn.datasets impor from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics import confusion_matrix from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import make_pipeline # importing the dataset data = fetch_20newsgroups() print(data.target_names) # training the data on these categories categories = data.target_names train = fetch_20newsgroups(subset='train', categories=categories) test = fetch_20newsgroups(subset='test', categories=categories) print(train.data[5]) # Pipelining the model model = make_pipeline(TfidfVectorizer(), MultinomialNB()) # Fitting the data model.fit(train.data, train.target) labels = model.predict(test.data) # heatmap mat = confusion_matrix(test.target, labels) sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, yticklabels=train.target_names) plt.xlabel('true label') plt.ylabel('predicted label') # plt.tight_layout() plt.savefig('images/lab04/heat_map.jpg', dpi=1000, bbox_inches="tight") plt.show() # predicting def predict_category(s, train=train, model=model): pred = model.predict([s]) return train.target_names[pred[0]] # prediction print(predict_category('Jesus christ')) print(predict_category('Prime minister of Nepal')) print(predict_category('Everest')) ``` ## References - P. Tan, M. Steinbach, V. Kumar and A. Karpatne, Introduction to Data Mining, Global Edition. Harlow, United Kingdom: Pearson Education Limited, 2019. - A. Burkov, The hundred-page machine learning book, Global Edition. Quebec City, Canada, 2019.