V = [ I, am, happy, because, learning, NLP, ..., hated, the, movie] {{< /admonition >}} Here `V` represents the vocabulary. It is kind of dictionary which keeps only unique words. ### Feature Extraction Feature extraction is a technique of extracting a valueable information about the data. In the case of sentiment analysis with logistic regression, it is the frequency of occurrence of word in particular corpus. {{< admonition type=note title="Example" open=true >}} **I am happy because I am learning NLP**

*freqs* = [ 1, 1, 1, 1, 1, 1, ..., 0, 0, 0] {{< /admonition >}} **Vocabulary for sentiment analysis** | Vocabulary | PosFreq(1) | NegFreq(0) | |:---:|:---:|:---:| | I | 3 | 3 | | am | 3 | 3 | | happy | 2 | 0 | | because | 1 | 1 | | learning | 1 | 1 | | NLP | 1 | 1 | | sad | 0 | 1 | | not | 0 | 1 | `freqs`: dictionary mapping from (word, class) to frequency. $$ X_m = [1, \sum_w freqs(w,1), \sum_w freqs(w,0)] $$ Where, $ X_m $ is `Features of tweet m`,

1 for `Bias`,

and the last two summations are the sum of positive and negative frequencies for tweet m respectively {{< admonition type=note title="Example" open=true >}} **I am sad, I am not learning NLP**

$$ X_m = [1, 8, 11] $$ {{< /admonition >}} ### Preprocessing The sentence may contain stopwords, punctuation, handles, URLs etc. which may not add the value for the determination of sentiments whether it has positive or negative meaning. The removal makes logistic regression algorithm to work fast and quite correct. #### a. Stop words and punctuation | Stop words | Punctuation | |:---:|:----:| | and | , | | is | . | | at | : | | has | ! | | for | " | | of | ' | | ... | #### b. Stemming and Lowercasing {{< admonition type=note title="Example" open=true >}} tuning, tune, tuned reduces to **tun**

Great, GREAT, great reduces to **great** {{< /admonition >}} This process reduces the number of unique text in the vocabulary corpus. ## 2 Logistic Regression In statistics, the logistic model (or `logit model`) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1, with a sum of one. {{< image src="/images/logistic-regression.jpg" alt="Logistic Regression" title="Logistic Regression" caption="Logistic Regression" width=80% >}} Sigmoid function is, $$ h(x^{(i)}, \theta) = \cfrac{1}{1 + e^{-\theta^T x^{(i)}}} $$ where, $ \theta $ is slope, $ X^{(i)} $ is $ i^{th} $ training example. {{< image src="/images/sigmoid-function.png" alt="Sigmoid Function" title="Sigmoid Function" caption="Sigmoid Function" width=50% >}} ### Training Logistic Regression {{< image src="/images/training-lr.png" alt="Training LR" title="Training LR" caption="Training LR" width=50% >}} ## 3 Cost Function for Logistic Regression $$ J(\theta) = -\cfrac{1}{m}\sum_{i=1}^{m}[y^{(i)}logh(x^{(i)}, \theta)-(1-y^{(i)})log(1-h(x^{(i)}, \theta))] $$ $$ J(\theta) = \begin{cases} -logh(x^{(i)}, \theta) &\text{if } y=1 \\\ -log(1-h(x^{(i)}, \theta)) &\text{if } y=0 \end{cases} $$ Case-1: When `y=1` $$ If\ y=1\\\ -\ J( \theta ) =0\ if\ y=1,\ h\left( x^{( i)} ,\ \theta \right) =1\\\ -\ J( \theta )\rightarrow \infty \ if\ h\left( x^{( i)} ,\ \theta \right)\rightarrow 0 $$ {{< image src="/images/lr-y-1.png" alt="Cost function (y=1)" title="Cost function (y=1)" caption="Cost function `(y=1)`" width=50% >}} Case-1: When `y=0` $$ If\ y=0\\\ -\ J( \theta ) =0\ if\ y=0,\ h\left( x^{( i)} ,\ \theta \right) =0\\\ -\ J( \theta \rightarrow \infty \ if\ h\left( x^{( i)} ,\ \theta \right)\rightarrow 1 $$ {{< image src="/images/lr-y-0.png" alt="Cost function (y=0)" title="Cost function (y=0)" caption="Cost function `(y=0)`" width=50% >}} {{< admonition type=tip title="Tips" open=true >}} Here, if we analyze it properly we clearly see that the cost function pays `maximum` penalty for incorrect prediction. {{< /admonition >}} ## References 1. [Sentiment analysis - Wikipedia](https://en.wikipedia.org/wiki/Sentiment_analysis ) 2. [Natural Language Processing ML - educative](https://www.educative.io/courses/natural-language-processing-ml/) ## Appendices ### Utils.py ```python import re import string import numpy as np from nltk.corpus import stopwords from nltk.stem import PorterStemmer from nltk.tokenize import TweetTokenizer def process_tweet(tweet): """Process tweet function. Input: tweet: a string containing a tweet Output: tweets_clean: a list of words containing the processed tweet """ stemmer = PorterStemmer() stopwords_english = stopwords.words('english') # remove stock market tickers like $GE tweet = re.sub(r'\$\w*', '', tweet) # remove old style retweet text "RT" tweet = re.sub(r'^RT[\s]+', '', tweet) # remove hyperlinks tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet) # remove hashtags # only removing the hash # sign from the word tweet = re.sub(r'#', '', tweet) # tokenize tweets tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True) tweet_tokens = tokenizer.tokenize(tweet) tweets_clean = [] for word in tweet_tokens: if (word not in stopwords_english and # remove stopwords word not in string.punctuation): # remove punctuation # tweets_clean.append(word) stem_word = stemmer.stem(word) # stemming word tweets_clean.append(stem_word) return tweets_clean def build_freqs(tweets, ys): """Build frequencies. Input: tweets: a list of tweets ys: an m x 1 array with the sentiment label of each tweet (either 0 or 1) Output: freqs: a dictionary mapping each (word, sentiment) pair to its frequency """ # Convert np array to list since zip needs an iterable. # The squeeze is necessary or the list ends up with one element. # Also note that this is just a NOP if ys is already a list. yslist = np.squeeze(ys).tolist() # Start with an empty dictionary and populate it by looping over all tweets # and over all processed words in each tweet. freqs = {} for y, tweet in zip(yslist, tweets): for word in process_tweet(tweet): pair = (word, y) if pair in freqs: freqs[pair] += 1 else: freqs[pair] = 1 return freqs ``` ## IPYNB ```python #!/usr/bin/env python # coding: utf-8 # # Assignment 1: Logistic Regression # Welcome to week one of this specialization. You will learn about logistic regression. Concretely, you will be implementing logistic regression for sentiment analysis on tweets. Given a tweet, you will decide if it has a positive sentiment or a negative one. Specifically you will: # # * Learn how to extract features for logistic regression given some text # * Implement logistic regression from scratch # * Apply logistic regression on a natural language processing task # * Test using your logistic regression # * Perform error analysis # # We will be using a data set of tweets. Hopefully you will get more than 99% accuracy. # Run the cell below to load in the packages. # ## Import functions and data # In[1]: # run this cell to import nltk import nltk from os import getcwd # ### Imported functions # # Download the data needed for this assignment. Check out the [documentation for the twitter_samples dataset](http://www.nltk.org/howto/twitter.html). # # * twitter_samples: if you're running this notebook on your local computer, you will need to download it using: # ```Python # nltk.download('twitter_samples') # ``` # # * stopwords: if you're running this notebook on your local computer, you will need to download it using: # ```python # nltk.download('stopwords') # ``` # # #### Import some helper functions that we provided in the utils.py file: # * `process_tweet()`: cleans the text, tokenizes it into separate words, removes stopwords, and converts words to stems. # * `build_freqs()`: this counts how often a word in the 'corpus' (the entire set of tweets) was associated with a positive label '1' or a negative label '0', then builds the `freqs` dictionary, where each key is a (word,label) tuple, and the value is the count of its frequency within the corpus of tweets. # In[2]: # add folder, tmp2, from our local workspace containing pre-downloaded corpora files to nltk's data path # this enables importing of these files without downloading it again when we refresh our workspace filePath = f"{getcwd()}/../tmp2/" nltk.data.path.append(filePath) # In[3]: import numpy as np import pandas as pd from nltk.corpus import twitter_samples from utils import process_tweet, build_freqs # ### Prepare the data # * The `twitter_samples` contains subsets of 5,000 positive tweets, 5,000 negative tweets, and the full set of 10,000 tweets. # * If you used all three datasets, we would introduce duplicates of the positive tweets and negative tweets. # * You will select just the five thousand positive tweets and five thousand negative tweets. # In[4]: # select the set of positive and negative tweets all_positive_tweets = twitter_samples.strings('positive_tweets.json') all_negative_tweets = twitter_samples.strings('negative_tweets.json') # * Train test split: 20% will be in the test set, and 80% in the training set. # # In[5]: # split the data into two pieces, one for training and one for testing (validation set) test_pos = all_positive_tweets[4000:] train_pos = all_positive_tweets[:4000] test_neg = all_negative_tweets[4000:] train_neg = all_negative_tweets[:4000] train_x = train_pos + train_neg test_x = test_pos + test_neg # * Create the numpy array of positive labels and negative labels. # In[6]: # combine positive and negative labels train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0) test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0) # In[7]: # Print the shape train and test sets print("train_y.shape = " + str(train_y.shape)) print("test_y.shape = " + str(test_y.shape)) # In[8]: # create frequency dictionary freqs = build_freqs(train_x, train_y) # check the output print("type(freqs) = " + str(type(freqs))) print("len(freqs) = " + str(len(freqs.keys()))) # ### Process tweet # The given function `process_tweet()` tokenizes the tweet into individual words, removes stop words and applies stemming. # In[9]: # test the function below print('This is an example of a positive tweet: \n', train_x[0]) print('\nThis is an example of the processed version of the tweet: \n', process_tweet(train_x[0])) # # Part 1: Logistic regression # ### Part 1.1: Sigmoid # In[10]: # UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) def sigmoid(z): ''' Input: z: is the input (can be a scalar or an array) Output: h: the sigmoid of z ''' ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # calculate the sigmoid of z h = 1 / (1 + np.exp(-z)) ### END CODE HERE ### return h # In[11]: # Testing your function if (sigmoid(0) == 0.5): print('SUCCESS!') else: print('Oops!') if (sigmoid(4.92) == 0.9927537604041685): print('CORRECT!') else: print('Oops again!') # In[12]: # verify that when the model predicts close to 1, but the actual label is 0, the loss is a large positive value -1 * (1 - 0) * np.log(1 - 0.9999) # loss is about 9.2 # * Likewise, if the model predicts close to 0 ($h(z) = 0.0001$) but the actual label is 1, the first term in the loss function becomes a large number: $-1 \times log(0.0001) \approx 9.2$. The closer the prediction is to zero, the larger the loss. # In[13]: # verify that when the model predicts close to 0 but the actual label is 1, the loss is a large positive value -1 * np.log(0.0001) # loss is about 9.2 # In[14]: # UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) def gradientDescent(x, y, theta, alpha, num_iters): ''' Input: x: matrix of features which is (m,n+1) y: corresponding labels of the input matrix x, dimensions (m,1) theta: weight vector of dimension (n+1,1) alpha: learning rate num_iters: number of iterations you want to train your model for Output: J: the final cost theta: your final weight vector Hint: you might want to print the cost to make sure that it is going down. ''' ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # get 'm', the number of rows in matrix x m = x.shape[0] for i in range(0, num_iters): # get z, the dot product of x and theta z = np.dot(x, theta) # get the sigmoid of z h = sigmoid(z) # calculate the cost function J = -(1./m)*(np.dot(y.T, np.log(h)) + np.dot((1-y).T, np.log(1-h))) # update the weights theta theta = theta - (alpha/m) * np.dot(x.T, (h-y)) ### END CODE HERE ### J = float(J) return J, theta # In[15]: # Check the function # Construct a synthetic test case using numpy PRNG functions np.random.seed(1) # X input is 10 x 3 with ones for the bias terms tmp_X = np.append(np.ones((10, 1)), np.random.rand(10, 2) * 2000, axis=1) # Y Labels are 10 x 1 tmp_Y = (np.random.rand(10, 1) > 0.35).astype(float) # tmp_X.shape[0] # Apply gradient descent tmp_J, tmp_theta = gradientDescent(tmp_X, tmp_Y, np.zeros((3, 1)), 1e-8, 700) print(f"The cost after training is {tmp_J:.8f}.") print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(tmp_theta)]}") # In[16]: # UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) def extract_features(tweet, freqs): ''' Input: tweet: a list of words for one tweet freqs: a dictionary corresponding to the frequencies of each tuple (word, label) Output: x: a feature vector of dimension (1,3) ''' # process_tweet tokenizes, stems, and removes stopwords word_l = process_tweet(tweet) # print(','.join(word_l)) # 3 elements in the form of a 1 x 3 vector x = np.zeros((1, 3)) #bias term is set to 1 x[0,0] = 1 ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # loop through each word in the list of words for word in word_l: # increment the word count for the positive label 1 x[0,1] += freqs.get((word, 1), 0) # increment the word count for the negative label 0 x[0,2] += freqs.get((word, 0), 0) ### END CODE HERE ### assert(x.shape == (1, 3)) return x # In[17]: # Check your function # test 1 # test on training data tmp1 = extract_features(train_x[0], freqs) print(tmp1) # #### Expected output # ``` # [[1.00e+00 3.02e+03 6.10e+01]] # ``` # In[18]: # test 2: # check for when the words are not in the freqs dictionary tmp2 = extract_features('blorb bleeeeb bloooob', freqs) print(tmp2) # In[19]: # collect the features 'x' and stack them into a matrix 'X' X = np.zeros((len(train_x), 3)) print('Extracting features >>>') for i in range(len(train_x)): X[i, :]= extract_features(train_x[i], freqs) if (i % 1000 == 0): print(f'{i} ->> feature extracted !') # training labels corresponding to X Y = train_y # Apply gradient descent J, theta = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 1500) print(f"The cost after training is {J:.8f}.") print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(theta)]}") # In[20]: # UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) def predict_tweet(tweet, freqs, theta): ''' Input: tweet: a string freqs: a dictionary corresponding to the frequencies of each tuple (word, label) theta: (3,1) vector of weights Output: y_pred: the probability of a tweet being positive or negative ''' ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # extract the features of the tweet and store it into x x = extract_features(tweet, freqs) # make the prediction using x and theta y_pred = sigmoid(np.dot(x, theta)) ### END CODE HERE ### return y_pred # In[21]: # Run this cell to test your function for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']: print( '%s -> %f' % (tweet, predict_tweet(tweet, freqs, theta))) # In[22]: # Feel free to check the sentiment of your own tweet below my_tweet = 'I am learning :)' predict_tweet(my_tweet, freqs, theta) # In[23]: # UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) def test_logistic_regression(test_x, test_y, freqs, theta): """ Input: test_x: a list of tweets test_y: (m, 1) vector with the corresponding labels for the list of tweets freqs: a dictionary with the frequency of each pair (or tuple) theta: weight vector of dimension (3, 1) Output: accuracy: (# of tweets classified correctly) / (total # of tweets) """ ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # the list for storing predictions y_hat = [] for tweet in test_x: # get the label prediction for the tweet y_pred = predict_tweet(tweet, freqs, theta) if y_pred > 0.5: # append 1.0 to the list y_hat.append(1) else: # append 0 to the list y_hat.append(0) # With the above implementation, y_hat is a list, but test_y is (m,1) array # convert both to one-dimensional arrays in order to compare them using the '==' operator accuracy = sum(np.asarray(y_hat) == np.squeeze(test_y)) / len(test_x) ### END CODE HERE ### return accuracy # In[24]: tmp_accuracy = test_logistic_regression(test_x, test_y, freqs, theta) # print(tmp_accuracy) print(f"Logistic regression model's accuracy = {tmp_accuracy:.4f}") # # Part 5: Error Analysis # # In this part you will see some tweets that your model misclassified. Why do you think the misclassifications happened? Specifically what kind of tweets does your model misclassify? # In[25]: # Some error analysis done for you print('Label Predicted Tweet') for x,y in zip(test_x,test_y): y_hat = predict_tweet(x, freqs, theta) if np.abs(y - (y_hat > 0.5)) > 0: print('THE TWEET IS:', x) print('THE PROCESSED TWEET IS:', process_tweet(x)) print('%d\t%0.8f\t%s' % (y, y_hat, ' '.join(process_tweet(x)).encode('ascii', 'ignore'))) # # Part 6: Predict with your own tweet # In[26]: # Feel free to change the tweet below my_tweet = 'I am happy for learning NLP! :D' print(process_tweet(my_tweet)) y_hat = predict_tweet(my_tweet, freqs, theta) print(y_hat) if y_hat > 0.5: print('Positive sentiment') else: print('Negative sentiment') ```