# Sentiment Analysis with Logistic Regression

Contents

## 1 Introduction

Sentiment analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications tha range from marketing to customer service to clinical medicine . This blog explains the sentiment analysis with logistic regression with real twitter dataset.

## 2 Terminology

### Corpus Vocabulary

Vocabulary is a list of unique word from the text of interest e.g. tweets. In the context of NLP tasks, the text corpus refers to the set of texts used for the task. For example, if we were building a model to analyze news articles, our text corpus would be the entire set of articles or papers we used to train and evaluate the model. The set of unique words used in the text corpus is referred to as the vocabulary. When processing raw text for NLP, everything is done around the vocabulary.

Example
I am happy because I am learning NLP … I hated movie.
V = [ I, am, happy, because, learning, NLP, …, hated, the, movie]

Here V represents the vocabulary. It is kind of dictionary which keeps only unique words.

### Feature Extraction

Feature extraction is a technique of extracting a valueable information about the data. In the case of sentiment analysis with logistic regression, it is the frequency of occurrence of word in particular corpus.

Example
I am happy because I am learning NLP
freqs = [ 1, 1, 1, 1, 1, 1, …, 0, 0, 0]

Vocabulary for sentiment analysis

VocabularyPosFreq(1)NegFreq(0)
I33
am33
happy20
because11
learning11
NLP11
not01

freqs: dictionary mapping from (word, class) to frequency.

$$X_m = [1, \sum_w freqs(w,1), \sum_w freqs(w,0)]$$

Where, $X_m$ is Features of tweet m,
1 for Bias,
and the last two summations are the sum of positive and negative frequencies for tweet m respectively

Example
I am sad, I am not learning NLP
$$X_m = [1, 8, 11]$$

### Preprocessing

The sentence may contain stopwords, punctuation, handles, URLs etc. which may not add the value for the determination of sentiments whether it has positive or negative meaning. The removal makes logistic regression algorithm to work fast and quite correct.

#### a. Stop words and punctuation

Stop wordsPunctuation
and,
is.
at:
has!
for"
of'

#### b. Stemming and Lowercasing

Example
tuning, tune, tuned reduces to tun
Great, GREAT, great reduces to great
This process reduces the number of unique text in the vocabulary corpus.

## 2 Logistic Regression

In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1, with a sum of one.

Sigmoid function is,

$$h(x^{(i)}, \theta) = \cfrac{1}{1 + e^{-\theta^T x^{(i)}}}$$

where, $\theta$ is slope, $X^{(i)}$ is $i^{th}$ training example.

## 3 Cost Function for Logistic Regression

$$J(\theta) = -\cfrac{1}{m}\sum_{i=1}^{m}[y^{(i)}logh(x^{(i)}, \theta)-(1-y^{(i)})log(1-h(x^{(i)}, \theta))]$$

$$J(\theta) = \begin{cases} -logh(x^{(i)}, \theta) &\text{if } y=1 \\ -log(1-h(x^{(i)}, \theta)) &\text{if } y=0 \end{cases}$$

Case-1: When y=1 $$If\ y=1\\ -\ J( \theta ) =0\ if\ y=1,\ h\left( x^{( i)} ,\ \theta \right) =1\\ -\ J( \theta )\rightarrow \infty \ if\ h\left( x^{( i)} ,\ \theta \right)\rightarrow 0$$

Case-1: When y=0 $$If\ y=0\\ -\ J( \theta ) =0\ if\ y=0,\ h\left( x^{( i)} ,\ \theta \right) =0\\ -\ J( \theta \rightarrow \infty \ if\ h\left( x^{( i)} ,\ \theta \right)\rightarrow 1$$

Here, if we analyze it properly we clearly see that the cost function pays maximum penalty for incorrect prediction.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71  import re import string import numpy as np from nltk.corpus import stopwords from nltk.stem import PorterStemmer from nltk.tokenize import TweetTokenizer def process_tweet(tweet): """Process tweet function. Input: tweet: a string containing a tweet Output: tweets_clean: a list of words containing the processed tweet """ stemmer = PorterStemmer() stopwords_english = stopwords.words('english') # remove stock market tickers like $GE tweet = re.sub(r'\$\w*', '', tweet) # remove old style retweet text "RT" tweet = re.sub(r'^RT[\s]+', '', tweet) # remove hyperlinks tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet) # remove hashtags # only removing the hash # sign from the word tweet = re.sub(r'#', '', tweet) # tokenize tweets tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True) tweet_tokens = tokenizer.tokenize(tweet) tweets_clean = [] for word in tweet_tokens: if (word not in stopwords_english and # remove stopwords word not in string.punctuation): # remove punctuation # tweets_clean.append(word) stem_word = stemmer.stem(word) # stemming word tweets_clean.append(stem_word) return tweets_clean def build_freqs(tweets, ys): """Build frequencies. Input: tweets: a list of tweets ys: an m x 1 array with the sentiment label of each tweet (either 0 or 1) Output: freqs: a dictionary mapping each (word, sentiment) pair to its frequency """ # Convert np array to list since zip needs an iterable. # The squeeze is necessary or the list ends up with one element. # Also note that this is just a NOP if ys is already a list. yslist = np.squeeze(ys).tolist() # Start with an empty dictionary and populate it by looping over all tweets # and over all processed words in each tweet. freqs = {} for y, tweet in zip(yslist, tweets): for word in process_tweet(tweet): pair = (word, y) if pair in freqs: freqs[pair] += 1 else: freqs[pair] = 1 return freqs 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398  #!/usr/bin/env python # coding: utf-8 # # Assignment 1: Logistic Regression # Welcome to week one of this specialization. You will learn about logistic regression. Concretely, you will be implementing logistic regression for sentiment analysis on tweets. Given a tweet, you will decide if it has a positive sentiment or a negative one. Specifically you will: # # * Learn how to extract features for logistic regression given some text # * Implement logistic regression from scratch # * Apply logistic regression on a natural language processing task # * Test using your logistic regression # * Perform error analysis # # We will be using a data set of tweets. Hopefully you will get more than 99% accuracy. # Run the cell below to load in the packages. # ## Import functions and data # In: # run this cell to import nltk import nltk from os import getcwd # ### Imported functions # # Download the data needed for this assignment. Check out the [documentation for the twitter_samples dataset](http://www.nltk.org/howto/twitter.html). # # * twitter_samples: if you're running this notebook on your local computer, you will need to download it using: # Python # nltk.download('twitter_samples') #  # # * stopwords: if you're running this notebook on your local computer, you will need to download it using: # python # nltk.download('stopwords') #  # # #### Import some helper functions that we provided in the utils.py file: # * process_tweet(): cleans the text, tokenizes it into separate words, removes stopwords, and converts words to stems. # * build_freqs(): this counts how often a word in the 'corpus' (the entire set of tweets) was associated with a positive label '1' or a negative label '0', then builds the freqs dictionary, where each key is a (word,label) tuple, and the value is the count of its frequency within the corpus of tweets. # In: # add folder, tmp2, from our local workspace containing pre-downloaded corpora files to nltk's data path # this enables importing of these files without downloading it again when we refresh our workspace filePath = f"{getcwd()}/../tmp2/" nltk.data.path.append(filePath) # In: import numpy as np import pandas as pd from nltk.corpus import twitter_samples from utils import process_tweet, build_freqs # ### Prepare the data # * The twitter_samples contains subsets of 5,000 positive tweets, 5,000 negative tweets, and the full set of 10,000 tweets. # * If you used all three datasets, we would introduce duplicates of the positive tweets and negative tweets. # * You will select just the five thousand positive tweets and five thousand negative tweets. # In: # select the set of positive and negative tweets all_positive_tweets = twitter_samples.strings('positive_tweets.json') all_negative_tweets = twitter_samples.strings('negative_tweets.json') # * Train test split: 20% will be in the test set, and 80% in the training set. # # In: # split the data into two pieces, one for training and one for testing (validation set) test_pos = all_positive_tweets[4000:] train_pos = all_positive_tweets[:4000] test_neg = all_negative_tweets[4000:] train_neg = all_negative_tweets[:4000] train_x = train_pos + train_neg test_x = test_pos + test_neg # * Create the numpy array of positive labels and negative labels. # In: # combine positive and negative labels train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0) test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0) # In: # Print the shape train and test sets print("train_y.shape = " + str(train_y.shape)) print("test_y.shape = " + str(test_y.shape)) # In: # create frequency dictionary freqs = build_freqs(train_x, train_y) # check the output print("type(freqs) = " + str(type(freqs))) print("len(freqs) = " + str(len(freqs.keys()))) # ### Process tweet # The given function process_tweet() tokenizes the tweet into individual words, removes stop words and applies stemming. # In: # test the function below print('This is an example of a positive tweet: \n', train_x) print('\nThis is an example of the processed version of the tweet: \n', process_tweet(train_x)) # # Part 1: Logistic regression # ### Part 1.1: Sigmoid # In: # UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) def sigmoid(z): ''' Input: z: is the input (can be a scalar or an array) Output: h: the sigmoid of z ''' ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # calculate the sigmoid of z h = 1 / (1 + np.exp(-z)) ### END CODE HERE ### return h # In: # Testing your function if (sigmoid(0) == 0.5): print('SUCCESS!') else: print('Oops!') if (sigmoid(4.92) == 0.9927537604041685): print('CORRECT!') else: print('Oops again!') # In: # verify that when the model predicts close to 1, but the actual label is 0, the loss is a large positive value -1 * (1 - 0) * np.log(1 - 0.9999) # loss is about 9.2 # * Likewise, if the model predicts close to 0 ($h(z) = 0.0001$) but the actual label is 1, the first term in the loss function becomes a large number: $-1 \times log(0.0001) \approx 9.2$. The closer the prediction is to zero, the larger the loss. # In: # verify that when the model predicts close to 0 but the actual label is 1, the loss is a large positive value -1 * np.log(0.0001) # loss is about 9.2 # In: # UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) def gradientDescent(x, y, theta, alpha, num_iters): ''' Input: x: matrix of features which is (m,n+1) y: corresponding labels of the input matrix x, dimensions (m,1) theta: weight vector of dimension (n+1,1) alpha: learning rate num_iters: number of iterations you want to train your model for Output: J: the final cost theta: your final weight vector Hint: you might want to print the cost to make sure that it is going down. ''' ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # get 'm', the number of rows in matrix x m = x.shape for i in range(0, num_iters): # get z, the dot product of x and theta z = np.dot(x, theta) # get the sigmoid of z h = sigmoid(z) # calculate the cost function J = -(1./m)*(np.dot(y.T, np.log(h)) + np.dot((1-y).T, np.log(1-h))) # update the weights theta theta = theta - (alpha/m) * np.dot(x.T, (h-y)) ### END CODE HERE ### J = float(J) return J, theta # In: # Check the function # Construct a synthetic test case using numpy PRNG functions np.random.seed(1) # X input is 10 x 3 with ones for the bias terms tmp_X = np.append(np.ones((10, 1)), np.random.rand(10, 2) * 2000, axis=1) # Y Labels are 10 x 1 tmp_Y = (np.random.rand(10, 1) > 0.35).astype(float) # tmp_X.shape # Apply gradient descent tmp_J, tmp_theta = gradientDescent(tmp_X, tmp_Y, np.zeros((3, 1)), 1e-8, 700) print(f"The cost after training is {tmp_J:.8f}.") print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(tmp_theta)]}") # In: # UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) def extract_features(tweet, freqs): ''' Input: tweet: a list of words for one tweet freqs: a dictionary corresponding to the frequencies of each tuple (word, label) Output: x: a feature vector of dimension (1,3) ''' # process_tweet tokenizes, stems, and removes stopwords word_l = process_tweet(tweet) # print(','.join(word_l)) # 3 elements in the form of a 1 x 3 vector x = np.zeros((1, 3)) #bias term is set to 1 x[0,0] = 1 ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # loop through each word in the list of words for word in word_l: # increment the word count for the positive label 1 x[0,1] += freqs.get((word, 1), 0) # increment the word count for the negative label 0 x[0,2] += freqs.get((word, 0), 0) ### END CODE HERE ### assert(x.shape == (1, 3)) return x # In: # Check your function # test 1 # test on training data tmp1 = extract_features(train_x, freqs) print(tmp1) # #### Expected output #  # [[1.00e+00 3.02e+03 6.10e+01]] #  # In: # test 2: # check for when the words are not in the freqs dictionary tmp2 = extract_features('blorb bleeeeb bloooob', freqs) print(tmp2) # In: # collect the features 'x' and stack them into a matrix 'X' X = np.zeros((len(train_x), 3)) print('Extracting features >>>') for i in range(len(train_x)): X[i, :]= extract_features(train_x[i], freqs) if (i % 1000 == 0): print(f'{i} ->> feature extracted !') # training labels corresponding to X Y = train_y # Apply gradient descent J, theta = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 1500) print(f"The cost after training is {J:.8f}.") print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(theta)]}") # In: # UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) def predict_tweet(tweet, freqs, theta): ''' Input: tweet: a string freqs: a dictionary corresponding to the frequencies of each tuple (word, label) theta: (3,1) vector of weights Output: y_pred: the probability of a tweet being positive or negative ''' ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # extract the features of the tweet and store it into x x = extract_features(tweet, freqs) # make the prediction using x and theta y_pred = sigmoid(np.dot(x, theta)) ### END CODE HERE ### return y_pred # In: # Run this cell to test your function for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']: print( '%s -> %f' % (tweet, predict_tweet(tweet, freqs, theta))) # In: # Feel free to check the sentiment of your own tweet below my_tweet = 'I am learning :)' predict_tweet(my_tweet, freqs, theta) # In: # UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) def test_logistic_regression(test_x, test_y, freqs, theta): """ Input: test_x: a list of tweets test_y: (m, 1) vector with the corresponding labels for the list of tweets freqs: a dictionary with the frequency of each pair (or tuple) theta: weight vector of dimension (3, 1) Output: accuracy: (# of tweets classified correctly) / (total # of tweets) """ ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # the list for storing predictions y_hat = [] for tweet in test_x: # get the label prediction for the tweet y_pred = predict_tweet(tweet, freqs, theta) if y_pred > 0.5: # append 1.0 to the list y_hat.append(1) else: # append 0 to the list y_hat.append(0) # With the above implementation, y_hat is a list, but test_y is (m,1) array # convert both to one-dimensional arrays in order to compare them using the '==' operator accuracy = sum(np.asarray(y_hat) == np.squeeze(test_y)) / len(test_x) ### END CODE HERE ### return accuracy # In: tmp_accuracy = test_logistic_regression(test_x, test_y, freqs, theta) # print(tmp_accuracy) print(f"Logistic regression model's accuracy = {tmp_accuracy:.4f}") # # Part 5: Error Analysis # # In this part you will see some tweets that your model misclassified. Why do you think the misclassifications happened? Specifically what kind of tweets does your model misclassify? # In: # Some error analysis done for you print('Label Predicted Tweet') for x,y in zip(test_x,test_y): y_hat = predict_tweet(x, freqs, theta) if np.abs(y - (y_hat > 0.5)) > 0: print('THE TWEET IS:', x) print('THE PROCESSED TWEET IS:', process_tweet(x)) print('%d\t%0.8f\t%s' % (y, y_hat, ' '.join(process_tweet(x)).encode('ascii', 'ignore'))) # # Part 6: Predict with your own tweet # In: # Feel free to change the tweet below my_tweet = 'I am happy for learning NLP! :D' print(process_tweet(my_tweet)) y_hat = predict_tweet(my_tweet, freqs, theta) print(y_hat) if y_hat > 0.5: print('Positive sentiment') else: print('Negative sentiment')