Online Messages Sentiments Analysis Based on Long Short-Term Memory

In December of 2019, an extremely infectious and deadly pandemic ambushed China. In Wuhan, the novel coronavirus COVID-19 suddenly broke out and spread rapidly to other countries. COVID-19 became a worldwide disaster, affecting not only physical, but also emotional health on a global scale. We wanted to record this change based on the sentiment analysis model and to examine the relationship between world events and the positivity of posts on social media. To analyze this relationship, we utilized a set of movie reviews as a training sample to construct a sentiment analysis model based on the Long Short-Term Memory neural network theory, and calculate the texts' sentiment score. We then analyzed the overall trend of the data, and discussed the reason behind the tendency. The principal result was that, as the pandemic progressed, online sentiment generally became more positive. We believe that this is because people gradually become more accustomed to life in the COVID-19 era.

are representative of the public's overall messages on social media and can validly express the public's sentiment.
2) Given that only English texts are gathered during the data collection, the problems this paper discussed are valid only in the United States, Britain, and other English-speaking countries or districts. Due to the lack of the data source location, the article assumes no significant difference among the data from different areas like urban cities and rural areas.
3) In consideration of the fact that the article's sentiment analysis model calculates only messages from December 2019 to March 2020, the article assumes that the TextBlob model used for the ready-calculated sentiment scores of tweets from April to July has roughly similar criteria for the score as the article's sentiment analysis model's. 4) During the calculation of the sentiment scores for the December-to-March texts, only randomly selected 1% of the data are used, and the article assumes that this selected part is representative of the whole population.

Movie Review Data Set
The movie review dataset serves as a sample and test data to train the article's sentiment analysis model based on the LSTM. Downloaded from GitHub open-source (https://github.com/yangbeans /Text_Classification_LSTM), the dataset has two files containing positive and negative film reviews, correspondingly.
C++ was used to select and randomly order, and we divide the data into 431 positive files and 428 negative files, each containing around 300 words and ending with a full sentence. To make sure that the machine will not randomly guess the text's sentiment tag based on the proportion of the sample, we equally assign 400 positive files and 400 negative files to the sample, and the rest of the data will be the test data.

News/Message Boards/Blogs Data Set
The news/message boards/blogs data set is the subject of the sentiment analysis. The texts extracted from this data set will be used for sentiment score calculation and will be evaluated as positive or negative. Downloaded from IEEE Dataport, (Note 2) the data set contains messages on social media from December 2019 to March 2020.
Using a python program, we extracted the texts from JSON files and divided them into text files, each containing one single message labeled with the date. Then we gathered all the December data and randomly selected 1% of the January-to-March data in each day as our data files, each containing the date label and the message texts. Most of the texts are around 200 words, and the majority of the texts have a word amount less than 300, fit for the article's sentiment analysis model.
Using the sentiment analysis model, we obtained the sentiment score for each message and built a news/message boards/blogs data set, each containing the date tag and the corresponding sentiment score.

Corona-Virus (COVID-19) Geo-Tagged Tweets Data Set
The coronavirus (COVID-19) geo-tagged tweets data set and the worked December-to-March data set serve as the subject of sentiment trend analysis. This data set is downloaded from IEEE Dataport (Note 3) and it contains the tweet IDs and the corresponding sentiment scores from April to July. The sentiment score inside the data set is calculated by TextBlob's simplified text processing model, used for sentiment analysis of emotion polarity, roughly the same criteria as our model's.
We gathered the data set's sentiment score, arranged the data, and labeled its distribution and proportion of different sentiments to create the data for the period from April to July.

Definition of Neural Network
The artificial neural network is a kind of calculation model mimicking the biological neural network using mathematical and computer science arithmetic. The neural network structure includes multiple layers of artificial neurons, and the model calculates by transmitting the data and result through layers. Like the biological neural network, an artificial neural network can learn and adjust its parameters and structure to minimize the assigned loss according to the feedback each time. The model can be used to solve regression problems, do image recognition, analyze text, etc.
Supervised learning lets the machine observe the pre-labeled sample data by humans and find the pattern. The machine will adjust its parameter and structure to form a model to predict the output result when the input is different from the sample. The supervised learning strategy requires a large amount of data, allowing the machine's sample data to learn and the test data for the human to determine the model's efficiency. Supervised learning can solve regression problems as well as sentiment analysis of the text.
Unsupervised learning, on the other hand, lets the machine automatically divide the non-pre-labeled sample data into classification. The model can be employed for clustering analysis and Generative Adversarial Networks.

Definition of RNN and LSTM
RNN, Recurrent neural network, is a kind of artificial neural network typically used for automatic speech recognition and natural language processing. Instead of merely processing each data as the typical neural network does, the feedback strategy allows RNN to detect and learn the pattern in a sequence by reprocessing some data. This tactic can be beneficial when it comes to the field of natural language processing. Take machine translation, for example: word-to-word translation is a horrible strategy, and sometimes the meaning of the word is dependent on the context in that sentence.RNN can analyze the whole sequence and deal with that pattern. For this reason, RNN is extremely powerful when the context of the sequence is critical for analyzing the data.
LSTM, Long short-term memory, is a typical recurrent neural network architecture. Its feedback connections allow the model to analyze the sequence comprehensively. Long short-term memory model usually contains several components including a cell and three gates-input gate, output gate, and forget gate-by which the machine can decide which information should be remembered for further calculation and which should be discarded to reduce the amount of memory. The selective forgetting and memorizing function of the Long short-term memory avoids vanishing and exploding gradients, which could cause severe problems in a standard recurrent neural network structure because of RNN's memorize-all strategy. Hence LSTM can perform well when dealing with sentiment analysis in which information from the past data will be needed to analyze new data in the sequence.

Model Introduction and Efficiency
The objective is to use a long short-term memory model to analyze the sentiment in a text. The model employs the 100-dimension Glove Vector by Stanford for the vocabulary. With the assigned sample data and the December-to-March messages' test data, the model reads the sample text and automatically adjusts the word amount of each input data to 300 by cutting off longer messages, and complementing the shorter one with zeros and process the text with pre-trained vocabulary. In consideration of the amount of sample and other factors, we chose a batch size of 40 and set the learning rates as 0.01 and the number of epochs as 20. The calculation result will be measured by a sentiment score, varied from -1 to 1. The negative index indicates a negative emotion, and a positive one indicates positive emotion. The absolute value of the score indicates polarity. The sentiment is more extreme when the absolute value of the score is close to 1, and the position tends to be neutral if the score is around zero.
The model runs for 20 epochs, and the sample accuracy increases from around 50% to over 96%. Simultaneously, the analysis of the test data, which includes 31 positive files and 28 negative files, manifests a high efficiency of the model with an accuracy close to 95%. It can be concluded that the model is pretty efficient and confidently valid.
The core part of the model's code is as Figure 1 mas.ccsen

The R
We colla pre-calcu Then we equal to "neutral t text with score betw  We determine with that betw e with a score vely, is "neut ined to be "ne and counting ach day. Cons eeks. The first onth will be d n the correspo ed with the 4 e 2: cience el Code ecember to M pril to July int e that a data w ween 0.1 and e smaller than tral towards n eutral." each tag's app sidering the la t seven days i divided into fo onding week 4 th week's data March and in to a single sen with sentimen 0.5, exclusive n or equal to -0 negative." The pearance num ck of data in D in a month w our files, each while the dat a. Finally, we Vol. 14, N ntegrate them ntiment distrib nt score great ely, is conside 0.5 is "negativ e data with a mber in a singl December an will be referred h of which co ta of the last e get the prop No. 11; with the bution file. ter than or ered to be ve," while sentiment le day, we d frequent d to as the ontains the remaining portion of   65.