For some time now, it has been claimed that society is polarized, but how can we check if it is true? Can polarization be measured? It can. Sociology, linguistics and computer science they join to study this phenomenon in the public sphere, very well represented today in social networks, where citizens share their position on any issue without complexes. Through techniques of natural language processing and machine learning we can analyze the comments that are published on the networks.
We understand by polarized position that which is shown in favor or against an idea or message, without considering the intermediate points. We also understand that an opinion that does not show arguments is an emotional opinion, the result of feeling rather than reasoning. Taking this into account, to study polarization we must analyze a large number of opinions. These opinions are reflected in the comments published on social networks, therefore, we can collect these comments and study if they show an emotional position or, on the contrary, reflect reasoning and present thoughtful arguments.
The first thing is to create a corpus, or database of comments, extracted from a social network. These comments must refer to the same topic, for example, elections. Once we have our corpus, we identify which features of natural language (which is nothing other than human language) serve as evidence of emotion or argumentation. For example, many emoticons, capital letters or exclamation marks indicate that the comment is possibly more emotional than another that contains many argumentative verbs. For the program to understand us, we have to convert this evidence into numbers. We achieve this by finding the percentage of each element — emoticons, capital letters, exclamation marks, verbs, etc. — in each comment. In other words, the evidences are the values that serve to represent the comments numerically. This search and registration of evidence in the text is, broadly speaking, what we call natural language processing.
The second part of the process is the one that implies the use of the artificial intelligence and is divided into two phases. The first phase requires human minds, as it is necessary to take a part of the corpus of comments and manually tag a few that will serve as an example for the machine learning algorithm, that is, for the program, to “learn”. Put another way, one or more people hand-label a handful of comments as “emotional” or “unemotional.” The second phase consists of training the algorithm. This takes the previously labeled data, detects the patterns that are used to classify it correctly and generates a model (which is a representation of those detected patterns and is used to classify new messages). Thus, when we process the rest of the comments that we have stored in our corpus, the model will automatically label them, based on what it has learned in this training phase.
Finally, processing our entire corpus using the generated model allows us to know how many comments are emotional and how many are not. This technique does not so much seek to know how many of the comments are in favor and how many are against an opinion —which could be ascertained using other technologies such as sentiment analysis—, but, rather, to measure how much emotion there is in the analyzed sample, to be able to conclude if polarization prevails in said sample.
This was a actual model which was built within the framework of a Master’s Thesis of the Master of Digital Letters of the Complutense University of Madrid to study the comments related to the electoral campaign to the Madrid Assembly of 2021, published on YouTube. The model was improved by including other parameters to take into account, such as the length of the text, in addition to applying sentiment analysis to obtain greater precision in the classification of comments. Likewise, it was found that the model was capable of reliably identifying emotional comments, but that the opposite label does not imply so much that the comment is reasoned and not emotional, but rather that emotion cannot be clearly identified. Of 16,691 comments analyzed, 8,230 were labeled by the model as “emotional”, which indicates a high percentage of confirmed emotionality in the sample, that is, approximately half of the comments lack reasoning features and contain evidence of subjectivity, which which suggests an important level of polarization in the studied context.
Lys Mayor Duenas is a computational linguist, graduated from the Master of Digital Letters at the Complutense University of Madrid.
Chronicles of the Intangible is a space for dissemination of computer science, coordinated by the academic society SISTEDES (Society of Software Engineering and Software Development Technologies). The intangible is the non-material part of computer systems (that is, the software), and its history and future are recounted here. The authors are professors from Spanish universities, coordinated by Ricardo Peña Marí (Professor at the Complutense University of Madrid) and Macario Polo Usaola (Professor at the University of Castilla-La Mancha).