Experiment: AI-based sentiment analysis of news articles about Malaysian politicians
A blog by Digitum Labs, implementing partner for our MyDemokrasi pilot
Image credit: iStock
Against a backdrop of limited voter engagement and disillusionment with politicians of all creeds, resulting in low voter engagement which tends to support electoral maintenance of the status quo, this pilot set out to understand how an AI enabled digital platform might enhance election processes and increase voter participation.
The team set out to understand if a digital platform could become a trusted single source of information about candidates’ views on key issues, their commitments and monitorable campaign pledges, and their records in office and how it might create opportunities for voters and candidates to engage more usefully than is possible on social platforms.
The pilot asked:
Are the electorate, candidates and elected officials willing and able to engage with such a platform?
How could information be gathered and verified?
Which technologies are needed to underpin the digital platform (AI, blockchain, big data)?
What would it take to build trust in such a platform and ensure its neutrality?
We believed that sentiment analysis (“the process of analysing digital text to determine if the emotional tone of the message is positive, negative, or neutral”1) could offer value to both politicians or political candidates and voters.
The platform will act as a single source of information on politicians and political candidates, including collating media coverage about politicians and linking to articles. We believe that some voters will value access to details and a volume of information linked from various sources, whereas others will value data, summaries and insights so long as they trust the analysis is accurate and unbiased. We also believe that having a publicly available, transparent score of public opinion comparable across politicians might ultimately influence behaviours and support greater accountability. The sentiment analysis might also provide valuable insights into media biases and trends, potentially informing both public and scholarly discourse on political communication. Validating these assumptions is beyond the scope of this experiment.
The significance of this experiment lies in understanding the media portrayal of political figures, which can influence public opinion and electoral outcomes. By leveraging advanced Large language models (LLMs)2, we aim to automate and enhance the efficiency of sentiment analysis in political journalism.
Aim of the experiment
This experiment was designed to test whether an LLM could accurately assign a sentiment label to each article, and therefore the feasibility of using LLMs to automate sentiment analysis scores of Malaysian politicians through news articles.
Hypothesis: LLMs can be effectively used to analyse the text of news articles about Malaysian politicians and classify their sentiment scores.
Minimum proof: The AI model and methodology that is being prototyped should be accurate (>95%) and efficient3 enough to give sentiment scores based only on the article headlines and excerpts or articles.
Methodology
The AI model scored 150 articles and the scores were compared to human scoring of a randomly selected sample of 5 for each of three categories (positive, negative, neutral).
The dataset:
Politician name
Article headline
Article excerpt (lede)
Sentiment score
The sentiment score is applied in the following manner:
Headlines that make politicians look good → positive.
Headlines that make politicians look bad → negative.
Headlines containing mixed sentiments, inconsequential, or otherwise → neutral.
Data collection
We collected profiles of politicians currently in the Malaysian Parliament (Members of Parliament) from the website of Parliament of Malaysia.
News articles were sourced from three new sources between January 1, 2024, and April 3, 2024. We collected 34,805 articles in total.
Name of news site
Number of articles
Bernama
Free Malaysia Today
Astro Awani
21,621
2,729
10,455
These three news sources were selected as they represent a range of perspectives: government-aligned (Bernama), independent and critical (Free Malaysia Today), and mainstream (Astro Awani).
Data processing
For each politician, we obtained an embeddings vector of their name using the text-embedding-3-large embedding model provided by OpenAI4. Embeddings are numerical representations of text that capture the semantic meaning, allowing similar words or phrases to have similar vector representations. This enables the analysis of textual data in a way that accounts for context and relationships between words.
The embedding vectors enable us to determine if the article is related to a politician but not if it is relevant. For example, it might be related because the article includes the name of the politician but it is not relevant because the article is referring to someone else with the same name. The determination of relevancy is made by the sentiment analysis model in the subsequent step.
The embeddings were shortened to 512 dimensions.
News articles were filtered based on their “relatedness” to each politician using a cosine filter. A cosine statistical test will assign a value between 1 and -1. The closer the value is to 1, the greater the similarity. We included every article returning a value of 0.5 or greater.
After data processing, we had 10910 articles in scope (related) from 3 newspapers (Bernama, Free Malaysia Today and Astro Awani) across 286 politicians.
Sentiment analysis
The sentiment analysis works by feeding a prompt into the AI which then reviews the words and phrases in the text and provides an outcome – a determination of whether the sentiment of the article is negative, positive or neutral. The algorithm calculates whether words or phrases relate to the subject, in this case a politician, negatively or positively. We used the gpt-4-turbo-preview model as it was the most advanced OpenAI model at the time of the experiment.
We break down the prompt into the following pieces: Describing the task, Describing the desired output, and Providing information about the politician.
Through trial and error, we learnt that:
Describing the meaning of "sentiment" was more effective, reducing ambiguity and recognising reporting bias, for example ‘how the article "paints" the politician in question’.
Including contextual information about the politician helped the model understand the relevance and specific sentiment of the article.
Words such as “positive” or “negative” capture qualitative nuances better than numbers and allow for strong, actionable classification of news articles.
Inclusion of a “skip” option to filter out irrelevant articles, improving the accuracy of the sentiment analysis.
Striking a balance between providing enough detail and keeping the task simple for the model to handle effectively.
After testing and evaluating these various prompts, the following final prompt was chosen for its balance between providing sufficient context and maintaining simplicity for accurate sentiment classification.
System prompt:
You will be provided with a news headline and a brief excerpt. Your task is to classify how the news article paints a certain Malaysian politician. Please answer with ONLY one of 'positive', 'very positive', 'negative', 'very negative', and 'neutral'. If the article is irrelevant to the politician, please type 'skip'.
Politician: [Name]
Party: [Party]
Constituency: [Area], [State]
Parliament: [Parliament]
Position in Cabinet: [Position]
User prompt:
[Article headline]
[Article excerpt]
By using this prompt, GPT-4 should output one of “positive”, “very positive”, “negative”, “very negative”, “neutral”, or “skip.”
Limitations
The use of cosine distance to filter relevant articles is a preliminary step, and while effective, it may not be perfect. Some relevant articles could be filtered out, and some irrelevant articles might be included, affecting the accuracy of the sentiment analysis. We truncated the embeddings vectors to 512 dimensions, which reduces memory savings and costs but may limit accuracy.
The sentiment labels used (positive, very positive, neutral, negative, very negative, skip) might not capture the full spectrum of sentiments expressed in the news articles. The absence of articles labelled as “very positive” or “very negative” suggests that AI is less effective at determining the strength of feeling,
The "eyeball5" method for verifying sentiment labels relies on subjective human judgement, which can introduce inconsistencies and biases. A more systematic approach to creating a golden dataset with multiple reviewers could improve reliability.
The analysis focuses primarily on headlines and brief excerpts. While headlines are indicative, they may not fully capture the sentiment expressed in the full article. Analysing entire articles could provide a more accurate sentiment assessment, at extra cost.
Review
Observations
We scanned a total of 34,805 news articles and collected 5,169 samples of labelling. The sentiment analysis revealed the following distribution:
4,199 (81.2%) labelled as irrelevant (skip)
970 (18.7%) labelled as relevant, of which
0 (0%) labelled as very positive
481 (49.5%) labelled as positive
402 (41.4%) labelled as neutral
87 (8.9%) labelled as negative
From these, we randomly selected 50 samples each of positive, neutral, and negative labels and manually reviewed them (“eyeballing”) to verify the accuracy of the sentiment labels.
This experiment demonstrates the feasibility of using LLMs like GPT-4 for sentiment analysis of news articles about Malaysian politicians. By analysing 34,805 news articles, we found that 18.7% were relevant to politicians, with 49.5% labelled as positive, 41.4% as neutral, and 8.9% as negative. No articles were classified as very positive or very negative. Only 30.7% of the 286 MPs had any sentiment score marked as relevant, highlighting the limited media coverage for most MPs. These findings underscore the model's potential but also reveal the need for further refinement.
Only 88 out of 286 MPs (30.7%) had at least one sentiment score marked as relevant. The reasons why most MPs did not receive significant coverage are outside the scope of this experiment, but we note the following implications:
The current dataset may be too limited to provide a balanced view of media sentiment towards all politicians.
The lack of data for many MPs could affect the reliability and generalizability of the sentiment analysis model. The model may be more accurate for frequently mentioned politicians but less so for those with sparse coverage.
The data sparsity could introduce bias, with the model potentially overrepresenting the sentiments towards high-profile politicians while under-representing others.
We suggest the following workarounds to combat the data sparsity problem:
Expand the dataset by including more news sources and a longer time frame to ensure a more comprehensive and balanced analysis.
Supplementing the news sentiment scores with social media sentiment scores.
The absence of “very positive” and “very negative” labels suggests these extreme classifications might not be useful for this context, at least for the news articles that we have collected.
Key insights
Accuracy of the model
Different aggregation methods or combinations thereof (e.g., mode, weighted average, moving average) could be explored to refine the sentiment scores.
Incorporating standard deviation could help illustrate polarisation across different news sources. / Just noting the possible different ways of analysing the sentiment scores. E.g. Weighted average to account for political bias, moving average to dampen the scores, standard deviation to detect polarisation
Balance the trade-offs between cost and accuracy.
Sentiments towards politicians can change rapidly due to current events. The fixed period of data collection (January 1, 2024, to April 3, 2024) may not capture these dynamic changes, leading to potentially outdated or unrepresentative sentiment scores.
Recommended next steps
Golden dataset: Develop a high-quality dataset starting from the manually reviewed ("eyeball") dataset to evaluate the accuracy of various AI models. This task can be crowdsourced within the team, using human feedback to verify each label produced by the AI. If a label is incorrect, the correct score will be provided. Subsequent iterations of the model and prompting techniques can be evaluated against this golden dataset to measure their accuracy. We define accuracy as the rate of correct labels out of all labels.
Explore ways to improve accuracy by using the golden dataset as a benchmark, such as:
Evaluating different LLMs (right not using OpenAI who run ChatGPT but can use Gemini, Meta or LLama) or sentiment analysis tools (that came before LLMs like natural language processing tool kits).
Experimenting with different prompting methods (e.g., few-shot prompting might be more accurate but more expensive… in addition to the prompt we give we give examples of what is positive or negative which might give better outcomes).
Conclusion
Despite limitations such as contextual understanding, training data bias, and the limited scope of data sources, the preliminary findings are promising. Future work will focus on improving accuracy by developing a high-quality golden dataset, expanding data sources, and exploring different aggregation (now using simple averaging only but can use weighted average etc and standard deviation etc ) and visualisation techniques (now heat maps but can use bar charts etc…). By addressing these limitations, we aim to provide deeper insights into media portrayal and public perception of political figures, ultimately enhancing the robustness and applicability of sentiment analysis in political journalism.
If you’d like to dig in further…
Publish date: 30th May 2025