NLP: Political Bias in American Media using Document Embedding.

Tapas Mahanta
12 min readJul 17, 2020

While issues such as fake news are very relevant, the bias in news media often goes unexamined, unless analyzed actively. These days almost each and every issue is politicized and this is making people reinforce their already existing beliefs through search engine suggestions. As a result, there is very little room for a change in the opinion polarizing our country and society further every day.

And unfortunately, mainstream media outlets are not helping in reducing these barriers. Though we have some idea about the political alignment of media outlets(left, right, or center), we are not actively analyzing the information for political bias when we read. I being a news junkie, saw this regularly where political agenda has become a priority rather than news reporting, which frustrated me.

So I decided to analyze for myself and see how biased mainstream news outlets are. I chose three news sources (washingtonpost.com, usatoday.com, FoxNews.com); representing liberal, conservative, and center-aligned outlets or topics that are usually controversial such as mail-in-voting, Obamacare, abortion, and unfortunately masks and healthcare, etc.

I used news API, which is an easy to use free API to collect news articles given a date range, and list of topics. I used selenium to then scrap the content from those links using a pre-existing wrapper.

The idea is to classify news articles into left, right, and center. The labeling of articles is done based on the title (sometimes in content when it’s unclear from the title). There are few issues in doing so. firstly it’s difficult in case of few articles to judge if there is a political bias present in the content. And secondly, all the labeling was done by myself, so there might be an inherent bias in the labeling. If something was not clear from the title, I tried to skim through the content to see if it will help me decide on a label for the article. It will be helpful to get opinions from outsiders to reduce personal bias in labeling. If you want to contribute to creating better labels, I’ll leave a link to the excel file at the end, where you can label articles based on titles.

I scraped about 500 news articles from washingtonpost.com, usatoday.com, and FoxNews.com, which I tagged on topic, sentiment, and bias(left, right, and center). Out of those, I have filtered 114 articles covering 13 topics for training and testing and 55 articles of a different topic (not seen by model).

Sample of the final data

LDA:

First let’s use a topic labeling method called LDA (Latent Dirichlet Allocation) to see if we can learn the underlined bias in the news articles.

At a high level, the LDA model assumes that each document will contain several topics so that there is a topic overlap within a document. The words in each document contribute to these topics. The topics may not be known a priori, and needn’t even be specified, but the number of topics must be specified a priori. Finally, there can be words overlap between topics, so several topics may share the same words.

LDA is a bag-of-words based model( bag-words is a generalized version of LDA). While in bag of words documents are simplistic probabilistic distribution over words( frequency), while LDA adds one more layer (topics) between words and documents. Basically, it builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

source: https://towardsdatascience.com/document-embedding-techniques-fed3e7a6a25d#e586

I visualized the latent topics learned from my LDA model, which pretty much looks like topics are formed based on issues, but on carefully looking at topics 0 and 1, we can notice some keywords that can be associated with political bias.

Topic Modelling using LDA

For example, it looks like topic 1 has captured some left-wing political bias through keywords “trump”, “voting”, “election”, “voting”, “health” etc., keywords representing issues that are important to liberals.

While topic 0 seems like it has captured right-wing political bias to some extent through keywords “police”, “abortion”, “law” etc.

Though LDA has captured some bias, the topics are mostly formed along with issues. Also, LDA performs poorly on sentiment analysis classification tasks. We will explore another document embedding method, called Doc2Vec

Doc2Vec:

Doc2Vec(Paragraph Vectors) is a generalization of the Word2vec model. The model in Doc2Vec learns vectors/embeddings(numerical representation of texts) for documents(the document can be a sentence, paragraph, or page) along with words. It is based on the distributional hypothesis, which assumes that words that appear in the same context, probably have a similar meaning.

Two variants “Distributed Memory (PV-DM) and Distributed Bag-of-Words(PV-DBOW)” which are analogous to the methods in word2vec.

In word2vec given a word, we try to predict its neighboring words(defined by a window size) or vice-versa. But we are not really interested in the input or output words, rather we are interested in the weights, which is a vector representation learned by the model during the process.

source:https://towardsdatascience.com/nlp-101-word2vec-skip-gram-and-cbow-93512ee24314

With Doc2Vec, we learn the embedding(vector) for each paragraph along with the words. (in PV-BOW there is only paragraph id).

Similar to the CBOW model of Word2Vec, the model learns to predict a center word based on the context. For example, given a sentence “The cat sat on sofa”, the CBOW model would learn to predict the word “sat” given the context words — the, cat, on, and sofa. Similarly, in PV-DM, the central idea is: randomly sample consecutive words from a paragraph and predict a center word from the randomly sampled set of words by taking as input — the context words and a paragraph id. In a paragraph matrix, each column represents the vector of a paragraph. By Average/Concatenate, it means whether the word vectors and paragraph vector are averaged or concatenated

The distributed Bag Of Words (DBOW) model is slightly different from the PVDM model. The DBOW model “ignores the context words in the input, but forces the model to predict words randomly sampled from the paragraph in the output.” For the above example, let’s say that the model is learning by predicting 2 sampled words. So, in order to learn the document vector, two words are sampled from { the, cat, sat, on, the, sofa}, as shown in the diagram.

We will try both the variants discussed in the original paper. the PV-DBOW model and the PV-DM model( with average and concatenation of context words and paragraph id)

For performance measure, We will choose the F-beta score(with beta=0.5) as our focus to reduce false positives, i.e. We do not want the model to identify something as biased(right/left), when it is not biased(center-aligned or factual reporting). I am okay with my model identifying a few biased articles as not biased, as I have generously assigned bias to articles while labeling. So my focus is to give more importance to false-positive while not completely ignoring false negatives; i.e. More weight is given to precision than recall.

We will train each article as a document and learn their representation and afterward classify each document using the Logistic Regression classifier.

beta = 0.5 gives recall half the importance of that of precision.

Among all three The PV-DBOW(Distributed Bag of Words) model had the best result, followed by PV-DM with concatenation and PV-DM with mean the poorest.

The results are satisfactory as they are leaning more towards the “center” class, some moderately left or right-wing articles are also being labeled as “center”. But it’s performing well when finding similarities between texts. i.e. given a sufficiently biased article, it returns articles of similar bias from the corpus.

For example: Given the following article (tag- 9), which has a visible right-wing bias, it returns documents with similar biases.

Similarly Given an article (tag- 65) with left-wing bias, it returns documents with similar biases.

But the issue here is that the model is learning document embedding based on keywords, i.e. in this case though the first two vectors are of similar bias and of the same issue/topic(abortion, Russian bounty), So when We look at the third most similar vector, it belongs to the same topic but of opposite political alignment. i.e. for tag 9 in the previous example, third and fourth similar vectors are of the same topic but different political bias.

This suggests the model is learning the topic (presumably on keywords) but has failed to capture the underline bias through semantics.

By closely studying few article contents i noticed that the biases (or the sentence that form bias) are used more with negative nouns and adjectives. Usually strong opinions or criticism use negative sentiment to grab the reader’s attention. (or that’s what newspapers intend to do). Below I have taken some examples of how strong(rather negative) words are used to emphasize a bias.

Example 1: McEnany on Monday insisted that the intelligence wouldn’t be brought to Trump until it was verified, a claim that runs contrary to common sense (given the outsize risks that ignoring an unverified threat might pose) and reports suggesting that it in fact had been brought to the president’s attention.

Example 2: Democrats would strip all of that away. The potential for fraud in such a system is obvious. But the liberal media keep telling you we have nothing to worry about, repeating over and over again that “there is no evidence voting by mail leads to voter fraud.”

source: https://www.featurepics.com/online/Criticism-Words-2984120.aspx

We will pick only sentences that contain negative words from a particular article for our training on all three models. This reduced the corpus size by half on average. Now we have a smaller but more relevant corpus for our task.

This though resulted in lower accuracy, now model doesn't directly learn from topic keywords, but it learns the underlined bias better than it did from the previous approach.

Positive Similarity Results for PV-DM model with concatenation
Negative Similarity Results for PV-DM model with concatenation

You can find more illustrations in the Jupyter notebook.

We see that the top two similar articles have the same political bias (left-wing) that of the test vector. But more importantly, they belong to different topics (Russian bounty, DACA). This tells us that model learning the underline bias which lacked in the previous approach. The flip side of this approach(keeping only negative word sentences) vs the previous approach (all text) is that the latter model has higher accuracy when classifying the documents using Logistic regression. Below are the CV score and best hyper-parameter found from GridSearch from all three variants.

Through all models have a lower CV scores, they perform considerably well for never-before-seen articles. Almost 20 percentage point improvement in accuracy for unseen articles compared to the majority prediction classifier ( majority vote of all 3 predictions)

Trained topic results
Never before seen topic results

Similar documents no longer belong to the same topic. Of course, the accuracy reduced. but now all the documents of similar bias are nearer to each other irrespective of topics. We can see this by plotting the t-sne vectors of the documents on Tableau and creating clusters.

t-SNE(t-Distributed Stochastic Neighbor Embedding) plots of document vectors learned by the model. it’s a technique primarily used to visualize high-level representations learned by an artificial neural network. In simpler terms, t-SNE gives you a feel or intuition of how the data is arranged in a high-dimensional space

tsne plots of document vectors learned by the model.

We see that for the most part documents of similar political bias are closer together inside a cluster. We can achieve good results with any distance-based clustering algorithm.

To understand more about what keyword or pattern the model has learned from the data, which it is using to classify news articles, we plot the popular keywords from articles the model predicted as left, right, or center.

Most Popular keywords on left-wing articles

We can see the issues, persons, or organizations that are the point of focus or criticism for liberals and conservatives.

Most Popular keywords on rightwing articles

We also see the topics which are least politicized that include schools, students, police, etc.

Political bias can often fall in the grey area and is subjective. Therefore there is always scope for improvement. The Next improvements that can be done on to improve are:

  1. As I have mentioned above, the labeling of articles was done by me, you can contribute by labeling the data. I have posted a link to a shared spreadsheet, feel free to update the bias(Alignment column) in the data, this will help make the solution more robust.
  2. We can try other classifiers other than Logistic regression.
  3. Distance-based clustering algorithms such as KNN with varying n to provide majority prediction should perform well based on what we have seen in the tableau representations of t-sne vectors we saw before.
  4. PV-BOW was able to recognize only left bias, while PV-DMM was able to recognize only the right bias. This can be tweaked further for better results.

You can find all the code for scraping the data and modeling here.

Bonus :

Also when we cluster the document vectors, the model seemed to have differentiated the articles both based on source and bias. From the t-sne graph, we see that the top-left quadrant has a majority of articles from the Washingtonpost and the majority of FOXNews articles are on the bottom right quadrant while USAToday articles lie on the third (bottom left quadrant).

Putting the same articles on a political bias(alignment) tag, we can see though a similar division exists (left-wing articles on top-left and right on bottom-right), a lot of articles on both left and right quadrant become center aligned.

This suggests though there exists a clear division in writing styles of both the newspapers, division based on political bias is not directly proportional to the news source. We can see a few articles from all sources showing the center align.

References:

https://towardsdatascience.com/document-embedding-techniques-fed3e7a6a25d

https://towardsdatascience.com/nlp-101-word2vec-skip-gram-and-cbow-93512ee24314

https://medium.com/swlh/a-text-classification-approach-using-vector-space-modelling-doc2vec-pca-74fb6fd73760

https://medium.com/@amarbudhiraja/understanding-document-embeddings-of-doc2vec-bfe7237a26da

--

--