Detecting incel misogyny on Reddit

By Virginie Soffer Nov 11th, 2025

Media requests

In 5 seconds Dominic Forest and Camille Demers have taught computers to automatically recognize when men's posts on the popular social-media platform target women.

A study was dedicated to the automatic detection of the speech of involuntary celibates on Reddit.
Photo : Getty

On Nov. 7, 2017, the American social-media platform Reddit shut down r/Incels, an online forum with more than 40,000 members. This was in line with a new policy the company brought in, banning content that incites violence.

Deprived of their outlet, where did all those misogynistic incels go? Where online did they wind up conversing, and, importantly, could their conversations still be detected on Reddit itself?

To answer those questions, Canadian researchers at Université de Montréal sifted through reams of posts, applying advanced automated text analysis techniques to distinguish incel comments from other online discussions.

Their results are published in the French-language scientific journal Traitement Automatique des Langues.

They blame women

Incels (a portmanteau of “involuntary” and “celibate”) are men who share a worldview that blames women and an unjust society for their inability to form romantic relationships.

Their misogyny has led to murderous acts of violence in North America, including a mass killing in Isla Vista, Ca. in 2014 and a van attack on pedestrians in Toronto in 2018.

The problem for researchers who study their online presence, however, is that it's hard for computer algorithms to detect incel rhetoric because it uses coded language and is constantly reinventing itself.

That's where the new study — authored by Dominic Forest, a professor in UdeM's School of Library and Information Sciences (known by the acronym EBSI), and his doctoral student Camille Demers — comes in.

Their research began in an EBSI classroom in the fall of 2021. Having participated in international competitions on detecting cyberbullying, Forest suggested that his data-mining students tackle the topic, too.

Demers, then a master’s student, and her group took up the challenge.

“I saw that their work had potential,” recalled Forest. He suggested they continue the project outside the classroom. “This led to a collaboration and the project evolved considerably before reaching its final form."

A protracted effort

Transforming a class project into a rigorous scientific study that addresses the challenges of processing massive amounts of constantly evolving data was in fact a protracted effort.

How do you train a machine to recognize incel discourse? The first step is to feed it a slew of examples. But instead of manually labelling tens of thousands of comments, the researchers opted for a “community bags” approach.

That is, rather than evaluating each message in a Reddit forum (known as “subreddits”), they classified entire subreddits as representative of a specific type of discourse.

Drawing on previous work, they identified 23 subreddits as incel strongholds.

They then extracted 40,000 comments from these forums and labeled them “incel,” and compiled an equivalent sample of comments from more than 13,000 other subreddits to create another corpus tagged “non-incel.”

Overcoming an imbalance

The next technical challenge was to overcome the problems created by the imbalance between incel and non-incel discourse on the Internet.

Incel statements make up only a tiny proportion of online conversations, so an algorithm trained on a realistic sampling of online comments could become “lazy” and identify no comments as incel.

It would almost always be right but the results wouldn't tell the whole story.

To prevent the computer from taking the easy way out, the researchers trained it on data in which incel comments were overrepresented. They ran a series of tests, varying the proportion of incel comments from 10 to 90 per cent.

That way, they could find the best balance for the machine to learn to distinguish between the two sets of data.

Data collection was another headache. As access to Reddit data changed during the project, the researchers turned to compressed archives made available by an online community of enthusiasts.

They then sorted conversations by month to avoid seasonal biases, such as men's potentially feeling more isolated over the December-January holidays.

A best overall model

After testing three methods for converting text from human-readable to numerical form and four classification algorithms, the researchers found a model that performed best.

It was one that combined a text conversion method called SBERT with a logistic regression algorithm. This model achieved an overall F-score (a metric that measures the performance of a machine learning model) of 79.7 per cent.

However, powerful new analytic models such as SBERT are frustratingly hermetic.

“They’re more effective, but in practice it’s impossible to know why they made a given decision,” Demers noted. “We couldn’t determine which characteristics it was taking into consideration.”

On the other hand, TF-IDF statistical weighting — an approach Forest considers more traditional — is slightly less effective but more transparent. So the researchers tried it.

With TF-IDF, they were able to extract the vocabulary that the machine deemed most relevant for identifying incel comments. The terms that carried the most weight turned out to be multiple.

There was "incel", of course, but also "chad" (a man considered attractive by the incel community), "woman", "ugly", "lonely", "virgin" and "normies" (people considered normal by incels).

“In addition to detecting incel rhetoric, this project enabled us to describe it,” said Forest. “It showed what involuntary celibates talk about and what vocabulary their communities use.”

Media requests