The next technical challenge was to overcome the problems created by the imbalance between incel and non-incel discourse on the Internet.
Incel statements make up only a tiny proportion of online conversations, so an algorithm trained on a realistic sampling of online comments could become “lazy” and identify no comments as incel.
It would almost always be right but the results wouldn't tell the whole story.
To prevent the computer from taking the easy way out, the researchers trained it on data in which incel comments were overrepresented. They ran a series of tests, varying the proportion of incel comments from 10 to 90 per cent.
That way, they could find the best balance for the machine to learn to distinguish between the two sets of data.
Data collection was another headache. As access to Reddit data changed during the project, the researchers turned to compressed archives made available by an online community of enthusiasts.
They then sorted conversations by month to avoid seasonal biases, such as men's potentially feeling more isolated over the December-January holidays.
A best overall model
After testing three methods for converting text from human-readable to numerical form and four classification algorithms, the researchers found a model that performed best.
It was one that combined a text conversion method called SBERT with a logistic regression algorithm. This model achieved an overall F-score (a metric that measures the performance of a machine learning model) of 79.7 per cent.
However, powerful new analytic models such as SBERT are frustratingly hermetic.
“They’re more effective, but in practice it’s impossible to know why they made a given decision,” Demers noted. “We couldn’t determine which characteristics it was taking into consideration.”
On the other hand, TF-IDF statistical weighting — an approach Forest considers more traditional — is slightly less effective but more transparent. So the researchers tried it.
With TF-IDF, they were able to extract the vocabulary that the machine deemed most relevant for identifying incel comments. The terms that carried the most weight turned out to be multiple.
There was "incel", of course, but also "chad" (a man considered attractive by the incel community), "woman", "ugly", "lonely", "virgin" and "normies" (people considered normal by incels).
“In addition to detecting incel rhetoric, this project enabled us to describe it,” said Forest. “It showed what involuntary celibates talk about and what vocabulary their communities use.”