A tale of two papers
- UdeMNouvelles
12/15/2023
- Jeff Heinrich
We put ChatGPT 4 to the test, asking it to read two scholarly studies produced at the university and summarize them in a pair of newsy articles for UdeM Nouvelles. The results were ... mixed.
How good is ChatGPT at reading and synthesizing science, specifically the highly detailed science published as peer-reviewed studies in scholarly journals? To find out, we put the chatbot to the test, twice. At their suggestion, we fed it a pair of studies from 2022 involving researchers at Université de Montréal: one on brain injuries in children by psychology professor Miriam Beauchamp, the other on astrophysics by assistant physics professor Laurence Perreault-Levasseur.
Studies like these figure regularly on UdeM Nouvelles, the university's news site, written by our in-house writers who digest the material, interview the researchers and validate the resulting article with them before publishing. Could that process be sped up with ChatGPT? Could we have OpenAI's software help do the work for us, with equally good results? The answer, we learned, hinges on two factors: how precise our commands to it, and how well it understands our intent.
Within minutes of starting our exercise with ChatGPT 4, the latest version, we got back several drafts of the news articles we wanted. But though grammatically correct, they weren't perfect: what was spewed out needed re-writing; the human element, and the writer's experience as a communicator, proved essential to getting an accurate and readable result. Just to be sure, we invited each researcher to see what she thought of the final drafts and the process itself.
At the same time, we asked an independent expert to evaluate how well we did in prompting ChatGPT with the right information to produce good results. The expert – Montreal freelance journalist Yann Fortier – gives training seminars on how best to use ChatGPT as a writing tool (UdeM is a new client) and he had some solid advice on how to better frame our commands to ChatGPT in future – at the risk, of course, of someday having it replace us as writers entirely.
***
So, the studies. First up, the one on astrophysics.
"You work in media relations for a major university," we began, telling ChatGPT to imagine it were us.
"Great," it replied cheerily. "Working in media relations for a major university involves a range of responsibilities and skills." And it went on to list 10 of them, from developing a communication strategy to "staying informed."
"You need to render a complicated scientific study into plain English for a 750-word article," we continued.
That's "a challenging but rewarding task," ChatGPT replied, going on to give another 10-point list, ranging from the need to "understand the study" and "identify key elements," to concluding the text "with a takeaway."
So far, so good.
Then we entered the URL of the astronomy study. (Unlike earlier, free versions, ChatGPT 4, which is pay-to-play at about $25 a month, accepts URLs, a useful shortcut to entering the data manually.)
The reply took a few seconds.
"The scientific study titled "Posterior samples of source galaxies in strong gravitational lenses with score-based priors" focuses on improving the accuracy of inferring information from images of galaxies that are distorted by the gravitational pull of other galaxies. This process, known as strong gravitational lensing, can significantly distort the images of distant galaxies, making it difficult to accurately understand their properties."
Then came another list, only five points now, detailing the study's key aspects and concluding that it "represents a significant step forward in the field of astrophysics."
Next we asked ChatGPT to "refine the text using this source material," giving a new URL with supplementary information from the study. To which it provided a re-write of its early five points and ended with a more elaborate takeaway: "This study represents a significant step in high-dimensional data inference, improving our understanding of the universe through advanced image reconstruction techniques in astrophysics."
Fine, now please "re-write employing a newspaper-style inverted pyramid structure," we instructed. To which ChatGPT opened with a classic journalistic formulation – "In a groundbreaking astrophysics study ..." – and continued with two long, detailed paragraphs. "Continue," we prompted, and it did, for another long paragraph, so we repeated the command, and it did again.
"Simplify for a Grade 11 audience," we then asked, trying to dumb things down a bit.
"In a recent study," ChatGPT replied, "scientists developed a new way to look at images of distant galaxies that have been warped by the gravity of other galaxies in between. This warping effect is called strong gravitational lensing. The scientists used a special kind of math called score-based modeling to make clearer pictures of these distant galaxies. This is important because it helps us understand things like dark matter and how galaxies form and change. Their method is much better than older ones, giving us clearer and more accurate images. This could lead to new discoveries in space science!"
Maybe a little too simple. How about you "re-write from the beginning, this time for an undergraduate audience," we suggested.
So it did: "In a recent astrophysics study, researchers developed a sophisticated method to analyze images of galaxies that are distorted by the gravitational effect of other galaxies, a phenomenon known as strong gravitational lensing. This distortion can significantly alter our perception of these distant galaxies, making it challenging to study their true properties.
"To address this, the researchers used a technique called score-based modeling, which involves advanced mathematics and stochastic differential equations. This method allows them to create more accurate and clearer images of the distorted galaxies (and) represents a significant leap forward, offering a more reliable way to study and understand the intricate details of distant galaxies and the cosmic forces acting upon them."
Three "Continue ..."s later, we asked ChatGPT to "insert quotes from the lead researchers," but here the software balked. "I couldn't find specific quotes," it replied. "However, if you're interested in exploring the study further or seeking comments from the researchers, I would recommend contacting them directly through their affiliated institutions or looking for interviews and presentations they may have given on this topic." Fair enough.
Then we noticed that nowhere in all the answers ChatGPT had given was our physicist Laurence Perreault-Levasseur or any of her co-authors mentioned, so we asked that it do a re-write incorporating their names and university affiliations, and it did. Finally, after responding to a long series of additional requests – find a better headline sub-head and lead paragraph; use less jargon; attribute some material as direct quotes by the researchers (it refused) – it gave us:
Astrophysics breakthrough reveals distant galaxy secrets
New technique by Université de Montréal and Mila scientists unlocks clearer space images
Researchers from Université de Montréal and Mila, including Alexandre Adam, Adam Coogan, and others, have made a significant breakthrough in space science. They've developed a new way to look at images of faraway galaxies that are usually blurred and twisted by the gravity of other galaxies in between. This warping effect has long been a challenge for scientists trying to study these distant galaxies.
The team used a complex but effective method to make these images clearer and more accurate. This is a big step forward in understanding how galaxies form and evolve. It's like having a much better camera for looking at space, providing scientists with a more detailed and accurate view of these distant galaxies.
After several more paragraphs of the like, it was time for a re-write by a human being, which is where we came in. Here's the result:
***
Revealing the secrets of distant galaxies
A new technique developed by Université de Montréal and Mila scientists unlocks clearer images of space.
In a breakthrough in space science, researchers at Université de Montréal and its affiliated AI institute Mila, have found a way to look at images of faraway galaxies that are usually blurred and twisted by the gravity of other galaxies in between.
This warping effect has long been a challenge for scientists trying to study these distant galaxies. But led by physics professor Laurence Perreault-Levasseur, the scientists used a complex but effective method to make these images clearer and more accurate.
For astrophysicists, the development is a big step forward in understanding how galaxies form and evolve. It's like having a much better camera for looking at space, they say, providing them with a better view of these distant galaxies.
"This new method is exciting for space research," said Perreault-Levasseur. "It will allow us to study distant galaxies in much more detail than before. We'll learn more about the history of the universe, see how galaxies were billions of years ago and and how they changed over time."
Now on to the second study, the one about brain trauma in children.
"Write a news article of 800 words (headline, subhead, paragraph breaks, inverted pyramid style) from the attached study," we asked ChatGPT, providing it with a pdf of Miriam Beauchamp's research.
The response began well enough, if a bit wordily:
And so on. Can you spot the improvements?
We reached out to Perreault-Levasseur, expecting to hear a point of view we sometimes get from researchers faced with "popularized" versions of their work: that the news article would be better if the science hadn't been oversimplified.
Unlike ChatGPT, the astrophysicist's response wasn't instantaneous. "Your emails went to my spam folder," she apologized, adding that she'd been busy travelling for work (a seminar in Princeton, a conference in Waterloo, another in New York).
Yes, she said, she did have some issues with what we sent her.
In general, it lacked a feeling for "the bigger picture ... something we scientists would typically spend most of our time and words explaining when talking to the public."
In particular, she added, "while I think your text is pretty good based on what GPT told you, one thing that's not quite accurate is that we're not really 'de-blurring', we're 'reconstructing' undistorted images of background sources."
There followed a detailed explanation of how that works: she used the analogy that "gravitational lensing is like looking at the flame of a candle through the foot of a wine glass," i.e. there's distortion, and it needs to be "un-done."
Isn't that the same as "de-blurring"? Not quite. It's more a matter of reconstructing what the distorted galaxy looks like, and using artificial intelligence to do so.
In her study, Perreault-Levasseur explained, "we showed how we can use machine learning and deep learning to get a much more complex and realistic representation of galaxies from other images of undistorted galaxies, and then use that to 'fill' the missing information in our reconstructions.
"This allows us not only to do reconstructions that are far better than what existing methods could do, but also, much more importantly, to quantify the uncertainties of our reconstructions ... This is an important step forward in using machine learning to make scientific discoveries in astrophysics."
Then the clincher:
"What's interesting in your using GPT, a deep learning model, to produce a news article is that the whole point of our using deep learning for scientific inference was missed!"
Score one for the human.
Study reveals parental discrepancies in rating child behavior post mild traumatic brain injury
Research highlights significant differences between mothers' and fathers' perceptions of child behavior following early childhood injuries.
A recent study published in Child Neuropsychology has uncovered significant differences in how mothers and fathers perceive and report behavioral problems in children who have sustained mild traumatic brain injuries (mTBI) during early childhood. The research, led by Lara-Kim Huynh and colleagues from the University of Montreal, provides new insights into parental reporting biases, which could have implications for child neuropsychological assessments and interventions.
After this introduction, the text cycled through another eight long paragraphs, titling each: key findings, parental stress and family burden, in-depth analysis, clinical implications, methodology, limitations and future research, a conclusion, and finally, something short "about the authors."
Take 2: this time for a audience of undergraduates but "written in plain language" as per our command. ChatGPT began promisingly with a simpler headline – New study shows mothers and fathers differ in rating child behavior post-brain injury – and continued in a similarly "plain" way.
Take 3: this time with instructions to include details of the study size and authors, ChatGPT provided an even more idiomatic headline and text:
Study finds parents differ in assessing child behavior after brain injuries
A groundbreaking study led by Miriam Beauchamp has revealed significant differences in how mothers and fathers perceive their children's behavior following a mild traumatic brain injury (mTBI) in early childhood. The research, involving 85 children with mTBI, provides new insights into parental reporting biases.
On the fourth and final attempt, we recapitulated our commands thus: Write a news article of 800 words (headline, subhead, para breaks, inverted pyramid style) from the attached study. Plain English, please, avoid jargon and try to use the active voice. Audience are readers with an undergraduate-level education."
After some fine-tuning and revision by us, the final version came out thus:
Moms and Dads differ on how a brain injury changes how their child behaves
When their young child suffers mild trauma to the brain, parents observe changes in behaviour, but not in the same way, a new study led by UdeM researchers suggests.
Mothers and fathers observe their child's behavior changing after a mild traumatic brain injury (mTBI), but not in the same way. That's the conclusion of a new study led by Université de Montréal psychology professor Miriam Beauchamp.
Published in Child Neuropsychology, the study compared 225 children ages 1.5 to 5 years: 85 had TBI, 58 had orthopedic injuries and 82 had typical development. Using a questionnaire, the researchers found that mothers reported more behavioral issues after mTBI than fathers did.
This difference wasn't seen in children with orthopedic injuries or in typically developing children. Interestingly, parental stress or the added family burden of having an injured child didn't seem to influence these differing views.
Early childhood is a crucial time for monitoring behavior, as untreated issues can lead to problems later in life like anxiety or mood disorders. Assessing behavior after an early mTBI is challenging, especially for very young children who can't easily express their feelings or symptoms. This makes parents' reports on questionnaires a key source of information.
Previous research often focused only on mothers' reports and reflected their generally negative perceptions of how their child was doing. After an early mTBI, mothers are more likely to report behavioral changes in their children than fathers.
Recognizing these discrepancies is vital for improving how clinicians assess child behavior and manage mTBI, Beauchamp said.
In her study, she used a questionnaire called the Child Behavior Checklist to query parents six months after their child was injured. She found that the view of both parents – not just the mother's – are equally important for a full understanding of behavioural changes in the child.
"Parental differences in rating their child's behavior can be informative," said Beauchamp. "These variations can show how a child's behavior changes in different contexts. High agreement between parents on certain behaviors can suggest these are stable over time."
Behavioural changes vary depending on the child, the severity of the injury, and other individual factors. They can include:
- increased irritability or mood swings;
- less ability to focus or concentrate, which can impact school performance;
- being unusually tired, having less energy, or showing a lack of interest in activities they used to enjoy;
- difficulty falling asleep, staying asleep, or changes in sleep patterns;
- anxiety or fearfulness, including being hesitant to participate in activities similar to the event that caused the injury;
- sadness, withdrawal from social interactions, loss of interest in hobbies and activities;
- more headaches or stomach aches, with no clear physical cause;
- newly aggressive behavior or outbursts;
- changes in how the child interacts with peers and adults, such as difficulty understanding social cues or maintaining friendships;
- poorer marks in school or difficulty with schoolwork which was not previously an issue.
Not bad, right?
But what did the researchers think? We shared the text with them. Five days later, we had their responses.
"My overall impression?" replied Beauchamp. "This exercise can be useful for quickly understanding the main conclusions of a study, and ChatGPT seems generally good at choosing the right level of language needed for a non-expert audience.
"However, in my opinion there is major confusion between what the authors of the study say (e.g. their interpretation and conclusions in the discussion of the article) and what comes from the existing literature (e.g. what is reported in the introduction)," she continued.
"ChatGPT also takes the liberty of attributing quotes to the authors that aren't real quotes, which seems problematic to me."
(Actually, we had to explain, those quotes weren't ChatGPT's doing, they were ours. Human error? Not quite: they would have been submitted to the researchers for validation before publication, and they would have had the liberty to change them or reject them altogether.)
Beauchamp summed up with some pros and cons:
Pros:
- The general conclusion of the study is fair and formulated in a simple way.
- The methodological details (number of participants, study groups/population) are reported and in sufficient detail for the study to be understood. Note, however, that this information was included only after a prompt to that effect had been added.
- I found the "follow up questions" interesting. When asked to give examples from the questionnaire used in our study, ChatGPT explains that it can't do so because the content of the questionnaire is not included in the study, it is not freely available in the public domain, and it is protected by copyright, which is absolutely true. I was surprised by this answer, because in other areas, ChatGPT does not state its sources and is not careful about where the information comes from.
Cons:
- For me the biggest problem is that there is a confusion between what is reported in the introduction and what is reported in the discussion. You get the impression that the authors of the article are stating certain things, when actually the conclusions come from studies previously published by other researchers. ChatGPT does not seem to differentiate between sources of information: what belongs in the existing literature versus what is said by the authors of the article).
- There is no mention of the study limitations, something we are careful to point out when we speak about our studies in the media. Researchers usually explain the nuances of their work. This is not addressed at all.
- Although the level of popularization generally seems adequate, ChatGPT does not go any further to explain certain terms, something I would usually do when speaking to a non-expert audience.
Beauchamp's PhD student and the first author of the study, Lara-Kim Huynh, chimed in as well:
- The quotes reported by ChatGPT are problematic, as they're not found in the text of the study. Beauchamp is quoted as saying "Parental differences in rating their child's behavior can be informative," whereas the study clearly attributes that to another author: "De Los Reyes (2013) suggested that parental discrepancies can be informative, rather than be interpreted as measurement error'').
- ChatGPT does a good job of identifying the article's general conclusion. However, the structure of the summary does not respect the structure of the ideas presented in the text. As a result, it's hard to grasp why the authors conducted this study and what the implications are for future research. Instead, ChatGPT seems to report randomly chosen phrases from the text. It rephrases the study's conclusion several times, interspersed with information from the methodology, making it difficult to understand.
- Some sentences are actually incorrect, such as: "Previous research often focused only on mothers' reports and reflected their generally negative perceptions of how their child was doing." ChatGPT seems to have merged these two sentences, to create new information: "Moreover, most studies documenting behavioral changes after pediatric mTBI rely on a sole informant (...)," in the introduction, and "Together, the findings suggest that higher ratings of behavioral problems by mothers compared to fathers on the CBCL are not due to a generally more negative perception of their child's functioning (...)", in the discussion.
- The title "Moms and Dads differ on how a brain injury changes how their child behaves" is very poorly formulated and does not reflect the content of the article.
***
Could ChatGPT – and we who command it – do better next time? It certainly could, our coach, Yann Fortier, explained after taking a close look at how our "conversations" with the software went.
"I really like your sequence of requests, starting at a macro level and moving on down to the micro level," he responded.
But then he had a couple of big, potentially game-changing suggestions:
"Ask it what it thinks would be the best approach for generating a 750-word article, specifying the context, etc. to make its job easier. Your method of repeatedly asking it to "continue" is certainly effective, however, I see an optimization risk.
"In my opinion, ChatGPT doesn't know in advance that it has to 'continue'. So it probably tends to adopt a 'will to conclude', and to do so at the end of each of its replies rather than at the very end of the text.
"That's why, as an example, I prefer an approach like:
"Summarize content X into 10 main sections, numbered and listed from 1 to 10. Step 1 should be the introduction, step 10 the conclusion. Each section should flow logically and continuously."
"Once the results are displayed, ask it to write a 75-word text for step 1. Then, "Continue by writing a 75-word text for step 2." When you get that, say "Same for step 3." And on through steps 4, 5 and 6.
"It's like you're giving it a gradual mini-apprenticeship. And I see two advantages of doing things this way: you'll reach your goal of 750 words in a more direct way, instead of beating around the bush, and ChatGPT will start 'understanding' each step and thereby follow a more logical narrative arc, and will avoid looking for a 'punchline' to end each of the segments.
"I invite you to take the test to see if the answers can prove more accurate," Fortier concluded.
And so we well might. Like ChatGPT itself, our use of it is just beginning.