措辞如何暴露学术欺诈

@海德沙龙：科学研究已是门大产业（1900年美国产生了300位PhD，如今每年5万多，25-39岁人口中0.11%是PhD，欧洲更多），这就难免存在作伪和欺诈，于是，如何在海量论文中找出作伪痕迹，成了一门新科学，一项新研究发现，作伪论文在用语措辞上会表现出一些可辨认的特征。

Stanford researchers uncover patterns in how scientists lie about their data
斯坦福大学研究者揭示了科学家数据造假的模式

When scientists falsify data, they try to cover it up by writing differently in their published works. A pair of Stanford researchers have devised a way of identifying these written clues.
当科学家伪造数据时，他们就会试图在发表作品中写得不同，以达到掩盖的目的。两位斯坦福大学研究者设计了一种能识别这些写作线索的方法。

Even the best poker players have “tells” that give away when they’re bluffing with a weak hand. Scientists who commit fraud have similar, but even more subtle, tells, and a pair of Stanford researchers have cracked the writing patterns of scientists who attempt to pass along falsified data.

当手握烂牌而虚张声势时，即使是最好的扑克玩家也会有一些使自己“露馅”的表现。学术造假的科学家也有类似的表现，尽管更不易察觉。而两位斯坦福大学的研究者破解了企图传播虚假数据的科学家们的写作模式。

The work, published in the Journal of Language and Social Psychology, could eventually help scientists identify falsified research before it is published.

发表在《语言与社会心理学期刊》的这项工作将来可以帮助科学家们在数据作伪的论文被发表前就把它们识别出来。

There is a fair amount of research dedicated to understanding the ways liars lie. Studies have shown that liars generally tend to express more negative emotion terms and use fewer first-person pronouns. Fraudulent financial reports typically display higher levels of linguistic obfuscation – phrasing that is meant to distract from or conceal the fake data – than accurate reports.

已经有相当数量的研究致力于理解说谎者说谎话的方式。研究表明，说谎者通常倾向于使用更多的负面情绪词汇，且更少使用第一人称代词。相比精确的报告，作假的财务报告的语言混淆——用来转移注意力和掩盖虚假数据的措辞——程度通常更为严重。

To see if similar patterns exist in scientific academia, Jeff Hancock, a professor of communication at Stanford, and graduate student David Markowitz searched the archives of PubMed, a database of life sciences journals, from 1973 to 2013 for retracted papers. They identified 253, primarily from biomedical journals, that were retracted for documented fraud and compared the writing in these to unretracted papers from the same journals and publication years, and covering the same topics.

为检查科学界是否存在相似的模式，斯坦福大学通讯学教授Jeff Hancock和研究生David Markowitz检索了生命科学期刊数据库PubMed从1973年到2013年间的被撤回论文。两位研究者找出了253份主要出自生物医学期刊的因造假被撤回的论文，并将它们与那些来自相同期刊相同发表年份、关于相同主题的未撤回论文进行了写作风格的比较。

They then rated the level of fraud of each paper using a customized “obfuscation index,” which rated the degree to which the authors attempted to mask their false results. This was achieved through a summary score of causal terms, abstract language, jargon, positive emotion terms and a standardized ease of reading score.

然后，对每篇论文，他们都用他们制定的“混淆指数”进行造假程度评分。“混淆指数”对作者试图掩盖伪造结果的程度进行评分，它由一些小项的得分求和得到。这些小项包括：因果词汇，抽象语言，专业术语，积极情绪词汇以及一个校准后的易读程度得分。

“We believe the underlying idea behind obfuscation is to muddle the truth,” said Markowitz, the lead author on the paper. “Scientists faking data know that they are committing a misconduct and do not want to get caught. Therefore, one strategy to evade this may be to obscure parts of the paper. We suggest that language can be one of many variables to differentiate between fraudulent and genuine science.”

“我们认为混淆的真正目的是把真相搞混，”论文第一作者Markowitz说道，“伪造数据的科学家知道自己行为不当，且不想被抓。一个逃避被抓的策略就是让论文某些部分晦涩难懂。我们认为语言是可以用来分辨学术作伪与学术真实的变量之一。”

The results showed that fraudulent retracted papers scored significantly higher on the obfuscation index than papers retracted for other reasons. For example, fraudulent papers contained approximately 1.5 percent more jargon than unretracted papers.

结果表明，因造假而被撤回的论文在混淆指数上得分远高于因其他原因被撤回的论文。比如，较之未撤回的论文，造假论文所用专业术语要多约1.5个百分点。

“Fradulent papers had about 60 more jargon-like words per paper compared to unretracted papers,” Markowitz said. “This is a non-trivial amount.”

“比之未撤回的论文，每篇造假的论文要多出大约60个行话切口般的专业术语，”Markowitz说，“这是一个不可忽视的量。”

The researchers say that scientists might commit data fraud for a variety of reasons. Previous research points to a “publish or perish” mentality that may motivate researchers to manipulate their findings or fake studies altogether. But the change the researchers found in the writing, however, is directly related to the author’s goals of covering up lies through the manipulation of language. For instance, a fraudulent author may use fewer positive emotion terms to curb praise for the data, for fear of triggering inquiry.

两位研究者说，科学家们可能因为各种各样的原因伪造数据。之前的研究指出，“要么发表要么走人”的心态可能会驱使研究人员操纵研究发现，甚至伪造整个研究。然而上述研究发现的写作上的变化，其形成的直接原因是造假的作者希望通过操纵语言来掩盖谎言。例如，为了避免引人追究，造假的作者可能会使用较少的积极情绪词汇，抑制对数据的称赞。

In the future, a computerized system based on this work might be able to flag a submitted paper so that editors could give it a more critical review before publication, depending on the journal’s threshold for obfuscated language. But the authors warn that this approach isn’t currently feasible given the false-positive rate.

未来，基于这项成果的一个计算机化的系统也许可以根据某杂志设定的混淆语言阈值给提交的论文进行标识，而编辑则可以在发表前对那些被标识的论文做更严格的评审。但两位研究者也提醒，因为存在错报问题，该方法现阶段尚不可行。

“Science fraud is of increasing concern in academia, and automatic tools for identifying fraud might be useful,” Hancock said. “But much more research is needed before considering this kind of approach. Obviously, there is a very high error rate that would need to be improved, but also science is based on trust, and introducing a ‘fraud detection’ tool into the publication process might undermine that trust.”

“科学造假越来越让学术界担忧，而自动识别造假的工具或许非常有用。”Hancock说，“但在考虑应用这种方法前，人们尚需进行更多研究。显然，目前很高的识别出错率需要改进。同时，科学基于信任，将‘造假检测’工具引入学术发表过程可能会损害这种信任。”

翻译：混乱阈值（@混乱阈值）
校对：沈沉（@你在何地-sxy）
编辑：辉格@whigzhou

相关文章