Calliope's Magic Library

That Discourse about Citations, You Know the One

After I punted a couple of thoughts onto Bluesky about the current academic citation discourse, Aversatrix thumped really hard so I have been trying to percolate a post on this ever since.

If you're not familiar with the issue at hand, ArXiv, a preprint publication database, has clarified that it will ban anyone who submits LLM shit in their academic work. The basic problem at hand appears to be that people, primarily in the natural sciences, are using LLMs, which are generating false citations, and given the culture of citation in the natural sciences as a whole, those are going into the papers without being checked.

Scientists, especially those who choose to remain on the CSAM generation site, are angry. Their anger boils down to frustration that they are on the hook for the citations someone else puts in their team-written paper, and that no one can check all the citations used in a realistic way. They object that one author may not speak the language a citation was written in, or that they may not have expertise in the field that their colleague, who inserted the citation, does have.

This is, in all honestly, bullshit. And it is in fact bullshit in several ways.

So, in case you don't already know, I'm an English professor. I earned my Ph.D. in literature and cultural studies in 2013, and I teach comparative humanities and freshman composition, mostly 102. English 102 (often called by other numbers at other schools) is the second of the basic writing courses, and is traditionally focused on building up to full-fledged argumentation and use of research to back it up. That's in contrast to English 101, which may have more personal narratives, minimal research, and a greater focus on working one's way through readings and basic writing tasks before building full-fledged arguments (in 102). I am not explicitly an expert in the pedagogy of composition; my field of expertise is literature. However, I have been teaching composition for over a decade, and to make the point this paragraph is about (one of them), I teach the class that teaches these scientists how to do their writing. They will undoubtedly have taken a research methods course in their field as well, and some of the things taught there will "trump" what I teach. However, we use APA, and so I'm familiar with the standards expected of citation in that style (which is the one used by scientists).

And, to be blunt, you do, in fact, need to know what's in the papers you're citing.

There's a kind of unspoken but widely acknowledged admission in the sciences that many people don't read the papers; they read the abstracts. That's not necessarily bad. If you're coming from a humanities background it can sound foolhardy, if not downright unethical, but the abstracts for scientific papers are responsible for bottom-lining everything of note in the paper. The paper is, in large part, providing the data necessary to prove the points made in the abstract and introduction. So there is pressure and expectation in scientific writing to have a lot of sources, because the sources make it easy to go through a lot of them in an afternoon. The expectation also makes sense, given that any single study could be a fluke, while a number of studies -- or, even better, studies and metastudies -- indicate more solid conclusions you can rest your own ideas on.

You can see where this is going. I'm not going to be polite here. Anyone who chooses to hand over any part of the writing and research project to an LLM is a fucking asshole. Don't do that. DO NOT DO THAT. You are hobbling your future ability to read, research, craft ideas, and write them down. But some assholes, faced with this pressure, are going to choose to use the plagiarism engine. And the plagiarism engine, quite famously, invents information-shaped bullshit out of nowhere.

"Bullshit" is an actual philosophical term, as you may already know. I found a pdf of what I believe to be the original essay here. My definition is based on but slightly divergent from Frankfurt's. "Bullshit" is a statement that has no referent to external reality. This is in contradiction to both truth and lies. Truth is, as best as the speaker is able to do, a statement reflecting reality as it really is. "The Sun is up" [at 2 pm]. A lie is, again as best as the speaker is able to do, a statement that intentionally obscures or fails to reflect reality as it really is. "The Sun is down" [at 2 pm].

Bullshit has no referent to reality at all. It could be true. But the speaker either doesn't know or doesn't care. This distinction may seem to rely on what you're raring to call the authorial fallacy, but when evaluating a lie, specifically, we have to consider that. The truth seems easy (it's not), but with a lie, the word itself indicates the speaker knows it's false. We've all promulgated falsehoods because the facts changed or better research was done after we learned a fact. I told people for years that all -- all -- the caffeine from tea is infused into water at or above 180 degrees Fahrenheit within thirty seconds. That's not true, but it was considered true when I learned it, somewhere around 2005 (at least, that's when I read the book that said it; who knows now when the book was written?).

LLMs are, at their core, Bullshit Machines (based on stolen works). They may deliver things that happen to be true. They may not. They have no reference to anything outside their dataset, and then their programming means they aren't referring to their dataset; they're generating word strings based on it. You've seen it. I've seen it. Students sometimes turn in Rhetorical Analyses on authors who don't exist, who wrote essays that don't exist. But words follow other words in a way that can be measured statistically, and the admittedly advanced algorithms of the Bullshit Machines can start from "an analysis of X text" and generate a lot of words that all kind of go into a statistically "correct" order. And they mean literally nothing.

That's a long tangent, but it's necessary here. Scientists ought to know better. They ought to be remembering that what they're doing is examining the world around them. LLMs don't do that.

Now, that's what we might call the problem with the ideals of the people who are complaining. There are yet more reasons they're full of shit.

These two practical issues emerge more directly from my experience teaching, and I can characterize them as starting "at both ends" of the problem. That is to say, at one end there's one or more people writing a text; at the other end, there are one or more people reading that text when it's finished. Using LLM citations disrupts both ends of this continuum in a way that does in fact necessitate the banning of people who did it from the academic arena.1

First, and this is the obvious one, what is an author doing putting a citation into their essay if they haven't read it? This is a bit of an idealistic way to phrase it, but bear with me: if you have not read the work, you should not ever cite it in any way. Because citations are for things that you fucking read, and nothing else. That is actually practical. When you put a citation into a piece of research, you are explicitly claiming that you read it. And, while different fields may have different definitions of what it means to "read" an article (see above), it still expects one to have done the reading if one puts the citation in. This is so important that one of the things I learned early on in academic writing is to cite the exact edition I read. No other. Because if I use an old edition, cite the newer one, and the text I quoted was changed, by a typo or a revision, then I effectively fucked up my citation, because I am pinky swearing that I read the exact edition that is in my citations.

This is what we might call "walking the walk." An expert is really an expert in two ways: the development of skills and the accruing of information. If one is not bothering to accrue information, and instead putting bullshit citations in instead, then one is not an expert. And so, any such person does not have the qualifications for academic publishing.

Now, let me stop you right there. A degree is not required to get published in an academic journal. Any person with a good argument and the necessary things to back it up can be published. But those things are what demonstrate the author or authors have the qualifications.

In my classroom I refer to this using one of the old standard terms from the rhetorical tradition: ethos. "Ethos" has been defined in many ways, but let's just start with the way the Encyclopedia Britannica defines it:

... Ethos was the natural disposition or moral character, an abiding quality, and pathos a temporary and often violent emotional state. For Renaissance writers the distinction was a different one: ethos described character and pathos an emotional appeal.

"Ethos" is the character of the person making the argument. It is, in its simplest form, whether the reader (in this case) can trust the author(s). What I typically tell my students is that someone who's been established as an expert in a field is usually "ethical" in this narrow sense (not necessarily the greater, more standard sense of the English "ethical," though they're clearly related). However, the work itself is the thing where "ethos" is actually necessary. If the work is sloppy, poorly researched, difficult to understand, or otherwise badly made, it's also poor in ethos. The character of the author reveals itself through the writing and research; there's no other way for it to do so, unless one happens to know who they are in advance, or looks them up partway through reading.

I said this section was about two things and that the second had to do with the reader, but the previous paragraph is surely about that? Yes and no. As you might have already reckoned, I used the word "continuum" earlier because these two things are closely locked together. The second one is not strictly that the reader can't trust the writer, because that's the first thing.

It's that the reader must now, logically, assume everything in the text is just as bad as the worst part.

This would not always be true, were the problem to change. If one paragraph of an essay is somehow riddled with typos and the rest is not, one's opinion of the author may change a bit, but the rest of the essay is readable and presentable. But when it comes to the sources, those gel together into an edifice. "This," the author says, "is the body of my knowledge as it pertains to this subject. These citations are the things that demonstrate I know what I'm talking about and that you should listen to me" (as in the previous point). If one of those "bricks" is faulty, the entire wall is liable to fall in on itself.

Think of it this way: if you were interviewing a candidate for a job position within a field that has a particular dress code, and that candidate didn't dress that way, you might be disinclined to hire them. But if a candidate dressed the part and you learned they lied to you, even one time, you would throw their resume into the trash. I'm sure someone is going to say lying is such an ingrained part of the interview process that they wouldn't, so transpose that to something else I suppose, like dating, or if you're already extremely angry from the part earlier where I accurately described the output of LLMs, imagine instead a games journalist who lied about a game. That should get you into the right headspace, especially if you imagine them to be a woman.

Setting that aside, you see how there's no other option here than to ban people who do this from publishing in a place where the editors and publishers are on the hook for the academic credibility of everything in it. The text and the authors prove themselves untrustworthy if they include citations they did not check or read, and especially so if they use false citations.

And this is the necessary point to make before wrapping this up: these citations are false. They are not true, and academic research cannot ethically rely on bullshit -- and obviously not on lies, either.

To end, the people complaining on the CSAM site are making the claims, as I said, that it's unreasonable to expect Bob to check John's citations. Well, that may actually be true. But it doesn't matter. Because if Bob and John both put their names on a work that is justified by falsehood, they're both implicated in the lie. The question of whether that extra work was unreasonable or not was supposed to come up earlier, not at the point of submission. If it's not something someone can do, don't put yourself into the position of needing to do it. Because this isn't a matter of humanities people not understanding what your field is like. It's a matter of you putting your name on lies and getting mad about consequences.

  1. note that ArXiv is only banning these people for a year, and not (so far as I can tell) insisting on evidence that they took retraining in how to do research properly, which is probably the minimum I would call for if I were in the editors' and publishers' positions. But I'm not, so oh well.

#academia #writing