Building The Public Interest Corpus for AI and Computational Research

min read

The CNI Interviews Podcast | Season 4, Episode 4

This episode explores an effort to create a public-interest corpus for AI training using digitized materials from research libraries, archives, and special collections. Dan Cohen, dean of the Libraries at Northeastern University, and Thomas Padilla, public interest artificial intelligence strategist for Authors Alliance, outline their goal to expand AI’s access to high-value, long-form academic content, typically absent from commercial models.

Listen on Apple Podcasts Listen on Spotify

View Transcript

Gerry Bayne: Welcome to the CNI Interviews podcast. I'm Jerry Bain for ed, and I'm highlighting conversations from the Coalition for Networked Information 2025 spring meeting. The CNI meeting is a venue for technology leaders and higher education, library administration, digital publishing and research to share and broaden their knowledge of digital information issues. In this episode, we explore a new effort to build a large-scale corpus of scholarly materials to support AI development grounded in more academic values. I'm joined in this conversation by Dan Cohen, who is Vice Provost for Information Collaboration, Dean of the Library and professor of History at Northeastern University, and Thomas Padilla, public Interest Artificial Intelligence strategists for the Author's Alliance. Both Dan and Thomas are helping lead the early stages of this initiative. They're thinking through how academic and research libraries and their archives might play a central role in shaping AI systems that serve the public interest. I started out by asking Dan what first prompted the project.

Dan Cohen: So, the origin of the project is really our look at what AI looks like right now. And there are many big models out there from OpenAI or anthropic or from Google. These are great and powerful models that are getting more powerful all the time. But something that I think we noticed and that is others have noticed, is that they're built on a set of materials that's rather different from what you'd find, say, in the average research library and their YouTube videos is there are Reddit comments or tweets, and yes, there are some books, but, certainly not 100 million bucks, which is a rough estimate of the number of books that are out there on the shelves of research libraries. And there are other materials as well that research libraries have. Libraries have archives and special collections. These are all the primary sources and secondary sources for research in the Academy and also valued by the general public as a sort of storehouse of, of truth. And so, our feeling was that AI itself just writ large, would benefit from a greater incorporation of materials like those on the shelves of research libraries. And in particular, we thought that books would be a good starting place, since they are, you know, some of our greatest cultural productions. They are long format, very sort of rationally designed objects that have deep thought and, carry out some of our most important sort of transmissions of knowledge and information, from one generation to the next. So, we thought it was important to think about how we could have some version of AI that was more focused on these kinds of materials and then how to do that. And that's a very big question of how you do that. A lot of these books are still under copyright. And also, there are ethical and privacy questions that we could discuss as well. But the main factor is how do we, as a coalition of institutions, nonprofit institutions, libraries, archives, potentially museums and other organizations, how can we get together and build some kind of project that would enable AI to use these very high value productions of culture, like books?

Gerry Bayne: What are you wanting to do with the books in relation to AI?

Thomas Padilla: So, we want to, establish, corpus, at scale of books that could, be used for the development of, of AI itself. As Dan was mentioning, you know, the data resources that we know that are used to train, predominant models that are available, they're really just not taking, they're not making use of the most representative set of data possible. And, in singly and in aggregate across research libraries, we have a massive amount of, unique and highly valuable data. So we do want to, of course, you know, prioritize uses of a public interest corpus, by researchers and students in all different kinds of disciplines, to continue to advance knowledge, in, in different areas.

Dan Cohen: Right now, you can ask Claude or ChatGPT questions about history, and that's great. And I'll generally answer them correctly with the materials that they found, as you know, largely off the web and other digital, foreign digital sources. But you can't ask the kind of longitudinal, deep research kinds of questions without actually having the kind of long form contextual information that you'd find in the book. So, I'm by training a Victorian Victorians, there are millions of books actually written and published in the Victorian period itself. And then, of course, there are millions of books about the Victorian period, ranging from fiction to nonfiction and other formats. And really, only a small slice of those are currently sort of in the weighting of these large language models. So, we believe there is a kind of opportunity and a gap for doing a more academic focus, other small language models, or finding ways to bring in sort of a slice of the collection that might be topical and have that produce actually a better result than the generative AI that currently exists.

Thomas Padilla: A point of emphasis in this project is, you know, trying to develop a solution for enabling the training within copyright content in copyright books. One of the things that we commonly hear from researchers is that they're just quite practically frustrated by the ability to access and do computational work within copyright books. And that's a vast amount of cultural production where there's all sorts of sort of legal policy and just logistical barriers to them.

Getting access to data to ask questions that are, and do research that, is perhaps has more contemporary saliency. Right. So, it's a real challenge, for disciplines across the humanities, social sciences, computer science and so forth.

Gerry Bayne: What's been the most difficult part of the project and maybe the most surprising?

Thomas Padilla: One of the challenges pertains to scope, right? So, as Dan was alluding to prior, you know, our focus in this initial project is on, you know, proving the use case for books at scale. But of course, just at this conference, we've had people say, well, I have special collections pertaining to, the historic development of the water resources in the American West. Might that be useful in a project like this? And I think, you know, the answer is, yes, that would be interesting. And that would be useful. Especially in the climate context. You know, it's a bunch of sorts of non-repeatable sort of climate data. Right? Often kind of tied into special collections, field notebooks and so forth. That would be highly valuable to understanding, say something like climate change.

Dan Cohen: Do we even have a sense of how much of the iceberg has actually been digitized? Right. And it probably is similar to an iceberg where maybe 10%. Right? I mean, look, we know that there are efforts made, Google books being among, and the stored books in Hathi Trust where, you know, you can talk about 10 or 20 or maybe even 30 million books that may have been digitized globally, but that's still a minority of the books that are out there. They're obviously concentrated in certain cultures and regions of the world. So, there'll be lots of materials. And then when you add in billions of literally billions of objects that are in special collections and archives, then you end up with a significant amount of material that's sort of nowhere near, being, ingested into any kind of.

Thomas Padilla: Another big goal. The project is we're trying to level the playing field, for, noncommercial actors to gain access to training data at scale. You know, currently predominantly access to the best, you know, most comprehensive, largest training data, is, you know, largely the province of, you know, very, well resourced, you know, for profit companies. So where does that leave sort of an academic researcher and institution, but say, cannot secure sort of an industry affiliation that kind of leaves them in the lurch. Right? So, we're trying to, level the playing field and commit to this public interest.

Gerry Bayne: Let's talk about resources. What are you looking at in terms of resources to get these projects off the ground?

Dan Cohen: Whenever you talk about AI, you're talking about potentially very resource intensive kinds of processes, whether it's digitization and, you know, cleaning of data that goes into a corpus that will be used to train a model or the compute power itself, of course, although maybe the costs are coming down on that, as we've seen recently with DPC and other models that are trained for a fraction of the cost of some of the frontier models. The question for us is, is this question of scope and parameters, right? So is the end result of this maybe just a curatorial service that shovels out a whole bunch of books and other materials to a researcher, to a member of the public who wants to create a small language model or so I'm out. Are we doing the training ourselves, or are you running actual technical infrastructure with significant amounts of compute? Are we running sort of a more modest, kind of technical infrastructure that educators his audience would be quite familiar with, where maybe we have some API's and others can build some software to pull out sections of the corpus that they might want to use. So all those sorts of different scale models, would really, produce a kind of different sort of organization. And of course, we're also, quite familiar that there are organizations out there that might want to take on this role. So really, our goal for this project, for the calendar year of 2025 is to listen to a lot of people to talk to a lot of people like your membership, and to decide what we think the best game plan for this is and to really map out. We wound up writing with our colleagues, a report to say we believe the organization for a Public Interest Corpus should look like this. We're trying to get a lot of use cases, satisfy those, and then talk about how a lightweight organization might be able to serve those needs.

Gerry Bayne: Is there any sort of solicitation going on that we could hit our members to any sort of information seeking or any way they could participate in ways that would help you guys?

Thomas Padilla: We'll be holding, a workshop in Oakland in the fall, likely in October. We, of course, also hold, one on one sort of informational sessions if people are interested in talking with a member, or a couple members of the project team, happy to set up those calls. And we may need to offer sort of a virtual workshop as well, given the level of interest that we're getting in the project.

Dan Cohen: And we should note, there's also ways to reach out and get in touch with us. Yeah, traditional means email, etc. Public interest corpus.org is where you can find all the further information about the project. And I believe there is some contact form through that website as well.

This episode features:

Dan Cohen
Vice Provost for Information Collaboration, Dean of the Library, and Professor of History
Northeastern University

Thomas Padilla
Public Interest Artificial Intelligence Strategist
Authors Alliance