Where’s the data? Notes from an international forum on limited use text mining

I’m attending a two-day workshop on issues related to data access for text and data mining (TDM). We are 25 participants from different areas, including researchers who do TDM, librarians who oversee digital content, and content providers who package and sell data to academic libraries (principally large publishers), and finally, lawyers.

I am excited to be here because these issues strike me as both complicated and intractable. I have for several years tried to gain greater access to data in our university library with no success. I have also worked extensively with limited use data and wished I could be more open with the data. I’ve even looked into different solutions like https://sureshot.io/data-enrichment/ to see how the way we manage our data can be improved. Whenever I ask how the situation can improve, a finger pointing circle begins where everyone points at someone else and nothing changes.

The overarching question that we are all implicitly asking ourselves: Will anything change after our meeting?

Here we go.

Phase 1 the Fish Bowl

The day begins with an event called the fish bowl. During it, individuals representing a certain constituency sit in the middle of a circle and talk about the issue while everyone else listens. Then at the end, people on the outside can ask questions of those on the inside.

Group 1 = Researchers

Have all of you had a problem with TDM with library data? Yes. The system doesn’t work. Not really much more to say. The success stories with TDM have been with places like the Hathi Trust Research Centre, JSTOR for Research, and even Gale distributing data directly. But there were no success stories of accessing data through one’s library.

Group 2 = Librarians

There is mostly broad support to improve the situation. One participant says:

If we are going to make something available to human readers, then we need to make it available to machine-assisted reading.

Another offers this slogan: “If you can read it, you should be able to mine it.”

Still another suggests that researchers need to be clearer about what they want from library resources. This is an important point as librarians can’t solve all problems all at once.

Another: But even in cases where a researcher has a legal right to mine content, you haven’t necessarily facilitated researchers’ access to the actual data. There are many more steps that need to be taken.

When a researcher outside the circle asks a question if he can have access to certain data right now, he is told either I don’t know or be patient. Libraries need to be more transparent about what is and isn’t mineable to forestall researcher frustrations and hold libraries and publishers accountable.

Is there is a specific resource on your library that says what is TDM accessible and what is not? This is a key first step. What percentage of your databases are mineable?

Group 3 = Content Providers

We need to learn what people are doing with our material so we can sell the idea for TDM. We need the success stories.

How can we empower users to use our data?

So many challenges in terms of making data accessible in a way that is usable.

The problem of surveilling users — TDM is a two-way looking glass. What happens when content providers are observing what scientists are doing, either to take advantage of it or to restrict access to it?

Need to do away with content providers requiring logins and accounts to use library data. That’s the library’s job.

Group 4 = Academic Societies

Thinking about second class of researchers beyond the bleeding-edge: where’s the google ngram tool for library data?

Group 5 = Legal Perspective

If you have data, what can you do?

Fair use means anything beyond the underlying data is protected. You can transform it.

But how can you get the data?

You can scrape things, usually in violation of terms of service. Unclear consequences of this.

What about cross-border issues?

In Europe IP issues are getting more and more restrictive, not less.

Is TDM even a copyright issue or should it have been?

Can you post a matrix of word counts for in copyright data? Yes.

Need literacy right now for librarians and researchers. Librarians should stop being gatekeepers of data and start facilitating. Leave the legal issues to the lawyers.


Everyone has a role to play:

  • Content providers need to embrace TDM rights (or stop limiting them).
  • Librarians need to stop signing licenses that limit users’ ability to use the information in the library
  • Faculty need to support librarians in these efforts and accept trade-offs. Faculty continue to demand access to everything and this puts librarians in a very weak negotiating position. We have to be willing to say no as a community to bad practices on the part of publishers.


These are just a few of the insights gleaned. A white paper is forthcoming, so is a declaration. The content provider group is working on various ways of continuing to provide third-party access to publisher data in “non-consumptive ways”. The lawyers are going to work on literacy campaign for researchers so they know what their rights are.

More to come.