It is Fair and Enabling – Data Mining from JNU Depot

Author: Dr. Kalyan C. Kankanala
Data Mining in today’s context is not limited to collection, organisation, management, searching and displaying of relevant data, it means much more than that, and includes creation of information, knowledge and intelligence from data sets, among others. At a basic level, data mining involves extraction of data, which is mostly done by using data mining software, applications or tools, and analysing the extracted data for a defined purpose. The data sets extracted may be further used by humans or machines for different purposes ranging from understanding trends and relations to training and developing artificial intelligence.
Data and Scientific Publications
Data can be extracted or mined only if it is available in one expressed form or another. One source of data is scientific publications. Most publications are expressions and articulations of studies, trials, experiments, and findings, among others. Though most publishers and authors assert copyrights over their publications, their copyright protection does not normally extend to facts and data on which those publications are based. Therefore, using facts and data that forms part of publications is not copyright infringement. However, making copies of publications for extracting data and facts in them may give rise to infringement liability unless it is fair dealing/use or otherwise exempted.
The JNU Depot and Mining
Nature India has reported that a giant database called JNU Depot is being created by Carl Malamud to facilitate data mining by scholars, researchers and others. An excerpt from the article reads as follows:
“Over the past year, Malamud has — without asking publishers — teamed up with Indian researchers to build a gigantic store of text and images extracted from 73 million journal articles dating from 1847 up to the present day. The cache, which is still being created, will be kept on a 576-terabyte storage facility at Jawaharlal Nehru University (JNU) in New Delhi. “This is not every journal article ever written, but it’s a lot,” Malamud says. It’s comparable to the size of the core collection in the Web of Science database, for instance. Malamud and his JNU collaborator, bio informatician Andrew Lynn, call their facility the JNU data depot.
No one will be allowed to read or download work from the repository, because that would breach publishers’ copyright. Instead, Malamud envisages, researchers could crawl over its text and data with computer software, scanning through the world’s text and data with computer software, scanning through the world’s scientific literature to pull out insights without actually reading the text.”
Fair Dealing/Use and Exemptions
Two distinguished IP professionals have taken opposing views with respect to JNU Depot/Mining infringement liability. While Arul argues that data mining from JNU Depot amounts to fair dealing/use, Prashanth argues that the data dump of millions of infringing materials may not be considered fair. Swarup, an upcoming IP practitioner from Chennai, believes that non-material reproduction may not give rise to infringement liability. I am inclined to agree with Arul and Swarup’s conclusions albeit based on slightly different reasons. It is pertinent to note that each of them assumes a different recitation of facts, and I have made my own assumptions based on information available in the source article for purposes of this opinion.
From my point of view, two activities are pertinent for determination of whether Malamad and JNU’s actions amount to copyright infringement, and whether they are fair use/dealing or otherwise exempted. The first activity includes making copies of millions of articles and storing them in a medium to form the JNU Depot. The second activity involves mining the data and text to gain research knowledge, insights and intelligence from articles on JNU Depot. I will deal with the second activity first.
The act of mining data in today’s context will certainly not amount to infringement, and even if it does, it is fair use/dealing. Before I begin, it is important to note that the data mining is done by a tool/application, which does not permit access to the full article, and will only enable the user to analyse the data in relevant articles and gain insights. Based on my first assumption that what the data mining tool will be collecting, processing, analysing, and providing insights with respect to,will be the facts in the articles and not their specific expressions, the question of copyright infringement does not arise. As the data mining tool will only be looking at and collecting data such as facts, results, trials, conclusions, etc., in the articles, which are not copyrightable, the copyright owners of the articles will not be able to go beyond the first step of infringement analysis, which is proving that copyright protection subsists in the data utilised by the tool. Even if they manage to overcome this obstacle, several fair use/dealing precedents stand in their way.
Over the years, copyright law has always made exceptions for technological progress and innovative activities aimed at a social or public purpose. In other words, if a technology or activity serves the larger good of the public, copyright infringement liability cannot come in its way. Several examples of exceptions being made under fair use/dealing or other exceptions exist with respect to different types of works. More than thirty years ago, making copies of movies and programs for time shifting and space shifting was permitted as fair use in the popular Sony case. In the case, a recording technology was not proscribed just because works can be reproduced.
Much later, Google was permitted to make copies of works on the world wide web to provide the facility to search and research, which is beneficial to the public and serves a social goal. In another case involving Google, reproducing and making copies of books was permitted to enable users to see excerpts from books. Though many file sharing facilities such as Napster, Grokster, etc.,were shut down, specific principles were laid down within which they can operate. Today, a strong intermediary exception has evolved to protect enabling technologies from claims of copyright infringement as long as knowledge, supervision, active action and direct financial benefits from infringing works can be avoided. Content aggregation and dissemination to make it easy and convenient to the general public has been permitted, and many businesses use such an exception to further technology progress and profitability.
Several technologies and activities have been exempted as fair dealing/use or otherwise by Courts, and, text/data mining of articles on JNU Depot will certainly be considered as fair use/dealing. The data mining tool is an enabling technology that helps researchers gain research insights, which could go a long way in furthering their research work, which is beneficial to the society. The fact that it only provides insights and does not provide the articles from which insights have been generated substantiates the fairness of the technology and its results. Furthermore, the data mining tool is an enabling technology, which acts as an intermediary between the JNU Depot and the researcher. Therefore, the exception provided to intermediaries under the copyright law is also available to the data mining tool, and the tool and its use are exempted even if the use is not considered fair use/dealing.
Now, coming to the act of making of the JNU Depot, which includes millions of articles, I believe that the ends justify the means. Simply put, having so many articles at one place amounts to copyright infringement, but the purpose for which the articles have been reproduced, downloaded, copied and stored, makes the infringing activity fair dealing/use. I am convinced with Arul’s articulation that the use of the articles for private/personal use and for research is fair dealing, and that the course of instruction exception will apply to the JNU Depot with a little bit of stretching. However, I would argue here that the act of making a data depot would amount to fair use even if the depot is made by an organisation with a profit motive.
In my opinion, making millions of copies of articles for purposes of data mining to gain insights, knowledge and intelligence justifies the act as fair and exempts it under the copyright law. Reproducing works and maintaining them to facilitate searching, accessing, excerpting, sampling, sharing, and so on, has earlier been exempted, and making a depot for mining is more justifiable than some of the exempted purposes. If JNU Depot and similar depots for data mining are not exempted as fair dealing/use or otherwise, the progress of artificial intelligence, which is dependent on data mining, may slow down due to copyright issues. Data is the raw material for neural networks, and their efficiency and accuracy depends to some extent on the quality and quantity of data available, which is used for training the models for specific applications. For example, to make an AI application that provides inputs about the surroundings to persons with blindness, the application must be trained using data pertaining to millions of surroundings to form the necessary networks. If images, videos and other data , which is the subject of copyright protection, cannot be used, then the application is relatively less effective as it is trained using that much less data.
To conclude, the activity of data mining is certainly fair use/dealing or otherwise excepted as it is merely an intermediary/enabling technology and uses only facts in the articles, which are not copyrightable. Furthermore, as the data mining tool merely provides research insights and does not provide the articles to its users, the use of the full articles to generate research insights amounts to fair dealing/use. Uses which provide much less benefits to the society and general public have earlier been exempted, and there is no reason why data mining will not be exempted as fair dealing/use or otherwise.
The JNU Depot will also be considered as fair dealing/use because its ends justify the means. Unless reproduction of millions of copies is permitted, the data from them cannot be extracted and analysed, and insights, knowledge and intelligence created. Not permitting such data repositories for research and social purposes may impede the progress of AI technologies/applications. The fact that the JNU Depot sits at JNU for research use puts it squarely within the fair dealing and educational exceptions under the Indian copyright law, and the number of articles in the depot cannot be a determinative factor in assessing whether it is fair or not.
Note: I have consciously stayed away from citing cases and paras from cases, which will be part of the next post on the topic by my colleague, Ashwini.
My sincere thanks to Professor Arul for prompting me to write this post. If not for him, I wouldn’t have sat down to pen this one.
The Plan to Mine the World’s Research Papers available at, visited on 26th August, 2019.
Dr. Arul George Scaria, “Should Indian Copyright Law Prevent Text and Data Mining?”, available at, visited on 1st September, 2019.
Prashant Reddy, “Malamud’s New TDM Venture May Not be Shielded by Section 52 of the Copyright Act”, available at, visited on 1st September, 2019.
Swaroop Mamidipudi, “Is the JNU Data Depot Even “Reproducing” Papers?”, available at, visited on 1st September, 2019.
Sony Corp. of America v. Universal City Studios, Inc., 464 U.S. 417.
A&M Records, Inc. v. Napster, Inc., 239 F.3d 1004.
MGM Studios, Inc. v. Grokster, Ltd., 545 U.S. 913.
Authors Guild v. Google, Inc. 804 F.3d 202 (2d Cir. 2015)

Leave a comment