The first question to ask is: Has copyright holder already granted a license for you to include their work. Sometimes authors have already provided permission through grants such as Creative Commons licenses. The license, itself, will identify the terms of what uses can be made without needing to get the author's permission first.
If a copyright holder has not already applied a Creative Commons license, he or she may be willing (often for a fee) to grant publication permission under specific terms and conditions. Step 2 addresses how to ask for that permission.
Public domain works are open for use with no permission needed. Just because material is online, however, does not mean it's in the "public domain." Public domain instead refers to works for which copyright protections have expired, or works that were ineligible for protection.
How can you figure out whether copyright has expired? The UC has provided helpful general rules of thumb in its Public Domain guide. For more detailed inquiries, we recommend using Cornell's chart, Copyright Term and the Public Domain in the United States, in combination with the Stanford Copyright Renewal Database. In brief:
Published prior to 1927: Work is in the public domain.
Published 1927-1977: If the work was published without a © notice, it is in the public domain. If it was published with a © notice, but no registration renewal was ever filed, it is in the public domain.
Published 1978-present: Work likely not yet in the public domain. (Remember, copyright is usually life of the author + 70 years).
Ideas and facts
U.S. government works (although foreign government and state government works may indeed be protected, as may works funded but not produced by the federal government)
Scientific principles, theorems, formulae, and natural laws
Scientific and other research methodologies, statistical techniques, and educational processes
Laws, regulations, judicial opinions, and legislative reports
Words, names, numbers, symbols, signs, rules of grammar and diction, and punctuation
⇒ Are you planning on using the work in a different way, or for a different purpose, than the original creator? In other words, in copyright terms, is your use “transformative”?
⇒ Are you using an amount of that work that is narrowly-tailored to your new purpose?
The purpose and character of the use, including whether the intended use is commercial vs. for nonprofit educational purposes. Tip: Uses in nonprofit educational institutions are more likely to be fair use than works used for commercial purposes. This may work in your favor for publishing the project, but not necessarily a subsequent commercially-licensed database based on your project.
The nature of the copyrighted work. Tip: Republishing factual work is more likely to be fair use than incorporating a creative, artistic work such as a musical composition.
The amount and significance of the portion used in relation to the entire work. Tip: Using smaller portions of a work is more likely to be fair use than larger portions, or portions that represent the "heart" of the underlying work
Potential impact on the market for the original. Tip: Uses which have no or little market impact on the copyright holder's ability to sell or license the original work are more likely to be fair. If the copyright holder offers licenses for uses similar to yours, use of the work without that license could harm the market for the license--weighing against fair use.
Text mining for non-commercial scholarship is likely fair use (see Authors Guild v. HathiTrust, 755 F3d 87 (2d Cir. 2014), but we must distinguish between using materials to text mine vs. subsequently republishing what you downloaded or scraped for that analysis. Subsequently re-publishing the content, itself, rather than just your analysis of that content might not be fair use. You'll have to consider all four factors.
Further, even if text mining is fair use, we have to consider whether we've signed contracts that constrict what would otherwise be fair use. That is, even if the mining would be fair use, the database provider might have terms and conditions about how it is to be conducted or how the content can be used. This is explained in workflow Step 3's discussion of Contractual Limitations.