Skip to Main Content
Using data ethically
When you plan to collect data from human (or some animal) subjects for a study or an experiment, you must follow the guidelines provided by your Institutional Review Board (IRB). Whitman's basic guidelines can be found here. These are meant to protect the welfare, rights, and privacy of the subjects.You may also need to check with an IRB to use secondary data that are publicly available.
Tips for writing about data and statistics
What does "curated data" mean?
Some repositories curate their data -- others do not. What's the difference? Curated data have generally been checked by the repository to make sure that they conform to certain standards for presenting and describing the data, and for preserving data for future use. This helps to ensure that new users can understand what the data represent, and makes these data easier to share and reuse in the long term.
There are different levels of curation -- some repositories require much more detailed metadata (information about the data) than others, and some (such as ICPSR) do certain checks and description by hand rather than automation.
Uncurated data are less uniformly described, and therefore may be more difficult to use in contexts beyond the work of the initial investigators. They may or may not be in sustainable file formats.
Bureau of Justice Statistics
Data related to criminal justice in the United States
Bureau of Labor Statistics
Economic data from the Bureau of Labor Statistics.
Caselaw Access Project
CAP includes all official, book-published United States case law — every volume designated as an official report of decisions by a court within the United States. Each volume has been converted into structured, case-level data broken out by majority and dissenting opinion, with human-checked metadata for party names, docket number, citation, and date.
Census Bureau Data
Data on the US population and economy from the US Census. There are interactive tools to help you find data in specific areas.
Datasets from federal agencies. Data relevant for the social sciences may be found under many topics, including Consumer, Education, Finance, Global Development, Business, Cities, Counties, Manufacturing, and Law.
Historical Statistics of the United States This link opens in a new window
Topics ranging from migration and health to crime and the Confederate States of America are each placed in historical context; allows users to graph individual tables and create customized tables and spreadsheets (1774- census 2000).
ICPSR This link opens in a new window
ICPSR maintains a data archive of more than 500,000 files of research in the social sciences. It hosts 16 specialized collections of data in education, aging, criminal justice, substance abuse, terrorism, and other fields.
Links to national statistical agencies of around 200 nations and territories.
Pew Research Center
Pew Research Center is a nonpartisan fact tank that conducts public opinion polling, demographic research, content analysis and other data-driven social science research.
Dryad is a curated repository for scientific data (especially in the life sciences). Most data are associated with peer-reviewed publications, and all are freely available for reuse.
Datasets from federal agencies. Data relevant for the sciences may be found under many topics, including Science & Research, Agriculture, Climate, Weather, Ocean, and Health.
Figshare allows researchers to upload their data and other kinds of research output (posters, articles, figures, etc.) in formats of their choosing.
Zenodo allows researchers to share data that are not associated with specific subject repositories.
Open Science Framework (OSF)
OSF offers a free, open-source platform for hosting and sharing scholarly research through its lifecycle, including repository services for data.
Archive of scientific data, especially for environmental science.
Early English Books Online through the Text Creation Partnership offers fully searchable texts by authors writing in English between 1475 and 1700. These works correspond to the digital facsimile editions available through ProQuest's EEBO database (http://eebo.chadwyck.com/home).
Getty Art and Architecture Thesaurus
The searchable interface provides definitions and equivalent terminology for vocabulary related to art and architecture. The data set can be downloaded at http://vocab.getty.edu/
Google Ngram Viewer
This tool allows you to visualize how words and phrases were used in the Google corpus of digitized books.
Hathi Trust This link opens in a new window
Digital library of books and journals scanned by a partnership of major research institutions and libraries. Log in to access the largest number of volumes and features.
In addition to digital text collections, there are also collections of music and video and the Wayback Machine to find archived versions of web pages.
JSTOR Data for Research
JSTOR has digital tools that allow you to do various analyses (word frequencies, n-grams) and visualizations on JSTOR content, including scholarly journal literature and one set of primary resources (19th century British pamphlets).
Searchable, public domain electronic texts for download as plain text or ebook.
Where to look for data
The links for the Social Sciences, Sciences, and Humanities in the center of the page are just a few common starting points. To find research data in a specific field or subfield, search in the Registry of Research Data Repositories (re3data).
If you're looking more generally for data sets that you can jump in and start analyzing or visualizing, StatSci.org has a collection of data sets from various institutions and textbooks.
The Dataverse Network contains multitudes: there are many Dataverses (data repositories), created by different institutions or researchers, and data may be related to social sciences, sciences, or other fields. You can browse or search to find data that are publicly available; some data are restricted. You must create an account if you want to publish data.
© 2014 Whitman College Penrose Library |