Skip to main content
Using data ethically
When you plan to collect data from human (or some animal) subjects for a study or an experiment, you must follow the guidelines provided by your Institutional Review Board (IRB). Whitman's basic guidelines can be found here. These are meant to protect the welfare, rights, and privacy of the subjects.You may also need to check with an IRB to use secondary data that are publicly available.
Tips for writing about data and statistics
Writing with Statistics - Purdue OWL
The Purdue Online Writing Lab offers some guidelines on how to incorporate statistical information into your writing effectively and accurately.
What does "curated data" mean?
Some repositories curate their data -- others do not. What's the difference? Curated data have generally been checked by the repository to make sure that they conform to certain standards for presenting and describing the data, and for preserving data for future use. This helps to ensure that new users can understand what the data represent, and makes these data easier to share and reuse in the long term.
There are different levels of curation -- some repositories require much more detailed metadata (information about the data) than others, and some (such as ICPSR) do certain checks and description by hand rather than automation.
Uncurated data are less uniformly described, and therefore may be more difficult to use in contexts beyond the work of the initial investigators. They may or may not be in sustainable file formats.
The Inter-University Consortium for Political and Social Research is an archive of curated social science research data available for re-use by students and faculty at member institutions.
Census Bureau Data
Data on the US population and economy from the US Census. There are interactive tools to help you find data in specific areas.
Bureau of Labor Statistics
Economic data from the Bureau of Labor Statistics.
Bureau of Justice Statistics
Data related to criminal justice in the United States
Datasets from federal agencies. Data relevant for the social sciences may be found under many topics, including Consumer, Education, Finance, Global Development, Business, Cities, Counties, Manufacturing, and Law.
Links to national statistical agencies of around 200 nations and territories.
Historical Statistics of the United States
A subscription database containing quantitative statistics on various aspects of American history.
Pew Research Center
Pew Research Center is a nonpartisan fact tank that conducts public opinion polling, demographic research, content analysis and other data-driven social science research.
Dryad is a curated repository for scientific data (especially in the life sciences). Most data are associated with peer-reviewed publications, and all are freely available for reuse.
Archive of scientific data, especially for environmental science.
Datasets from federal agencies. Data relevant for the sciences may be found under many topics, including Science & Research, Agriculture, Climate, Weather, Ocean, and Health.
Figshare allows researchers to upload their data and other kinds of research output (posters, articles, figures, etc.) in formats of their choosing.
Zenodo allows researchers to share data that are not associated with specific subject repositories.
Open Science Framework (OSF)
OSF offers a free, open-source platform for hosting and sharing scholarly research through its lifecycle, including repository services for data.
Early English Books Online through the Text Creation Partnership offers fully searchable texts by authors writing in English between 1475 and 1700. These works correspond to the digital facsimile editions available through ProQuest's EEBO database (http://eebo.chadwyck.com/home).
Searchable, public domain electronic texts for download as plain text or ebook.
In addition to digital text collections, there are also collections of music and video and the Wayback Machine to find archived versions of web pages.
JSTOR Data for Research
JSTOR has digital tools that allow you to do various analyses (word frequencies, n-grams) and visualizations on JSTOR content, including scholarly journal literature and one set of primary resources (19th century British pamphlets).
Google Ngram Viewer
This tool allows you to visualize how words and phrases were used in the Google corpus of digitized books.
Getty Art and Architecture Thesaurus
The searchable interface provides definitions and equivalent terminology for vocabulary related to art and architecture. The data set can be downloaded at http://vocab.getty.edu/
HathiTrust Digital Library
The HathiTrust Digital Library contains millions of digitized books and periodicals which are full-text searchable. Those that are in the public domain may be read online.
Where to look for data
The links for the Social Sciences, Sciences, and Humanities in the center of the page are just a few common starting points. To find research data in a specific field or subfield, search in the Registry of Research Data Repositories (re3data).
If you're looking more generally for data sets that you can jump in and start analyzing or visualizing, StatSci.org has a collection of data sets from various institutions and textbooks.
The Dataverse Network contains multitudes: there are many Dataverses (data repositories), created by different institutions or researchers, and data may be related to social sciences, sciences, or other fields. You can browse or search to find data that are publicly available; some data are restricted. You must create an account if you want to publish data.
© 2014 Whitman College Penrose Library |