Data Sources
Data for Equity & Inclusion
SlaveVoyages : “The SlaveVoyages website is a collaborative digital initiative that compiles and makes publicly accessible records of the largest slave trades in history. Search these records to learn about the broad origins and forced relocation of more than 12 million African people who were sent across the Atlantic in slave ships, and hundreds of thousands more who were trafficked within the Americas.”
gibboda : Gender Isn’t Binary But Other Data Are!
hate speak corpora : particularly interesting for training NLP models- Police Scorecard is the first nationwide public evaluation of policing in the United States. The Scorecard calculates levels of police violence, accountability, racial bias and other policing outcomes for over 16,000 municipal and county law enforcement agencies, covering nearly 100% of the US population.
Data Challenges
TidyTuesday A weekly data project aimed at the R ecosystem. I’ve written some suggestions and included a lot of resources for getting started with TidyTuesday
data for social good provides real data and structures (similar to Kaggle) for working through models and coming up with predictions – all on data which benefits the social good.
ASA DataFest The American Statistical Association (ASA) DataFest is a celebration of data in which teams of undergraduates work around the clock to find and share meaning in a large, rich, and complex data set.
Fall Data Challenge Each year, the contest challenges undergraduate and high school students to work in teams to analyze real-world data and make recommendations to combat critical issues.
ASA Data Expo The Annual Data Challenge Expo is open to anyone who is interested in participating — including government, industry, academia, retirees, and students. Each year, this contest challenges participants to analyze a core data set using statistical and visualization tools and methods. Student awards at three levels of $1,500, $1,000, and $500. (An annual spring competition, with presentations at the Joint Statistical Meetings.)
kaggle competitions Many different data challenges with cash prizes. Alternatively, compete in a competition that has already closed for practice creating a a full data analysis.
R packages for connecting to APIs:
- coinmarketcapr: Connecting to Coin Market Cap to get Cryptocurrencies Market Cap Prices .
- rtweet: R client for accessing Twitter (stream and REST) API.
- epidata: R package to link to the API. The Economic Policy Institute provides researchers, media, and the public with easily accessible, up-to-date, and comprehensive historical data on the American labor force. It is compiled from Economic Policy Institute analysis of government data sources. Use it to research wages, inequality, and other economic indicators over time and among demographic groups. Data is usually updated monthly.
- acs: R package to link to the API. Provides a general toolkit for downloading, managing, analyzing, and presenting data from the U.S. Census, including SF1 (Decennial short-form), SF3 (Decennial long-form), and the American Community Survey (ACS).
- tidyhydat: Canadian hydrometric data — Historical data is contained within HYDAT, the Canadian national Water Data Archive, which is published quarterly by the Government of Canada’s Department of Environment and Climate Change. Data in this archive range from 1850 to 2017. tidyhydat also provides functions to access real-time data over the web. This package would be of interest to anyone who has need for Canadian hydrometric data in R.
- ipumsr: An easy way to import census, survey and geographic data provided by ‘IPUMS’ into R plus tools to help use the associated metadata to make analysis easier. ‘IPUMS’ data describing 1.4 billion individuals drawn from over 750 censuses and surveys is available free of charge from their website.
- essurvey: Download data from the European Social Survey directly from their website. There are two families of functions that allow you to download and interactively check all countries and rounds available.
- data360r: Makes it easy to engage with the Application Program Interface (API) of the TCdata360 and Govdata360 platforms. These APIs provide access to over 5000 trade, competitiveness, and governance indicator data, metadata, and related information from sources both inside and outside the World Bank Group. Package functions include easier download of data sets, metadata, and related information, as well as searching based on user-inputted query.
- lahman: Provides the tables from the ‘Sean Lahman Baseball Database’ as a set of R data.frames. It uses the data on pitching, hitting and fielding performance and other tables from 1871 through 2015, as recorded in the 2016 version of the database.
- wbstats: Women in Parliament dataset and link to worldbank
R packages containing multiple datasets:
- lterdatasampler Data samples with features that are useful in introductory environmental data science and statistics courses.
- gibboda: Gender Isn’t Binary But Other Data Are!
- openintro an R package for data and supplemental functions for OpenIntro resources
- tidyverse is a suite of R packages that contain some fantastic datasets
- fivethirtyeight: an R package that pulls in data that 538 has already publicly posted
- mosaicData
- The R package dslabs has some great datasets, described in this Simply Statistics blog
Dynamic Data Sets / Databases:
The amazing TidyTuesday datasets. For example, data on Juneteenth or Wealth Inequality
Awesome Public Datasets, a huge number of datasets automatically generated from blogs, answers, and user responses.
World Atlas is a public database with 2,500 datasets. The datasets come from places like Gapminder, United Nations Data, Worldbank, and Open numbers, and it is described here in detail.
College Scorecard. A tremendous amount of information about all universities (though some of it collected only from students on financial aid).
National Health and Nutrition Examination Survey from the CDC also available in two different R packages: nhanesA and NHANES
Medicare dataset (discussed on whitehouse.gov)
gapminder.org – a fascinating website with amazing graphics (social and economic data broken down by country). Click on the spreadsheet links to download the data.
Wolfram/Alpha is billed as a computational search engine. Put in “nachos” you get a detailed nutritional analysis, put in “GDP of Albania”and you get several forms of GDP, a historical graph and other economic variables, put in your favorite college and get lots of info (including number of degrees in mathematics in 2009, location on a map and link to a satellite view of campus). While the case by case data display is not so convenient for building datasets there are pretty good links to the sources that Wolfram is pulling data from. For example, the Wolfram/Alpha page of info on a college or university has a data source link at the bottom to the National Center For Educational Statistics website where you can download your own custom data files from the IPSEDS (Integrated Post Secondary Education Data System) – want to know the average faculty salary by rank for all the schools in your comparison group? or the nacho search gives a link to the USDA’s National Nutrient database and a few clicks later I’ve got a spreadsheet with data on 50+ nutrients in 7400+ foods (and that’s the abbreviated data!)
The Census Bureau including guide to getting the most out of the Census.gov website.
Baby names (popularity by year and state), compiled by the Social Security Administration
New & Continuously Revised Static Data Sets / Databases:
- National Survey of Children’s Health including a dozen or so variables on 85,000 kids.
Boston College COVID-19 Sleep and Well-Being Dataset - Collection of datasets compiled by Robin Donatello at CSU Chico.
- Harvard Dataverse
- hate speak corpora, particularly interesting for training NLP models
- The R package dslabs has some great datasets, described in this Simply Statistics blog
- CDC health datasets which are freely available and formatted. To be analyzed correctly, these survey data require proper weighting, clustering and stratification adjustments. There are many such datasets available, including NHAMCS (OPD and ED), NAMCS, BRFSS, NSFG, NHIS, NIS-Child, NIS-Teen, NHANES, NVSS. A quick Google of any of these acronyms will take you directly to each webpage.
- kaggle.com is a repository for data used in analysis competitions.
- UC Irvine’s Machine Learning Repository (huge and fantastic database!).
- The GitHub site and other info for many of 538’s analyses.
FiveThirtyEight.com has been very forward thinking in making the data and code used in many of their articles accessible on GitHub. With consultation from Andrew Flowers and Andrei Scheinkman of FiveThirtyEight, we go one step further in our package by pre-processing the data so that it more accessible statistics and data science novices and providing ample documentation in the help files.
See a usage example and R package called fivethirtyeight. For a more detailed outline of all data sets and a discussion on our motivation and data guidelines, see the package vignette.
And this: How to explore data from 538 - Data is Plural set of fun and interesting new datasets and the spreadsheet with all relevant info
- The StudentLife Study. In 2013, four dozen Dartmouth College students agreed to let a custom smartphone app surveil them for the StudentLife Study. During the 10 weeks of the spring academic term, the app collected data on the students’ physical activity, GPS coordinates, eating schedule, sleep habits, phone usage, and more. The study combined all that information with a slew of other data, including the students’ class deadlines, academic performance, and their responses to surveys about stress, depression, personality, and sleep quality. The study’s public (and anonymized) dataset clocks in at 53 gigabytes.
- Shonda Kuiper’s (Grinnell College) many data resources
- realclimate.org keeps an up to date catalog of many different types of climate data
- An analysis of Denny’s vs LaQuinta restaurants
- American Time Use survey
- FEC contributions data (as part of Hadley Wickham’s dplyr package)
- Yahoo big data datasets
- SF OKCupid Users Everett Wetchler wrote a python script back in the day to rip the public profiles of San Francisco OkCupid users. He pulled one snapshot (June 26, 2012) of all OkCupid users who lived within 25 miles of San Francisco along with other caveats. It might be of interest to students given the recent press that data-driven approaches to online dating have been getting, specifically the Wired article “How a Math Genius Hacked OkCupid to Find True Love” and Amy Webb’s Ted Talk “How I hacked online dating”.
- This growing dataset repository presents raw data from real medical studies and offers (a) a vignette summarizing the study, research question and study design; (b) a data dictionary with clear documentation of variables and codes; (c) a complete citation for the associated study publication; and (d) a variety of data formats compatible with the majority of statistical packages.
- Clinical Trials datasets from Teaching Statistics in the Health Sciences
- DASL (Data and Story Library) – a collection of datasets and related documentation which may be searched by data subjects or by statistical techniques
- Bessie Chu‘s compilation of datasets
- 21 Places to Find Free Datasets for Data Science Projects
- Lots of fun data from OpenIntro
Journals / Journal articles that provide corresponding data:
The GitHub site and other info for many of 538’s analyses.
FiveThirtyEight.com has been very forward thinking in making the data and code used in many of their articles accessible on GitHub. With consultation from Andrew Flowers and Andrei Scheinkman of FiveThirtyEight, we go one step further in our package by pre-processing the data so that it more accessible statistics and data science novices and providing ample documentation in the help files.
See a usage example and R package called fivethirtyeight. For a more detailed outline of all data sets and a discussion on our motivation and data guidelines, see the package vignette.
And this: How to explore data from 538Nature – Many articles have a “Data availability” section. See Hurricane-induced selection on the morphology of an island lizard which includes a link to the data.
Journal of Statistics and Data Science Education (check the archive) or more recent papers
Static Data Sets / Databases:
- DASL in Australia
- Statlib Dataset Archive – one of the original sources for archived data
- National Institute of Standards and Technology (NIST) education data sets
- CHANCE Project Datasets – data from recent media coverage of current events. Only a few datasets here, but many excellent references to teaching applications of statistics in the news can be found at the main CHANCE page
- A data repository from statsci.org – a statistics and bioinformatics group in Australia
Visualizing Data:
- Information is Beautiful
- From Mark Ward at Purdue: Websites for Visualizing Data
- Nathan Yau’s amazing visualizations: FlowingData, most of which include corresponding datasets.
- Kerry Lock Morgan has posted a compilation data visualizations
- Caitlin Hudon’s GitHub site of Data Viz Resources
- No data here, but I have to link to these amazing gifs which get cleaner as they go, by Darkhorse Analytics.