Data Sources

Art by @allison_horst

Data for Equity & Inclusion

  • SlaveVoyages: “The SlaveVoyages website is a collaborative digital initiative that compiles and makes publicly accessible records of the largest slave trades in history. Search these records to learn about the broad origins and forced relocation of more than 12 million African people who were sent across the Atlantic in slave ships, and hundreds of thousands more who were trafficked within the Americas.”
  • gibboda: Gender Isn’t Binary But Other Data Are!
  • hate speak corpora: particularly interesting for training NLP models
  • Police Scorecard is the first nationwide public evaluation of policing in the United States. The Scorecard calculates levels of police violence, accountability, racial bias and other policing outcomes for over 16,000 municipal and county law enforcement agencies, covering nearly 100% of the US population.

Data Challenges

  • TidyTuesday A weekly data project aimed at the R ecosystem. I’ve written some suggestions and included a lot of resources for getting started with TidyTuesday

  • data for social good provides real data and structures (similar to Kaggle) for working through models and coming up with predictions – all on data which benefits the social good.

  • ASA DataFest The American Statistical Association (ASA) DataFest is a celebration of data in which teams of undergraduates work around the clock to find and share meaning in a large, rich, and complex data set.

  • Fall Data Challenge Each year, the contest challenges undergraduate and high school students to work in teams to analyze real-world data and make recommendations to combat critical issues.

  • ASA Data Expo The Annual Data Challenge Expo is open to anyone who is interested in participating — including government, industry, academia, retirees, and students. Each year, this contest challenges participants to analyze a core data set using statistical and visualization tools and methods. Student awards at three levels of $1,500, $1,000, and $500. (An annual spring competition, with presentations at the Joint Statistical Meetings.)

  • kaggle competitions Many different data challenges with cash prizes. Alternatively, compete in a competition that has already closed for practice creating a a full data analysis.

  • Other data science competitions

R packages for connecting to APIs:

  • coinmarketcapr: Connecting to Coin Market Cap to get Cryptocurrencies Market Cap Prices .
  • rtweet: R client for accessing Twitter (stream and REST) API.
  • epidata: R package to link to the API. The Economic Policy Institute provides researchers, media, and the public with easily accessible, up-to-date, and comprehensive historical data on the American labor force. It is compiled from Economic Policy Institute analysis of government data sources. Use it to research wages, inequality, and other economic indicators over time and among demographic groups. Data is usually updated monthly.
  • acs: R package to link to the API. Provides a general toolkit for downloading, managing, analyzing, and presenting data from the U.S. Census, including SF1 (Decennial short-form), SF3 (Decennial long-form), and the American Community Survey (ACS).
  • tidyhydat: Canadian hydrometric data — Historical data is contained within HYDAT, the Canadian national Water Data Archive, which is published quarterly by the Government of Canada’s Department of Environment and Climate Change. Data in this archive range from 1850 to 2017. tidyhydat also provides functions to access real-time data over the web. This package would be of interest to anyone who has need for Canadian hydrometric data in R.
  • ipumsr: An easy way to import census, survey and geographic data provided by ‘IPUMS’ into R plus tools to help use the associated metadata to make analysis easier. ‘IPUMS’ data describing 1.4 billion individuals drawn from over 750 censuses and surveys is available free of charge from their website.
  • essurvey: Download data from the European Social Survey directly from their website. There are two families of functions that allow you to download and interactively check all countries and rounds available.
  • data360r: Makes it easy to engage with the Application Program Interface (API) of the TCdata360 and Govdata360 platforms. These APIs provide access to over 5000 trade, competitiveness, and governance indicator data, metadata, and related information from sources both inside and outside the World Bank Group. Package functions include easier download of data sets, metadata, and related information, as well as searching based on user-inputted query.
  • lahman: Provides the tables from the ‘Sean Lahman Baseball Database’ as a set of R data.frames. It uses the data on pitching, hitting and fielding performance and other tables from 1871 through 2015, as recorded in the 2016 version of the database.
  • wbstats: Women in Parliament dataset and link to worldbank

R packages containing multiple datasets:

Dynamic Data Sets / Databases:

  • The amazing TidyTuesday datasets. For example, data on Juneteenth or Wealth Inequality

  • Awesome Public Datasets, a huge number of datasets automatically generated from blogs, answers, and user responses.

  • World Atlas is a public database with 2,500 datasets. The datasets come from places like Gapminder, United Nations Data, Worldbank, and Open numbers, and it is described here in detail.

  • College Scorecard. A tremendous amount of information about all universities (though some of it collected only from students on financial aid).

  • National Park Service Visitor Use Statistics

  • Financial and Economic data

  • Behavioral Risk Factor Surveillance System

  • General Social Survey

  • National Health and Nutrition Examination Survey from the CDC also available in two different R packages: nhanesA and NHANES

  • Medicare dataset (discussed on whitehouse.gov)

  • State Health Facts

  • gapminder.org – a fascinating website with amazing graphics (social and economic data broken down by country). Click on the spreadsheet links to download the data.

  • Wolfram/Alpha is billed as a computational search engine. Put in “nachos” you get a detailed nutritional analysis, put in “GDP of Albania”and you get several forms of GDP, a historical graph and other economic variables, put in your favorite college and get lots of info (including number of degrees in mathematics in 2009, location on a map and link to a satellite view of campus). While the case by case data display is not so convenient for building datasets there are pretty good links to the sources that Wolfram is pulling data from. For example, the Wolfram/Alpha page of info on a college or university has a data source link at the bottom to the National Center For Educational Statistics website where you can download your own custom data files from the IPSEDS (Integrated Post Secondary Education Data System) – want to know the average faculty salary by rank for all the schools in your comparison group? or the nacho search gives a link to the USDA’s National Nutrient database and a few clicks later I’ve got a spreadsheet with data on 50+ nutrients in 7400+ foods (and that’s the abbreviated data!)

  • The Census Bureau including guide to getting the most out of the Census.gov website.

  • Baby names (popularity by year and state), compiled by the Social Security Administration

New & Continuously Revised Static Data Sets / Databases:

Journals / Journal articles that provide corresponding data:

Static Data Sets / Databases:

Visualizing Data: