Catalogues / lists / repositories of open data sources

Open data — meta-directories (lists of lists)

This wiki post presents a curated overview of directories, catalogues, registries and curated lists whose primary content is other open data sources — portals, repositories, databases, datasets, or lists of creative assets. This file lists meta-directories only — resources whose value is in pointing to other open data sources. Individual datasets, portals, repositories and asset collections are described in a second wiki post below. This is a wiki post, so you can add to this post; please feel free to enrich/improve where you can!!


1. Cross-domain meta-directories

General-purpose directories that span many types of open data source. The broadest starting points.

  • DataPortals.org — ~520 open data portals worldwide. Long-standing curated registry maintained by an international group of open data experts; covers local, regional and national levels.
  • OpenDataSoft – Open Data Sources catalogue2,900+ portals worldwide, organised by geography. Generally regarded as the most comprehensive list.
  • Open Data Inception — Geotagged map view of ~1,600+ open data portals worldwide; built on the OpenDataSoft list but useful for browsing by location.
  • DataCatalogs.org — Long-running curated catalogue of open data catalogues (now redirects to / merged with DataPortals.org).
  • CKAN Portals listing — Index of CKAN-powered open data portals; CKAN underpins a large share of government data portals globally.
  • PortalJS Data Portals listing — Modern catalogue of open data portals maintained by the PortalJS project (Datopian/CKAN ecosystem).
  • List of open government data sites (Wikipedia) — Wikipedia-maintained country-by-country index of national, regional and municipal OGD portals.
  • EasyData Open Data Portals Catalogus — Dutch-language catalogue of open data portals (NL-based, global in scope).
  • Awesome Public Datasets — Topic-organised GitHub list of high-value public datasets; hundreds of entries, community-maintained.
  • sindresorhus/awesome — The hub-of-hubs: 1,000+ topical “awesome” lists, several of which (datasets, transit, citizen science, ML) are themselves directories of open data sources.
  • brandonhimpfen/awesome-open-data — Curated list of open data resources, tools and platforms across domains.
  • CoolDatasets — Curated, lightly categorised collection of public datasets across topics.
  • Open Data Impact Map — Global database of organisations that use open data (Center for Open Data Enterprise); useful for finding sectoral data users and sources.

2. Directories of government & intergovernmental portals

Meta-lists specifically of official government / IGO open data portals.


3. Registries of research data repositories (generalist)

Cross-disciplinary registries of repositories that hold research data.

  • re3data.org – Registry of Research Data Repositories3,300+ research data repositories across all disciplines, with rich metadata (subjects, certifications, policies, APIs). Run by DataCite + KIT + Purdue + partners. The canonical registry for scientific repositories.
  • FAIRsharing.org — Curated registry of data standards, databases and policies; ~2,000+ databases catalogued with FAIR-compliance metadata.
  • OpenDOAR — Global directory of ~6,000 academic open-access repositories (including data); operated by Jisc.
  • Open Access Directory: Data Repositories — Wiki-maintained directory of data repositories, hosted by Simmons University.

4. Catalogues of databases within a domain

Meta-lists of the major databases in a specific field — the canonical “where are all the databases for X” references.

Life sciences & biomedical

  • NAR Online Molecular Biology Database Collection — Curated catalogue of ~1,650 molecular-biology and bioinformatics databases, classified into 15 categories and 41 sub-categories; maintained alongside the annual Nucleic Acids Research Database Issue. The canonical meta-list for the life sciences.
  • FAIRsharing (life sciences view) — (Also in §3.) Especially deep for biomedical databases and standards.

Astronomy

  • VizieR (CDS) — The most complete library of published astronomical catalogues; ~24,000 catalogues and tables gathered by the Centre de Données astronomiques de Strasbourg. The reference meta-catalogue for astronomy.
  • NASA Astrophysics Data System (ADS) — 15M+ records; indexes external data catalogues and archives alongside the literature.

Linguistics & language

Linked / semantic data

  • Linked Open Data Cloud — Diagram and dataset of ~1,300 interlinked Linked Open Data datasets across nine domains (geography, government, life sciences, linguistics, media, etc.). Maintained by the Insight Centre for Data Analytics; CC BY.

Cultural heritage

  • Europeana and DPLA — Each aggregates thousands of institutions, so each effectively functions as a meta-directory of cultural-heritage collections (also listed as sources in the companion file).

5. Dataset search engines & aggregators (that index many sources)

Tools that don’t host data themselves but catalogue/index it across many sources.


7. Directories of map, transport & traffic data

Meta-lists specific to geospatial and mobility data sources.

  • NAPCORE — (Also in §2.) Directory of the EU’s 30+ mobility National Access Points.
  • Mobility Database (MobilityData) — Catalogue of 6,000+ GTFS / GTFS-RT / GBFS public-transport feeds across 99+ countries.
  • Transitland Atlas — Open feed registry of GTFS / GTFS-RT / GBFS / MDS feeds from 2,500+ operators across 55+ countries.
  • OpenAddresses — Aggregates 2,600+ open government address sources worldwide (a directory of address datasets as much as a dataset).

9. Curated dataset lists for data journalism

Curated, regularly updated lists aimed at journalists and storytellers — strong for finding interesting rather than merely official datasets.

  • Data Is Plural — Jeremy Singer-Vine’s weekly newsletter of useful/curious datasets, running since 2015; 1,750+ datasets, with a browsable archive as a “dataset of datasets.”
  • Data Liberation Project — Initiative (now run by MuckRock + Big Local News) that obtains, documents and publishes hard-to-get government datasets of public interest.
  • FiveThirtyEight Data — Index of the datasets behind FiveThirtyEight’s data journalism (politics, sports, science, economics), released as plain CSVs.
  • BuzzFeed News GitHub — Data and analysis behind BuzzFeed News investigations.
  • ProPublica Data Store — Datasets compiled and cleaned by ProPublica’s investigative team (many free, some priced).
  • The Pudding — Datasets underlying The Pudding’s visual essays.
  • Awesome Public Datasets — (Also in §1.) Widely used by data journalists as a starting point.

10. Directories of open design & creative assets

Meta-lists of openly licensed creative assets (icons, fonts, images, CC media).

  • Open Source Design – Resources — Curated directory of openly licensed icons, fonts, images, CC media and design tools. The best single meta-list for creative open assets.
  • Openverse — Search engine indexing 800 million+ openly licensed and public-domain images and audio files across hundreds of sources; WordPress Foundation successor to CC Search.
  • Creative Commons Search — Meta-search across CC-licensed works.

11. Directories for data preservation & “data rescue”

Meta-lists / clearinghouses of preservation efforts (especially the 2025 US federal-data rescue).


12. Standards, metrics & community references

Not data sources, but the infrastructure and benchmarks that catalogue or rank them.

Useful. I need to find a tool which analyzes all this data.

Wow this is a nice collection! Thanks for compiling these, makes it easier to explore.

Open data — individual sources

In this post we present a curated overview of individual open data sources: portals, repositories, databases, datasets and openly licensed asset collections. For directories and curated lists of these sources (meta-directories), see the wiki post above. This is a wiki post, so you can add to this post; please feel free to enrich/improve where you can!!

Note: Links live as of the date this list was produced. Items marked :warning: are not strictly open-licensed (controlled access, partial open, or commercial-with-free-tier) and are flagged inline.


1. Cross-domain (general)

Large general-purpose open datasets and knowledge bases that don’t fit a single domain.

  • Wikidata — Wikimedia’s free, collaborative, multilingual structured knowledge base; ~115 million items, CC0. The data backbone behind Wikipedia.
  • DBpedia — Structured knowledge extracted from Wikipedia, queryable via SPARQL; millions of entities across 125+ languages, interlinked to other Linked Open Data datasets.
  • Kaggle Datasets~400,000+ datasets, community-curated and tied to ML competitions/notebooks.
  • Hugging Face Datasets~500,000+ datasets, ML-focused, with built-in tooling.
  • Registry of Open Data on AWS — Open datasets hosted on AWS (Common Crawl, Sentinel imagery, genomics, climate, transport); free to access.
  • freeCodeCamp Open Data — Open datasets, analyses and demos published monthly by the freeCodeCamp community.

2. Government & intergovernmental portals

Official open data portals run by governments and intergovernmental organisations — typically the largest single sources.

Pan-European / EU institutions

  • data.europa.eu — The official portal for European data. ~1.5 million datasets aggregated from 36+ European countries, EU institutions, agencies and bodies. Built on CKAN.
  • Eurostat — The EU’s statistical office; thousands of official statistical datasets.
  • European Environment Agency (EEA) data hub — Environment-focused data hub of the EEA.
  • INSPIRE Geoportal — Pan-European geospatial open data under the INSPIRE Directive.

National (selected, by data volume / relevance)

City-level (selected examples)

Intergovernmental / international organisations

  • World Bank Open Data — Thousands of global development, economy and poverty indicators. Free, APIs.
  • UN Data — UN statistical databases aggregating dozens of agencies.
  • OECD Data — Comparable statistics across OECD members.
  • IMF Data — Economic and financial data from the IMF.
  • WHO – Global Health Observatory — World Health Organization’s data portal.
  • FAOSTAT — Food and agriculture data from 245+ countries.

3. Scientific & research data repositories (generalist)

Multidisciplinary repositories for research data — usually with DOI assignment, versioning and metadata standards.

  • Zenodo — CERN/OpenAIRE generalist research repository; millions of records, 50 GB per record, DOIs included.
  • Figshare — Generalist repository; millions of items. Free for public deposits.
  • Dryad — Curated research-data repository with editorial review; partners with many journals.
  • Harvard Dataverse — Large Dataverse instance; 150,000+ datasets across disciplines.
  • Mendeley Data — Elsevier-operated generalist research data repository.
  • Open Science Framework (OSF) — Free, open-source project hosting + repository; Center for Open Science.
  • UCI Machine Learning Repository — UC Irvine; one of the oldest and most-cited collections of ML benchmark datasets.
  • Yelp Open Dataset — Large subset of Yelp businesses/reviews/users as JSON, for academic and educational use.
  • LODUM (University of Münster) — University open-data initiative publishing institutional data as Linked Open Data.

4. Scientific & research data repositories (domain-specific)

Major discipline-specific repositories — usually the canonical source for their field.

Life sciences & biomedical

Earth observation, climate & environmental science

Astronomy & physics

Social sciences & economics

  • ICPSR~500,000+ files of social science research.
  • FRED~800,000+ economic time series from 100+ sources.
  • UK Data Service — UK’s largest collection of economic, population and social data.
  • CESSDA Data Catalogue — Consortium of European Social Science Data Archives.
  • IPUMS — Harmonised census and survey microdata, global.

Linguistics & language

  • CLARIN — Pan-European language-resources research infrastructure (ERIC).
  • Mozilla Common Voice — Crowdsourced speech; 30,000+ validated hours, 100+ languages, CC0.
  • LDC:warning: Linguistic Data Consortium; open catalogue, mostly licensed resources.
  • OPUS — Open parallel corpora.

Humanities & cultural heritage

  • Europeana~50 million+ digitised cultural-heritage items from 3,000+ institutions.
  • DPLA~50 million+ items from US libraries, archives and museums.
  • Open Context — 1M+ CC-licensed archaeological resources.
  • Smithsonian Open Access — 4.5M+ CC0 records.

5. Search engines & cross-portal aggregators

Tools that don’t host data themselves but index it across many sources.


6. Industry / specialised

Open data for specific industries or use cases.


7. Map, geospatial, transport & traffic data

Openly licensed map data, public-transport feeds, real-time traffic, speed limits, addresses, boundaries and related infrastructure.

Foundational map data

  • OpenStreetMap — World’s largest crowdsourced open geographic database (OSMF, since 2004); billions of features under ODbL. Planet dump (~85 GB PBF) updated minutely. Speed-limit coverage is partial (~12% of roads tagged).
  • Overture Maps Foundation — Open map data from AWS, Meta, Microsoft, TomTom + 30 members (Linux Foundation). Quarterly GeoParquet releases: Places, Buildings, Transportation, Base, Addresses, Divisions.
  • Natural Earth — Public-domain vector + raster map data at 1:10m/1:50m/1:110m scales.
  • GADM — Administrative boundaries to 5 levels; v4.1 has 400,276 areas, v5 released Jan 2026. :warning: non-commercial only.
  • geoBoundaries — CC BY administrative boundaries for every country; commercial use allowed.
  • Who’s on First — Gazetteer of administrative places with structured identifiers.
  • GeoNames — 25M+ geographic names, CC BY.
  • OurAirports — ~85,000 airports worldwide, CC0.

OSM extracts, tooling & derivatives

  • Geofabrik Downloads — Daily OSM extracts by country/subdivision; PBF + Shapefile. The de facto standard server.
  • BBBike Extracts — Free user-defined OSM extracts of any polygon; many formats.
  • Protomaps — Subscription-free map tiles; slice arbitrary OSM regions.
  • Mapillary — Crowdsourced street-level imagery (CC BY-SA); 2B+ images.
  • KartaView — Open street-level imagery alternative to Mapillary.

Addresses

  • OpenAddresses — Aggregates 2,600+ open government address sources; ~600M addresses worldwide.
  • Overture Addresses theme — Growing global open address dataset in Overture’s releases.

Public transport — feeds

National Access Points for mobility (EU)

Under EU ITS Directive 2010/40/EU and its Delegated Regulations, every member state must run a National Access Point (NAP) for mobility data (real-time traffic, multimodal travel, truck parking, EV charging). 30+ operational.

Real-time traffic — open feeds

Truly open live traffic is rare (TomTom, HERE, Mapbox, INRIX, Google sell it). The open exceptions are mainly EU NAPs and national road authorities.

Speed limits

Aviation, maritime & rail

Cycling, walking, micromobility & EV charging

  • Open Charge Map — Global open EV-charging registry; ~800,000+ points.
  • CycleStreets — Open cycle-routing data and tools.
  • Strava Metro:warning: aggregated cycling/walking data; free for agencies, not open-licensed.

Routing & isochrone engines (open backends)


8. AI training data corpora

Large openly licensed datasets used to train ML / generative-AI models. These carry distinctive legal and ethical caveats, noted inline.

  • Common Crawl — Open web-crawl repository; 10+ petabytes since 2008, refreshed ~monthly (2B+ pages each). The dominant text source behind most LLMs. :warning: a Nov 2025 investigation alleged it under-honoured publisher opt-outs.
  • LAION — German non-profit; Re-LAION-5B (Aug 2024) is the safety-rescreened replacement for the withdrawn LAION-5B (~5.5B text-image pairs). Backbone of Stable Diffusion.
  • Common Pile v0.1 — EleutherAI’s ~8 TB copyright-clean text corpus (June 2025); successor to The Pile.
  • The Stack v2 — BigCode/Hugging Face; permissively licensed source-code corpus.
  • Mozilla Common Voice — Open speech corpus; 30,000+ hours, 100+ languages, CC0.
  • Pile of Law — Open ~256 GB legal-text corpus.
  • Have I Been Trained? (Spawning.ai) — Tool to search LAION-style datasets and opt images out of training.

Caveat: AI corpora sit at the contested edge of “open data.” Copyright (The Pile/Books3), consent (LAION CSAM removal) and opt-out compliance (Common Crawl) are live issues. Verify licences before reuse.


9. Data-journalism datasets

The individual data collections published by data-journalism teams (the curated lists of these live in §9 of the meta-directories file).


10. Open design & creative assets

Openly licensed creative assets: icons, fonts, images, audio/video, colour systems.

Icons

Fonts

Images & photos

CC media (audio / video)

Colour & design systems


11. Data preservation, archives & “data rescue”

Resources focused on preserving open datasets — increasingly important given the 2025 US federal data removals.


12. Standards, tooling & community

Not data sources themselves — the infrastructure that produces and packages open data.