hydrodataset¶
A Python package for accessing hydrological datasets with a unified API, optimized for deep learning workflows.
- 🌊 Unified Interface: Consistent API across 20+ hydrological datasets
- ⚡ Fast Access: NetCDF caching for instant data loading
- 🎯 Standardized Variables: Common naming across all datasets
- 🔗 Built on AquaFetch: Powered by the comprehensive AquaFetch backend
- 📊 ML-Ready: Optimized for integration with torchhydro
Table of Contents¶
Core Philosophy¶
This library has been redesigned to serve as a powerful data-adapting layer on top of the AquaFetch package.
While AquaFetch handles the complexities of downloading and reading numerous public hydrological datasets, hydrodataset takes the next step: it standardizes this data into a clean, consistent NetCDF (.nc) format. This format is specifically optimized for seamless integration with hydrological modeling libraries like torchhydro.
The core workflow is:
1. Fetch: Use a hydrodataset class for a specific dataset (e.g., CamelsAus).
2. Standardize: It uses AquaFetch as the primary backend for fetching raw data, while maintaining a consistent, unified interface across all datasets.
3. Cache: On the first run, hydrodataset processes the data into an xarray.Dataset and saves it as .nc files for timeseries and attributes separately in a specified local directory set in hydro_setting.yml in the user's home directory.
4. Access: All subsequent data requests are read directly from the fast .nc cache, giving you analysis-ready data instantly.
Installation¶
We strongly recommend using a virtual environment to manage dependencies.
Using uv (Recommended)¶
We recommend using uv for fast, reliable package and environment management:
1 2 3 4 5 | |
For more advanced usage or to work on the project locally:
1 2 3 4 5 6 | |
The --all-extras flag installs base dependencies plus all optional dependencies for development and documentation.
Using pip (Alternative)¶
If you prefer traditional pip:
1 2 3 4 5 6 | |
Quick Start¶
The primary goal of hydrodataset is to provide a simple, unified API for accessing various hydrological datasets. Here's a complete example showing the core workflow:
⚠️ Important Note on First-Time Data Download
If you haven't pre-downloaded the datasets, the first access will trigger automatic downloads via AquaFetch, which can take considerable time depending on dataset size:
- Small datasets (< 1GB, e.g., CAMELS-CL, CAMELS-COL): ~10-30 minutes
- Medium datasets (1-5GB, e.g., CAMELS-AUS, CAMELS-BR): ~30 minutes to 1 hour
- Large datasets (10-20GB, e.g., CAMELS-US, LamaH-CE): ~1-3 hours
- Very large datasets (> 30GB, e.g., HYSETS): ~3-6 hours or more
Download times vary based on your internet connection speed and server availability.
We strongly recommend downloading datasets manually during off-peak hours if possible.
After the initial download, all subsequent access will be fast thanks to NetCDF caching.
Basic Example¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | |
Standardized Variable Names¶
A key feature of the new architecture is the use of standardized variable names. This allows you to use the same variable name to fetch the same type of data across different datasets, without needing to know the specific, internal naming scheme of each one.
For example, you can get streamflow from both CAMELS-US and CAMELS-AUS using the same variable name:
1 2 3 4 5 | |
Similarly, you can use precipitation, temperature_max, etc., across datasets. A comprehensive list of these standardized names and their coverage across all datasets is in progress and will be published soon.
Supported Datasets¶
hydrodataset currently provides unified access to 27 hydrological datasets across the globe. Below is a summary of all supported datasets:
| Dataset Name | Paper | Temporal Resolution | Data Version | Region | Basins | Time Span | Release Date | Size |
|---|---|---|---|---|---|---|---|---|
| BULL | Paper / Code | Daily | Version 3 (code) / Version 2 (data) | Spain | 484 | 1951-01-02 to 2021-12-31 | 2024-03-10 | 2.2G |
| CAMELS-AUS | Paper (V1) / Paper (V2) | Daily | Version 1 / Version 2 | Australia | 561 | 1950-01-01 to 2022-03-31 | 2024-12 | 2.1G |
| CAMELS-BR | Paper | Daily | Version 1.2 / Version 1.1 | Brazil | 897 | 1980-01-01 to 2024-10-22 | 2025-03-21 | 1.4G |
| CAMELS-CH | Paper | Daily | Version 0.9 / Version 0.6 | Switzerland | 331 | 1981-01-01 to 2020-12-31 | 2025-03-14 | 793.1M |
| CAMELS-CL | Paper | Daily | Dataset | Chile | 516 | 1913-02-15 to 2018-03-09 | 2018-09-28 | 208M |
| CAMELS-COL | Paper | Daily | Version 2 | Colombia | 347 | 1981-05 to 2022-12 | 2025-05 | 80.9M |
| CAMELS-DE | Paper | Daily | Version 1.1 / Version 0.1 | Germany | 1582 | 1951-01-01 to 2020-12-31 | 2025-08-07 | 2.2G |
| CAMELS-DK | Paper | Daily | Version 6.0 | Denmark | 304 | 1989-01-02 to 2023-12-31 | 2025-02-14 | 1.41G |
| CAMELS-FI | Meeting | Yearly/Daily | Version 1.0.1 | Finland | 320 | 1961-01-01 to 2023-12-31 | 2025-07 | 382M |
| CAMELS-FR | Paper | Daily/Monthly/Yearly | Version 3.2 / Version 3 | France | 654 | 1970-01-01 to 2021-12-31 | 2025-08-12 | 364M |
| CAMELS-GB | Paper | Daily | Dataset | United Kingdom | 671 | 1970-10-01 to 2015-09-30 | 2025-05 (new data link) | 244M |
| CAMELS-IND | Paper | Daily | Version 2.2 | India | 472 (242 sufficient flow) | 1980-01-01 to 2020-12-31 | 2025-03-13 | 529.4M |
| CAMELS-LUX | Paper | Hourly/Daily | Version 1.1 | Luxembourg | 56 | 2004-11-01 to 2021-10-31 | 2024-09-27 | 1.4G |
| CAMELS-NZ | Paper | Hourly/Daily | Version 2 / Version 1 | New Zealand | 369 | 1972-01-01 to 2024-08-02 | 2025-08-05 | 4.81G |
| CAMELS-SE | Paper | Daily | Version 1 | Sweden | 50 | 1961-2020 | 2024-02 | 16.19M |
| CAMELS-US | Paper | Daily | Version 1.2 | United States | 671 | 1980-2014 | 2022-06-24 | 14.6G |
| CAMELSH-KR | - | Hourly | Version 1 | South Korea | 178 | 2000-2019 | 2025-03-23 | 3.1G |
| CAMELSH | Paper | Hourly | Version 6 + 3 + 2 | United States | 9008 | 1980-2024 | 2025-08-14 | 4.2G+3.57G+2.18G |
| Caravan-DK | Paper | Daily | Version 7 / Version 5 | Denmark | 308 | 1981-01-02 to 2020-12-31 | 2025-04-11 | 521.6M |
| Caravan | Paper / Code | Daily | Version 1.6 | Global | 16299 | 1950-2023 | 2025-05 | 24.8G |
| EStream | Paper / Code | Daily (weekly, monthly, yearly available) | Version 1.3 / Version 1.1 | Europe | 17130 | 1950-01-01 to 2023-06-30 | 2025-06-30 | 12.3G |
| GRDC-Caravan | Paper | Daily | Version 0.6 / Version 0.2 | Global | 5357 | 1950-2023 | 2025-05-06 | 16.4G |
| HYPE | Paper (draft) | Daily/Monthly/Yearly | Version 1.1 | Costa Rica | 605 | 1985-01-01 to 2019-12-31 | 2020-09-14 | 616.5M |
| HYSETS | Paper / Code | Daily | Dataset (dynamic attributes) | North America | 14425 | 1950-01-01 to 2023-12-31 | 2024-09 | 41.9G |
| LamaH-CE | Paper | Daily/Hourly | Version 1.0 | Central Europe | 859 | 1981-01-01 to 2019-12-31 | 2021-08-02 | 16.3G |
| LamaH-Ice | Paper | Daily/Hourly | Version 1.5 / old version | Iceland | 111 | 1950-01-01 to 2021-12-31 | 2025-08-12 | 9.6G |
| Simbi | Paper | Daily/Monthly | Version 6.0 | Haiti | 24 | 1920-01-01 to 2005-12-31 | 2024-07-02 | 125M |
| > |
Key Features¶
🎯 Unified API Across All Datasets¶
Access any dataset using the same method calls:
1 2 3 4 | |
⚡ Fast NetCDF Caching¶
First access processes and caches data as NetCDF files. All subsequent reads are instant:
- Timeseries data: {dataset}_timeseries.nc
- Attribute data: {dataset}_attributes.nc
- Configured via ~/hydro_setting.yml
🔄 Standardized Variable Names¶
Use common names across all datasets:
- streamflow - River discharge
- precipitation - Rainfall
- temperature_max / temperature_min - Temperature extremes
- potential_evapotranspiration - PET
- And many more...
📊 xarray Integration¶
All data returned as xarray.Dataset objects:
- Labeled dimensions and coordinates
- Built-in metadata and units
- Easy slicing, selection, and computation
- Compatible with Dask for large datasets
Project Status & Future Work¶
The new, unified API architecture is currently in active development.
- Current Implementation: hydrodataset provides access to 27 hydrological datasets (see the Supported Datasets table above). The new unified architecture based on the
HydroDatasetbase class has been fully implemented and tested forcamels_usandcamels_ausdatasets, which serve as reference implementations. - In Progress: We are in the process of migrating all other datasets supported by the library to this new architecture.
- Release Schedule: We plan to release new versions frequently in the short term as more datasets are integrated. Please check back for updates.
Credits¶
This package was created with Cookiecutter and the giswqs/pypackage project template. The data fetching and reading is now powered by AquaFetch.