Skip to content

hydrodataset

PyPI version conda-forge License: MIT Documentation Status Python 3.10+

A Python package for accessing hydrological datasets with a unified API, optimized for deep learning workflows.

  • 🌊 Unified Interface: Consistent API across 20+ hydrological datasets
  • Fast Access: NetCDF caching for instant data loading
  • 🎯 Standardized Variables: Common naming across all datasets
  • 🔗 Built on AquaFetch: Powered by the comprehensive AquaFetch backend
  • 📊 ML-Ready: Optimized for integration with torchhydro

Table of Contents

Core Philosophy

This library has been redesigned to serve as a powerful data-adapting layer on top of the AquaFetch package.

While AquaFetch handles the complexities of downloading and reading numerous public hydrological datasets, hydrodataset takes the next step: it standardizes this data into a clean, consistent NetCDF (.nc) format. This format is specifically optimized for seamless integration with hydrological modeling libraries like torchhydro.

The core workflow is: 1. Fetch: Use a hydrodataset class for a specific dataset (e.g., CamelsAus). 2. Standardize: It uses AquaFetch as the primary backend for fetching raw data, while maintaining a consistent, unified interface across all datasets. 3. Cache: On the first run, hydrodataset processes the data into an xarray.Dataset and saves it as .nc files for timeseries and attributes separately in a specified local directory set in hydro_setting.yml in the user's home directory. 4. Access: All subsequent data requests are read directly from the fast .nc cache, giving you analysis-ready data instantly.

Installation

We strongly recommend using a virtual environment to manage dependencies.

We recommend using uv for fast, reliable package and environment management:

1
2
3
4
5
# Install uv if you haven't already
pip install uv

# Install hydrodataset with uv
uv pip install hydrodataset

For more advanced usage or to work on the project locally:

1
2
3
4
5
6
# Clone the repository
git clone https://github.com/OuyangWenyu/hydrodataset.git
cd hydrodataset

# Create virtual environment and install all dependencies
uv sync --all-extras

The --all-extras flag installs base dependencies plus all optional dependencies for development and documentation.

Using pip (Alternative)

If you prefer traditional pip:

1
2
3
4
5
6
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install the package
pip install hydrodataset

Quick Start

The primary goal of hydrodataset is to provide a simple, unified API for accessing various hydrological datasets. Here's a complete example showing the core workflow:

⚠️ Important Note on First-Time Data Download

If you haven't pre-downloaded the datasets, the first access will trigger automatic downloads via AquaFetch, which can take considerable time depending on dataset size:

  • Small datasets (< 1GB, e.g., CAMELS-CL, CAMELS-COL): ~10-30 minutes
  • Medium datasets (1-5GB, e.g., CAMELS-AUS, CAMELS-BR): ~30 minutes to 1 hour
  • Large datasets (10-20GB, e.g., CAMELS-US, LamaH-CE): ~1-3 hours
  • Very large datasets (> 30GB, e.g., HYSETS): ~3-6 hours or more

Download times vary based on your internet connection speed and server availability.

We strongly recommend downloading datasets manually during off-peak hours if possible.

After the initial download, all subsequent access will be fast thanks to NetCDF caching.

Basic Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from hydrodataset.camels_us import CamelsUs
from hydrodataset import SETTING
import os

# All datasets are expected to be in the directory defined in your hydro_setting.yml
# A example of hydro_setting.yml in Windows is like this:
# local_data_path:
#   root: 'D:\data\waterism' # Update with your root data directory
#   datasets-origin: 'D:\data\waterism\datasets-origin'
#   cache: 'D:\data\waterism\cache'
data_path = SETTING["local_data_path"]["datasets-origin"]

# Initialize the dataset class
ds = CamelsUs(data_path)

# 1. Check which features are available
print("Available static features:")
print(ds.available_static_features)

print("Available dynamic features:")
print(ds.available_dynamic_features)

# 2. Get a list of all basin IDs
basin_ids = ds.read_object_ids()

# 3. Read static (attribute) data for a subset of basins
# Note: We use standardized names like 'area' and 'p_mean'
attr_data = ds.read_attr_xrdataset(
    gage_id_lst=basin_ids[:2],
    var_lst=["area", "p_mean"]
)
print("Static attribute data:")
print(attr_data)

# 4. Read dynamic (time-series) data for the same basins
# Note: We use standardized names like 'streamflow' and 'precipitation'
ts_data = ds.read_ts_xrdataset(
    gage_id_lst=basin_ids[:2],
    t_range=["1990-01-01", "1995-12-31"],
    var_lst=["streamflow", "precipitation"]
)
print("Time-series data:")
print(ts_data)

Standardized Variable Names

A key feature of the new architecture is the use of standardized variable names. This allows you to use the same variable name to fetch the same type of data across different datasets, without needing to know the specific, internal naming scheme of each one.

For example, you can get streamflow from both CAMELS-US and CAMELS-AUS using the same variable name:

1
2
3
4
5
# Get streamflow from CAMELS-US
us_ds.read_ts_xrdataset(gage_id_lst=["01013500"], var_lst=["streamflow"], t_range=["1990-01-01", "1995-12-31"])

# Get streamflow from CAMELS-AUS
aus_ds.read_ts_xrdataset(gage_id_lst=["A4260522"], var_lst=["streamflow"], t_range=["1990-01-01", "1995-12-31"])

Similarly, you can use precipitation, temperature_max, etc., across datasets. A comprehensive list of these standardized names and their coverage across all datasets is in progress and will be published soon.

Supported Datasets

hydrodataset currently provides unified access to 27 hydrological datasets across the globe. Below is a summary of all supported datasets:

Dataset Name Paper Temporal Resolution Data Version Region Basins Time Span Release Date Size
BULL Paper / Code Daily Version 3 (code) / Version 2 (data) Spain 484 1951-01-02 to 2021-12-31 2024-03-10 2.2G
CAMELS-AUS Paper (V1) / Paper (V2) Daily Version 1 / Version 2 Australia 561 1950-01-01 to 2022-03-31 2024-12 2.1G
CAMELS-BR Paper Daily Version 1.2 / Version 1.1 Brazil 897 1980-01-01 to 2024-10-22 2025-03-21 1.4G
CAMELS-CH Paper Daily Version 0.9 / Version 0.6 Switzerland 331 1981-01-01 to 2020-12-31 2025-03-14 793.1M
CAMELS-CL Paper Daily Dataset Chile 516 1913-02-15 to 2018-03-09 2018-09-28 208M
CAMELS-COL Paper Daily Version 2 Colombia 347 1981-05 to 2022-12 2025-05 80.9M
CAMELS-DE Paper Daily Version 1.1 / Version 0.1 Germany 1582 1951-01-01 to 2020-12-31 2025-08-07 2.2G
CAMELS-DK Paper Daily Version 6.0 Denmark 304 1989-01-02 to 2023-12-31 2025-02-14 1.41G
CAMELS-FI Meeting Yearly/Daily Version 1.0.1 Finland 320 1961-01-01 to 2023-12-31 2025-07 382M
CAMELS-FR Paper Daily/Monthly/Yearly Version 3.2 / Version 3 France 654 1970-01-01 to 2021-12-31 2025-08-12 364M
CAMELS-GB Paper Daily Dataset United Kingdom 671 1970-10-01 to 2015-09-30 2025-05 (new data link) 244M
CAMELS-IND Paper Daily Version 2.2 India 472 (242 sufficient flow) 1980-01-01 to 2020-12-31 2025-03-13 529.4M
CAMELS-LUX Paper Hourly/Daily Version 1.1 Luxembourg 56 2004-11-01 to 2021-10-31 2024-09-27 1.4G
CAMELS-NZ Paper Hourly/Daily Version 2 / Version 1 New Zealand 369 1972-01-01 to 2024-08-02 2025-08-05 4.81G
CAMELS-SE Paper Daily Version 1 Sweden 50 1961-2020 2024-02 16.19M
CAMELS-US Paper Daily Version 1.2 United States 671 1980-2014 2022-06-24 14.6G
CAMELSH-KR - Hourly Version 1 South Korea 178 2000-2019 2025-03-23 3.1G
CAMELSH Paper Hourly Version 6 + 3 + 2 United States 9008 1980-2024 2025-08-14 4.2G+3.57G+2.18G
Caravan-DK Paper Daily Version 7 / Version 5 Denmark 308 1981-01-02 to 2020-12-31 2025-04-11 521.6M
Caravan Paper / Code Daily Version 1.6 Global 16299 1950-2023 2025-05 24.8G
EStream Paper / Code Daily (weekly, monthly, yearly available) Version 1.3 / Version 1.1 Europe 17130 1950-01-01 to 2023-06-30 2025-06-30 12.3G
GRDC-Caravan Paper Daily Version 0.6 / Version 0.2 Global 5357 1950-2023 2025-05-06 16.4G
HYPE Paper (draft) Daily/Monthly/Yearly Version 1.1 Costa Rica 605 1985-01-01 to 2019-12-31 2020-09-14 616.5M
HYSETS Paper / Code Daily Dataset (dynamic attributes) North America 14425 1950-01-01 to 2023-12-31 2024-09 41.9G
LamaH-CE Paper Daily/Hourly Version 1.0 Central Europe 859 1981-01-01 to 2019-12-31 2021-08-02 16.3G
LamaH-Ice Paper Daily/Hourly Version 1.5 / old version Iceland 111 1950-01-01 to 2021-12-31 2025-08-12 9.6G
Simbi Paper Daily/Monthly Version 6.0 Haiti 24 1920-01-01 to 2005-12-31 2024-07-02 125M
>

Key Features

🎯 Unified API Across All Datasets

Access any dataset using the same method calls:

1
2
3
4
# Same API works for all datasets
ds.read_object_ids()                          # Get basin IDs
ds.read_attr_xrdataset(...)                   # Read attributes
ds.read_ts_xrdataset(...)                     # Read timeseries

⚡ Fast NetCDF Caching

First access processes and caches data as NetCDF files. All subsequent reads are instant: - Timeseries data: {dataset}_timeseries.nc - Attribute data: {dataset}_attributes.nc - Configured via ~/hydro_setting.yml

🔄 Standardized Variable Names

Use common names across all datasets: - streamflow - River discharge - precipitation - Rainfall - temperature_max / temperature_min - Temperature extremes - potential_evapotranspiration - PET - And many more...

📊 xarray Integration

All data returned as xarray.Dataset objects: - Labeled dimensions and coordinates - Built-in metadata and units - Easy slicing, selection, and computation - Compatible with Dask for large datasets

Project Status & Future Work

The new, unified API architecture is currently in active development.

  • Current Implementation: hydrodataset provides access to 27 hydrological datasets (see the Supported Datasets table above). The new unified architecture based on the HydroDataset base class has been fully implemented and tested for camels_us and camels_aus datasets, which serve as reference implementations.
  • In Progress: We are in the process of migrating all other datasets supported by the library to this new architecture.
  • Release Schedule: We plan to release new versions frequently in the short term as more datasets are integrated. Please check back for updates.

Credits

This package was created with Cookiecutter and the giswqs/pypackage project template. The data fetching and reading is now powered by AquaFetch.