Data Preparation Guide¶

This guide explains how to prepare and use data with hydromodel, covering both public CAMELS datasets and custom data.

Overview¶

hydromodel supports two main data sources:

Public CAMELS Datasets - Using hydrodataset package
11 global CAMELS variants (US, GB, AUS, BR, CL, etc.)
Automatic download and caching
Standardized format and quality-controlled
Custom Data - Using hydrodatasource package
Your own basin data
Flexible data organization
Integration with cloud storage

Option 1: Using CAMELS Datasets (hydrodataset)¶

Step 1: Install hydrodataset¶

1	`pip install hydrodataset`

Step 2: Configure Data Path (Optional)¶

hydromodel automatically uses default paths, but you can customize:

Default paths: - Windows: C:\Users\YourUsername\hydromodel_data\ - macOS/Linux: ~/hydromodel_data/

To customize, create ~/hydro_setting.yml:

local_data_path:
  root: 'D:/data'
  datasets-origin: 'D:/data'  # CAMELS datasets location

Important: Provide only the datasets-origin directory. The system automatically appends the dataset name (e.g., CAMELS_US, CAMELS_GB).

Example: If your data is in D:/data/CAMELS_US/, set datasets-origin: 'D:/data'.

Step 3: Download Data¶

The data downloads automatically on first use:

from hydrodataset.camels_us import CamelsUs
from hydrodataset import SETTING

# Initialize dataset (auto-downloads if not present)
data_path = SETTING["local_data_path"]["datasets-origin"]
ds = CamelsUs(data_path, download=True)

# Get available basins
basin_ids = ds.read_object_ids()
print(f"Downloaded {len(basin_ids)} basins")

Note: First download may take 30-120 minutes depending on dataset size. CAMELS-US is ~70GB.

Step 4: Use with hydromodel¶

from hydromodel.trainers.unified_calibrate import calibrate

config = {
    "data_cfgs": {
        "data_source_type": "camels_us",  # Dataset name
        "basin_ids": ["01013500", "01022500"],
        "train_period": ["1990-10-01", "2000-09-30"],
        "test_period": ["2000-10-01", "2010-09-30"],
        "warmup_length": 365,
        "variables": ["precipitation", "potential_evapotranspiration", "streamflow"]
    },
    # ... other configs
}

results = calibrate(config)

Available CAMELS Datasets¶

Dataset	Region	Basins	Package Name
CAMELS-US	United States	671	`camels_us`
CAMELS-GB	Great Britain	671	`camels_gb`
CAMELS-AUS	Australia	222	`camels_aus`
CAMELS-BR	Brazil	897	`camels_br`
CAMELS-CL	Chile	516	`camels_cl`
CAMELS-CH	Switzerland	331	`camels_ch`
CAMELS-DE	Germany	1555	`camels_de`
CAMELS-DK	Denmark	304	`camels_dk`
CAMELS-FR	France	654	`camels_fr`
CAMELS-NZ	New Zealand	70	`camels_nz`
CAMELS-SE	Sweden	54	`camels_se`

Usage example:

# Use different datasets by changing data_source_type
config["data_cfgs"]["data_source_type"] = "camels_gb"
config["data_cfgs"]["basin_ids"] = ["28015"]  # GB basin ID

CAMELS Data Structure¶

CAMELS datasets provide standardized variables:

Time Series Variables: - precipitation (mm/day or mm/hour) - potential_evapotranspiration (mm/day) - streamflow (mm/day or m³/s) - temperature (°C) - And more depending on dataset

Basin Attributes: - area (km²) - elevation (m) - latitude, longitude - Climate, soil, vegetation attributes

For detailed documentation, see: - hydrodataset GitHub - hydrodataset documentation

Option 2: Using Custom Data (hydrodatasource)¶

Step 1: Install hydrodatasource¶

1	`pip install hydrodatasource`

Step 2: Organize Your Data¶

Create a directory with this structure:

my_basin_data/
├── attributes/
│   └── attributes.csv              # Basin metadata (required)
├── timeseries/
│   ├── 1D/                         # Daily time series
│   │   ├── basin_001.csv          # One file per basin
│   │   ├── basin_002.csv
│   │   └── ...
│   └── 1D_units_info.json          # Variable units (required)
└── shapes/                         # Basin boundaries (optional)
    └── basins.shp

Step 3: Prepare Required Files¶

3.1 attributes.csv¶

Minimum required columns: basin_id and area (km²)

basin_id,area,lat,lon,elevation
basin_001,1250.5,30.5,105.2,850
basin_002,856.3,31.2,106.1,920

Important: - basin_id: String identifier (matches filename) - area: Basin area in km² - Other columns optional but recommended

3.2 basin_XXX.csv (Time Series)¶

Required column: time. Other columns are your variables.

time,prcp,PET,streamflow
1990-01-01,5.2,2.1,45.3
1990-01-02,0.0,2.3,42.1
1990-01-03,12.5,1.8,58.7

Important: - time format: YYYY-MM-DD (for daily data) - Variable names: Lowercase, underscores for multi-word - Missing values: Use empty cells or NaN (not -9999 or 0) - No duplicate time stamps

3.3 1D_units_info.json (Units Definition)¶

Define physical units for all variables:

{
  "prcp": "mm/day",
  "PET": "mm/day",
  "streamflow": "m^3/s",
  "temp": "degC"
}

Common units: - Precipitation/ET: mm/day or mm/hour - Streamflow: m^3/s or mm/day - Temperature: degC or K - Area: km^2

Step 4: Verify Data Structure¶

from hydrodatasource.reader.data_source import SelfMadeHydroDataset

# Initialize dataset
dataset = SelfMadeHydroDataset(
    data_path="D:/my_basin_data",
    time_unit="1D"
)

# Check basins
basin_ids = dataset.read_object_ids()
print(f"Found {len(basin_ids)} basins: {basin_ids}")

# Check time series
data = dataset.read_timeseries(
    gage_id_lst=["basin_001"],
    t_range=["1990-01-01", "2000-12-31"],
    var_lst=["prcp", "PET", "streamflow"]
)

print(f"Data shape: {data['1D'].shape}")  # [n_basins, n_time, n_vars]

Step 5: Use with hydromodel¶

from hydromodel.trainers.unified_calibrate import calibrate

config = {
    "data_cfgs": {
        "data_source_type": "selfmadehydrodataset",  # Use custom data
        "data_source_path": "D:/my_basin_data",      # Your data path
        "basin_ids": ["basin_001", "basin_002"],
        "train_period": ["1990-01-01", "2000-12-31"],
        "test_period": ["2001-01-01", "2010-12-31"],
        "warmup_length": 365,
    },
    "model_cfgs": {
        "model_name": "xaj_mz",
    },
    "training_cfgs": {
        "algorithm": "SCE_UA",
        "loss_func": "RMSE",
        "output_dir": "results",
        "experiment_name": "my_basins",
        "rep": 10000,
        "ngs": 100,
    },
    "evaluation_cfgs": {
        "metrics": ["NSE", "KGE", "RMSE"],
    },
}

results = calibrate(config)

Option 3: Using Flood Event Data (hydrodatasource)¶

Overview¶

Flood event data is designed for event-based hydrological modeling where you focus on specific flood episodes rather than continuous time series. This is particularly useful for:

Flood forecasting and warning systems
Peak flow estimation
Event-based rainfall-runoff analysis
Unit hydrograph calibration

Key Differences from Continuous Data¶

Feature	Continuous Data	Flood Event Data
Data Structure	Complete time series	Individual flood events with gaps
Input Features	2D: [prcp, PET]	4D: [prcp, PET, marker, event_id]
Warmup Handling	Removed after simulation	Included in each event (NaN markers)
Time Coverage	Full period	Only flood periods + warmup
Use Case	Long-term water balance	Flood peak prediction

Data Structure and Components¶

1. Event Format¶

Each flood event contains:

event = {
    "rain": np.array([...]),           # Precipitation (with NaN in warmup)
    "ES": np.array([...]),             # Evapotranspiration
    "inflow": np.array([...]),         # Streamflow (with NaN in warmup)
    "flood_event_markers": np.array([...]),  # NaN=warmup, 1=flood
    "event_id": 1,                     # Event identifier
    "time": np.array([...])            # Datetime array
}

2. Three Key Periods in Each Event¶

[Warmup Period] → [Flood Period] → [GAP]
  marker=NaN        marker=1         marker=0

Warmup Period (e.g., 30 days before flood): - Contains NaN values in observations - Used to initialize model states - Length specified by warmup_length in config - Extracted from real data before the flood event

Flood Period (actual event): - Contains valid observations (marker=1) - The period of interest for simulation - Used for model calibration and evaluation

GAP Period (between events): - Artificial buffer (10 time steps by default) - Designed for visualization clarity - NOT used in simulation (marker=0, ignored by model) - Created with: precipitation=0, ET=0.27, flow=0

3. How Events are Concatenated¶

When loading multiple events, they are combined as:

Event1: [warmup-NaN][flood-1][GAP-0]
Event2: [warmup-NaN][flood-1][GAP-0]
Event3: [warmup-NaN][flood-1]

Important: The final data structure includes GAP periods, but these are automatically skipped during simulation based on the marker values.

Simulation Behavior¶

Event Detection¶

During simulation (unified_simulate.py), the system:

Reads the flood_event_markers array (3rd feature)
Uses find_flood_event_segments_as_tuples() to identify events
Only processes segments where marker > 0 (i.e., marker=1)
Skips GAP periods (marker=0) completely

# Simulation automatically identifies event segments
flood_event_array = inputs[:, basin_idx, 2]  # marker column
event_segments = find_flood_event_segments_as_tuples(
    flood_event_array, warmup_length
)

# Each event is simulated independently
for start, end, orig_start, orig_end in event_segments:
    event_inputs = inputs[start:end+1, :, :3]  # Extract event data
    result = model(event_inputs, ...)  # Simulate this event

Why GAP Doesn't Affect Results¶

The GAP period is present in the loaded data but does not participate in simulation:

✅ GAP helps separate events visually in plots
✅ GAP provides clear boundaries between independent floods
❌ GAP data is never fed to the hydrological model
❌ GAP does not contribute to loss calculation

This design ensures that: 1. Each flood event is simulated independently 2. Model states are reset via warmup for each event 3. Events don't interfere with each other

Configuration Example¶

data:
  dataset: "floodevent"
  dataset_name: "my_flood_events"
  data_source_path: "D:/flood_data"
  is_event_data: true
  time_unit: ["1D"]

  # Datasource parameters
  datasource_kwargs:
    warmup_length: 30      # Days before flood for warmup
    offset_to_utc: false   # Time zone handling
    version: null

  basin_ids: ["basin_001"]
  train_period: ["2000-01-01", "2020-12-31"]
  test_period: ["2020-01-01", "2023-12-31"]

  variables: ["rain", "ES", "inflow", "flood_event"]
  warmup_length: 30

model:
  name: "xaj"
  params:
    source_type: "sources"
    source_book: "HF"
    kernel_size: 15

Data File Structure¶

my_flood_events/
├── attributes/
│   └── attributes.csv
├── timeseries/
│   ├── 1D/
│   │   ├── basin_001.csv      # Must include 'flood_event' column
│   │   └── basin_002.csv
│   └── 1D_units_info.json
└── shapes/
    └── basins.shp

basin_001.csv Format¶

time,rain,ES,inflow,flood_event
2020-06-01,0.0,3.2,5.1,0
2020-06-02,2.5,3.5,5.8,0
2020-07-01,5.0,4.0,10.2,1    # Flood event starts
2020-07-02,12.5,3.8,25.5,1
2020-07-03,8.0,3.5,30.1,1
2020-07-04,0.0,3.2,20.5,1
2020-07-05,0.0,3.0,12.0,0    # Event ends

Key Points: - flood_event column: 0=no flood, 1=flood period - Continuous time series with all periods marked - System automatically extracts events with warmup

Best Practices¶

Warmup Length:
Typical: 30 days for daily data
Should be long enough to initialize soil moisture states
Too short: poor initial conditions
Too long: data availability issues
Event Selection:
Focus on significant floods (peak > threshold)
Include complete rising and recession limbs
Ensure warmup period has valid data
Data Quality:
Check for missing data in warmup periods
Verify flood markers are correctly assigned
Ensure precipitation and flow are synchronized

Marker Assignment:

# Example: Mark floods based on threshold
threshold = flow.quantile(0.95)
flood_event = (flow > threshold).astype(int)

Common Issues¶

1. "Warmup period contains all NaN" - Ensure data exists before each flood event - Check warmup_length is not too long - Verify CSV has continuous time series

2. "No flood events found" - Check flood_event column exists - Verify flood markers are 1 (not True or other values) - Ensure train_period covers some flood events

3. "Simulation results are all zeros" - Check if events are detected: markers should be 1 for flood periods - Verify warmup_length matches the actual warmup in data - Ensure model parameters are physically reasonable

Advanced: Manual Event Creation¶

For custom event extraction:

from hydroutils import hydro_event

# Extract events from continuous data
events = hydro_event.extract_flood_events(
    df=continuous_data,
    warmup_length=30,
    flood_event_col="flood_event",
    time_col="time"
)

# Each event includes warmup automatically
for event in events:
    print(f"Event: {event['event_name']}")
    print(f"  Total length: {len(event['data'])}")
    print(f"  Warmup markers: {event['data']['flood_event'].isna().sum()}")
    print(f"  Flood markers: {(event['data']['flood_event']==1).sum()}")

Data Requirements for XAJ Model¶

Required Variables¶

Variable	Description	Unit	Typical Source
`prcp`	Precipitation	mm/day	Rain gauge, gridded data (CHIRPS, ERA5)
`PET`	Potential Evapotranspiration	mm/day	Penman, Priestley-Taylor, or reanalysis
`streamflow`	Observed streamflow	m³/s	Stream gauge
`area`	Basin area	km²	GIS analysis

Optional Variables¶

Variable	Description	Unit	Usage
`temp`	Temperature	°C	Snow module (if enabled)
`elevation`	Basin elevation	m	PET estimation
`lat`, `lon`	Coordinates	degrees	Spatial analysis

Data Quality Guidelines¶

Time Resolution: Daily (1D) is standard for XAJ model
Data Completeness:
Training period: ≥5 years continuous data
Warmup period: ≥1 year before training
Missing data: <5% acceptable, continuous gaps <7 days
Physical Consistency:
Precipitation ≥ 0
Streamflow ≥ 0
PET ≥ 0
Check water balance: P ≈ Q + ET (within 20%)
Unit Consistency:
Ensure all units match units_info.json
Use consistent time stamps (no daylight saving shifts)

Advanced Features¶

NetCDF Caching (For Large Datasets)¶

Convert CSV to NetCDF for 10x faster access:

from hydrodatasource.reader.data_source import SelfMadeHydroDataset

dataset = SelfMadeHydroDataset(
    data_path="D:/my_basin_data",
    time_unit="1D"
)

# Cache all data as NetCDF (one-time operation)
dataset.cache_xrdataset(
    gage_id_lst=basin_ids,
    t_range=["1990-01-01", "2010-12-31"],
    var_lst=["prcp", "PET", "streamflow"]
)

# Now access is much faster
data_xr = dataset.read_ts_xrdataset(
    gage_id_lst=["basin_001"],
    t_range=["1990-01-01", "2000-12-31"],
    var_lst=["prcp", "PET", "streamflow"]
)

Multi-Scale Time Series¶

Support different time scales in one dataset:

timeseries/
├── 1h/                     # Hourly data
│   ├── basin_001.csv
│   └── 1h_units_info.json
├── 1D/                     # Daily data (most common)
│   ├── basin_001.csv
│   └── 1D_units_info.json
└── 8D/                     # 8-day data (e.g., MODIS)
    ├── basin_001.csv
    └── 8D_units_info.json

Specify in config:

dataset = SelfMadeHydroDataset(
    data_path="D:/my_basin_data",
    time_unit="1h"  # or "1D", "8D"
)

Cloud Storage (MinIO/S3)¶

For large datasets in the cloud:

from hydrodatasource.reader.data_source import SelfMadeHydroDataset

dataset = SelfMadeHydroDataset(
    data_path="s3://my-bucket/basin-data",
    time_unit="1D",
    minio_paras={
        "endpoint_url": "http://minio.example.com:9000",
        "key_id": "access_key",
        "secret_key": "secret_key"
    }
)

Complete Workflow Example¶

Here's a complete example from raw data to calibration:

import pandas as pd
import json
from hydrodatasource.reader.data_source import SelfMadeHydroDataset
from hydromodel.trainers.unified_calibrate import calibrate

# Step 1: Prepare attributes
attributes = pd.DataFrame({
    'basin_id': ['basin_001', 'basin_002'],
    'area': [1250.5, 856.3],
    'lat': [30.5, 31.2],
    'lon': [105.2, 106.1]
})
attributes.to_csv("my_data/attributes/attributes.csv", index=False)

# Step 2: Prepare time series (assume you have daily_data_001.csv)
# Make sure it has columns: time, prcp, PET, streamflow
daily_data = pd.read_csv("daily_data_001.csv")
daily_data.to_csv("my_data/timeseries/1D/basin_001.csv", index=False)

# Step 3: Create units info
units = {
    "prcp": "mm/day",
    "PET": "mm/day",
    "streamflow": "m^3/s"
}
with open("my_data/timeseries/1D_units_info.json", "w") as f:
    json.dump(units, f, indent=2)

# Step 4: Verify data loads correctly
dataset = SelfMadeHydroDataset(
    data_path="my_data",
    time_unit="1D"
)
print(f"Basins: {dataset.read_object_ids()}")

# Step 5: Cache for faster access (optional)
dataset.cache_xrdataset(
    gage_id_lst=['basin_001'],
    t_range=["1990-01-01", "2010-12-31"],
    var_lst=["prcp", "PET", "streamflow"]
)

# Step 6: Run calibration with hydromodel
config = {
    "data_cfgs": {
        "data_source_type": "selfmadehydrodataset",
        "data_source_path": "my_data",
        "basin_ids": ["basin_001"],
        "train_period": ["1990-01-01", "2000-12-31"],
        "test_period": ["2001-01-01", "2010-12-31"],
        "warmup_length": 365,
    },
    "model_cfgs": {
        "model_name": "xaj_mz",
    },
    "training_cfgs": {
        "algorithm": "SCE_UA",
        "loss_func": "RMSE",
        "output_dir": "results",
        "experiment_name": "my_basin_001",
        "rep": 5000,
        "ngs": 50,
    },
    "evaluation_cfgs": {
        "metrics": ["NSE", "KGE", "RMSE"],
    },
}

results = calibrate(config)
print("Calibration complete!")

Troubleshooting¶

Common Issues¶

1. "Basin ID not found" - Check basin_id column in attributes.csv matches CSV filenames - Basin IDs must be strings (not numbers) - Filenames: {basin_id}.csv (e.g., basin_001.csv)

2. "Time column not found" - CSV must have time column (case-sensitive) - Format: YYYY-MM-DD for daily, YYYY-MM-DD HH:MM for hourly

3. "Unit info file not found" - Create {time_unit}_units_info.json in timeseries folder - Example: 1D_units_info.json for daily data

4. "Variable not found in units info" - Every variable in CSV must be in units_info.json - Check spelling matches exactly (case-sensitive)

5. "Data shape mismatch" - All basins should have same variables - All basins should cover the requested time range

6. CAMELS data download fails - Check internet connection - Check disk space (CAMELS-US needs ~70GB) - Try manual download from official sources - Set download=False if data already exists

Data Validation Checklist¶

Before using your custom data:

[ ] attributes.csv exists with basin_id and area columns
[ ] Time series files named {basin_id}.csv
[ ] All CSV files have time column
[ ] {time_unit}_units_info.json exists
[ ] All variables in CSV are in units_info.json
[ ] No negative precipitation or streamflow values
[ ] Time series is continuous (no large gaps)
[ ] Data covers warmup + train + test periods
[ ] Units are physically reasonable

Data Conversion Tools¶

From GIS Shapefile¶

Extract basin attributes from shapefile:

import geopandas as gpd

# Read shapefile
basins = gpd.read_file("basins.shp")

# Calculate area (convert to km²)
basins['area'] = basins.geometry.area / 1e6

# Export to CSV
basins[['basin_id', 'area', 'lat', 'lon']].to_csv(
    "attributes/attributes.csv",
    index=False
)

From Other Formats¶

# From Excel
import pandas as pd
df = pd.read_excel("basin_data.xlsx")
df.to_csv("timeseries/1D/basin_001.csv", index=False)

# From NetCDF
import xarray as xr
ds = xr.open_dataset("data.nc")
df = ds.to_dataframe().reset_index()
df.to_csv("timeseries/1D/basin_001.csv", index=False)

Summary¶

Quick Decision Guide¶

Choose CAMELS (hydrodataset) if: - ✅ You need quality-controlled data - ✅ Working with well-studied basins - ✅ Want standardized format - ✅ Need consistent attributes

Choose Custom Data (hydrodatasource) if: - ✅ Using your own field data - ✅ Working with ungauged basins - ✅ Need specific time periods - ✅ Have proprietary data

Key Points¶

Public Data: Use hydrodataset for CAMELS variants
Custom Data: Use hydrodatasource with selfmadehydrodataset format
Data Structure: Follow standard directory layout
Required Files: attributes.csv, time series CSVs, units_info.json
Data Quality: Check completeness, consistency, and physical validity
Performance: Use NetCDF caching for large datasets

Additional Resources¶

hydrodataset GitHub: https://github.com/OuyangWenyu/hydrodataset
hydrodataset docs: https://hydrodataset.readthedocs.io/
hydrodatasource GitHub: https://github.com/OuyangWenyu/hydrodatasource
hydromodel docs: usage.md, quickstart.md
CAMELS official sites:
US: https://ral.ucar.edu/solutions/products/camels
GB: https://catalogue.ceh.ac.uk/documents/8344e4f3-d2ea-44f5-8afa-86d2987543a9