Data Preparation Guide¶
This guide explains how to prepare and use data with hydromodel, covering both public CAMELS datasets and custom data.
Overview¶
hydromodel supports two main data sources:
- Public CAMELS Datasets - Using hydrodataset package
- 11 global CAMELS variants (US, GB, AUS, BR, CL, etc.)
- Automatic download and caching
-
Standardized format and quality-controlled
-
Custom Data - Using hydrodatasource package
- Your own basin data
- Flexible data organization
- Integration with cloud storage
Option 1: Using CAMELS Datasets (hydrodataset)¶
Step 1: Install hydrodataset¶
1 | |
Step 2: Configure Data Path (Optional)¶
hydromodel automatically uses default paths, but you can customize:
Default paths:
- Windows: C:\Users\YourUsername\hydromodel_data\
- macOS/Linux: ~/hydromodel_data/
To customize, create ~/hydro_setting.yml:
1 2 3 | |
Important: Provide only the datasets-origin directory. The system automatically appends the dataset name (e.g., CAMELS_US, CAMELS_GB).
Example: If your data is in D:/data/CAMELS_US/, set datasets-origin: 'D:/data'.
Step 3: Download Data¶
The data downloads automatically on first use:
1 2 3 4 5 6 7 8 9 10 | |
Note: First download may take 30-120 minutes depending on dataset size. CAMELS-US is ~70GB.
Step 4: Use with hydromodel¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
Available CAMELS Datasets¶
| Dataset | Region | Basins | Package Name |
|---|---|---|---|
| CAMELS-US | United States | 671 | camels_us |
| CAMELS-GB | Great Britain | 671 | camels_gb |
| CAMELS-AUS | Australia | 222 | camels_aus |
| CAMELS-BR | Brazil | 897 | camels_br |
| CAMELS-CL | Chile | 516 | camels_cl |
| CAMELS-CH | Switzerland | 331 | camels_ch |
| CAMELS-DE | Germany | 1555 | camels_de |
| CAMELS-DK | Denmark | 304 | camels_dk |
| CAMELS-FR | France | 654 | camels_fr |
| CAMELS-NZ | New Zealand | 70 | camels_nz |
| CAMELS-SE | Sweden | 54 | camels_se |
Usage example:
1 2 3 | |
CAMELS Data Structure¶
CAMELS datasets provide standardized variables:
Time Series Variables:
- precipitation (mm/day or mm/hour)
- potential_evapotranspiration (mm/day)
- streamflow (mm/day or m³/s)
- temperature (°C)
- And more depending on dataset
Basin Attributes:
- area (km²)
- elevation (m)
- latitude, longitude
- Climate, soil, vegetation attributes
For detailed documentation, see: - hydrodataset GitHub - hydrodataset documentation
Option 2: Using Custom Data (hydrodatasource)¶
Step 1: Install hydrodatasource¶
1 | |
Step 2: Organize Your Data¶
Create a directory with this structure:
1 2 3 4 5 6 7 8 9 10 11 | |
Step 3: Prepare Required Files¶
3.1 attributes.csv¶
Minimum required columns: basin_id and area (km²)
1 2 3 | |
Important:
- basin_id: String identifier (matches filename)
- area: Basin area in km²
- Other columns optional but recommended
3.2 basin_XXX.csv (Time Series)¶
Required column: time. Other columns are your variables.
1 2 3 4 | |
Important:
- time format: YYYY-MM-DD (for daily data)
- Variable names: Lowercase, underscores for multi-word
- Missing values: Use empty cells or NaN (not -9999 or 0)
- No duplicate time stamps
3.3 1D_units_info.json (Units Definition)¶
Define physical units for all variables:
1 2 3 4 5 6 | |
Common units:
- Precipitation/ET: mm/day or mm/hour
- Streamflow: m^3/s or mm/day
- Temperature: degC or K
- Area: km^2
Step 4: Verify Data Structure¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
Step 5: Use with hydromodel¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | |
Option 3: Using Flood Event Data (hydrodatasource)¶
Overview¶
Flood event data is designed for event-based hydrological modeling where you focus on specific flood episodes rather than continuous time series. This is particularly useful for:
- Flood forecasting and warning systems
- Peak flow estimation
- Event-based rainfall-runoff analysis
- Unit hydrograph calibration
Key Differences from Continuous Data¶
| Feature | Continuous Data | Flood Event Data |
|---|---|---|
| Data Structure | Complete time series | Individual flood events with gaps |
| Input Features | 2D: [prcp, PET] | 4D: [prcp, PET, marker, event_id] |
| Warmup Handling | Removed after simulation | Included in each event (NaN markers) |
| Time Coverage | Full period | Only flood periods + warmup |
| Use Case | Long-term water balance | Flood peak prediction |
Data Structure and Components¶
1. Event Format¶
Each flood event contains:
1 2 3 4 5 6 7 8 | |
2. Three Key Periods in Each Event¶
1 2 | |
Warmup Period (e.g., 30 days before flood):
- Contains NaN values in observations
- Used to initialize model states
- Length specified by warmup_length in config
- Extracted from real data before the flood event
Flood Period (actual event): - Contains valid observations (marker=1) - The period of interest for simulation - Used for model calibration and evaluation
GAP Period (between events): - Artificial buffer (10 time steps by default) - Designed for visualization clarity - NOT used in simulation (marker=0, ignored by model) - Created with: precipitation=0, ET=0.27, flow=0
3. How Events are Concatenated¶
When loading multiple events, they are combined as:
1 2 3 | |
Important: The final data structure includes GAP periods, but these are automatically skipped during simulation based on the marker values.
Simulation Behavior¶
Event Detection¶
During simulation (unified_simulate.py), the system:
- Reads the
flood_event_markersarray (3rd feature) - Uses
find_flood_event_segments_as_tuples()to identify events - Only processes segments where
marker > 0(i.e., marker=1) - Skips GAP periods (marker=0) completely
1 2 3 4 5 6 7 8 9 10 | |
Why GAP Doesn't Affect Results¶
The GAP period is present in the loaded data but does not participate in simulation:
- ✅ GAP helps separate events visually in plots
- ✅ GAP provides clear boundaries between independent floods
- ❌ GAP data is never fed to the hydrological model
- ❌ GAP does not contribute to loss calculation
This design ensures that: 1. Each flood event is simulated independently 2. Model states are reset via warmup for each event 3. Events don't interfere with each other
Configuration Example¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | |
Data File Structure¶
1 2 3 4 5 6 7 8 9 10 | |
basin_001.csv Format¶
1 2 3 4 5 6 7 8 | |
Key Points:
- flood_event column: 0=no flood, 1=flood period
- Continuous time series with all periods marked
- System automatically extracts events with warmup
Best Practices¶
- Warmup Length:
- Typical: 30 days for daily data
- Should be long enough to initialize soil moisture states
- Too short: poor initial conditions
-
Too long: data availability issues
-
Event Selection:
- Focus on significant floods (peak > threshold)
- Include complete rising and recession limbs
-
Ensure warmup period has valid data
-
Data Quality:
- Check for missing data in warmup periods
- Verify flood markers are correctly assigned
-
Ensure precipitation and flow are synchronized
-
Marker Assignment:
1 2 3
# Example: Mark floods based on threshold threshold = flow.quantile(0.95) flood_event = (flow > threshold).astype(int)
Common Issues¶
1. "Warmup period contains all NaN"
- Ensure data exists before each flood event
- Check warmup_length is not too long
- Verify CSV has continuous time series
2. "No flood events found"
- Check flood_event column exists
- Verify flood markers are 1 (not True or other values)
- Ensure train_period covers some flood events
3. "Simulation results are all zeros" - Check if events are detected: markers should be 1 for flood periods - Verify warmup_length matches the actual warmup in data - Ensure model parameters are physically reasonable
Advanced: Manual Event Creation¶
For custom event extraction:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
Data Requirements for XAJ Model¶
Required Variables¶
| Variable | Description | Unit | Typical Source |
|---|---|---|---|
prcp |
Precipitation | mm/day | Rain gauge, gridded data (CHIRPS, ERA5) |
PET |
Potential Evapotranspiration | mm/day | Penman, Priestley-Taylor, or reanalysis |
streamflow |
Observed streamflow | m³/s | Stream gauge |
area |
Basin area | km² | GIS analysis |
Optional Variables¶
| Variable | Description | Unit | Usage |
|---|---|---|---|
temp |
Temperature | °C | Snow module (if enabled) |
elevation |
Basin elevation | m | PET estimation |
lat, lon |
Coordinates | degrees | Spatial analysis |
Data Quality Guidelines¶
-
Time Resolution: Daily (1D) is standard for XAJ model
-
Data Completeness:
- Training period: ≥5 years continuous data
- Warmup period: ≥1 year before training
-
Missing data: <5% acceptable, continuous gaps <7 days
-
Physical Consistency:
- Precipitation ≥ 0
- Streamflow ≥ 0
- PET ≥ 0
-
Check water balance: P ≈ Q + ET (within 20%)
-
Unit Consistency:
- Ensure all units match
units_info.json - Use consistent time stamps (no daylight saving shifts)
Advanced Features¶
NetCDF Caching (For Large Datasets)¶
Convert CSV to NetCDF for 10x faster access:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
Multi-Scale Time Series¶
Support different time scales in one dataset:
1 2 3 4 5 6 7 8 9 10 | |
Specify in config:
1 2 3 4 | |
Cloud Storage (MinIO/S3)¶
For large datasets in the cloud:
1 2 3 4 5 6 7 8 9 10 11 | |
Complete Workflow Example¶
Here's a complete example from raw data to calibration:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | |
Troubleshooting¶
Common Issues¶
1. "Basin ID not found"
- Check basin_id column in attributes.csv matches CSV filenames
- Basin IDs must be strings (not numbers)
- Filenames: {basin_id}.csv (e.g., basin_001.csv)
2. "Time column not found"
- CSV must have time column (case-sensitive)
- Format: YYYY-MM-DD for daily, YYYY-MM-DD HH:MM for hourly
3. "Unit info file not found"
- Create {time_unit}_units_info.json in timeseries folder
- Example: 1D_units_info.json for daily data
4. "Variable not found in units info"
- Every variable in CSV must be in units_info.json
- Check spelling matches exactly (case-sensitive)
5. "Data shape mismatch" - All basins should have same variables - All basins should cover the requested time range
6. CAMELS data download fails
- Check internet connection
- Check disk space (CAMELS-US needs ~70GB)
- Try manual download from official sources
- Set download=False if data already exists
Data Validation Checklist¶
Before using your custom data:
- [ ]
attributes.csvexists withbasin_idandareacolumns - [ ] Time series files named
{basin_id}.csv - [ ] All CSV files have
timecolumn - [ ]
{time_unit}_units_info.jsonexists - [ ] All variables in CSV are in
units_info.json - [ ] No negative precipitation or streamflow values
- [ ] Time series is continuous (no large gaps)
- [ ] Data covers warmup + train + test periods
- [ ] Units are physically reasonable
Data Conversion Tools¶
From GIS Shapefile¶
Extract basin attributes from shapefile:
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
From Other Formats¶
1 2 3 4 5 6 7 8 9 10 | |
Summary¶
Quick Decision Guide¶
Choose CAMELS (hydrodataset) if: - ✅ You need quality-controlled data - ✅ Working with well-studied basins - ✅ Want standardized format - ✅ Need consistent attributes
Choose Custom Data (hydrodatasource) if: - ✅ Using your own field data - ✅ Working with ungauged basins - ✅ Need specific time periods - ✅ Have proprietary data
Key Points¶
- Public Data: Use
hydrodatasetfor CAMELS variants - Custom Data: Use
hydrodatasourcewithselfmadehydrodatasetformat - Data Structure: Follow standard directory layout
- Required Files:
attributes.csv, time series CSVs,units_info.json - Data Quality: Check completeness, consistency, and physical validity
- Performance: Use NetCDF caching for large datasets
Additional Resources¶
- hydrodataset GitHub: https://github.com/OuyangWenyu/hydrodataset
- hydrodataset docs: https://hydrodataset.readthedocs.io/
- hydrodatasource GitHub: https://github.com/OuyangWenyu/hydrodatasource
- hydromodel docs: usage.md, quickstart.md
- CAMELS official sites:
- US: https://ral.ucar.edu/solutions/products/camels
- GB: https://catalogue.ceh.ac.uk/documents/8344e4f3-d2ea-44f5-8afa-86d2987543a9