Abstract
This study aims to evaluate the performance of three prominent LLMs, DeepSeek R1, ChatGPT-4o, and Gemini 2, in addressing key questions within four core fields of hydrology and water science: machine learning and optimization, remote sensing, flood modeling, and sediment transport. LLMs’ responses are systematically compared to benchmark responses derived from review articles in the respective fields. To assess the LLMs’ efficiency, a novel evaluation rubric is introduced in this study, incorporating four key criteria: relevancy, accuracy, authenticity, and novelty. Findings revealed that each model can address the core aspects of the benchmark questions. DeepSeek R1 achieved the highest overall scores in machine learning and optimization, flood modeling, and sediment transport, while ChatGPT-4o demonstrated superior performance in remote sensing. Notably, DeepSeek R1 and Gemini 2 exhibited the lowest response similarity in 95% of the evaluated questions, whereas ChatGPT-4o and Gemini 2 showed the highest similarity in 70% of cases.
| Original language | English |
|---|---|
| Article number | 106772 |
| Number of pages | 17 |
| Journal | Environmental Modelling & Software |
| Volume | 196 |
| Early online date | 7 Nov 2025 |
| DOIs | |
| Publication status | Published - 30 Jan 2026 |
| MoE publication type | A1 Journal article-refereed |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 11 Sustainable Cities and Communities
Keywords
- CharGPT-4o
- DeepSeek R1
- Gemini 2
- hydrology
- large language model (LLM)
- water science
- Hydrology
- ChatGPT-4o
- Large language model (LLM)
- Water science
Fingerprint
Dive into the research topics of 'How Well Do DeepSeek, ChatGPT, and Gemini Respond to Water Science Questions?'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver