TY - GEN
T1 - Data Quality Management for Real-World Graduation Prediction
AU - Nguyen-Pham, Hong Duyen
AU - Vo, Khoa Tan
AU - Nguyen, Thu
AU - Nguyen-Hoang, Tu Anh
AU - Dinh, Ngoc Thanh
AU - Nguyen, Hong Tri
N1 - Publisher Copyright: © 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - The rapid growth of diverse and multi-sourced data has rendered traditional data storage models inadequate to han-dle the sheer volume and complexity. Data Lakes, which store all raw data and all data versions in an easily accessible format, are well-suited for deep data analysis and valuable insights discovery. However, the quality of this data is not guaranteed, raising the question of how to utilize this vast repository effectively. Our research proposes a four-step data quality management process profile, implement, monitor, and improve to oversee and ensure data usability within a data lake. This process employs five commonly used evaluation criteria: accuracy, completeness, consistency, uniqueness, and timeliness. Our study focuses on higher education data, an area that has not been extensively explored in previous research, using real-world data from a uni-versity's computer science department. The application context is managing the quality of input data for a machine-learning model that predicts student graduation outcomes. Two advanced boosting machine learning models, LightGBM and CatBoost, are employed, resulting in a 5% improvement in performance. Our research aims to provide a comprehensive solution for assessing data quality in higher education, saving significant time, effort, and cost while enhancing the reliability of data utilization from data lakes.
AB - The rapid growth of diverse and multi-sourced data has rendered traditional data storage models inadequate to han-dle the sheer volume and complexity. Data Lakes, which store all raw data and all data versions in an easily accessible format, are well-suited for deep data analysis and valuable insights discovery. However, the quality of this data is not guaranteed, raising the question of how to utilize this vast repository effectively. Our research proposes a four-step data quality management process profile, implement, monitor, and improve to oversee and ensure data usability within a data lake. This process employs five commonly used evaluation criteria: accuracy, completeness, consistency, uniqueness, and timeliness. Our study focuses on higher education data, an area that has not been extensively explored in previous research, using real-world data from a uni-versity's computer science department. The application context is managing the quality of input data for a machine-learning model that predicts student graduation outcomes. Two advanced boosting machine learning models, LightGBM and CatBoost, are employed, resulting in a 5% improvement in performance. Our research aims to provide a comprehensive solution for assessing data quality in higher education, saving significant time, effort, and cost while enhancing the reliability of data utilization from data lakes.
KW - big data
KW - data quality manage-ment
KW - educational data mining
KW - graduation prediction
UR - http://www.scopus.com/inward/record.url?scp=105000630648&partnerID=8YFLogxK
U2 - 10.1109/ATC63255.2024.10908184
DO - 10.1109/ATC63255.2024.10908184
M3 - Conference article in proceedings
AN - SCOPUS:105000630648
T3 - International Conference on Advanced Technologies for Communications
SP - 701
EP - 706
BT - Proceedings - 2024 International Conference on Advanced Technologies for Communications, ATC 2024
A2 - Bao, Vo Nguyen Quoc
A2 - Anh, Do Trung
PB - IEEE
T2 - International Conference on Advanced Technologies for Communications
Y2 - 17 October 2024 through 19 October 2024
ER -