Data Quality Management for Real-World Graduation Prediction

Hong Duyen Nguyen-Pham, Khoa Tan Vo, Thu Nguyen, Tu Anh Nguyen-Hoang, Ngoc Thanh Dinh, Hong Tri Nguyen

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

Abstract

The rapid growth of diverse and multi-sourced data has rendered traditional data storage models inadequate to han-dle the sheer volume and complexity. Data Lakes, which store all raw data and all data versions in an easily accessible format, are well-suited for deep data analysis and valuable insights discovery. However, the quality of this data is not guaranteed, raising the question of how to utilize this vast repository effectively. Our research proposes a four-step data quality management process profile, implement, monitor, and improve to oversee and ensure data usability within a data lake. This process employs five commonly used evaluation criteria: accuracy, completeness, consistency, uniqueness, and timeliness. Our study focuses on higher education data, an area that has not been extensively explored in previous research, using real-world data from a uni-versity's computer science department. The application context is managing the quality of input data for a machine-learning model that predicts student graduation outcomes. Two advanced boosting machine learning models, LightGBM and CatBoost, are employed, resulting in a 5% improvement in performance. Our research aims to provide a comprehensive solution for assessing data quality in higher education, saving significant time, effort, and cost while enhancing the reliability of data utilization from data lakes.

Original languageEnglish
Title of host publicationProceedings - 2024 International Conference on Advanced Technologies for Communications, ATC 2024
EditorsVo Nguyen Quoc Bao, Do Trung Anh
PublisherIEEE
Pages701-706
Number of pages6
ISBN (Electronic)979-8-3503-5398-3
DOIs
Publication statusPublished - 2024
MoE publication typeA4 Conference publication
EventInternational Conference on Advanced Technologies for Communications - Ho Chi Minh City, Viet Nam
Duration: 17 Oct 202419 Oct 2024

Publication series

NameInternational Conference on Advanced Technologies for Communications
PublisherIEEE
ISSN (Print)2162-1039
ISSN (Electronic)2162-1020

Conference

ConferenceInternational Conference on Advanced Technologies for Communications
Abbreviated titleATC
Country/TerritoryViet Nam
CityHo Chi Minh City
Period17/10/202419/10/2024

Keywords

  • big data
  • data quality manage-ment
  • educational data mining
  • graduation prediction

Fingerprint

Dive into the research topics of 'Data Quality Management for Real-World Graduation Prediction'. Together they form a unique fingerprint.

Cite this