Abstrakti
The size of data collected around the world is growing exponentially, and it has become popular as big data. The volume and velocity of big data are facilitating the transition of machine learning (ML), deep learning (DL) and artificial intelligence (AI) from research laboratories to real life. There are numerous other claims made about Big Data. Can we, however, rely on data blindly? What happens when a dataset used to train ML models has a hidden statistical paradox? Data, like fossil fuels, is valuable, but it must be refined carefully for accurate outcomes. Statistical paradoxes are hard to observe in classical data cleaning and analysis techniques. Still, they are required to be investigated separately in training datasets. In this paper, we discuss the impact of Simpson’s paradox on categorical data and demonstrate its effects on AI and ML application scenarios. Next, we provide an algorithm to automatically identify the confounding variable and detect Simpson’s paradox within categorical datasets. The algorithm experiments on datasets from two real-world case studies. The outcome of the algorithm uncovers the existence of the paradox and indicates that Simpson’s paradox is severely harmful in automatic data analysis, especially in AI, ML and DL.
| Alkuperäiskieli | Englanti |
|---|---|
| Otsikko | Database and Expert Systems Applications - 33rd International Conference, DEXA 2022, Proceedings |
| Toimittajat | Christine Strauss, Alfredo Cuzzocrea, Gabriele Kotsis, Ismail Khalil, A Min Tjoa |
| Kustantaja | Springer |
| Sivut | 323-335 |
| Sivumäärä | 13 |
| ISBN (painettu) | 978-3-031-12422-8 |
| DOI - pysyväislinkit | |
| Tila | Julkaistu - 2022 |
| OKM-julkaisutyyppi | A4 Artikkeli konferenssijulkaisussa |
| Tapahtuma | International Conference on Database and Expert Systems Applications - Vienna, Itävalta Kesto: 22 elok. 2022 → 24 elok. 2022 Konferenssinumero: 33 |
Julkaisusarja
| Nimi | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
|---|---|
| Vuosikerta | 13426 LNCS |
| ISSN (painettu) | 0302-9743 |
| ISSN (elektroninen) | 1611-3349 |
Conference
| Conference | International Conference on Database and Expert Systems Applications |
|---|---|
| Lyhennettä | DEXA |
| Maa/Alue | Itävalta |
| Kaupunki | Vienna |
| Ajanjakso | 22/08/2022 → 24/08/2022 |
Rahoitus
Acknowledgements. This work has been partially conducted in the project “ICT programme” which was supported by the European Union through the European Social Fund.
Sormenjälki
Sukella tutkimusaiheisiin 'Detecting Simpson’s Paradox: A Machine Learning Perspective'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.Siteeraa tätä
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver