Efficiently Enumerating Substrings with Statistically Significant Frequencies of Locally Optimal Occurrences in Gigantic String

Atsuyoshi Nakamura, Ichigaku Takigawa, Hiroshi Mamitsuka

Tutkimustuotos: Artikkeli kirjassa/konferenssijulkaisussaConference article in proceedingsScientificvertaisarvioitu

Abstrakti

We propose new frequent substring pattern mining which can enumerate all substrings with statistically significant frequencies of their locally optimal occurrences from a given single sequence. Our target application is genome sequences, around a half being said to be covered by interspersed and consecutive (tandem) repeats, and detecting these repeats is an important task in molecular life sciences. We evaluate the statistical significance of frequent substrings by using a string generation model with a memoryless stationary information source. We combine this idea with an existing algorithm, ESFLOO-0G.C (Nakamura et al. 2016), to enumerate all statistically significant substrings with locally optimal occurrences. We further develop a parallelized version of our algorithm. Experimental results using synthetic datasets showed the proposed algorithm achieved far higher F-measure in extracting substrings (with various lengths and frequencies) embedded in a randomly generated string with noise, than conventional algorithms. The large-scale experiment using the whole human genome sequence with 3,095,677,412 bases (letters) showed that our parallel algorithm covers 75% of the whole positions analyzed, around 4% and 24% higher than the recent report and the current cutting-edge knowledge, implying a biologically unique finding.
AlkuperäiskieliEnglanti
OtsikkoProceedings of the AAAI Conference on Artificial Intelligence (AAAI 2020)
KustantajaAAAI Press
Sivut5240-5247
Sivumäärä8
ISBN (painettu)978-1-57735-835-0
DOI - pysyväislinkit
TilaJulkaistu - 3 huhtik. 2020
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisussa
TapahtumaAAAI Conference on Artificial Intelligence - New York, Yhdysvallat
Kesto: 7 helmik. 202012 helmik. 2020
Konferenssinumero: 34
https://aaai.org/Conferences/AAAI-20/

Julkaisusarja

NimiProceedings of the AAAI Conference on Artificial Intelligence
KustantajaAAAI Press
Numero4
Vuosikerta34
ISSN (painettu)2159-5399
ISSN (elektroninen)2374-3468

Conference

ConferenceAAAI Conference on Artificial Intelligence
LyhennettäAAAI
Maa/AlueYhdysvallat
KaupunkiNew York
Ajanjakso07/02/202012/02/2020
www-osoite

Sormenjälki

Sukella tutkimusaiheisiin 'Efficiently Enumerating Substrings with Statistically Significant Frequencies of Locally Optimal Occurrences in Gigantic String'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä