Host-IP clustering technique for deepweb characterization

Denis Shestakov*, Tapio Salakoski

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

    7 Citations (Scopus)


    A huge portion of todays Web consists of web pages filled with information from myriads of online databases. This part of the Web known as the deep Web is to date relatively unexplored and even major characteristics such as number of searchable databases on the Web is somewhat disputable. In this paper we are aimed at more accurate estimation of main parameters of the deep Web by sampling one national web domain. We propose the Host-IP clustering sampling technique that addresses drawbacks of existing approaches to characterize the deep Web and report our findings based on the survey of Russian Web conducted in September 2006. Obtained estimates together with a proposed sampling method could be useful for further studies to handle data in the deep Web.

    Original languageEnglish
    Title of host publicationAdvances in Web Technologies and Applications - Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010
    Number of pages3
    Publication statusPublished - 9 Jul 2010
    MoE publication typeA4 Article in a conference publication
    EventInternational Asia Pacific Web Conference - Busan, Korea, Republic of
    Duration: 6 Apr 20108 Apr 2010
    Conference number: 12


    ConferenceInternational Asia Pacific Web Conference
    Abbreviated titleAPWeb
    CountryKorea, Republic of


    • Deep web
    • Host-IP clustering sampling
    • Search interface discovery
    • Virtual hosting
    • Web characterization

    Fingerprint Dive into the research topics of 'Host-IP clustering technique for deepweb characterization'. Together they form a unique fingerprint.

    Cite this