During recent years the Internet has become an important source of information, able to provide data for any machine learning task. To take advantage of this fact, the ML4DS group has developed tools for the exploration of web domains. These tools, known as web crawlers, are able to explore millions of web pages, extract their contents and save them for subsequent processing. This can be complemented with additional information about the web status, or with a detailed analysis of its DNS (Domain Name Server). This way, the extracted information can be combined to obtain statistical data about a given domain.
For example, analyzing the content of the domain .es we can know the number of available web pages, the number of redirections or how many domains are down. Moreover, by combining web content with Machine Learning tools we are able to identify specific, predefined web page activities (parking/PPC, Business, E-commerce, Blog, etc.). In this sense, the ML4DS group has experience analyzing and extracting statistics over the .es and .eu domains, where millions of web domains were explored.
The developed web crawler works over Python and other open source technologies, such as MongoDB, MPI (for parallelization purposes), Scrapy (as web crawler) and Spark (for distributed data processing). The combination of these technologies allows the parallelization of the web crawling process and eases the subsequent access to the downloaded information. Running the last version of the software over a mini-cluster of 45 cores, we are able to crawl more than ten million web pages, from three hundred thousand domains, in a little over three days.
This link provides an example of 100 crawled web sites, including the downloaded web content as well as their DNS information.