The Volume and Evolution of Web Page Templates
Source:
14th International World Wide Web Conference (2005)
Abstract:
Web pages contain a combination of unique content and \emph{template
material}, which is present across multiple pages and used primarily
for formatting, navigation, and branding. We study the nature,
evolution, and prevalence of these templates on the web. As part of
this work, we develop new randomized algorithms for template
extraction that perform approximately twenty times faster than
existing approaches with similar quality. Our results show that
40--50\% of the content on the web is template content. Over the last
eight years, the fraction of template content has doubled, and the
growth shows no sign of abating. Text, links, and total HTML bytes
within templates are all growing as a fraction of total content at a
rate of between 6 and 8\% per year. We discuss the deleterious
implications of this growth for information retrieval and ranking,
classification, and link analysis.
Download: