If we look forward to year 2025, where we will have big dreams realized like Nanotechnology, artificial intelligence, next generation cloud and high performance computing. Impact of such technologies on overall human life is unimaginable at this point in time. We are not far away from tiny Nano factories and Nano robots at home doing some smart job for us. Computers around us will be million times faster, smaller, ready for you to serve within fraction of energy consumption as compared to today. These possibilities are beautiful and are likely to be realized but one important question is ‘Are we ready for that?’, ‘Are we putting correct foundation for next generation computing?’ Answer may be a ‘YES’ or ‘NO’ basis individuals perceptions and context. However, it would be certainly ‘No’ if we look at the current un-structured nature of World Wide Web, the biggest information store freely available over our fingertips.
Due to tremendous size of Web, the way we have organized our web resources and the rate of web adoption in developing countries, soon it will become difficult to identify relevant information and services of interest easily. Total dependence on merely text based search engines for information identification will not be sufficient and we will lose credible information which search engines cannot put forward effectively and such a loss may become unaffordable in near future.
Lot of research has been happening around web standardization like research on classifying web sites by Christoph Lindemann and Lars Littig, research on extracting and managing structured Web Data by Michael John Cafarella is remarkable.
This paper advises few techniques on structuring the Web to make it best usable.
This is the first paper from the series targeted towards research on ‘Structured Web: 2025’ topic.
Fig. 1 Conceptual view of Web showing scattered information without specific structure
Due to the heterogeneity of the Web and its lack of structure, it is crucial to identify properties of a Web resource that best reflect its functionality. In Relational Database world, we call it a Schema. If we want to read any tuple from database, we need to first know its schema. This principle is equally applicable to Web resource as well. Once we know the schema, second step is, we should allow database tuple to be read by anybody.
Here I propose two step methodology to describe the structure of Web resource.
A. Every Web resource should describe and expose its properties.
B. Every Web resource should be accessible using unified structure.
Here I am considering Web resource as everything which will be publicly accessible.
IV. PROPERTIES DESCRIPTION
This applies to one of the major web resource i.e. Web site. Every Web site should describe its schema using below properties and should expose it for public access.
Web site properties
|Sr. No.||Web Site Properties Description|
|13||Documents (Word, PDF, XLS etc.)|
|Size of Pages|
|17||Count of Pages|
|18||External site out degree|
|Domain dictionary keywords|
|Popular URLs of the site|
Web resource structure
|See section V.|
Using above information available with each Web site, organizations can write crawlers, which will visit web sites and retrieve these details to maintain database of all this information.
Domain dictionary keywords can be used by search engines to index the web site against those keywords.
Security signifies if that website can be openly used by anybody or registration is required.
V. RESOURCE ACCESSIBILITY
Once we understand about web site properties, we will be able to understand general structure of it. Next level of categorization is done using how actual web site content is made available for public access. This content access is different from content access using rendered web page. By directly exposing content using URLs will help categorising overall information in terms of relational database table like below.
WEB resource structure
|Sr. No.||Web site content access structure|
A. WebResource_Properties.XML file
Now the question is how any web site will expose properties and content access structure to the outside world. Answer is one XML file with standard schema that should be published by every website owner. This file would be WebResource_Properties.XML. This file should be present in each sites root virtual folder and should be accessible publicly by using below URL format –
Using above mechanism, we can build relational database table for all the websites exposing web resource properties.
One can easily write piece of software which will provide you list of all sites from ‘Ireland, in Health care domain, with Audio and images, having page count >20 without any security’ for accessing content.
B. Ranking Website
Another way of classifying web resources/web sites is ranking them. This ranking should be done basis
- Relevant Content volume and quality
- No. of users and/or web traffic
Web site ranking should be done by independent organizations to provide real usability aspect to the world. Rank is always linked to Domain, so while comparing ranks domain always comes into picture.
Some of the domains can be listed as Affiliate site, Archive Site, Blogs, Corporate site, Commerce Site, database site, development site, directory site, download site, employment site etc.
VI. TWO VIEWS OF WEB SITE
Figure below shows two views of web site as –
A. View which is rendered in the browser and user can see it directly. Search engines works on this view for performing indexing on web site. Search engine cannot reach to the web resource which has got no link in the browser rendered page. Search engine cannot crawl the web sites which have got some files on web servers without links provided in web pages.
B. Second view is the view provided through WebResources_Properties xml file, sample as shown in right side of this figure.
Fig. 2 Web Virtual directory and two views of it
By implementing above guidelines, web information can be structured to some level which allows us to leverage following advantages.
A. Technology neutral way of categorizing of web sites
Using above method web sites can be categorized and web can be structured in a technology neutral way.
B. Improved search engine optimization
Now search engines need not just depend on text based indexing, additional web resource properties can help in getting meaningful search results.
C. Minimal work to get started
Web resource owners don’t need to make any changes in their web applications. Just one XML file will help bring in lots of difference.
Figure below shows conceptual view of Web when such structuring will happen over a period. Web being a massive data store, it will take time for people to adopt such standards and apply them.
Fig 3. – Conceptual view of Web showing structured information after employing above techniques
Important point is if we don’t take action on time we will be at great loss where millions of ideas/research/opinions by billions of people might get into dark ages just because nobody could find it at correct time and carry on further work. People will keep on reinventing the wheel, and next generation will blame on us because we could not manage the Web with great responsibly. If we start today, hope is entire Web will be structured data source by 2025 and next generation might use structured query language to search the Web seamlessly.
Because “Information could not be found easily is as good as information is not present.”
 Michael John Cafarella, Extracting and Managing Structured Web Data, university of Washington, 2009
 Chrisoph Lindemann and Lars Littig, Classifying web sites, University of Leipzig, Johannisgasse 26, 2007
 John M. Pierre, On the Automated Classification of Web Sites, California UAS, 2001