HTML as graphs: the HTML2GDL application

January 26th, 2009 | Tags:

Yahoo HTML Tree Information visualization, besides being a science field, can be fun. The ever challenging problem of converting text and numbers into something viewable that is self explanatory makes information visualization an interesting toy to play with. Some time ago I saw an applet that displays the hierarchical structure of HTML files called HtmlGraph. I really liked the idea: you don’t have to ramble through the woods of html trees in order to figure out its structural pattern. I wondered how aiSee will display html graphs and created a simple application for this purpose.

But the never ending search for perfection impels to continually improve and evolve. Finally, I ended up with a fairly nice tool that might be useful for the community. I’ve called it HTML2GDL. Here is how http://www.aisee.com/gallery/ is displayed by HtmlGraph and HTML2GDL:

aisee.com/gallery (HtmlGraph)

aisee.com/gallery (HtmlGraph)

aisee.com/gallery (HTML2GDL)

aisee.com/gallery (HTML2GDL)

The GDL source: aisee_gallery.gdl

The forcedir layout algorithm was used. I don’t know well how force directed layout schemes works, so I’ve tried different configuration options to obtain a nice graph. I found that the graph can be improved by setting the minimum temperature (tempmin attribute) to a higher value. You can play with attraction and repulsion graph attributes to control the length of the edges. The edge’s length can be dynamically adjusted by specifying the –priority-factor (the default is 0). The higher the priority of an edge, the smaller its length will be. (edge.priority = priority-factor * node.level). Refer to priority edge attribute.

html2gdl.zip 7KB / Version: 1.0; Date: February 15, 2009
Notice: HTML2GDL now supports GraphViz package, read more…

The documentation (the complete list of html2gdl.pl command line options with explanations) is located at the beginning of the script.

html2gdl.pl can process a local file or fetch an URL directly:

> perl html2gdl.pl --url=http://yahoo.com/ --graph=yahoo.gdl
> perl html2gdl.pl --file=localfile.html --graph=yahoo.gdl

The graph definition will be written to --graph file (note: existing files will be overwritten).
and tag statistics displayed (Top 10 most used tags).

The script is using two standard perl modules: LWP::Simple and HTML::TreeBuilder. If you don’t have these modules installed (the script refuses to work :) ), the procedure is very straightforward, run the following commands:

> cpan install HTML::TreeBuilder
> cpan install LWP::Simple

I’ll briefly outline the main html2gdl.pl features and then provide a case study.

HTML2GDL features

The configuration options of the application can be divided in two groups: 1) HTML processing 2) Visual effects.

1. HTML Processing: specifying how tags are processed during graph construction:

  • remove-tags: tags and their descendants will not be displayed;
  • ignore-tags: these and descendant tags will be white colored, it is useful when you want to blur a list of unimportant tags;
  • fold-tags: descendant tags will not be displayed, although the tag itself will be visible. You may fold entire tables or just tr‘s
  • flatten-tags: tags are removed but descendants will be linked to the ancestors of the deleted tags.

You can specify a starting level for above options, i.e.: only tags that are at a specified level and below will be affected.

2. Visual Effects: different node coloring methods, controling the size (big or small) of the nodes. There are three approaches to node coloring: by tag, size and level:

  • tag: a colorMap is provided that specifies colors for tag groups. The color legend is provided below.
  • size: the size of a node is the content length of its descendants (i.e. the length of the inner text without html tags). The color of the node will be in the [ColorStart, ColorEnd] interval: ‘fat’ nodes will have colors close to ColorStart, thin nodes will have a color near ColorEnd.
  • level: similar to size, but the level of the node in the hierarchy is used instead of the size.

Color Legend (for tag coloring):

  • blue: A
  • green: DIV
  • magenta: IMG
  • darkgrey: P, BR, I, S, U, BLOCKQUOTE, STRONG, …  (text tags)
  • yellowgreen: UL, OL, LI, DL, DT, DD, DIR, MENU
  • cyan: H1, H2, … H6
  • orange: TABLE, TR, TD …. (table tags)
  • yellow: FORM, INPUT, SELECT, … (form tags)
  • red: APPLET, SCRIPT, OBJECT, … (external resources)
  • lightgrey: all other tags

By default, the nodes are represented as circles, the radius can be fixed or dynamically computed according to node size or level.

Refer to html2gdl.pl for the full list of configuration options.

Html2GDL Usage

Displaying all html tags might be overwhelming, especially for large pages. Instead of a nice colored graph, my intention was to create a simplified version of HTML files. At the same time, the graph should be more informative. For this purposes the node attributes were used: color and size.

At first, let’s create the full graph (all html tags are shown) of http://yahoo.com/ but with varying node radius: the more content (pure text with html tags stripped off) a node has, the bigger the radius (--node-radius=size). The edges at lower levels are smaller (this is controlled by --priority-factor).

> perl html2gdl.pl --url=http://yahoo.com/ --priority-factor=0.5 --node-radius=size --attraction=50 --repulsion=35 --graph=yahoo.gdl

GDL source
The resulted graph is displayed in the middle image. On the left is the HtmlGraph image provided for comparison purposes. The right image is the same graph shown at the center but using the minbackward layout algorithm:

> perl html2gdl.pl --url=http://yahoo.com/ --layout=minbackward --priority-factor=0.5 --node-radius=size --attraction=50 --repulsion=35 --graph=yahoo.gdl

GDL source

IMHO, the last graph provides a better insight into the html structure. It clearly depicts where most of the content is located in the HTML document, under which tags and at which levels.

yahoo.com (HtmlGraph)

yahoo.com (HtmlGraph)

yahoo.com (HTML2GDL)

yahoo.com (HTML2GDL)

yahoo.com as tree

yahoo.com as tree

Notice: the html of the yahoo.com webpage is different when accessed through a browser. If you’ll do a wget http://yahoo.com you’ll get the same page as html2gdl.pl gets. I found this issue when I wondered why yahoo has designed its homepage using TABLE elements instead of DIV‘s.

Although useful, the graphs are cluttered with tags that add a lot of noise: the B, FONT, BR tags for example. To generate a lighter graph, we’ll remove the HEAD, SPACER, BR tags and flatten B, FONT, TR, CENTER tags. TR‘s are flattened in order to reduce the deepness of the tree. The CENTER is the child of the BODY tag and doesn’t carry any structural information. (Flattening means: tags are removed but descendants will be linked to the ancestors of the removed tags). In the end, let’s give a new look to our graph: --node-radius=level --node-color=size. (The node radius is decreasing for lower levels. The darker the node, the more content (text without html tags) it has inside, white nodes are empty).

> perl html2gdl.pl --url=http://yahoo.com/ --graph=yahoo.gdl --layout=minbackward --node-radius=level --node-color=size --remove-tags='HEAD, SPACER, BR' --flatten-tags='B, FONT, TR, CENTER'

GDL source

The middle graph uses --node-radius=size only (GDL source). The right graph is a manual modification of the middle graph: I removed the first two nodes and their corresponding edges. The result is quite interesting, it highlights the main sections of the HTML page and their structure.

--node-color=size --node-radius=level

--node-color=size --node-radius=level

--node-radius=size

--node-radius=size

Dissected graph

Dissected graph

Let’s examine another webpage: http://www.ibm.com/

> perl html2gdl.pl --url=http://ibm.com --graph=ibm.gdl

GDL source (The image on the left).

The same graph but with a --layout=minbackward is displayed in the middle. While working with the graph in aiSee, you can pan, zoom in/out the graph. Each node has a label: the tag and class/id attributes. This label can be seen when zooming in, or as a tooltip when moving the mouse over the node. The middle graph indicates that there are a lot of UL, LI tags in the HTML. Let’s fold UL tags, but starting from the 7th level downward (because we want to keep the upper UL that has a FORM inside). The HEAD tag will be removed.

perl html2gdl.pl --url=http://ibm.com --graph=ibm.gdl --layout=minbackward --fold-tags='UL' --fold-tags-sl=7 --remove-tags='HEAD'

GDL source (The image on the right).

www.ibm.com

www.ibm.com

www.ibm.com Tree

www.ibm.com Tree

www.ibm.com Filtered Tree

www.ibm.com Filtered Tree

Let’s add the final touch to the picture. The BODY tag will be flattened, and --node-radius=level --node-color=size options added. Notice that the flattening operation changes the level numbering for descendant tags, that is why --fold-tags-sl=6 was specified.

perl html2gdl.pl --url=http://ibm.com --graph=ibm.gdl --fold-tags='UL' --fold-tags-sl=6 --remove-tags='HEAD, SCRIPT' --flatten-tags='BODY' --node-radius=level --node-color=size

GDL source

ibm.com

ibm.com

This article doesn’t cover all the features, I’ve outlined the most important ones. Refer to html2gdl.pl script for available options and documentation. Nevertheless, I’ll mention the --debug=1 parameter. Besides the Top 10 tags list, the whole HTML skeleton will be printed. ID and class attributes are also provided, ex: li#ibm-country.ibm-first. Note that unknown HTML tags are ignored. (Now I think that it might be useful to highlight deprecated HTML tags).

Conclusion

My intention was to create a tool that will reduce the time spent for evaluating HTML structures. Now I realize that it can be rewritten to be even more flexible :) For ex. replacing the --remove-tags, --fold-tags etc parameters with an XPath counterpart: probably you don’t want all div’s to be folded, just those with a specific class. Furthermore, I think that it would be useful to specify the root tag to start from: html > body > div#sidebar > table[3] i.e. show me the graph of the 3rd table underneath the DIV with id=#sidebar. But, after having looked at the problem from another point of view, I don’t think that someone will want to use this tool as magnifying glass to inspect HTML sources. The aim is to provide a good overview.

I should learn more about the forcedir layout algorithm. Maybe the graph attributes can be further adjusted. I’ll say more: at the beginning I couldn’t understand why the graphs were so ugly (compared to HtmlGraph). I was thinking about nesting each level into a separate subgraph and adjust attraction/repulsion accordingly. Hopefully, I had inadvertently discovered that increasing the tempmin attribute yields better results.

I hope that this application will be helpful for those who use the “view source” button quite often :)

No comments yet.
You must be logged in to post a comment.