Pre-GSoC

Hi All,

Quite a few days since I’ve posted a blog post. Summarizing the few incidents that happened during this period were:

I finalized and made changes to my draft proposal (We discussed on the DOM characteristics of websites and how effective would it be)
I wrote a Fetcher that checks for the number of DOM nodes and the length of Total HTML, to get to a perspective.

That said, there are quite a few things I observed and this post would be on it.

Certain websites like Google and YouTube on opening through Selenium browser displays a popup(iframe). This makes it difficult to compare[1]. For some websites like Lloyds Bank and Dan.me.uk, sites that displays error pages, it works. Click here to see the results. Here we can see the clear difference between the nodes returned from Normal browser and Tbb for(Llyods Bank): 2135, 84 respectively, thereby giving it a score of 2441.6% which shows us that the website returned later might not contain that many information like it returned from the Normal Browser (Thereby showing us that the websites might have an error page).

Another thing I noticed was: for the website adsabs.harvard.edu. It usually returns http://adsabs.harvard.edu/cgi-bin/access_denied for Tbb but while I was fetching the website this time it didn’t return the ADS Access Denied Page. The data too suggests the similar number of nodes returned.

But (that shows error page), but for some dynamic websites that change content according to the region/ip like Bing it doesn’t give a satisfied data. Also another such example is the Domino’s Piza which redirects to Domino’s Loc in Tor and then you can go to your desired location website, where it doesn’t work for all location(For India, Bangladesh it shows “403 ERROR” Cloudflare Error), but in normal browser it redirects to the specific area. So for that reason Domino’s returned different html data, and thus we can’t operate on these.

- A bit more on the Cookie iframe/Popup:

tor_ google com

Solution what I could think of:

This kind of cookie hinders results, therefore one possible way is to programmatically remove these iframes by letting Selenium click on the Allow box (not displayed in the screenshot), but that websites might have different iframes, and might have to write a new case for each new website encountered with these Popups.
The other way that I think of dealing this would be using a cookie from the Non-Tor Selenium browsing, because if using Selenium or Automation results into this cookie popup thing, I think Non-Tor exit nodes should have generated a similar screenshot. But No! they don’t (atleast I haven’t faced them yet). So a bit of more insights need to be gathered on this point.

- Difference between total DOM nodes:

What if the difference in the total DOM nodes are large but still the websites doesn’t generate errors? Like Bing for example, during a trial resulted into the following data:

	Tor	Non-Tor
Total Length of Html	63699	68332
Total DOM Nodes	187	227

The Screenshots generated were:

Non-Tor Screenshot:

non-tor_ Bing com

Tor Screenshot:

tor_ Bing com

Here, a noticeable difference that can be seen in the screenshots are the Languages: bar and the cookie permission also. So, in these type of cases, problems might arise. And a question here could be asked as to What is a large difference? How much could a difference be justified between nodes returned by Tbb and Normal Browser?

[Update: On running the code for bing 2-3 times I saw some noticeable differences, 229 nodes for non-tor browser and 233 nodes for tbb. This might be a case for different exit node]

This suggests testing a website multiple times from multiple relays to get an average result and then to determine for the values. I hope this would remove any inconsistency. That said, we see that in our Bing case, where the Tbb generated nodes were less, we need to determine what should be our calculated value that we could omit.

Mathematically speaking:

  X (VALUE TO BE OMITTED) = DOM_nodes_Non_Tbb(website) - DOM_nodes_Tbb(website)

Because it isn't always a case like Wikipedia or LinkedIn where the number of nodes are the same as there are also cases like Google which presents a cookie popup(if one wants to browse with their cookies saved) etc. So finding a suitable value for X should help to minimise errors and also to detect if the sites return errors.
For finding X, I'm bent towards calculating the percentage difference :

$$ X = \frac{Website_{nodes\Leftarrow!Tbb}- Website_{nodes\Leftarrow Tbb}}{ Website_{nodes\Leftarrow Tbb}}\times100 $$

and check for websites that return an error page to answer: What is the percentage difference that these websites return. I'll try testing websites from here to get these answers.

Another thing that could be added up on top of the doubtful cases would be searching for keywords (might be helpful in some cases that could be misleading) in the generated HTML, for both Tbb and Non-Tor browser.

What are your thoughts. Do share, any suggestions would be welcomed!

Thanks for reading :)

Update:

To make the cookie of a site work, we need some to import it with profile and also manually accept the consent popup, which again does the same thing as accepting it with XPath or any other method. So Creating it for different websites could be an option?
Another inconsistency noticed could be seen between these two files: Tor and Non Tor, for cases like these, we’ll have to generate reports on same websites multiple times from different relays and then label it as a Error website (if the average nodes returned from Tor browser < average nodes returned from Normal browser).