Pre-GSoC
Hi All,
Quite a few days since I’ve posted a blog post. Summarizing the few incidents that happened during this period were:
- I finalized and made changes to my draft proposal (We discussed on the DOM characteristics of websites and how effective would it be)
- I wrote a Fetcher that checks for the number of DOM nodes and the length of Total HTML, to get to a perspective.
That said, there are quite a few things I observed and this post would be on it.
Certain websites like Google and YouTube on opening through Selenium browser displays a popup(iframe). This makes it difficult to compare[1]. For some websites like Lloyds Bank and Dan.me.uk, sites that displays error pages, it works. Click here to see the results. Here we can see the clear difference between the nodes returned from Normal browser and Tbb for(Llyods Bank): 2135, 84 respectively, thereby giving it a score of 2441.6% which shows us that the website returned later might not contain that many information like it returned from the Normal Browser (Thereby showing us that the websites might have an error page).
Another thing I noticed was: for the website adsabs.harvard.edu. It usually returns http://adsabs.harvard.edu/cgi-bin/access_denied for Tbb but while I was fetching the website this time it didn’t return the ADS Access Denied Page. The data too suggests the similar number of nodes returned.
But (that shows error page), but for some dynamic websites that change content according to the region/ip like Bing it doesn’t give a satisfied data. Also another such example is the Domino’s Piza which redirects to Domino’s Loc in Tor and then you can go to your desired location website, where it doesn’t work for all location(For India, Bangladesh it shows “403 ERROR” Cloudflare Error), but in normal browser it redirects to the specific area. So for that reason Domino’s returned different html data, and thus we can’t operate on these.
- A bit more on the Cookie iframe/Popup:
Solution what I could think of:
-
This kind of cookie hinders results, therefore one possible way is to programmatically remove these iframes by letting Selenium click on the Allow box (not displayed in the screenshot), but that websites might have different iframes, and might have to write a new case for each new website encountered with these Popups.
-
The other way that I think of dealing this would be using a cookie from the Non-Tor Selenium browsing, because if using Selenium or Automation results into this cookie popup thing, I think Non-Tor exit nodes should have generated a similar screenshot. But No! they don’t (atleast I haven’t faced them yet). So a bit of more insights need to be gathered on this point.
- Difference between total DOM nodes:
What if the difference in the total DOM nodes are large but still the websites doesn’t generate errors? Like Bing for example, during a trial resulted into the following data:
Tor | Non-Tor | |
---|---|---|
Total Length of Html | 63699 | 68332 |
Total DOM Nodes | 187 | 227 |
The Screenshots generated were:
Non-Tor Screenshot:
Tor Screenshot:
Here, a noticeable difference that can be seen in the screenshots are the Languages: bar and the cookie permission also. So, in these type of cases, problems might arise. And a question here could be asked as to What is a large difference? How much could a difference be justified between nodes returned by Tbb and Normal Browser?
[Update: On running the code for bing 2-3 times I saw some noticeable differences, 229 nodes for non-tor browser and 233 nodes for tbb. This might be a case for different exit node]
Mathematically speaking:
X (VALUE TO BE OMITTED) = DOM_nodes_Non_Tbb(website) - DOM_nodes_Tbb(website)
For finding X, I'm bent towards calculating the percentage difference :
$$ X = \frac{Website_{nodes\Leftarrow!Tbb}- Website_{nodes\Leftarrow Tbb}}{ Website_{nodes\Leftarrow Tbb}}\times100 $$
Another thing that could be added up on top of the doubtful cases would be searching for keywords (might be helpful in some cases that could be misleading) in the generated HTML, for both Tbb and Non-Tor browser.
What are your thoughts. Do share, any suggestions would be welcomed!
Thanks for reading :)
Update:
- To make the cookie of a site work, we need some to import it with profile and also manually accept the consent popup, which again does the same thing as accepting it with XPath or any other method. So Creating it for different websites could be an option?
- Another inconsistency noticed could be seen between these two files: Tor and Non Tor, for cases like these, we’ll have to generate reports on same websites multiple times from different relays and then label it as a Error website (if the average nodes returned from Tor browser < average nodes returned from Normal browser).