Findings to ponder
Well as my Community Bonding period has started, I’ve been asked by my mentor to break the problems into smaller ones and start creating issues as one could track and get the progress of the project through these. If you haven’t visited the wiki page of my project, please visit it to get the latest details of the project.
That said, my first aim was to compile the Captcha Monitor code locally to experiment with it but due to some errors couldn’t compile it, but thanks to my mentor I did understand the basic architecture of the code. Further, he asked me to start my experiments on the smaller scripts that would soon be the base on which the new code would start developing.
So,
While writing the project I tested my trials using the request
library, and it has this pretty cool status_code
method that returns the status_code of the website queried, and for the
Google Search example returns 429 No Reason-Phrase
in Tor but 200 Ok
in non-tor.
Can be seen here:
Now I’m using the Selenium Wire because it works on top of Selenium and provides some more powerful features (Selenium doesn’t support status code) to Request Library as a whole.
So now I’ve Selenium-Wire returning me the entire dump of request status codes (Details like What are the status code for the different components including images of a particular website).
At first, I began thinking what if a request to a component is made in a different order so I used dict
to check if the request path
is present and further check the difference in the status codes
.
But after running it I realized about the blunder I made, what if just a particular image is not loaded? I see the above failing in this case and also realized the first request a client makes to a website is to its root and further to the different components. The real-life example I faced was with Netflix.
The '/scripttemplates/otSDKStub.js'
file appears to have two different status codes from two different browsers: 403
from Tor and 200
from Non-tor. So this shows that this technique isn’t reliable.
Now I plan to check for the very first status code
and then check for just the error codes, like 4xx and 5xx and not all ( bad request (a 4XX client error or 5XX server error response)raise_for_status()
), as we don’t want websites redirecting to other pages (GDPR, consent pages, website based on geo-ip) to be labeled as Blocking-Tor.
That being said the issues I faced were :
- With
http://
website as I was usinghttps://
. Citing an example below, while I was going through this particular website :
Client | http://adsabs.harvard.edu | https://adsabs.harvard.edu |
---|---|---|
Non-Tor | ||
Tor |
Clicking the above links, you’ll notice that the browser too loads the http version. I tried setting Untrusted/Insecure Certificates to true but didn’t work.
- The script I’m using this code tries to get status codes till all components aren’t fetched. For websites like Reddit, where it keeps loading slowly, it throws
Timeout loading page after 300000ms
. So I plan to break from the loop before the error is reached or a try catch block to proceed ahead.
Hence, for the website to be checked using status code my final Logic would be :
- Check
request[0]
and ifTor == 4xx
orTor == 5xx
:- break (Blocked website)
- Check
request[0]
and ifTor == 3xx
:- Check for redirection (Either could be Captcha redirection or Safe redirection or GDPR consent)
- Click all possible Translation of “Accept”, “Ok” etc.
-
Check the request_paths to see if it contains
captcha
in. If it contains, the website may contain captcha (high possibility) else proceed further… - Use of DOM tools and Consensus Module.
- Check for redirection (Either could be Captcha redirection or Safe redirection or GDPR consent)
The script and the files of the ticked bullets could be found here: Github Link Any suggestions are welcomed.
Thanks!!
Edit:
The ProtocolException
error I faced was because I was forcing https
request on websites supporting http
requests and thereby returning the error shown by the image.
I wasn’t using http
by deafult because my ISP randomly injects Advertisement on http
websites (DNS Poisoning). I had been using Cloudflare DNS from before but seems like it isn’t working in my case. So today I queried in irc and got answers like using own recursors or DNS-over-TLS/HTTPS (DoT/DoH) but I feel my ISP intercepts my dns traffic too (That opens up another topic for research). I got across NextDNS (DoT with Blacklist) and it seems to be working as of now :)
Meanwhile grepping (grep -ril "captcha" | grep -v "non-tor" | grep "txt"
) in my test_run2 folder containing seperate folders for each website (for easy debugging), I noticed captcha
appearing in my reqeust_paths. Might be of help to distinguish a website from returning Captchas or not as some of the images do suggest Captchas in the websites respectively. Will explore this in the next episode.