Is this relatively simple to create or other? Image location scrambler thingy.

3 posts by 2 authors in: Forums > CMS Builder
Last Post: May 8, 2023   (RSS)

By Dave - May 8, 2023

Hi Codee, 

There is no perfect solution to this issue, as browsers need to download images. A bot that perfectly replicates browser behaviour could still download them. However, there are ways to make it challenging for many scrapers.

One method involves requiring an HTTP_REFERER value, which is the URL of the page that linked to or referred to the current page or image. Simple bots often request a file directly without sending this value. You can filter based on this field (with PHP or .htaccess) to display a different image or an error if there is no referrer.

Another approach is to filter based on the HTTP_USER_AGENT value. All browsers send their name and version when making a web request, and well-behaved bots should include something that identifies their source. You can find more information about this here: https://datadome.co/threat-research/how-chatgpt-openai-might-use-your-content-now-in-the-future/

Additionally, you can ban access from blocks of IP addresses. If you know a certain service is scraping your site (or might) and you know their IP range, you can block it. However, be cautious not to exclude valid search engine indexing bots.

Other ideas to consider:

  • Disallowing bots with robots.txt
  • Loading images with JavaScript to prevent access by bots that can't use JavaScript
  • Watermarking your images to make them traceable and less useful
  • Limiting the rate of requests so scrapers and bots can't download everything all at once

Hope that helps, the solution might vary based on what you're trying to protect and why, but there are a few options.

Dave Edis - Senior Developer
interactivetools.com

By Codee - May 8, 2023

Dave,

Thank you. Yea, I couldn't figure how to make that happen because downloading the images is critical browser behavior and monkeying with that causes issues, and different issues in different browsers, and it's not hard to break past some of the robots.txt rules. For example, the tool img2datset can be prevented, by website coders, from access using X-Robots-Tag: noai”, “X-Robots-Tag: noindex” , “X-Robots-Tag: noimageai”, and “X-Robots-Tag: noimageindex”. By default, img2dataset will ignore images with such headers.  HOWEVER, img2dataset tells users "to disable this behaviour and download all images, you may pass “--disallowed_header_directives '[ ]'”  ...which is part of what just ticks me off.

That's why I wrote because I couldn't think of any other simple ways to combat this.  Yes, I have some .htaccess blocks in place but attempting to block everything would cripple site speed.