Announcement

Collapse
No announcement yet.

Script to automate building an adblocking hosts file

Collapse
This topic is closed.
X
This is a sticky topic.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    #31
    Are you questioning how to run a weekly job? If yes, in crontab select every month and 1 day a week.

    Code:
    # Minute   Hour   Day of Month       Month          Day of Week        Command
    # (0-59)  (0-23)     (1-31)    (1-12 or Jan-Dec)  (0-6 or Sun-Sat)
    So; 0 0 * * 1 would run every Monday at midnight

    Please Read Me

    Comment


      #32
      Originally posted by techiejames View Post
      Instead of manually doing this, why not just have it in the script? It seems to do fine for me.
      Feel free to modify the source as you see fit. My goal was to write a script that does one thing: create the new hosts file. The mechanism for replacing the existing one is, as you've seen, an exercise left for the reader

      Comment


        #33
        This thread is excellent! A really good read, thanks to all contributors. This must be how a lot of free software projects get started!

        Do any of you use AdAway on Android (GPLv3)? I think it does exactly this, pulls a list of websites from 4 different sources, merges them and adds them to the hosts file. For some reason AdAway asks you to reboot afterwards, does this script take effect immediately?

        The reason for the difference in performance between this and browser based blockers must be the effort taken to reshape the page to remove the empty advert spaces, right? I think I'd miss that function, but will give this a try.

        If you visit a site with pop-ups, do you still get a window, just nothing in it?

        Feathers
        samhobbs.co.uk

        Comment


          #34
          Originally posted by SecretCode View Post
          Wait, that would be so convenient. Why doesn't Linux do that?
          Theoretically, you can include "./" in $PATH and have your the system check the cwd for executables, but as a general rule it is much better to have your binaries in $PATH than make $PATH dance around your binaries.

          Comment


            #35
            Originally posted by kubicle View Post
            Theoretically, you can include "./" in $PATH and have your the system check the cwd for executables....
            If you have a blank entry in your $PATH, including starting or ending with the separator colon, that means the cwd. I've done that for three decades. It's only an issue if the cwd is writable by people (or bots or software) you don't trust; we don't do that in a typical linux install.

            Regards, John Little
            Regards, John Little

            Comment


              #36
              Originally posted by Feathers McGraw View Post
              Do any of you use AdAway on Android (GPLv3)? I think it does exactly this, pulls a list of websites from 4 different sources, merges them and adds them to the hosts file.
              AdAway was my inspiration for writing the script. But rather than look at their source code, I wanted to write my own from scratch as a learning exercise. This was the first Bash shell script I ever wrote.

              Originally posted by Feathers McGraw View Post
              For some reason AdAway asks you to reboot afterwards, does this script take effect immediately?
              I think that's because Android only reads its /etc/hosts file during boot. By default, "regular" Linux consults /etc/hosts before DNS every time an application performs a host name lookup. (*) No reboot is necessary. You might, though, need to flush your resolver cache. One of these should do it:
              Code:
              sudo service dnsmasq restart
              
              sudo service networking force-reload
              Originally posted by Feathers McGraw View Post
              The reason for the difference in performance between this and browser based blockers must be the effort taken to reshape the page to remove the empty advert spaces, right? I think I'd miss that function, but will give this a try.
              Depends on how the site's HTML is structured. You might see a frame with an error, or the frame might simply not appear. I've experienced both. My script definitely makes Phoronix a much more pleasing site to visit.

              Originally posted by Feathers McGraw View Post
              If you visit a site with pop-ups, do you still get a window, just nothing in it?
              Dunno, I always enable popup blockers.


              (*) Edit: Kubicle reminded me in a post below that this behavior can be changed.
              Last edited by SteveRiley; Oct 27, 2013, 01:26 AM.

              Comment


                #37
                Originally posted by jlittle View Post
                If you have a blank entry in your $PATH, including starting or ending with the separator colon, that means the cwd. I've done that for three decades. It's only an issue if the cwd is writable by people (or bots or software) you don't trust; we don't do that in a typical linux install.
                I'd still prefer a literal "." for clarity, a blank entry is easier to miss.

                I'm not terribly fond of relative path elements, especially if those are before the absolute path elements...as one might run something malicious by accident (like a browser plugin that places a modified sudo executable in your $HOME).

                It is usually much more convenient to place your executables in $PATH, but I do understand the reason why someone might wish to add cwd as a fallback.
                Last edited by kubicle; Oct 27, 2013, 12:29 AM.

                Comment


                  #38
                  Originally posted by SteveRiley View Post
                  "Regular" Linux always consults /etc/hosts before DNS every time an application performs a host name lookup.
                  This actually depends on configuration, although checking local host files before dns is the default on most distributions.

                  Config in /etc/nsswitch.conf (and older /etc/host.conf), man pages will give details.

                  Comment


                    #39
                    Kubicle, we've gone OT, I'll start a new thread.
                    Regards, John Little

                    Comment


                      #40
                      Originally posted by jlittle View Post
                      Kubicle, we've gone OT
                      IIRC, this isn't the first time ...and likely not the last, I've got a tendency to do that.
                      Originally posted by jlittle View Post
                      I'll start a new thread.
                      Sounds like a plan, meet you on the other side.

                      Comment


                        #41
                        Trying this now, so far so good!

                        Tested using the site below, and some adverts still showed, but they're not real adverts, so I'm not sure what that means! Haven't had any real ones get through. Browsing seems snappier with AdBlock turned off.

                        http://www.angelfire.com/alt2/entert...lock_test.html

                        Is there any reason why it would be a bad idea to do this on a router? Would it filter ads for every device connected?

                        Feathers
                        samhobbs.co.uk

                        Comment


                          #42
                          Originally posted by Feathers McGraw View Post
                          Tested using the site below, and some adverts still showed, but they're not real adverts, so I'm not sure what that means! Haven't had any real ones get through. Browsing seems snappier with AdBlock turned off.
                          Here's the HTML from the portion of the page that delivers the images:
                          Code:
                          <img src="http://img236.echo.cx/img236/5108/adbannersportedtop9tr.gif" alt="Ad banner should be blocked" title=" Ad banner should be blocked"> 
                          <h3>[^ You should NOT be seeing this image above Ad banner was here ^]</h3>
                          <br>
                          <img src="http://img145.echo.cx/img145/3690/atribalfushionsported3ti.gif" alt="Ad should be blocked" title="Ad should be blocked"> 
                          <h3>[^ You should NOT be seeing this image above Ad image was here ^]</h3>
                          <br>
                          <img src="http://img207.echo.cx/img207/1241/realmedia6iw.gif" alt="Ad should be blocked" title="Ad should be blocked"> 
                          <h3>[^ You should NOT be seeing this image above Ad image was here ^]</h3>
                          <br>
                          <img src="http://img61.echo.cx/img61/2681/adtrackingpromo1gl.gif" alt=" Ad should be blocked" title="Ad should be blocked"> 
                          <h3>[^ You should NOT be seeing this image above Ad image was here ^]</h3>
                          <br>
                          <img src="http://img104.echo.cx/img104/9528/friendsaffiliatessported8zx.gif" alt="Ad should be blocked" title="This is NOT an AD"> 
                          <h3>[^ You SHOULD be seeing this image above ^]</h3>
                          <br>
                          <img src="http://img64.echo.cx/img64/6751/doubleclickaffsportedbottom6cf.gif" alt="Ad should be blocked" title="Ad should be blocked"> 
                          <h3>[^ You should NOT be seeing this image above Affilates was here ^]</h3>
                          You'll note that they're served from various hosts in the echo.cx domain. Browser-based ad blockers maintain lists of ad sites and also URL matching strings for common ad image names, and would thus block all those images based on the file names. My script can only block known ad hosts because it's DNS based.

                          Upon first glance, then, my script shouldn't block any of the images, because it has no entries for hosts in the echo.cx domain. However, the snippet of HTML above deserves a bit more investigation. The first, fourth, fifth, and sixth image links point to true images. But the second and third do not: instead, they point to HTML files! Let's download the second:
                          Code:
                          steve@t520:~/junk$ [B]wget -S http://img145.echo.cx/img145/3690/atribalfushionsported3ti.gif[/B]
                          --2013-10-27 14:32:06--  http://img145.echo.cx/img145/3690/atribalfushionsported3ti.gif
                          Resolving img145.echo.cx (img145.echo.cx)... 208.94.1.239
                          Connecting to img145.echo.cx (img145.echo.cx)|208.94.1.239|:80... connected.
                          HTTP request sent, awaiting response... 
                            HTTP/1.1 200 OK
                            Server: nginx/1.0.4
                            Date: Sun, 27 Oct 2013 21:32:07 GMT
                            Content-Type: text/html
                            Transfer-Encoding: chunked
                            Connection: close
                            X-Powered-By: PHP/5.2.9
                            X-Server-Name-And-Port: _:14000
                            Expires: Sun, 27 Oct 2013 21:32:06 GMT
                            Cache-Control: no-cache
                            X-Server-Name-And-Port: _:14000
                          Length: unspecified [text/html]
                          Saving to: ‘atribalfushionsported3ti.gif’
                          
                              [ <=>                                                                               ] 19,145      --.-K/s   in 0.09s   
                          
                          2013-10-27 14:32:07 (207 KB/s) - ‘atribalfushionsported3ti.gif’ saved [19145]
                          Now, let's take a look at what this supposed "image" really is: http://paste.ubuntu.com/6314839/

                          You will see that it contains a number of links to sites my script does block (google-analytics.com, quantcast.com) and also tries to open a popup. My script plus your browser's pop-up blocker prevent the second "image" from loading. (The third "image" grabs exactly the same HTML as the second.)

                          Browser-based ad blockers will likely catch more ads, but they are slower and they work only in browsers. DNS blocking catches fewer ads but is faster and will work for every application that makes an Internet connection, including email clients, RSS readers, and more. It's up to each individual to determine which set of tradeoffs matter most.

                          Originally posted by Feathers McGraw View Post
                          Is there any reason why it would be a bad idea to do this on a router? Would it filter ads for every device connected?
                          Absolutely you can place it on your router, and the outcome will be exactly as you expect -- so long as each node on your network is using your router as the DNS server.

                          Comment


                            #43
                            Thanks, that's really interesting!

                            Originally posted by SteveRiley View Post
                            You will see that it contains a number of links to sites my script does block (google-analytics.com, quantcast.com)
                            Ahh, then I'm in a bind. I use Google Analytics on the site because it gives such an insight into which bits people are finding interesting/useful etc.

                            Blocking google-analytics at the router would break the connection from the Pi to the Google server. Finding a local equivalent would be ideal, I've tried a couple of wordpress plugins but unfortunately the counts were pretty wild.

                            Feathers
                            samhobbs.co.uk

                            Comment


                              #44
                              You can edit the output of my script and remove any references to Google Analytics before you copy the file to your router.

                              Comment


                                #45
                                Didn't fancy trawling through 33,000 lines for certain things so I thought I'd automate it. Was a good learning experience.

                                Code:
                                #!/bin/bash
                                
                                #Before calling this script, create a whitelist file containing phrases to allow, one phrase per line
                                
                                if [ $# -ne 1 ]; then
                                echo "Usage: $0 whitelist_file_location"
                                exit
                                fi
                                
                                INPUT_FILE=~/hosts-block
                                OUTPUT_FILE=~/hosts-block-less-whitelist
                                
                                #first, remove empty lines from whitelist_file (or next step will throw an error)
                                sed '/^$/d' $1 > tt
                                mv tt $1
                                echo 'Removed empty lines from whitelist_file'
                                
                                cp $INPUT_FILE $OUTPUT_FILE
                                
                                #now, read lines from whitelist file and remove entries with matching content from OUTPUT_FILE
                                cat $1 | while read line; do
                                        sed -e '/'$line'/d' $OUTPUT_FILE > tt
                                        mv tt $OUTPUT_FILE
                                        echo 'Removed any lines containing' $line
                                done
                                If I've done anything embarrassingly inefficient, let me know lol.
                                samhobbs.co.uk

                                Comment

                                Working...
                                X