Announcement

Collapse
No announcement yet.

Script to automate building an adblocking hosts file

Collapse
This topic is closed.
X
This is a sticky topic.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    #61
    Actually, I'm not sure if this would work without lots of extra modification because it compares whole lines, and those lines have to be identical. So I'd have to change both the whitelist file and the downloaded hosts files so that both contain either "0.0.0.0 blahblahblah.com" or "127.0.0.1 blahblahblah.com", with the same number of spaces / tabs between both.

    For example, the first link ( http://winhelp2002.mvps.org/hosts.txt ) has two spaces between the terms:
    Code:
    127.0.0.1  www.alwayson-network.com
    127.0.0.1  adtools2.amakings.com
    127.0.0.1  ad.amgdgt.com
    127.0.0.1  vfdeprod.amobee.com
    127.0.0.1  banners.amsterdamcash.com
    127.0.0.1  widgets.amung.us
    127.0.0.1  whos.amung.us #[WebBug]
    127.0.0.1  gw.anametrix.net #[WebBug]
    127.0.0.1  www.anastasiasaffiliate.com
    127.0.0.1  advert.ananzi.co.za
    127.0.0.1  advert2.ananzi.co.za
    127.0.0.1  box.anchorfree.net
    127.0.0.1  rpt.anchorfree.net
    127.0.0.1  www.anticlown.com
    The second link ( http://hosts-file.net/.%5Cad_servers.txt ) contains a tab:
    Code:
    127.0.0.1	a-ads.com
    127.0.0.1	a-counter.kiev.ua
    127.0.0.1	a.admob.com
    127.0.0.1	a.adorika.net
    127.0.0.1	a.adroll.com
    127.0.0.1	a.adtwirl.com
    127.0.0.1	a.alimama.cn
    127.0.0.1	a.collective-media.net
    127.0.0.1	a.consumer.net
    127.0.0.1	a.ctasnet.com
    127.0.0.1	a.facdn.net
    127.0.0.1	a.fandango.com
    127.0.0.1	a.networkworld.com
    127.0.0.1	a.oix.net
    127.0.0.1	a.playlistmag.com
    Not sure which method would be most efficient considering all that. Plus the whitelist file would have to contain whole addresses - you couldn't simply put "google" in to whitelist everything with google in the name. Not sure if the comments after some of those entries would prevent a match either.

    "nothing is ever easy".

    Feathers
    samhobbs.co.uk

    Comment


      #62
      Steve,

      thanks a lot for your great script. I've been using it successfully for some months, and it really does a great job.

      One suggestion, though: http://winhelp2002.mvps.org/hosts.txt recently replaced 127.0.0.1 with 0.0.0.0 which means that it's no longer inluded since all corresponding lines are deleted in the script. That's why I suggest that you update it by changing

      # 2. Delete all lines that don't begin with 127.0.0.1

      to

      # 2. Delete all lines that don't begin with 127.0.0.1 or 0.0.0.0

      and

      sed -e '/^127.0.0.1/!d'

      to

      sed -e '/^127.0.0.1\|0.0.0.0/!d'

      I'm not a programmer but it seems to work for me

      Comment


        #63
        Okay, I did some more homework and modified the script a bit more. BTW, I saved it in /root and created a symlink in /etc/cron.daily to execute it daily.

        I've implemented the following changes:

        1. The list of hosts files is updated and expanded. Right now, the created hosts file has more than 18 MiB with more than 565,000 entries.
        2. Fortunately, several hosts files are available as zip or 7z archives - downloading them saves a lot of bandwith. Please make sure that the unzip and unar packages are installed on your system.
        3. More bandwith is saved by checking the timestamp of the hosts files with wget -N (if supported). It compares the timestamps of the remote files with the ones of the corresponding local files. Thus, only remote files with a newer timestamp will be downloaded.

        Code:
        # If this is our first run, save a copy of the system's original hosts file and set to read-only for safety
        if [ ! -f ~/hosts-system ]
        then
         echo "Saving copy of system's original hosts file..."
         cp /etc/hosts ~/hosts-system
         chmod 444 ~/hosts-system
        fi
        
        # Perform work in temporary files
        temphosts1=`mktemp`
        temphosts2=`mktemp`
        
        # Obtain various hosts files and merge into one
        echo "Downloading ad-blocking hosts files..."
        # wget -nv -O - http://winhelp2002.mvps.org/hosts.txt >> $temphosts1
        mv mvps.zip hosts.zip; wget -N http://winhelp2002.mvps.org/hosts.zip; mv hosts.zip mvps.zip; unzip -p mvps.zip HOSTS >> $temphosts1
        # wget -nv -O - http://hosts-file.net/download/hosts.txt >> $temphosts1
        mv hosts-file.zip hosts.zip; wget -N http://hosts-file.net/download/hosts.zip; mv hosts.zip hosts-file.zip; unzip -p hosts-file.zip hosts.txt >> $temphosts1
        # wget -nv -O - http://someonewhocares.org/hosts/hosts >> $temphosts1
        wget -N http://someonewhocares.org/hosts/hosts; cat hosts >> $temphosts1
        wget -N "http://pgl.yoyo.org/as/serverlist.php?hostformat=hosts&showintro=1&mimetype=plaintext"; cat "serverlist.php?hostformat=hosts&showintro=1&mimetype=plaintext" >> $temphosts1
        wget -O - "https://spyeyetracker.abuse.ch/blocklist.php?download=hostfile" >> $temphosts1
        wget -O - "https://zeustracker.abuse.ch/blocklist.php?download=hostfile" >> $temphosts1
        # wget -nv -O - "http://www.malware.com.br/cgi/submit?action=list_hosts_win_127001" >> $temphosts1 # no longer freely available
        mv malwaredomains.txt hosts.txt; wget -N http://www.malwaredomainlist.com/hostslist/hosts.txt; mv hosts.txt malwaredomains.txt; cat malwaredomains.txt  >> $temphosts1
        # wget -nv -O - http://hostsfile.mine.nu/Hosts >> $temphosts1  # obviously no longer maintained
        wget -N http://rlwpx.free.fr/WPFF/hblc.7z; unar hblc.7z; cat Hosts.blc >> $temphosts1
        wget -N http://rlwpx.free.fr/WPFF/hpub.7z; unar hpub.7z; cat Hosts.pub >> $temphosts1
        wget -N http://rlwpx.free.fr/WPFF/hrsk.7z; unar hrsk.7z; cat Hosts.rsk >> $temphosts1
        # wget -N http://rlwpx.free.fr/WPFF/hmis.7z; unar hmis.7z; cat Hosts.mis >> $temphosts1  # too many false positives
        wget -N http://rlwpx.free.fr/WPFF/htrc.7z; unar htrc.7z; cat Hosts.trc >> $temphosts1
        
        # Do some work on the file:
        # 1. Remove MS-DOS carriage returns
        # 2. Delete all lines that don't begin with 127.0.0.1 or 0.0.0.0
        # 3. Delete any lines containing the word localhost because we'll obtain that from the original hosts file
        # 4. Delete any lines containing the words dropbox.com and downloads.sourceforge.net.
        # 5. Replace 127.0.0.1 with 0.0.0.0 because then we don't have to wait for the resolver to fail
        # 6. Scrunch extraneous spaces separating address from name into a single tab
        # 7. Delete any comments on lines
        # 8. Clean up leftover trailing blanks
        # 9. Finally, delete all lines that don't begin with 0.0.0.0 to make sure that all remnants are removed
        # Pass all this through sort with the unique flag to remove duplicates and save the result
        echo "Parsing, cleaning, de-duplicating, sorting..."
        
        sed -e 's/\r//'               \
            -e '/^127.0.0.1\|0.0.0.0/!d'       \
            -e '/localhost/d'         \
            -e 's/127.0.0.1/0.0.0.0/' \
            -e 's/ \+/\t/'            \
            -e 's/#.*$//'             \
            -e 's/[ \t]*$//'          \
            -e '/^0.0.0.0/!d'         \
            < $temphosts1 |
            sort -u > $temphosts2
        
        # Combine system hosts with adblocks
        echo Merging with original system hosts...
        echo -e "\n# Ad blocking hosts generated "`date` | cat ~/hosts-system - $temphosts2 > ~/hosts-block
        
        # Clean up temp files and remind user to copy new file
        echo "Cleaning up..."
        rm $temphosts1 $temphosts2 Hosts.*
        echo "Done."
        echo
        echo "Copy ad-blocking hosts file with this command:"
        cp ~/hosts-block /etc/hosts
        echo
        echo "You can always restore your original hosts file with this command:"
        echo " sudo cp ~/hosts-system /etc/hosts"
        echo "so don't delete that file! (It's saved read-only for your protection.)"
        I''m very interested in your comments and suggestions.

        EDIT: I just made a small change. I had used http://someonewhocares.org/hosts/zero in the list above because that version already uses 0.0.0.0 instead of 127.0.0.1. However, I noticed that it doesn't support timestamps while http://someonewhocares.org/hosts/hosts does. So I decided to choose the latter one.
        Last edited by tlu; Feb 16, 2014, 08:15 AM.

        Comment


          #64
          Sorry to resurrect an old thread, but I'm re-building my cron hosts script for my new 14.04 install and decided to direct the output to dmesg instead of just echoing to the terminal. I use a cron job to run the script so terminal output isn't desired.

          I simply added >/dev/kmsg to the echo lines (the ones you want to see in dmesg anyway) and added -o /dev/kmsg to the wget commands. This gets all desired output into dmesg.

          Cool, huh?

          Please Read Me

          Comment


            #65
            So you feel my little hacky script is worthy of reporting its behavior to the kernel's log. That feels like a promotion of some kind! LOL

            Comment


              #66
              LOL: Hey, I'm not trying to swell your head anymore than it already is dmesg is the first and easiest place I could think of to store the output. I used logger initially but "cat /var/log/syslog" was too verbose for me just to see if it worked. I suppose the "correct" way would be to put the results into a local email but I'm too lazy for all that. Besides, I get too much email as it is!

              I also added some of the archives from tlu's post, but all that garbage left behind just to save some bandwidth was more annoying than it was worth in my case. I used your process (direct text to $temphosts) except for the rlwpx.free.fr files which come in 7-zip format. I couldn't find a way to make p7zip output to stdout without acrobatics, so I've opted for 7z-to-stdout-then-delete-sourcefile instead. Works, but not pretty. The unar that tlu used has no benefit at all from what I can tell other than to drag in six otherwise unneeded dependencies. Each to his own I guess...

              Please Read Me

              Comment

              Working...
              X