Announcement

Collapse
No announcement yet.

Script to automate building an adblocking hosts file

Collapse
This topic is closed.
X
This is a sticky topic.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • oshunluvr
    replied
    LOL: Hey, I'm not trying to swell your head anymore than it already is dmesg is the first and easiest place I could think of to store the output. I used logger initially but "cat /var/log/syslog" was too verbose for me just to see if it worked. I suppose the "correct" way would be to put the results into a local email but I'm too lazy for all that. Besides, I get too much email as it is!

    I also added some of the archives from tlu's post, but all that garbage left behind just to save some bandwidth was more annoying than it was worth in my case. I used your process (direct text to $temphosts) except for the rlwpx.free.fr files which come in 7-zip format. I couldn't find a way to make p7zip output to stdout without acrobatics, so I've opted for 7z-to-stdout-then-delete-sourcefile instead. Works, but not pretty. The unar that tlu used has no benefit at all from what I can tell other than to drag in six otherwise unneeded dependencies. Each to his own I guess...

    Leave a comment:


  • SteveRiley
    replied
    So you feel my little hacky script is worthy of reporting its behavior to the kernel's log. That feels like a promotion of some kind! LOL

    Leave a comment:


  • oshunluvr
    replied
    Sorry to resurrect an old thread, but I'm re-building my cron hosts script for my new 14.04 install and decided to direct the output to dmesg instead of just echoing to the terminal. I use a cron job to run the script so terminal output isn't desired.

    I simply added >/dev/kmsg to the echo lines (the ones you want to see in dmesg anyway) and added -o /dev/kmsg to the wget commands. This gets all desired output into dmesg.

    Cool, huh?

    Leave a comment:


  • tlu
    replied
    Okay, I did some more homework and modified the script a bit more. BTW, I saved it in /root and created a symlink in /etc/cron.daily to execute it daily.

    I've implemented the following changes:

    1. The list of hosts files is updated and expanded. Right now, the created hosts file has more than 18 MiB with more than 565,000 entries.
    2. Fortunately, several hosts files are available as zip or 7z archives - downloading them saves a lot of bandwith. Please make sure that the unzip and unar packages are installed on your system.
    3. More bandwith is saved by checking the timestamp of the hosts files with wget -N (if supported). It compares the timestamps of the remote files with the ones of the corresponding local files. Thus, only remote files with a newer timestamp will be downloaded.

    Code:
    # If this is our first run, save a copy of the system's original hosts file and set to read-only for safety
    if [ ! -f ~/hosts-system ]
    then
     echo "Saving copy of system's original hosts file..."
     cp /etc/hosts ~/hosts-system
     chmod 444 ~/hosts-system
    fi
    
    # Perform work in temporary files
    temphosts1=`mktemp`
    temphosts2=`mktemp`
    
    # Obtain various hosts files and merge into one
    echo "Downloading ad-blocking hosts files..."
    # wget -nv -O - http://winhelp2002.mvps.org/hosts.txt >> $temphosts1
    mv mvps.zip hosts.zip; wget -N http://winhelp2002.mvps.org/hosts.zip; mv hosts.zip mvps.zip; unzip -p mvps.zip HOSTS >> $temphosts1
    # wget -nv -O - http://hosts-file.net/download/hosts.txt >> $temphosts1
    mv hosts-file.zip hosts.zip; wget -N http://hosts-file.net/download/hosts.zip; mv hosts.zip hosts-file.zip; unzip -p hosts-file.zip hosts.txt >> $temphosts1
    # wget -nv -O - http://someonewhocares.org/hosts/hosts >> $temphosts1
    wget -N http://someonewhocares.org/hosts/hosts; cat hosts >> $temphosts1
    wget -N "http://pgl.yoyo.org/as/serverlist.php?hostformat=hosts&showintro=1&mimetype=plaintext"; cat "serverlist.php?hostformat=hosts&showintro=1&mimetype=plaintext" >> $temphosts1
    wget -O - "https://spyeyetracker.abuse.ch/blocklist.php?download=hostfile" >> $temphosts1
    wget -O - "https://zeustracker.abuse.ch/blocklist.php?download=hostfile" >> $temphosts1
    # wget -nv -O - "http://www.malware.com.br/cgi/submit?action=list_hosts_win_127001" >> $temphosts1 # no longer freely available
    mv malwaredomains.txt hosts.txt; wget -N http://www.malwaredomainlist.com/hostslist/hosts.txt; mv hosts.txt malwaredomains.txt; cat malwaredomains.txt  >> $temphosts1
    # wget -nv -O - http://hostsfile.mine.nu/Hosts >> $temphosts1  # obviously no longer maintained
    wget -N http://rlwpx.free.fr/WPFF/hblc.7z; unar hblc.7z; cat Hosts.blc >> $temphosts1
    wget -N http://rlwpx.free.fr/WPFF/hpub.7z; unar hpub.7z; cat Hosts.pub >> $temphosts1
    wget -N http://rlwpx.free.fr/WPFF/hrsk.7z; unar hrsk.7z; cat Hosts.rsk >> $temphosts1
    # wget -N http://rlwpx.free.fr/WPFF/hmis.7z; unar hmis.7z; cat Hosts.mis >> $temphosts1  # too many false positives
    wget -N http://rlwpx.free.fr/WPFF/htrc.7z; unar htrc.7z; cat Hosts.trc >> $temphosts1
    
    # Do some work on the file:
    # 1. Remove MS-DOS carriage returns
    # 2. Delete all lines that don't begin with 127.0.0.1 or 0.0.0.0
    # 3. Delete any lines containing the word localhost because we'll obtain that from the original hosts file
    # 4. Delete any lines containing the words dropbox.com and downloads.sourceforge.net.
    # 5. Replace 127.0.0.1 with 0.0.0.0 because then we don't have to wait for the resolver to fail
    # 6. Scrunch extraneous spaces separating address from name into a single tab
    # 7. Delete any comments on lines
    # 8. Clean up leftover trailing blanks
    # 9. Finally, delete all lines that don't begin with 0.0.0.0 to make sure that all remnants are removed
    # Pass all this through sort with the unique flag to remove duplicates and save the result
    echo "Parsing, cleaning, de-duplicating, sorting..."
    
    sed -e 's/\r//'               \
        -e '/^127.0.0.1\|0.0.0.0/!d'       \
        -e '/localhost/d'         \
        -e 's/127.0.0.1/0.0.0.0/' \
        -e 's/ \+/\t/'            \
        -e 's/#.*$//'             \
        -e 's/[ \t]*$//'          \
        -e '/^0.0.0.0/!d'         \
        < $temphosts1 |
        sort -u > $temphosts2
    
    # Combine system hosts with adblocks
    echo Merging with original system hosts...
    echo -e "\n# Ad blocking hosts generated "`date` | cat ~/hosts-system - $temphosts2 > ~/hosts-block
    
    # Clean up temp files and remind user to copy new file
    echo "Cleaning up..."
    rm $temphosts1 $temphosts2 Hosts.*
    echo "Done."
    echo
    echo "Copy ad-blocking hosts file with this command:"
    cp ~/hosts-block /etc/hosts
    echo
    echo "You can always restore your original hosts file with this command:"
    echo " sudo cp ~/hosts-system /etc/hosts"
    echo "so don't delete that file! (It's saved read-only for your protection.)"
    I''m very interested in your comments and suggestions.

    EDIT: I just made a small change. I had used http://someonewhocares.org/hosts/zero in the list above because that version already uses 0.0.0.0 instead of 127.0.0.1. However, I noticed that it doesn't support timestamps while http://someonewhocares.org/hosts/hosts does. So I decided to choose the latter one.
    Last edited by tlu; Feb 16, 2014, 08:15 AM.

    Leave a comment:


  • tlu
    replied
    Steve,

    thanks a lot for your great script. I've been using it successfully for some months, and it really does a great job.

    One suggestion, though: http://winhelp2002.mvps.org/hosts.txt recently replaced 127.0.0.1 with 0.0.0.0 which means that it's no longer inluded since all corresponding lines are deleted in the script. That's why I suggest that you update it by changing

    # 2. Delete all lines that don't begin with 127.0.0.1

    to

    # 2. Delete all lines that don't begin with 127.0.0.1 or 0.0.0.0

    and

    sed -e '/^127.0.0.1/!d'

    to

    sed -e '/^127.0.0.1\|0.0.0.0/!d'

    I'm not a programmer but it seems to work for me

    Leave a comment:


  • Feathers McGraw
    replied
    Actually, I'm not sure if this would work without lots of extra modification because it compares whole lines, and those lines have to be identical. So I'd have to change both the whitelist file and the downloaded hosts files so that both contain either "0.0.0.0 blahblahblah.com" or "127.0.0.1 blahblahblah.com", with the same number of spaces / tabs between both.

    For example, the first link ( http://winhelp2002.mvps.org/hosts.txt ) has two spaces between the terms:
    Code:
    127.0.0.1  www.alwayson-network.com
    127.0.0.1  adtools2.amakings.com
    127.0.0.1  ad.amgdgt.com
    127.0.0.1  vfdeprod.amobee.com
    127.0.0.1  banners.amsterdamcash.com
    127.0.0.1  widgets.amung.us
    127.0.0.1  whos.amung.us #[WebBug]
    127.0.0.1  gw.anametrix.net #[WebBug]
    127.0.0.1  www.anastasiasaffiliate.com
    127.0.0.1  advert.ananzi.co.za
    127.0.0.1  advert2.ananzi.co.za
    127.0.0.1  box.anchorfree.net
    127.0.0.1  rpt.anchorfree.net
    127.0.0.1  www.anticlown.com
    The second link ( http://hosts-file.net/.%5Cad_servers.txt ) contains a tab:
    Code:
    127.0.0.1	a-ads.com
    127.0.0.1	a-counter.kiev.ua
    127.0.0.1	a.admob.com
    127.0.0.1	a.adorika.net
    127.0.0.1	a.adroll.com
    127.0.0.1	a.adtwirl.com
    127.0.0.1	a.alimama.cn
    127.0.0.1	a.collective-media.net
    127.0.0.1	a.consumer.net
    127.0.0.1	a.ctasnet.com
    127.0.0.1	a.facdn.net
    127.0.0.1	a.fandango.com
    127.0.0.1	a.networkworld.com
    127.0.0.1	a.oix.net
    127.0.0.1	a.playlistmag.com
    Not sure which method would be most efficient considering all that. Plus the whitelist file would have to contain whole addresses - you couldn't simply put "google" in to whitelist everything with google in the name. Not sure if the comments after some of those entries would prevent a match either.

    "nothing is ever easy".

    Feathers

    Leave a comment:


  • Feathers McGraw
    replied
    Originally posted by jlittle View Post
    Right you are then, the comm command does this stuff, only it needs sorted input, so would have to be applied before Steve's merge step, and the whitelist would need to be sorted:

    Code:
    comm -23 $temphosts2 whitelist > $temphosts3
    Regards, John Little
    Thanks for this. If I put it all in one script (probably will for convenience) I'll look into it.

    Leave a comment:


  • jlittle
    replied
    Originally posted by Feathers McGraw View Post
    If I've done anything embarrassingly inefficient, let me know lol.
    Right you are then, the comm command does this stuff, only it needs sorted input, so would have to be applied before Steve's merge step, and the whitelist would need to be sorted:

    Code:
    comm -23 $temphosts2 whitelist > $temphosts3
    Regards, John Little

    Leave a comment:


  • jlittle
    replied
    Since, two years later, you're still interested in comment...

    Feathers had a problem with truncation copying the script, I think, and it brought to mind a reaction I had originally, but dismissed at the time as just being pernickety, though now I regret not speaking up. Anyway, I would lay out the long sed command using line continuations, maybe:
    Code:
    sed -e 's/\r//'               \
        -e '/^127.0.0.1/!d'       \
        -e '/localhost/d'         \
        -e 's/127.0.0.1/0.0.0.0/' \
        -e 's/ \+/\t/'            \
        -e 's/#.*$//'             \
        -e 's/[ \t]*$//'          \
        < $temphosts1 |
        sort -u > $temphosts2
    and the merge:
    Code:
    echo -e "\n# Ad blocking hosts generated "$(date) |
        cat ~/hosts-system - $temphosts2 > ~/hosts-block
    Looks cool IMO with syntax colouring, like kate provides.

    Regards, John Little

    Leave a comment:


  • SteveRiley
    replied
    Originally posted by kubicle View Post
    In case you're interested in suggestions for the script
    Thanks, those are great suggestions. Would certainly be useful for broadening my Bash script understanding, and make the code more inline with Linux conventions.

    Leave a comment:


  • GreyGeek
    replied

    Leave a comment:


  • Feathers McGraw
    replied
    I see.

    Having a read of that Privoxy link now, looks really interesting, thanks!

    Leave a comment:


  • kubicle
    replied
    Originally posted by Feathers McGraw View Post
    Can you not add an IP address to a hosts file?
    Connections to IP addresses bypass DNS (DNS is there to match domain names to IP addresses, which obviously isn't necessary for IP addresses)...so the hosts file doesn't come into play.

    Leave a comment:


  • Feathers McGraw
    replied
    Can you not add an IP address to a hosts file?

    Leave a comment:


  • kubicle
    replied
    Nice script, but there are a few inherent weaknesses in /etc/hosts blocking. For example, it only affects DNS queries...so things that use IP addresses instead of DNS queries are unaffected (I think this is somewhat of a tendency these days...like Google safebrowsing).

    I'd personally prefer something like privoxy (http://en.wikipedia.org/wiki/Privoxy) for ad-blocking/privacy, but of course there is nothing wrong in using both.

    In case you're interested in suggestions for the script:
    1. since /etc/hosts is systemwide, I'd probably use something like /var/local/hostblock instead of users $HOME for storing backup of hosts, and /usr/local/sbin for the script.
    2. use variables instead of hardcoded paths/filenames for easier modification.
    3. You could create separate dynamic hostblock files in /var/local/hostblock which could be used to generate the hosts file, like:
    hostblock.localhost (localhost hosts entries)
    hostblock.static (static addresses, like static lan addresses)
    hostblock.dynamic (user configurable dynamic addresses queried at runtime, like shortcut entries for dynamic DNS hostnames)
    hostblock.block (null addresses for ad-blocking)
    hostblock.blacklist (user configurable additions to blocked hosts)
    hostblock.whitelist (addresses user wants to whitelist, removed from blocked hosts)
    and possibly some configuration files:
    hostblock.conf (could be used to store the variables)
    hostblock.blocklists (store list of urls of adblocking hosts-files downloaded from the net)
    4. make it cronjob friendly, this could include variable times for changing hosts (entries in hostblock.dynamic could be checked every 10 minutes, hostblock.block once a week etc.) and some error checking to make sure it'll make a valid hosts file in case /etc/hosts is modified automatically.

    (All just suggestions, of course, if you prefer to keep it simple that's completely fine)

    Leave a comment:

Working...
X