Announcement

Collapse
No announcement yet.

grep with a 4 word search

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    grep with a 4 word search


    Code:
    grep --version     # grep (GNU grep) 3.7
    Objective
    search 1000s of files looking for:
    4 words in any file
    4 words in any order
    4 words have unknown # of words inbetween each
    4 words might have a prefix: tree, blacktree
    4 words might have a suffix: dog, dogs


    Objective
    find all files that contain 4 words in any order.


    Sample
    Text file has many Lines, here are 2 Lines:

    Line 3 = tree and claws or dog maybe paws .
    Line 4 = trees and zclaws or dogs maybe zpaws .


    This works
    Code:
    time grep -Eri '.*tree.* .*claws.* .*dog.* .*paws.*'
    correct output:
    Dir1/Dir2/filename.txt:Line 3 = tree and claws or dog maybe paws .
    Dir1/Dir2/filename.txt:Line 4 = trees and zclaws or dogs maybe zpaws .

    But this does not work:
    Code:
    time grep -Eri '.*claws.* .*dog.* .*paws.* .*tree.*'
    the above does not work,
    the above changes the order of the words
    tree was the 1st word in grep now
    tree is the 4th word in grep.
    Thus, changing the order of the word fails to find the file.


    How to structure grep command?
    so order of words does not matter.


    -​

    #2
    I'm not sure this is the best place to look for these kind of answers, but there are some smart people here. I would have tried StackExchange or someplace like that first.

    Anyway, it looks like it's doing what you've asked - four words in a row.

    What about:
    Code:
    time grep -i -e 'tree' -e 'claws' -e 'dog' -e 'paws' words3
    ​or maybe:
    Code:
    time find /path/to/files -type f -exec grep -i -e 'dog' -e 'paws' -e 'tree' -e 'claws' /dev/null {} +
    My test file:
    Code:
    bunch of gibberish
    more talk talk talk
    paws tree and claws or dog maybe .
    trees and zclaws or zpaws and dogs.
    
    ​
    ​My output from grep command:
    Code:
    paws tree and claws or dog maybe .
    trees and zclaws or zpaws and dogs.
    
    real 0m0.001s
    user 0m0.001s
    sys 0m0.000s
    ​
    My output from find command:
    Code:
    testfile:paws tree and claws or dog maybe .
    testfile:trees and zclaws or zpaws and dogs.
    
    real 0m0.002s
    user 0m0.002s
    sys 0m0.000s

    Then I made four copies of the testfile and changed them so 1 had both matching lines, 2 had no matching lines, and 3 and 4 had one each different matching lines and ran the find command on it

    output:
    Code:
    testfile1:paws tree and claws or dog maybe .
    testfile1:trees and zclaws or zpaws and dogs.
    testfile3:paws tree and claws or dog maybe .
    testfile4:trees and zclaws or zpaws and dogs.
    
    real    0m0.003s
    user    0m0.001s
    sys     0m0.002s​
    Seems to do what I think you want...
    Last edited by oshunluvr; Nov 09, 2022, 01:32 PM.

    Please Read Me

    Comment


      #3
      Of course, all of these only look at sentences, not the whole file, and you want it to return only files containing ALL four words?

      Please Read Me

      Comment


        #4
        Putting the four words in a file then doing this seems to do the same thing:

        time grep -Ff words testfile*


        Please Read Me

        Comment


          #5
          This seems to find those four words in the FILE rather than in one line:
          Code:
          find testfile* -type f -exec grep -q tree {} \; -exec grep -q paws {} \; -exec grep -q dog {} \; -exec grep -l claws {} \;
          ​

          Please Read Me

          Comment


            #6
            Code:
            time grep -Eri '.*tree.* .*claws.* .*dog.* .*paws.*'
            Above gave expected answer in under 1 second
            Above outputs 2 Lines:
            Dir1/Dir2/filename.txt:Line 3 = tree and claws or dog maybe paws .
            Dir1/Dir2/filename.txt:Line 4 = trees and zclaws or dogs maybe zpaws .


            Code:
            time grep -i -e 'tree' -e 'claws' -e 'dog' -e 'paws'
            Above was still processing after 4 minutes, guessing it hung.
            note words3 was removed


            Code:
            time find . -type f -exec grep -i -e 'dog' -e 'paws' -e 'tree' -e 'claws' /dev/null {} +
            Above answers in under 1 second
            Above outputs not 2 Lines but 18 Lines including JPGs like:
            grep: ./DJI_0198.JPG: binary file matches

            Above was a surprise
            because it appears to have even more details
            because it goes inside binaries JPGs.

            i understand a Hex Editor is needed to confirm so
            i can search inside binaries JPGs for words
            tree
            claws
            dog
            paws

            Comment


              #7
              Originally posted by oshunluvr View Post
              Of course, all of these only look at sentences, not the whole file, and you want it to return only files containing ALL four words?
              correct look at whole file
              4 words have unknown # of words inbetween each

              files that contain all 4 words
              tree
              claws
              dog
              paws

              Comment


                #8
                Maybe add a -I to the grep

                Using grep, -I will process a binary file as if it did not contain matching data, this is equivalent to the --binary-files=without-match option.​
                Code:
                find testfile* -type f -exec grep -qI tree {} \; -exec grep -qI paws {} \; -exec grep -qI dog {} \; -exec grep -lI claws {} \;

                Please Read Me

                Comment


                  #9
                  If you don't want matches on words like 'endogenous' or 'street', I suggest adding beginning and end of word matches, for example grep '\<dog\>'. Ignoring case might be appropriate too.
                  Regards, John Little

                  Comment


                    #10
                    for future readers:
                    shorter command but too many hits, needs
                    1 command and
                    1 text file

                    Code:
                    grep -rif pattern_file.txt
                    or
                    Code:
                    grep -rif ~/Desktop/pattern_file.txt
                    pattern_file.txt =
                    tree
                    claws
                    dog
                    paws

                    Comment


                      #11
                      fyi
                      grep with a 4 word search, meaning:
                      tree
                      claws
                      dog
                      paws

                      4 words in any file
                      4 words in any order
                      4 words have unknown # of words inbetween each word
                      4 words might have a prefix: tree, blacktree
                      4 words might have a suffix: dog, dogs

                      code below is Ok,
                      code below is passing all testing thus far. Thank you oshunluvr.


                      Code:
                      time find . -type f -exec grep -qI tree {} \; -exec grep -qI paws {} \; -exec grep -qI dog {} \; -exec grep -lI claws {} \;

                      output
                      ./Dir1/Dir2/filename1.txt
                      ./Dir1/Dir2/filename2.txt
                      ...



                      vertical reformat above code:

                      Code:
                      time find . -type f \
                      -exec grep -qI tree    {} \; \
                      -exec grep -qI paws    {} \; \
                      -exec grep -qI dog     {} \; \
                      -exec grep -lI claws   {} \;      # grep 4 word search

                      --

                      Comment


                        #12
                        (Love the reformat.)

                        For a very large number of files, running grep several times for each might be slow. (Back in the day, say on hard drives and much less memory for caching, it might have been significant.) If that's an issue, you'd want to do a grep -lRI with the most selective word, output to a temporary file. Then, for each of the remaining words, grep just the files in the current list, whittling it down.
                        Regards, John Little

                        Comment


                          #13
                          from John Little, we have code that finds text files:
                          Code:
                          grep -lRI 'tree'
                          But:
                          grep (GNU grep) 3.7 does not do .PDFs .doc etc.. so using
                          pdfgrep version 2.1.2


                          Question:
                          With pdfgrep, How to
                          exclude .JPG .PNG ... and
                          include .PDF, .txt, MSword.doc, LibreOffice .odt .ods


                          note change in code from -qI to -qi


                          Code:
                          word1=tree                   ;
                          word2=paws                   ;
                          word3=dog                    ;
                          word4=claws                  ;
                          time find . -type f \
                          -exec pdfgrep -qi $word1 {} \; \
                          -exec pdfgrep -qi $word2 {} \; \
                          -exec pdfgrep -qi $word3 {} \; \
                          -exec pdfgrep -li $word4 {} \;

                          Comment


                            #14
                            pdfgrep is new to me, but from its home page it only searches pdfs. Maybe you'll need a specialized tool for each file type.

                            Using ls -Q **/*.pdf will go a long way these days, (mucking around I generated a single command line of about a gigabyte). The **/ construct handles spaces and quotes in names, so that might be good for the first scan for file names rather than grep -R. Otherwise, with a list of files more than, say 100 MB, the first list of files would have to come from using find, for example find * -type f -name '*.pdf' as oshunluvr suggested.

                            Then, the steps whittling the list will be faster with something like
                            Code:
                            tr '\n' '\0' < filelist | xargs --null pdfgrep -l '\<word\>' > newlist
                            using nulls to replace newlines so that xargs --null handles the nasty file names correctly. I'd thought that grep would need --null-data or -z but with my quick testing with some gnarly file names the --null to xargs was enough. Some folks use nulls throughout, to cope with newlines in the names, including in the file lists, but that makes it difficult to look in the lists, and I'd be surprised if you have file names with newlines in the names. (There's only two characters Unix file names can't have: null and slash.)
                            Regards, John Little

                            Comment

                            Working...
                            X