Announcement

**oshunluvr** · Nov 09, 2022, 01:29 PM

I'm not sure this is the best place to look for these kind of answers, but there are some smart people here. I would have tried StackExchange or someplace like that first.

Anyway, it looks like it's doing what you've asked - four words in a row.

What about:

Code:

time grep -i -e 'tree' -e 'claws' -e 'dog' -e 'paws' words3

or maybe:

Code:

time find /path/to/files -type f -exec grep -i -e 'dog' -e 'paws' -e 'tree' -e 'claws' /dev/null {} +

My test file:

Code:

bunch of gibberish
more talk talk talk
paws tree and claws or dog maybe .
trees and zclaws or zpaws and dogs.

My output from grep command:

Code:

paws tree and claws or dog maybe .
trees and zclaws or zpaws and dogs.

real 0m0.001s
user 0m0.001s
sys 0m0.000s

My output from find command:

Code:

testfile:paws tree and claws or dog maybe .
testfile:trees and zclaws or zpaws and dogs.

real 0m0.002s
user 0m0.002s
sys 0m0.000s

Then I made four copies of the testfile and changed them so 1 had both matching lines, 2 had no matching lines, and 3 and 4 had one each different matching lines and ran the find command on it

output:

Code:

testfile1:paws tree and claws or dog maybe .
testfile1:trees and zclaws or zpaws and dogs.
testfile3:paws tree and claws or dog maybe .
testfile4:trees and zclaws or zpaws and dogs.

real    0m0.003s
user    0m0.001s
sys     0m0.002s

Seems to do what I think you want...

**oshunluvr** · Nov 09, 2022, 01:36 PM

Of course, all of these only look at sentences, not the whole file, and you want it to return only files containing ALL four words?

**oshunluvr** · Nov 09, 2022, 01:42 PM

Putting the four words in a file then doing this seems to do the same thing:

time grep -Ff words testfile*

**oshunluvr** · Nov 09, 2022, 01:54 PM

This seems to find those four words in the FILE rather than in one line:

Code:

find testfile* -type f -exec grep -q tree {} \; -exec grep -q paws {} \; -exec grep -q dog {} \; -exec grep -l claws {} \;

**joseph22** · Nov 09, 2022, 02:27 PM

Code:

time grep -Eri '.*tree.* .*claws.* .*dog.* .*paws.*'

Above gave expected answer in under 1 second
Above outputs 2 Lines:
Dir1/Dir2/filename.txt:Line 3 = tree and claws or dog maybe paws .
Dir1/Dir2/filename.txt:Line 4 = trees and zclaws or dogs maybe zpaws .

Code:

time grep -i -e 'tree' -e 'claws' -e 'dog' -e 'paws'

Above was still processing after 4 minutes, guessing it hung.
note words3 was removed

Code:

time find . -type f -exec grep -i -e 'dog' -e 'paws' -e 'tree' -e 'claws' /dev/null {} +

Above answers in under 1 second
Above outputs not 2 Lines but 18 Lines including JPGs like:
grep: ./DJI_0198.JPG: binary file matches

Above was a surprise
because it appears to have even more details
because it goes inside binaries JPGs.

i understand a Hex Editor is needed to confirm so
i can search inside binaries JPGs for words
tree
claws
dog
paws

**joseph22** · Nov 09, 2022, 02:29 PM

Originally posted by oshunluvr View Post

Of course, all of these only look at sentences, not the whole file, and you want it to return only files containing ALL four words?

correct look at whole file
4 words have unknown # of words inbetween each

files that contain all 4 words
tree
claws
dog
paws

**oshunluvr** · Nov 09, 2022, 02:43 PM

Maybe add a -I to the grep

Using grep, -I will process a binary file as if it did not contain matching data, this is equivalent to the --binary-files=without-match option.

Code:

find testfile* -type f -exec grep -qI tree {} \; -exec grep -qI paws {} \; -exec grep -qI dog {} \; -exec grep -lI claws {} \;

**jlittle** · Nov 09, 2022, 05:04 PM

If you don't want matches on words like 'endogenous' or 'street', I suggest adding beginning and end of word matches, for example grep '\<dog\>'. Ignoring case might be appropriate too.

**joseph22** · Nov 09, 2022, 07:15 PM

for future readers:
shorter command but too many hits, needs
1 command and
1 text file

Code:

grep -rif pattern_file.txt

or

Code:

grep -rif ~/Desktop/pattern_file.txt

pattern_file.txt =
tree
claws
dog
paws

**joseph22** · Nov 09, 2022, 07:37 PM

fyi
grep with a 4 word search, meaning:
tree
claws
dog
paws

4 words in any file
4 words in any order
4 words have unknown # of words inbetween each word
4 words might have a prefix: tree, blacktree
4 words might have a suffix: dog, dogs

code below is Ok,
code below is passing all testing thus far. Thank you oshunluvr.

Code:

time find . -type f -exec grep -qI tree {} \; -exec grep -qI paws {} \; -exec grep -qI dog {} \; -exec grep -lI claws {} \;

output
./Dir1/Dir2/filename1.txt
./Dir1/Dir2/filename2.txt
...

vertical reformat above code:

Code:

time find . -type f \
-exec grep -qI tree    {} \; \
-exec grep -qI paws    {} \; \
-exec grep -qI dog     {} \; \
-exec grep -lI claws   {} \;      # grep 4 word search

--

**jlittle** · Nov 10, 2022, 01:00 PM

(Love the reformat.)

For a very large number of files, running grep several times for each might be slow. (Back in the day, say on hard drives and much less memory for caching, it might have been significant.) If that's an issue, you'd want to do a grep -lRI with the most selective word, output to a temporary file. Then, for each of the remaining words, grep just the files in the current list, whittling it down.

**joseph22** · Nov 10, 2022, 06:54 PM

from John Little, we have code that finds text files:

Code:

grep -lRI 'tree'

But:
grep (GNU grep) 3.7 does not do .PDFs .doc etc.. so using
pdfgrep version 2.1.2

Question:
With pdfgrep, How to
exclude .JPG .PNG ... and
include .PDF, .txt, MSword.doc, LibreOffice .odt .ods

note change in code from -qI to -qi

Code:

word1=tree                   ;
word2=paws                   ;
word3=dog                    ;
word4=claws                  ;
time find . -type f \
-exec pdfgrep -qi $word1 {} \; \
-exec pdfgrep -qi $word2 {} \; \
-exec pdfgrep -qi $word3 {} \; \
-exec pdfgrep -li $word4 {} \;

**jlittle** · Nov 11, 2022, 05:59 AM

pdfgrep is new to me, but from its home page it only searches pdfs. Maybe you'll need a specialized tool for each file type.

Using ls -Q **/*.pdf will go a long way these days, (mucking around I generated a single command line of about a gigabyte). The **/ construct handles spaces and quotes in names, so that might be good for the first scan for file names rather than grep -R. Otherwise, with a list of files more than, say 100 MB, the first list of files would have to come from using find, for example find * -type f -name '*.pdf' as oshunluvr suggested.

Then, the steps whittling the list will be faster with something like

Code:

tr '\n' '\0' < filelist | xargs --null pdfgrep -l '\<word\>' > newlist

using nulls to replace newlines so that xargs --null handles the nasty file names correctly. I'd thought that grep would need --null-data or -z but with my quick testing with some gnarly file names the --null to xargs was enough. Some folks use nulls throughout, to cope with newlines in the names, including in the file lists, but that makes it difficult to look in the lists, and I'd be surprised if you have file names with newlines in the names. (There's only two characters Unix file names can't have: null and slash.)

Announcement

grep with a 4 word search

grep with a 4 word search

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Users Viewing This Topic