Return to forum

jixumbty · August 31, 2022 at 1:36 PM

Given multiple sets of data stored as text files, how would you go about making an index for faster searching with grep?

Searching should include username, name, last name, email.

What procedure or methods would you use?

What are some of the best practices for searching and indexing data sets?

nbit · August 31, 2022 at 1:42 PM

look at i think breach_parse_master github

pompompurin · August 31, 2022 at 1:48 PM

This? https://github.com/hmaverickadams/breach-parse @nbit

nbit · August 31, 2022 at 1:58 PM

(August 31, 2022, 01:48 PM)pompompurin Wrote: This? https://github.com/hmaverickadams/breach-parse @nbit

yes theres query and sorter scripts and splitter that separated/indexed data by first two characters into different files

pompompurin · August 31, 2022 at 4:26 PM

(August 31, 2022, 01:58 PM)nbit Wrote:
(August 31, 2022, 01:48 PM)pompompurin Wrote: This? https://github.com/hmaverickadams/breach-parse @nbit

yes theres query and sorter scripts and splitter that separated/indexed data by first two characters into different files

Oh I wrote something that does that before.

lulziso · August 31, 2022 at 4:34 PM

(August 31, 2022, 01:36 PM)jixumbty Wrote: Given multiple sets of data stored as text files, how would you go about making an index for faster searching with grep?

Searching should include username, name, last name, email.

What procedure or methods would you use?

What are some of the best practices for searching and indexing data sets?

We use ripgrep nowadays for searching with json or prefered format with multi cores and threads.

grep is kinda old. ripgrep searching in various archive extensions as well, if you want to search in even more like pdf and such things, take a look @ripgrep-all

https://github.com/BurntSushi/ripgrep



rg -w 'Sherlock [A-Z]\w+'` | Time: 2.769s

 vs 

LC_ALL=en_US.UTF-8 egrep -w 'Sherlock [A-Z]\w+'` | Time: 9.027s

Another trick is to use parallel but ripgrep is alot faster:


find big_source_code_dir -type f | parallel -k -j150% -n 1000 -m grep -H -n pattern_str {}


rg --colors 'match:none' \
    --colors 'match:bg:0x33,0x66,0xFF' \
    --colors 'match:fg:white' \
    --colors 'match:style:bold' \
    "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b"

For search emails. You can search multiple fields with awk more neat but you can search multiple words with ripgrep as with egrep of course.


rg -i "string1|string2|string3|string4"

etc. Its a really powerful tool.

For search with a monster box with 96 cores or something similiar, use below,:


rg -j$(nproc) .....

It's hard to answer your question since I dont know how your files are sorted, firstname and second name can be used by [A-Za-z] and a max length:


< /dev/urandom tr -dc "
 [:alnum:]"|awk ' length($0) < 12'

Will only print strings that is NOT longer then 12 words, use your brain how to tweak it.

Cheers

jixumbty · August 31, 2022 at 4:57 PM

Thanks for the detailed answer

gopej82399 · August 31, 2022 at 5:02 PM

(August 31, 2022, 04:34 PM)lulziso Wrote:
(August 31, 2022, 01:36 PM)jixumbty Wrote: Given multiple sets of data stored as text files, how would you go about making an index for faster searching with grep?

Searching should include username, name, last name, email.

What procedure or methods would you use?

What are some of the best practices for searching and indexing data sets?

We use ripgrep nowadays for searching with json or prefered format with multi cores and threads.

grep is kinda old. ripgrep searching in various archive extensions as well, if you want to search in even more like pdf and such things, take a look @ripgrep-all

https://github.com/BurntSushi/ripgrep
rg -w 'Sherlock [A-Z]\w+'` | Time: 2.769s

 vs 

LC_ALL=en_US.UTF-8 egrep -w 'Sherlock [A-Z]\w+'` | Time: 9.027s
Another trick is to use parallel but ripgrep is alot faster:
find big_source_code_dir -type f | parallel -k -j150% -n 1000 -m grep -H -n pattern_str {}
rg --colors 'match:none' \
    --colors 'match:bg:0x33,0x66,0xFF' \
    --colors 'match:fg:white' \
    --colors 'match:style:bold' \
    "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b"
For search emails. You can search multiple fields with awk more neat but you can search multiple words with ripgrep as with egrep of course.
rg -i "string1|string2|string3|string4"
etc. Its a really powerful tool.

For search with a monster bgox, 96 cores or something, use
rg -j$(nproc) .....
It's hard to answer your question since I dont know how your files are sorted, firstname and second name can be used by [A-Za-z] and a max length:
< /dev/urandom tr -dc "
 [:alnum:]"|awk ' length($0) < 12' 
Will only print strings that is NOT longer then 12 words, use your brain how to tweak it.

Cheers

johndose · August 31, 2022 at 6:28 PM

(August 31, 2022, 01:48 PM)pompompurin Wrote: This? https://github.com/hmaverickadams/breach-parse @nbit

hey pom sorry to bother you but read a message sent by me pls

God · August 31, 2022 at 8:56 PM

I've been looking for this too. Nothing fancy, just to first build an index and return lines containing a specified string in JSON/JSONL files. That breach_parse script looks to be exclusive to combo lists and I'm too retarded to modify it. Regular grep-style tools take ridiculously long without an index.