Index for faster searching on multiple databases / datasets
by - Thursday, January 1, 1970 at 12:00 AM
Given multiple sets of data stored as text files, how would you go about making an index for faster searching with grep?

Searching should include username, name, last name, email.

What procedure or methods would you use?

What are some of the best practices for searching and indexing data sets?
Reply
look at i think breach_parse_master github
Reply
This? https://github.com/hmaverickadams/breach-parse @nbit

https://pompur.in
Reply
(August 31, 2022, 01:48 PM)pompompurin Wrote: This? https://github.com/hmaverickadams/breach-parse @nbit


yes theres query and sorter scripts and splitter that separated/indexed data by first two characters into different files
Reply
(August 31, 2022, 01:58 PM)nbit Wrote:
(August 31, 2022, 01:48 PM)pompompurin Wrote: This? https://github.com/hmaverickadams/breach-parse @nbit


yes theres query and sorter scripts and splitter that separated/indexed data by first two characters into different files


Oh I wrote something that does that before.

https://pompur.in
Reply
(August 31, 2022, 01:36 PM)jixumbty Wrote: Given multiple sets of data stored as text files, how would you go about making an index for faster searching with grep?

Searching should include username, name, last name, email.

What procedure or methods would you use?

What are some of the best practices for searching and indexing data sets?


We use ripgrep nowadays for searching with json or prefered format with multi cores and threads.

grep is kinda old. ripgrep searching in various archive extensions as well, if you want to search in even more like pdf and such things, take a look @ripgrep-all

https://github.com/BurntSushi/ripgrep



rg -w 'Sherlock [A-Z]\w+'` | Time: 2.769s

vs

LC_ALL=en_US.UTF-8 egrep -w 'Sherlock [A-Z]\w+'` | Time: 9.027s


Another trick is to use parallel but ripgrep is alot faster:

find big_source_code_dir -type f | parallel -k -j150% -n 1000 -m grep -H -n pattern_str {}



rg --colors 'match:none' \
    --colors 'match:bg:0x33,0x66,0xFF' \
    --colors 'match:fg:white' \
    --colors 'match:style:bold' \
    "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b"


For search emails. You can search multiple fields with awk more neat but you can search multiple words with ripgrep as with egrep of course.


rg -i "string1|string2|string3|string4"


etc. Its a really powerful tool.

For search with a monster box with 96 cores or something similiar, use below,:


rg -j$(nproc) .....


It's hard to answer your question since I dont know how your files are sorted, firstname and second name can be used by [A-Za-z] and a max length:

< /dev/urandom tr -dc "
[:alnum:]"|awk ' length($0) < 12'

Will only print strings  that is NOT longer then 12 words, use your brain how to tweak it.

Cheers
Reply
Thanks for the detailed answer
Reply
(August 31, 2022, 04:34 PM)lulziso Wrote:
(August 31, 2022, 01:36 PM)jixumbty Wrote: Given multiple sets of data stored as text files, how would you go about making an index for faster searching with grep?

Searching should include username, name, last name, email.

What procedure or methods would you use?

What are some of the best practices for searching and indexing data sets?



We use ripgrep nowadays for searching with json or prefered format with multi cores and threads.

grep is kinda old. ripgrep searching in various archive extensions as well, if you want to search in even more like pdf and such things, take a look @ripgrep-all

https://github.com/BurntSushi/ripgrep



rg -w 'Sherlock [A-Z]\w+'` | Time: 2.769s

vs

LC_ALL=en_US.UTF-8 egrep -w 'Sherlock [A-Z]\w+'` | Time: 9.027s


Another trick is to use parallel but ripgrep is alot faster:

find big_source_code_dir -type f | parallel -k -j150% -n 1000 -m grep -H -n pattern_str {}



rg --colors 'match:none' \
    --colors 'match:bg:0x33,0x66,0xFF' \
    --colors 'match:fg:white' \
    --colors 'match:style:bold' \
    "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b"


For search emails. You can search multiple fields with awk more neat but you can search multiple words with ripgrep as with egrep of course.


rg -i "string1|string2|string3|string4"


etc. Its a really powerful tool.

For search with a monster bgox, 96 cores or something, use


rg -j$(nproc) .....


It's hard to answer your question since I dont know how your files are sorted, firstname and second name can be used by [A-Za-z] and a max length:

< /dev/urandom tr -dc "
[:alnum:]"|awk ' length($0) < 12'

Will only print strings  that is NOT longer then 12 words, use your brain how to tweak it.

Cheers
Reply
(August 31, 2022, 01:48 PM)pompompurin Wrote: This? https://github.com/hmaverickadams/breach-parse @nbit

hey pom sorry to bother you but read a message sent by me pls
Reply
I've been looking for this too. Nothing fancy, just to first build an index and return lines containing a specified string in JSON/JSONL files. That breach_parse script looks to be exclusive to combo lists and I'm too retarded to modify it. Regular grep-style tools take ridiculously long without an index.
Reply


 Users viewing this thread: Index for faster searching on multiple databases / datasets: No users currently viewing.