(August 31, 2022, 01:36 PM)jixumbty Wrote: Given multiple sets of data stored as text files, how would you go about making an index for faster searching with grep?
Searching should include username, name, last name, email.
What procedure or methods would you use?
What are some of the best practices for searching and indexing data sets?
We use ripgrep nowadays for searching with json or prefered format with multi cores and threads.
grep is kinda old. ripgrep searching in various archive extensions as well, if you want to search in even more like pdf and such things, take a look @ripgrep-all
https://github.com/BurntSushi/ripgrep
rg -w 'Sherlock [A-Z]\w+'` | Time: 2.769s
vs
LC_ALL=en_US.UTF-8 egrep -w 'Sherlock [A-Z]\w+'` | Time: 9.027s
Another trick is to use parallel but ripgrep is alot faster:
find big_source_code_dir -type f | parallel -k -j150% -n 1000 -m grep -H -n pattern_str {}
rg --colors 'match:none' \
--colors 'match:bg:0x33,0x66,0xFF' \
--colors 'match:fg:white' \
--colors 'match:style:bold' \
"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b"
For search emails. You can search multiple fields with awk more neat but you can search multiple words with ripgrep as with egrep of course.
rg -i "string1|string2|string3|string4"
etc. Its a really powerful tool.
For search with a monster box with 96 cores or something similiar, use below,:
rg -j$(nproc) .....
It's hard to answer your question since I dont know how your files are sorted, firstname and second name can be used by [A-Za-z] and a max length:
< /dev/urandom tr -dc "
[:alnum:]"|awk ' length($0) < 12'
Will only print strings that is NOT longer then 12 words, use your brain how to tweak it.
Cheers