working with large txt files
by - Thursday, January 1, 1970 at 12:00 AM
i saw some threads close to this but not what i was wanting

what is a good way to work with txt files that have hundred million lines?

for example if i have 2 words lists with hundred million lines each and want to combine them while deleting duplicates

i know how to do that in a txt editor but those files so big editors wont load them

and ok maybe its better to leave separated but i still want to compare and get duplicates out

obv this is noob question but thx in advanced for help i think this is good way for me to learn working with large files like this
Reply
I found total commander very useful for this. https://www.ghisler.com/ using their viewer function (F3) on a file you can load only a part of its contents. Basically it's just a notepad but it only reads the data it needs to display the text on the screen.
Reply
UltraEdit can help
Reply
for deduplicating two files quickly, if they are not larger than ram you can put the strings into python or c++

best speed is to put them in suffix tree/suffix trie, but if lines are short stuffing them in dictionary also ok

if files larger than ram it is likely problematic for both python or editor, then it is time to separate
Reply
I would look into stuff like Powershell and Python and the likes. You have reached a point where you don't want to edit anymore so scripting is the next step.
Reply
(August 28, 2022, 01:22 PM)nef Wrote: I found total commander very useful for this. https://www.ghisler.com/ using their viewer function (F3) on a file you can load only a part of its contents. Basically it's just a notepad but it only reads the data it needs to display the text on the screen.


thx for directing me towards that tool i will at it looks useful


(August 28, 2022, 07:27 PM)MCD Wrote: I would look into stuff like Powershell and Python and the likes. You have reached a point where you don't want to edit anymore so scripting is the next step.


good advice i think you are correct
i wasnt sure what editing limitations there are with scripting but thats why i need to look into it and figure out
Reply
Hmm, I have used vim for CSV contains a few billion lines, but it is not exactly easy to use.
Those who share kindness, I will repay that payment 10-fold, and Who do injustice, try to hurt the innocent, I will repay that injustice a 1000 times over.
Reply
As said somewhere else..windows powershell is quite powerful with import-csv/export-csv
Reply
python or perl are the best options there
Reply
Use cat/head/tail in a linux shell, you can also grep to search for content in the file
Reply


 Users viewing this thread: working with large txt files: No users currently viewing.