mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Linux

Reply
 
Thread Tools
Old 2021-09-20, 15:17   #1
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

5·7·132 Posts
Default Low memory line counting in huge files

I have been running a program that outputs a line to a file every time it encounters a composite. Many of these lines are duplicated and I am interested in targeting the composites that are encountered the most. The issue is that the files generated by my program can get extremely large(150+GB).
In an idea world I would run a command such as:
sort -n comps.txt | uniq -c | sort -n | tail -n 10000 > best10k.txt
However, running this in WSL2 require memory equal to twice the file size which is unmanageable(I have 48GB memory).
So far I have been splitting the file into pieces based on the first one or two digits using awk and then processing them separately appending results to one file:
awk '{file = "comps_" substr($0, 1, 2) ".txt"; print > file}' < comps.txt
However, sometimes the file is dominated by one composite which means one of the files gets too big. Also the awk command is quite slow.
Does anyone have any suggestions about how I could make this run faster/with less memory? The output of sort -n comps.txt | uniq -c should easily fit in memory(probably twice over).

If running powershell commands is a better option that is possible.
henryzz is offline   Reply With Quote
Old 2021-09-20, 15:36   #2
kruoli
 
kruoli's Avatar
 
"Oliver"
Sep 2017
Porta Westfalica, DE

11·59 Posts
Default

Have you tried using the --buffer-size=SIZE and --temporary-directory=DIR options of sort? It should not need so much memory, neither should uniq. Or is disk space a problem? Does your sort default to in-memory sorting? Also, use sort [other options] -r in the second occurence and head instead of tail.

Edit: I can think of some a more efficient implementation with Powershell 2; since you now can pass C# code there directly, I'll give it a try later, because the normal Powershell syntax is still a struggle for me. We should be able to nail the memory requirements down quite drastically this way.

Last fiddled with by kruoli on 2021-09-20 at 15:46 Reason: Additions.
kruoli is offline   Reply With Quote
Old 2021-09-20, 15:59   #3
chris2be8
 
chris2be8's Avatar
 
Sep 2009

217610 Posts
Default

How many different composites are there? If there are a reasonably small number, say under 100,000, I would write a perl script to read the file and update a hash using the composite as the key and count of matches as the value. At EOF sort the hash by value and output the top n records. This assumes you can fit all the unique composites into memory.

If that needs too much memory I would do it in batches, outputting "composite count" from each batch, sorted by composite. Then do a series of merges of two or more batches replacing matching records with one where the count is the sum of the input counts. When eventually down to 1 file just run through it extracting the "n" largest counts (and any other data you want, eg how many singletons there are).

HTH
chris2be8 is offline   Reply With Quote
Old 2021-09-20, 17:36   #4
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

5×7×132 Posts
Default

Quote:
Originally Posted by kruoli View Post
Have you tried using the --buffer-size=SIZE and --temporary-directory=DIR options of sort? It should not need so much memory, neither should uniq. Or is disk space a problem? Does your sort default to in-memory sorting? Also, use sort [other options] -r in the second occurence and head instead of tail.

Edit: I can think of some a more efficient implementation with Powershell 2; since you now can pass C# code there directly, I'll give it a try later, because the normal Powershell syntax is still a struggle for me. We should be able to nail the memory requirements down quite drastically this way.
I have used those options(and forgot about them when writing the post). This reduces the memory usage down to the buffer + the file size. It slows it down significantly though.


The number of unique composites is fairly limitted compared with the number of composites. They should fit in memory with little issue. Summing the counts is probably a good idea. Maybe better would be a perl or python script that counts the composites. I suppose I should just bite the bullet and write that script.
henryzz is offline   Reply With Quote
Old 2021-09-20, 18:09   #5
nordi
 
Dec 2016

4D16 Posts
Default

Quote:
Originally Posted by henryzz View Post
The number of unique composites is fairly limitted compared with the number of composites. They should fit in memory with little issue.
In that case you can use the following Python script:
Code:
from collections import Counter
with open("YOUR_FILE_NAME") as f:
    print(Counter(line.rstrip() for line in f).most_common(10000))
nordi is offline   Reply With Quote
Old 2021-09-21, 08:36   #6
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

5·7·132 Posts
Default

Quote:
Originally Posted by nordi View Post
In that case you can use the following Python script:
Code:
from collections import Counter
with open("YOUR_FILE_NAME") as f:
    print(Counter(line.rstrip() for line in f).most_common(10000))
Thank you. This works with very low memory usage(not even 1GB).

The only issue is that it outputs in a different format:
Code:
[('comp1', count1), ('comp2',count2)]
while the previous method outputs:
Code:
count1 comp1
 count2 comp2

I can alter my parsing to cope with this although if there is an easy solution that would be useful.

This sort of code snippet is why I posted. If I had written this it would have been done from scratch using hash tables. While I can read and write basic python I don't know the libraries at all.
henryzz is offline   Reply With Quote
Old 2021-09-21, 15:39   #7
nordi
 
Dec 2016

7·11 Posts
Default

Quote:
Originally Posted by henryzz View Post
the previous method outputs:
Code:
count1 comp1
  count2 comp2

You can get that with
Code:
from collections import Counter
with open("FILE_NAME") as f:
    counter = Counter(line.rstrip() for line in f)

for composite, count in counter.most_common(10000):
    print(composite, count)
That's also easier to adjust if you want a different format.
nordi is offline   Reply With Quote
Old 2021-09-21, 18:05   #8
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

5·7·132 Posts
Default

Quote:
Originally Posted by nordi View Post
You can get that with
Code:
from collections import Counter
with open("FILE_NAME") as f:
    counter = Counter(line.rstrip() for line in f)

for composite, count in counter.most_common(10000):
    print(composite, count)
That's also easier to adjust if you want a different format.

Thank you. That does look much easier to fiddle with. Most of my contact with python has been through editting existing scripts on this forum(sometimes heavily). This has lead to a fairly narrow skill set. This solution would have been obvious to me in C#.
henryzz is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Posting log files or other text files Xyzzy Forum Feedback 3 2018-12-30 19:37
Huge exponent paulunderwood Miscellaneous Math 15 2016-01-21 18:56
An aliquot sequence with huge, huge, huge tracts of sand !!! garambois Aliquot Sequences 50 2012-01-19 18:25
That is a HUGE factor Dubslow Information & Answers 9 2011-09-09 06:01
Best way to store huge data HiddenWarrior Data 18 2005-10-11 03:53

All times are UTC. The time now is 02:54.


Tue Oct 19 02:54:33 UTC 2021 up 87 days, 21:23, 0 users, load averages: 1.52, 1.58, 1.66

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.