Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I am doing some file clean up before loading into my data warehouse and have run into a file sizing issue:
(Get-Content -path C:\Workspace\workfile\myfile.txt -Raw) -replace '\\"', '"' | Set-Content C:\Workspace\workfile\myfileCLEAN.txt
My file is about 2GB. I am receiving the following error and not sure how to correct.
Get-Content : Exception of type 'System.OutOfMemoryException' was
thrown, ........
I am NOT a coder, but I do like learning so am building my own data warehouse. So if you do respond, keep my experience level in mind :)
–
A performant way of reading a text file line by line - without loading the entire file into memory - is to use a switch
statement with the -File
parameter.
A performant way of writing a text file is to use a System.IO.StreamWriter
instance.
As Mathias points out in his answer, using verbatim \"
with the regex-based -replace
operator actually replaces "
alone, due to the escaping rules of regexes. While you could address that with '\\"'
, in this case a simpler and better-performing alternative is to use the [string]
type's Replace()
method, which operates on literal substrings.
To put it all together:
# Note: Be sure to use a *full* path, because .NET's working dir. usually
# differs from PowerShell's.
$streamWriter = [System.IO.StreamWriter]::new('C:\Workspace\workfile\myfileCLEAN.txt')
switch -File C:\Workspace\workfile\myfile.txt {
default { $streamWriter.WriteLine($_.Replace('\"', '"')) }
$streamWriter.Close()
Note: If you're using an old version of Windows PowerShell, namely version 4 or below, use
New-Object System.IO.StreamWriter 'C:\Workspace\workfile\myfileCLEAN.txt'
instead of
[System.IO.StreamWriter]::new('C:\Workspace\workfile\myfileCLEAN.txt')
–
–
–
–
Get-Content -Raw
makes PowerShell read the entire file into a single string.
.NET can't store individual objects over 2GB in size in memory, and each character in a string takes up 2 bytes, so after reading the first ~1 billion characters (roughly equivalent to a 1GB ASCII-encoded text file), it reaches the memory limit.
Remove the -Raw
switch, -replace
is perfectly capable of operating on multiple input strings at once:
(Get-Content -path C:\Workspace\workfile\myfile.txt) -replace '\"', '"' | Set-Content C:\Workspace\workfile\myfileCLEAN.txt
Beware that -replace
is a regex operator, and if you want to remove \
from a string, you need to escape it:
(Get-Content -path C:\Workspace\workfile\myfile.txt) -replace '\\"', '"' | Set-Content C:\Workspace\workfile\myfileCLEAN.txt
While this will work, it'll still be slow due to the fact that we're still loading >2GB of data into memory before applying -replace
and writing to the output file.
Instead, you might want to pipe the output from Get-Content
to the ForEach-Object
cmdlet:
Get-Content -path C:\Workspace\workfile\myfile.txt |ForEach-Object {
$_ -replace '\\"','"'
} |Set-Content C:\Workspace\workfile\myfileCLEAN.txt
This allows Get-Content
to start pushing output prior to finishing reading the file, and PowerShell therefore no longer needs to allocate as much memory as before, resulting in faster execution.
–
Get-Content
loads the whole file into memory.
Try processing line by line to improve memory utilization.
$infile = "C:\Workspace\workfile\myfile.txt"
$outfile = "C:\Workspace\workfile\myfileCLEAN.txt"
foreach ($line in [System.IO.File]::ReadLines($infile)) {
Add-Content -Path $outfile -Value ($line -replace '\\"','"')
–
–
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.