添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I am doing some file clean up before loading into my data warehouse and have run into a file sizing issue:

(Get-Content -path C:\Workspace\workfile\myfile.txt -Raw) -replace '\\"', '"' | Set-Content C:\Workspace\workfile\myfileCLEAN.txt

My file is about 2GB. I am receiving the following error and not sure how to correct.

Get-Content : Exception of type 'System.OutOfMemoryException' was thrown, ........

I am NOT a coder, but I do like learning so am building my own data warehouse. So if you do respond, keep my experience level in mind :)

don't load the whole file ... load it line-by-line and do your replace on each line. then use Add-Content to send the result to a new file. – Lee_Dailey Nov 20, 2020 at 21:26
  • A performant way of reading a text file line by line - without loading the entire file into memory - is to use a switch statement with the -File parameter.

  • A performant way of writing a text file is to use a System.IO.StreamWriter instance.

  • As Mathias points out in his answer, using verbatim \" with the regex-based -replace operator actually replaces " alone, due to the escaping rules of regexes. While you could address that with '\\"', in this case a simpler and better-performing alternative is to use the [string] type's Replace() method, which operates on literal substrings.

    To put it all together:

    # Note: Be sure to use a *full* path, because .NET's working dir. usually
    #       differs from PowerShell's.
    $streamWriter = [System.IO.StreamWriter]::new('C:\Workspace\workfile\myfileCLEAN.txt')
    switch -File C:\Workspace\workfile\myfile.txt {
      default { $streamWriter.WriteLine($_.Replace('\"', '"')) }
    $streamWriter.Close()
    

    Note: If you're using an old version of Windows PowerShell, namely version 4 or below, use
    New-Object System.IO.StreamWriter 'C:\Workspace\workfile\myfileCLEAN.txt'
    instead of
    [System.IO.StreamWriter]::new('C:\Workspace\workfile\myfileCLEAN.txt')

    Thank you for pointing that out. I upgraded to v5 and wow! this works great and is super fast :) Thank you so much! – Susan Melton Dec 1, 2020 at 16:38 mklement0 - I'm working on some comparisons and will write a proper question soon, but is switch w/ -File the most performant way of reading? Interesting we aren't doing a StreamReader to complement the writer. I need to use ConvertFrom-Csv in my case, so I'm still working on figuring out what will be the best to measure, but thought I'd ask while I'm working on it – immobile2 Apr 5, 2022 at 9:49 @immobile2, among native PowerShell features, switch -File is the fastest method. I'm not certain, but I think its performance is comparable to a StreamReader loop or a [IO.File]::ReadLines() loop. The use of .NET methods is less obvious to PowerShell users, and there's the pitfall of the working-directory mismatch. Also, switch offers sophisticated filtering functionality (not needed here). – mklement0 Apr 5, 2022 at 12:47 Thanks! It is actually slower than my piping at the moment, but I think that is the StreamWriter for some reason. I'm going to go with an [IO.File] method though because my switch statement right now is needlessly doing a wildcard compare to identify the header and then stripping the header off later before writing to file – immobile2 Apr 6, 2022 at 17:24

    Get-Content -Raw makes PowerShell read the entire file into a single string.

    .NET can't store individual objects over 2GB in size in memory, and each character in a string takes up 2 bytes, so after reading the first ~1 billion characters (roughly equivalent to a 1GB ASCII-encoded text file), it reaches the memory limit.

    Remove the -Raw switch, -replace is perfectly capable of operating on multiple input strings at once:

    (Get-Content -path C:\Workspace\workfile\myfile.txt) -replace '\"', '"' | Set-Content C:\Workspace\workfile\myfileCLEAN.txt
    

    Beware that -replace is a regex operator, and if you want to remove \ from a string, you need to escape it:

    (Get-Content -path C:\Workspace\workfile\myfile.txt) -replace '\\"', '"' | Set-Content C:\Workspace\workfile\myfileCLEAN.txt
    

    While this will work, it'll still be slow due to the fact that we're still loading >2GB of data into memory before applying -replace and writing to the output file.

    Instead, you might want to pipe the output from Get-Content to the ForEach-Object cmdlet:

    Get-Content -path C:\Workspace\workfile\myfile.txt |ForEach-Object {
      $_ -replace '\\"','"'
    } |Set-Content C:\Workspace\workfile\myfileCLEAN.txt
    

    This allows Get-Content to start pushing output prior to finishing reading the file, and PowerShell therefore no longer needs to allocate as much memory as before, resulting in faster execution.

    This does work but is slow with a 2GB file taking almost an hour to process. Any hints on helping to speed up the process? – Susan Melton Nov 30, 2020 at 21:59

    Get-Content loads the whole file into memory.

    Try processing line by line to improve memory utilization.

    $infile = "C:\Workspace\workfile\myfile.txt"
    $outfile = "C:\Workspace\workfile\myfileCLEAN.txt"
    foreach ($line in [System.IO.File]::ReadLines($infile)) {
        Add-Content -Path $outfile -Value ($line -replace '\\"','"')
                    Get-Content only loads the whole file into memory if you use the -Raw switch or enclose the call in (...). While [System.IO.File]::ReadLines() is an efficient way to read a file, you're negating the performance benefits by using Add-Content in a loop, which opens and closes the output file in each iteration.
    – mklement0
                    Nov 20, 2020 at 23:37
                    This works but is as slow as using Get-Content. Only advantage I saw was less taxing on memory and I could more easily monitor the progress.
    – Susan Melton
                    Nov 30, 2020 at 22:02
            

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question. Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers.

  •