>>> with open("exercises.zip") as zip_file:
... contents = zip_file.read()
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python3.10/codecs.py", line 322, in de
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8e in position 11: invalid sta
rt byte
We get an error because zip files aren't text files, they're binary files.
To read from a binary file, we need to open it with the mode rb
instead of the default mode of rt
:
>>> with open("exercises.zip", mode="rb") as zip_file:
... contents = zip_file.read()
When you read from a binary file, you won't get back strings.
You'll get back a bytes
object, also known as a byte string:
>>> with open("exercises.zip", mode="rb") as zip_file:
... contents = zip_file.read()
>>> type(contents)
<class 'bytes'>
>>> contents[:20]
b'PK\x03\x04\n\x00\x00\x00\x00\x00Y\x8e\x84T\x00\x00\x00\x00\x00\x00'
Byte strings don't have characters in them: they have bytes in them.
The bytes in a file won't help us very much unless we understand what they mean.
Use a library to read your binary file
You probably won't read a binary file yourself very often.
When working with binary files you'll typically use a library (either a built-in Python library or a third-party library) that knows how to process the specific type of file you're working with
That library will do the work of decoding the bytes from your file into something that's easier to work with.
For example, Python's ZipFile
module can help us read data that's within a zip file:
>>> from zipfile import ZipFile
>>> with ZipFile("exercises.zip") as zip_file:
... test_file = zip_file.read("exercises/test.py").decode("utf-8")
>>> test_file[:30]
'#!/usr/bin/env python3\nfrom __'
It's best to avoid implementing your own byte-checking or byte manipulation logic if someone has already done that work for you.
Working at byte level in Python
Sometimes you'll work with a library or an API that requires you to work directly at the byte-level.
In that case, you'll want to have at least a little bit of familiarity with binary files and byte strings.
For example, let's say we'd like to calculate the sha256
checksum
of a given file.
Here we have a function called get_sha256_hash
that does that:
import hashlib
def get_sha256_hash(filename):
with open(filename, mode="rb") as f:
return hashlib.sha256(f.read()).hexdigest()
This function reads all of the binary data within this file.
We're reading bytes because the Python's hashlib
module requires us to work with bytes.
The hashlib
module works at a low-level: it works with bytes
instead of with strings.
So we're passing in all the bytes in our file to get a hash object and then calling the hexdigest
method on that hash object to get a string of hexadecimal characters that represent the SHA-256 checksum of this file:
>>> get_sha256_hash("exercises.zip")
'9e98242a21760945ec815668fc79d8621fa15dd23659ea29be2c5949153fe96d'
This function works well, but reading very big files with this function might be a problem.
Reading binary files in chunks
Our get_sha256_hash
function reads the whole file into memory all at once.
With a really big file that might take up a lot of memory.
With a text file, the usual way to solve this problem would be to read the file line-by-line.
But binary files don't necessarily have lines!
Instead, we could try to read chunk by chunk.
First we'll read an eight kilobyte chunk from our file:
import hashlib
def get_sha256_hash(filename, buffer_size=2**10*8):
file_hash = hashlib.sha256()
with open(filename, mode="rb") as f:
chunk = f.read(buffer_size)
We make a new hash object first and then reading one eight kilobyte chunk (by passing the number of bytes to our file object's read
method).
Now we need the rest of our file's chunks.
So we'll loop:
import hashlib
def get_sha256_hash(filename, buffer_size=2**10*8):
file_hash = hashlib.sha256()
with open(filename, mode="rb") as f:
chunk = f.read(buffer_size)
while chunk:
file_hash.update(chunk)
chunk = f.read(buffer_size)
return file_hash.hexdigest()
We're repeatedly reading a chunk, updating our hash object, and then reading another chunk.
As long as we're not at the end of our file, we'll get back a truthy chunk when we read.
But when we read at the very end of our file we'll get back an empty byte string.
Empty byte strings (like empty strings) are falsey, so at the end of our file we'll break out of our loop.
Then we'll return the hexdigest
just like we did before.
This modified get_sha256_hash
function works just like before:
>>> get_sha256_hash("exercises.zip")
'9e98242a21760945ec815668fc79d8621fa15dd23659ea29be2c5949153fe96d'
But instead of reading our entire file into memory, we're now reading our file chunk-by-chunk.
Aside: using an assignment expression
It's common to see an assignment expression used (via Python's walrus operator) when reading files chunk-by-chunk:
import hashlib
def get_sha256_hash(filename, buffer_size=2**10*8):
file_hash = hashlib.sha256()
with open(filename, mode="rb") as f:
while chunk := f.read(buffer_size):
file_hash.update(chunk)
return file_hash.hexdigest()
Repeatedly reading data within a while
loop is a pretty good use case for an assignment expression.
It may look a little bit weird, but it does save us a few lines of code.
The walrus operator was added in Python 3.8.
Avoid reading binary files if you can
When you read a binary file in Python, you'll get back bytes.
When you're reading a large binary file, you'll probably want to read it chunk-by-chunk.
But it's best to avoid reading binary files yourself if you can.
If there's a third-party library that can help you process your binary file, you should probably use that library to do the byte-based processing for you.
A Python tip every week
Need to fill-in gaps in your Python skills?
Sign up for my Python newsletter where I share one of my favorite Python tips every week.
Series: Files
Reading from and writing to text files (and sometimes binary files) is an important skill for most Python programmers.
To track your progress on this Python Morsels topic trail, sign in or sign up.
Sign in to your Python Morsels account to track your progress.
Don't have an account yet? Sign up here.