I'm trying to upload a file that weight about 3GB and I'm getting the following error:
"OverflowError: string longer than 2147483647 bytes"
If I understand correctly it seems like there's a 2GB limit? didnt manage to find any reference to such limiation or how to bypass it (if possible).
datafile = 'someHugeFile'
with open(datafile, 'rb') as myfile:
args = myfile.read()
resp = requests.put(url, data=args, verify=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/requests-2.3.0-py2.7.egg/requests/api.py", line 99, in put
return request('put', url, data=data, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests-2.3.0-py2.7.egg/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests-2.3.0-py2.7.egg/requests/sessions.py", line 456, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/site-packages/requests-2.3.0-py2.7.egg/requests/sessions.py", line 559, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests-2.3.0-py2.7.egg/requests/adapters.py", line 327, in send
timeout=timeout
File "/usr/local/lib/python2.7/site-packages/requests-2.3.0-py2.7.egg/requests/packages/urllib3/connectionpool.py", line 493, in urlopen
body=body, headers=headers)
File "/usr/local/lib/python2.7/site-packages/requests-2.3.0-py2.7.egg/requests/packages/urllib3/connectionpool.py", line 291, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/lib/python2.7/httplib.py", line 995, in request
self._send_request(method, url, body, headers)
File "/usr/local/lib/python2.7/httplib.py", line 1029, in _send_request
self.endheaders(body)
File "/usr/local/lib/python2.7/httplib.py", line 991, in endheaders
self._send_output(message_body)
File "/usr/local/lib/python2.7/httplib.py", line 844, in _send_output
self.send(msg)
File "/usr/local/lib/python2.7/httplib.py", line 820, in send
self.sock.sendall(data)
File "/usr/local/lib/python2.7/ssl.py", line 234, in sendall
v = self.send(data[count:])
File "/usr/local/lib/python2.7/ssl.py", line 203, in send
v = self._sslobj.write(data)
OverflowError: string longer than 2147483647 bytes
Unfortunately it looks like it cannot be avoided when you do a POST request with several headers. Then the file (or the files) is always read completely.
It would be great when this could be avoided in requests since I often have to send files which are longer than the available main memory on the system.
So the comment from Lukasa seems to work when you are uploading a single file, then you can do a streaming upload. But I have to do a normal post request with several variables in the data part and the file as a part of a multipart upload.
I don't have the traceback at the moment, but when I get it, I will post it here.
Traceback (most recent call last):
File "/var/virtualenvs/centr/lib/python3.5/site-packages/elasticsearch/connection/http_requests.py", line 75, in perform_request
timeout=timeout or self.timeout, verify=False)
File "/var/virtualenvs/centr/lib/python3.5/site-packages/requests/sessions.py", line 609, in send
r = adapter.send(request, **kwargs)
File "/var/virtualenvs/centr/lib/python3.5/site-packages/requests/adapters.py", line 423, in send
timeout=timeout
File "/var/virtualenvs/centr/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/var/virtualenvs/centr/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 356, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python3.5/http/client.py", line 1106, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python3.5/http/client.py", line 1151, in _send_request
self.endheaders(body)
File "/usr/lib/python3.5/http/client.py", line 1102, in endheaders
self._send_output(message_body)
File "/usr/lib/python3.5/http/client.py", line 936, in _send_output
self.send(message_body)
File "/usr/lib/python3.5/http/client.py", line 908, in send
self.sock.sendall(data)
File "/usr/lib/python3.5/ssl.py", line 891, in sendall
v = self.send(data[count:])
File "/usr/lib/python3.5/ssl.py", line 861, in send
return self._sslobj.write(data)
File "/usr/lib/python3.5/ssl.py", line 586, in write
return self._sslobj.write(data)
OverflowError: string longer than 2147483647 bytes
This should throw the error:
datafile = 'someHugeFile'
with open(datafile, 'rb') as myfile:
r = requests.post(endpoint, data={'key': myfile.read()}, verify=False)
If endpoint
is https then ssl
will have to process the payload. I wonder if requests
or requests-toolbelt
could have an option to do the signature with some other library that doesn't die when signing a 2GB string. Of course, I would say that people shouldn't be signing such large things but it's definitely a real crash that's happening in the real world.
@adamn That was not my proposed solution. My proposed solution was to not read the file in manually at all. You are bumping into the same error as before, which is that we are sending a single gigantic string to httplib.
This is a behaviour we can fix: if we spot someone uploading a gigantic single string via Python then we can resolve it. But at this point I strongly recommend you use an intermediary file object: either one on disk, or by doing the urlencoding yourself and wrapping the result in a BytesIO
.
I've already come up with a workaround so won't be able to dig deeper into this unfortunately. I still suspect that the SSL payload needs to be signed/encrypted so the same thing will happen regardless of whether there is a file object or not since the exception is raised by ssl.write
itself and I presume that method needs the entire payload. Chunking the POST seems like the only real option. Anyway, thanks for the help.
@adamn No, that's not necessary. TLS uses stream encryption, it does not need the entire payload at once.
What you're missing is that when given a file object, requests will automatically stream it in smaller chunks (specifically, 8192 kb chunks). Those cause no problem.
Sorry to comment on an old issue, but this looks similar to an issue we've run into and I'm trying to decide whether it's worth opening a new issue for it.
Again, requests.put
where data is a huge string doesn't work, but we don't get an error. requests just hangs sending; a packet capture shows that no more data is being sent.
This behaviour is worse that an exception being raised.
Yes, the remote end is sending ACKs appropriately, no FIN or anything like that.
In fact, if you have a large file f
, we see the problem if you do requests.put(url, data=f.read())
but not if you do requests.put(url, data=f)
. Obviously if we have a file handle, we wouldn't bother to call read on it, but the point is that both calls are supposed generate the same request, and a packet capture shows that they do up until the point at which one stops sending packets.
As luck would have it, I have already done so.
Github doesn't seem to want to let me attach files, so:
#!/usr/bin/env python
import requests
MB = 1024 ** 2
GB = MB * 1024
if __name__ == '__main__':
data = 'x' * 4 * GB
resp = requests.put('http://localhost:8000', data=data)
print resp
And a server to run it against:
#!/usr/bin/env python
import BaseHTTPServer
import logging
READ_CHUNK = 1024 ** 2
class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
def do_PUT(self):
logging.info("PUT request recieved")
for header, value in self.headers.items():
logging.info("header: %s = %s", header, value)
length = int(self.headers['Content-Length'])
logging.info("Content length %s, getting content...", length)
while length:
to_read = min(length, READ_CHUNK)
logging.info("reading %s bytes...", to_read)
self.rfile.read(to_read)
length -= to_read
logging.info("Recieved content")
self.send_response(200)
def run(server_class=BaseHTTPServer.HTTPServer):
server_address = ('', 8000)
httpd = server_class(server_address, Handler)
httpd.serve_forever()
if __name__ == '__main__':
logging.basicConfig(
level=logging.DEBUG,
format="%(asctime)s %(levelname)s: %(message)s",
logging.debug("Starting server")
run()
Obviously this isn't the server we were running against when we first encountered this problem :)
Huh. That behaves...very oddly. On my machine, over the loopback, I don't see any data sent at all: it's like Requests just gave up on sending it. Further debugging seems to show this is happening at the level of socket.sendall
, which for some absurd reason is just not sending the complete response. By "not sending the complete response" I mean socket.sendall
is returning early, but demonstrably is not sending all the data.
Naturally, the reason this happens is the same as the reason Python does lots of other stupid crap: socket.sendall
is written in C. The very first thing that socket.sendall
does is get the length of the data that was sent into it and shoves it into a C int
. Now, this is wrong to begin with: Py_buffer.len
is a Py_ssize_t
, and sizeof(ssize_t)
is frequently larger than sizeof(int)
. So that's bonkers stupid, and probably the source of this bug.
In fact, it definitely is, since the current Python master has a changed sendall
that uses the correct size. This seems to have been cleaned up around Python 3 time as a general "64-bit issue" (see python/cpython@19467d2) in the socket module.
That makes this ultimately a duplicate of CPython issue #18100. This has been open a long time in need of patch review, and given that Python 2.7 is now only getting security fixes I doubt the CPython developers will fix it at this point.
This is a difficult issue for Requests to sensibly police. We can tell when people will definitely hit it (e.g. because the input is a string which a length greater than 2GB), but there are many situations where people will hit it but we can't tell (e.g. because the string plus the headers is greater than 2GB in size, or because there is a different type in use that CPython will treat as "stringish" that is larger than 2GB). So my initial inclination is, given that this is an issue that can be solved by moving to a newer version of Python, and that it can be worked around by not reading gigantic strings into memory (which is a best-practice anyway), and that if we ever move off httplib we'll fix it automatically anyway, I'm inclined to suggest that we probably don't have a huge pressure to resolve the issue? For my part, I think this is getting pretty close to "Dr, it hurts when I do this." "So don't do that then!" territory.
However, I'm willing to be disagreed with here.
This also results in an OverflowError in self._sslobj.write(data)
:
files = {'file': open(tar_file_path, 'rb')}
headers = {'key': 'abc123'}
r = requests.post(url, files=files, headers=headers)
The file is 3GB in size.
I'm having the same basic issue as @gjedeer and see the same behavior as @cmbasnett (that wrapping in BytesIO is not a solution). I'm trying to use a file object to upload something larger than 2GB using a TLS encrypted post. Specifically I'm trying to use a presigned url to upload a file to S3. It appears that the underlying ssl library in python doesn't like files over 2GB. Is there an accepted workaround to this? Stack trace:
Basic code:
with open(self.path_to_data, 'rb') as f:
fields = 'defined elsewhere...'
files = {'file': f}
request('post', url, data=fields, files=files)
Traceback (most recent call last):
File "/Users/coltonhicks/.virtualenvs/mds_cli-AtYG3_5U/bin/mds", line 11, in <module>
load_entry_point('mdscli', 'console_scripts', 'mds')()
File "/Users/coltonhicks/.virtualenvs/mds_cli-AtYG3_5U/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/Users/coltonhicks/.virtualenvs/mds_cli-AtYG3_5U/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/Users/coltonhicks/.virtualenvs/mds_cli-AtYG3_5U/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/coltonhicks/.virtualenvs/mds_cli-AtYG3_5U/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/coltonhicks/.virtualenvs/mds_cli-AtYG3_5U/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/Users/coltonhicks/.virtualenvs/mds_cli-AtYG3_5U/lib/python3.7/site-packages/click/decorators.py", line 64, in new_func
return ctx.invoke(f, obj, *args, **kwargs)
File "/Users/coltonhicks/.virtualenvs/mds_cli-AtYG3_5U/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/Users/coltonhicks/dev/mds_cli/mdscli/cli.py", line 133, in upload_member_data
uploader.main()
File "/Users/coltonhicks/dev/mds_cli/mdscli/upload_member_data.py", line 30, in main
self.client.upload_member_data(self.mida, self.data_type, f)
File "/Users/coltonhicks/dev/mds_cli/mdscli/requests_client.py", line 300, in upload_member_data
logger.info(
File "/Users/coltonhicks/dev/mds_cli/mdscli/requests_client.py", line 186, in _request
res = requests.request(method, url, data=data, files=files)
File "/Users/coltonhicks/.virtualenvs/mds_cli-AtYG3_5U/lib/python3.7/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/Users/coltonhicks/.virtualenvs/mds_cli-AtYG3_5U/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/Users/coltonhicks/.virtualenvs/mds_cli-AtYG3_5U/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/Users/coltonhicks/.virtualenvs/mds_cli-AtYG3_5U/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/Users/coltonhicks/.virtualenvs/mds_cli-AtYG3_5U/lib/python3.7/site-packages/urllib3/connectionpool.py", line 603, in urlopen
chunked=chunked)
File "/Users/coltonhicks/.virtualenvs/mds_cli-AtYG3_5U/lib/python3.7/site-packages/urllib3/connectionpool.py", line 355, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1055, in _send_output
self.send(chunk)
File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 977, in send
self.sock.sendall(data)
File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 1015, in sendall
v = self.send(byte_view[count:])
File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 984, in send
return self._sslobj.write(data)
OverflowError: string longer than 2147483647 bytes
Requests can't handle a put call for very large data objects. However, it
can accept the data as a file-like object instead, and the size issue will
not show up. Documented here: psf/requests#2717.
Issue: IT-19717
Change-Id: I826d3fa2eebbd3ba0389a0d7042701b4389e406d
Signed-off-by: Eric Ball <[email protected]>
Regarding the newer (since 2018) questions for a solution here:
The workaround using requests-toolbelt works for me in Python 3.8.
The requests-code to upload was the following:
requests.request(method, url, *args, **kwargs)
...with the kwargs being:
{'files': {'file': ({filename}', <_io.BufferedReader name='{filepath}'>), 'parent_dir': '/bert_files/bert_models'}, 'headers': {'Authorization': 'Token {token}'}}
.
Wrapping it like this:
m = MultipartEncoder(
fields={'file': (kwargs['files']['file'][1].name, open(kwargs['files']['file'][1].name, 'rb'), 'text/plain'),
'parent_dir': kwargs['files']['parent_dir']}
del kwargs['files']
kwargs['data'] = m
kwargs['headers']['Content-Type'] = m.content_type
...leading to these kwargs:
{'headers': {'Authorization': 'Token {token}', 'Content-Type': 'multipart/form-data; boundary={boundary}'}, 'data': <MultipartEncoder: {'file': ('filename', <_io.BufferedReader name={filepath}'>, 'text/plain'), 'parent_dir': '{parent_tir}'}>}
Like this it worked perfectly for me.