Description
Currently, for API that can use BQ Storage Client to fetch data like to_dataframe_iterable
or to_arrow_iterable
, the client library always uses the maximum number of read streams recommended by BQ server.
python-bigquery/google/cloud/bigquery/_pandas_helpers.py
Lines 854 to 858 in ef8e927
This behavior has the advantage of maximizing throughput but can lead to out-of-memory issue when there are too many streams being opened and result are not read fast enough: we've encountered queries that open hundreds of streams and consuming GBs of memory.
BQ Storage Client API also suggests capping max_stream_count
when resource is constrained
Typically, clients should either leave this unset to let the system to determine an upper bound OR set this a size for the maximum "units of work" it can gracefully handle.
This problem has been encountered by others before and can be worked-around by monkey-patching the create_read_session
on the BQ Client object: #1292
However, it should really be fixed by allowing the max_stream_count
parameter to be set through public API.