Update README.md

xando · Jun 17, 2024 · ba3365a · ba3365a
1 parent b31db10
commit ba3365a
Showing 1 changed file with 101 additions and 71 deletions.
diff --git a/README.md b/README.md
@@ -1,18 +1,17 @@
+
+
 # pyarrow-bigquery
 
 A simple library to **write to** and **download from** BigQuery tables as PyArrow tables.
 
-
 ---
 
+## Installation
 
-## Installation 
-
-```bash 
+```bash
 pip install pyarrow-bigquery
 ```
 
-
 ---
 
 ## Quick Start
@@ -21,11 +20,11 @@ This guide will help you quickly get started with `pyarrow-bigquery`, a library
 
 ### Reading
 
-`pyarrow-bigquery` exposes two methods to read BigQuery tables as PyArrow tables. Depending on your use case or the size of the table, you might want to use one method over the other.
+`pyarrow-bigquery` offers four methods to read BigQuery tables as PyArrow tables. Depending on your use case and/or the table size, you can choose the most suitable method.
 
-**Read the Whole Table**
+**Read from a Table Location**
 
-When the table is small enough to fit in memory, you can read it directly using `bq.read_table`.
+When the table is small enough to fit in memory, you can read it directly using `read_table`.
 
 ```python
 import pyarrow.bigquery as bq
@@ -35,9 +34,24 @@ table = bq.read_table("gcp_project.dataset.small_table")
 print(table.num_rows)
 ```
 
-**Read with Batches**
+**Read from a Query**
 
-If the target table is larger than memory or you have other reasons not to fetch the whole table at once, you can use the `bq.reader` iterator method along with the `batch_size` parameter to limit how much data is fetched per iteration.
+Alternatively, if the query results are small enough to fit in memory, you can read them directly using `read_query`.
+
+```python
+import pyarrow.bigquery as bq
+
+table = bq.read_query(
+    project="gcp_project",
+    query="SELECT * FROM `gcp_project.dataset.small_table`"
+)
+
+print(table.num_rows)
+```
+
+**Read in Batches**
+
+If the target table is larger than memory or you prefer not to fetch the entire table at once, you can use the `bq.reader` iterator method with the `batch_size` parameter to limit how much data is fetched per iteration.
 
 ```python
 import pyarrow.bigquery as bq
@@ -46,13 +60,27 @@ for table in bq.reader("gcp_project.dataset.big_table", batch_size=100):
     print(table.num_rows)
 ```
 
+**Read Query in Batches**
+
+Similarly, you can read data in batches from a query using `reader_query`.
+
+```python
+import pyarrow.bigquery as bq
+
+for table in bq.reader_query(
+    project="gcp_project",
+    query="SELECT * FROM `gcp_project.dataset.small_table`"
+):
+    print(table.num_rows)
+```
+
 ### Writing
 
-Similarly, the package exposes two methods to write to BigQuery. Depending on your use case or the size of the table, you might want to use one method over the other.
+The package provides two methods to write to BigQuery. Depending on your use case or the table size, you can choose the appropriate method.
 
-**Write the Whole Table**
+**Write the Entire Table**
 
-When you want to write a complete table at once, you can use the `bq.write_table` method.
+To write a complete table at once, use the `bq.write_table` method.
 
 ```python
 import pyarrow as pa
@@ -63,9 +91,9 @@ table = pa.Table.from_arrays([[1, 2, 3, 4]], names=['integers'])
 bq.write_table(table, 'gcp_project.dataset.table')
 ```
 
-**Write in Batches (Smaller Chunks)**
+**Write in Batches**
 
-If you need to write data in smaller chunks, you can use the `bq.writer` method with the `schema` parameter to define the table structure.
+If you need to write data in smaller chunks, use the `bq.writer` method with the `schema` parameter to define the table structure.
 
 ```python
 import pyarrow as pa
@@ -75,12 +103,11 @@ schema = pa.schema([
     ("integers", pa.int64())
 ])
 
-with bq.writer("gcp_project.dataset.table", schema=schema) as w:
-    w.write_batch(record_batch)
-    w.write_table(table)
+with bq.writer("gcp_project.dataset.table", schema=schema) as writer:
+    writer.write_batch(record_batch)
+    writer.write_table(table)
 ```
 
-
 ---
 
 ## API Reference
@@ -89,75 +116,75 @@ with bq.writer("gcp_project.dataset.table", schema=schema) as w:
 
 #### `pyarrow.bigquery.write_table`
 
-Write a PyArrow Table to a BigQuery Table. No return value.
+Writes a PyArrow Table to a BigQuery Table. No return value.
 
 **Parameters:**
 
 - `table`: `pa.Table`  
-  PyArrow table.
+  The PyArrow table.
 
 - `where`: `str`  
-  Destination location in BigQuery catalog.
+  The destination location in the BigQuery catalog.
 
 - `project`: `str`, *default* `None`  
-  BigQuery execution project, also the billing project. If not provided, it will be extracted from `where`.
+  The BigQuery execution project, also the billing project. If not provided, it will be extracted from `where`.
 
 - `table_create`: `bool`, *default* `True`  
   Specifies if the BigQuery table should be created.
 
 - `table_expire`: `None | int`, *default* `None`  
-  Amount of seconds after which the created table will expire. Used only if `table_create` is `True`. Set to `None` to disable expiration.
+  The number of seconds after which the created table will expire. Used only if `table_create` is `True`. Set to `None` to disable expiration.
 
 - `table_overwrite`: `bool`, *default* `False`  
-  If the table already exists, destroy it and create a new one.
+  If the table already exists, it will be destroyed and a new one will be created.
 
 - `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`  
-  Worker backend for fetching data.
+  The worker backend for fetching data.
 
 - `worker_count`: `int`, *default* `os.cpu_count()`  
-  Number of threads or processes to use for fetching data from BigQuery.
+  The number of threads or processes to use for fetching data from BigQuery.
 
 - `batch_size`: `int`, *default* `100`  
-  Batch size for fetched rows.
+  The batch size for fetched rows.
 
 ```python
 bq.write_table(table, 'gcp_project.dataset.table')
 ```
 
-#### `pyarrow.bigquery.writer` (Context manager)
+#### `pyarrow.bigquery.writer` (Context Manager)
 
 Context manager version of the write method. Useful when the PyArrow table is larger than memory size or the table is available in chunks.
 
 **Parameters:**
 
 - `schema`: `pa.Schema`  
-  PyArrow schema.
+  The PyArrow schema.
 
 - `where`: `str`  
-  Destination location in BigQuery catalog.
+  The destination location in the BigQuery catalog.
 
 - `project`: `str`, *default* `None`  
-  BigQuery execution project, also the billing project. If not provided, it will be extracted from `where`.
+  The BigQuery execution project, also the billing project. If not provided, it will be extracted from `where`.
 
 - `table_create`: `bool`, *default* `True`  
   Specifies if the BigQuery table should be created.
 
 - `table_expire`: `None | int`, *default* `None`  
-  Amount of seconds after which the created table will expire. Used only if `table_create` is `True`. Set to `None` to disable expiration.
+  The number of seconds after which the created table will expire. Used only if `table_create` is `True`. Set to `None` to disable expiration.
 
 - `table_overwrite`: `bool`, *default* `False`  
-  If the table already exists, destroy it and create a new one.
+  If the table already exists, it will be destroyed and a new one will be created.
 
 - `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`  
-  Worker backend for writing data.
+  The worker backend for writing data.
 
 - `worker_count`: `int`, *default* `os.cpu_count()`  
-  Number of threads or processes to use for writing data to BigQuery.
+  The number of threads or processes to use for writing data to BigQuery.
 
 - `batch_size`: `int`, *default* `100`  
-  Batch size used for writes. Table will be automatically split to this value.
+  The batch size used for writes. The table will be automatically split to this value.
 
-Depending on the use case, you might want to use one of the methods below to write your data to a BigQuery table, using either `pa.Table` or `pa.RecordBatch`.
+Depending on your use case, you might want to use one of the methods below to write your data to a BigQuery table, using either `pa.Table` or `pa.RecordBatch`.
 
 #### `pyarrow.bigquery.writer.write_table` (Context Manager Method)
 
@@ -166,17 +193,17 @@ Context manager method to write a table.
 **Parameters:**
 
 - `table`: `pa.Table`  
-  PyArrow table.
+  The PyArrow table.
 
 ```python
 import pyarrow as pa
 import pyarrow.bigquery as bq
 
 schema = pa.schema([("value", pa.list_(pa.int64()))])
 
-with bq.writer("gcp_project.dataset.table", schema=schema) as w:
+with bq.writer("gcp_project.dataset.table", schema=schema) as writer:
     for a in range(1000):
-        w.write_table(pa.Table.from_pylist([{'value': [a] * 10}]))
+        writer.write_table(pa.Table.from_pylist([{'value': [a] * 10}]))
 ```
 
 #### `pyarrow.bigquery.writer.write_batch` (Context Manager Method)
@@ -186,17 +213,17 @@ Context manager method to write a record batch.
 **Parameters:**
 
 - `batch`: `pa.RecordBatch`  
-  PyArrow record batch.
+  The PyArrow record batch.
 
 ```python
 import pyarrow as pa
 import pyarrow.bigquery as bq
 
 schema = pa.schema([("value", pa.list_(pa.int64()))])
 
-with bq.writer("gcp_project.dataset.table", schema=schema) as w:
-    for a in range(1000):
-        w.write_batch(pa.RecordBatch.from_pylist([{'value': [1] * 10}]))
+with bq.writer("gcp_project.dataset.table", schema=schema) as writer:
+    for a in range 1000:
+        writer.write_batch(pa.RecordBatch.from_pylist([{'value': [1] * 10}]))
 ```
 
 ### Reading
@@ -206,43 +233,44 @@ with bq.writer("gcp_project.dataset.table", schema=schema) as w:
 **Parameters:**
 
 - `source`: `str`  
-  BigQuery table location.
+  The BigQuery table location.
 
 - `project`: `str`, *default* `None`  
-  BigQuery execution project, also the billing project. If not provided, it will be extracted from `source`.
+  The BigQuery execution project, also the billing project. If not provided, it will be extracted from `source`.
 
 - `columns`: `str`, *default* `None`  
-  Columns to download. When not provided, all available columns will be downloaded.
+  The columns to download. When not provided, all available columns will be downloaded.
 
 - `row_restrictions`: `str`, *default* `None`  
-  Row level filtering executed on the BigQuery side. More in [BigQuery documentation](https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1beta1).
+  Row-level filtering executed on the BigQuery side. More information is available in the [BigQuery documentation](https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1beta1).
 
 - `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`  
-  Worker backend for fetching data.
+  The worker backend for fetching data.
 
 - `worker_count`: `int`, *default* `os.cpu_count()`  
-  Number of threads or processes to use for fetching data from BigQuery.
+  The number of threads or processes to use for fetching data from BigQuery.
 
 - `batch_size`: `int`, *default* `100`  
-  Batch size used for fetching. Table will be automatically split to this value.
+  The batch size used for fetching. The table will be automatically split into this value.
 
 #### `pyarrow.bigquery.read_query`
 
 **Parameters:**
+
 - `project`: `str`  
-  BigQuery query execution (and billing project).
+  The BigQuery query execution (and billing) project.
 
 - `query`: `str`  
-  Query to be executed
+  The query to be executed.
 
 - `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`  
-  Worker backend for fetching data.
+  The worker backend for fetching data.
 
 - `worker_count`: `int`, *default* `os.cpu_count()`  
-  Number of threads or processes to use for fetching data from BigQuery.
+  The number of threads or processes to use for fetching data from BigQuery.
 
 - `batch_size`: `int`, *default* `100`  
-  Batch size used for fetching. Table will be automatically split to this value.
+  The batch size used for fetching. The table will be automatically split into this value.
 
 ```python
 table = bq.read_query("gcp_project", "SELECT * FROM `gcp_project.dataset.table`")
@@ -252,26 +280,28 @@ table = bq.read_query("gcp_project", "SELECT * FROM `gcp_project.dataset.table`"
 
 **Parameters:**
 
-- `source`: `str`  
-  BigQuery table location.
+- `
+
+source`: `str`  
+  The BigQuery table location.
 
 - `project`: `str`, *default* `None`  
-  BigQuery execution project, also the billing project. If not provided, it will be extracted from `source`.
+  The BigQuery execution project, also the billing project. If not provided, it will be extracted from `source`.
 
 - `columns`: `str`, *default* `None`  
-  Columns to download. When not provided, all available columns will be downloaded.
+  The columns to download. When not provided, all available columns will be downloaded.
 
 - `row_restrictions`: `str`, *default* `None`  
-  Row level filtering executed on the BigQuery side. More in [BigQuery documentation](https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1beta1).
+  Row-level filtering executed on the BigQuery side. More information is available in the [BigQuery documentation](https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1beta1).
 
 - `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`  
-  Worker backend for fetching data.
+  The worker backend for fetching data.
 
 - `worker_count`: `int`, *default* `os.cpu_count()`  
-  Number of threads or processes to use for fetching data from BigQuery.
+  The number of threads or processes to use for fetching data from BigQuery.
 
 - `batch_size`: `int`, *default* `100`  
-  Batch size used for fetching. Table will be automatically split to this value.
+  The batch size used for fetching. The table will be automatically split into this value.
 
 ```python
 import pyarrow as pa
@@ -284,24 +314,24 @@ for part in bq.reader("gcp_project.dataset.table"):
 table = pa.concat_tables(parts)
 ```
 
-
 #### `pyarrow.bigquery.reader_query`
 
 **Parameters:**
+
 - `project`: `str`  
-  BigQuery query execution (and billing project).
+  The BigQuery query execution (and billing) project.
 
 - `query`: `str`  
-  Query to be executed
+  The query to be executed.
 
 - `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`  
-  Worker backend for fetching data.
+  The worker backend for fetching data.
 
 - `worker_count`: `int`, *default* `os.cpu_count()`  
-  Number of threads or processes to use for fetching data from BigQuery.
+  The number of threads or processes to use for fetching data from BigQuery.
 
 - `batch_size`: `int`, *default* `100`  
-  Batch size used for fetching. Table will be automatically split to this value.
+  The batch size used for fetching. The table will be automatically split into this value.
 
 ```python
 for batch in bq.reader_query("gcp_project", "SELECT * FROM `gcp_project.dataset.table`"):