Skip to content

Commit 8dd6625

Browse files
pw42020kevingurney
andauthored
MINOR: [Docs][MATLAB] update README failing example code snippets (apache#45973)
### Rationale for this change MATLAB currently has multiple "example code" sections in the readme `matlab/doc/matlab_interface_for_apache_arrow_design.md` that have either been deprecated or were retrieved from other languages where MATLAB does not have the same endpoints. Example code that works out of the box not only helps with early developers get started using arrow, but also helps experienced developers ensure their development setup is in proper condition. The broken endpoints described include: #### Use Case 2 1. `arrow.Table(Var1, Var2, Var3)` a. With the current implementation of arrow.Table, it cannot take multiple `arrow.array`s and create a table from that. All examples inside of `matlab/test/arrow/tabular/tTable.m` create a table by first creating a MATLAB table. 3. `arrow.FeatherTableWriter` 4. `arrow.FeatherTableReader` 5. `arrow.matlab2arrow` #### Use Case 3 1. `importFromCDataInterface` 2. `ExportToCDataInterface` 3. `arrow.ipcwrite` 4. Python lines using `_import_from_c` or `_export_from_c` a. While these are functions inside of Python, MATLAB [has syntax that does not allow for underscores to begin variable or function names](https://www.mathworks.com/help/matlab/matlab_prog/variable-names.html). Therefore, running these in MATLAB using the Python in MATLAB module will result in errors. ### What changes are included in this PR? I changed the use case README example code in use case 2 and use case 3 to use code that can be copy-and-pasted into MATLAB and work out of the box, rather than the current endpoints. ### Are these changes tested? Yes. These changes have been tested by running each example piece of code inside of MATLAB inside of the changes and verifying that they work out of the box. ### Are there any user-facing changes? No. Lead-authored-by: Patrick Walsh <[email protected]> Co-authored-by: Patrick Walsh <[email protected]> Co-authored-by: Kevin Gurney <[email protected]> Signed-off-by: Kevin Gurney <[email protected]>
1 parent c56ec12 commit 8dd6625

File tree

1 file changed

+68
-39
lines changed

1 file changed

+68
-39
lines changed

matlab/doc/matlab_interface_for_apache_arrow_design.md

Lines changed: 68 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -109,19 +109,7 @@ ans =
109109

110110
To serialize MATLAB data to a file on disk (e.g. Feather, Parquet), a MATLAB developer could start by constructing an `arrow.Table` using one of several different approaches.
111111

112-
They could individually compose the table from a set of `arrow.Array` objects (one for each table variable).
113-
114-
###### Example Code:
115-
``` matlab
116-
>> Var1 = arrow.array(["foo"; "bar"; "baz"]);
117-
118-
>> Var2 = arrow.array([today; today + 1; today + 2]);
119-
120-
>> Var3 = arrow.array([10; 20; 30]);
121-
122-
>> AT = arrow.Table(Var1, Var2, Var3);
123-
```
124-
Alternatively, they could directly convert from an existing MATLAB `table` to an `arrow.Table` using a function like `arrow.matlab2arrow` to convert between an existing MATLAB `table` and an `arrow.Table`.
112+
They could directly convert from an existing MATLAB `table` to an `arrow.tabular.Table` using a function like `arrow.table`.
125113

126114
###### Example Code:
127115
``` matlab
@@ -131,33 +119,43 @@ Alternatively, they could directly convert from an existing MATLAB `table` to an
131119
132120
>> Density = [10.2; 20.5; 11.2; 13.7; 17.8];
133121
134-
>> T = table(Weight, Radius, Density); % Create a MATLAB table
122+
% Create a MATLAB `table`
123+
>> T = table(Weight, Radius, Density);
135124
136-
>> AT = arrow.matlab2arrow(T); % Create an arrow.Table
125+
% Create an `arrow.tabular.Table` from the MATLAB `table`
126+
>> AT = arrow.table(T);
137127
```
138-
To serialize the `arrow.Table`, `AT`, to a file (e.g. Feather) on disk, the user could then instantiate an `arrow.FeatherTableWriter`.
128+
129+
To serialize the `arrow.Table`, `AT`, to a file (e.g. Feather) on disk, the user could then instantiate an `arrow.internal.io.feather.Writer`.
139130

140131
###### Example Code:
141132
``` matlab
142-
>> featherTableWriter = arrow.FeatherTableWriter();
143-
144-
>> featherTableWriter.write(AT, "data.feather");
133+
% Make an `arrow.tabular.RecordBatch` from the `arrow.tabular.Table` created in the previous step
134+
>> recordBatch = arrow.recordBatch(AT);
135+
>> filename = "data.feather";
136+
% Write the `arrow.tabular.RecordBatch` to disk as a Feather V1 file named `data.feather`
137+
>> writer = arrow.internal.io.feather.Writer(filename);
138+
>> writer.write(recordBatch);
145139
```
146-
The Feather file could then be read and operated on by an external process like Rust or Go. To read it back into MATLAB after modification by another process, the user could instantiate an `arrow.FeatherTableReader`.
140+
The Feather V1 file could then be read and operated on by an external process like Rust or Go. To read it back into MATLAB, the user could instantiate an `arrow.internal.io.feather.Reader`.
147141

148142
###### Example Code:
149143
``` matlab
150-
>> featherTableReader = arrow.FeatherTableReader("data.feather");
144+
>> reader = arrow.internal.io.feather.Reader(filename);
151145
152-
>> AT = featherTableReader.read();
146+
% Read in the first RecordBatch
147+
>> newBatch = reader.read();
148+
149+
% Create a MATLAB `table` from the `arrow.tabular.RecordBatch`
150+
>> AT = table(newBatch);
153151
```
154152
#### Advanced MATLAB User Workflow for Implementing Support for Writing to Feather Files
155153

156-
To add support for writing to Feather files, an advanced MATLAB user could use the MATLAB and C++ APIs offered by the MATLAB Interface for Apache Arrow to create `arrow.FeatherTableWriter`.
154+
To add support for writing to Feather V1 files, an advanced MATLAB user could use the MATLAB and C++ APIs offered by the MATLAB Interface for Apache Arrow to create `arrow.internal.io.feather.Writer`.
157155

158156
They would need to author a [MEX function] (e.g. `featherwriteMEX`), which can be called directly by MATLAB code. Within their MEX function, they could use `arrow::matlab::unwrap_table` to convert between the MATLAB representation of the Arrow memory (`arrow.Table`) and the equivalent C++ representation (`arrow::Table`). Once the `arrow.Table` has been "unwrapped" into a C++ `arrow::Table`, it can be passed to the appropriate Arrow C++ library API for writing to a Feather file (`arrow::ipc::feather::WriteTable`).
159157

160-
An analogous workflow could be followed to create `arrow.FeatherTableReader` to enable reading from Feather files.
158+
An analogous workflow could be followed to create `arrow.internal.io.feather.Reader` to enable reading from Feather V1 files.
161159

162160
#### Enabling High-Level Workflows
163161

@@ -179,47 +177,67 @@ Roughly speaking, local memory sharing workflows can be divided into two categor
179177

180178
To share a MATLAB `arrow.Array` with PyArrow efficiently, a user could use the `exportToCDataInterface` method to export the Arrow memory wrapped by an `arrow.Array` to the C Data Interface format, consisting of two C-style structs, [`ArrowArray`] and [`ArrowSchema`], which represent the Arrow data and associated metadata.
181179

182-
Memory addresses to the `ArrowArray` and `ArrowSchema` structs are returned by the call to `exportToCDataInterface`. These addresses can be passed to Python directly, without having to make any copies of the underlying Arrow data structures that they refer to. A user can then wrap the underlying data pointed to by the `ArrowArray` struct (which is already in the [Arrow Columnar Format]), as well as extract the necessary metadata from the `ArrowSchema` struct, to create a `pyarrow.Array` by using the static method `py.pyarrow.Array._import_from_c`.
180+
Memory addresses for the `ArrowArray` and `ArrowSchema` structs are returned by the call to `export`. These addresses can be passed to Python directly, without having to make any copies of the underlying Arrow data structures that they refer to. A user can then wrap the underlying data pointed to by the `ArrowArray` struct (which is already in the [Arrow Columnar Format]), as well as extract the necessary metadata from the `ArrowSchema` struct, to create a `pyarrow.Array` by using the static method `pyarrow.Array._import_from_c`.
181+
182+
Multiple lines of Python are required to import the Arrow array from MATLAB. Therefore, the function [`pyrunfile`]((https://www.mathworks.com/help/matlab/ref/pyrunfile.html)) can be used which can run Python scripts defined in an external file.
183183

184184
###### Example Code:
185+
186+
```python
187+
# Filename: import_from_c.py
188+
# Note: This file is located in same directory as the MATLAB file.
189+
import pyarrow as pa
190+
array = pa.Array._import_from_c(arrayMemoryAddress, schemaMemoryAddress)
191+
```
192+
185193
``` matlab
186194
% Create a MATLAB arrow.Array.
187195
>> AA = arrow.array([1, 2, 3, 4, 5]);
188196
197+
% Export C Data Interface C-style structs for `arrow.array.Array` values and schema
198+
>> cArray = arrow.c.Array();
199+
>> cSchema = arrow.c.Schema();
200+
189201
% Export the MATLAB arrow.Array to the C Data Interface format, returning the
190202
% memory addresses of the required ArrowArray and ArrowSchema C-style structs.
191-
>> [arrayMemoryAddress, schemaMemoryAddress] = AA.exportToCDataInterface();
203+
>> AA.export(cArray.Address, cSchema.Address);
192204
193205
% Import the memory addresses of the C Data Interface format structs to create a pyarrow.Array.
194-
>> PA = py.pyarrow.Array._import_from_c(arrayMemoryAddress, schemaMemoryAddress);
206+
>> PA = pyrunfile("import_from_c.py", "array", arrayMemoryAddress=cArray.Address, schemaMemoryAddress=cSchema.Address);
195207
```
196208
Conversely, a user can create an Arrow array using PyArrow and share it with MATLAB. To do this, they can call the method `_export_to_c` to export a `pyarrow.Array` to the C Data Interface format.
197209

198-
The memory addresses to the `ArrowArray` and `ArrowSchema` structs populated by the call to `_export_to_c` can be passed to the static method `arrow.Array.importFromCDataInterface` to construct a MATLAB `arrow.Array` with zero copies.
210+
**NOTE:** Since the python calls to `_export_to_c` and `_import_from_c` have underscores at the beginning of their names, they cannot be called directly in MATLAB. MATLAB member functions or variables are [not allowed to start with an underscore](https://www.mathworks.com/help/matlab/matlab_prog/variable-names.html).
199211

200-
The example code below is adapted from the [`test_cffi.py` test cases for PyArrow].
212+
To initialize a Python `pyarrow` array, `pyrunfile` can (again) be used to execute a Python script containing variables and functions with names that start with an underscore.
213+
214+
The memory addresses to the `ArrowArray` and `ArrowSchema` structs populated by the call to `_export_to_c` can be passed to the static method `arrow.Array.importFromCDataInterface` to construct a MATLAB `arrow.Array` with zero copies.
201215

202216
###### Example Code:
217+
218+
```python
219+
# Filename: export_to_c.py
220+
# Note: This file is located in same directory as the MATLAB file.
221+
import pyarrow as pa
222+
PA._export_to_c(arrayMemoryAddress, schemaMemoryAddress)
223+
```
224+
203225
``` matlab
204226
% Make a pyarrow.Array.
205227
>> PA = py.pyarrow.array([1, 2, 3, 4, 5]);
206228
207229
% Create ArrowArray and ArrowSchema C-style structs adhering to the Arrow C Data Interface format.
208-
>> array = py.pyarrow.cffi.ffi.new("struct ArrowArray*")
209-
210-
>> arrayMemoryAddress = py.int(py.pyarrow.cffi.ffi.cast("uintptr_t", array));
211-
212-
>> schema = py.pyarrow.cffi.ffi.new("struct ArrowSchema*")
213-
214-
>> schemaMemoryAddress = py.int(py.pyarrow.cffi.ffi.cast("uintptr_t", schema));
230+
>> cArray = arrow.c.Array();
231+
>> cSchema = arrow.c.Schema();
215232
216233
% Export the pyarrow.Array to the C Data Interface format, populating the required ArrowArray and ArrowShema structs.
217-
>> PA.export_to_c(arrayMemoryAddress, schemaMemoryAddress)
234+
>> pyrunfile("export_to_c.py", PA=PA, arrayMemoryAddress=cArray.Address, schemaMemoryAddress=cSchema.Address);
218235
219236
% Import the C Data Interface structs to create a MATLAB arrow.Array.
220-
>> AA = arrow.Array.importFromCDataInterface(arrayMemoryAddress, schemaMemoryAddress);
237+
>> AA = arrow.array.Array.import(cArray, cSchema);
221238
```
222239

240+
223241
#### Out-of-Process Memory Sharing
224242

225243
[MATLAB supports running Python code in a separate process]. A user could leverage the MATLAB Interface for Apache Arrow to share Arrow memory between MATLAB and PyArrow running within a separate Python process using one of the following approaches described below.
@@ -240,7 +258,18 @@ For large tables used in a multi-process "data processing pipeline", a user coul
240258
>> AT = arrow.Table(Var1, Var2, Var3);
241259
242260
% Write the MATLAB arrow.Table to the Arrow IPC File Format on disk.
243-
>> arrow.ipcwrite(AT, "data.arrow");
261+
>> recordBatch = arrow.recordBatch(AT);
262+
263+
>> filename = "data.arrow"
264+
265+
% Open `data.arrow` as an IPC file
266+
>> writer = arrow.io.ipc.RecordBatchFileWriter(filename, recordBatch.Schema);
267+
268+
% Write the `RecordBatch` to `data.arrow`
269+
>> writer.writeRecordBatch(recordBatch);
270+
271+
% Close the writer -- don't forget this step!
272+
>> writer.close()
244273
245274
% Run Python in a separate process.
246275
>> pyenv("ExecutionMode", "OutOfProcess");

0 commit comments

Comments
 (0)