Skip to content

Separate transformation and materialization #4365

Closed
@franciscojavierarceo

Description

@franciscojavierarceo

Is your feature request related to a problem? Please describe.
As briefly mentioned in #4277, our current structure of having feature view decorators with a naming convention that references the ingestion and transformation pattern is confusing.

Transformation and Materialization are two separate constructs that should be decoupled.

Feature Views are simply schema definitions that can be used online and offline and historically did not support transformation. We should change this.

As a concrete, simple example suppose a user had a Spark offline store and MySQL online store using the Python feature server.

Suppose further that the user of the Feast service had 3 sets of data that required 3 different write patterns:

  1. Batch data from some scheduled Spark Job that returned as output a large parquet file with some entity key and some features that is to be materialized to the online store.
  2. Streaming data sent by an asynchronous Kinsesis/Kafka event to Feast to be pushed to the online store.
  3. Online data sent through a synchronous api call to the online store (e.g., to the write-to-online-store endpoint).

Cases (1) and (2) are asynchronous and have no guarantees about the consistency of the data when a client requests those features but (3), if explicitly chosen to be a synchronous write, would have much stronger guarantees about the consistency of the data.

If Feature Views allowed for Feature Transformations before writes, then the current view of Feature Views representing Batch Features alone breaks down. This poor clarity is rooted in the mixture of transformations and materializations. Transformations can happen as a part of a batch job, a streaming pipeline, or during an api call by different computation engines (Spark, a Flink Application, or a simple python microservice). Materialization can technically be done independently of the computation engine (e.g., the output of a spark job can be materialized to the online store using something else).

If we want to enable Feature Views to allow for transformations, it no longer only represents a batch feature view so adding a decorator (as proposed in #4277) to represent that would be confusing.

Describe the solution you'd like
We should update Feast to use a transform decorator and the write patterns should be more tightly coupled with the Feature View type. For example, Stream, On Demand, Batch, and regular Feature Views could all use the same transformation code but offer different guarantees about how the data will be written (Stream: asynchronously, On Demand: not at all, Batch: asynchronously, and Feature View: synchronously).

Describe alternatives you've considered
N/a

Additional context
@tokoko @HaoXuAI @shuchu what do you think here? My thoughts here aren't perfectly fleshed out but it's something that I've been thinking about and trying to find the way to articulate it well.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions