-
Notifications
You must be signed in to change notification settings - Fork 378
Add IDataView notebook #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
luisquintanilla
wants to merge
2
commits into
dotnet:main
Choose a base branch
from
luisquintanilla:mlnet-idataview
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
*.cs | ||
*.csproj | ||
bin/**/* | ||
obj/**/* |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,375 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# IDataView\n", | ||
"\n", | ||
"In this notebooks, we'll cover:\n", | ||
"\n", | ||
"- What is an IDataView?\n", | ||
"- What's the difference between DataFrame vs. IDataView?\n", | ||
"- How to create an IDataView?\n", | ||
"- How to inspect data in an IDataView?" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## What is an IDataView?\n", | ||
"\n", | ||
"The [IDataView](https://docs.microsoft.com/dotnet/api/microsoft.ml.idataview?view=ml-dotnet) system is a set of interfaces and components that provide efficient, compositional processing of schematized data for machine learning and advanced analytics applications. It is designed to gracefully and efficiently handle high dimensional data and large data sets. It does not directly address distributed data and computation, but is suitable for single node processing of data partitions belonging to larger distributed data sets.\n", | ||
"\n", | ||
"### Schema\n", | ||
"\n", | ||
"IDataView has general schema support, in that a view can have an arbitrary number of columns, each having an associated name, index, data type, and optional annotation.\n", | ||
"\n", | ||
"Column names are case sensitive. Multiple columns can share the same name, in which case, one of the columns hides the others, in the sense that the name will map to one of the column indices, the visible one. \n", | ||
"\n", | ||
"All user interaction with columns should be via name, not index, so the hidden columns are generally invisible to the user. However, hidden columns are often useful for diagnostic purposes.\n", | ||
"\n", | ||
"### Supported Data Types\n", | ||
"\n", | ||
"The set of supported column data types forms an open type system, in the sense\n", | ||
"that additional types can be added at any time and in any assembly. However,\n", | ||
"there is a precisely defined set of standard types including:\n", | ||
"\n", | ||
"- Text\n", | ||
"- Boolean\n", | ||
"- Single and Double precision floating point\n", | ||
"- Signed integer values using 1, 2, 4, or 8 bytes\n", | ||
"- Unsigned integer values using 1, 2, 4, or 8 bytes\n", | ||
"- Values for ids and probabilistically unique hashes, using 16 bytes\n", | ||
"- Date time, date time zone, and timespan\n", | ||
"- Key types\n", | ||
"- Vector types\n", | ||
"- Image types\n", | ||
"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## What's the difference between a DataFrame and IDataView?\n", | ||
"\n", | ||
"DataFrame and IDataView are very similar in the sense that they both are ways of representing data in a tabular format and applying transformations for it. Some key differences:\n", | ||
"\n", | ||
"- DataFrame only supports loading delimited files.\n", | ||
"- DataFrame runs on memory so you're limited to the amount of memory on your PC.\n", | ||
"\n", | ||
"The DataFrame is recommended when performing tasks like exploratory data anlysis on a sample of your data. \n", | ||
"\n", | ||
"IDataView is recommended for training on larger datasets. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## How to create an IDataView\n", | ||
"\n", | ||
"You can create an IDataView by using any of the methods for loading data:\n", | ||
"\n", | ||
"- TextLoader\n", | ||
"- LoadFromTextFile\n", | ||
"- LoadFromEnumerable\n", | ||
"- Load" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Defining Schema\n", | ||
"\n", | ||
"IDataViews are schematized. Therefore you need to provide the schema. There's several ways to define the schema:\n", | ||
"\n", | ||
"- Manually\n", | ||
"- Classes" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### Manually defining IDataView Schema\n", | ||
"\n", | ||
"To manually define the model schema you can use the `SchemaBuilder`. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"dotnet_interactive": { | ||
"language": "csharp" | ||
} | ||
}, | ||
"source": [ | ||
"#r \"nuget:Microsoft.ML,1.7.1\"" | ||
], | ||
"outputs": [ | ||
{ | ||
"output_type": "execute_result", | ||
"data": { | ||
"text/html": "<div><div></div><div></div><div><strong>Installed Packages</strong><ul><li><span>Microsoft.ML, 1.7.1</span></li></ul></div></div>" | ||
}, | ||
"execution_count": 1, | ||
"metadata": {} | ||
} | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"dotnet_interactive": { | ||
"language": "csharp" | ||
} | ||
}, | ||
"source": [ | ||
"using Microsoft.ML;\n", | ||
"using Microsoft.ML.Data;" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Let's say that we have data that looks like the following\n", | ||
"\n", | ||
"| Student Name | Score | \n", | ||
"| --- | --- |\n", | ||
"| Jane | 80 |\n", | ||
"| John | 75 | \n", | ||
"| Jack | 90 |\n", | ||
"| Sally | 100 |\n", | ||
"\n", | ||
"We can define the schema as follows:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"dotnet_interactive": { | ||
"language": "csharp" | ||
} | ||
}, | ||
"source": [ | ||
"var schemaBuilder = new DataViewSchema.Builder();\n", | ||
"schemaBuilder.AddColumn(\"StudentName\", TextDataViewType.Instance);\n", | ||
"schemaBuilder.AddColumn(\"Score\", NumberDataViewType.Single);\n", | ||
"var schema = schemaBuilder.ToSchema();" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"When we inspect the schema we can see its different properties." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"dotnet_interactive": { | ||
"language": "csharp" | ||
} | ||
}, | ||
"source": [ | ||
"schema" | ||
], | ||
"outputs": [ | ||
{ | ||
"output_type": "execute_result", | ||
"data": { | ||
"text/html": "<table><thead><tr><th><i>index</i></th><th>Name</th><th>Index</th><th>IsHidden</th><th>Type</th><th>Annotations</th></tr></thead><tbody><tr><td>0</td><td>StudentName</td><td><div class=\"dni-plaintext\">0</div></td><td><div class=\"dni-plaintext\">False</div></td><td><table><thead><tr><th>RawType</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">System.ReadOnlyMemory<System.Char></div></td></tr></tbody></table></td><td><table><thead><tr><th>Schema</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">[ ]</div></td></tr></tbody></table></td></tr><tr><td>1</td><td>Score</td><td><div class=\"dni-plaintext\">1</div></td><td><div class=\"dni-plaintext\">False</div></td><td><table><thead><tr><th>RawType</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">System.Single</div></td></tr></tbody></table></td><td><table><thead><tr><th>Schema</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">[ ]</div></td></tr></tbody></table></td></tr></tbody></table>" | ||
}, | ||
"execution_count": 1, | ||
"metadata": {} | ||
} | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Define schema with classes\n", | ||
"\n", | ||
"You also have the option of creating new classes or using existing classes to define your schema. Using the same student data above, you can define the schema as follows:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"dotnet_interactive": { | ||
"language": "csharp" | ||
} | ||
}, | ||
"source": [ | ||
"public class TestScores\n", | ||
"{\n", | ||
"\tpublic string StudentName {get;set;}\n", | ||
"\tpublic string Scores {get;set;}\n", | ||
"}" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Loading data\n", | ||
"\n", | ||
"You can load data from a flat file either using the TextLoader or LoadFromTextFile methods\n", | ||
"\n", | ||
"#### Loading data from a TextLoader" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"dotnet_interactive": { | ||
"language": "csharp" | ||
} | ||
}, | ||
"source": [ | ||
"// Initialize MLContext\n", | ||
"var mlContext = new MLContext();" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"dotnet_interactive": { | ||
"language": "csharp" | ||
} | ||
}, | ||
"source": [ | ||
"// Define TextLoader\n", | ||
"var textLoader =\n", | ||
" mlContext.Data.CreateTextLoader(\n", | ||
" columns: new TextLoader.Column[]\n", | ||
" {\n", | ||
" new TextLoader.Column(\"StudentName\",DataKind.String, 0),\n", | ||
" new TextLoader.Column(\"Score\", DataKind.Single, 1)\n", | ||
" },\n", | ||
" separatorChar: ',',\n", | ||
" hasHeader: true);" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"dotnet_interactive": { | ||
"language": "csharp" | ||
} | ||
}, | ||
"source": [ | ||
"// Create IDataView\n", | ||
"var textLoaderDataView = textLoader.Load(\"student-scores.csv\");" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"dotnet_interactive": { | ||
"language": "csharp" | ||
} | ||
}, | ||
"source": [ | ||
"textLoaderDataView.Schema" | ||
], | ||
"outputs": [ | ||
{ | ||
"output_type": "execute_result", | ||
"data": { | ||
"text/html": "<table><thead><tr><th><i>index</i></th><th>Name</th><th>Index</th><th>IsHidden</th><th>Type</th><th>Annotations</th></tr></thead><tbody><tr><td>0</td><td>StudentName</td><td><div class=\"dni-plaintext\">0</div></td><td><div class=\"dni-plaintext\">False</div></td><td><table><thead><tr><th>RawType</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">System.ReadOnlyMemory<System.Char></div></td></tr></tbody></table></td><td><table><thead><tr><th>Schema</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">[ ]</div></td></tr></tbody></table></td></tr><tr><td>1</td><td>Score</td><td><div class=\"dni-plaintext\">1</div></td><td><div class=\"dni-plaintext\">False</div></td><td><table><thead><tr><th>RawType</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">System.Single</div></td></tr></tbody></table></td><td><table><thead><tr><th>Schema</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">[ ]</div></td></tr></tbody></table></td></tr></tbody></table>" | ||
}, | ||
"execution_count": 1, | ||
"metadata": {} | ||
} | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"dotnet_interactive": { | ||
"language": "csharp" | ||
} | ||
}, | ||
"source": [ | ||
"// Specify column index from file via LoadColumn attribute\n", | ||
"public class TestScoresAttributes\n", | ||
"{\n", | ||
"\t[LoadColumn(0)]\n", | ||
"\tpublic string StudentName {get;set;}\n", | ||
"\t\n", | ||
"\t[LoadColumn(1)]\n", | ||
"\tpublic string Scores {get;set;}\n", | ||
"}" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"dotnet_interactive": { | ||
"language": "csharp" | ||
} | ||
}, | ||
"source": [ | ||
"var textLoaderAttributes = \n", | ||
"\tmlContext.Data.CreateTextLoader<TestScoresAttributes>(separatorChar: ',', hasHeader:true);" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Inspecting data in IDataView\n", | ||
"\n", | ||
"There's several ways to inspect the data in an IDataView:\n", | ||
"\n", | ||
"- Use cursors\n", | ||
"- Convert to IEnumerable\n", | ||
"\n", | ||
"### Use cursors" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": ".NET (C#)", | ||
"language": "C#", | ||
"name": ".net-csharp" | ||
}, | ||
"language_info": { | ||
"file_extension": ".cs", | ||
"mimetype": "text/x-csharp", | ||
"name": "C#", | ||
"pygments_lexer": "csharp", | ||
"version": "8.0" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
Student Name, Score | ||
Jane, 80 | ||
John, 75 | ||
Jack, 90 | ||
Sally, 100 |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.