diff --git a/machine-learning/.gitignore b/machine-learning/.gitignore new file mode 100644 index 0000000..b2d9098 --- /dev/null +++ b/machine-learning/.gitignore @@ -0,0 +1,4 @@ +*.cs +*.csproj +bin/**/* +obj/**/* \ No newline at end of file diff --git a/machine-learning/IDataView.ipynb b/machine-learning/IDataView.ipynb new file mode 100644 index 0000000..c6642e2 --- /dev/null +++ b/machine-learning/IDataView.ipynb @@ -0,0 +1,375 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# IDataView\n", + "\n", + "In this notebooks, we'll cover:\n", + "\n", + "- What is an IDataView?\n", + "- What's the difference between DataFrame vs. IDataView?\n", + "- How to create an IDataView?\n", + "- How to inspect data in an IDataView?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What is an IDataView?\n", + "\n", + "The [IDataView](https://docs.microsoft.com/dotnet/api/microsoft.ml.idataview?view=ml-dotnet) system is a set of interfaces and components that provide efficient, compositional processing of schematized data for machine learning and advanced analytics applications. It is designed to gracefully and efficiently handle high dimensional data and large data sets. It does not directly address distributed data and computation, but is suitable for single node processing of data partitions belonging to larger distributed data sets.\n", + "\n", + "### Schema\n", + "\n", + "IDataView has general schema support, in that a view can have an arbitrary number of columns, each having an associated name, index, data type, and optional annotation.\n", + "\n", + "Column names are case sensitive. Multiple columns can share the same name, in which case, one of the columns hides the others, in the sense that the name will map to one of the column indices, the visible one. \n", + "\n", + "All user interaction with columns should be via name, not index, so the hidden columns are generally invisible to the user. However, hidden columns are often useful for diagnostic purposes.\n", + "\n", + "### Supported Data Types\n", + "\n", + "The set of supported column data types forms an open type system, in the sense\n", + "that additional types can be added at any time and in any assembly. However,\n", + "there is a precisely defined set of standard types including:\n", + "\n", + "- Text\n", + "- Boolean\n", + "- Single and Double precision floating point\n", + "- Signed integer values using 1, 2, 4, or 8 bytes\n", + "- Unsigned integer values using 1, 2, 4, or 8 bytes\n", + "- Values for ids and probabilistically unique hashes, using 16 bytes\n", + "- Date time, date time zone, and timespan\n", + "- Key types\n", + "- Vector types\n", + "- Image types\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What's the difference between a DataFrame and IDataView?\n", + "\n", + "DataFrame and IDataView are very similar in the sense that they both are ways of representing data in a tabular format and applying transformations for it. Some key differences:\n", + "\n", + "- DataFrame only supports loading delimited files.\n", + "- DataFrame runs on memory so you're limited to the amount of memory on your PC.\n", + "\n", + "The DataFrame is recommended when performing tasks like exploratory data anlysis on a sample of your data. \n", + "\n", + "IDataView is recommended for training on larger datasets. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## How to create an IDataView\n", + "\n", + "You can create an IDataView by using any of the methods for loading data:\n", + "\n", + "- TextLoader\n", + "- LoadFromTextFile\n", + "- LoadFromEnumerable\n", + "- Load" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Defining Schema\n", + "\n", + "IDataViews are schematized. Therefore you need to provide the schema. There's several ways to define the schema:\n", + "\n", + "- Manually\n", + "- Classes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Manually defining IDataView Schema\n", + "\n", + "To manually define the model schema you can use the `SchemaBuilder`. " + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "dotnet_interactive": { + "language": "csharp" + } + }, + "source": [ + "#r \"nuget:Microsoft.ML,1.7.1\"" + ], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": "
Installed Packages
" + }, + "execution_count": 1, + "metadata": {} + } + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "dotnet_interactive": { + "language": "csharp" + } + }, + "source": [ + "using Microsoft.ML;\n", + "using Microsoft.ML.Data;" + ], + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's say that we have data that looks like the following\n", + "\n", + "| Student Name | Score | \n", + "| --- | --- |\n", + "| Jane | 80 |\n", + "| John | 75 | \n", + "| Jack | 90 |\n", + "| Sally | 100 |\n", + "\n", + "We can define the schema as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "dotnet_interactive": { + "language": "csharp" + } + }, + "source": [ + "var schemaBuilder = new DataViewSchema.Builder();\n", + "schemaBuilder.AddColumn(\"StudentName\", TextDataViewType.Instance);\n", + "schemaBuilder.AddColumn(\"Score\", NumberDataViewType.Single);\n", + "var schema = schemaBuilder.ToSchema();" + ], + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When we inspect the schema we can see its different properties." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "dotnet_interactive": { + "language": "csharp" + } + }, + "source": [ + "schema" + ], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": "
indexNameIndexIsHiddenTypeAnnotations
0StudentName
0
False
RawType
System.ReadOnlyMemory<System.Char>
Schema
[ ]
1Score
1
False
RawType
System.Single
Schema
[ ]
" + }, + "execution_count": 1, + "metadata": {} + } + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Define schema with classes\n", + "\n", + "You also have the option of creating new classes or using existing classes to define your schema. Using the same student data above, you can define the schema as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "dotnet_interactive": { + "language": "csharp" + } + }, + "source": [ + "public class TestScores\n", + "{\n", + "\tpublic string StudentName {get;set;}\n", + "\tpublic string Scores {get;set;}\n", + "}" + ], + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Loading data\n", + "\n", + "You can load data from a flat file either using the TextLoader or LoadFromTextFile methods\n", + "\n", + "#### Loading data from a TextLoader" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "dotnet_interactive": { + "language": "csharp" + } + }, + "source": [ + "// Initialize MLContext\n", + "var mlContext = new MLContext();" + ], + "outputs": [] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "dotnet_interactive": { + "language": "csharp" + } + }, + "source": [ + "// Define TextLoader\n", + "var textLoader =\n", + " mlContext.Data.CreateTextLoader(\n", + " columns: new TextLoader.Column[]\n", + " {\n", + " new TextLoader.Column(\"StudentName\",DataKind.String, 0),\n", + " new TextLoader.Column(\"Score\", DataKind.Single, 1)\n", + " },\n", + " separatorChar: ',',\n", + " hasHeader: true);" + ], + "outputs": [] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "dotnet_interactive": { + "language": "csharp" + } + }, + "source": [ + "// Create IDataView\n", + "var textLoaderDataView = textLoader.Load(\"student-scores.csv\");" + ], + "outputs": [] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "dotnet_interactive": { + "language": "csharp" + } + }, + "source": [ + "textLoaderDataView.Schema" + ], + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": "
indexNameIndexIsHiddenTypeAnnotations
0StudentName
0
False
RawType
System.ReadOnlyMemory<System.Char>
Schema
[ ]
1Score
1
False
RawType
System.Single
Schema
[ ]
" + }, + "execution_count": 1, + "metadata": {} + } + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "dotnet_interactive": { + "language": "csharp" + } + }, + "source": [ + "// Specify column index from file via LoadColumn attribute\n", + "public class TestScoresAttributes\n", + "{\n", + "\t[LoadColumn(0)]\n", + "\tpublic string StudentName {get;set;}\n", + "\t\n", + "\t[LoadColumn(1)]\n", + "\tpublic string Scores {get;set;}\n", + "}" + ], + "outputs": [] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "dotnet_interactive": { + "language": "csharp" + } + }, + "source": [ + "var textLoaderAttributes = \n", + "\tmlContext.Data.CreateTextLoader(separatorChar: ',', hasHeader:true);" + ], + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Inspecting data in IDataView\n", + "\n", + "There's several ways to inspect the data in an IDataView:\n", + "\n", + "- Use cursors\n", + "- Convert to IEnumerable\n", + "\n", + "### Use cursors" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".NET (C#)", + "language": "C#", + "name": ".net-csharp" + }, + "language_info": { + "file_extension": ".cs", + "mimetype": "text/x-csharp", + "name": "C#", + "pygments_lexer": "csharp", + "version": "8.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/machine-learning/data/student-scores.csv b/machine-learning/data/student-scores.csv new file mode 100644 index 0000000..3d157e8 --- /dev/null +++ b/machine-learning/data/student-scores.csv @@ -0,0 +1,5 @@ +Student Name, Score +Jane, 80 +John, 75 +Jack, 90 +Sally, 100 \ No newline at end of file