Skip to content

Add a doc about encoding configuration #1761

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from
Closed
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
327 changes: 327 additions & 0 deletions docs/encoding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,327 @@
# Configuring VSCode and PowerShell to prevent encoding bugs
Copy link
Contributor

@rkeithhill rkeithhill Feb 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this apply to both Windows PowerShell and PowerShell Core? I thought PowerShell Core was immune from this issue? If so, maybe this should say "Windows PowerShell"? OK, it appears that pwsh on Windows doesn't use utf-8 (no BOM). But pwsh on Linux, and presumably macOS, do use utf-8. So maybe this should be "to prevent encoding bugs on Windows"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, the above was clarified after reading the rest of the doc.


## TL;DR - I have weird `–` characters in my script and just want to fix it

It's likely that your VSCode is set to encode files as UTF-8 with no BOM,
but your PowerShell is set to read them as Windows-1252.

In your VSCode configuration (<kbd>Ctrl</kbd>+<kbd>,</kbd>), set:

```json
"files.encoding": "utf8bom"
```

[Then make sure that any files you are working on are encoded in UTF-8 with a BOM.](#scripts)

If you want more information, read on.

## Introduction

In this document, you will find information on:

- What's meant by "encoding" and why you need to configure it
- What some common encoding problems look like
- How VSCode, PowerShell and the PowerShell extension depend on encodings
- Why you might choose one encoding over another
- How to configure VSCode and PowerShell to make encoding work
- Some other components you might need to configure to stop encoding issues from occurring.

## Why configuring your encoding is important

VSCode, being a text editor, manages the interface between
a human entering strings of characters into a buffer on the screen
and reading/writing blocks of bytes to the filesystem.
When VSCode saves a buffer (a file you have open) to the filesystem,
it chooses a text encoding to do this.

When PowerShell runs a file,
it similarly has to convert from bytes to chars
to reconstruct a file into a PowerShell program.
(This process of parsing a PowerShell script goes:
*bytes* -> *characters* -> *tokens* -> *abstract syntax tree* -> *execution*.)

Since VSCode writes to the file system and PowerShell reads from the filesystem,
this means they need to communicate using the same encoding,
so that the characters written by VSCode are
the same as the characters read by PowerShell.

Both VSCode and PowerShell are installed with a sensible default encoding configuration,
but especially in older PowerShell versions,
the most sensible encoding to use has changed since the default was configured,
and VSCode's default sometimes conflicts with PowerShell's.

In order to ensure you have no problems using PowerShell or the PowerShell extension in
VSCode, you will need to configure your VSCode and PowerShell settings properly.

## Why you might hit encoding issues, and how to tell if you have

Encoding problems arise because the encoding of VSCode or your script
file does not match the expected encoding of PowerShell,
and there is no way for Powershell to determine the encoding of the file automatically.

You're much more likely to hit encoding problems if you're using characters
not in the [7-bit ASCII character set](https://ascii.cl/),
such as accented latin characters (e.g. `É`, `ü`),
or non-latin characters like Cyrillic (`Д`, `Ц`) or Han Chinese (`脚`, `本`).

Common reasons for encoding issues are:

- The encodings of VSCode and PowerShell have not been changed from their defaults.
For PowerShell 5.1 and below, the default encoding is different from VSCode's.
- Another editor has opened and overwritten the file in a new encoding.
This often happens with the ISE.
- The file is checked into source control (like git) in a different encoding
to what VSCode or PowerShell expects. This can happen when collaborators edit
files with an editor with a different encoding configuration.

### Tell-tale signs of encoding issues

Often encoding errors present themselves as parse errors in scripts.

If you find strange character sequences occurring in your script,
you can look them up in [this handy reference](https://www.i18nqa.com/debug/utf8-debug.html),
which often confirms a UTF-8/Windows-1252 encoding problem.

In the example below, an en-dash (`–`) appears as the characters `–`:

```text
Send-MailMessage : A positional parameter cannot be found that accepts argument 'Testing FuseMail SMTP...'.
At C:\Users\<User>\<OneDrive>\Development\PowerShell\Scripts\Send-EmailUsingSmtpRelay.ps1:6 char:1
+ Send-MailMessage –From $from –To $recipient1 –Subject $subject ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidArgument: (:) [Send-MailMessage], ParameterBindingException
+ FullyQualifiedErrorId : PositionalParameterNotFound,Microsoft.PowerShell.Commands.SendMailMessage
```

This is because VSCode encodes the character `–` in UTF-8 as the bytes `0xE2 0x80 0x93`.
When these bytes are decoded as Windows-1252, they are interpreted as the characters `–`.

Some character sequences that might be evidence of an encoding configuration problem are:

- `–` instead of `–`
- `—` instead of `—`
- `Ä2` instead of `Ä`
- `Â` instead of ` ` (a non-breaking space)
- `é` instead of `é`

## The PowerShell extension and encodings

The PowerShell extension interacts with scripts in a number of ways:

1. When scripts are executed directly in the Integrated Console,
they are read off the filesystem by PowerShell directly.
This means that if PowerShell's encoding differs from VSCode's, something may go wrong here.
2. When scripts are edited in VSCode, the contents are sent by VSCode to the extension,
meaning it's not possible for the extension to get the wrong encoding of a file
(the [Language Server Protocol] mandates this file content be transferred in UTF-8).
3. When a script being edited in VSCode references another script that is not also open in VSCode,
the extension falls back to loading that script's content from the file system.
This load will detect a BOM, but defaults to UTF-8.

It's only in the 1st and 3rd scenarios that a problem may arise.

### Why can't the extension detect encodings and do the right thing?

The extension will perform BOM detection, and in those cases does do the right thing.

The problem occurs when assuming the encoding of BOM-less formats (like [UTF-8] with no BOM and [Windows-1252]).
In these cases, the extension has to settle on an encoding,
and currently just assumes UTF-8 rather than more complex logic.

The extension [does not have access (read or write) to VSCode's encoding settings](https://github.com/Microsoft/vscode/issues/824),
and instead tries to have a sane default - the same as VSCode's.

However, despite being loaded into PowerShell, the extension also can't control PowerShell's encoding
easily (PowerShell's encoding settings are fragmented, and can be changed back at any time in the Integrated Console).

These factors mean the PowerShell extension relies on the user to configure the encoding instead.

## Choosing the right encoding

To solve encoding issues, you need to choose a common text encoding that
you can encode all your PowerShell scripts in that everything you use will understand.

**The encoding most likely to work easily with VSCode, all versions of PowerShell
and most (especially Windows-based) applications is UTF-8 with a [byte-order mark] (BOM)**.

**However**, choosing an encoding is a question of what platforms and applications
will be reading and writing your text/PowerShell scripts.

On Windows, many applications have long used [Windows-1252],
although many .NET applications use [UTF-16]
(the Windows world often calls this "Unicode", a term that now [refers to a broader standard](https://en.wikipedia.org/wiki/Unicode)).

In the Linux world, on the web, and [where .NET is heading](https://github.com/dotnet/standard/issues/260#issuecomment-289549508)
[UTF-8] is now the dominant encoding.

Unicode encodings (the "UTF"s) also have the concept of a [byte-order mark] (BOM),
which may occur at the beginning of text to
tell a decoder what encoding the text is in and, in the case of
multibyte encodings, the [endianness](https://en.wikipedia.org/wiki/Endianness) of the encoding.
BOMs are also designed to be bytes that rarely occur in non-Unicode text,
allowing a reasonable guess that text is Unicode when a BOM is present.

BOMs are optional and their adoption has not caught on in the Linux world,
due to a dependable convention of UTF-8 being used everywhere.
This means that most Linux applications presume that text input is encoded in UTF-8.
While many Linux applications will recognise and correctly handle a BOM,
a number do not, leading to artifacts in text manipulated with those applications.

**Therefore**:

- If you work primarily with Windows applications and Windows PowerShell,
you should prefer an encoding like UTF-8 with BOM or UTF-16.
- If you work across platforms, you should prefer UTF-8 with BOM.
- If you work mainly in Linux-associated contexts, you should prefer UTF-8 without BOM.
- Windows-1252 and latin-1 are essentially legacy encodings that you should avoid if possible.
However, some older Windows applications may depend on them.
- It's also worth noting that script signing [is encoding-dependent](https://github.com/PowerShell/PowerShell/issues/3466),
meaning a change of encoding on a signed script will require resigning.

## Configuring VSCode

VSCode's default encoding is UTF-8 without BOM.

To set [VSCode's encoding](https://code.visualstudio.com/docs/editor/codebasics#_file-encoding-support),
go to the VSCode settings (<kbd>Ctrl</kbd>+<kbd>,</kbd>)
and set the `"files.encoding"` setting:

```json
"files.encoding": "utf8bom"
Copy link
Contributor

@rkeithhill rkeithhill Feb 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth pointing out that you can make this encoding change just for PoweShell files:

    "[powershell]": {
        "files.encoding": "utf8bom"
    }

```

Some possible values are:

- `"utf8"`: [UTF-8] without BOM
- `"utf8bom"`: [UTF-8] with BOM
- `"utf16le"`: Little endian [UTF-16]
- `"utf16be"`: Big endian [UTF-16]
- `"windows1252"`: [Windows-1252]

You should get a dropdown for this in the GUI view,
or completions for it in the JSON view.

You can also add the following to autodetect encoding when possible:

```json
"files.autoGuessEncoding": true
```

## Configuring PowerShell

PowerShell's default encoding varies depending on version:

- In PowerShell 6+, the default encoding is [UTF-8] without BOM on all platforms.
- In Windows PowerShell, the default encoding is usually [Windows-1252],
an extension of [latin-1]
(also known as ISO 8859-1).

In PowerShell 5+ you can find your default encoding with this:

```powershell
[psobject].Assembly.GetTypes() | ? { $_.Name -eq 'ClrFacade'} | % { $_.GetMethod('GetDefaultEncoding', [System.Reflection.BindingFlags]'nonpublic,static').Invoke($null, @()) }
```

It's not strictly possible to force PowerShell to use an input encoding,
and PowerShell 5.1 and below default to [Windows-1252] when there is no BOM.
For interoperability reasons then, it's best to save scripts you wish to
evaluate in PowerShell 5.1 and below in a Unicode format with a BOM.

If want to configure PowerShell to use a given encoding more generally,
this is possible to do for some aspects with profile settings.
See:

- [@mklement0]'s [answer about PowerShell encoding on StackOverflow](https://stackoverflow.com/a/40098904).
- [@rkeithhill]'s [blog post about dealing with BOM-less UTF-8 input in PowerShell](https://rkeithhill.wordpress.com/2010/05/26/handling-native-exe-output-encoding-in-utf8-with-no-bom/).

## What else you might need to configure for encoding

Any other tools you have that touch PowerShell scripts may:

1. Be affected by your encoding choices, or
2. *Worse*, re-encode your scripts in another encoding.

### Scripts

Scripts already on the file system may need to be re-encoded to your new chosen encoding.
To do this with VSCode, you can open the file and [save it again with the new encoding](https://stackoverflow.com/a/40365121).

If you need to re-encode multiple files, [this PowerShell snippet on StackOverflow may help](https://stackoverflow.com/a/1681610).

### The PowerShell Integrated Scripting Environment (ISE)

If you also edit scripts using the PowerShell ISE,
you will need to synchronize your encoding settings there.

The ISE should honor a BOM, but it is [also possible to use reflection to set the encoding](https://bensonxion.wordpress.com/2012/04/25/powershell-ise-default-saveas-encoding/).
Note that this would not be persisted between startups.

### Source control software

While some source control tools (like git) ignore encodings (git just tracks the bytes),
others (like TFS or Mercurial) may not, and even some git-based tools rely on decoding text.

When this is the case, make sure you:

- Configure the text encoding in your source control to match VSCode's.
- Ensure all your files are checked into source control in the relevant encoding.
- Be wary of changes to the encoding received through source control.
A key sign of this is a diff indicating changes but where nothing seems to have changed
(because bytes have but characters have not).

### Collaborators' environments

On top of configuring source control,
ensure that your collaborators on any files you share
(through source control, a file share or any other way)
don't have settings that will override your encoding by re-encoding PowerShell files.

### Other programs

Any other program that reads or writes a PowerShell script may re-encode it.

Some examples are:

- The clipboard (copying and pasting a script). This is common in scenarios like:
- Copying a script into a VM
- Copying a script out of an email or webpage
- Copying a script into or out of an MS Word or PowerPoint document
- Other text editors, such as:
- Notepad
- vim
- Any other PowerShell script editor
- Text editing utilities, like:
- `Get-Content`/`Set-Content`/`Out-File`
- PowerShell redirection operators like `>` and `>>`
- `sed`/`awk`
- File transfer programs, like:
- A web browser, when downloading scripts
- A file share

Some of the above deal in bytes rather than text,
but others offer encoding configurations.
In those cases where you need to configure an encoding,
you will need to make it the same as your editor encoding to prevent bugs.

## Other resources on encoding in PowerShell

There are a few other nice posts on encoding
and configuring encoding in PowerShell that are worth a read:

- [@mklement0]'s [summary of PowerShell encoding on StackOverflow](https://stackoverflow.com/questions/40098771/changing-powershells-default-output-encoding-to-utf-8)
- Previous issues opened on vscode-PowerShell for encoding problems:
- [#1308](https://github.com/PowerShell/vscode-powershell/issues/1308)
- [#1628](https://github.com/PowerShell/vscode-powershell/issues/1628)
- [#1680](https://github.com/PowerShell/vscode-powershell/issues/1680)
- [#1744](https://github.com/PowerShell/vscode-powershell/issues/1744)
- [#1751](https://github.com/PowerShell/vscode-powershell/issues/1751)
- [The classic *Joel on Software* writeup about Unicode](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/)

[@mklement0]: https://github.com/mklement0
[@rkeithhill]: https://github.com/rkeithhill
[Windows-1252]: https://en.wikipedia.org/wiki/Byte_order_mark
[latin-1]: https://en.wikipedia.org/wiki/ISO/IEC_8859-1
[UTF-8]: https://en.wikipedia.org/wiki/UTF-8
[byte-order mark]: https://en.wikipedia.org/wiki/Byte_order_mark
[UTF-16]: https://en.wikipedia.org/wiki/UTF-16
[Language Server Protocol]: https://microsoft.github.io/language-server-protocol/