WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Conversation

@MichalStrehovsky
Copy link
Member

@MichalStrehovsky MichalStrehovsky commented Dec 3, 2025

When compiling hello world:

byte[] allocations before: 564000. string allocations before: 302000
byte[] allocations after: 625000. string allocations after: 241000

So this is mostly a wash allocation-wise, however, not allocating string means we're also avoiding the UTF-8 -> UTF-16 -> UTF-8 conversions.

Cc @dotnet/ilc-contrib

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

@MichalStrehovsky MichalStrehovsky changed the title Switch object writer to UTF-8 Mostly switch object writer to UTF-8 Dec 5, 2025
@MichalStrehovsky MichalStrehovsky marked this pull request as ready for review December 5, 2025 05:49
Copilot AI review requested due to automatic review settings December 5, 2025 05:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR converts the object writer subsystem from using string to Utf8String to avoid UTF-8 ↔ UTF-16 ↔ UTF-8 conversions during compilation. The change primarily affects symbol naming, mangling, and object file generation code.

Key Changes

  • Introduced IsNull property to Utf8String for null checking
  • Added multiple Concat overload methods for efficient UTF-8 string concatenation
  • Updated all object writer interfaces and implementations to accept/return Utf8String instead of string
  • Changed dictionaries and collections to use Utf8String keys

Reviewed changes

Copilot reviewed 44 out of 44 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
Internal/Text/Utf8String.cs Added IsNull property and new Concat overloads for UTF-8 string operations
ObjectWriter/ObjectWriter.cs Core object writer changed to use Utf8String for symbol names and relocations
ObjectWriter/StringTableBuilder.cs Updated string table to work directly with UTF-8 bytes
ObjectWriter/CoffObjectWriter.cs COFF format writer updated to use Utf8String
ObjectWriter/ElfObjectWriter.cs ELF format writer updated to use Utf8String
ObjectWriter/MachObjectWriter.cs Mach-O format writer updated to use Utf8String
ObjectWriter/UnixObjectWriter.cs Unix-specific object writer base updated
Compiler/NodeMangler.cs Name mangling infrastructure changed to return Utf8String
Compiler/WindowsNodeMangler.cs Windows-specific name mangling updated with UTF-8 concatenation
Compiler/UnixNodeMangler.cs Unix-specific name mangling updated with UTF-8 concatenation
Compiler/UserDefinedTypeDescriptor.cs Debug info type descriptors updated to use Utf8String
DependencyAnalysis/NodeFactory.cs Factory methods updated for Utf8String symbol names
DependencyAnalysis/*Node.cs Various node types updated to use Utf8String for mangled names

@MichalStrehovsky
Copy link
Member Author

/azp run runtime-nativeaot-outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@MichalStrehovsky MichalStrehovsky merged commit 8c8bfb2 into dotnet:main Dec 8, 2025
111 of 120 checks passed
@MichalStrehovsky MichalStrehovsky deleted the utf8objwriter branch December 8, 2025 12:59

private Utf8String SanitizeNameWithHash(Utf8String literal)
{
Utf8String mangledName = SanitizeName(literal);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That code seems to be dangerous.
The result is not really an Utf8-String but ASCII - and it needs to be because the next few lines would be incorrect if it would be UTF8 and would contain any multi-byte chars (you can't simply cut off arbitrary Utf8 at byte position 30).
So if anybody ever changes SanitizeName to actually be Utf8 this will create hard-to-spot errors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if anybody ever changes SanitizeName to actually be Utf8 this will create hard-to-spot errors.

It's not likely we'd ever allow SanitizeName to return characters outside the basic ASCII set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a Debug.Assert(Ascii.IsValid(mangledName)) make sense here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants