-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Mostly switch object writer to UTF-8 #122160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR converts the object writer subsystem from using string to Utf8String to avoid UTF-8 ↔ UTF-16 ↔ UTF-8 conversions during compilation. The change primarily affects symbol naming, mangling, and object file generation code.
Key Changes
- Introduced
IsNullproperty toUtf8Stringfor null checking - Added multiple
Concatoverload methods for efficient UTF-8 string concatenation - Updated all object writer interfaces and implementations to accept/return
Utf8Stringinstead ofstring - Changed dictionaries and collections to use
Utf8Stringkeys
Reviewed changes
Copilot reviewed 44 out of 44 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| Internal/Text/Utf8String.cs | Added IsNull property and new Concat overloads for UTF-8 string operations |
| ObjectWriter/ObjectWriter.cs | Core object writer changed to use Utf8String for symbol names and relocations |
| ObjectWriter/StringTableBuilder.cs | Updated string table to work directly with UTF-8 bytes |
| ObjectWriter/CoffObjectWriter.cs | COFF format writer updated to use Utf8String |
| ObjectWriter/ElfObjectWriter.cs | ELF format writer updated to use Utf8String |
| ObjectWriter/MachObjectWriter.cs | Mach-O format writer updated to use Utf8String |
| ObjectWriter/UnixObjectWriter.cs | Unix-specific object writer base updated |
| Compiler/NodeMangler.cs | Name mangling infrastructure changed to return Utf8String |
| Compiler/WindowsNodeMangler.cs | Windows-specific name mangling updated with UTF-8 concatenation |
| Compiler/UnixNodeMangler.cs | Unix-specific name mangling updated with UTF-8 concatenation |
| Compiler/UserDefinedTypeDescriptor.cs | Debug info type descriptors updated to use Utf8String |
| DependencyAnalysis/NodeFactory.cs | Factory methods updated for Utf8String symbol names |
| DependencyAnalysis/*Node.cs | Various node types updated to use Utf8String for mangled names |
src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/UserDefinedTypeDescriptor.cs
Show resolved
Hide resolved
src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/UserDefinedTypeDescriptor.cs
Show resolved
Hide resolved
src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/WindowsNodeMangler.cs
Show resolved
Hide resolved
src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/UserDefinedTypeDescriptor.cs
Show resolved
Hide resolved
src/coreclr/tools/Common/Compiler/ObjectWriter/StringTableBuilder.cs
Outdated
Show resolved
Hide resolved
|
/azp run runtime-nativeaot-outerloop |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
|
||
| private Utf8String SanitizeNameWithHash(Utf8String literal) | ||
| { | ||
| Utf8String mangledName = SanitizeName(literal); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That code seems to be dangerous.
The result is not really an Utf8-String but ASCII - and it needs to be because the next few lines would be incorrect if it would be UTF8 and would contain any multi-byte chars (you can't simply cut off arbitrary Utf8 at byte position 30).
So if anybody ever changes SanitizeName to actually be Utf8 this will create hard-to-spot errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if anybody ever changes SanitizeName to actually be Utf8 this will create hard-to-spot errors.
It's not likely we'd ever allow SanitizeName to return characters outside the basic ASCII set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a Debug.Assert(Ascii.IsValid(mangledName)) make sense here?
When compiling hello world:
byte[]allocations before: 564000.stringallocations before: 302000byte[]allocations after: 625000.stringallocations after: 241000So this is mostly a wash allocation-wise, however, not allocating string means we're also avoiding the UTF-8 -> UTF-16 -> UTF-8 conversions.
Cc @dotnet/ilc-contrib