This extension allows you to register all encodings available in the ICU data repository. It supports a total of 1,040 different encodings. The names of the encodings match the <code_set_name> option in each file.
It currently only performs encoding when reading data, converting from <code_set_name> to UTF-8.
To use this extension, you must first install and load it in DuckDB.
INSTALL encodings;
LOAD encodings;After that, all encodings are initialized in your database instance. To use them, simply refer to the correct encoding.
FROM read_csv('my_shift_jis.csv', encoding = 'shift_jis')Most of the code in this library consists of auto-generated maps from the Unicode data. The Unicode data is inlined in the third_party folder, and the generated code resides in src/include/generated and src/generated . Any changes to files in these folders should be made by modifying the scripts/converter.py script, rather than editing the files manually.
If it's necessary to regenerate the code (e.g., due to a bug fix in the replacement maps), simply run:
python scripts/converter.pyNote that the script does not generate formatted code, so you must also run:
make formatIt is also important to notice that the scripts/converter.py script also generates the test/sql/encodings.test test file.
Now to build the extension, run:
makeThe main binaries that will be built are:
./build/release/duckdb
./build/release/test/unittest
./build/release/extension/encodings/encodings.duckdb_extensionduckdbis the binary for the duckdb shell with the extension code automatically loaded.unittestis the test runner of duckdb. Again, the extension is already linked into the binary.encodings.duckdb_extensionis the loadable binary as it would be distributed.
Different tests can be created for DuckDB extensions. The primary way of testing DuckDB extensions should be the SQL tests in ./test/sql. These SQL tests can be run using:
make testTo install your extension binaries from S3, you will need to do two things. Firstly, DuckDB should be launched with the
allow_unsigned_extensions option set to true. How to set this will depend on the client you're using. Some examples:
CLI:
duckdb -unsignedPython:
con = duckdb.connect(':memory:', config={'allow_unsigned_extensions' : 'true'})NodeJS:
db = new duckdb.Database(':memory:', {"allow_unsigned_extensions": "true"});Secondly, you will need to set the repository endpoint in DuckDB to the HTTP url of your bucket + version of the extension you want to install. To do this run the following SQL query in DuckDB:
SET custom_extension_repository='bucket.s3.eu-west-1.amazonaws.com/<your_extension_name>/latest';Note that the /latest path will allow you to install the latest extension version available for your current version of
DuckDB. To specify a specific version, you can pass the version instead.
After running these steps, you can install and load your extension using the regular INSTALL/LOAD commands in DuckDB:
INSTALL encodings
LOAD encodings