Determining S3 listing order

Some third-party implementations of Amazon’s S3 protocol return object information (‘file listings’) in UTF-16 code-unit order rather than the Amazon-compatible Unicode code-point order.

Introduced in Moonwalk 2023.2, when configuring Moonwalk’s s3generic:// plugin (as well as certain other plugins that provide 3rd party S3 support such as s3cos://), a ‘UTF-16 listing order work-around’ option is provided in the Plugin Configuration panel to allow Moonwalk to correctly process results returned in this non-standard order and thereby allow correct and complete scanning of your S3 buckets.

How do you determine whether you need to enable this option?

The following experiment will test the sort order of your S3-compatible device.

  1. Create a new folder on a Windows server with Moonwalk Agent installed
  2. Add files with the EXACT names shown below - use cut & paste to get them right
    • file_ꦏ_1.txt
    • file__2.txt
    • file__3.txt
    • file_𐎣_4.txt
      • Don’t worry about the order that Windows shows the files in and don’t worry if some programs just show the characters between the underscores as a box or a question mark etc
  3. Use an Ingest policy to upload this folder to a test bucket on your S3-compatible storage
  4. Use a Gather Statistics policy to scan the location to which you just ingested the files
    a. Tick ‘Export raw file metadata
    b. Untick the ‘Compress (gzip)’ option
    c. Choose ‘CSV’ format
  5. Check the exported CSV data (e.g. using notepad) to determine the order in which the files appear:
    • If the files appear in 1, 2, 3, 4 order: congratulations, your S3-compatible device uses the expected AWS ordering - you should NOT tick the workaround box
    • If the files appear in 1, 4, 2, 3 order: your device is using UTF-16 code-unit order - you WILL need to tick the ‘UTF-16 listing order work-around’ box

Note: this option does not change the order in which results are actually returned, it just ensures that Moonwalk processes them correctly.