Bitcoin

Fixing Garbled Text When Syncing Oracle to Doris with SeaTunnel 2.3.9

When using SeaTunnel 2.3.9 to sync data from Oracle to Doris, you may encounter garbled characters—especially if the Oracle database uses the ASCII character set. But don’t panic—this article walks you through why this happens and how to fix it.

🧠 Root Cause

The issue stems from how SeaTunnel reads data from Oracle. If Oracle is using a character set like ASCII, and you’re syncing to Doris (which expects proper UTF-8 or other compatible encodings), Chinese characters can become unreadable.

The key is to intercept and re-encode the data when it is read from the Oracle ResultSet.

🔍 Understanding the SeaTunnel Reading Flow

Let’s look at the SeaTunnel internals that handle JDBC data ingestion:

1. JdbcSourceFactory

This class:

  • Loads your source configurations.
  • Constructs JdbcSourceConfig and JdbcDialect.
  • Creates a JdbcSource instance.

2. JdbcSource

This:

  • Initializes a SourceSplitEnumerator to split the tasks.
  • Creates a JdbcSourceReader to execute them.

3. JdbcSourceReader

Responsible for:

  • Building the JdbcInputFormat.
  • Repeatedly calling the pollNext() method to fetch data.

4. pollNext() Method

This method:

  • Calls open() in JdbcInputFormat to prepare the PreparedStatement and ResultSet.
  • Then calls nextRecord() to process the ResultSet and convert it to a SeaTunnelRow.

5. nextRecord() and the Encoding Problem

In JdbcInputFormat:

  • The nextRecord() method calls toInternal() in JdbcRowConverter.
  • The default implementation uses JdbcFieldTypeUtils.getString(rs, resultSetIndex).

💥 Problem: If the ResultSet contains Chinese characters stored as ASCII, this method returns garbled text.

✅ Solution Strategy

We need to detect the source encoding and re-encode the data at the moment it’s retrieved from the ResultSet.

Here’s how to do it:

🛠 Implementation Steps

Step 1: Add Charset Parameters

In JdbcInputFormat, add:

private final Map params;

In the constructor:

public JdbcInputFormat(JdbcSourceConfig config, Map tables) {
    this.jdbcDialect = JdbcDialectLoader.load(config.getJdbcConnectionConfig().getUrl(), config.getCompatibleMode());
    this.chunkSplitter = ChunkSplitter.create(config);
    this.jdbcRowConverter = jdbcDialect.getRowConverter();
    this.tables = tables;
    this.params = config.getJdbcConnectionConfig().getProperties(); // <-- get charset info here
}

Step 2: Pass params to the Row Converter

In the nextRecord() method of JdbcInputFormat, update the method call to:

SeaTunnelRow seaTunnelRow = jdbcRowConverter.toInternal(resultSet, splitTableSchema, params);

Step 3: Add Encoding Method

In AbstractJdbcRowConverter, define:

public static String convertCharset(byte[] value, String charSet) {
    if (value == null || value.length == 0) {
        return null;
    }
    log.info("Value bytes: {}", Arrays.toString(value));
    try {
        return new String(value, charSet);
    } catch (UnsupportedEncodingException e) {
        throw new RuntimeException(e);
    }
}

Step 4: Modify toInternal() for String Types

In AbstractJdbcRowConverter, update the STRING type handling like so:

case STRING:
    if (params == null || params.isEmpty()) {
        fields[fieldIndex] = JdbcFieldTypeUtils.getString(rs, resultSetIndex);
    } else {
        String sourceCharset = params.get("sourceCharset");
        if ("GBK".equalsIgnoreCase(sourceCharset)) {
            fields[fieldIndex] = convertCharset(JdbcFieldTypeUtils.getBytes(rs, resultSetIndex), sourceCharset);
        } else {
            fields[fieldIndex] = JdbcFieldTypeUtils.getString(rs, resultSetIndex);
        }
    }
    break;

Step 5: Rebuild and Deploy

After making the above changes:

  1. Rebuild the connector-jdbc module.
  2. Replace the existing connector-jdbc-2.3.9.jar under SeaTunnel’s connectors directory.
  3. Restart the SeaTunnel cluster.

🧾 Configuration Tips

  • If your Oracle database does not have encoding issues, you don’t need to pass the sourceCharset property.
  • If needed, pass it like this in your config:
sourceCharset=GBK
  • To debug logging from connector-jdbc, check the worker logs in the SeaTunnel logs directory.

✅ Summary

By adding a simple charset-switching mechanism and tweaking the JDBC source implementation, you can eliminate garbled characters when syncing Oracle data to Doris using SeaTunnel.

No more broken characters—your data pipeline just got smarter. 🚀

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Check Also
Close
Back to top button