Fixing Garbled Text When Syncing Oracle to Doris with SeaTunnel 2.3.9

When using SeaTunnel 2.3.9 to sync data from Oracle to Doris, you may encounter garbled characters—especially if the Oracle database uses the ASCII character set. But don’t panic—this article walks you through why this happens and how to fix it.
🧠 Root Cause
The issue stems from how SeaTunnel reads data from Oracle. If Oracle is using a character set like ASCII, and you’re syncing to Doris (which expects proper UTF-8 or other compatible encodings), Chinese characters can become unreadable.
The key is to intercept and re-encode the data when it is read from the Oracle ResultSet
.
🔍 Understanding the SeaTunnel Reading Flow
Let’s look at the SeaTunnel internals that handle JDBC data ingestion:
1. JdbcSourceFactory
This class:
- Loads your source configurations.
- Constructs
JdbcSourceConfig
andJdbcDialect
. - Creates a
JdbcSource
instance.
2. JdbcSource
This:
- Initializes a
SourceSplitEnumerator
to split the tasks. - Creates a
JdbcSourceReader
to execute them.
3. JdbcSourceReader
Responsible for:
- Building the
JdbcInputFormat
. - Repeatedly calling the
pollNext()
method to fetch data.
4. pollNext()
Method
This method:
- Calls
open()
inJdbcInputFormat
to prepare thePreparedStatement
andResultSet
. - Then calls
nextRecord()
to process theResultSet
and convert it to aSeaTunnelRow
.
5. nextRecord()
and the Encoding Problem
In JdbcInputFormat
:
- The
nextRecord()
method callstoInternal()
inJdbcRowConverter
. - The default implementation uses
JdbcFieldTypeUtils.getString(rs, resultSetIndex)
.
💥 Problem: If the ResultSet contains Chinese characters stored as ASCII, this method returns garbled text.
✅ Solution Strategy
We need to detect the source encoding and re-encode the data at the moment it’s retrieved from the ResultSet.
Here’s how to do it:
🛠 Implementation Steps
Step 1: Add Charset Parameters
In JdbcInputFormat
, add:
private final Map params;
In the constructor:
public JdbcInputFormat(JdbcSourceConfig config, Map tables) {
this.jdbcDialect = JdbcDialectLoader.load(config.getJdbcConnectionConfig().getUrl(), config.getCompatibleMode());
this.chunkSplitter = ChunkSplitter.create(config);
this.jdbcRowConverter = jdbcDialect.getRowConverter();
this.tables = tables;
this.params = config.getJdbcConnectionConfig().getProperties(); // <-- get charset info here
}
Step 2: Pass params
to the Row Converter
In the nextRecord()
method of JdbcInputFormat
, update the method call to:
SeaTunnelRow seaTunnelRow = jdbcRowConverter.toInternal(resultSet, splitTableSchema, params);
Step 3: Add Encoding Method
In AbstractJdbcRowConverter
, define:
public static String convertCharset(byte[] value, String charSet) {
if (value == null || value.length == 0) {
return null;
}
log.info("Value bytes: {}", Arrays.toString(value));
try {
return new String(value, charSet);
} catch (UnsupportedEncodingException e) {
throw new RuntimeException(e);
}
}
Step 4: Modify toInternal()
for String Types
In AbstractJdbcRowConverter
, update the STRING
type handling like so:
case STRING:
if (params == null || params.isEmpty()) {
fields[fieldIndex] = JdbcFieldTypeUtils.getString(rs, resultSetIndex);
} else {
String sourceCharset = params.get("sourceCharset");
if ("GBK".equalsIgnoreCase(sourceCharset)) {
fields[fieldIndex] = convertCharset(JdbcFieldTypeUtils.getBytes(rs, resultSetIndex), sourceCharset);
} else {
fields[fieldIndex] = JdbcFieldTypeUtils.getString(rs, resultSetIndex);
}
}
break;
Step 5: Rebuild and Deploy
After making the above changes:
- Rebuild the
connector-jdbc
module. - Replace the existing
connector-jdbc-2.3.9.jar
under SeaTunnel’sconnectors
directory. - Restart the SeaTunnel cluster.
🧾 Configuration Tips
- If your Oracle database does not have encoding issues, you don’t need to pass the
sourceCharset
property. - If needed, pass it like this in your config:
sourceCharset=GBK
- To debug logging from
connector-jdbc
, check the worker logs in the SeaTunnellogs
directory.
✅ Summary
By adding a simple charset-switching mechanism and tweaking the JDBC source implementation, you can eliminate garbled characters when syncing Oracle data to Doris using SeaTunnel.
No more broken characters—your data pipeline just got smarter. 🚀