- Регистрация
- 1 Мар 2015
- Сообщения
- 1,481
- Баллы
- 155
With the continuous complexity of data sources and the rapid evolution of business demands, general-purpose data integration frameworks often face many challenges in practical deployment: frequent issues such as irregular data structures, missing fields, mixed sensitive information, and unclear data semantics. To better address these complex scenarios, a leading publicly listed cybersecurity enterprise has performed secondary development based on Apache SeaTunnel, building a scalable, easy-to-maintain data processing and intelligent fault-tolerance mechanism suitable for complex scenarios. This article will comprehensively introduce the relevant technical implementations around actual functional extensions and design concepts.
Background and Pain Points
In practical business scenarios, the data sources we face are highly heterogeneous, including but not limited to log files, FTP/SFTP files, Kafka messages, and database changes. The data itself may be structurally inconsistent or even unstructured text or semi-structured XML format. The following problems are particularly prominent:
To address the above complex scenarios, we built multiple Transform plugins based on SeaTunnel for regex transform, XML parsing, Key-Value parsing, dynamic data completion, IP address completion, data masking, dictionary translation, extended Incremental reading for SFTP/FTP, and other processing.
1. Regex Parsing (Regex Transform)
Used for parsing structured or semi-structured text fields. By configuring regular expressions and specifying group mappings, raw text can be split into multiple business fields. This method is widely used in log parsing and field splitting scenarios.
Core Parameters:
2. XML Parsing
Using the VTD-XML parser combined with XPath expressions to precisely extract XML nodes, attributes, and text content, converting them into structured data.
Core Parameters:
Parse strings like "key1=value1;key2=value2" into structured fields. Supports configuration of key-value and field delimiters.
Core Parameters:
4. Dynamic Data Completion (Lookup Enrichment)
Dynamically fill in missing fields using auxiliary data streams or dictionary tables, such as completing device asset information, user attributes, etc.
Implementation Highlights:
5. IP Address Completion
Derive geographic information such as country, city, region from IP fields by locally integrating the IP2Location database.
Parameters:
6. Data Masking
Mask sensitive information such as phone numbers, ID cards, emails, IP addresses, supporting various masking rules (mask, fuzzy replacement, etc.) to ensure privacy compliance.
Common Masking Strategies:
7. Dictionary Translation
Convert encoded values into business semantics (e.g., gender code 1 => Male, 2 => Female), improving data readability and report quality.
Background and Pain Points
In practical business scenarios, the data sources we face are highly heterogeneous, including but not limited to log files, FTP/SFTP files, Kafka messages, and database changes. The data itself may be structurally inconsistent or even unstructured text or semi-structured XML format. The following problems are particularly prominent:
- Insufficient complex data parsing capability: Unable to parse and ingest complex data such as XML, key-value, and irregular data.
- Lack of data completion and dictionary translation capability: Unable to complete asset information when supplementing raw logs, resulting in incomplete data and missing key information, which leads to insufficient data analysis capability and inability to mine data value.
- Limited file reading modes: Unable to capture and parse newly added logs in real time, causing delays in security threat detection and loss of real-time analysis and subsequent system alert value.
- Weak exception handling mechanisms: During task execution, data senders may change logs without notifying receivers, causing task interruption. Without notification of log changes, it is difficult to quickly locate and solve problems.
To address the above complex scenarios, we built multiple Transform plugins based on SeaTunnel for regex transform, XML parsing, Key-Value parsing, dynamic data completion, IP address completion, data masking, dictionary translation, extended Incremental reading for SFTP/FTP, and other processing.
1. Regex Parsing (Regex Transform)
Used for parsing structured or semi-structured text fields. By configuring regular expressions and specifying group mappings, raw text can be split into multiple business fields. This method is widely used in log parsing and field splitting scenarios.
Core Parameters:
- source_field: The original field to be parsed
- regex: Regular expression, e.g., (\d+)-(\w+)
- groupMap: The mapping relationship between parsed result fields and regex capture group indexes
2. XML Parsing
Using the VTD-XML parser combined with XPath expressions to precisely extract XML nodes, attributes, and text content, converting them into structured data.
Core Parameters:
- pathMap: Mapping each result field to the corresponding XPath path of the needed attribute
- source_field: The XML string field name
Parse strings like "key1=value1;key2=value2" into structured fields. Supports configuration of key-value and field delimiters.
Core Parameters:
- source_field: Upstream key-value string field
- field_delimiter: Key-value pair delimiter (e.g., ;)
- kv_delimiter: Key and value delimiter (e.g., =)
- fields: Set of target mapped field keys
4. Dynamic Data Completion (Lookup Enrichment)
Dynamically fill in missing fields using auxiliary data streams or dictionary tables, such as completing device asset information, user attributes, etc.
| Name | Type | Required | Description |
|---|---|---|---|
| source_table_join_field | String | ![]() | Source table join field, used to get the source table field value |
| dimension_table_join_field | String | ![]() | Dimension table join field, used to get the dimension table data |
| dimension_table_jdbc_url | String | ![]() | JDBC URL path for the dimension table |
| driver_class_name | String | ![]() | Driver name |
| username | String | ![]() | Username |
| password | String | ![]() | Password |
| dimension_table_sql | String | ![]() | SQL statement for the dimension table, the queried fields will be output to the next level process |
| data_cache_expire_time_minutes | long | ![]() | Data cache refresh time in minutes |
Implementation Highlights:
- Support external data source lookup based on key fields
- Local cache to improve lookup performance
- Configurable timed cache data refresh
5. IP Address Completion
Derive geographic information such as country, city, region from IP fields by locally integrating the IP2Location database.
Parameters:
- field: Source IP field
- output_fields: Geographic fields to extract (e.g., country, city)
6. Data Masking
Mask sensitive information such as phone numbers, ID cards, emails, IP addresses, supporting various masking rules (mask, fuzzy replacement, etc.) to ensure privacy compliance.
| Name | Type | Required | Description |
|---|---|---|---|
| field | String | ![]() | The field that needs to be desensitized |
| rule_type | String | ![]() | Rule type: Positive desensitization, Equal desensitization |
| desensitize_type | String | ![]() | Desensitization type: Phone number, ID number, Email, IP address; Required when choosing positive desensitization |
| equal_content | String | No | Equal content; Required when choosing equal desensitization |
| display_mode | String | ![]() | Display mode: Full desensitization, Head-tail desensitization, Middle desensitization |
Common Masking Strategies:
- Mask middle four digits of phone numbers: 138****8888
- Mask email account name: x***@domain.com
- Mask IP address: 192.168.*.*
7. Dictionary Translation
Convert encoded values into business semantics (e.g., gender code 1 => Male, 2 => Female), improving data readability and report quality.
| Name | Type | Required | Description |
|---|---|---|---|
| fields | String | ![]() | Field list, format: target field name = source field name |
| dict_fields | Array<Map> | ![]() | Dictionary field list. Each dictionary object has the following attributes: |
| fieldName |
