Databricks Autoloader - dealing with combined files
I'm working with some files that have some complexities
- multiple tab files concatenated into 1
- csv files with some meta data prior to the csv data
- csv files with an extra row after the header that should be ignored
- csv files with log information interspersed into the file
My q开发者_StackOverflowuestion relates to whether autoloader can split the stream (ie 1 input file to 2 or more output files) based on pattern matching or has some other mechanism for dealing with these scenarios
Ignoring the metadata using skipRows isn't an option as I want to retain the metadata in a separate output file The RescuedDataColumn option doesn't appear to be a valid approach as the data doesn't fall into the 3 identified scenarios (from the docs). ie.
- The column is missing from the schema.
- Type mismatches.
- Case mismatches.
精彩评论