开发者

Databricks Autoloader - dealing with combined files

I'm working with some files that have some complexities

  • multiple tab files concatenated into 1
  • csv files with some meta data prior to the csv data
  • csv files with an extra row after the header that should be ignored
  • csv files with log information interspersed into the file

My q开发者_StackOverflowuestion relates to whether autoloader can split the stream (ie 1 input file to 2 or more output files) based on pattern matching or has some other mechanism for dealing with these scenarios

Ignoring the metadata using skipRows isn't an option as I want to retain the metadata in a separate output file The RescuedDataColumn option doesn't appear to be a valid approach as the data doesn't fall into the 3 identified scenarios (from the docs). ie.

  1. The column is missing from the schema.
  2. Type mismatches.
  3. Case mismatches.
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜