开发者

java spark文件读取乱码问题的解决方法

目录
  • 一、问题
  • 二、心酸历程
    • 1、textFile+Configuration方式——乱码
    • 2、spark.read().option方式——乱码
    • 3、newAPIHadoopFile+Configuration——乱码
    • 4、newAPIHadoopFile+自定义类——乱码
    • 5、newAPIHadoopRDD+自定义类——乱码
  • 三、最终解决

    一、问题

    环境为jdk1.8,spark3.2.1,读取hadoop中GB18030编码格式的文件出现乱码。

    二、心酸历程

    为了解决该问题,尝试过很多种方法,但都没有成功

    1、textFile+Configuration方式——乱码

            String filePath = "hdfs:///user/test.deflate";
            //创建SparkSession和SparkContext的实例
            String encoding = "GB18030";
    
            SparkSession spark = SparkSession.builder()
                    .master("local[*]").appName("Spark Example")
                    .getOrCreate();
    
            JavaSparkContext sc = JavASParkContext.fromSparkContext(spark.sparkContext());
            Configuration entries = sc.hadoopConfiguration();
            entries.set("textinputformat.record.delimiter", "\n");
            entries.set("mapreduce.input.fileinputformat.inputdir",filePath);entries.set("mapreduce.input.fileinputformat.encoding", "GB18030");
            
            JavaRDD<String> rdd = sc.textFile(filePath);
    

    2、spark.read().option方式——乱码

            Dataset<Row> load = spark.read().format("text").option("encoding", "GB18030").load(filePath);
    
            load.foreach(row -> {
                System.out.println(row.toString());
                System.out.println(new String(row.toString().getBytes(encoding),"UTF-8"));
         http://www.devze.com       System.out.println(new String(row.toString().getBytes(encoding),"GBK"));
    
    
            });
    

    3、newAPIHadoopFile+Configuration——乱码

            JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD = sc.newAPIHadoopFile(filePath, TextInputFormat.class, LongWritable.class, Text.class, entries );
    
            System.out.println("longWritableTextJavaPairRDD  count ="+longWritableTextJavaPairRDD.count());
            longWritableTextJavaPairRDD.foreach(k->{
    
                System.out.println(k._2);
            });
    

    4、newAPIHadoopFile+自定义类——乱码

            JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDDjs = sc.newAPIHadoopFile(filePath, GBKInputFormat.class, LongWritable.class, Text.class, entries );
    
            System.out.println("longWritableTextJavaPairRDD  count ="+longWritableTextJavaPairRDD.count());
            longWritableTextJavaPairRDD.foreach(k->{
    
                System.out.println(k._2);
            });
    

    代码中GBKInputFormat.class是TextInputFormat.class复制将内部编程UTF-8修改为GB18030所得

    5、newAPIHadoopRDD+自定义类——乱码

            JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD1 = sc.newAPIHadoopRDD(entries, GBKInputFormat.class, LongWritable.class, Text.编程class);
            System.out.println("longWritableTextJavaPairRDD  count ="+longWritableTextJavaPairRDD1.count());
            longWritableTextJavaPairRDD1.foreach(k->{
                System.out.println(k._2());
            });
    

    三、最终解决

    上述方法感觉指定的字符编码并没有生效不知道为什么,如有了解原因的还请为我解惑,谢谢

    最终解决方案如下

           JavaPairRDD<LongWritable, Text> longWritableTextJavaPairRDD = sc.newAPIHadoopFile(filePath, TextInputFormat.class, LongWritable.class, Text.class, new Configuration());
    
            System.out.println("longWritableTextJavaPairRDD  count ="+longWritableTextJavaPairRDD.count());
            longWritableTextJavaPairRDD.foreach(k->{
                System.out.println(new String(k._2.copyBytes(), encoding));
            });
    
            JavaPairRDD&lwww.devze.comt;LongWritable, Text> longWritableTextJavaPairRDD1 = sc.newAPIHadoopRDD(entries, TextInputFormat.class, LongWritable.class, Text.class);
    
            System.out.println("longWritableTextJavaPairRDD  count ="+longWritableTextJavaPairRDD1.count());
            longWritableTextJavaPairRDD1.foreach(k->{
                System.out.println(new String(k._2().copyBytes(),encoding));
                System.out.println(new String(k._2.copyBytes(),encoding));
            });
    

    主要是new String(k._2().copyBytes(),encoding)得以解决

    到此这篇关于java spark文件读取乱码问题的解决方法的文章就介绍到这了,更多相关java spark文件读取乱码内容请搜索编程客栈(www.devze.com)以前的文章或继续浏览下面的相关文章希望大家以后多多支持编程客栈(www.devze.com)!

    0

    上一篇:

    下一篇:

    精彩评论

    暂无评论...
    验证码 换一张
    取 消

    最新开发

    开发排行榜