开发者

process csv in scala

I am using scala 2.7.7, and wanted to parse CSV file and store the data in SQLite database.

I ended up using OpenCSV java library to parse the CSV file, and using sqlitejdbc library.

Using these java libraries makes my scala code looks almost identical to that of Java code (sans semicolon and with val/var)

As I am dealing with java objects, I can't use scala list, map, etc, unless I do scala2java conversion or upgrad开发者_JAVA技巧e to scala 2.8

Is there a way I can simplify my code further using scala bits that I don't know?

val filename = "file.csv";
val reader = new CSVReader(new FileReader(filename))
var aLine = new Array[String](10)
var lastSymbol = ""
while( (aLine = reader.readNext()) != null ) {
    if( aLine != null ) {
        val symbol = aLine(0)
        if( !symbol.equals(lastSymbol)) { 
            try {
                val rs = stat.executeQuery("select name from sqlite_master where name='" + symbol + "';" )
                if( !rs.next() ) {
                    stat.executeUpdate("drop table if exists '" + symbol + "';")
                    stat.executeUpdate("create table '" + symbol + "' (symbol,data,open,high,low,close,vol);")
                }
            }
            catch {
              case sqle : java.sql.SQLException =>
                 println(sqle)

            }
            lastSymbol = symbol
        }
        val prep = conn.prepareStatement("insert into '" + symbol + "' values (?,?,?,?,?,?,?);")
        prep.setString(1, aLine(0)) //symbol
        prep.setString(2, aLine(1)) //date
        prep.setString(3, aLine(2)) //open
        prep.setString(4, aLine(3)) //high
        prep.setString(5, aLine(4)) //low
        prep.setString(6, aLine(5)) //close
        prep.setString(7, aLine(6)) //vol
        prep.addBatch()
        prep.executeBatch()
     }
}
conn.close()


If you have a simple CSV file, an alternative would be not to use any CSV library at all, but just simply parse it in Scala, for example:


case class Stock(line: String) {
  val data = line.split(",")
  val date = data(0)
  val open = data(1).toDouble
  val high = data(2).toDouble
  val low = data(3).toDouble
  val close = data(4).toDouble
  val volume = data(5).toDouble
  val adjClose = data(6).toDouble

  def price: Double = low
}

scala> import scala.io._

scala> Source.fromFile("stock.csv") getLines() map (l => Stock(l))
res0: Iterator[Stock] = non-empty iterator


scala> res0.toSeq  
res1: Seq[Stock] = List(Stock(2010-03-15,37.90,38.04,37.42,37.64,941500,37.64), Stock(2010-03-12,38.00,38.08,37.66,37.89,834800,37.89) //etc...

Which would have the advantage that you can use the full Scala collection API.

If you prefer to use parser combinators, there's also an example of a csv parser combinator on github.


The if statement after the while is useless--you've already made sure that aLine is not null.

Also, I don't know exactly what the contents of aLine is, but you probably want to do something like

aLine.zipWithIndex.foreach(i => prep.setString(i._2+1 , i._1))

instead of counting up by hand from 1 to 7. Or alternatively, you can

for (i <- 1 to 7) { prep.setString(i, aLine(i)) }

If you felt adopting a more functional style, you could probably replace the while with

Iterator.continually(reader.readNext()).takeWhile(_!=null).foreach(aLine => {
  // Body of while goes here
}

(and also remove the var aLine). But using the while is fine. One could also refactor to avoid the lastSymbol (e.g. by using a recursive def), but I'm not really sure that's worth it.


If you want to parse it in Scala, the built in parsers are quite powerful, and once you get the hang of it, pretty easy. I'm no expert, but with a few spec tests, this proved to be functional:

object CSVParser extends RegexParsers {
  def apply(f: java.io.File): Iterator[List[String]] = io.Source.fromFile(f).getLines().map(apply(_))
  def apply(s: String): List[String] = parseAll(fromCsv, s) match {
    case Success(result, _) => result
    case failure: NoSuccess => {throw new Exception("Parse Failed")}
  }

  def fromCsv:Parser[List[String]] = rep1(mainToken) ^^ {case x => x}
  def mainToken = (doubleQuotedTerm | singleQuotedTerm | unquotedTerm) <~ ",?".r ^^ {case a => a}
  def doubleQuotedTerm: Parser[String] = "\"" ~> "[^\"]+".r <~ "\"" ^^ {case a => (""/:a)(_+_)}
  def singleQuotedTerm = "'" ~> "[^']+".r <~ "'" ^^ {case a => (""/:a)(_+_)}
  def unquotedTerm = "[^,]+".r ^^ {case a => (""/:a)(_+_)}

  override def skipWhitespace = false
}

It's not what I would consider a feature-complete solution perhaps, I'm not how it would handle UTF-8 etc, but it seems to work for ASCII CSVs that have quotes at least.


If you want something a bit more idiomatic and quite a bit more type safe, may I suggest kantan.csv?

It lets you turn any source of CSV data into a collection of well-typed values. To rewrite the CSV parsing part of your example (and dealing with dates as Strings because I don't know what format you receive them in), you'd write:

import kantan.csv.ops._

type Row = (String, String, Double, Double, Double, Double, Double)

// Do whatever it is you need to do with each row
def sqliteMagic(row: Row): Unit = ???

new File(filename).asUnsafeCsvRows[Row](',', false).foreach(sqliteMagic)

Note that I'm not particularly fond of using tuples when you can use more specific types. Using kantan.csv's shapeless module, you can write it a bit more nicely:

import kantan.csv.ops._
import kantan.csv.generic.codecs._

case class Symbol(name: String, date: String, open: Double, high: Double, low: Double, close: Double, vol: Double)

def sqliteMagic(symbol: Symbol): Unit = ???

new File(filename).asUnsafeCsvRows[Symbol](',', false).foreach(sqliteMagic)

Note how you did not have to do any work to support the Symbol case class, which is pretty nice and still type safe thanks to shapeless.

Full disclosure: I'm the author of kantan.csv.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜