开发者

Pricelist parser

I have to create Pricelist parser that imports data from excel or csv and put it in database. I have no problems to get data from source. I need to 开发者_开发问答find columns that contains price, product title and description automaticaly.

What can you suggest how to do that, is there common methods or libraries?

Data sample 1:

Intel Core 2 Duo E6300 (2.80GHz, 1066MHz, 2MB, S775) tray  |    83
Intel Core 2 Duo E6500 (2.93GHz, 1066MHz, 2MB, S775) tray  |    86

Data sample 2:

     Title                     Description                Guaranty     Price  
Intel Core 2 Duo E6300  |  2.80GHz, 1066MHz, 2MB, S775   |  12       |  83    
Intel Core 2 Duo E6500  |  2.93GHz, 1066MHz, 2MB, S775   |  6        |  86

Data sample 3:

 UPC                Title                      Price
 456546545     |  Intel Core 2 Duo E6300    |  83 
 4654654654    |  Intel Core 2 Duo E6500    |  out of stock


I recently wrote an address parser and the general strategy I used was to first pull out all the items that have a distinguishable pattern. In my case I first found the Postal Code which is analogous to price in your example. From there I found the state code, etc.

In your example I would find the Price and remove it from the line. From there you will need to find some pattern in the data that would allow you to parse our the product code. Without seeing more REAL data it is hard to decide what that is. In my address parser I used address suffixes (Rd, St, Court, etc) to help identify the end of an address line.

If you can provide more data we could probably be more helpful.


If you're using SQL Server, I would suggest not creating a program at all and using SQL Server Integration Services, which has built-in support for CSV and Excel.


Depending on the quality of your input (are all input strings equally formatted), you could try the following:

string s = "Intel Core 2 Duo E6300 (2.80GHz, 1066MHz, 2MB, S775) tray  |    83";
string firstPart = s.Substring(0, s.IndexOf("(")).Trim(); //returns "Intel Core 2 Duo E6300"
string secondPart = s.Substring(s.IndexOf("(") + 1, s.IndexOf(")") - s.IndexOf("(") - 1).Trim(); //returns "2.80GHz, 1066MHz, 2MB, S775"
string thirdPart = s.Substring(s.IndexOf(")") + 1, s.IndexOf("|") - s.IndexOf(")") - 1).Trim(); //returns "tray"
string fourthPart = s.Substring(s.IndexOf("|") + 1, s.Length - s.IndexOf("|") - 1).Trim(); //returns "83"

But when your data is not uniformely formatted, you might need to do some (or a lot) of checking before you can use the above functions.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜