开发者

determine the type of a value which is represented as string in python

When I read a comma seperated file or string with the csv parser in python all items are represented as a string. see example below.

import csv
a = "1,2,3,4,5"
r = csv.reader([a])
for row in r:
    d = row

d ['1', '2', '3', '4', '5'] type(d[0]) <type 'str'>

I want to determine for each value if it is 开发者_运维技巧a string, float, integer or date. How can I do this in python?


You could do something like this:

from datetime import datetime

tests = [
    # (Type, Test)
    (int, int),
    (float, float),
    (datetime, lambda value: datetime.strptime(value, "%Y/%m/%d"))
]

def getType(value):
     for typ, test in tests:
         try:
             test(value)
             return typ
         except ValueError:
             continue
     # No match
     return str

>>> getType('2010/1/12')
<type 'datetime.datetime'>
>>> getType('2010.2')
<type 'float'>
>>> getType('2010')
<type 'int'>
>>> getType('2013test')
<type 'str'>

The key is in the tests order, for example the int test should be before the float test. And for dates you can add more tests for formats you want to support, but obviously you can't cover all possible cases.


This cannot be done in a reliable manner and that is not due to limitations in Python or any other programming language for that matter. A human being could not do this in a predictable manner without guessing and following a few rules (usually called Heuristics when used in this context).

So lets first design a few heuristics then encode them in Python. Things to consider are:

  • All the values are valid strings we know that because that is the basis of our problem so there is no point in checking for this at all. We should check everything else we can whatever falls through we can just leave as a string.
  • Dates are the most obvious thing to check first if they are formatted in predictable manner such as [YYYY]-[MM]-[DD]. (ISO ISO 8601 date format) they are easy to distinguish from other bits of text that contain numbers. If the dates are in a format with just numbers like YYYYMMDD then we are stuck as these dates will be indistinguishable from ordinary numbers.
  • We will do integers next because all integers are valid floats but not all floats are valid integers. We could just check if the text contains on digits (or digits and the letters A-F if hexadecimal numbers are possible) in this case treat the value as an integer.
  • Floats would be next as they are numbers with some formatting (the decimal point). It is easy to recognise 3.14159265 as a floating point number. However 5.0 which can be written simply as 5 is also a valid float but would have been caught in the previous steps and not be recognised as a float even if it was intended to be.
  • Any values that are left unconverted can be treated as strings.

Due to the possible overlaps I have mentioned above such a scheme can never be 100% reliable. Also any new data type that you need to support (complex number perhaps) would need its own set of heuristics and would have to placed in the most appropriate place in the chain of checks. The more likely a check is to match only the data type desired the higher up the chain it should be.

Now lets make this real in Python, most of the heuristics I mentioned above are taken care of for us by Python we just need to decide on the order in which to apply them:

from datetime import datetime

heuristics = (lambda value: datetime.strptime(value, "%Y-%m-%d"),
              int, float)

def convert(value):
    for type in heuristics:
        try:
            return type(value)
        except ValueError:
            continue
    # All other heuristics failed it is a string
    return value

values = ['3.14159265', '2010-01-20', '16', 'some words']

for value in values:
    converted_value = convert(value)
    print converted_value, type(converted_value)

This outputs the following:

3.14159265 <type 'float'>
2010-01-20 00:00:00 <type 'datetime.datetime'>
16 <type 'int'>
some words <type 'str'>


There's no real answer to this as far as I can tell since these are just strings. They're not integers or floats or whatever. Those are roles you decide. eg. Is 1 an integer or a float?

A couple of things come to mind though. One is to do some kind of pattern matching (eg. If it contains a decimal point, it's a float etc.). For parsing/guessing dates, you can try this or this.

You could also try 'casting' the element into whatever you want and catch exceptions to try the others. You can do something like try int, if it fails, try float and if that fails, try date etc.


What you want to acheive is difficult because the types are ambiguous: "1" could either be a string, or an int for example. At any rate, you could try something like this:

  • Dates: presumably they are in a known format: if so, you can try instantiating a datetime from the timestamp string (datetime.strptime()) and if it fails you know its not a datetime.

  • Floats: ensure all characters are either a digit and there is at least one "." in the string. Then convert to float (float(value))

  • Integers: regex the string and match digits. Ensure the string is the same lenght as the source string then convert (int(value))

  • If none of the above worked, it's a string.


Well..you can't.

How would you decide if "5" is meant as a string or an integer? How would you decide if "20100120" is meant as an integer or a date?

You can of course make educated guesses, and implement some kind of parse order. First try it as a date, then as a float, then as an int, and lastly as a string.


From the manual:

Return a reader object which will iterate over lines in the given csvfile. csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called — file objects and list objects are both suitable.

The interface requires that a string be returned each time next() is called.


The date is a bit harder. It depends on the format and how regular it is. Here is a clue to get you started on the rest.

>>> int('a')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: 'a'
>>> int('1')
1
>>> float('1')
1.0
>>> float('1.0')
1.0

But notice:

>>> int(1.0)
1
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜