Finding patterns in these numbers
I am currently working on a project. In this project I have set of data which follows particular algorithm. I have to find the pattern.
1 355138022809833 RUPQ730562P 247001 20578330 70175500
2 355138022809841 RUPQ730563D 247001 72754950 71957850
3 355138023475287 RVSQ831978E 247001 39374170 25101090
4 355138023475295 RVSQ831979F 247001 06260280 87190670
5 355138023475303 RVSQ831980L 247001 05025410 26440510
6 355138023475352 RVSQ831985Y 247001 96637700 48209200
7 355138023475360 RVSQ831986A 247001 27362620 70790740
8 355138023475378 RVSQ831987P 247001 16576600 30002180
9 355138023475386 RVSQ831988D 247001 74778020 98010580
10 355138023475402 RVSQ831990M 247001 25716170 97946520
First column is the serial number. Next 3 columns are the input which will be given. Next 2 are the output of the algorithm.
So basically
I have 3 variables x, y, z (2nd, 3rd, 4th column of above data)
And
y1 = f1(x, y, z)
y2 = f2(x, y, z)
y1 is 5th column of above data
y2 is 6th column of above data
I have above data. Now I need to find function f1 and f2.
What procedure should I follow? What steps must be taken?
EDIT 1 by Krishna Kant Sharma
I posted this question to not to ask for the answering algorithm. I just asked for the steps necessary to take, to solve these kinds of problem, when we have alphabets in variables too. This is the first time, in my experience, some small portion of stackoverflow community have acted like closed minded people. Whats the whole point of stackoverflow? We are here to help each other understand and resolve problems. To lend a helping hand when some of us need it. So why dont we stop beating around some t开发者_C百科echnical pureness (like alphabets are not alphabetical characters), and solve the main problem.
More Data
11 355138023475436 RVSQ831993L 247001 07481830 49057990
12 355138023475444 RVSQ831994T 247001 65090950 87729430
13 355138023475451 RVSQ831995B 247001 06689330 60021180
14 355138023475469 RVSQ831996K 247001 05784310 69836640
15 355138023475477 RVSQ831997Z 247001 13157740 35850670
16 355138023475485 RVSQ831998Y 247001 68658020 77311320
17 355138023475501 RVSQ832000N 247001 01567780 26994970
18 355138023475519 RVSQ832001E 247001 43775370 58120770
19 355138023475527 RVSQ832002F 247001 42463550 55145190
20 355138023475535 RVSQ832003R 247001 85766840 15491950
First step you should undertake is to get to know what the context of this input data is. Then you would have the option to make assumptions on what the result columns might be and what algorithms/function are typically used in that context.
The next point is to analyze yourself the input data looking for patterns and matchings with real world things (e.g. zip codes, serial numbers, dates, etc.) Therefor you should look at different parts of the input, but also at similar blocks of input.
If you don't succeed at the points before, you'll have no other choice than trial and error. You can still sort out some functions or algorithms by looking at the input data (e.g. letters will render typical mathematic functions useless, so maybe it's some hash functions.)
To shed some light on your input data:
- the last character of x (and also y) is looking like a checksum char, so if you look for patterns check the number/text without the last char
- the letters in y might be some common abbreviation of currencies, business processes or something else
- the last 0 in the result columns might also be some checksum char or depending on the z column (not enough data to tell)
I'd try some (common) hash functions on some combinations of the input data which produce results of 8 digits and look for results.
Sorry, but I think what you are asking is impossible (from a computational point of view).
The system this data comes from could be doing, say,
SELECT Y1, Y2 FROM my_secret_data WHERE Col1 = x AND Col2 = Y and Col3 = Z;
Where my_secret_data
contains values that are not computationally derived.
So unless you had the underlying table, you could never find an algorithm that solves it (unless you had every possible combination of inputs and outputs - which would then mean reconstructing the entire table)
Outside computation, all I think you can do is look for patterns, and try to find out what the input / output values represent, and see where that takes you.
Edit:
All is not lost in certain situations; things would be different if the inputs, outputs and any functions the algorithm uses were continuous(given the inputs are alphanumeric this does not look the case here)
If they were, you could (probably) find an algorithm via interpolating (perhaps a neural network), but under these circumstances, given the values, I think you would need a lot more sample data.
Look at rows 5 and 6. All 3 input parameters are the same and yet the output is different. I don't think this is possible to solve having only the data you gave us.
Where are you getting these, and how are they being used?
If the entity that is using the real f1 and f2 is using them for authentication, and they're at all intelligent, f1 and f2 are going to be cryptographically secure functions, and you have essentially no chance of breaking them. If they're just checksums, you could try common checksum algorithms. I'd start with Wikipedia.
How much data can you get? Can you get the values of f1 and f2 for arbitrary inputs, or are you limited to what you can observe? If you can observe the values for inputs differing in one character, you can see how much of a change this makes. If the results are mostly the same, it's not a cryptographic hash, and you've got a better chance.
How important is this to you and your company? I'd say you have very little chance of succeeding, unless there's more than I'm seeing right now. It's extremely likely that any solution would involve a large number of brute-force searches.
Incidentally, in your solving process, don't use all the data. You can come up with some sort of function to fit any data points. Keep a little data back for tests to see if your derived function works on outside data.
Finally, is this even ethical? You haven't told us where these numbers come from, and it seems plausible that they're something another company generates for purposes of security, and their intentions may be good or bad. That's something to think about, if only because a company that behaves unethically to others might well behave unethically to its employees.
You can always define a piecewise function:
f1(355138022809833,RUPQ730562P,247001)=20578330 f1(355138022809841,RUPQ730563D,247001)=72754950
etc. as you don't require the continuity.
Given that the z value is not changing in your sample data, it can be removed. There's really no general approach to solve this if you only have data. On the other hand, if you have the ability to test the function with arbitrary inputs of your own devising, you can use techniques similar to differential cryptanalysis.
This seems like a problem that would be best-solved by a neural network. Hopefully you can get a bigger data set to train with though!
If you are sure there is a pattern you could try machine learning techniques. But your data-set to set up and train the "machine" is quite small (only 10 each). Further more you need a prediction, so several algorithms will not work for you like clustering, classification, association mining. Neural networks would be such a kind of technique. It's an option you can actually try out. Unfortunately I'm no machine learning/data mining expert and can't tell you the way. For Java have a look at WEKA.
Look at the data from various angles:
What values occur, which digits.
Look at differences. This exposes that x and y are sequential and that there is a close relation when you strip off the last digit/letter.
Look at patterns. y1 begins with 06 then 05 in rows 4, 5 and 13, 14. The difference of the serial numbers minus the check digit is 16. It might be a coincidence, or it might not.
Run statistical tests (not much data to go by here).
Look at the data in various number systems (hex, binary).
Look at the prime factorization of numbers.
Look at the effect of small differences in the data.
You may initially want to exclude the first two rows, because their serial numbers are far off the others, which may obscure a possible pattern.
Try to find out as much background about the computation as you can.
Some knowledge of cryptanalysis wouldn't be bad either.
Then make up some working hypothesis how the y1 and y2 values are computed and test it. For example, the first I'd check would be some bit twiddling with shift and xor (maybe a CRC), or some linear function of the serial number modulo 10000000, disregarding the trailing zeros.
Rinse and repeat. If you have enough patience and it is not too hard you could find it.
精彩评论