Fast file search algorithm for IP addresses
Question
What is the fastest way to find if an IP address exists in a file that contains IP addresses sorted as:
219.93.88.62 219.94.181.87 219.94.193.96 220.1.72.201 220.110.162.50 220.126.52.187 220.126.52.247
Constraints
- No database (e.g., MySQL, PostgreSQL, Oracle, etc.)
- Infrequent pre-processing is allowed (see possibilities section)
- Would be nice not to have to load the file each query (131Kb)
- Uses under 5 megabytes of disk space
- No e开发者_StackOverflowxtra PHP modules
File Details
- One IP address per line
- 9500+ lines
Possible Solutions
- Create a directory hierarchy (radix tree?) then use
is_dir()
(sadly, this uses 87 megabytes)
Scanning the file line by line to find an IP seems like a pain if you have 9,000 non-matches to check before you get to 232.0.17.1
Is your file constrained to a single file? e.g. lets say this list is banned IPs and you just want to see if one is "in" the list.
What if you made a DIR to contain multiple files:
BannedIPs
+- 0.ips
+- 1.ips
+- 37.ips
+- 123.ips
+- 253.ips
+- 254.ips
Each file only contains IP addresses that start with that number.
If you were lucky enough to have even distribution... you'd have 256 files, but each would only have ~37 entries.
Thus when you want to test: 232.0.17.1
you look in the 232.ips
file and scan for it.
Since your file stores the IP addresses in sorted order already you can quickly find a specific IP address in O(log(n)) time by using a binary search.
If you want to speed this up even further you can cache for example every 100th row in memory and use an in-memory binary search first, then you know which part of the file you need to read in to finish the search.
Having said that 131kB is really not that much, so the simplest and fastest solution is to buy more memory and cache the entire file in memory in a hash table.
EDIT I didn't notice the php
tag, I don't know if the following type of thing is possible in that language. But I'll leave it for the idea anyway.
An IPv4 address is representable as a 32-bit number, so I'd just make an array of int32
, translate your addresses into the 'ints` with the following Python-ish psuedocode:
x = 0
i = 24
s = '111.222.333.444'
for part in s.split('.'):
x += part.toint() << i
i -= 8
IPlist.append(x)
Then you can get the input address, convert it to an int
the same way, and do binary search on the array.
For ~10 k lines, the array will take ~40 kBytes.
Might not be fast, but I'd try this out: If the IP address file doesn't change much, read the file into an array and cache it (maybe Memcache) and search from there on every request.
精彩评论