开发者

How can I parse a space-delimited string in PHP?

Part of the PHP application I'm building parses an RSS feed of upcoming jobs and internships. The <description> for each feed entry is a series of tags or labels containing four standard pieces of information:

  1. Internship or job
  2. Full or part time
  3. Type (one of 4 types: Local Gov, HR, Non-profit, Other)
  4. Name of organization

However, everything is space-delimited, turning each entry into a mess like this:

  • Internship Full time Local Gov NASA
  • Job Part time HR Deloitte
  • Job Full time Non-profit United Way

I'm trying to parse each line and use the pieces of the string as variables. this list were delimited in any standard way, I could easily use something like list($job, $time, $type, $name) = explode(",", $description) to parse the string and use the pieces individually.

I can't do that with this data, though. If I use explode(" ") I'll get lots of useless variables ("Full", "time", "Local", "Gov", for example).

Thoug开发者_开发问答h the list isn't delimited, the first three pieces of information are standard and can only be one of 2–4 different options, essentially creating a dictionary of allowable terms (except the last one—the name of the organization—which is variable). Because of this it seems like I should be able to parse these strings, but I can't think of the best/cleanest/fastest way to do it.

preg_replace seems like it would require lots of messy regexes; a series of if/then statements (if the string contains "Local Gov" set $type to "Local Gov") seems tedious and would only capture the first three variables.

So, what's the most efficient way to parse a non-delimited string against a partial dictionary of allowed strings?

Update: I have no control over the structure of the incoming feed data. If I could I'd totally delimit this, but it's sadly not possible…

Update 2: To clarify, the first three options can only be the following:

  1. Internship | Job
  2. Full time | Part time
  3. Local Gov | HR | Non-profit | Other

That's the pseudo dictionary I'm talking about. I need to somehow strip those strings out of the main string and use what's left over as the organization name.


It's just a matter of getting your hands dirty it seems:

$input = 'Internship Full time Local Gov NASA';

// Preconfigure known data here; these will end up
// in the output array with the keys defined here
$known_data = array(
    'job'  => array('Internship', 'Job'),
    'time' => array('Full time', 'Part time'),
    // add more known strings here
);

$parsed = array();
foreach($known_data as $key => $options) {
    foreach($options as $option) {
        if(substr($input, 0, strlen($option)) == $option) {
            // Skip recognized token and next space
            $input = substr($input, strlen($option) + 1);
            $parsed[$key] = $option;
            break;
        }
    }
}

// Drop all remaining tokens into $parsed with numeric
// keys; you could do something else with them if desired
$parsed += explode(' ', $input);

See it in action.


<?php

$a = array (
'Internship Full time Local Gov NASA',
'Job Part time HR Deloitte',
'Job Full time Non-profit United Way',
);


foreach ($a as $s)
{
    if (preg_match ('/(Internship|Job)\s+(Part time|Full time)\s+(Local Gov|HR|Non-profit|Other)\s+(.*)/', $s, $match))
    {
        array_shift ($match);
        list($job, $time, $type, $name) =  $match;

        echo "$job, $time, $type, $name\n";
    }

}


Obviously, the optimal thing to do would be to change the RSS feed to use a different delimiter or (even better) put the four items into separate tags/elements/attributes/whatever.

But assuming that's not possible: Given what you describe, I would focus on making the code clear to read and maintain (and modify) at the expense of performance and compactness. The code will be larger, and it won't scale well if you go from 4 fields to 40 fields, but if you are confident that things won't change so much, you and anyone who has to take over maintaining the code will be happier. (Include a comment explaining the space-delimiting problem so that people understand why you did it the way you did.)

So, rethink the problem. Instead of parsing the string all at once, figure out how to pull just the first item off. (I would match each of the possibilities with preg_match() using ^ in the regexp to indicate that the match has to appear at the start of the string. If the regexp is really long because the dictionary is big but there are no special chars to worry about, consider storing the dictionary as an array and using implode() to create a string delimited by | to use as your regexp.)

Do that three times for the first three elements (removing it from the string each time, probably), then the fourth element is your last element.

Maybe put each of the element retrieval routines into its own function that calls a subsequent function that gets passed the dictionary. The subsequent function can then do the implode() and pulling of the substring off the string.

Something like that, anyway. It won't be compact code, but someone reading it will be able to tell what's going on and the regexps won't be too crazy.


Try an explode delimited by '\n' then within a foreach you could kill the key words and probably have to explode again based on ' '.


function startsWith($key, $data) {
   // get the length of the key we are looking for
   $len = strlen($key);
   // Check if the key matches the initial portion of the string
   if ($key === substr($data, 0, $len)) {
      // if yes return the remainder of the string
      return substr($data, $len);
   } else {
      // return false
      return false;
   }
}

This would allow you to check if the string starts with that and process it accordingly


If the dictionary above is complete, you can just take out non-functional words.

$input = str_replace(array('time', 'Gov'), '', $input);

Now you can operate on meaningful single words.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜