Take apart xml text in C
My objective is to read an XML text file and split each word and tag into there own line in an array.
For example, if I input this text into my program:
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
I would get this:
<note>
<to>
Tove
</to>
<from>
...
Right now I have code that can successfully do this but only with the words so instead of the above list I get:
note
to
Tove
...
I want to keep the tags or I wont be able to do what I want with it. So I have been trying to get it to also add the tags but have been failing
Okay so here is my code:
//While the file is not empty
while(fgets(buffer, sizeof(buffer), stdin) != NULL){
int first = 0;
int last = 0;
//While words are left in line
while(last < INITIAL_SIZE && buffer[last] != '\0'){
int bool = 0;
//Tag detected
if(buffer[last] == '<'){
while(buffer[last] != '>'){
last++;
}
bool = 1;
}else{
//While more chars are in the word
while(last < INITIAL_SIZE && isalpha(buffer[last])){
last++;
}
}
//Word detected
if(first < last){
//Words array is full, add more space
if(numOfWords == sizeOfWords){
sizeOfWords = sizeOfWords + 10;
words = (char **) realloc(words, sizeOfWords*sizeof(char *));
}
//Allocate memory for array
words[numOfWords] = (char *) calloc(last-first+1, sizeof(char));
for(i = 0; i < (last-first); i++){
words[numOfWords][i] = buffer[first + i];
}
//Add terminator to "new word"
words[numOfWords][i] = '\0';
numOfWords++;
}
//Move "Array Pointers" accordingly
last++;
first = last;
}
}
Any one have any idea, with the above code this is the printout:
<note
<to
Tove
to
<from
Jani
from
<heading
...
Don
t
forget
me
this
weekend
</body
</note
So after this wall of text, does anyone have any idea on how I can modify my current code to get this to work? Or开发者_如何学C does anyone else have an alternative?
My basic way of thinking is this:
first
is the first letter included in the current word;
last
is the first letter not included in the current word.
In your program, when you are detecting tags, you are not including the >
. Also, the last++
in the end is not needed, since you are parsing the words correctly, once you include the >
, it's useless. In addition, you forgot to check not only \0
as the end of a string, but also\n
as the end of line.
Here's my solution:
while (fgets(buffer, sizeof(buffer), stdin) != NULL) {
int first = 0;
int last = 0;
//While words are left in line
while (last < INITIAL_SIZE && buffer[last] != '\0'
&& buffer[last] != '\n') { // <--------- Add this
int Bool = 0;
//Tag detected
if (buffer[last] == '<') {
while (buffer[last] != '>') {
last++;
}
last++; // <--------- This
Bool = 1;
} else {
//While more chars are in the word
while (last < INITIAL_SIZE && isalpha(buffer[last])) {
last++;
}
}
//Word detected
if (first < last) {
//Words array is full, add more space
if (numOfWords == sizeOfWords) {
sizeOfWords = sizeOfWords + 10;
words = (char **) realloc(words,
sizeOfWords * sizeof(char *));
}
//Allocate memory for array
words[numOfWords] = (char *) calloc(last - first + 1,
sizeof(char));
for (i = 0; i < (last - first); i++) {
words[numOfWords][i] = buffer[first + i];
}
//Add terminator to "new word"
words[numOfWords][i] = '\0';
numOfWords++;
}
//Move "Array Pointers" accordingly
first = last; // <--------- And change this
}
}
Even though it is highly doubtful that anyone would ever use this I got it to work by using Boolean type logic.
while (fgets(buffer, sizeof(buffer), stdin) != NULL) {
int first = 0;
int last = 0;
//While words are left in line
while (last < INITIAL_SIZE && buffer[last] != '\0' && buffer[last] != '\n'){
int Bool = 0;
//Tag detected
if (buffer[last] == '<'){
while (buffer[last] != '>')
last++;
Bool = 1;
}else
//While more chars are in the word
while(last < INITIAL_SIZE && !isspace(buffer[last]) && buffer[last] != '<')
last++;
//Word detected
if (first < last) {
//Words array is full, add more space
if (numOfWords == sizeOfWords) {
sizeOfWords = sizeOfWords + 10;
words = (char **) realloc(words, sizeOfWords * sizeof(char *));
}
//Allocate memory for array
words[numOfWords] = (char *) calloc(last - first + 1, sizeof(char));
int xHolder = 0;
if(buffer[first] == '/'){
words[numOfWords][0] = '<';
xHolder++;
Bool++;
}
for (i = 0; i < (last - first + Bool); i++) {
words[numOfWords][xHolder] = buffer[first + i];
xHolder++;
}
//Add terminator to "new word"
words[numOfWords][i] = '\0';
numOfWords++;
}
//Move "Array Pointers" accordingly
last++;
first = last;
}
}
The best advice I can give here is what was given to me when I posted this on comp.lang.c.
Functions
Pretty much everywhere you've written a full-line comment, the important words from the comment shoud be the name of the function called at that point.
ProcessFile
while(fgets..)
ProcessWords()
ProcessWords
if(DetectTag)
...
Refactoring in this way makes compicated code much easier to read (for you, too). It allows your top-level logic to read like pseudocode, while all the fiddly-bits can be grouped together. Maybe someday, tags will use curly braces. Put your literals in #define
s or even enums
. That way simple syntax changes can be made easily later on.
The goal is you shoud be able to see the entire function body on the screen at the same time. This allows you to verify each piece separately.
You might be having problem in your inner loop
精彩评论