creating a confusion maxtrix out of various data files but getting ValueError : x and y must be the same size
I'm new to python and trying to create a sentiment analysis using VADER
I pulled various artists (13) data into individual dataframes, converted the lyrics to words, found only the unique words, remove stopwords and all that then put it all into a single df
#for all the artists clean, get the single event of the word and place it in the list
df_allocate = []
for df in df_all:
df_clean = cleaning(df)
df_words = to_unique_words(df_clean)
df_allocate.append(df_words)
frames = df_allocate
# create the new column with the information of words lists
df_main = pd.concat(frames, ignore_index=True)
df_main = df_main.reset_index(drop=True)
Now I'开发者_JAVA百科m trying to train a logistic regression model, predict test results and get a confusion matrix.
I think I'm getting confused about how data frames work and also how to train_test_split the data correctly.
Right now, I have:
for column_name in df_all:
cv = CountVectorizer(max_features=100000)
X = cv.fit_transform(df_main['Artist']).toarray()
y = column_name.sentiment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.50, random_state=20)
classifier = LogisticRegression(random_state= 25)
classifier.fit(X_train, y_train)
y_predict = classifier.predict(X_test)
print_confusionMatrix = confusion_matrix(y_test, y_predict)
print(print_confusionMatrix)
print("accuracy score : ", accuracy_score(y_test, y_predict))
When I debug the program, I see why it's complaining however, I don't know how to fix it. I looked over how to iterate through dataframe and tried doing
for df in df_all.index
but it didn't work.
The columns are Artist, Title, Album, Date, Lyric, Year, and sentiment. What I want to accomplish is to iterate through each artist (df_all has the data frames of each individual artist, and that is why I use it), and get a prediction of the sentiment analysis of their lyrics to build a confusion matrix for all the 13 artists.
Previous tries are changing x to, and y keep it as that, so it's:
X = cv.fit_transform(df_main).toarray()
y = df_main.sentiment
however, this is where I get the error that x and y must be the same size.
Please push me in the right direction. I'm quite lost.
精彩评论