开发者

creating a confusion maxtrix out of various data files but getting ValueError : x and y must be the same size

I'm new to python and trying to create a sentiment analysis using VADER

I pulled various artists (13) data into individual dataframes, converted the lyrics to words, found only the unique words, remove stopwords and all that then put it all into a single df

#for all the artists clean, get the single event of the word and place it in the list
df_allocate = []
for df in df_all:
    df_clean = cleaning(df)
    df_words = to_unique_words(df_clean)
    df_allocate.append(df_words)

frames = df_allocate
# create the new column with the information of words lists
df_main = pd.concat(frames, ignore_index=True)
df_main = df_main.reset_index(drop=True)

Now I'开发者_JAVA百科m trying to train a logistic regression model, predict test results and get a confusion matrix.

I think I'm getting confused about how data frames work and also how to train_test_split the data correctly.

Right now, I have:

for column_name in df_all:
    cv = CountVectorizer(max_features=100000)
    X = cv.fit_transform(df_main['Artist']).toarray()
    y = column_name.sentiment

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.50, random_state=20)

    classifier = LogisticRegression(random_state= 25)
    classifier.fit(X_train, y_train)

    y_predict = classifier.predict(X_test)

    print_confusionMatrix = confusion_matrix(y_test, y_predict)
    print(print_confusionMatrix)
    print("accuracy score : ", accuracy_score(y_test, y_predict))

When I debug the program, I see why it's complaining however, I don't know how to fix it. I looked over how to iterate through dataframe and tried doing

for df in df_all.index

but it didn't work.

The columns are Artist, Title, Album, Date, Lyric, Year, and sentiment. What I want to accomplish is to iterate through each artist (df_all has the data frames of each individual artist, and that is why I use it), and get a prediction of the sentiment analysis of their lyrics to build a confusion matrix for all the 13 artists.

Previous tries are changing x to, and y keep it as that, so it's:

X = cv.fit_transform(df_main).toarray()
y = df_main.sentiment

however, this is where I get the error that x and y must be the same size.

Please push me in the right direction. I'm quite lost.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜