Whatsapp Chat Analysis

Whatsapp Chat Analysis

Using Python Basics and PowerBI

Table of contents

No heading

No headings in the article.

Ever wanted to try analyzing datasets that aren’t structured?

Look no further, as I was in this particular dilemma until I got an opportunity to volunteer and carry out a WhatsApp data analysis for the DSN ( Data Scientists Network) Port Harcourt chapter as a member.

I jumped on this task as it will enable me to stretch my knowledge beyond structured data and learn how to work with unstructured data.

The questions to solve with this analysis include:

Who are the most active persons in the group? Who are the least active persons in the group? What is the total number of contacts in the group? What is the total number of messages in the group from August to October 2022? To do this, a visualization of the monthly performance of the group will be needed as well as a visualization of the most used word by members of the group with a word cloud.

Does it look like too much work? Take a deep breath,

It’s as simple as closing your eyes and opening them 😂

So come along as we go through the following processes:

Data Collection:

This is quite interesting as WhatsApp stores data in a .txt file extension and this makes it a new thing for me. So let’s go to it. On the WhatsApp group, open it and click on the three dots at the topmost right-hand corner of the app, then click on more, click on export chat and finally choose without media.

MEET THE AUTHOR (2).png

Data Preparation:

After exporting the chat as a txt file, there was a need to prepare the data and put into a structured form. Some people use power query, but I chose to use the basics of Python to get it done. A few codes were written as seen below:

```# Importing necessary libraries for the analysis import regex #regular expressions import pandas as pd #for dataframe import numpy as np #for numeric python import emoji
from collections import Counter import matplotlib.pyplot as plt


```# Preparing of data before importing the data as it has several processes
def date_time(s):
    pattern = '^([0-9]+)(\/)([0-9]+)(\/)([0-9]+), ([0-9]+):([0-9]+)[ ]?(AM|PM|am|pm)? -'
    result = regex.match(pattern, s)
    if result:
        return True
    return False

def find_author(s):
    s = s.split(":")
    if len(s)==2:
        return True
    else:
        return False

def getDatapoint(line):
    splitline = line.split(' - ')
    dateTime = splitline[0]
    date, time = dateTime.split(", ")
    message = " ".join(splitline[1:])
    if find_author(message):
        splitmessage = message.split(": ")
        author = splitmessage[0]
        message = " ".join(splitmessage[1:])
    else:
        author= None
    return date, time, author, message

more of the codes can be seen here as the dataset was converted to dataframe with the codes and then exported to Microsoft Excel for cleaning.

Data Cleaning: The Excel file was opened and there was also a need to clean and properly format the sheet.

Raw file converted to Excel.png

The following steps were done in cleaning the file:

  • Borders were put on the document
  • Font style was changed to make all the same style and font
  • The column name “Author” was changed to “Username”.
  • Renumbering of the serial number was done.
  • Separation of the “date” column into having days and months separately using the DAY and MONTH functions in Excel.
  • Thereafter, I made some changes manually to the sheet as I really love to see sparkling clean data, in fact, I am obsessed with clean data.😂 The changes are noting and categorizing when members are added to the group, noting and categorizing when members left the group, noted those that changed numbers, noted those that were removed from the group and finally noting when the group name was changed.

The data was saved to be visualized using PowerBI. Data Analysis and Visualization:

The dataset was uploaded into PowerBI and on the Power query editor, I changed the “datetime” type to “date”, Afterwards, I viewed the column quality to check that there were no errors in the columns.

and voila! PowerBI to the rescue…

Below is the dashboard that gives the visualization of what has happened on the WhatsApp group.

Dashboard.png

Results:

  1. There are 65 active members in the group
  2. Gospel Orok, Emmanuel and Evans are the three most active members with Gospel having 36% of the overall messages.
  3. There are 8 people which were the least active in the group.
  4. The total number of messages in the group is 1417
  5. Gospel Orok also sent the highest number of URLs to the group.
  6. The time between 8:35 pm and 8:50 pm happens to be the most active time in the group
  7. 27 members were added to the group within the period of August to October 2022.
  8. 6 people left the group, 1 person changed phone number and the group name was changed once.
  9. October had more chats/activities than August and September.
  10. Saturday is the day with the highest activities with 473 chats and 35 active persons accounting for 33% of total chats.
  11. “DATA” is the most used word in the group.

Limitation:

I was supposed to carry out sensitivity analysis, I believe it is possible with PowerBI as it was possible to use Word Cloud in PowerBI by using the Word Cloud extension, but for some reason, I wouldn't want to cloud the dashboard so I will skip it and perhaps use Python next time for the full analysis.

This is a wrap!

Does it look difficult to do? I guess not!

You can follow the steps and try out yours.

Do not forget to tag me when you do yours.

Cheers!

Thank you for reading.

Kindly drop any feedback and recommendation.

Connect with me on Twitter and LinkedIn