Unleashing the Power of Clean Data: A Guide to Cleaning Financial Data from Square Website and adding features



While square.com has many built-in analytics tools, it does not paint the full picture of a business’ data. Significant cleaning is required to make the best use of the data. I will go into detail of how I cleaned the data from a brewery’s square account to make it useable for further analysis.

Importing


There is an option to download csv files from square.com. These will be for individual items sold and span from January 1st of the year in question to January 1st of the following year.
square1
square2
                    import pandas as pd
                    import datetime as dt
                    import numpy as np

                    #importing square data
                    sales_2019 = pd.read_csv('data/items-2019-01-01-2020-01-01.csv')
                    sales_2020 = pd.read_csv('data/items-2020-01-01-2021-01-01.csv')
                    sales_2021 = pd.read_csv('data/items-2021-01-01-2022-01-01.csv')
                    sales_2022 = pd.read_csv('data/items-2022-01-01-2023-01-01.csv')
                    sales_2023 = pd.read_csv('data/items-2023-01-01-2024-01-01.csv')

                    df_list = [sales_2019, sales_2020, sales_2021, sales_2022, sales_2023]
                
Sales data for the years in question is imported, converted into dataframes and the resulting dataframes are put into a list.


Initial Cleaning


                    monetary_columns = ['Gross Sales', 'Discounts', 'Net Sales', 'Tax']
                    for df in df_list:
                        for col in monetary_columns:
                        #fixing monetary values
                            df[col] = df[col].str.replace(',', '', regex=True)
                            df[col] = df[col].str.replace('$', '', regex=True)
                            df[col] = df[col].astype('float')

                        #changing applicable columns to datetime
                        df['Date'] = pd.to_datetime(df['Date'])
                        # df['Time'] = pd.to_datetime(df['Time'], infer_datetime_format=True)

                        #adding new columns based off datetime
                        df['year'] = df['Date'].dt.year
                        df['month'] = df['Date'].dt.month
                        df['month_name'] = df['Date'].dt.month_name()
                        df['day'] = df['Date'].dt.day
                        df['Day_of_Week'] = df['Date'].dt.day_name()
                        df["period"] = df["year"].astype(str) + "." + df["month_name"].astype(str)
                        df['Time'] = df['Time'].apply(lambda x: dt.datetime.strptime(x, "%H:%M:%S"))
                        df['hour'] = df['Time'].dt.hour

                        #addition of new columns
                        df['Customer Name Init'] = df['Customer Name'].str.replace('[^A-Z]', '', regex=True) #changing customer names to initials for privacy
                        df['Customer Name'] = df['Customer Name'].str.replace(',', '', regex=True)
                        df['unit_price'] = df['Gross Sales'] / df['Count'] # adding a price per unit of product column
                
Customers without a name are understood to be cash paying customers and cannot be tracked.
                    # Fill in missing values
                    df['customer_name'].fillna('NoName',inplace=True)
                    df['customer_init'].fillna('NN',inplace=True)
                

Mapping item types


Since methods of entering the item type changed throughout the years, it was necessary to create dictionaries to link the ‘item’ to a beer name (i.e. IPA, Chocolate Stout) and a product type (i.e. draft, keg, can).
                    #dictionary import to map items to product types and beer names
                    import ast
                    def read_text_file(file_path):
                        with open(file_path, "r") as file:
                            content = file.read()
                        return content
                    file_path_ names = "data/names_dict.txt"
                    file_content_ names = read_text_file(file_path_ names)
                    names _dict = ast.literal_eval(file_content_ names)

                    file_path_product = "data/product_dict.txt"
                    file_content_product = read_text_file(file_path_product)
                    product_dict = ast.literal_eval(file_content_product)
                
These dictionaries could then be mapped to create new columns for the beer names and the product type.
                    df['beer_ names’] = df['Item'].map(names_dict)
                    df['product_type'] = df['Item'].map(product_dict)
                

Compiling into a single dataframe


used concat, dropped unneeded columns – as Customer ID is not consistent have to make my own later, renamed columns to be in keeping with proper formatting
                    df = pd.concat([sales_2019, sales_2020, sales_2021, sales_2022, sales_2023],
                                join='inner', axis='rows')

                    df = df.drop(['Payment ID', 'Customer ID', 'Customer Reference ID', 
                                'Price Point Name', 'Discounts', 'Net Sales', 'Tax', 
                                'Category', 'Commission', 'Employee',
                                'Unit', 'Dining Option', 'Details', 'Qty', 'SKU', 
                                'Modifiers Applied','Location', 
                                'Itemization Type', 'Fulfillment Note', 'Notes', 
                                'Device Name', 'Event Type', 'Time Zone'], axis='columns')

                    df = df.rename(columns={"Date":"date",
                                            "Time":"time",
                                            "Item":"item",
                                            "Gross Sales":"gross_sales",
                                            "Transaction ID":"transaction_id",
                                            "Customer Name":"customer_name",
                                            "Customer Name Init":"customer_init", 
                                            "Day_of_Week":"weekday",
                                            "Count":"count"})

                    df = df.sort_values(by="date", ascending=True, ignore_index=True)
                

Addition of more features


Customer ID


                    # addition of customer ID number
                    df['customer_id_no'] = pd.factorize(df['customer_name'])[0]
                    df['transaction_no'] = pd.factorize(df['transaction_id'])[0]
                    df['period_id'] = pd.factorize(df['period'])[0]
                

Unique transaction IDs needed to further consolidate the data


                    df = df.sort_values(['customer_id_no', 'date'])
                

Visit frequency of individual customers


                    df['visit_freq'] = df.groupby('customer_id_no')['transaction_no'].transform('nunique')
                    df = df.sort_values(['date', 'customer_id_no'])
                

Number of customers on a daily basis


                    df['daily_no_customers'] = df.groupby('date')['transaction_no'].transform('nunique')
                

Number of different beers sold daily


                    df = df.sort_values(['date', 'beer_ names'])
                    df['no_styles_sold'] = df.groupby('date')['beer_ names'].transform('nunique')
                

Addition of features beyond square data


Weather


Weather data might be an important factor in looking at a brewery’s sales on a day-to-day basis. Data for this project was pulled from the closest weather station to the brewery from the National Centers for Environmental Information. It lists the date, the precipitation in inches, snow in inches, the maximum temperature, minimum temperature and the percentage of sun.
After importing the csv and renaming the columns it can be merged with the square data. Any null values in precipitation or snowfall can be assumed to be zero.
                    sales_data = pd.merge(df, weather, on='date', how='left')
                    sales_data['precipitation'].fillna(0,inplace=True)
                    sales_data['snowfall'].fillna(0,inplace=True)
                

Untappd


The top social media site for beer lovers is Untappd. A user can log in a beer and rate it. The brewer can add in additional data such as ABV, a description of the beer, and IBU. Data can be exported through a paid account or scraped.
untappd
For the purposes of my project, I mapped the Untappd beer styles to an overall style to make fewer categories for easier analysis; for example, ‘IPA – American’ and ‘IPA – New England/Hazy’ are both mapped to ‘IPA’. After importing the csv, verifying the same naming convention for the beer names, and pulling out the features needed, the Untappd data can be merged with the original dataframe on the ‘beer_name’ column created earlier.
                    sales_data = pd.merge(sales_data, untappd_data, on='beer_ names’, how='left')
                

Instagram


Instagram or other social media data can be used as a feature to indicate whether an event was promoted or if there was some online visibility that attracted customers on a given day. The data can be scraped or, in my case, recorded manually. I wanted to be clear on the type of content that was promoted in a post.
I noted which beer, if any, was promoted in a post, whether the post was a general beer post or promoting an event, and the days since the last post.
instagram
                    sales_data = pd.merge(sales_data, insta_data, on='date', how='left')
                
After the final merging, the data is ready to be used for further analysis!