<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>hyuunk_.log</title>
        <link>https://velog.io/</link>
        <description></description>
        <lastBuildDate>Thu, 23 Apr 2026 08:21:35 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <image>
            <title>hyuunk_.log</title>
            <url>https://velog.velcdn.com/images/hyuunk_/profile/4d95ae28-83c5-4a4a-80df-202ef8911399/social_profile.jpeg</url>
            <link>https://velog.io/</link>
        </image>
        <copyright>Copyright (C) 2019. hyuunk_.log. All rights reserved.</copyright>
        <atom:link href="https://v2.velog.io/rss/hyuunk_" rel="self" type="application/rss+xml"/>
        <item>
            <title><![CDATA[Gamezone Sales Analysis Project ]]></title>
            <link>https://velog.io/@hyuunk_/Gamezone-Sales-Analysis-Project</link>
            <guid>https://velog.io/@hyuunk_/Gamezone-Sales-Analysis-Project</guid>
            <pubDate>Thu, 23 Apr 2026 08:21:35 GMT</pubDate>
            <description><![CDATA[<h2 id="overview">Overview</h2>
<p>This Project involved cleaning and preprocessing the raw dataset in Excel as well as EDA and visualization in Tableau. The dataset contained customer orders, region details, and marketing information. In cleaning process, the goal was to identify and address data quality issues, improve consistency, and document any limitations. 
Analysis process involves sales metrics, marketing channel metrics and regional metrics. </p>
<h2 id="index">Index</h2>
<ul>
<li>Data Cleaning </li>
<li>EDA </li>
<li>Dashboard </li>
<li>Recommendation</li>
</ul>
<h1 id="1-data-cleaning">1. Data Cleaning</h1>
<h2 id="cleaning-objectives">Cleaning Objectives</h2>
<ul>
<li>Fix formatting issues </li>
<li>Handle missing or inconsistent data </li>
<li>Standardize text and numerical values </li>
<li>Ensure proper data types for analysis </li>
</ul>
<h2 id="dataset">Dataset</h2>
<ol>
<li>Orders Table: ~20,000 rows of customer orders and transation details</li>
<li>Region Table: Country and region codes </li>
</ol>
<h2 id="issue-log-and-resolutions">Issue Log and Resolutions</h2>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/5a22eae3-aa54-4952-b6e6-f0a25fa1fd76/image.png" alt=""></p>
<h2 id="files">Files</h2>
<ul>
<li>Orders, Regions - Original Dataset(anonymized)</li>
<li>Orders_Cleaned, Regions_Cleaned - Cleaned version with corrections </li>
<li>Issue Log - Tabular format of th issues above </li>
<li>All can be found in the excel workbook <a href="https://docs.google.com/spreadsheets/d/14zln_by7jsUOQ2Ev4STZwCHaroGSNxoY/edit?usp=sharing&amp;ouid=117155271517985089386&amp;rtpof=true&amp;sd=true">here!</a></li>
</ul>
<h2 id="notes">Notes</h2>
<ul>
<li>Certain unsolved issues were left for business teams to validate due to external dependencies.</li>
<li>The cleanes dataset can now be confidently used for further analysis or dashboard reporting </li>
</ul>
<h2 id="contact">Contact</h2>
<p>For any questions or suggestions, feel free to open an issue or contact me. <a href="mailto:khc13258@gmail.com">khc13258@gmail.com</a></p>
<h1 id="2-eda">2. EDA</h1>
<h2 id="analysis-objectives">Analysis Objectives</h2>
<ul>
<li>How did total revenue dollars across all products perform during 2019-2022? Conduct some intial analysis to help product, marketing, and finance nanagers understand high-level trends.</li>
<li>Look at total usd_price across months, products, and averages. Identify minimums, and ant outliers</li>
<li>Can also slice by region and other demographics like marketing channel, or account creation method </li>
</ul>
<h2 id="kpi">KPI</h2>
<ul>
<li>Total Sales </li>
<li>Total Orders </li>
<li>AOV</li>
</ul>
<h2 id="data-overview-in-excel">Data overview in Excel</h2>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/cd4e32aa-2167-4b2f-a15b-b285913bee7c/image.png" alt=""></p>
<p>*<em>1. USD_Price/Month  *</em></p>
<ul>
<li>Total sales across 2019-2022 is -6.1&lt;, ranging in monthly sales from $80K to $500K (Finance Team)</li>
</ul>
<p>*<em>2. USD_price/Product *</em></p>
<ul>
<li>Top performing products is gaming monitor(almost $2M total sales), worst performing is gaming headset ($800). Might be missing data, would need to check with team (Finance team, Product team)</li>
</ul>
<p><strong>3. USD_price / Product Category</strong></p>
<ul>
<li>Worst performing category is headsets, which are reponsible for less than 2% of sales (Product team)</li>
</ul>
<p>*<em>4. USD_Price / Month *</em></p>
<ul>
<li>December 2020 saw a large spike in sales - seems like fall and winter months before the new year perform best. Maybe related to holiday season or promotion at that time. (Marketing team)</li>
</ul>
<p>*<em>5. USD_price / Product *</em></p>
<ul>
<li>All products had significant spike in sales 2020 onwards, following the general trend in total sales. This indicated a macro or company-wide impact(mostly likely due to COVID) (Finance team)</li>
</ul>
<p>*<em>6. USD_price / Product *</em></p>
<ul>
<li>Almost all products exhibited a splike in December 2020 - look into promotions or campaigns that happened at this time (Marketing team)</li>
</ul>
<h2 id="eda-in-tableau">EDA in Tableau</h2>
<h4 id="overall-usd">Overall USD</h4>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/9f84791a-f3a7-4900-94c0-de7fb4775440/image.png" alt=""></p>
<ul>
<li>Sales more than doubled in early 2020, with all-time high in Sep and Dec 2020. Sales dropped significantly in Feb 2021, nearing pre-COVID levels.</li>
</ul>
<h3 id="products-usd">Products USD</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/8ba43d08-e2e4-4f53-9fde-41a7c9d062ce/image.png" alt="">
<img src="https://velog.velcdn.com/images/hyuunk_/post/0c0a050a-854b-4d7a-868d-b0339bada59b/image.png" alt=""></p>
<ul>
<li>27in 4K Gaming monitor, Nintendo Switch, and Sony Playstation5 bundle are main drivers of the dip - all three exhibit the same plateauing behavior in 2020 and 2021.</li>
</ul>
<h3 id="marketing-usd">Marketing USD</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/be30f7c7-0de7-41e4-a3f1-93d90982d8eb/image.png" alt="">
Direct is the main driver of sales - all other channels pale in comparison </p>
<h3 id="region-usd">Region USD</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/399a17eb-a448-4ce4-a1a2-dff3048f5ade/image.png" alt=""></p>
<ul>
<li>All regions exhibit similar dips in 2020 and 2021, indicating global / macro trend </li>
</ul>
<h3 id="product-marketing">Product Marketing</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/35ffa8e1-6b91-4599-8185-de8c7d48881c/image.png" alt=""></p>
<ul>
<li>Whole all three top products(27in 4K Gaming Monitor, Nintendo Switch, Sony Playstation5 Bundle) exhibited dips in early 2020 and 2021, Sony playstation5 bundle had major dip in direct traffic in the beginning of 2021 compared to other products and marketing channels. Since then,  consistent decline for that products&#39;s direct channel sales. </li>
</ul>
<h3 id="product-region">Product Region</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/fda1ea7c-dfd2-4706-a4c8-acf96397014a/image.png" alt=""></p>
<ul>
<li>For Sony Playstation5 bundle, the drop is mostly contained to the NA region and direct traffic - may indicate a shift in trends or competitors there.</li>
</ul>
<h2 id="3-dashboard">3. Dashboard</h2>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/d799892d-4a79-415e-8277-d43d7bdfeacf/image.png" alt=""></p>
<h2 id="4-recommendation">4. Recommendation</h2>
<p>*<em>1. Finance / Analytics *</em></p>
<ul>
<li>(Directional) Investigate the main drivers of the dip in sales in the years following COVID - which products and regions?</li>
</ul>
<p>*<em>2. Marketing *</em></p>
<ul>
<li>(Actionable) Push more promotions and marketing strategies towards North American sales for gaming monitor and playstation, which have more transaction than other top2 products. </li>
<li>(Actionable) Double down on Winter sales by pushing promotions for Top3 products in October / November</li>
<li>(Actionable) Focus marketing strategy on social media channel, which shows potential uptick, in order to shift away from reliance on Direct traffic.</li>
</ul>
<p>*<em>3. Product *</em></p>
<ul>
<li>(Actionable) Remove headsets from the inventory as these make up less than 2% of overall revenue.</li>
</ul>
<p>*<em>4. Data *</em></p>
<ul>
<li>(Directional) Double check attribution to marketing channel is correct - direct semms oversized.</li>
</ul>
<p><strong>5. General</strong> </p>
<ul>
<li>(Contextual) Awareness than when people spend more time at home, there&#39;s greater willingness to spend on gaming products.</li>
</ul>
]]></description>
        </item>
        <item>
            <title><![CDATA[Olist Dashboard Visualization (Tableau)]]></title>
            <link>https://velog.io/@hyuunk_/Olist-Dashboard-Visualization-Tableau</link>
            <guid>https://velog.io/@hyuunk_/Olist-Dashboard-Visualization-Tableau</guid>
            <pubDate>Mon, 09 Mar 2026 08:35:21 GMT</pubDate>
            <description><![CDATA[<h2 id="0-tableau-public-link">0. Tableau Public Link</h2>
<p><a href="https://public.tableau.com/views/1_17727892679710/HomeDashboard?:language=ko-KR&amp;publish=yes&amp;:sid=&amp;:display_count=n&amp;:origin=viz_share_link">링크텍스트</a></p>
<h2 id="1-date-schema">1. Date Schema</h2>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/f5a3017e-98e4-408f-9a58-61147cf43c57/image.png" alt=""></p>
<p>I have built multiple tables including customers, orders, order items, sellers, products, payments, and reviews of the Olist dataset.</p>
<blockquote>
<p>This dashboard helps identify logistics bottlenecks and understand delivery performance across different regions and routes.</p>
</blockquote>
<h2 id="2-sales-dashboard-overview">2. Sales Dashboard Overview</h2>
<blockquote>
<p><strong>The Sales dashboard</strong> provides an overview of key business metrics and sales performance within the Olist e-commerce platform.</p>
</blockquote>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/fb428f9a-c7e8-46f4-b98e-f031da503a85/image.png" alt=""></p>
<p>At the top of the dashboard, several key performance indicators (KPIs) summarize the overall business performance</p>
<p> <strong>The Sales KPI Line chart</strong> shows monthly sales performance trends, allowing users to track how sales volume changes over time.</p>
<p>On the right side, <strong>the Category Sales chart</strong> highlights the top-performing product categories based on order volume. Categories such as cama_mesa_banho, beleza_saude, and esporte_lazer generate the highest number of sales.</p>
<p><strong>The Brazil map visualization</strong> displays the average payment amount per customer by state, providing insights into regional purchasing behavior.</p>
<p> Finally, <strong>the Payment Type Breakdown chart</strong> shows the distribution of payment methods used by customers. Credit cards account for the majority of transactions, followed by boleto payments.</p>
<blockquote>
<p> Overall, this dashboard enables users to quickly understand sales performance, category trends, regional purchasing patterns, and payment behaviors within the Olist platform.</p>
</blockquote>
<h2 id="3-logistics-dashboard-overview">3. Logistics Dashboard Overview</h2>
<blockquote>
<p>T<strong>he Logistics dashboard</strong> focuses on analyzing delivery performance and identifying potential logistics bottlenecks across the Olist platform.</p>
</blockquote>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/f121b36a-72ec-4dc8-82fd-6bde4d66c970/image.png" alt=""></p>
<p>At the top of the dashboard, key indicators summarize the overall delivery performance</p>
<p><strong>The Average Late Rate by Categories chart</strong> highlights product categories with the highest delivery delay rates. Categories such as casa_conforto, moveis_colchao, and audio show relatively higher delay rates compared to others.</p>
<p><strong>The Seller-Customer Delivery Days chart</strong> analyzes the average delivery time by seller-to-customer routes. Routes such as SP → RJ and SP → RS show relatively longer delivery times, suggesting potential logistics challenges along these routes.</p>
<p><strong>The Average Delivery Days Map</strong> provides a geographic overview of delivery performance across Brazilian states, helping identify regions with slower delivery times.</p>
<p>Finally, <strong>the Orders Delayed Due to Logistics by State chart</strong> shows the number of orders delayed due to logistics issues in each state. States such as SP and RJ account for the largest share of logistics-related delivery delays.</p>
<blockquote>
<p>Overall, this dashboard helps identify delivery bottlenecks, high-delay product categories, and regional logistics performance within the Olist platform.</p>
</blockquote>
<h2 id="4-review-dashboard-overview">4. Review Dashboard Overview</h2>
<blockquote>
<p><strong>The Review dashboard</strong> analyzes customer feedback and satisfaction across the Olist platform.</p>
</blockquote>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/1c3ddc20-c045-4c33-973b-ccde447d0933/image.png" alt=""></p>
<p>At the top of the dashboard, several key indicators summarize the overall review statistic.</p>
<p><strong>The Total Orders by Reviews chart</strong> shows the distribution of review scores over time. Most orders receive high ratings (4–5 stars), indicating generally positive customer satisfaction across the platform.</p>
<p><strong>The Top 10 Highest Rated Categories chart</strong> highlights product categories with the highest average review scores. Categories such as cds_dvds_musicais and fashion_roupa_infanto receive consistently high customer ratings.</p>
<p>In contrast, <strong>the Top 10 Lowest Rated Categories chart</strong> identifies product categories with relatively lower review scores. These categories may require further investigation to understand potential issues related to product quality, delivery, or customer expectations.</p>
<blockquote>
<p>Overall, this dashboard helps identify customer satisfaction trends and category-level review performance within the Olist marketplace.</p>
</blockquote>
]]></description>
        </item>
        <item>
            <title><![CDATA[Payment / Logistics / Review EDA (Python) ]]></title>
            <link>https://velog.io/@hyuunk_/EDA-Python</link>
            <guid>https://velog.io/@hyuunk_/EDA-Python</guid>
            <pubDate>Mon, 09 Mar 2026 07:29:35 GMT</pubDate>
            <description><![CDATA[<blockquote>
<p>Further EDA was conducted based on the previous analysis of the Olist dataset.</p>
</blockquote>
<h2 id="1-payment-analysis">1. Payment Analysis</h2>
<h3 id="payment-types-distribution">Payment types distribution</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/6ef1cfbc-1e3e-43b5-bbf8-5780e6051dfb/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/dc75c42e-56a3-4af8-b992-311bbffd83ea/image.png" alt=""></p>
<p>Credit cards are the dominant payment method, accounting for about 74% of all transactions.</p>
<h3 id="payment-intallments--count">Payment Intallments / Count</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/da071bdc-efd2-462d-9239-ef10a0d28758/image.png" alt=""></p>
<p>This chart shows the distribution of payment installments.
We can see that most transactions are made with a single payment, indicating that many customers prefer paying in full rather than using installments. A Smaller portion of customers use 2 to 4 installments, while long installment plans are relatively rare.</p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/cfe148e5-1d66-4462-bc1a-b6ee102390fd/image.png" alt=""></p>
<p>Most orders (about 97–98%) are paid with a single payment.
Orders with multiple payments are very rare, indicating customers usually pay the full amount at once.</p>
<h2 id="2-customer-review-analysis">2. Customer Review Analysis</h2>
<h3 id="wordcloud">Wordcloud</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/da2914eb-e8aa-4cd5-adf7-e5a76b8a664e/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/b1d8f8c5-1c03-413b-99af-a567afe25fda/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/e5cc435a-249b-495e-a1ff-1afe70b44052/image.png" alt=""><img src="https://velog.velcdn.com/images/hyuunk_/post/886e6c44-14bc-44f6-964c-0b118d7bff3f/image.png" alt=""></p>
<blockquote>
<p>I created an <strong>average review score data mart</strong> by categories and its cnt on MySQl and then I&#39;ve loaded it on Python.</p>
</blockquote>
<pre><code>[MySQL]
use olistdata;

# Categories Review Datamart
SELECT
  COALESCE(t.product_category_name_english, p.product_category_name) AS category,
  COUNT(*) AS review_cnt,
  ROUND(AVG(r.review_score), 3) AS avg_review_score
FROM order_reviews r
JOIN orders_items oi
  ON r.order_id = oi.order_id
JOIN products p
  ON oi.product_id = p.product_id
LEFT JOIN product_category_name_translation t
  ON p.product_category_name = t.product_category_name
WHERE r.review_score IS NOT NULL
GROUP BY category
ORDER BY avg_review_score DESC;</code></pre><p><img src="https://velog.velcdn.com/images/hyuunk_/post/341e77e4-9cae-486e-b0ac-aa89533eb394/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/0713ddca-86c2-4bb0-9c60-5c394aa3e18d/image.png" alt=""></p>
<p> Categories like CDs/DVDs, children’s clothes, and books have the highest review scores (around 4.4–4.6), indicating strong customer satisfaction. In contrast, categories such as office furniture and electronics-related items have lower score (around 3.2~3.5), indicating more customer complaints or delivery/product issues.</p>
<h3 id="in-depth-analysis-of-three-problematic-sellers">In-depth Analysis of Three Problematic Sellers</h3>
<blockquote>
<p>Among the bottom five categories by average review score, office_furniture has the largest sample size (1,687). Therefore, I drilled down into this category(office_furniture) to identify specific sellers with consistently low review scores.</p>
</blockquote>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/ae2ba091-1112-4186-8ba4-7ab316c17d5d/image.png" alt=""></p>
<p>As a result, I identified three problematic sellers in the office_furniture category with the highest number of review scores of 2 or below.</p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/4b2d99d5-a958-4643-84d3-eb980a585c0b/image.png" alt=""></p>
<blockquote>
<p>These sellers show consistently high proportions of low ratings, suggesting potential issues with product quality, shipping, or customer service.</p>
</blockquote>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/235ea403-eefc-4970-97a2-602e8580fb7a/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/f828a6c7-fae3-47c3-9e16-ca6497090954/image.png" alt=""></p>
<p>Next, to analyze where the orders from these sellers were distributed, I filtered the dataset using the problematic seller_ids. The results show that about 60% of the orders from these sellers occurred in SP and RJ, where the sellers themselves are located.</p>
<blockquote>
<p><strong>Is delivery delay actually associated with review scores of 2 or below?</strong></p>
</blockquote>
<pre><code>[Python Code]

problem_orders[&#39;is_delayed&#39;] = (
    problem_orders[&#39;order_delivered_customer_date&#39;] &gt;
    problem_orders[&#39;order_estimated_delivery_date&#39;]
)

# 지연 vs 비지연 2점 이하 비율 비교 
problem_orders[&#39;is_low&#39;] = problem_orders[&#39;review_score&#39;] &lt;= 2

delay_compare = (
    problem_orders
    .groupby(&#39;is_delayed&#39;)[&#39;is_low&#39;]
    .mean()
    .reset_index()
)</code></pre><p><img src="https://velog.velcdn.com/images/hyuunk_/post/bbed04af-bf5f-49fa-823c-dde91171fb1c/image.png" alt=""></p>
<p>The direct cause of the low average review scores of these sellers is delivery delays.
However, the earlier hypothesis—that delivery delays were mainly caused by logistics infrastructure limitations—is not fully supported. Low ratings are not concentrated only in remote regions but also appear in SP and RJ, where the sellers themselves are located.
This suggests that the issue may stem not only from logistics infrastructure but also from seller-related problems, <strong>indicating that both seller factors and infrastructure limitations contribute to the delay.</strong></p>
<blockquote>
<p>[Next Step]
To more clearly determine whether the issue stems from seller-related problems or logistics infrastructure limitations, we can examine the difference between <strong>shipping_limit_date</strong> and <strong>order_delivered_carrier_date</strong>, which allows us to check whether the delay occurred at the seller stage.
If <strong>shipping_limit_date</strong> is later, the delay is more likely related to delivery or logistics issues. If order_delivered_carrier_date is later, the delay likely occurred at the seller’s fulfillment stage.</p>
</blockquote>
<pre><code>[Python Code]

problem_orders[&#39;order_delivered_carrier_date&#39;] = pd.to_datetime(problem_orders[&#39;order_delivered_carrier_date&#39;])
problem_orders[&#39;shipping_limit_date&#39;] = pd.to_datetime(problem_orders[&#39;shipping_limit_date&#39;])

problem_orders[&#39;seller_late&#39;] = (
    problem_orders[&#39;order_delivered_carrier_date&#39;] &gt;
    problem_orders[&#39;shipping_limit_date&#39;]
)

problem_orders.groupby(&#39;seller_id&#39;)[&#39;seller_late&#39;].mean().sort_values(ascending=False)</code></pre><p><img src="https://velog.velcdn.com/images/hyuunk_/post/074889d9-df7d-4f21-a11b-8c62dedadfb4/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/26c71643-c21f-44f8-95df-c938fea54de0/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/4c619d59-79eb-467f-b06b-537a5cf88b0d/image.png" alt=""></p>
<p>The seller shipment delay rate—defined as cases where carrier handover occurs after the <strong>shipping_limit_date—ranges from 16.7% to 30.3%</strong>, which is 1.8–3.3 times higher than the Olist overall average <strong>(9.2%)</strong>. This suggests that the primary driver of low review scores is more likely seller-side fulfillment and processing delays rather than external logistics infrastructure issues.</p>
<blockquote>
<p>However, the characteristics of the office_furniture category, where products are typically large and bulky, as well as the remaining limitations in logistics infrastructure, should not be overlooked.</p>
</blockquote>
<blockquote>
<p>Therefore, it would be useful to examine the correlation between delivery time or delay status and product characteristics, specifically product volume and weight, where volume is calculated using product length, height, and width.</p>
</blockquote>
<h3 id="correlation-between-delivery-time-and-product-characteristics">Correlation between delivery time and product characteristics</h3>
<blockquote>
<p>The <em><strong>base dataset</strong></em>  was created using the <strong>full dataset</strong> instead of focusing on a <strong>specific category(office_furniture)</strong>.</p>
</blockquote>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/ff21ace2-1ada-4508-b16f-2e14f4d920c3/image.png" alt="">
Orders with delivery delays took about 30 days on average, compared to around 10 days for non-delayed orders.</p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/2081a4b8-219a-47bb-836d-b625761ddc0f/image.png" alt="">
 Delayed orders are approximately 300g heavier on average than non-delayed orders.</p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/d1fe5c6f-f9d1-4bd4-a0b0-ef4ef6bd3464/image.png" alt="">
Delayed orders have about 10% larger product volume on average than non-delayed orders.</p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/7cb3d690-35bc-4855-8d41-5b116561d049/image.png" alt=""></p>
<p>However, the correlation analysis shows a coefficient of 0.016, indicating virtually no relationship between product volume and delivery delays.
![] (<a href="https://velog.velcdn.com/images/hyuunk_/post/cbfec604-5794-4940-95bf-b291f653e551/image.png">https://velog.velcdn.com/images/hyuunk_/post/cbfec604-5794-4940-95bf-b291f653e551/image.png</a>)</p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/66653c8a-630c-4656-be6c-655ec32b439b/image.png" alt=""><img src="https://velog.velcdn.com/images/hyuunk_/post/98d09955-f8f0-4441-82c3-8a5cdeaad22d/image.png" alt=""></p>
<p>The correlation between product weight and freight value is 0.61, indicating a relatively strong positive relationship. Similarly, product volume and freight value show a correlation of 0.59, suggesting a comparable positive association.</p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/763f581d-663a-4d52-a76f-319daa509af2/image.png" alt=""></p>
<p>Delayed orders took an average of about 30 days for delivery, while non-delayed orders took around 10 days on average.</p>
<blockquote>
<p><strong>[Conclusion]
Low review scores in the office_furniture category are primarily driven by seller-side shipment delays rather than logistics infrastructure constraints.</strong></p>
</blockquote>
<h2 id="3-logistics-infrastructure-analysis">3. Logistics Infrastructure Analysis</h2>
<h3 id="same-state-vs-differnet-state">Same State vs Differnet State</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/b05e376d-dacc-42a2-80e3-f02b4bc6340f/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/fd1e5516-cbd4-408a-a967-ad5b613206bb/image.png" alt=""></p>
<p>When the seller and customer are in the same state, the average delivery time is about 7 days, while it increases to about 14 days when they are in different states, indicating that cross-state deliveries take roughly twice as long.</p>
<blockquote>
<p>States with the highest average delivery days were identified for orders where the seller and customer are located in the same state.</p>
</blockquote>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/774e0246-9ffc-4b0b-9194-d42dc07a2461/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/6270f8d4-3409-45a7-8cbc-219ce258f219/image.png" alt=""></p>
<h3 id="delivery-delay-ratio-by-state">Delivery Delay Ratio by State</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/99d89e28-c143-4f39-8b35-94f483f1ec48/image.png" alt=""></p>
<blockquote>
<p>To gain insights into the causes of delivery delays, cases where <strong>shipping_limit_date</strong> is later than <strong>order_delivered_carrier_date</strong> were defined as <strong>seller_delay</strong>, while the opposite cases were defined as <strong>logistics_delay</strong>.</p>
</blockquote>
<pre><code>[Python Code]

# 배송소요일 계산 
base2[&#39;delivery_days&#39;] = (
    base2[&#39;order_delivered_customer_date&#39;] - 
    base2[&#39;order_purchase_timestamp&#39;]
).dt.days

# 최종 배송 지연 여부 8ㅑ![](https://velog.velcdn.com/images/hyuunk_/post/d3eff9d6-709c-4ad0-866c-bd8328aa0563/image.png)

base2[&#39;is_delayed&#39;] = (
    base2[&#39;order_delivered_customer_date&#39;] &gt; 
    base2[&#39;order_estimated_delivery_date&#39;]
)

# seller 출고 지연 여부 
base2[&#39;seller_late&#39;] = (
    base2[&#39;order_delivered_carrier_date&#39;] &gt;
    base2[&#39;shipping_limit_date&#39;]
)</code></pre><blockquote>
<p>Load only states with more than 200 total orders (order_id).</p>
</blockquote>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/68bda37a-8d43-4019-8d35-8393f321706a/image.png" alt=""></p>
<p>The analysis suggests key point that <strong>logistics infrastructure may play a larger role in delivery delays in RJ</strong>.
In this region, <strong>77% of delayed orders occur during the logistics stage</strong>, which is nearly 20 percentage points higher than the approximately 60% observed in other regions.</p>
<h3 id="average-lead-time-by-state-analysis">Average lead time by state analysis</h3>
<blockquote>
<p><strong>Seller processing lead time</strong> and <strong>logistics delivery lead time(carrier_delivery_days)</strong> were defined as follows.</p>
</blockquote>
<pre><code>[Python Code]

# Seller Processing Lead Time 
base2[&#39;seller_processing_days&#39;] = (
    base2[&#39;order_delivered_carrier_date&#39;] - 
    base2[&#39;order_purchase_timestamp&#39;]
).dt.days

# Carrier Delivery Days  
base2[&#39;carrier_delivery_days&#39;] = (
    base2[&#39;order_delivered_customer_date&#39;] - 
    base2[&#39;order_delivered_carrier_date&#39;]
).dt.days</code></pre><p><img src="https://velog.velcdn.com/images/hyuunk_/post/26f3eb36-4b18-48bf-8728-5b4cc5e03f8f/image.png" alt=""></p>
<p>Based on the average across all customer states, the seller processing lead time for delayed orders is 5.36 days, compared to 2.52 days for non-delayed orders, indicating it roughly doubles when delays occur.
However, the carrier delivery lead time increases much more significantly—from 7.44 days to 25.19 days, more than tripling.
This suggests that delivery delays are driven more by logistics delivery time than by seller processing time.</p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/30f582bc-679b-4475-a971-36f539b57471/image.png" alt=""></p>
<p>This boxplot analysis shows that carrier delivery time is significantly longer for delayed orders, with the median increasing from about 7 days to around 20 days. The distribution is also much wider, with several extreme delays observed, suggesting that <strong>logistics delivery time is a major driver of delivery delays.</strong></p>
<p>The median logistics delivery time for delayed orders is about three times higher than that of non-delayed orders, and in some cases extreme delivery delays exceeding 100 days were observed.</p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/b2b85fca-bf6d-4505-a40a-544ca4e00921/image.png" alt=""></p>
<h3 id="in-depth-analysis-of-rj-delivery-time">In-depth Analysis of RJ Delivery Time</h3>
<p>The distribution shows that non-delayed orders are concentrated within 3–10 days, while delayed orders are mainly distributed between 15–30 days with a long tail extending beyond 40 days. Considering the result above, this indicates that delivery delays are primarily driven by substantially longer carrier delivery times.</p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/b0c8743b-ad68-4b1c-a730-3712ca10fa7d/image.png" alt="">
A deeper analysis of logistics lead time in the RJ region reveals several key patterns.
Most deliveries are concentrated within 5–15 days, indicating relatively fast delivery lead times on average.
In fact, the majority of orders arrive well before the SLA threshold.
However, deliveries begin to exceed the SLA around the 95th percentile (P95).
The distribution also shows a long right tail, suggesting that a small number of extreme delays significantly contribute to the overall delay rate.
<strong>The long right tail in the distribution contributes significantly to the high delivery delay rate.</strong></p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/f94bf945-418a-48da-beef-534439d75fc5/image.png" alt=""></p>
<p>Comparing carrier delivery times between Brazil overall and the RJ region shows little difference in the median, but RJ exhibits longer delivery times in the upper percentiles, with delays of +6 days at P90, +9 days at P95, and +12 days at P99.</p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/e1d51db1-2a3e-4fd7-9a22-f1af6ba16e3c/image.png" alt=""></p>
<p>RJ experiences 2–3 times more extremely long delivery delays (30+ days) compared to the overall average across Brazilian states, indicating a structural tendency toward long-tail delivery delays.</p>
<p>RJ shows similar average delivery performance to the national level, but the severity of SLA exceedance and the frequency of extremely long deliveries are 2–3 times higher than the national average.
This suggests the presence of structural bottlenecks that occur under certain conditions.</p>
<blockquote>
<p>Therefore, it is necessary to drill down into the top 5% of RJ orders (those exceeding P95) to examine whether the delays are concentrated in specific product categories, time periods, or carriers.</p>
</blockquote>
<h3 id="top-5-tail-analysis-of-carrier-delivery-days-in-rj">Top 5% Tail Analysis of Carrier Delivery Days in RJ</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/8a841499-a623-4d1f-99fa-f00fff69169f/image.png" alt=""></p>
<p>Analysis of the top 5% delayed orders in the RJ region shows that the seller processing time remained within a normal range, averaging around 3 days, while the carrier delivery time increased abnormally to an average of 45 days.
This indicates that the primary cause of the delay lies in the logistics transportation stage rather than seller processing.</p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/cd512ac4-5e30-42c8-bf14-bafca1bbc514/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/340fa5ff-1c34-40c5-830a-e1b88aae6970/image.png" alt=""></p>
<p>76.7% of the top 5% delayed deliveries in RJ originate from sellers located in SP, which is about 10 percentage points higher than the overall share of SP-origin orders in RJ (66%).
This suggests the possibility of relatively more severe logistics bottlenecks along the SP→RJ delivery route.</p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/30298f67-31c6-4878-ab36-00231b675a6e/image.png" alt=""></p>
<h3 id="average-review-score-by-freight-cost-across-regions">Average Review Score by Freight Cost Across Regions</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/cf22a4bd-6b09-4203-8447-d2664038e7be/image.png" alt=""></p>
<p>Across all regions, a clear negative relationship is observed between delivery days and review scores, indicating that longer delivery times consistently lead to lower customer ratings.</p>
<h3 id="average-late-rate-by-category-top-15">Average Late Rate by Category (Top 15)</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/a8d039f4-3978-48b9-9075-a90c2e0e5ae6/image.png" alt=""></p>
<p>Several categories show late delivery rates above the overall average (~10%), with audio, fashion underwear, and Christmas items recording the highest delay rates.</p>
<h3 id="monthly-late-rate--order-volume">Monthly Late Rate &amp; Order Volume</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/07b34acd-4ea2-4b75-ab01-6b43d5732b62/image.png" alt=""></p>
<p>Late delivery rates fluctuate over time, with noticeable spikes during periods of high order volume, suggesting demand surges may contribute to delivery delays.</p>
]]></description>
        </item>
        <item>
            <title><![CDATA[RFM Data Mart Creation & Analysis (MySQL & Tableau)]]></title>
            <link>https://velog.io/@hyuunk_/RFM-Data-Mart-Creation-Analysis-MySQL-Tableau</link>
            <guid>https://velog.io/@hyuunk_/RFM-Data-Mart-Creation-Analysis-MySQL-Tableau</guid>
            <pubDate>Sat, 07 Mar 2026 08:15:34 GMT</pubDate>
            <description><![CDATA[<h2 id="1-creating-rfm-data-mart-mysql">1. Creating RFM Data Mart (MySQL)</h2>
<blockquote>
<p>Creating RFM Data Mart on MySQL 
Recency, Frequency, Monetary, avg_delivery_days, avg_review_score, delay_rate are produced as columns.</p>
</blockquote>
<pre><code>USE olistdata;

WITH pay AS (
  SELECT
    order_id,
    SUM(payment_value) AS order_payment
  FROM order_payments
  GROUP BY order_id
),

delivery AS (
  SELECT
    order_id,
    order_delivered_customer_date,
    order_estimated_delivery_date,
    DATEDIFF(order_delivered_customer_date, order_purchase_timestamp) AS delivery_days
  FROM orders
  WHERE order_status = &#39;delivered&#39;
),

base AS (
  SELECT
    c.customer_unique_id,
    o.order_id,
    o.order_purchase_timestamp,
    pay.order_payment,
    d.delivery_days,
    d.order_delivered_customer_date,
    d.order_estimated_delivery_date,
    r.review_score
  FROM orders o
  JOIN customers c
    ON o.customer_id = c.customer_id
  JOIN pay
    ON o.order_id = pay.order_id
  JOIN delivery d
    ON o.order_id = d.order_id
  LEFT JOIN order_reviews r
    ON o.order_id = r.order_id
  WHERE o.order_status = &#39;delivered&#39;
),

snapshot AS (
  SELECT MAX(order_purchase_timestamp) AS snap_dt
  FROM orders
  WHERE order_status = &#39;delivered&#39;
)

SELECT
  b.customer_unique_id,

  -- RFM
  DATEDIFF(MAX(s.snap_dt), MAX(b.order_purchase_timestamp)) AS recency,
  COUNT(DISTINCT b.order_id) AS frequency,
  SUM(b.order_payment) AS monetary,

  -- 배송 &amp; 리뷰
  AVG(b.delivery_days) AS avg_delivery_days,
  AVG(b.review_score) AS avg_review,

  -- 지연율
  AVG(
    CASE
      WHEN b.order_delivered_customer_date &gt; b.order_estimated_delivery_date THEN 1
      ELSE 0
    END
  ) AS delay_rate

FROM base b
CROSS JOIN snapshot s
GROUP BY b.customer_unique_id;</code></pre><p><img src="https://velog.velcdn.com/images/hyuunk_/post/3ae3af11-7b80-4633-af76-8bb19245b6d9/image.png" alt="">
<img src="https://velog.velcdn.com/images/hyuunk_/post/c806c634-0ced-4652-9df6-e72a26f8ae59/image.png" alt="">
<img src="https://velog.velcdn.com/images/hyuunk_/post/e279e510-364a-4450-8b7f-b6d3d913954f/image.png" alt=""></p>
<blockquote>
<p>Created Data Mart are saved as csv file in computer, and imported into Tableau for further analysis of RFM Data </p>
</blockquote>
<h2 id="2-score-caculation--visualization-tableau">2. Score Caculation &amp; Visualization (Tableau)</h2>
<h3 id="recency-score">Recency Score</h3>
<pre><code>The first step in Tableau was to create Recency Score and vizualize distribution with bar chart.</code></pre><p><img src="https://velog.velcdn.com/images/hyuunk_/post/f7f0aa5d-c16e-410d-bf88-2cb01aecf0e4/image.png" alt=""></p>
<blockquote>
<p>Recency was defined as the number of days since the customer&#39;s most recent purchase, calculated from the last date available in the dateset.
I assigned higher Recency score to customers with lower Recency values. </p>
</blockquote>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/34379282-d937-43d1-a4ff-89cf6f23b22f/image.png" alt=""></p>
<blockquote>
<p><strong>Inactive customer concentration</strong>
This distribution shows a strong concentration of customers with low Recency scores. About 31.56% of customers fall into Recency score 1, indicating that a large portion of customers have not made a purchase recently. The remaining scores are relatively evenly distributed, suggesting that active customers are spread across different recency levels.</p>
</blockquote>
<h3 id="frequency-score">Frequency Score</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/e3a153f6-3687-4638-9cea-5b534c1f4419/image.png" alt=""></p>
<blockquote>
<p>Frequency Score are calcuated like this. The more customers are involved in purchase the higher score they&#39;re assigned</p>
</blockquote>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/f1baf6f3-a4b6-415d-8c63-521558a0475f/image.png" alt=""></p>
<blockquote>
<p>97% of customers are involved in 1 purchase, only 3% of customers purchased over 2 on the other hand. </p>
</blockquote>
<h3 id="monetary-score">Monetary Score</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/4d6d1290-904d-44db-88d5-80a8d59abade/image.png" alt=""></p>
<blockquote>
<p>Monetary scores were assigned based on customer spending levels, with higher spending customers receiving higher scores.</p>
</blockquote>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/3da42151-8bd7-481e-8519-1a68e0eb2b57/image.png" alt=""></p>
<blockquote>
<p><strong>Long Tail Distribution</strong>
Most customers belong to the lower Monetary score groups, indicating that the majority of customers spend relatively small amounts, while only a small portion of customers are high-value spenders. </p>
</blockquote>
<h3 id="rfm-group-score">RFM Group Score</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/b8a203db-afa8-4dc9-a181-3430fe788fe9/image.png" alt=""></p>
<blockquote>
<p>Added three separate score, RFM Score has been calulated </p>
</blockquote>
<h2 id="3-rfm-analysis-focusing-on-relationship-between-average-arrived-days-and-average-review-score">3. RFM Analysis (Focusing on relationship between average arrived days and average review score)</h2>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/4f8d49ee-abcc-4d49-a53f-9266984dab1e/image.png" alt=""></p>
<blockquote>
<p>Most customers fall into the mid-range RFM scores, particularly between 10 and 13.
Customer satisfaction remains consistently high, with average review scores above 4 across most segments.
Delivery times are generally stable around 10–12 days, although a few segments show longer delivery periods.</p>
</blockquote>
<h2 id="4-arrived-day--reivew-score-analysis">4. Arrived Day &amp; Reivew Score Analysis</h2>
<h3 id="average-arrived-days-by-review-score-15">Average arrived days by review score (1~5)</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/251f2e58-f28e-4fe8-bcb9-9d92fe8e9b53/image.png" alt=""></p>
<blockquote>
<p>Clear negative relationship between delivery time and customer satisfaction are observed. As delivery speed improves, the average review score increases.</p>
</blockquote>
]]></description>
        </item>
        <item>
            <title><![CDATA[EDA (데이터 마트 생성 및 일별판매량 확인) ]]></title>
            <link>https://velog.io/@hyuunk_/EDA-%EB%8D%B0%EC%9D%B4%ED%84%B0-%EB%A7%88%ED%8A%B8-%EC%83%9D%EC%84%B1-%EB%B0%8F-%EC%9D%BC%EB%B3%84%ED%8C%90%EB%A7%A4%EB%9F%89-%ED%99%95%EC%9D%B8</link>
            <guid>https://velog.io/@hyuunk_/EDA-%EB%8D%B0%EC%9D%B4%ED%84%B0-%EB%A7%88%ED%8A%B8-%EC%83%9D%EC%84%B1-%EB%B0%8F-%EC%9D%BC%EB%B3%84%ED%8C%90%EB%A7%A4%EB%9F%89-%ED%99%95%EC%9D%B8</guid>
            <pubDate>Mon, 12 Jan 2026 09:29:14 GMT</pubDate>
            <description><![CDATA[<h3 id="1-데이터마트-생성">1. 데이터마트 생성</h3>
<blockquote>
<ul>
<li>EDA를 시작하기에 앞서, 간편함을 위해 데이터마트를 생성하였다. </li>
</ul>
</blockquote>
<ul>
<li>컬럼으로는 주문ID/고객ID/고객 주/배송소요일/지연일수/리뷰스코어/결측치를 제외한 order_purchase_timestamp, order_delivered_customer_date, order_estimated_Date가 9개의 컬럼이 포함되었다. </li>
</ul>
<h3 id="2-일별-주문량-계산">2. 일별 주문량 계산</h3>
<pre><code>SELECT
  DATE(order_purchase_timestamp) AS order_date,
  COUNT(*) AS daily_orders
FROM orders
WHERE order_status = &#39;delivered&#39;
  AND order_purchase_timestamp IS NOT NULL
GROUP BY order_date
ORDER BY order_date;</code></pre><ul>
<li>MySQL에서 전체 기간 주문량을 일별로 나타낸 CSV 파일을 파이썬에서 시각화하였다. 
<img src="https://velog.velcdn.com/images/hyuunk_/post/fb60d2c9-fab2-4e8c-aa6b-90db9a44caaf/image.png" alt=""><blockquote>
<ul>
<li>가장 눈에 띄는 특징으로는 2017-11-24일 주문량에서 매우 큰 이상치가 발견된다는 점이다. </li>
</ul>
</blockquote>
</li>
<li>리서치 결과, 해당 일자는 블랙프라이데이로 확인되었다. </li>
<li>따라서, 블랙프라이데이에 대한 세부적인 EDA를 진행해보았다. </li>
</ul>
<h3 id="3-블랙프라이데이-카테고리별-판매량">3. 블랙프라이데이 카테고리별 판매량</h3>
<pre><code>select 
p.product_category_name as category, 
count(*) as items_sold 
from orders o
join orders_items oi 
on o.order_id = oi.order_id 
join products p 
on oi.product_id = p.product_id
where o.order_status = &#39;delivered&#39; 
and date(o.order_purchase_timestamp) = &#39;2017-11-24&#39;
group by p.product_category_name
order by items_sold desc;</code></pre><ul>
<li>마찬가지로 SQL에서 CSV파일을 생성한 뒤 Top10 판매 카테고리를 파이썬에서 시각화하였다.</li>
</ul>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/bb3d2d80-3da0-4584-a0ea-f718d3e2e13c/image.png" alt=""></p>
<blockquote>
<ul>
<li>그 결과, 침구/식탁/욕실용품이 압도적인 판매량을 보였고, 가구/인테리어, 공구/정원용품, 스포츠/레저, 뷰티/헬스, 휴대폰/통신기기, 시계/선물용품, 장난감, 컴퓨터/IT/악세사리, 향수가 뒤를 이었다. </li>
</ul>
</blockquote>
<ul>
<li>1-2개 카테고리만 압도적으로 나오는 관계로 로그 변환를 해주어 재확인하였다.</li>
</ul>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/e16f3d0e-e584-49a9-8e59-50575d7be28d/image.png" alt=""></p>
<h3 id="4-블랙프라이데이-카테고리별-aov">4. 블랙프라이데이 카테고리별 AOV</h3>
<pre><code>SELECT
  category,
  AVG(order_value) AS avg_order_value
FROM (
  SELECT
    o.order_id,
    p.product_category_name AS category,
    SUM(oi.price + oi.freight_value) AS order_value
  FROM orders o
  JOIN orders_items oi
    ON o.order_id = oi.order_id
  JOIN products p
    ON oi.product_id = p.product_id
  WHERE o.order_status = &#39;delivered&#39;
    AND DATE(o.order_purchase_timestamp) = &#39;2017-11-24&#39;
  GROUP BY o.order_id, p.product_category_name
) t
GROUP BY category
ORDER BY avg_order_value DESC;</code></pre><ul>
<li>판매량에 이어 Top10 AOV(판매금액)을 확인하였다. </li>
</ul>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/57b43818-39d5-4659-a784-88d144181fc8/image.png" alt=""></p>
<blockquote>
<ul>
<li>농업/산업/상업용 장비가 압도적인 판매량을 기록하였다. </li>
</ul>
</blockquote>
<ul>
<li>대형가전, 가전제품, 건설/공구/조명, 시계/선물용품, 악기, 카메라/영상장비, 거실가구, 사무용가구, 아이디어/기획상품이 뒤를 이었다. </li>
</ul>
<h3 id="5-블랙프라이데이-할부율-분석">5. 블랙프라이데이 할부율 분석</h3>
<ul>
<li>가설: 세일 이벤트 기간이 고객들의 고가 물품 소비를 촉진하여 할부율이 높아졌을 것이다. <pre><code>SELECT
period,
COUNT(*) AS orders,
ROUND(AVG(payment_installments), 2) AS avg_installments,
ROUND(SUM(CASE WHEN payment_installments &gt; 1 THEN 1 ELSE 0 END) / COUNT(*) * 100, 2) AS installment_rate
FROM (
SELECT
  o.order_id,
  CASE
    WHEN DATE(o.order_purchase_timestamp) BETWEEN &#39;2017-11-10&#39; AND &#39;2017-11-23&#39; THEN &#39;Before&#39;
    WHEN DATE(o.order_purchase_timestamp) = &#39;2017-11-24&#39; THEN &#39;Black Friday&#39;
    WHEN DATE(o.order_purchase_timestamp) BETWEEN &#39;2017-11-25&#39; AND &#39;2017-12-08&#39; THEN &#39;After&#39;
  END AS period,
  op.payment_installments
FROM orders o
JOIN order_payments op ON o.order_id = op.order_id
WHERE o.order_status = &#39;delivered&#39;
  AND DATE(o.order_purchase_timestamp) BETWEEN &#39;2017-11-10&#39; AND &#39;2017-12-08&#39;
) t
GROUP BY period;</code></pre></li>
<li>비교 기간은 블랙프라이데이 전후 2주로 잡았다. </li>
<li>전체 기간으로 잡고 비교한다면 블랙프라이데이가 아닌 계절 요인, olist 서비스 확대 요인 등 불순물이 추가되어 목적이 흐려질 수 있다는 판단으로 전후 비교 기간을 2주로 설정하였다. </li>
</ul>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/8bb0904e-aefd-43ef-8e78-59ee29294c96/image.png" alt=""></p>
<blockquote>
<ul>
<li>블랙프라이데이의 평균 할부 개월 수는 3.51개월로, 블랙프라이데이 전 2주와 후 2주와 비교했을 때 가장 많았다. </li>
</ul>
</blockquote>
<ul>
<li>할부율 또한 57.95%로 가장 높음을 확인할 수 있다. </li>
<li>이는 블랙프라이데이가 고가 상품 구매를 유도하거나, 평소 미루던 소비를 앞당긴 이벤트 였음을 시사하고 있다. </li>
</ul>
<h3 id="6-블랙프라이데이-배송-지연율-분석">6. 블랙프라이데이 배송 지연율 분석</h3>
<ul>
<li><p>가설: 주문량이 폭증한만큼, 배송 지연율 또한 높아졌을 것이다. 
SELECT
period,
round(AVG(is_delayed)*100,2) AS delay_rate
FROM (
SELECT
  CASE</p>
<pre><code>WHEN DATE(order_purchase_timestamp) BETWEEN &#39;2017-11-10&#39; AND &#39;2017-11-23&#39; THEN &#39;Before&#39;
WHEN DATE(order_purchase_timestamp) = &#39;2017-11-24&#39; THEN &#39;Black Friday&#39;
WHEN DATE(order_purchase_timestamp) BETWEEN &#39;2017-11-25&#39; AND &#39;2017-12-08&#39; THEN &#39;After&#39;</code></pre><p>  END AS period,
  is_delayed
FROM delivery_review_dm
WHERE DATE(order_purchase_timestamp) BETWEEN &#39;2017-11-10&#39; AND &#39;2017-12-08&#39;
) t
GROUP BY period;</p>
</li>
<li><p>앞서 생성한 배송-리뷰 데이터마트(delivery_review_dm)를 활용하였다. </p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/43f3b66e-bcb4-48cc-b1ee-f42a2a6d959f/image.png" alt=""></p>
<blockquote>
<ul>
<li>블랙프라이데이의 배송지연율이 평소(Before)이 배해 거의 1.7배 가까이 상승했음을 알 수 있다. </li>
</ul>
</blockquote>
<ul>
<li>즉, 블랙프라이데이 기간 중 주문 급증으로 인해 배송 시스템의 처리 한계를 초과했고, 그 영향이 이벤트 이후까지 이어졌음을 알 수 있다. </li>
</ul>
<ul>
<li>종합적으로, 고객은 블랙프라이데이에 더 많은, 더 큰 금액을, 더 쉽게 결제했지만, 운영 측면(배송)은 이를 따라가지 못했음을 알 수 있다. </li>
<li>이는 앞서 분석한 지역별 배송 지연 인사이트 발굴과 연결지어 해석할 수 있다. </li>
</ul>
]]></description>
        </item>
        <item>
            <title><![CDATA[브라질 지역별 배송 지연 인사이트 발굴 ]]></title>
            <link>https://velog.io/@hyuunk_/%EB%B8%8C%EB%9D%BC%EC%A7%88-%EC%A7%80%EC%97%AD%EB%B3%84-%EB%B0%B0%EC%86%A1-%EC%A7%80%EC%97%B0-%EC%9D%B8%EC%82%AC%EC%9D%B4%ED%8A%B8-%EB%B0%9C%EA%B5%B4</link>
            <guid>https://velog.io/@hyuunk_/%EB%B8%8C%EB%9D%BC%EC%A7%88-%EC%A7%80%EC%97%AD%EB%B3%84-%EB%B0%B0%EC%86%A1-%EC%A7%80%EC%97%B0-%EC%9D%B8%EC%82%AC%EC%9D%B4%ED%8A%B8-%EB%B0%9C%EA%B5%B4</guid>
            <pubDate>Mon, 05 Jan 2026 08:03:57 GMT</pubDate>
            <description><![CDATA[<h3 id="1-sellers-분포">1. Sellers 분포</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/abfd8373-a0b1-4344-a85d-fad67edd7b6b/image.png" alt=""></p>
<blockquote>
<p>[특징 및 인사이트]</p>
</blockquote>
<ul>
<li>zip 기준으로 &#39;대표위경도&#39;를 만들어서 같은 동네에 사는 각각의 주문들은 하나의 점으로 만들어 주었다.(평균) </li>
<li>Sellers는 대도시 밀집 지역인 남동부 지역에 밀집되어 있다. </li>
</ul>
<pre><code># geolocation을 zip 기준으로 &#39;대표위경도&#39; 만들기 (평균)
geo_agg = ( 
    geo.groupby(&#39;geolocation_zip_code_prefix&#39;, as_index=False)
    .agg({&#39;geolocation_lat&#39; : &#39;mean&#39;, &#39;geolocation_lng&#39; : &#39;mean&#39;})
)</code></pre><pre><code># sellers에 위경도 붙이기 
sellers_map_df = sellers[[&#39;seller_id&#39;, &#39;seller_zip_code_prefix&#39;, &#39;seller_city&#39;, &#39;seller_state&#39;]].merge(
    geo_agg,
    left_on=&#39;seller_zip_code_prefix&#39;, 
    right_on=&#39;geolocation_zip_code_prefix&#39;,
    how=&#39;left&#39;
)</code></pre><h3 id="2-customers-분포">2. Customers 분포</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/d1bf9066-5d41-4053-910e-693577dbc2c3/image.png" alt=""></p>
<blockquote>
<p>[인사이트]</p>
</blockquote>
<ul>
<li>Olist의 주 고객들은 남동부와 북동부 지역에 밀집되어 있다. </li>
<li>Customer도 Seller와 마찬가지로 customer 테이블에서 아이디/zip 코드/도시/주 칼럼을 geo_agg와 merge하여 customers_map_df 데이터프레임을 생성한 뒤 지도를 그린다. </li>
</ul>
<h3 id="3-지연-건수">3. 지연 건수</h3>
<pre><code># 배송이 estimated time보다 delay된 지역만 뽑아서 보기 

df_delay = orders[[
    &#39;order_id&#39;, &#39;customer_id&#39;,
    &#39;order_purchase_timestamp&#39;,
    &#39;order_delivered_customer_date&#39;, 
    &#39;order_estimated_delivery_date&#39;
]].copy()

# daytime 변환 
for c in [&#39;order_purchase_timestamp&#39;, &#39;order_delivered_customer_date&#39;, &#39;order_estimated_delivery_date&#39;]:
    df_delay[c] = pd.to_datetime(df_delay[c], errors=&#39;coerce&#39;)

# delivered &amp; estimated 있는 것만 
df_delay = df_delay.dropna(subset=[&#39;order_delivered_customer_date&#39;,&#39;order_estimated_delivery_date&#39;])

# 지연일수
df_delay[&#39;delay_days&#39;] = (df_delay[&#39;order_delivered_customer_date&#39;] - df_delay[&#39;order_estimated_delivery_date&#39;]).dt.days

# delay된 주문만
df_delay = df_delay[df_delay[&#39;delay_days&#39;] &gt; 0].copy()

df_delay[[&#39;order_id&#39;,&#39;delay_days&#39;]].head()</code></pre><blockquote>
<p>[output]</p>
</blockquote>
<ul>
<li>실제 배송이 지연된 주문과 그 지연 일수를 정리한 분석용 데이터셋 도출</li>
</ul>
<pre><code>df_delay = df_delay.merge(
    cus[[&#39;customer_id&#39;,&#39;customer_zip_code_prefix&#39;,&#39;customer_city&#39;,&#39;customer_state&#39;]],
    on=&#39;customer_id&#39;,
    how=&#39;left&#39;
)

df_delay = df_delay.merge(
    geo_agg[[&#39;geolocation_zip_code_prefix&#39;,&#39;geolocation_lat&#39;,&#39;geolocation_lng&#39;]],
    left_on=&#39;customer_zip_code_prefix&#39;,
    right_on=&#39;geolocation_zip_code_prefix&#39;,
    how=&#39;left&#39;
)

df_delay = df_delay.dropna(subset=[&#39;geolocation_lat&#39;,&#39;geolocation_lng&#39;]).copy()
df_delay.shape</code></pre><p><img src="https://velog.velcdn.com/images/hyuunk_/post/83dda402-3f75-463b-982b-0dfddfbb766a/image.png" alt=""></p>
<blockquote>
<p>[인사이트]</p>
</blockquote>
<ul>
<li>위 지도는 지연 주문의 절대적인 개수이다.</li>
<li>대도시 밀집 지역인 남동부 지역에 전체 주문량이 많은만큼 확실히 지연 주문 건수도 많은 것을 알 수 있다. </li>
<li>초록색: 지연 주문 한 자릿 수 / 노란색: 지연 주문 두 자릿 수 / 주황색: 지연 주문 3-4자릿수</li>
<li>하지만 위 지도는 절대적인 지연 주문량만 나타낼 뿐, 지연율 및 배송 속도 평균 등 성능을 비교하기에는 무리가 있다. </li>
<li>그래서 초록색 지역은 배송이 빠르고 주황색 지역은 배송이 느리다로 해석하면 안된다. </li>
</ul>
<h3 id="4-배송-지표-만들기">4. 배송 지표 만들기</h3>
<pre><code># 배송 + 고객
base = (orders[[&#39;order_id&#39;,&#39;customer_id&#39;,
                &#39;order_purchase_timestamp&#39;,
                &#39;order_delivered_customer_date&#39;,
                &#39;order_estimated_delivery_date&#39;]]
        .merge(cus[[&#39;customer_id&#39;,&#39;customer_state&#39;]],
               on=&#39;customer_id&#39;, how=&#39;left&#39;))

# datetime 변환
for c in [&#39;order_purchase_timestamp&#39;,
          &#39;order_delivered_customer_date&#39;,
          &#39;order_estimated_delivery_date&#39;]:
    base[c] = pd.to_datetime(base[c], errors=&#39;coerce&#39;)

base = base.dropna(subset=[
    &#39;order_purchase_timestamp&#39;,
    &#39;order_delivered_customer_date&#39;,
    &#39;order_estimated_delivery_date&#39;
])

# 배송 지표
base[&#39;shipping_days&#39;] = (
    base[&#39;order_delivered_customer_date&#39;]
    - base[&#39;order_purchase_timestamp&#39;]
).dt.days

base[&#39;delay_days&#39;] = (
    base[&#39;order_delivered_customer_date&#39;]
    - base[&#39;order_estimated_delivery_date&#39;]
).dt.days

base[&#39;is_delayed&#39;] = (base[&#39;delay_days&#39;] &gt; 0).astype(int)

# 리뷰 결합
base = base.merge(
    ordrev[[&#39;order_id&#39;,&#39;review_score&#39;]],
    on=&#39;order_id&#39;,
    how=&#39;left&#39;
)
</code></pre><pre><code>state_metrics = (
    base.groupby(&#39;customer_state&#39;)
    .agg(
        주문건수=(&#39;order_id&#39;,&#39;count&#39;),
        배송지연율=(&#39;is_delayed&#39;,&#39;mean&#39;),
        평균배송소요일=(&#39;shipping_days&#39;,&#39;mean&#39;),
        평균지연일수=(&#39;delay_days&#39;, lambda x: x[x&gt;0].mean()),
        평균리뷰평점=(&#39;review_score&#39;,&#39;mean&#39;)
    )
    .reset_index()
    .sort_values(&#39;배송지연율&#39;, ascending=False)
)

state_metrics</code></pre><p><img src="https://velog.velcdn.com/images/hyuunk_/post/c7d975e2-de25-4686-ada0-75f82e7ad018/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/f5a2f865-98cf-4330-aa07-f7ada04fd21b/image.png" alt="">
<img src="https://velog.velcdn.com/images/hyuunk_/post/684e8dcb-1443-404f-96f9-b5bfcafde672/image.png" alt=""></p>
<h3 id="5-지연율-지도-시각화">5. 지연율 지도 시각화</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/65a6190a-708c-494e-9ecc-8fc2a6b31603/image.png" alt=""></p>
<blockquote>
<p>[분석 결과 요약 및 인사이트 정리]</p>
</blockquote>
<ul>
<li>분석 결과 요약: 브라질 북부/북동부 지역에서 배송 지연율이 상대적으로 높게 나타난 것을 볼 수 있다. 반면, 남동부 지역은 낮은 지연율을 보였다. 이는 판매자/고객 밀집 지역과 물류 인프라 수준 차이에서 기인한 구조적 문제로 해석할 수 있다. </li>
</ul>
<blockquote>
<ul>
<li>결과 해석: </li>
</ul>
</blockquote>
<h4 id="1-북부북동부-지역-높은-배송-지연율">1) 북부/북동부 지역: 높은 배송 지연율</h4>
<ul>
<li>Amazonas, Para, Maranhao, 일부 Nordeste 지역에서 지연율이 가장 높다. </li>
<li>주문 절대량은 남동부에 비해 많지 않지만, 주문 대비 지연 비율이 매우 높다 </li>
<li>장거리 배송에 필요한 물류 허브 부족, 도로/운송 인프라 열세, 배송 리드타임 변동성 등으로 해석할 수 있다. <h4 id="2-중부-지역-중간-수준의-지연율">2) 중부 지역: 중간 수준의 지연율</h4>
</li>
<li>Mato Grosso, Goias 등 </li>
<li>주문량과 지연율 모두 중간 수준 </li>
<li>물류 접근성은 일부 확보했으나 남동부 대비 여전히 거리 연결성의 한계가 존재한다. <h4 id="3-남동부남부-낮은-배송-지연율">3) 남동부/남부: 낮은 배송 지연율</h4>
</li>
<li>SP, Rio de Janerio, Parana, Minas Gerais 등 </li>
<li>대도시 밀집 지역으로, 주문량이 매우 많지만 지연율은 오히려 낮다. </li>
<li>Sellers 분포 지도에서 알 수 있듯이, 판매자 밀집 지역이라 평균 배송 거리가 짧다. 물류센터 및 허브가 집중 되어있는 곳이기 때문이다. </li>
</ul>
<blockquote>
<ul>
<li>인사이트 </li>
</ul>
</blockquote>
<p>1) 지연율은 주문량과 반드시 비례하지 않는다. 오히려 물류 인프라와 지역 접근성의 문제로 해석할 수있다. 
2) 북부/북동부: SLA 차등 적용 필요, 물류 파트너 확대 or 지역 허브 전략 검토 필요 
3) 남동부: 대규모 주문 처리에도 안정적인 현재 구조 </p>
<h3 id="6-분포비교시각적-검증">6. 분포비교(시각적 검증)</h3>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/9eb3af0f-bfbb-4ec3-94d0-e2f8ba7b9a5c/image.png" alt=""></p>
<blockquote>
<p>[인사이트]
<img src="https://velog.velcdn.com/images/hyuunk_/post/7c065296-0844-404a-ae46-9b2086f11f5e/image.png" alt=""></p>
</blockquote>
]]></description>
        </item>
        <item>
            <title><![CDATA[포트폴리오 구성 ]]></title>
            <link>https://velog.io/@hyuunk_/%ED%8F%AC%ED%8A%B8%ED%8F%B4%EB%A6%AC%EC%98%A4-%EA%B5%AC%EC%84%B1-wv2f6noe</link>
            <guid>https://velog.io/@hyuunk_/%ED%8F%AC%ED%8A%B8%ED%8F%B4%EB%A6%AC%EC%98%A4-%EA%B5%AC%EC%84%B1-wv2f6noe</guid>
            <pubDate>Mon, 29 Dec 2025 05:47:20 GMT</pubDate>
            <description><![CDATA[<blockquote>
<p>브라질 olist 데이터분석 프로젝트 포트폴리오 구성 </p>
</blockquote>
<h3 id="1-개요-및-데이터-소개">1. 개요 및 데이터 소개</h3>
<h3 id="2-rfm-세그먼트-배송-리뷰-데이터마트-생성sql">2. RFM 세그먼트, 배송, 리뷰 데이터마트 생성(SQL)</h3>
<ul>
<li>RFM 세그먼트별로 평균 배송일자 및 리뷰점수 상관관계 분석을 위한 데이터마트 생성 </li>
<li>주문 데이터 기반 고객 RFM 지표 산출 </li>
<li>SQL을 활용한 고객 세그먼트 분류 </li>
</ul>
<h3 id="3-rfm-고객-세그먼트별-특징-시각화tableau">3. RFM 고객 세그먼트별 특징 시각화(Tableau)</h3>
<h3 id="4-sql-카테고리-관련-데이터마트-생성-sql">4. SQL 카테고리 관련 데이터마트 생성 (SQL)</h3>
<h3 id="5-rfm-고객-세그먼트별-특징-시각화추가분석">5. RFM 고객 세그먼트별 특징 시각화(추가분석)</h3>
<h3 id="6-지역별-배송-지연-인사이트-발굴python">6. 지역별 배송 지연 인사이트 발굴(Python)</h3>
<h3 id="7-프로젝트를-마치며">7. 프로젝트를 마치며...</h3>
]]></description>
        </item>
        <item>
            <title><![CDATA[2025-12-23 (월)]]></title>
            <link>https://velog.io/@hyuunk_/2025-12-23-%EC%9B%94</link>
            <guid>https://velog.io/@hyuunk_/2025-12-23-%EC%9B%94</guid>
            <pubDate>Wed, 24 Dec 2025 01:31:26 GMT</pubDate>
            <description><![CDATA[<h2 id="kpi-선정을-위한-가설-설정">KPI 선정을 위한 가설 설정</h2>
<blockquote>
<ul>
<li>이커머스 주문은 평일보다는 시간적으로 여유가 있는 주말에 더 많이 발생할 것이다.(기각)</li>
</ul>
</blockquote>
<ul>
<li>배송 소요일수가 짧을수록 고객 리뷰 점수가 높을 것이다. (채택)</li>
<li>대도시 고객일수록 상품 가격 대비 실제 결제 금액이 더 높을 것이다. (채택)</li>
</ul>
<h2 id="eda를-위한-kpi-선정">EDA를 위한 KPI 선정</h2>
<ul>
<li>아직 전처리는 하지 않은 상태에서, Raw Data를 훑어보며 주요 KPI를 선정하였다. </li>
<li>이커머스에서 활용되는 주요 KPI가 있지만, 세부적인 EDA를 위해 세워둔 일부 가설을 기반으로 더 디테일 한 KPI를 선정하였다 .</li>
</ul>
<blockquote>
<p>Olist Store E-commerce analysis Project KPI</p>
</blockquote>
<pre><code>   - KPI1: 주말 vs 평일 주문 기준 결제 통계 (order_purchase_timestamp) 
   - KPi2: 리뷰스코어 5점이고 결제 수단이 신용카드인 주문 건수 
   - KPI3: 펫샵(pet_shop) 카테고리 주문의 평균 배송 소요일수
   - KPI4: 상파울로 거주 고객의 평균 상품 가격 및 평균 결제 금액 
   - KPI5: 배송 소요일수(배송완료일-주문일시)와 리뷰 평점 간의 관계</code></pre><h2 id="eda-1">EDA 1</h2>
<h3 id="kpi1-주말-vs-평일-주문-기준-결제-통계">KPI1: 주말 vs 평일 주문 기준 결제 통계</h3>
<pre><code>[쿼리]
SELECT * FROM olistdata.orders;
SELECT * FROM olistdata.order_payments;

select kpi1.day_end, 
concat(round(kpi1.total_payment / (select sum(payment_value) from order_payments) * 100, 2)
, &#39;%&#39;) as percentage_payment_values 
from 
(select ord.day_end , sum(pmt.payment_value) as total_payment
from order_payments as pmt 
join 
(select distinct order_id, 
case 
when weekday(order_purchase_timestamp) in (5,6) then &quot;Weekend&quot; 
else &quot;Weekday&quot;
end as Day_end 
from orders) as ord
on ord.order_id = pmt.order_id
group by ord.day_end) as kpi1 ; </code></pre><p>[Result]
<img src="https://velog.velcdn.com/images/hyuunk_/post/ac4b2bf1-a427-4619-901e-4d91088330cf/image.png" alt=""></p>
<blockquote>
<p>[Insight]</p>
</blockquote>
<ul>
<li>전체 결제 금액 중, 약 3/4가 평일에 발생하였음 </li>
<li>즉, 고객들은 주말보다 평일에 훨씬 더 많이 구매 </li>
<li>주말 쇼핑이 주력일 거라는 가설과는 다른 결과 </li>
</ul>
<h3 id="kpi2-리뷰스코어-5점이고-결제-수단이-신용카드인-주문-건수">KPI2: 리뷰스코어 5점이고 결제 수단이 신용카드인 주문 건수</h3>
<pre><code>[쿼리]
select 
count(pmt.order_id) as Total_Orders
from 
order_payments pmt
inner join order_reviews rev on pmt.order_id = rev.order_id
where 
rev.review_score = 5 
and pmt.payment_type = &#39;credit_card&#39; ;</code></pre><p>[Result]
<img src="https://velog.velcdn.com/images/hyuunk_/post/d2c558fb-7b60-4bee-b523-b25d9e7d3118/image.png" alt=""></p>
<blockquote>
<p>[Insight]</p>
</blockquote>
<ul>
<li>고객 만족도 + 핵심 결제수단이 동시에 충족된 주문이 상당히 많음 </li>
</ul>
<h3 id="kpi3-펫샵pet_shop-카테고리-주문의-평균-배송-소요일수">KPI3: 펫샵(pet_shop) 카테고리 주문의 평균 배송 소요일수</h3>
<pre><code>[쿼리]
select 
prod.product_category_name, 
round(avg(datediff(ord.order_delivered_customer_date , ord.order_purchase_timestamp)), 0) as Avg_delivery_days 
from orders ord 
join 
(select product_id, order_id, product_category_name
from products
join orders_items using(product_id)) as prod 
on ord.order_id = prod.order_id 
where prod.product_category_name = &quot;Pet_shop&quot;
group by prod.product_category_name ; </code></pre><p>[Result]
 <img src="https://velog.velcdn.com/images/hyuunk_/post/29de8509-9d46-45a5-9d81-a4be7ca7866a/image.png" alt=""></p>
<blockquote>
<p>[Insight]</p>
</blockquote>
<ul>
<li>평균 11일의 소요 시간은 이커머스 기준으로 다소 긴 편에 속함 </li>
<li>추가분석1: &quot;배송이 느린 카테고리는 고객 만족도에 어떠한 영향을 주는가?&quot;</li>
<li>추가분석2: &quot;펫샵 vs 전체 평균 배송일 비교&quot;</li>
<li>추가분석3: &quot;펫샵 내 배송일수 구간별 리뷰 분포&quot;</li>
</ul>
<h3 id="kpi4-상파울로-거주-고객의-평균-상품-가격-및-평균-결제-금액">KPI4: 상파울로 거주 고객의 평균 상품 가격 및 평균 결제 금액</h3>
<pre><code>[쿼리]
with orderItemsAvg as ( 
select round(avg(item.price)) as avg_order_item_price
from orders_items item
join orders ord 
on item.order_id = ord.order_id 
join customers cust on ord.customer_id = cust.customer_id 
where cust.customer_city = &quot;Sao Paulo&quot; 
)
select 
(select avg_order_item_price from orderItemsAvg) as avg_order_item_price, 
round(avg(pmt.payment_value)) as avg_payment_value 
from order_payments pmt 
join orders ord on pmt.order_id = ord.order_id 
join customers cust on ord.customer_id = cust.customer_id 
where 
cust.customer_city = &quot;Sao Paulo&quot; ;</code></pre><p>[Result]
<img src="https://velog.velcdn.com/images/hyuunk_/post/386800f8-5a13-4907-9fa7-e4754a0f3eaf/image.png" alt=""></p>
<blockquote>
<p>[Insight]</p>
</blockquote>
<ul>
<li>실제 상품 가격 평균은 108(약 29,000원), 최종 결제 금액은 136(약 36,700원)</li>
<li>상품 가격 외 추가 비용이 결제 금액에 반영되고 있음을 알 수 있음(ex.배송비, 세금, 복수 상품 구매 가능성)</li>
<li>추가분석 1: 상파울루 vs 타 도시 평균 비교 </li>
<li>추가분석 2: 상파울루 배송비 평균 추정 </li>
</ul>
<h3 id="kpi5-배송-소요일수배송완료일-주문일시와-리뷰-평점-간의-관계">KPI5: 배송 소요일수(배송완료일-주문일시)와 리뷰 평점 간의 관계</h3>
<pre><code>[쿼리]
select 
rew.review_score, 
round(avg(datediff(ord.order_delivered_customer_date , order_purchase_timestamp)),0) as &quot;Avg shipping days&quot; 
from orders as ord 
join order_reviews as rew on rew.order_id = ord.order_id 
group by rew.review_score 
order by rew.review_score ;</code></pre><p>[Result]
<img src="https://velog.velcdn.com/images/hyuunk_/post/b4520773-cca5-4d8e-bf23-2b487a066a46/image.png" alt=""></p>
<blockquote>
<p>[Insight]</p>
</blockquote>
<ul>
<li>배송이 빠를수록 리뷰 점수가 높고, 배송이 느릴수록 리뷰점수가 낮다</li>
<li>리뷰1점 고객은 평균 21일 대기, 리뷰 5점 고객은 평균 11일 대시(10일 차이)</li>
<li>배송 속도는 고객 만족도를 결정짓는 핵심 요인 중 하나 </li>
</ul>
<blockquote>
<p>[총평]
본 분석은 고객의 구매 시점, 결제 방식, 배송 리드타임이 고객 만족도에 미치는 영향을 검증하기 위해 다섯 가지 가설을 설정하였다. 분석 결과, 주문은 주말보다 평일에 집중되어 있었으며, 고만족 고객은 신용카드 결제에 집중되는 경향을 보였다. 또한 특정 카테고리에서는 배송 기간이 상대적으로 길게 나타났고, 대도시 고객일수록 상품 가격 대비 실제 결제 금액이 높았다. 마지막으로 배송 소요일수와 리뷰 점수 간에는 명확한 음의 상관관계가 확인되어, 배송 속도가 고객 만족도를 결정짓는 핵심 요인임을 데이터로 입증하였다.</p>
</blockquote>
]]></description>
        </item>
        <item>
            <title><![CDATA[2025-12-17 (수) ]]></title>
            <link>https://velog.io/@hyuunk_/Olist-e-commerce-%EB%8D%B0%EC%9D%B4%ED%84%B0%EB%B6%84%EC%84%9D</link>
            <guid>https://velog.io/@hyuunk_/Olist-e-commerce-%EB%8D%B0%EC%9D%B4%ED%84%B0%EB%B6%84%EC%84%9D</guid>
            <pubDate>Thu, 18 Dec 2025 10:00:29 GMT</pubDate>
            <description><![CDATA[<ul>
<li>포트폴리오에 넣을 새로운 프로젝트 주제로 &#39;Olist dataset&#39;을 선정하였다. </li>
<li>선정이유: <pre><code>     1) SQL 입문자로써, 이만한 양질의 데이터를 찾기 어려웠다. 
     2) 선행분석들이 많아 어려움을 겪을 때 참고할 수 있는 자료가 많았다.</code></pre></li>
<li>진행목표: <pre><code>     12월 안에 SQL 및 Python을 통한 분석과 POWER BI 대시보드, 포트폴리            오까지 완성하는 것을 목표.</code></pre></li>
<li>세부계획: <pre><code>     1) 매일 분석한 내용들을 벨로그에 기록하고 회고. 
     2) 추가 작성 예정 </code></pre></li>
</ul>
<h2 id="테이블-및-컬럼-확인">테이블 및 컬럼 확인</h2>
<p><img src="https://velog.velcdn.com/images/hyuunk_/post/6c4a461d-c958-409b-b4e6-0fda538942c8/image.png" alt=""></p>
<h2 id="데이터-정제">데이터 정제</h2>
<ul>
<li>처음 데이터를 다운로드 받았을 때, 텍스트 데이터 중 알파벳이 깨져 들어있는 행들이 많았다. </li>
<li>우선, 엑셀 상으로 Ctrl+F 기능을 통해 한글로 잘못들어온 글자들을 알파벳으로 변환해주었다. &#39;geolocation&#39; 테이블의 &#39;city&#39; 컬럼에서 고급필터 기능을 통해 어떤 도시 종류들이 있는지 쭉 뽑아낸다음, 잘못 받아진 글자들을 변환하였다. </li>
<li>예를 들어, 찾-&gt;a, 챕-&gt;e, 찼-&gt;o으로 잘못 변환되어 받아졌음을 확인하고 ctrl+F를 통해 한번에 바꾸어 주었다.</li>
<li>마찬가지로, &#39;order_reviews&#39; 테이블의 &#39;comment message&#39; 칼럼에도 해당 작업이 필요함을 뒤늦게 확인하였다. (Python을 통해 SQL로 Data Import를 하는 과정에서 깨져서 들어온 글자 때문에 인코딩 문제를 겪으며 해당 테이블에도 같은 문제가 있음을 파악하였다)</li>
<li>너무 많은 텍스트 데이터가 있어서, 엑셀 상으로 하나하나 바꾸어주는 대신에 기존에 설정한 &#39;utf-8&#39;을 &#39;latin1&#39;로 바꿔주어서 깨진 특수 문자를 전부 강제로 문자로 읽게 하였다.</li>
</ul>
<h2 id="환경-세팅-단계인프라">환경 세팅 단계(인프라)</h2>
<h2 id="1-물리적-스키마-생성">1. 물리적 스키마 생성</h2>
<ul>
<li>MySQL에서 &#39;olistdata&#39;라는 이름으로 물리적 스키마를 생성해주었다.(Creata New Scheme)</li>
</ul>
<h2 id="2-데이터-적재">2. 데이터 적재</h2>
<ul>
<li>MySQL Database 카테고리에 있는 &#39;Data Import&#39;기능을 통해 테이블을 불러오려했으나 잘 되지않았다. </li>
<li>&#39;Table Data Import Wizard&#39;를 통해 파일별로 데이터를 불러오려했으나 시간이 너무 소요되는 문제가 있었다. </li>
<li>그래서 그냥 Python으로 SQL에 데이터를 보내는 방식을 택하기로 하였다. <pre><code>[사용코드]
</code></pre></li>
</ul>
<p>engine = create_engine(
    &quot;mysql+pymysql://root:Marine13258!@localhost:3306/olistdata&quot;
)
base_path = r&quot;C:\Users\SAMSUNG\OneDrive\바탕 화면\olist dataset&quot;</p>
<p>files = {
    r&quot;C:\Users\SAMSUNG\OneDrive\바탕 화면\olist dataset\olist_customers_dataset.csv&quot;: &quot;customers&quot;,
    r&quot;C:\Users\SAMSUNG\OneDrive\바탕 화면\olist dataset\olist_geolocation_dataset.csv&quot;: &quot;geolocation&quot;,
    r&quot;C:\Users\SAMSUNG\OneDrive\바탕 화면\olist dataset\olist_order_items_dataset.csv&quot;: &quot;orders_items&quot;,
    r&quot;C:\Users\SAMSUNG\OneDrive\바탕 화면\olist dataset\olist_order_payments_dataset.csv&quot;: &quot;order_payments&quot;,
    r&quot;C:\Users\SAMSUNG\OneDrive\바탕 화면\olist dataset\olist_order_reviews_dataset.csv&quot;: &quot;order_reviews&quot;,
    r&quot;C:\Users\SAMSUNG\OneDrive\바탕 화면\olist dataset\olist_orders_dataset.csv&quot;: &quot;orders&quot;,
    r&quot;C:\Users\SAMSUNG\OneDrive\바탕 화면\olist dataset\olist_products_dataset.csv&quot;: &quot;products&quot;,
    r&quot;C:\Users\SAMSUNG\OneDrive\바탕 화면\olist dataset\olist_sellers_dataset.csv&quot;: &quot;sellers&quot;,
}</p>
<p>for files, table in files.items(): 
    print(f&quot;{table}) 업로드 중...&quot;)
    df = pd.read_csv(
        os.path.join(base_path, files),
        encoding=&quot;latin1&quot;
    )
    df.to_sql(table, engine, if_exists=&#39;replace&#39;, index=False)
    print(f&quot;{table} 완료&quot;)</p>
<p>print(&quot;전체 업로드 완료&quot;)</p>
<p>```</p>
<h2 id="분석-단계">분석 단계</h2>
<h2 id="1-모델링">1. 모델링</h2>
<ul>
<li>&#39;Moldes&#39; 카테고리에서 데이터 구조를 파악하고 EER 다이어그램을 통해 도식화하기 위해 데이터를 뜯어 보았다. </li>
<li>용어나 PK, FK 등 개념들이 헷갈려서 기본서와 유튜브를 다시 찾아보았다. </li>
<li>그래도 아직 어떻게 해야할지 감이 잡히지 않는다. </li>
</ul>
<blockquote>
<p>&lt;앞으로의 진행방향&gt; </p>
</blockquote>
<ul>
<li>데이터 구조 파악 (PK/FK 설정)</li>
<li>다이어그램으로 도식화 </li>
<li>데이터 전처리 </li>
<li>KPI 설정 </li>
<li>분석 주제 설정 </li>
</ul>
]]></description>
        </item>
    </channel>
</rss>