Sid Dani

Welcome to an ongoing library of prompts for data collection, cleaning, exploration, analysis, and visualization. This is an invaluable resource for data analysts and data scientists, serving as a dynamic and evolving repository of practical and tested prompts.

Given its ongoing nature, I encourage you to bookmark this page and visit it regularly. I’ll be periodically updating the prompts to ensure they're current, relevant, and challenging, reflecting the ever-evolving landscape of data analysis. You'll discover new additions and alterations that will continually test and grow your skill set.

When practicing with these prompts, it's crucial to remember to use fake or anonymized data. This helps protect privacy and ensures ethical data handling, particularly when dealing with sensitive information. All prompts are created with the aim of enhancing your skills, not for analyzing real, confidential data.

The prompts encompass the full spectrum of data analysis, from data collection and cleaning to the interpretation of findings and provision of data-driven recommendations. You'll also find exercises on maintaining comprehensive data documentation and conducting exploratory data analysis.

This is more than just a static blog post; it's an evolving platform designed to keep you at the forefront of data analysis techniques and practices. So, bookmark this page, visit often, and continue to grow your skills with us. Always keep privacy at the forefront and remember to use fake or anonymized data. Let's dive into the fascinating world of data analysis together!

Data Collection and Cleaning

Data analysts gather data from various sources, such as databases, APIs, or spreadsheets, and clean it to ensure accuracy and consistency.
Example: A data analyst at a healthcare organization might collect patient data from different hospital departments, clean and standardize it to ensure consistent formatting, and merge it into a single dataset for analysis.

Prompts

▶

Generate Data:

I want you to act as a fake data generator. I need a dataset that has rows and [y] columns: [insert column names]

▶

Generate Data From DDL

Please help me generate sample data for the following SQL DDL table definition:
SQL DDL:
[Provide your SQL DDL table definition, including table name, column names, and data types]
Based on the table definition, please generate a set of somewhat realistic sample data that can be used for testing and mock data generation. Ensure that the sample data is consistent with the meaning of the column names and adheres to the specified data types.

▶

Design Panda functions

Please help me perform a specific operation (x) on the following example DataFrame represented as a table in Markdown format:
[Insert Example DataFrame]
Operation (x): [Describe the desired operation, e.g., filter rows based on a condition, calculate a new column, sort the DataFrame, or group by a specific column]
Please provide the necessary Pandas code to perform the specified operation (x) on this example DataFrame, and show the resulting DataFrame after the operation is applied.

▶

Clean Dataset

Please provide a Python code snippet that demonstrates how to clean and preprocess a dataset, including handling missing values, removing duplicates, and standardizing data formats. Use a sample dataset with columns 'Name,' 'Age,' 'Gender,' and 'Email' for this demonstration.

▶

Merge Datasets

Please provide a Python code snippet that demonstrates how to merge two datasets using the Pandas library. Assume that the first dataset, 'df1,' contains columns 'ID,' 'Name,' and 'Age,' and the second dataset, 'df2,' contains columns 'ID,' 'City,' and 'Country.' Merge the two datasets on the 'ID' column, and show the resulting merged dataset.

▶

Build a simple data scraper

Please provide a Python code snippet that demonstrates how to scrape data from the homepage of www.xyz.com

▶

Collect Data from an API

Please provide a Python code snippet that demonstrates how to collect data from a public REST API endpoint using the 'requests' library. As an example, use the following API endpoint that returns JSON data about users: https://jsonplaceholder.typicode.com/users Retrieve the data, parse the JSON response, and display the result in a readable format.”

Data Exploration and Analysis

They explore datasets to understand their structure, identify patterns, trends, and relationships, and perform statistical analyses to test hypotheses.
Example: A data analyst at an e-commerce company might analyze customer purchase data to identify seasonal trends, high-performing products, and customer segments with different spending behaviors.

Prompts

▶

Explore Data

I want you to act as a data engineer and code for me. I have a dataset of [describe dataset]. Please write code for data visualisation and exploration.

▶

Calculate Running Average

As a data scientist, I have a table with two columns: [Insert column names]. I'd like to calculate a running average for [specify the desired value or column]. Can you provide the SQL code to accomplish this in BigQuery?

▶

Rewrite used queries to modify them slightly

Please help me modify the following SQL query to achieve a slightly different result:

[Insert Original SQL Query]

Original Query Purpose: [Describe the purpose or goal of the original SQL query]

Desired Modification: [Explain the specific modification you want to make to the query, such as changing the filtering criteria, adding or removing columns, modifying the aggregation, or altering the sorting order]

Please provide the modified SQL query that achieves the desired result, along with an explanation of the changes made and how the new query differs from the original one.

▶

Translate SQL Dialects

What is the equivalent of the FUNC1 function in BigQuery?

▶

Compare 2 similar SQL code

Please help me compare the following two similar SQL queries and explain the differences between them:
[SQL QUERY 1]
[SQL QUERY 2]
Analyze both SQL queries and provide a detailed comparison that highlights the differences in terms of structure, syntax, filtering criteria, columns selected, aggregation, and any other relevant aspects. Additionally, explain how these differences may impact the results returned by each query and any potential implications for performance or data accuracy.

▶

PowerBI Modeling

As a Power BI expert, please analyze the details of my current project [insert project details here], focusing on the table structure and relationships. Are there any issues or areas for improvement you can identify within the table?

Chain Prompting

▶

Generate SQL Query

As a senior data analyst,
[insert schema & data sample]
given the above schemas and data, write a detailed and correct [insert DBMS] sql query to answer the analytical question:

[question]

Comment the query with your logic.

▶

Double Check SQL Query

Double check the Postgres query above for common mistakes, including:

Remembering to add NULLS LAST to an ORDER BY DESC clause
Handling case sensitivity, e.g. using ILIKE instead of LIKE
Ensuring the join columns are correct
Casting values to the appropriate type

Rewrite the query here if there are any mistakes. If it looks good as it is, just reproduce the original query.

▶

Debug Query Against DB

[insert query from previous prompt]

The query above produced the following error:

[insert query error]

Rewrite the query with the error fixed:

Reporting and Visualization

Data analysts create reports and visualizations to present their findings in a clear and concise manner to stakeholders, often using tools like Tableau or Power BI.
Example: A data analyst working for a marketing agency might create a dashboard displaying the performance metrics of an advertising campaign, such as impressions, click-through rates, and conversions, to help clients understand the campaign's effectiveness.

▶

Write Pyspark Struct

Please help me create PySpark StructType and StructField schema definitions for the following dataset:

Dataset columns:

Column Name: [Name of the first column]
Data Type: [Data type of the first column, e.g., StringType, IntegerType, DoubleType, etc.]
Nullable: [True/False, indicating if the first column can contain null values]
Column Name: [Name of the second column]
Data Type: [Data type of the second column]
Nullable: [True/False, indicating if the second column can contain null values]

[Continue with further columns as needed]

Please provide the PySpark code for creating the StructType and StructField objects that define the schema for this dataset.

▶

Choose Visualization Method

As an expert in data visualization, I need your help to choose the best visualization method for the following problem:
[PROBLEM]

Please describe the problem in detail and recommend the most appropriate visualization method to effectively communicate the information. Explain why you think this method is the best choice.

▶

Visualize Data

Write python code to visualize [metric] using [choose viz method]

▶

Explore Data

[Insert data sample]
Can you do visualizations & descriptive analyses to help me understand the data?

▶

Perform Linear Regression

[insert data sample]
Can you try regressions and look for patterns? Can you run regression diagnostics?

Business Insights & Recommendations

They interpret their findings and provide data-driven insights to support decision-making and improve business processes.
Example: A data analyst at a manufacturing company might analyze production data to identify bottlenecks in the assembly line, and recommend process improvements to increase efficiency and reduce costs.

▶

Write OKRs

Write OKRs for my X people data team. The focus for this quarter is X, Y, Z.

Maintain Data Documentation

Data analysts are responsible for maintaining documentation of data sources, data dictionaries, and data processing steps to ensure transparency, reproducibility, and easy access to information for other team members.
Example: A data analyst working on a financial reporting project might create and maintain a data dictionary outlining the meaning and format of each column in the dataset, as well as document the data processing and transformation steps taken during the analysis.

▶

Write documentation for functions

I want you to act as a software developer. Please provide documentation for func1 below. [Insert function]

▶

Extract structure out of data sample

Please help me extract the structure of the following data sample:

Data Sample:
[Provide a sample of your data, either as a small dataset, a JSON snippet, or a few rows of a CSV file]

Based on this sample, please provide the inferred structure, including column names, data types, and any relationships or hierarchies that can be observed in the data. Additionally, provide any suggestions or best practices for storing and processing this data using appropriate tools and technologies.

Exploratory Data Analysis

ChatGPT can assist Data Analysts in conducting exploratory data analysis by suggesting statistical techniques, visualizations, and tools to identify patterns, trends, and relationships in data.

▶

Suggest Statistical technique

I want to do [X] with the following data [insert data]. Can you suggest statistical techniques that will help me do [X]. Provide a SQL code sample if possible.‣

▶

Missing Data Ideas

if I am missing [X] data, what is best way to measure [X]

▶

Find Best Visualisation Ideas