Welcome to an ongoing library of prompts for data collection, cleaning, exploration, analysis, and visualization. This is an invaluable resource for data analysts and data scientists, serving as a dynamic and evolving repository of practical and tested prompts.
Given its ongoing nature, I encourage you to bookmark this page and visit it regularly. I’ll be periodically updating the prompts to ensure they're current, relevant, and challenging, reflecting the ever-evolving landscape of data analysis. You'll discover new additions and alterations that will continually test and grow your skill set.
When practicing with these prompts, it's crucial to remember to use fake or anonymized data. This helps protect privacy and ensures ethical data handling, particularly when dealing with sensitive information. All prompts are created with the aim of enhancing your skills, not for analyzing real, confidential data.
The prompts encompass the full spectrum of data analysis, from data collection and cleaning to the interpretation of findings and provision of data-driven recommendations. You'll also find exercises on maintaining comprehensive data documentation and conducting exploratory data analysis.
This is more than just a static blog post; it's an evolving platform designed to keep you at the forefront of data analysis techniques and practices. So, bookmark this page, visit often, and continue to grow your skills with us. Always keep privacy at the forefront and remember to use fake or anonymized data. Let's dive into the fascinating world of data analysis together!
Data analysts gather data from various sources, such as databases, APIs, or spreadsheets, and clean it to ensure accuracy and consistency.
Example: A data analyst at a healthcare organization might collect patient data from different hospital departments, clean and standardize it to ensure consistent formatting, and merge it into a single dataset for analysis.
I want you to act as a fake data generator. I need a dataset that has rows and [y] columns: [insert column names]
Please help me generate sample data for the following SQL DDL table definition:
SQL DDL:
[Provide your SQL DDL table definition, including table name, column names, and data types]
Based on the table definition, please generate a set of somewhat realistic sample data that can be used for testing and mock data generation. Ensure that the sample data is consistent with the meaning of the column names and adheres to the specified data types.
Please help me perform a specific operation (x) on the following example DataFrame represented as a table in Markdown format:
[Insert Example DataFrame]
Operation (x): [Describe the desired operation, e.g., filter rows based on a condition, calculate a new column, sort the DataFrame, or group by a specific column]
Please provide the necessary Pandas code to perform the specified operation (x) on this example DataFrame, and show the resulting DataFrame after the operation is applied.
Please provide a Python code snippet that demonstrates how to clean and preprocess a dataset, including handling missing values, removing duplicates, and standardizing data formats. Use a sample dataset with columns 'Name,' 'Age,' 'Gender,' and 'Email' for this demonstration.
Please provide a Python code snippet that demonstrates how to merge two datasets using the Pandas library. Assume that the first dataset, 'df1,' contains columns 'ID,' 'Name,' and 'Age,' and the second dataset, 'df2,' contains columns 'ID,' 'City,' and 'Country.' Merge the two datasets on the 'ID' column, and show the resulting merged dataset.
Please provide a Python code snippet that demonstrates how to scrape data from the homepage of www.xyz.com
Please provide a Python code snippet that demonstrates how to collect data from a public REST API endpoint using the 'requests' library. As an example, use the following API endpoint that returns JSON data about users: https://jsonplaceholder.typicode.com/users Retrieve the data, parse the JSON response, and display the result in a readable format.”
They explore datasets to understand their structure, identify patterns, trends, and relationships, and perform statistical analyses to test hypotheses.
Example: A data analyst at an e-commerce company might analyze customer purchase data to identify seasonal trends, high-performing products, and customer segments with different spending behaviors.
I want you to act as a data engineer and code for me. I have a dataset of [describe dataset]. Please write code for data visualisation and exploration.
As a data scientist, I have a table with two columns: [Insert column names]. I'd like to calculate a running average for [specify the desired value or column]. Can you provide the SQL code to accomplish this in BigQuery?
Please help me modify the following SQL query to achieve a slightly different result:
[Insert Original SQL Query]
Original Query Purpose: [Describe the purpose or goal of the original SQL query]
Desired Modification: [Explain the specific modification you want to make to the query, such as changing the filtering criteria, adding or removing columns, modifying the aggregation, or altering the sorting order]
Please provide the modified SQL query that achieves the desired result, along with an explanation of the changes made and how the new query differs from the original one.
What is the equivalent of the FUNC1 function in BigQuery?
Please help me compare the following two similar SQL queries and explain the differences between them:
[SQL QUERY 1]
[SQL QUERY 2]
Analyze both SQL queries and provide a detailed comparison that highlights the differences in terms of structure, syntax, filtering criteria, columns selected, aggregation, and any other relevant aspects. Additionally, explain how these differences may impact the results returned by each query and any potential implications for performance or data accuracy.
As a Power BI expert, please analyze the details of my current project [insert project details here], focusing on the table structure and relationships. Are there any issues or areas for improvement you can identify within the table?
As a senior data analyst,
[insert schema & data sample]
given the above schemas and data, write a detailed and correct [insert DBMS] sql query to answer the analytical question:
[question]
Comment the query with your logic.
Double check the Postgres query above for common mistakes, including:
Rewrite the query here if there are any mistakes. If it looks good as it is, just reproduce the original query.
[insert query from previous prompt]
The query above produced the following error:
[insert query error]
Rewrite the query with the error fixed:
Data analysts create reports and visualizations to present their findings in a clear and concise manner to stakeholders, often using tools like Tableau or Power BI.
Example: A data analyst working for a marketing agency might create a dashboard displaying the performance metrics of an advertising campaign, such as impressions, click-through rates, and conversions, to help clients understand the campaign's effectiveness.
Please help me create PySpark StructType and StructField schema definitions for the following dataset:
Dataset columns:
[Continue with further columns as needed]
Please provide the PySpark code for creating the StructType and StructField objects that define the schema for this dataset.
As an expert in data visualization, I need your help to choose the best visualization method for the following problem:
[PROBLEM]
Please describe the problem in detail and recommend the most appropriate visualization method to effectively communicate the information. Explain why you think this method is the best choice.
Write python code to visualize [metric] using [choose viz method]
[Insert data sample]
Can you do visualizations & descriptive analyses to help me understand the data?
[insert data sample]
Can you try regressions and look for patterns? Can you run regression diagnostics?
They interpret their findings and provide data-driven insights to support decision-making and improve business processes.
Example: A data analyst at a manufacturing company might analyze production data to identify bottlenecks in the assembly line, and recommend process improvements to increase efficiency and reduce costs.
Write OKRs for my X people data team. The focus for this quarter is X, Y, Z.
Data analysts are responsible for maintaining documentation of data sources, data dictionaries, and data processing steps to ensure transparency, reproducibility, and easy access to information for other team members.
Example: A data analyst working on a financial reporting project might create and maintain a data dictionary outlining the meaning and format of each column in the dataset, as well as document the data processing and transformation steps taken during the analysis.
I want you to act as a software developer. Please provide documentation for func1 below. [Insert function]
Please help me extract the structure of the following data sample:
Data Sample:
[Provide a sample of your data, either as a small dataset, a JSON snippet, or a few rows of a CSV file]
Based on this sample, please provide the inferred structure, including column names, data types, and any relationships or hierarchies that can be observed in the data. Additionally, provide any suggestions or best practices for storing and processing this data using appropriate tools and technologies.
ChatGPT can assist Data Analysts in conducting exploratory data analysis by suggesting statistical techniques, visualizations, and tools to identify patterns, trends, and relationships in data.
I want to do [X] with the following data [insert data]. Can you suggest statistical techniques that will help me do [X]. Provide a SQL code sample if possible.‣
if I am missing [X] data, what is best way to measure [X]