Load Data from Twitter to Snowflake

This project uses Python and Tweepy to load semi-structured data from Twitter into a data model in Snowflake.

Kim Te
Towards Data Science

--

End to End Design

Here is a diagram of what the end to end project ultimately looks like and what you should expect to achieve after following this article.

Diagram by author. This is a high-level outline of the project design.

Table of Contents

This project mostly stemmed from an interest in learning Python after years of doing ETL (extract-transform-load) using data integration software such as Alteryx and SnapLogic. I wanted to get a little bit of hands-on work with Python and decided to build out a small project. Before starting this project, I took a Python course on Udemy to prepare. I chose to use Twitter because it has a large volume of publicly accessible data, Python for personal interest, and Snowflake because it has a 30-day free trial.

Hopefully, these forthcoming materials will be helpful to you. I did some research online but only found bits and pieces scattered across the world wide web. I found that there are a ton of data scientists using Twitter data for machine learning purposes. To level set, this is a data engineering project, not a data science one. If you are a data scientist, some parts of this could still be useful in terms of retrieving raw data.

As alluded to in the table of contents, this project is broken down into 3 main sections. Some short tangential posts and official documentation are also linked for further support.

What Will Be Covered

  1. How to get your Twitter API keys and tokens
  2. How to pull Twitter data using the Tweepy user_timeline method
  3. Loading Tweepy user_timeline data into a local folder on your computer
  4. Loading raw data from local storage into a Snowflake internal named stage that you will create
  5. Using streams and tasks to load data automatically from your Snowflake internal stage into basic database tables

What Will Not Be Covered

  1. How to schedule your Python script to retrieve Twitter data. There are tons of online materials for this already
  2. Snowflake basics — I will provide links where I can but I recommend reviewing the official documentation
  3. How to “productionize” this design for an enterprise platform

Are you ready? If so, then start with pulling data from Twitter!

Pull Data from the Twitter API

In this section, I set up my Twitter account, get API keys and tokens, then land data locally onto my computer. Landing into S3 or similar storage is also possible but I landed data locally for simplicity.

Also, coming from a background in working with enterprise data systems, I understand how bureaucratic it can be to have an AWS team provision an S3 bucket for you. Hence this post lands data locally and perhaps you can use it as an initial proof of concept before landing the data into a formal storage service. Storing the data into S3 (or other) instead of your local computer is definitely recommended for the longterm.

Screenshot by author. This is what your files should look like stored locally by the end of this section.

The diagram you see below will be a recurring image throughout this project. Highlighted in GREEN is the section of the project we will be covering.

Diagram by author. Let’s get the data!!

But First, There Are Some Admin Steps

  1. Get API keys and tokens from the Twitter Developer Portal. If you are unsure how to obtain that info, feel free to read my short post on Getting Twitter API Keys and Tokens.
  2. Store those keys into a config file. Here is a super simple example of my own config.

For my development, I just wanted to pull a single Twitter user’s tweets over time therefore I mostly use the API.user_timeline method.

In the below Python script, I am pulling tweets by Elon Musk (twitter_id = 44196397) but you can change that variable to whomever you choose. There are a lot of free websites to do this or you can use the handy-dandy Postman Collection that Twitter provides. I used this Postman Collection for my initial testing of the API endpoints.

As seen in lines 10–12 in the above code, I am storing the returned JSON data as local files on my computer. You must update this and can change it to wherever you choose.

If you run print(tweet._json), you should see the tweets that will be printed into your JSON file as seen in the screenshot below. I don’t recommend printing this if you expect an especially large volume of tweets.

Here is what the files should look like once they are stored locally in your folder. I actually started this mini-project using a now-banned Twitter account (as you can tell by the file names…) so I had to pivot to another Twitter account.

To conclude this first part of the project, you should now have local files with your Twitter data ready to go. The next step is to load the raw data into a Snowflake.

Land Raw Twitter Data into Snowflake

Okay, so you have your raw data locally but you probably want to share the data with other team members or at least not have it solely stored locally on your computer. What’s the point of having tons of data to analyze if no one can access it for analysis?

The next section walks you through how to land the local files into a Snowflake internal named stage which will serve as your historical storage for raw data. We will do this by using native Snowflake functions.

Diagram by author. Let’s get the raw data into Snowflake!

If you successfully pulled data from Twitter into JSON files and opened up any of them, you should see the following JSON objects and corresponding key/value pairs.

Screenshot by author. This is an example of the raw JSON data.

To load these JSON files into Snowflake, you will use a stage. A stage object lives in a database object so first, you will need to create a database in Snowflake and any additional foundational objects. Below is my SQL code and I break it down line-by-line right after the code block. There are also comments with the SQL script itself.

To break down the code:

  • lines 4–9: Creates the initial database and objects that are used in the project. I chose to set up the stg/dm/rpt layers as schemas but you can change that based upon your personal or professional experience.
Screenshot by author. This is what you should see after creating the database and schemas.
  • lines 11–12: To load the JSON data into a named stage, we need to define a file format. You can do that directly in the creation of the stage but I chose to create a reusable file format.
Screenshot by author.

To see file formats, navigate to the Databases Tab > SOCIAL_MEDIA_DB > File Formats Tab. You can also run the following script in your worksheet.

show file formats;
  • lines 14–15: Here I create an internal named stage called twitter_stage and use the previously created file format.
Screenshot by author.

To see stages, navigate to the Databases Tab > SOCIAL_MEDIA_DB > Stages Tab. You can also run the following script in your worksheet.

show stages;
Screenshot by author.

After you create the stage in lines 14–15, you will need to run a PUT command to load the files:

  • Open command prompt
  • Connect to your Snowflake account by running the below code. I set up the trialaccount connection details in my SnowSQL config file.
snowsql -c trialaccount
  • Change your context as needed then run the following PUT command. Be sure to change the folder path to what you decided on in Part 1 of the series.
put file://‪xxx\data\rawtweets* @twitter_stage;
Screenshot by author. This is what you should see after running the PUT command.
  • lines 17–25: Here I create a warehouse named jobrunner_wh which I will use for running my future scheduled tasks. It is up to you whether you want to create a new warehouse or use the default warehouse. I specifically chose to create a new warehouse to separate and monitor resources.
  • lines 30–36: This DDL creates the staging table in the previously created stg schema. Our raw JSON only has one column ($1) but Snowflake provides additional metadata that you can pull and the ability to parse basic information using the parse_json function.
  • lines 39–40: This creates the Snowflake stream that will track changes to the stg_raw_twitter table. Because I am pulling the last 100 tweets from the Tweepy user_timeline method, some of the tweets are redundant. I do not want those redundant tweets to cause dupes downstream in my curated and consumption tables. The stream will track inserts, updates, and deletes to the staging table which we will use when loading the data mart table.
  • lines 43–59: This part of the script is the copy command you can run to load data from our named stage twitter_stage to our staging table stg_raw_twitter. We will use this again later when we automate all of the tasks.
Screenshot by author. This is what you should see after running the COPY INTO command.
  • line 62: Use this select statement to see the changes to stg_raw_twitter that are captured by the stream stg_raw_twitter_stream that we created in line 39. Streams essentially track changes to the previous version of the table.
Screenshot by author. Highlighted in yellow are the changes made to stg_raw_twitter.

There you go! You have landed raw data in the form of JSON into Snowflake and began tracking changes to that data. Next, we’ll build out more analyst-friendly tables and summarize the end-to-end build.

Parse, Load, and Schedule Twitter Data in Snowflake

Finally, let’s parse the data and load it from your staging layer > data mart layer > reporting layer. This load of data will be scheduled using Snowflake streams and tasks.

Diagram by author. Let’s make the data more analyst-friendly!

At this point, you should now have data in a table called stg_raw_twitter. This data remains mostly in JSON which still requires parsing for downstream analysis if you are using common data visualization software.

In this section, we will do the following:

  1. Create a data mart (DM) table that has some parsed data from the raw JSON
  2. Create a task to load that DM table
  3. Create a reporting (RPT) table that aggregates data from the DM table
  4. Create a task to load that RPT table
  5. Outline the end to end data flow based upon the tasks and stream we have created in the series

To break down the code:

  • lines 4–12: This creates the data mart table dm_tweets that can eventually be joined to other tables to produce a reporting table. For simplicity in this project, I only had one DM table.
  • lines 15–78: This is a heftier section of code but it essentially does an upsert into the DM table from the stream that we created against the STG table. This means that it will update any changed rows, insert any new rows, and delete any removed rows.
Screenshot by author of data output in Snowflake Web UI.
  • lines 60–107: This creates a task named load_dm_tweets that runs the above MERGE INTO statement. This task is dependent on the successful load of the load_stg_task task we created in the last article.
  • lines 112–120: Here, we create the reporting table rpt_social. Now, this part is totally up to your discretion but I created a simple aggregate table to get the distinct number of users and tweets per day.
Screenshot by author of data output in Snowflake Web UI.
  • lines 123–152: This section creates a task named load_rpt_social to CREATE or REPLACE the rpt_social table using a pretty basic SELECT statement. This is where you can change any of the logic as you please. Just make sure the DDL for the reporting table is modified to match your changes.

Task Order of Operations

We have created 3 tasks now that should have dependencies on each other. The below diagram shows the high-level relationship.

Diagram by author. Straightforward stuff.

After running the SQL code from the project, you should now have the following objects in addition to the initially created database objects:

Stage(s)

  • twitter_stage

Table(s)

  • stg_twitter_raw
  • dm_tweets
  • rpt_social

Stream(s)

  • stg_twitter_raw_stream

Task(s)

  • load_stg_tweets
  • load_dm_tweets
  • load_rpt_social

To do a quick recap, this is what you have accomplished in this small project:

  1. Download raw Twitter data using the Tweepy API class
  2. Create initial database objects
  3. Load raw data into those initial database objects
  4. Institute change tracking on your staging table
  5. Create and load downstream tables
  6. Create task dependencies to ensure downstream data is loaded

Considerations

This project can be improved in several ways. For example, rather than scheduling the load_stg_tweets task to load on a specific frequency, this could be loaded based upon the arrival of new data in twitter_stage. The data model itself is very simplistic and might not include the specific KPIs you or your team cares about. Additionally, this is not primed for production purposes and I’m sure there are other items to improve upon so feel free to share them!

Conclusion

Congrats if you made it through this Twitter to Snowflake project and I hope it was helpful to you. Drop a line if you have any questions or comments. I’ve included some further references below as well as revision history tracking in case you return to this in the future.

Code

References

Disclosures

  • TWITTER, TWEET, RETWEET, and the Twitter logo are trademarks of Twitter, Inc., or its affiliates.
  • All opinions are my own and not reflective of past, present, or future employment.

Revision History

  1. 17 January 2021 — This article was originally published on 15 January 2021 as a series of articles. However, thanks to some very valid feedback, I have consolidated the series into one long article.

--

--