Pandas Tutor: Using Pyodide to Teach Data Science at Scale

Hi, we’re Sam Lau and Philip Guo, and we teach data science classes at UC San Diego. In this guest post we’ll tell you about our free educational tool, Pandas Tutor, that helps students learn data science using the popular pandas library. The above screenshot shows how you can use it to write Python and pandas code in a web-based editor and see visualizations of what your code does step-by-step.

After giving an overview of Pandas Tutor, we’ll dive into a case study of how we ported it to Pyodide and why we feel that Pyodide is amazing for educational use cases like ours.

What is Pandas Tutor?

Pandas is now an industry-standard tool for data science, but as teachers we’ve seen firsthand how it can be hard for students to learn due to its complex API. For example, this code selects, sorts, and groups values from a dogs dataset to produce a summary table. But if you run this code in a Jupyter notebook, all you see is the end result:

Wait, what exactly did pandas do to turn the dogs dataset into this output summary table? It’s not at all clear from what you see above.

If you ran this same code in our Pandas Tutor tool, it will show you what’s going on step-by-step as your code selects, sorts, groups, and calculates the group medians:

Try this example live here.

Instructors can use Pandas Tutor to aid their teaching, and students can use it to understand and debug their homework assignments.

Why did we port Pandas Tutor to Pyodide?

The original version of Pandas Tutor ran the user’s code on our Linux server by spinning up a new Docker container for every code execution request. Its main limitation was slowness – from the time a user hits the ‘Run’ button, it takes up to 5 seconds for the server to run their code, produce a step-by-step execution trace, and send it over the network to their browser. When there’s a lot of concurrent users the latency may be up to 10 seconds, or the server might crash due to too many Docker containers. This can be a frustrating user experience when an instructor is lecturing or when hundreds of students are using the tool at the same time to visualize in-class code examples.

Beyond user experience, this server-based setup was also a pain for us to maintain since we had to carefully configure Linux and Docker to handle ever-increasing scale. Remember, we’re university instructors, not professional software developers or DevOps staff. That means we don’t have the know-how or institutional resources required to maintain a production-scale server deployment setup.

That’s why we were thrilled when we discovered Pyodide since it could help us overcome all these limitations. Specifically, now that we ported Pandas Tutor over to use Pyodide:

Users see their visualizations instantly since all their code runs in-browser (after an initial Pyodide startup time). No more waiting 5-10 seconds for server-side sandboxed execution and network data transfer every time they attempt to run a piece of code.
Related to above, zero-latency means we don’t even need a ‘Run’ button anymore! Pandas Tutor continually live-runs the user’s code as they edit so that they always see the most up-to-date visualizations.
We no longer need to maintain a complex server-side Linux and Docker setup. Rather, we can serve the app as a JavaScript bundle on a static hosting service or CDN.
It also becomes easier to secure our server since we’re no longer running untrusted user-written code on it.
Pandas Tutor now scales effortlessly no matter how many concurrent users are on the site … if there are 500+ students in a large lecture hall using it at the same time, no problem! After all, everyone is running Python code inside their laptop’s browser due to the magic of Pyodide. This kind of scale was impossible with our server-based setup.

In sum, we feel that Pyodide is tremendously powerful for scalable educational use cases like ours since it allows anyone with a web browser to quickly try Python and its huge ecosystem of packages without any installation or system maintenance.

How did we port it and what issues did we face?

It was remarkably pleasant to port Pandas Tutor over to Pyodide from our original server-based setup. With some help from Pyodide core developer Roman Yurchak we found a way to do it without changing our Pandas Tutor codebase at all.

Step 1: create a self-contained Pandas Tutor wheel

The recommended way to load custom Python code into Pyodide is by using a wheel, so we first created a wheel to package up the Pandas Tutor backend code. That was straightforward since it’s pure-Python and doesn’t require any WASM code compilation.

However, we ran into our first issue right away: Since Pandas Tutor has almost a dozen package dependencies, we originally thought we could use micropip to install them all, as the Pyodide documentation recommends. Unfortunately, some packages couldn’t be installed with micropip since they didn’t have wheels available on PyPI (e.g., micropip.install('docopt') doesn’t work). Our solution was to bundle (i.e., inline) all dependencies into a self-contained Pandas Tutor wheel so that our web frontend code needs only one call to micropip instead of almost a dozen separate calls:

const micropip = pyodide.pyimport("micropip");
await micropip.install("https://pandastutor.com/build/pandastutor-1.0-py3-none-any.whl");

This approach also has the nice side-effect of pinning all the dependency versions for better API stability. The only dependency we didn’t bundle was pandas because that requires WASM-compiled code. But fortunately Pyodide comes with pandas built-in!

Step 2: import the Pandas Tutor module in JavaScript

We added one line to our frontend JavaScript to import main.py from Pandas Tutor when the page loads (after the wheel gets installed from Step 1):

const pandastutor_py = pyodide.pyimport("pandas_tutor.main");

This is a synchronous call that imports pandas_tutor and all dependencies when the web app first loads, so it takes 3-5 seconds to finish since pandas and other libraries must be loaded. Unfortunately, browser caching can’t speed things up here since even if all files were cached locally, Pyodide still needs to run the import and initialization code.

One feature addition that might speed this up is if Pyodide could serialize a binary memory snapshot taken right after everything gets initialized. Then the user’s browser can directly load that snapshot on the first page visit and cache it for future visits. (But maybe that snapshot would be too big to load efficiently. Related GitHub Issues.)

Step 3: call Pandas Tutor backend code from JavaScript

Now we can call the Pandas Tutor backend from our frontend JavaScript with just one line:

// 1) frontend grabs userCode from the user's in-browser code editor
// 2) run the user's code to produce an execution trace ...
const executionTrace = pandastutor_py.run_user_code(userCode);  
// 3) frontend process executionTrace to produce step-by-step visualizations

This line runs the user’s code in Pyodide using the Pandas Tutor backend, which analyzes it to produce an execution trace that the frontend uses to render step-by-step visualizations. The coolest thing about this line of code is that it seamlessly calls a Python function directly from JavaScript.

Before we ported to Pyodide, this process involved an HTTP request to the server, which spun up a new Docker container, ran the Pandas Tutor backend within there, and transferred the resulting JSON execution trace over the internet to the user’s browser. There were multiple points of possible failure, and robust exception handling was a pain to implement. Now everything happens with a single function call in-browser, and exception handling is just a JavaScript try...catch!

Bonus: other fine-tuning and ongoing work

Graceful degradation: To continue supporting older browsers, our frontend code detects if there are any problems loading the Pyodide version of Pandas Tutor and, if so, automatically reverts back to the original server-side version. Also, loadPyodide crashes on some older mobile devices (presumably due to limited memory) and we can’t seem to catch the exception to recover; so we disable it by default on mobile and add a toggle to let the user explicitly enable it. (This is also considerate since it prevents the page from preemptively downloading large amounts of data on users’ mobile phone plans.)
Auto-importer: We want users to be able to import any package when writing their code in Pandas Tutor. But Pyodide’s loadPackagesFromImports works only on packages that come with Pyodide. To give users more flexibility, we wrote an auto-importer that automatically calls micropip.install() whenever the user’s code tries to import an unknown package. These package wheels get downloaded from PyPI on-demand and cached in the browser. (Note that this works only if the package name on PyPI matches the name in the import statement.)
Preloading: As soon as the user visits the pandastutor.com landing page, it starts downloading Pyodide and selected packages (e.g., pandas, numpy) in the background. That way, by the time the user goes to the visualizer page, these packages may already have been cached in their browser, so they can start working right away. (Thanks to Michael Kennedy from the Talk Python podcast for this suggestion!)
Caching file downloads: [work-in-progress] Some pandas code uses functions like read_csv or read_html to download data files from the web. It would be great to cache those files locally in-browser so that when the user re-runs that code in the future, Pandas Tutor doesn’t need to repeat the same (potentially-large) downloads. (The browser may already cache Pyodide-issued fetch requests, but using the Pyodide/Emscripten filesystem API may give us finer-grained control over file caching.)
Start-up speed: [work-in-progress] Even if all packages are cached, it still takes a few seconds to initialize Pandas Tutor at page load time. We plan to profile our code and might try using pyc-wheel to compile *.py in our wheel into *.pyc files (related GitHub Issues).

The future: Pandas Tutor + Pyodide everywhere!

Pyodide is really exciting for us since it opens the door to embedding Pandas Tutor into any existing website so that anyone can learn data science within the context of their favorite websites without installing anything.

To achieve this, we’re building a bookmarklet that injects Pandas Tutor + Pyodide into any webpage to add an inline code editor that can access its DOM. This will allow students to write Python code to, say, parse HTML tables using BeautifulSoup, clean up that raw data, and plot it. Here’s a mock-up screenshot of injecting Pandas Tutor (in the dashed red box) into a Wikipedia page to explore and plot country populations for a social studies class:

(This idea originally came from the 2017 research paper, DS.js: Turn Any Webpage into an Example-Centric Live Programming Environment for Learning Data Science, by Xiong Zhang and Philip Guo. Download PDF)

Next, we can take this idea even further … wouldn’t it be cool if the official pandas API docs were runnable? With this bookmarklet, we can inject Pandas Tutor + Pyodide into any API docs to automatically parse its code examples, visualize them, and let visitors edit that code live to explore different parameter values:

We can also do the same with Stack Overflow or other help forum sites because they often have self-contained code examples that can be directly visualized:

That’s all for now, but this is just the beginning of our journey with Pyodide. Stay tuned for more in the coming months as we use it to build tools to teach data science at scale!

– Sam and Philip

What is Pandas Tutor?#

Why did we port Pandas Tutor to Pyodide?#

How did we port it and what issues did we face?#

Step 1: create a self-contained Pandas Tutor wheel#

Step 2: import the Pandas Tutor module in JavaScript#

Step 3: call Pandas Tutor backend code from JavaScript#

Bonus: other fine-tuning and ongoing work#

The future: Pandas Tutor + Pyodide everywhere!#