hpr3328 :: Pandas Part 2

Enigma continues his discussion about his favorite Python module Pandas

Hosted by Enigma on Wednesday, 2021-05-05 is flagged as Clean and is released under a CC-BY-SA license.
Tags: python, pandas, Data, Data Science. Comments: 2.

Listen in ogg, opus, or mp3 format. Play now:

Duration: 00:11:59
Download the transcription and subtitles.

Part of the series: A Little Bit of Python.

Initially based on the podcast "A Little Bit of Python", by Michael Foord, Andrew Kuchling, Steve Holden, Dr. Brett Cannon and Jesse Noller. https://www.voidspace.org.uk/python/weblog/arch_d7_2009_12_19.shtml#e1138

Now the series is open to all.

Part two in the For the Love of Data series. Enigma covers part 2 of Pandas
The following topics are discussed

1) Another way to apply a condition to a field
2) Creating a DataFrame from a dictionary
3) Appending a data frame with another DataFrame
4) Joining DataFrames with merge and join
5) Writing an output to csv

Part 2 Sample code
Follow me on twitter @Ed_N1gma

Come chat on irc.freenode.net #hackerexchange

Comments

Subscribe to the comments RSS feed.

Comment #1 posted on 2021-05-05 19:49:39 by Mr. Young

Another great show

Thanks for another great show. I look forward to your next one.

As to your use of `pd.apply` in lieu of `np.select`, here's my 2 cents:

Apply is more readable in most cases, but select is more performant. When performance matters, or when the dataset is very large, you might want to use `np.select`. For instance, when using `np.select` on your example here, the output was 10x faster on my PC.

```
%timeit df.apply(Scorelevel, axis=1)

448 µs ± 2.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

```
%timeit np.select(cond_list, choice_list, default='Require Activation')

55.6 µs ± 440 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```

In many cases, the readability can trump the need for speed, but just wanted to give a counter-point.

Comment #2 posted on 2021-05-05 19:58:07 by Mr. Young

One more speed gain

If you really want to fly, you can turn the pandas series to numpy arrays first. For you example, it got twice as 2x faster than regular `np.select`.

Example:
```
cond_list = [df['Score'].values >= 9,
((df['Score'].values >= 8) & (df['Score'].values < 9)),
((df['Score'].values >= 7) & (df['Score'].values < 8)),
((df['Score'].values >= 6) & (df['Score'].values < 7)),
((df['Score'].values >= 5) & (df['Score'].values < 6)),
((df['Score'].values >= 4) & (df['Score'].values < 5))]

%timeit np.select(cond_list, choice_list, default='Require Activation')
23.5 µs ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```

Leave Comment

Note to Verbose Commenters
If you can't fit everything you want to say in the comment below then you really should record a response show instead.

Note to Spammers
All comments are moderated. All links are checked by humans. We strip out all html. Feel free to record a show about yourself, or your industry, or any other topic we may find interesting. We also check shows for spam :).

Your Name/Handle:
Title:
Comment:
Anti Spam Question:	What does the letter P in HPR stand for?
Are you a spammer?	Yes No
Who is the host of this show?
What does HPR mean to you?