Beat the Dealer by Ed Thorp, published in 1962 (1966 rev. ed) laid out a winning strategy for playing Blackjack through card counting. While an earlier book Playing Blackjack to Win, by Roger Baldwin et al (1957) exists, and Thorp gives them due credit, Beat the Dealer is certainly the book that popularized the concept and sold over 1,000,000 copies.

When Professor Thorp did his calculations, computers were a little bit slower. I’d like to think he used the 1960’s version of a desktop computer: the Elliot Brothers 903 Elliot 903 description .

This machine had up to 64K of memory (core, not modern-day RAM), 18-bit words and a cycle time of 6 microseconds. The desktop computer I will use to recompute the basic Blackjack strategies has a 4 GHz clock rate, or 0.25 nanosecond cycle time (that’s about 24,000 times faster then the Elliot 903), not to mention (a) 64-bit words, (b) a million times more memory, (c) many more instructions than the handful the Elliot had, and (d) 8 cores running in parallel. Plus I can program in Python instead of FORTRAN or Assembly.

Unfortunately, Thorp did not use the Elliott. It was probably too slow. In his book’s acknowledgment, Thorp says he is grateful for the use of MIT’s IBM 704 computer. That’s this one:

What I would like to do is recalculate the Basic Strategy that Thorp outlined in his book. I will use a technique from Reinforcement Learning, by Sutton & Barlow. Rich Sutton is the father of reinforcement learning and I highly recommend his book as a well-written, clear and easy to understand introduction to the subject.

The technique we will implement is the one labeled “Monte Carlo ES (Exploring Starts)” in Chapter 5 of the second edition of Reinforcement Learning.

First, a little introduction to what we are doing. For the detailed description see Reinforcement Learning.

A strategy $\pi$ is a function that maps the current state (what we can observe) into an action that we can take. There is a very simple (at least conceptually) way to compute how good any particular strategy is. We can simulate many games using this strategy and compute the value of the game. The value of the game is simply the average result that we can expect. In our game, we will say each game can result in a -1 (loss), +1 (win), or +1.5 (win with a Blackjack, or a natural 21). Of course, the other way to calculate the value of a strategy would be to calculate the probability of each outcome, and multiply those probabilities times the corresponding outcome – the sum of these would be the expected value of the outcome, what we call the “value” of the game. If I give you a strategy for Blackjack, which would be a table of what to do based on what you can see at each deal of the cards, and asked you to calculate the probability of winning, you can immediately see that this would be fairly complicated. For each deal, you would have to compute the probability of getting that deal, then the probability of getting each additional card (if you hit), and so on, each time following the strategy I gave you. On the other hand, the Monte Carlo method is much simpler. I just have the computer play lots of games using that strategy, and just add up the results. I am much less likely to make a mistake in computing things, because I am just counting, not computing probabilities. Of course, the answer I get is not exact, and I may have to do a lot of simulations to get reasonable precision, but why not let the computer do all the work, instead of me?

So we can see how to compute the value of a given strategy $\pi$, but how would we go about finding the optimal strategy, usually written $\pi^*$? The idea is one of “policy improvement.” If we have a policy $\pi$, can we improve it a little bit? The answer is yes. The idea is that when we calculated the value of a particular strategy, we computed a function $q_\pi(s,a)$, which was the expected value (or return) when starting in state s and taking action a. The idea of policy improvement is that we make a slight change to $\pi$ by making it greedy with respect to the $q()$ function – that is

$$\begin{align}
\pi(s) = \arg \max_{a} q(s,a)
\end{align}$$

This says that we take the action $a$ at a state $s$ that maximizes the value, according to the current $q$ function. So we could see a way to keep improving our strategy: (1) compute the q for our best strategy so far, then (2) improve that strategy with a greedy choice of actions, and then keep repeating steps 1 and 2. The problem with this is that the step 1 takes a long time (forever if we want an exact answer), and thus each iteration is quite exhausting.

It turns out, that we can make each step shorter: we do, instead, the two steps: (1) move q towards the proper q for our current strategy $\pi$, and (2) improve $\pi$ slightly with respect to the new q.

This is the Monte Carlo ES algorithm and it can be shown to converge to the optimal strategy. We start with Reinforcement Learning Section 5.3 and make some slight modifications.

We use $Q_T$ as the total return for a state, action pair (over all our games) and $Q_C$ as the count, meaning the number of times we have visited that state, action pair. Thus we can compute the average return as $Q_T/Q_C$ for any state, action pair.

In the following, $S$ is the set of all states, and $A(s)$ is the set of all actions you can take for a particular state $s$.

Initialize, for all $s \in S, a \in A(s)$:
	set $Q_T(s,a)=0$ and $Q_C(s,a)=0$ for all $s,a$
	set $\pi(s)$ = some random action for all $s$
Repeat Forever:
	Choose $S_0 \in S$ and $A_0 \in A(S_0)$ such that all pairs have probability > 0
	Generate a game starting from $S_0, A_0 $, following $\pi$, with terminal reward $R$
	For each pair $s$, $a$ appearing in the game:
	$Q_T(s,a) \mathrel{+}=R$
	$Q_C(s,a) \mathrel{+}=1$
	Update the policy: $\pi(s)=\arg \max_{a} \frac{Q_T(s,a)}{Q_C(s,a)}$

Our implemention looks like this in Python. I have set it up so that it computes the strategy two times: (1) for decks=None (meaning infinite decks) and decks=1 (one deck).

Source Code

#!/usr/bin/env python
"""
Use Reinforcement Learning to compute Blackjack strategy
Following Reinforcement Learning: An Introduction, by Sutton & Barto

Copyright (C) 2022 Solverworld

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.

Classes Defined:
State is the state of the game
Deck is the deck
Game
cards are 1-10, 10=facecards, 1=ace
End Game values:
  -1  # lost (-2 if doubled)
  +1  # win
  +1.5 #win with black jack

Some issues:
* If we are looking for states with zero count, they will not be in our q[][] dict
  because they are only added when we count.
* We are also including state counts for impossible to reach states, e.g. soft 4. This
  is because we add the dictionary {'hit':0, 'stand':0} whenever we need to add just one
  of these.
* How do know if we are done?  What are some convergence metrics?  Could use visit counts for estimation of variance,
  Look at lowest visits count states, depends on probability of visiting that state as well.  Compute a smoothed
  game value?
* Have not added double, splitting of pairs.  Will need to worry about handling permitted actions at each state

This program will use the Monte Carlo ES (Exploring Starts) strategy from Sutton & Barlo,
every 2 million games played, it will: output the current strategy and game value to the screen, and the
* output the q-values to a pickle file
* output the current strategy and game value
"""
import collections
import pickle
import random
import argparse
import tqdm
import numpy as np
import time
import statistics
import copy
import sys

from dataclasses import dataclass
from typing import List

# random.seed(1)
# control thorp with global variable sutton - changes to hit on 18 vs Ace
sutton = False

parser = argparse.ArgumentParser(
  description="Computer learns to play blackjack",
  epilog='''
''')
#The eval option evaluates the Thorp strategy, otherwise we calculate the optimal strategy
parser.add_argument('--eval', action='store_true', help='do a policy eval')
options = parser.parse_args()

# A policy is a mapping State->action.  We will use State.hash() value as the key to avoid issues with
# immutability, etc.  Actions are: hit, stand, double, split
actions = ['hit', 'stand', 'double', 'split']


class GameOver(StopIteration):
  pass


def print_policy(policy, tag='', prev_policy=None):
  # print out the policy, compare to prev_policy if provided
  for soft in [False, True]:
    print(f'TABLE soft={soft} {tag}')
    for dealer in list(range(2, 11)) + [1]:
      print(f'{dealer:6}  ', end='')
    print()
    for your_tot in reversed(range(12, 21)):
      print(f'{your_tot}: ', end='')
      for dealer in list(range(2, 11)) + [1]:
        state = State(my_total=your_tot, dealer_shows=dealer, soft=soft)
        ch = '*' if prev_policy and policy[state] != prev_policy[state] else ' '
        print(f'{policy[state]+ch:7} ', end='')
      print()


class Deck:
  # The set of 13 cards
  the_cards = list(range(1, 11)) + [10, 10, 10]

  def __init__(self, decks=None, shuffle_limit=12):
    # Create a new deck with decks the number of decks.  None means infinite i.e. with replacement
    # For efficiency, keep the deck object around and draw repeatedly from it.  It will be reshuffled
    # when less than shuffle_limit cards are left.  Call start_new_game for this check to be done at the
    # start of a game if you do not want a reshuffle to happen mid-game; otherwise no need to call start_new_game ever
    self.decks = decks
    self.cards = None
    self.shuffle_limit = shuffle_limit
    self.shuffle()     # Assign self.cards

  def shuffle(self):
    if self.decks is not None:
      deck: List[int]
      deck = Deck.the_cards * 4 * self.decks
      # create the list of cards once so we can just pop off one for each card dealt
      random.shuffle(deck)
      self.cards = deck
    else:
      # the deck with replacement, i.e. infinite deck size.  Let's put 52 for consistency
      self.cards = random.choices(Deck.the_cards, k=52)

  def start_new_game(self):
    if len(self.cards) < self.shuffle_limit:
      self.shuffle()

  def deal(self):
    try:
      # EAFP
      return self.cards.pop(0)
    except IndexError:
      self.shuffle()
      return self.cards.pop(0)


class Game:
  def __init__(self, deck=None):
    # Must pass in a Deck object
    assert deck
    self.deck = deck
    self.dealer = [self.deck.deal(), self.deck.deal()]  # only first card is showing
    self.player = [self.deck.deal(), self.deck.deal()]
    self.reward = 0  # will be filled in when game is over
    self.player_total = 0
    self.soft = False
    self.update()

  def start(self):
    # Need to call this before playing in order to figure out if anyone wins before a play
    # and we want the object to exist, so can't raise exception in __init__
    assert len(self.player) == 2
    dealer_total = self.compute_dealer_total()
    if self.player_total == 21:
      if dealer_total != 21:
        self.reward = 1.5
        raise GameOver
      else:
        # tie
        self.reward = 0
        raise GameOver
    else:
      if dealer_total == 21:
        self.reward = -1
        raise GameOver

  def play(self, action):
    # player plays action
    if action == 'hit':
      self.player.append(self.deck.deal())
      self.update()
      if self.player_total > 21:
        self.reward = -1
        raise GameOver
      if self.player_total < 21:
        return
      # total is 21, so we are done (could we double down on soft 21?)
    # assume stand is other, game is over
    dealer_total = self.compute_dealer_total()
    while dealer_total <= 16:
      self.dealer.append(self.deck.deal())
      dealer_total = self.compute_dealer_total()
    if dealer_total > 21:
      # dealer bust
      self.reward = 1
      raise GameOver
    if dealer_total > self.player_total:
      self.reward = -1
      raise GameOver
    elif dealer_total < self.player_total:
      self.reward = 1
      raise GameOver
    else:
      self.reward = 0
      raise GameOver

  def __repr__(self):
    return f'dealer {self.dealer} player {self.player_total} {self.soft}'

  @property
  def state(self):
    # state=State(my_total=self.player_total,dealer_shows=self.dealer[0],soft=self.soft,num_cards=len(self.player))
    state = State(my_total=self.player_total, dealer_shows=self.dealer[0], soft=self.soft)
    return state

  @staticmethod
  def all_states():
    # return a list of all possible states
    ret = []
    for player in range(4, 21):    # 4-20, 21 means game is over (could we double down?)
      for dealer in range(1, 11):  # 1-10
        ret.append(State(my_total=player, dealer_shows=dealer, soft=False))
        if player >= 12:
          ret.append(State(my_total=player, dealer_shows=dealer, soft=True))
    return ret

  def update(self):
    # update the player total and soft attributes
    s = sum(self.player)
    self.soft = False
    if 1 in self.player and s <= 11:
      s += 10
      self.soft = True
    self.player_total = s

  def compute_dealer_total(self):
    s = sum(self.dealer)
    if 1 in self.dealer and s <= 11:
      s += 10
    return s

  def result(self):
    return self.reward


'''
Note: splitting is tricky, maybe create 2 games?  But results of those 2 games have to be combined
to give the value for the current state 
'''


# Contain the state of a blackjack game, we want it hashable so it can be used as a dict key
@dataclass(eq=True, frozen=True)
class State:
  my_total: int  # my total so far
  dealer_shows: int  # dealer upcard
  soft: bool  # whether my total is soft
  # num_cards: int     #number of cards I have (important for splitting)

  # def hash(self):
  #  return (self.my_total, self.dealer_shows,self.soft)


def print_statistics(q, tag=''):
  # print some stats on the q_counts[state][action] function
  values = [y for x in q.values() for y in x.values()]
  hist, bin_edges = np.histogram(values)
  print(f'histogram of state visit counts {tag}')
  for c, edge in zip(hist, bin_edges):
    print(f'{edge=:8.1f} {c:8d}')


def print_some_counts(q, tag=''):
  # sort the states by counts and print from each end of the list
  # we have to create a dict with a single key from the q[][] configuration
  # note that we will have some zero counts because we add both hit and stand
  # when we create a new entry.
  d = {(k1, k2): v2 for k1, v1 in q.items() for k2, v2 in v1.items()}
  keys = list(d)
  keys.sort(key=lambda x: q[x[0]][x[1]])
  print(f'high and low counts {tag}')
  for i in list(range(10))+list(range(-5, 0)):
    s1 = f'{keys[i][0]}'
    s2 = f'{keys[i][1]}'
    print(f'{s1:50} {s2:5}  {d[keys[i]]:10}')


def evaluate_progress(returns, num_games=0):
  # compute mean return and some statistics for it from list of returns
  # We divide returns into 10 groups of equal length, and compute the mean and stddev of the groups
  n = len(returns)//10
  # This little bit of zip python magic makes groups of length N
  means = []
  for group in zip(*(iter(returns),)*n):
    means.append(statistics.mean(group))
  m = statistics.mean(means)
  d = statistics.pstdev(means)
  print(f'GAMEVALUE {num_games} {len(returns)} {m:8.3f} {d:8.3f}', flush=True)


def monte_carlo_es(count=100.0, decks=None):
  # return the optimal policy pi[], smoothed game value, use exploring starts (ES)
  count = int(count)
  permitted_actions = ['hit', 'stand']
  # because of exploring starts, we cannot compute the value of a game this easily
  # we can turn off ES for 200K games every so often and use the current strategy to evaluate it
  # we will also calculate the variance of the returns of 10 groups of 1/10 the games
  num_games_eval = int(2e5)
  eval_returns = []   # store them for eval
  eval_mode = False   # in this mode we will store the returns and not do ES

  # This returns the default dict to initialize a new q[state]
  def default_q():
    return {'hit': 0, 'stand': 0}

  def update_policy(state_list, policy):
    # update policy p at states using q_total, q_counts
    qmax = -100
    amax = None
    for state in state_list:
      q2 = q_total[state]
      c2 = q_counts[state]
      for a in permitted_actions:
        if c2[a] > 0:
          q = q2[a] / c2[a]
        else:
          q = 0
        if q > qmax:
          qmax = q
          amax = a
      # set the policy to the maximum q value
      policy[state] = amax

  # exploring starts to estimate optimal policy
  policy = collections.defaultdict(lambda: 'stand')  # maps state->action, start with stand
  prev_policy = None  # keep track of a previous one for print out comparisons
  # Instead of keeping a list of the returns for each state, we will keep the totals and the count
  # We do not use a defaultdict because we want to know how many states have zero counts and there are only about 360
  # states anyway (18*10*2)
  # q_total = collections.defaultdict(default_q)  # q_total[state][action] to value (total)
  q_total = {k: default_q() for k in Game.all_states()}
  # q_counts = collections.defaultdict(default_q)  # q_counts[state][action] to count
  q_counts = {k: default_q() for k in Game.all_states()}
  game_total = 0
  game_count = 0
  deck = Deck(decks=decks)
  for i in tqdm.tqdm(range(count)):
    # Generate a game, will raise exception when done
    # Need to save the states
    if i > 0 and i % 2000000 == 0:
      # do some intermediate reporting
      print_policy(policy, tag=f'games={i//1000000}M', prev_policy=prev_policy)
      prev_policy = copy.deepcopy(policy)
      print_some_counts(q_counts, tag=f'at games {i//1000000}M')
      print(flush=True)
      # save q to a file so can evaluate later.  Do not need policy because it can be generated from q
      with open(f'q-{i//1000000}-{decks}.pickle', 'wb') as fp:
        pickle.dump([q_total, q_counts], fp)
      # start eval period
      eval_mode = True
    if len(eval_returns) >= num_games_eval:
      eval_mode = False
      evaluate_progress(eval_returns, num_games=i)
      eval_returns.clear()
    states = []
    actions = []
    game = Game(deck=deck)  # create a new Game, use existing deck for computation efficiency
    deck.start_new_game()   # make sure enough cards to finish this game
    try:
      game.start()  # need to determine if anyone won before a move
      while True:
        if eval_mode:
          action = policy[game.state]  # pick an action by the current policy
        elif len(game.player) == 2:
          # we pick a random action for the first play (Exploring Starts)
          action = random.choice(permitted_actions)
        else:
          action = policy[game.state]  # pick an action by the current policy
        states.append(game.state)
        actions.append(action)
        game.play(action)  # apply action to game
    except GameOver:
      terminal_reward = game.result()
      game_total += terminal_reward
      if eval_mode:
        eval_returns.append(terminal_reward)
      game_count += 1
      for state, action in zip(states, actions):
        q_total[state][action] += terminal_reward
        q_counts[state][action] += 1
      # set the policy to the max q value
      update_policy(states, policy)
  return policy


def evaluate_policy_mcts(policy, count=100):
  # First-visit MC (cannot revisit a state anyway)
  # Following First-visit MC prediction for estimating V for a policy
  # policy is a function that maps state to action
  # Instead of keeping a list of the returns for each state, we will keep the totals and the count
  permitted_actions = ['hit', 'stand']
  returns_total = collections.defaultdict(float)
  returns_count = collections.defaultdict(int)
  game_total = 0
  game_count = 0
  for _ in tqdm.tqdm(range(count)):
    # Generate a game, will raise exception when done
    # Need to save the states
    states = []
    deck = Deck(decks=None)
    game = Game(deck=deck)  # create a new Game
    try:
      game.start()  # need to determine if anyone won before a move
      while True:
        states.append(game.state)
        action = policy(game.state)  # pick an action by the current policy
        game.play(action)  # apply action to game
    except GameOver:
      terminal_reward = game.result()
      # G=terminal_reward  # no intermediate rewards and no discount factor
      game_total += terminal_reward
      game_count += 1
      for state in states:
        returns_total[state] += terminal_reward
        returns_count[state] += 1
  return returns_total, returns_count, game_total, game_count


def mimic_dealer(state):
  if state.my_total <= 16:
    return 'hit'
  else:
    return 'stand'


def thorp(state):
  #compute the Thorp strategy
  if state.soft:
    return thorp_soft(state)
  if state.dealer_shows == 1 or state.dealer_shows >= 7:
    stand = state.my_total >= 17
  elif state.dealer_shows in [4, 5, 6]:
    stand = state.my_total >= 12
  else:
    stand = state.my_total >= 13
  return 'stand' if stand else 'hit'


def thorp_soft(state):
  #compute the Thorp strategy for soft totals
  stand_on_19 = [1, 9, 10] if sutton else [9, 10]
  if state.dealer_shows in stand_on_19:
    stand = state.my_total >= 19
  else:
    stand = state.my_total >= 18
  return 'stand' if stand else 'hit'


def evaluate(use_policy, set_sutton=False):
  # compute game value for a policy
  global sutton
  sutton = set_sutton
  num = int(5e6)
  returns_total, returns_count, game_total, game_count = evaluate_policy_mcts(use_policy, count=num)
  # returns_total,returns_count,game_total,game_count=evaluate_policy(mimic_dealer,count=num)
  print(f'after running {num}')
  print(f'mean game value {game_total / game_count:8.5f}')
  return game_total / game_count


"""
  keys = list(returns_total.keys())
  keys.sort(key=lambda x: x.dealer_shows * 1000 + x.soft * 50 + x.my_total)
  for state in keys:
    st = repr(state)
    print(f'{st:64} {returns_count[state]:4d} {returns_total[state] / returns_count[state]:8.3f}')
"""


if options.eval:
  print(time.strftime('%c'))
  print('Thorp, infinite deck')
  ret = []
  for i in range(10):
    val = evaluate(thorp, set_sutton=False)
    ret.append(val)
  m = statistics.mean(ret)
  p = statistics.pstdev(ret)
  print(f'mean={m} pstddev={p}')
  print('Sutton, infinite deck')
  ret = []
  for i in range(10):
    val = evaluate(thorp, set_sutton=True)
    ret.append(val)
  m = statistics.mean(ret)
  p = statistics.pstdev(ret)
  print(f'mean={m} pstddev={p}')
  print(time.strftime('%c'))
  sys.exit(0)

for use_decks in [None, 1]:
  print(time.strftime('%c'))
  games = 250e6
  print(f'decks={use_decks} games to be played={games}')
  opt_policy = monte_carlo_es(count=games, decks=use_decks)
  print_policy(opt_policy, tag='at end')
print(time.strftime('%c'))

This is the result after running 248,200,000 games for decks=None (infinite)

GAMEVALUE 248200000 200000   -0.024    0.007
TABLE soft=False at end
     2       3       4       5       6       7       8       9      10       1  
20: stand   stand   stand   stand   stand   stand   stand   stand   stand   stand   
19: stand   stand   stand   stand   stand   stand   stand   stand   stand   stand   
18: stand   stand   stand   stand   stand   stand   stand   stand   stand   stand   
17: stand   stand   stand   stand   stand   stand   stand   stand   stand   stand   
16: stand   stand   stand   stand   stand   hit     hit     hit     hit     hit     
15: stand   stand   stand   stand   stand   hit     hit     hit     hit     hit     
14: stand   stand   stand   stand   stand   hit     hit     hit     hit     hit     
13: stand   stand   stand   stand   stand   hit     hit     hit     hit     hit     
12: hit     hit     stand   stand   stand   hit     hit     hit     hit     hit     
TABLE soft=True at end
     2       3       4       5       6       7       8       9      10       1  
20: stand   stand   stand   stand   stand   stand   stand   stand   stand   stand   
19: stand   stand   stand   stand   stand   stand   stand   stand   stand   stand   
18: stand   stand   stand   stand   stand   stand   stand   hit     hit     hit     
17: hit     hit     hit     hit     hit     hit     hit     hit     hit     hit     
16: hit     hit     hit     hit     hit     hit     hit     hit     hit     hit     
15: hit     hit     hit     hit     hit     hit     hit     hit     hit     hit     
14: hit     hit     hit     hit     hit     hit     hit     hit     hit     hit     
13: hit     hit     hit     hit     hit     hit     hit     hit     hit     hit     
12: hit     hit     hit     hit     hit     hit     hit     hit     hit     hit

This is the decks=1 result.

GAMEVALUE 248200000 200000   -0.020    0.004
TABLE soft=False at end
     2       3       4       5       6       7       8       9      10       1  
20: stand   stand   stand   stand   stand   stand   stand   stand   stand   stand   
19: stand   stand   stand   stand   stand   stand   stand   stand   stand   stand   
18: stand   stand   stand   stand   stand   stand   stand   stand   stand   stand   
17: stand   stand   stand   stand   stand   stand   stand   stand   stand   stand   
16: stand   stand   stand   stand   stand   hit     hit     hit     hit     hit     
15: stand   stand   stand   stand   stand   hit     hit     hit     hit     hit     
14: stand   stand   stand   stand   stand   hit     hit     hit     hit     hit     
13: stand   stand   stand   stand   stand   hit     hit     hit     hit     hit     
12: hit     hit     stand   stand   stand   hit     hit     hit     hit     hit     
TABLE soft=True at end
     2       3       4       5       6       7       8       9      10       1  
20: stand   stand   stand   stand   stand   stand   stand   stand   stand   stand   
19: stand   stand   stand   stand   stand   stand   stand   stand   stand   stand   
18: stand   stand   stand   stand   stand   stand   stand   hit     hit     stand   
17: hit     hit     hit     hit     hit     hit     hit     hit     hit     hit     
16: hit     hit     hit     hit     hit     hit     hit     hit     hit     hit     
15: hit     hit     hit     hit     hit     hit     hit     hit     hit     hit     
14: hit     hit     hit     hit     hit     hit     hit     hit     hit     hit     
13: hit     hit     hit     hit     hit     hit     hit     hit     hit     hit     
12: hit     hit     hit     hit     hit     hit     hit     hit     hit     hit

You can see that the only difference between the 2 results is the Soft 18 vs. a dealer Ace (1). For 1 deck, you should stand on soft 18 vs A, which matched Thorp’s 1966 book. For infinite decks, you should hit soft 18 vs A, which is the result that Sutton & Barto get in Reinforcement Learning (for an infinite deck).

As a quick check on our results, Thorp says that if you follow the “Basic Strategy” with no doubling or splitting, the “casino edge will be only about 2 percent.” We get a game value of -0.020, which seems like a good match.

Note that we have not included the ability to split pairs or double down, we will leave that as an exercise for the reader.

Just to see how close these two results are, here is a graph of the q value (average) for both deck types as we play the 250,000,000 games.

q-value for soft 18 vs A, Infinite Decks

March 20, 2020February 14, 2022 by admin

DIY Hand Sanitizer – Methanol is Deadly

Follow @solverworld Tweet this

With the current COVID-19 crisis, most retail and online stores have been unable to keep hand sanitizer in stock. This has created a crisis for people that need to clean their hands and are unable to access soap and water. The CDC recommends hand washing for 20 seconds frequently to avoid contamination ¹. If this is not possible, they recommend using a hand sanitizer with at least 60% alcohol.

Because of the shortage of commercial hand sanitizer, recipes have been circulating on social media for a homemade or DIY (Do It Yourself) version. While there is rarely any scientific standing behind these recipes, there is also a long standing recipe published by the World Health Organization (WHO). Putting aside any possible problems people may have following these recipes precisely enough to make an effective hand sanitizer, I want to address one deadly consequence that could occur from faulty or improper manufacturing practices.

It is possible that someone attempting to make their own hand sanitizer in this crisis situation makes one simple, deadly, mistake. They use an alcohol containing a small amount of methanol.

First, some background. The alcohol that is used in what are commonly called alcoholic drinks is ethyl alcohol (sometimes called ethanol). It is taxed by the Federal government (in the US) at a rate of $13.50 per proof gallon ². A proof gallon is 1 gallon at 100 proof, or 1.25 gallons of 80 proof, and so on.

Two other kinds of alcohol are isopropyl alcohol and methanol (methyl alcohol). Isopropyl alcohol is what people commonly buy as rubbing alcohol, with percentages of alcohol ranging from 70% to 95%. Methanol is sold for use as fuel in camping stoves (alcohol stoves) and as a thinner for various paints, such as shellac.

The problem with methanol is that it is deadly toxic to humans. It only takes 10 mL (about 1/3 of an ounce) to cause blindness in an adult, or 30mL to cause death ³. It is the impurity in moonshine (or any improperly distilled spirit) that can make it deadly.

The other property of methanol of note is that it is absorbed through the skin. In fact, it is absorbed at the rate of 2mg/cm2/hr [See update below]. That may not seem like much, except that the hand surface area is about 2% of your total body surface area, which amounts to roughly 1000 cm2 for two hands ⁴. It is left as an exercise for the reader to determine just how long you can rub methanol on your hands before suffering disastrous consequences, including neural toxicity, blindness, and death. Hint: not very long and not something anyone should try.

There is one more important fact before we come to the punch line. In order to sell ethyl alcohol for non-drinking purposes and avoid the payment of the excise tax, the Federal government allows it to be adulterated with certain substances that make it undrinkable. This alcohol is then called “denatured alcohol” and allowed to be sold without the payment of the excise tax, which would be $6.75 per quart of pure (100%) ethyl alcohol otherwise. What are these things added, called denaturants? It depends on the purpose of the alcohol, and there are whole set of recipes pre-approved by the Federal government, under CFR (code of Federal Regulations) Title 27 Part 21 ⁵. For example, S.D.A (Specially Denatured Alcohol) 23-H is made by adding acetone and methyl isobutyl ketone (MIK). S.D.A. 30 is made by adding methanol (approximately 9%). I hope you can now see the problem – if someone makes hand sanitizer, and uses a denatured alcohol containing methanol, the most common type sold in hardware stores, either without knowing or without caring, that is going to have deadly consequences when people rub it on their hands.

California banned the sale of denatured alcohol last year.

In case you think this is just so ridiculous that it could not happen, I have bad news. There have been cases reported in the past ⁶. And with the current pressure on supplies of hand sanitizer and the reported price gouging by some retailers, it is bound to become more common.

The FDA has relaxed its rules on the production of hand sanitizer, allowing pharmacies and others to produce hand sanitizer in the current situation.

Please make sure you trust your supplier when you buy hand sanitizer. Anybody can print a label and put it on a bottle – that’s no guarantee of what’s inside.

Update 13 Feb 2022:

Thanks to the astute reader who pointed out my previously incorrect estimate of hand surface area. It has been corrected. I apologize.

Also, there are variations in absorption rates reported. Two references to 0.192 mg/cm^2/min are ⁷ and ⁸. The one I used (for 2.0 mg/cm^2/hr) was ⁹. Please note the different time units in the two estimates.

Also, there is a wide range in the reported amount of methanol that can cause blindness. This reference gives 3.16-11.85 g/person ¹⁰. I don’t look at that and say, I guess I will ingest 3.06g because it’s under the limit! So please be safe.

Notes:

https://www.cdc.gov/coronavirus/2019-ncov/prepare/prevention.html ↩
https://www.ttb.gov/tax-audit/tax-and-fee-rates ↩
https://en.wikipedia.org/wiki/Methanol ↩
J Physiol Anthropol. 2007 Jun;26(4):475-83. doi: 10.2114/jpa2.26.475. https://pubmed.ncbi.nlm.nih.gov/17704626/ ↩
https://www.govinfo.gov/content/pkg/CFR-2019-title27-vol1/xml/CFR-2019-title27-vol1-part21.xml ↩
https://www.researchgate.net/publication/326275500_Methanol_as_an_Unlisted_Ingredient_in_Supposedly_Alcohol-Based_Hand_Rub_Can_Pose_Serious_Health_Risk ↩
J Avella, et al, J. Analytical Toxicology, Vol. 29, Oct 2005. ↩
Int Arch Occup Environ Health 1980, 47(1):81-8 https://pubmed.ncbi.nlm.nih.gov/7429648/ ↩
Ann Occup Environ Med 2017, 29:44 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5625597 ↩
Ann Occup Environ Med 2017, 29:44 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5625597/ ↩

July 22, 2019February 15, 2022 by admin

Is Reinforcement Learning a Slow Learner?

Follow @solverworld Tweet this

A prominent AI researcher recently gave a webinar with the ACM (Association of Computing Machinery) expressing dismay at the current performance of AI systems and giving his thoughts on the directions research should take. Yann LeCun, Chief AI Scientist at Facebook and Professor at NYU, gave a talk titled “The Power and Limits of Deep Learning”. Current AI systems have no ability to model the real-world. For example, babies learn quite early that a truck that drives off a platform and hovers in the air is unexpected. Current AI systems do not have this ability – they might, after many, many training examples, might be able predict this type of behavior for very specific vehicles. “Sure, I know a red fire truck will fall down, but I have no idea what this Prius is going to do. Let’s watch…” This same type of thing happens in the simpler task of image recognition. A human can get the idea of an elephant from a few images, but our most sophisticated image recognition systems need many thousands of training examples to recognize a new object. And even then, it will have difficulty in recognizing a different view (Elephant rear-end?, Elephant with trunk hidden behind a wall?) if it has not specifically been trained with those types of views.

Similarly, Reinforcement Learning, a technique used to train AI systems to do things like play video games at (or above) human levels, is a slow learner. It takes 83 hours of real-time play for the RL systems to achieve a level a human player can achieve in 15 minutes.

Two basic algorithms used in AI are (1) supervised learning and (2) unsupervised learning. Supervised learning is an algorithm trained by showing it an image (or training example) along with the desired response. “Hello computer. This image is a car. This next image is a bird.” This goes on for millions of images (The ImageNet dataset, used in a lot of benchmark tests, has over 15 million images, and often a subset of over 1 million images is used for training). The training is also repeated over that same set many times (many “epochs”). On the other hand, unsupervised learning tries to make sense of the data without any human “supervised” advice. An example is a clustering algorithm that tries to group items into clusters, or groups, so that items within each group are similar to each other in some way.

Prof. LeCun’s suggestion is that unsupervised learning, or what he calls self-supervised learning, might provide a better approach. He said “Prediction is the essence of intelligence.” We will see whether computers will be able to generate predictions from just a few examples.

References

June 18, 2019February 15, 2022 by admin

Bitwise Compression: What Is It and How Does it Work?

Follow @solverworld Tweet this

File compression programs such as Zip, Pkzip, and Winzip can make a file smaller and take up less space on your hard disk. These programs can then reverse the process, or uncompress the file(s), and restore them exactly as they were before. This is called lossless compression. There lossy compression programs, in which the restored files are close but not exactly the same as the original files. This might seem not useful for data files (like databases, text, or spreadsheets), but can be critically important for things like movies or photographic images. We are going to be only discussing lossless compression today.

The very best (in terms of compression ratio) programs use advanced techniques. For many more details and a history of progress, see Matt Mahoney’s Data Compression Explained. Matt Mahoney also runs the The Hutter Prize, a contest (with cash prizes) for the best programs in a specific data compression task. For reference, the current record for compressing the first 100MB of Wikipedia content has a compression ratio of 0.15285.

There is a specific technique that all the top compression programs use that I will explain in this blog post, called Bitwise Compression. We will get to that in a bit, but first we will need to discuss some basic data compression and information theory.

Basic Compression

The basic idea of data compression is that repeated things can be assigned shorter codes than the length they originally were. This means that things that are more likely should be assigned shorter codes. Morse Code tries to be efficient by assigning the most common letter, “E”, the shortest code ¹. Let’s look at a toy example, and suppose that we have just 4 symbols in a file and we want to compress the file. Suppose that the 4 symbols are A,B,C, and D and have the following probabilities:

SYMBOL	PROBABILITY
A	0.75
B	0.15
C	0.05
D	0.05

The probabilities can either be thought of as the likelihood of those symbols appearing in future files that we will see, or the fraction of the symbols in a particular file that we are trying to compress. Note that in endeavors like the Hutter Prize, there is a specific file that we are trying to compress and optimize for that file as much as we want; general compression programs do not have that luxury.

Given the above symbols and probabilities, we might ask what is the optimal way to encode them in bits. A particular solution to this program was invented by David A. Huffman, in his 1952 paper, called Huffman Coding. He asked the question: if I assigned bit sequences to each of those symbols, what assignment would produce the shortest output bit sequence on average. Think of this like Morse Code, where instead of each symbol being assigned a certain sequence of dots and dashes, we assign a certain sequence of 0s and 1s.

For example, suppose we just decided to assign A=00, B=01, C=10, and D=11. That is, 2 bits for each symbol. Of course, this encoding method can be reversed, because we just take each pair of bits, and translate back into ABCD as we go. What is the average number of bits used per symbol for this code? By inspection, the answer is 2.0 bits per symbol. Dr. Huffman says we can do better. Using his algorithm, we get the following optimal assignment ²

SYMBOL	CODE	LENGTH	PROBABILITY	CONTRIBUTION
A	1	1	0.75	0.75
B	01	2	0.15	0.30
C	001	3	0.05	0.15
D	000	3	0.05	0.15
TOTAL				1.35 bits/symbol

This assignment shows that 1.35 bits/symbol is achieved. Huffman says that is this the best we can do by assigning a bit sequence to each symbol. Are there other methods that could do better? Claude Shannon, who invented the entire field of Information Theory with his 2 part blockbuster paper published in the Bell System Technical Journal in 1948, said yes. He said that the information content of a symbol stream could be measured by somethings called Entropy, in units of bits. The formula for computing Entropy is

\[
E=\sum_{k=0}^{N-1}-p_i \log_2 p_i
\]

where $\log_2$ is the logarithm to the base 2, which, in case you have forgotten can be calculated as
\begin{align}
\log_2 x = \frac{\log x}{\log 2}
\end{align}
where log is the logarithm to your favorite base (such as the more common 10 or e). The units of Entropy are bits/symbol. The number of bits used for any entire file would then be Entropy times the number of symbols in the file.

For our toy ABCD example, the entropy we get is
\[
E=1.15402 \textrm{ bits/symbol}
\]

Now, Shannon did not know the exact practical way to get to an encoding method that could achieve this, but he showed that it was possible. One way to do better would be to encode pairs of symbols into Huffman Codes. That is, we make a table like we did above, but the symbols are now AA, AB, AC, AD, BA, etc. We get the probabilities of these combinations by multiplying the probabilities of the individual symbols together (putting aside conditional probabilities for the time being). This will get us closer to the 1.15402 bits/symbol we are looking for, but not there yet. The reason is that we constrained to using a whole number of bits for each symbol and we end up wasting output bits for each symbol.

Arithmetic Coding

Enter Arithmetic Coding, invented in the 1970s at IBM, and published in the IBM Journal of Research and Development. It basically allows us to out fractional bits for each symbol, and overall, the number of bits used approaches the Entropy we calculated above (the Arithmetic Coding algorithm gets us to within 1 byte of the Entropy in our entire sequence, so let’s call that good enough). There are many descriptions on how AC works, starting with Arithmetic Coding. We are not going to go into the details here.

Context

In any real file, the individual bytes do not show up with independent probabilities. For example, in an English language text file, after the letter “q”, the letter “u” is very likely to show up, far more likely than it’s overall frequency of occurrence in the file. How do we take advantage of this information? It turns out, that using Arithmetic Coding (AC), it very straightforward. AC, at each step, just needs the probabilities of each symbol, and it can output the proper bit sequence to encode those symbols. The decoder just needs the same probabilities provided to it, and it can recover the original symbol sequence. One way that file compression algorithms can take advantage of this information is by building models (or tables) as a file is processed, which try to predict what the next symbol will be. This prediction is done by giving a probability for each symbol. Now, in a technique known as Prediction by Partial Matching (PPM), we look at the previously seen symbols and then predict the next symbol. For example, if we see “sequ” we might give a high probability to the next symbol being “e”. This is done by keeping track of how many times various symbols are seen for each context, or previous symbols. Where it gets difficult is that you want the probabilities to be adaptive, so they can adjust to varying parts of the file. It turns out that we cannot just compute these probabilities by context for the entire file and use those for encoding. They would have to sent along to the decoder. While this would result in extremely good compression ratios, the size of the data (the probability for every context) would be huge, and negate the space savings.

For a bunch of reasons, a variation of this method, Bitwise Compression, was developed. It works basically the same way, but each symbol being encode is a bit from the source file. That is, we only have 2 symbols, “0” and “1”. This makes everything simpler, including the AC algorithm, which now only needs the probability of a 1 to encode each bit. A good question to ask is, are we losing anything by doing this Bitwise Compression? At first glance, it seems like you might be taking up extra space because each bit has to be encoded, rather than entire bytes.

Bitwise Compression

Now, one thing is obvious. If we did not use any context, this would not work. Without any context, each bit is likely to be 50/50, so we would not gain any compression benefit. So what I want to show here is that compressing by bytes is equivalent to compressing by bits, with the context of the byte encoded so far.

Let’s start by comparing two methods of compressing a text file:
(1) Compress each byte based on its probability of occurrence in the file, with no context. This is easily estimated by simply computing the probability of each byte (based on the number of times it occurs in the file), then computing the entropy of those probabilities.
(2) Compress each byte, one bit at a time, where the context of each bit prediction is the previously seen byte. That is, for the first bit prediction, the context is nothing. That is, the first bit will be predicted based on the probability of the first bit of each byte being 1. The second bit will be predicted based on the previous context being ‘0’ or ‘1’, depending on what the first bit was. A simple way to handle the bit context in Python is to make a context string, which is a ‘1’ followed by the actual bits seen so far. This distinguishes 3 zeros seen from 1 zeros seen: 1000 (0x8) means 3 zeros seen, while 10 (0x3) means a single 0 seen, and 1 (0x1) means nothing seen yet (the first bit of a byte). In the accompanying code, this is called the bitcontext.

You can follow along with the python code in the related gitlab account, which you can get by doing

git clone https://gitlab.com/dgrunberg/bitwise-compress.git

If you have Python 3.5 or later, you should be able to run this code, like this:

python study-bits.py --meg 1 --input enwik8

Where enwik8 is the 100MB of Wikipedia data you can get from the Hutter Prize site. This is perform the analysis on the first 1000000 bytes of enwik8. The Estimated compression bits it puts out is computed by calculating the entropy of the probabilities generated as inputs to the AC.

The results are

Method – no previous bytes context + bitcontext	Compression Ratio
Entropy by byte: 5.0589 bits/byte	0.6324
Encode by bit – Estimated bits out by calculating entropy	0.6324
Encode by bit – Actual bits output	0.6324

So you can see, for this simple example file, that encoding by bit with a single byte bitcontext gives the same result as encoding based on byte probabilities.
Now, for a preview of how context can help with encoding, let us consider using the single previous byte as context. The results are:

Method – single previous byte as context + bitcontext	Compression Ratio
Encode by bit – Estimated bits out by calculating entropy	0.4793
Encode by bit – Actual bits output	0.4794

So we went from compression ratio of 0.6324 to 0.4794 by using a single previous byte as context. If you follow the history of compression programs from Matt Mahoney’s excellent site, you will see that to get from here to the current record holder ratio of about 0.15, a number of additional things get added:

An adaptive probability model is used, where the count of occurrences of each context is encoded into a single byte (called a bit history). This gets updated each time a context is encountered through a table lookup
Each count is mapped to a probability through a single table for the entire context; the table gets updated as each bit is revealed.
Many different models are used, including various numbers of previous bytes and some unigram word models
These models are combined linearly (called the mixer), with the weights adjusted adaptively
This mixed probability is then run through a SSE (Secondary Symbol Estimation), which is another adaptively updated table that corrects this probability. Sometimes 2 or more SSE stages are used.
More than 1 set of weights are used, where the particular weight to be used is chosen based on some other context characteristics.
A preprocessor and dictionary is used, which replaces many of the words in the file with various dictionary entries and escape codes.

-THE END

Notes:

E is just one single dot. The second most common letter, T, is a single dash. ↩
One minor point I glossed over: since the codes in general are different lengths, we have to be able to decode them without any specific indication of where each code ends. That is, we need to be able to parse out the individual bit sequences from the overall stream. The property of the chosen codes that ensures this will happen is called the prefix property. This means that no code can be a prefix (first part) of any other code. Huffman Codes have this property. ↩

Initialize, for all \(s \in S, a \in A(s)\):
	set \(Q_T(s,a)=0\) and \(Q_C(s,a)=0\) for all \(s,a\)
	set \(\pi(s)\) = some random action for all \(s\)
Repeat Forever:
	Choose \(S_0 \in S\) and \(A_0 \in A(S_0)\) such that all pairs have probability > 0
	Generate a game starting from \(S_0, A_0 \), following \(\pi\), with terminal reward \(R\)
	For each pair \(s\), \(a\) appearing in the game:
	\(Q_T(s,a) \mathrel{+}=R\)
	\(Q_C(s,a) \mathrel{+}=1\)
	Update the policy: \(\pi(s)=\arg \max_{a} \frac{Q_T(s,a)}{Q_C(s,a)}\)

Business

Blackjack Strategy – A Monte Carlo Method